<h3>Tidying Up Your Data</h3>

One of the most common things you will do with pandas involves tidying your
data, which is the process of preparing raw data for analysis. Showing you how to use
various features of pandas to get raw data into a tidy form is the focus of this chapter.

In [1]:
# import pandas, numpy and datetime
import numpy as np
import pandas as pd
import datetime

# Set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

<h3>Working with missing data</h3>

Data is “missing” in pandas when it has a value of NaN (also seen as np.nan—the form
from NumPy). The NaN value represents that in a particular Series that there is not a value
specified for the particular index label.

In [2]:
# create a DataFrame with 5 rows and 3 columns
df = pd.DataFrame(np.arange(0, 15).reshape(5, 3), 
index=['a', 'b', 'c', 'd', 'e'],
columns=['c1', 'c2', 'c3'])
df

   c1  c2  c3
a   0   1   2
b   3   4   5
c   6   7   8
d   9  10  11
e  12  13  14

In [3]:
np.arange(0, 15).reshape(5, 3)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [4]:
# add some columns and rows to the DataFrame
# column c4 with NaN values
df['c4'] = np.nan

# row 'f' with 15 through 18
df.loc['f'] = np.arange(15, 19)

# row 'g' will all NaN
df.loc['g'] = np.nan

# column 'C5' with NaN's
df['c5'] = np.nan

# change value in col 'c4' row 'a'
df['c4']['a'] = 20
df

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

This DataFrame object exhibits the following characteristics that will support most of the
examples that follow in this section:
<ul>
    <li>One row consisting only of NaN values</li>
    <li>One column is consisiting only of NaN values</li>
    <li>Several rows and columns consisting of both numeric values and NaN values</li>
    </ul>

<h3>Determining NaN values in Series and DataFrame
objects</h3>

The NaN values in a DataFrame object can be identified using the .isnull() method. Any
True value means that the item is a NaN value:

In [5]:
# which items are NaN?
df.isnull()

      c1     c2     c3     c4    c5
a  False  False  False  False  True
b  False  False  False   True  True
c  False  False  False   True  True
d  False  False  False   True  True
e  False  False  False   True  True
f  False  False  False  False  True
g   True   True   True   True  True

We can use the fact that the .sum() method treats True as 1 and False as 0 to determine
the number of NaN values in a DataFrame object. By applying .sum() on the result of
.isnull(), we will get a total for the number of True values (representing NaN values) in
each column:

In [6]:
# count the number of NaN values in each column
df.isnull().sum()

c1    1
c2    1
c3    1
c4    5
c5    7
dtype: int64

Applying .sum() to the resulting series gives the total number of NaN values in the original
DataFrame object.

In [7]:
# total count of NaN values
df.isnull().sum().sum()

15

Another way to determine this is to use the .count() method of a Series object and
DataFrame. For a Series method, this method will return the number of non-NaN values.
For a DataFrame object, it will count the number of non-NaN values in each column:

In [8]:
# number of non-NaN values in each column
df.count()

c1    6
c2    6
c3    6
c4    2
c5    0
dtype: int64

In [9]:
df

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

This then needs to be flipped around to sum the number of NaN values, which can be
calculated as follows

In [10]:
# and this counts the number of NaN values too
(len(df) - df.count()).sum()

15

In [11]:
len(df)

7

In [12]:
df.count()

c1    6
c2    6
c3    6
c4    2
c5    0
dtype: int64

In [13]:
len(df) - df.count()

c1    1
c2    1
c3    1
c4    5
c5    7
dtype: int64

We can also determine whether an item is not NaN using the .notnull() method, which
returns True if the value is not a NaN value, otherwise it returns False:

In [14]:
# which items are not null?
df.notnull()

      c1     c2     c3     c4     c5
a   True   True   True   True  False
b   True   True   True  False  False
c   True   True   True  False  False
d   True   True   True  False  False
e   True   True   True  False  False
f   True   True   True   True  False
g  False  False  False  False  False

In [15]:
df

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

In [16]:
df.notnull().sum()

c1    6
c2    6
c3    6
c4    2
c5    0
dtype: int64

In [17]:
df.notnull().sum().sum()

20

<h3>Selecting out or dropping missing data</h3>

In [18]:
df

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

In [19]:
df.c4.notnull()

a     True
b    False
c    False
d    False
e    False
f     True
g    False
Name: c4, dtype: bool

In [20]:
# select the non-NaN items in column c4
df.c4[df.c4.notnull()]

a    20.0
f    18.0
Name: c4, dtype: float64

pandas also provides a convenience function .dropna(), which will drop the items in a
Series where the value is NaN, involving less typing than the previous example.

In [21]:
# .dropna will also return non NaN values
# this gets all non NaN items in column c4
df.c4.dropna()

a    20.0
f    18.0
Name: c4, dtype: float64

In [22]:
df

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

In [23]:
# dropna returns a copy with the values dropped
# the source DataFrame / column is not changed
df.c4

a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    18.0
g     NaN
Name: c4, dtype: float64

Note that .dropna() has actually returned a copy of DataFrame without the rows. The
original DataFrame is not changed:

When applied to a DataFrame object, .dropna() will drop all rows from a DataFrame
object that have at least one NaN value. The following code demonstrates this in action,
and since each row has at least one NaN value, ,<b>there are no rows in the result:</b>

In [24]:
# on a DataFrame this will drop entire rows
# where there is at least one NaN
# in this case, that is all rows
df.dropna()

Empty DataFrame
Columns: [c1, c2, c3, c4, c5]
Index: []

If you want to only drop rows where all values are NaN, you can use the how='all'
parameter. The following code only drops the g row since it has all NaN values:

In [25]:
# as NaN will be dropped
df.dropna(how = 'all')

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN

This can also be applied to the columns instead of the rows, by changing the axis
parameter to axis=1. The following code drops the c5 column as it is the only one with all
NaN values:

In [26]:
# flip to drop columns instead of rows
df.dropna(how='all', axis=1) # say goodbye to c5

     c1    c2    c3    c4
a   0.0   1.0   2.0  20.0
b   3.0   4.0   5.0   NaN
c   6.0   7.0   8.0   NaN
d   9.0  10.0  11.0   NaN
e  12.0  13.0  14.0   NaN
f  15.0  16.0  17.0  18.0
g   NaN   NaN   NaN   NaN

In [27]:
# make a copy of df
df2 = df.copy()

In [28]:
# replace two NaN cells with values
df2.loc['g'].c1 = 0

In [29]:
df2

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   0.0   NaN   NaN   NaN NaN

In [30]:
df2.loc['g'].c3 = 0
df2

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   0.0   NaN   0.0   NaN NaN

In [31]:
# now drop columns with any NaN values
df2.dropna(how='any', axis=1)

     c1    c3
a   0.0   2.0
b   3.0   5.0
c   6.0   8.0
d   9.0  11.0
e  12.0  14.0
f  15.0  17.0
g   0.0   0.0

The .dropna() methods also has a parameter, thresh, which when given an integer value
specifies the minimum number of NaN values that must exist before the drop is performed.
The following code drops all columns with at least five NaN values; these are the c4 and c5
columns:

In [32]:
# only drop columns with at least 5 NaN values
df.dropna(thresh=5, axis=1)

     c1    c2    c3
a   0.0   1.0   2.0
b   3.0   4.0   5.0
c   6.0   7.0   8.0
d   9.0  10.0  11.0
e  12.0  13.0  14.0
f  15.0  16.0  17.0
g   NaN   NaN   NaN

Note that the .dropna() method (and the Boolean selection) returns a copy of the
DataFrame object, and the data is dropped from that copy. If you want to drop the data in
the actual DataFrame, use the inplace=True parameter.

<h3>How pandas handles NaN values in mathematical
operations</h3>

The NaN values are handled differently in pandas than in NumPy. This is demonstrated
using the following example:

In [33]:
# create a NumPy array with one NaN value
a = np.array([1, 2, np.nan, 3])

# create a Series from the array
s = pd.Series(a)

# the mean of each is different
a.mean(), s.mean()

(nan, 2.0)

NumPy functions, when encountering a NaN value, will return NaN. pandas functions and
will typically ignore the NaN values and continue processing the function as though the
values were not part of the Series object.

More specifically, the way that pandas handles NaN values is as follows:
<ul><li>Summing of data treats NaN as 0</li>
    <li>If all values are NaN, the result is NaN</li>
<li>Methods like .cumsum() and .cumprod() ignore NaN values, but preserve them in the
    resulting arrays</li>

In [34]:
df

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

In [35]:
# demonstrate sum, mean and cumsum handling of NaN
# get one column
s = df.c4
s.sum() # NaN values treated as 0

38.0

In [36]:
s.mean() # NaN also treated as 0

19.0

In [37]:
# as 0 in the cumsum, but NaN values preserved in result Series
s.cumsum()

a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    38.0
g     NaN
Name: c4, dtype: float64

In [38]:
# in arithmetic, a NaN value will result in NaN
df.c4 + 1

a    21.0
b     NaN
c     NaN
d     NaN
e     NaN
f    19.0
g     NaN
Name: c4, dtype: float64

<h3>Filling in missing data</h3>

If you prefer to replace the NaN values with a specific value, instead of having them
propagated or flat out ignored, you can use the .fillna() method. The following code
fills the NaN values with 0:

In [39]:
# return a new DataFrame with NaN values filled with 0
filled = df.fillna(0)
filled

     c1    c2    c3    c4   c5
a   0.0   1.0   2.0  20.0  0.0
b   3.0   4.0   5.0   0.0  0.0
c   6.0   7.0   8.0   0.0  0.0
d   9.0  10.0  11.0   0.0  0.0
e  12.0  13.0  14.0   0.0  0.0
f  15.0  16.0  17.0  18.0  0.0
g   0.0   0.0   0.0   0.0  0.0

Be aware that this causes differences in the resulting values. As an example, the following
code shows the result of applying the .mean() method to the DataFrame object with the
NaN values, as compared to the DataFrame that has its NaN values filled with 0:

In [40]:
# NaNs don't count as an item in calculating
# the means
df.mean()

c1     7.5
c2     8.5
c3     9.5
c4    19.0
c5     NaN
dtype: float64

In [41]:
# having replaced NaN with 0 can make
# operations such as mean have different results
filled.mean()

c1    6.428571
c2    7.285714
c3    8.142857
c4    5.428571
c5    0.000000
dtype: float64

It is also possible to limit the number of times that the data will be filled using the limit
parameter. Each time the NaN values are identified, pandas will fill the NaN values with the
previous value up to the limit times in each group of NaN values.

In [42]:
# only fills the first two NaN values in each row with 0
df.fillna(0, limit=2)

     c1    c2    c3    c4   c5
a   0.0   1.0   2.0  20.0  0.0
b   3.0   4.0   5.0   0.0  0.0
c   6.0   7.0   8.0   0.0  NaN
d   9.0  10.0  11.0   NaN  NaN
e  12.0  13.0  14.0   NaN  NaN
f  15.0  16.0  17.0  18.0  NaN
g   0.0   0.0   0.0   NaN  NaN

<h3>Forward and backward filling of missing values</h3>

Gaps in data can be filled by propagating non-NaN values forward or backward along a
Series. To demonstrate this, the following example will “fill forward” the c4 column of
DataFrame:

In [43]:
df.c4

a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    18.0
g     NaN
Name: c4, dtype: float64

In [44]:
# extract the c4 column and fill NaNs forward
df.c4.fillna(method="ffill")

a    20.0
b    20.0
c    20.0
d    20.0
e    20.0
f    18.0
g    18.0
Name: c4, dtype: float64

In [45]:
# perform a backwards fill
df.c4.fillna(method="bfill")

a    20.0
b    18.0
c    18.0
d    18.0
e    18.0
f    18.0
g     NaN
Name: c4, dtype: float64

To save a little typing, pandas also has global level functions pd.ffill() and pd.bfill(),
which are equivalent to .fillna(method="ffill") and .fillna(method="bfill").

In [46]:
df.c4

a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    18.0
g     NaN
Name: c4, dtype: float64

In [47]:
df.c4.bfill()

a    20.0
b    18.0
c    18.0
d    18.0
e    18.0
f    18.0
g     NaN
Name: c4, dtype: float64

In [48]:
df.c4.ffill()

a    20.0
b    20.0
c    20.0
d    20.0
e    20.0
f    18.0
g    18.0
Name: c4, dtype: float64

<h3>Filling using index labels</h3>

Data can be filled using the labels of a Series or keys of a Python dictionary. This allows
you to specify different fill values for different elements based upon the value of the index
label:

In [49]:
# create a new Series of values to be
# used to fill NaN values where the index label matches
fill_values = pd.Series([100, 101, 102], index=['a', 'e', 'g'])
fill_values

a    100
e    101
g    102
dtype: int64

In [50]:
df.c4

a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    18.0
g     NaN
Name: c4, dtype: float64

In [51]:
# using c4, fill using fill_values
# a, e and g will be filled with matching values only if they are NaN
df.c4.fillna(fill_values)

a     20.0
b      NaN
c      NaN
d      NaN
e    101.0
f     18.0
g    102.0
Name: c4, dtype: float64

Another common scenario, is to fill all the NaN values in a column with the mean of the
column:

In [52]:
df

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

In [53]:
# fill NaN values in each column with the
# mean of the values in that column
df.fillna(df.mean())

     c1    c2    c3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0  19.0 NaN
c   6.0   7.0   8.0  19.0 NaN
d   9.0  10.0  11.0  19.0 NaN
e  12.0  13.0  14.0  19.0 NaN
f  15.0  16.0  17.0  18.0 NaN
g   7.5   8.5   9.5  19.0 NaN

<h3>Interpolation of missing values</h3>

Both DataFrame and Series have an .interpolate() method that will, by default,
perform a linear interpolation of missing values:

In [54]:
# linear interpolate the NaN values from 1 through 2
s = pd.Series([1, np.nan, np.nan, np.nan, 2])
s.interpolate()

0    1.00
1    1.25
2    1.50
3    1.75
4    2.00
dtype: float64

The interpolation method also has the ability to specify a specific method of interpolation.
One of the common methods is to use time-based interpolation. Consider the following
Series of dates and values:

In [55]:
# create a time series, but missing one date in the Series
ts = pd.Series([1, np.nan, 2], index=[datetime.datetime(2014, 1, 1),
datetime.datetime(2014, 2, 1), datetime.datetime(2014, 4, 1)])
ts

2014-01-01    1.0
2014-02-01    NaN
2014-04-01    2.0
dtype: float64

In [56]:
# linear interpolate based on the number of items in the Series
ts.interpolate()

2014-01-01    1.0
2014-02-01    1.5
2014-04-01    2.0
dtype: float64

<p>The value for 2014-02-01 is calculated as 1.0 + (2.0-1.0)/2 = 1.5, since there is one NaN
value between the values 2.0 and 1.0.</p>
<p>The important thing to note is that the series is missing an entry for 2014-03-01. If we
were expecting to interpolate daily values, there would be two values calculated, one for
2014-02-01 and another for 2014-03-01, resulting in one more value in the numerator of
the interpolation.</p>
<p>This can be corrected by specifying the method of interpolation as “time“:</p>

In [57]:
# this accounts for the fact that we don't have
# an entry for 2014-03-01
ts.interpolate(method="time")

2014-01-01    1.000000
2014-02-01    1.344444
2014-04-01    2.000000
dtype: float64

This is the correct interpolation for 2014-02-01 based upon dates. Also note that the index
label and value for 2014-03-01 is not added to the Series, it is just factored into the
interpolation.

Interpolation can also be specified to calculate values relative to the index values when
using numeric index labels. To demonstrate this, we will use the following Series:

In [58]:
# a Series to demonstrate index label based interpolation
s = pd.Series([0, np.nan, 100], index=[0, 1, 10])
s

0       0.0
1       NaN
10    100.0
dtype: float64

If we perform a linear interpolation, we get the following value for label 1, which is
correct for a linear interpolation:

In [59]:
# linear interpolate
s.interpolate()

0       0.0
1      50.0
10    100.0
dtype: float64

However, what if we want to interpolate the value to be relative to the index value? To do
this, we can use method="values":

In [60]:
# interpolate based upon the values in the index
s.interpolate(method="values")

0       0.0
1      10.0
10    100.0
dtype: float64

Now, the value calculated for NaN is interpolated using relative positioning based upon the
labels in the index. The NaN value has a label of 1, which is one tenth of the way between 0
and 10, so the interpolated value will be 0 + (100-0)/10, or 10.

<h3>Handing duplicate data</h3>

To facilitate finding duplicate data, pandas provides a .duplicates() method that returns
a Boolean Series where each entry represents whether or not the row is a duplicate. A
True value represents that the specific row has appeared earlier in the DataFrame object
with all column values being identical.

In [61]:
# a DataFrame with lots of duplicate data
data = pd.DataFrame({'a': ['x'] * 3 + ['y'] * 4,
'b': [1, 1, 2, 3, 3, 4, 4]})

data

   a  b
0  x  1
1  x  1
2  x  2
3  y  3
4  y  3
5  y  4
6  y  4

In [62]:
# reports which rows are duplicates based upon
# if the data in all columns was seen before
data.duplicated()

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

<p>Duplicate rows can be dropped from a DataFrame using the .drop_duplicates() method.
This method will return a copy of the DataFrame object with the duplicate rows removed.</p> <p>It is also possible to use the inplace=True parameter to remove the rows without making
a copy:</p>

In [63]:
# drop duplicate rows retaining first row of the duplicates
data.drop_duplicates()

   a  b
0  x  1
2  x  2
3  y  3
5  y  4

<p>Note that there is a ramification to which indexes remain when dropping duplicates. The
duplicate records may have different index labels (labels are not taken into account in
calculating a duplicate). So, which row is kept can affect the set of labels in the resulting
DataFrame object.</p><p>The default operation is to keep the first row of the duplicates. If you want to keep the last
row of duplicates, you can use the take_last=True parameter. The following code
demonstrates how the result differs using this parameter:</p>

In [64]:
# drop duplicate rows, only keeping the first
# instance of any data
data.drop_duplicates(keep='first')

   a  b
0  x  1
2  x  2
3  y  3
5  y  4

If you want to check for duplicates based on a smaller set of columns, you can specify a
list of columns names:

In [65]:
# add a column c with values 0..6
# this makes .duplicated() report no duplicate rows
data['c'] = range(7)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [94]:
data

   a  b  c
0  x  1  0
1  x  1  1
2  x  2  2
3  y  3  3
4  y  3  4
5  y  4  5
6  y  4  6

In [95]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [97]:
# but if we specify duplicates to be dropped only in columns a & b
# they will be dropped
data.drop_duplicates(['a', 'b'], keep="first")

   a  b  c
0  x  1  0
2  x  2  2
3  y  3  3
5  y  4  5

<h2>Transforming Data</h2>

<h3>Mapping</h3>

One of the basic tasks in data transformations is mapping of a set of values to another set.
pandas provides a generic ability to map values using a lookup table (via a Python
dictionary or a pandas Series) using the .map() method. This method performs the
mapping by matching the values of the outer Series with the index labels of the inner
Series, and returning a new Series with the index labels of the outer Series but the
values from the inner Series:

In [67]:
# create two Series objects to demonstrate mapping
x = pd.Series({"one": 1, "two": 2, "three": 3})
y = pd.Series({1: "a", 2: "b", 3: "c"})
x,y

(one      1
 two      2
 three    3
 dtype: int64,
 1    a
 2    b
 3    c
 dtype: object)

In [68]:
# map values in x to values in y
x.map(y)

one      a
two      b
three    c
dtype: object

Like with other alignment operations, if pandas does not find a map between the value of
the outer Series and an index label of the inner Series, it will fill the value with NaN. To
demonstrate this, the following code removes the 3 key from the outer Series, therefore
causing the alignment to fail for that record, and the result is that a NaN value is
introduced:

In [69]:
# three in x will not align / map to a value in y
x = pd.Series({"one": 1, "two": 2, "three": 3})
y = pd.Series({1: "a", 2: "b"})
x.map(y)

one        a
two        b
three    NaN
dtype: object

<h3>Replacing Values</h3>

<p>We previously saw how the .fillna() method can be used to replace the NaN values with
a value of your own decision. The .fillna() method can actually be thought of as an
implementation of the .map() method that maps a single value, NaN, to a specific value.</p><p>
Even more generically, the .fillna() method itself can be considered a specialization of
a more general replacement that is provided by the .replace() method, which provides
more flexibility by being able to replace any value (not just NaN) with another value.</p><p>
The most basic use of the .replace() method replaces an individual value with another:</p>

In [70]:
# create a Series to demonstrate replace
s = pd.Series([0., 1., 2., 3., 2., 4.])
s

0    0.0
1    1.0
2    2.0
3    3.0
4    2.0
5    4.0
dtype: float64

In [71]:
# replace all items with index label 2 with value 5
s.replace(2, 5)

0    0.0
1    1.0
2    5.0
3    3.0
4    5.0
5    4.0
dtype: float64

It is also possible to specify multiple items to replace and also specify their substitute
values by passing two lists:

In [72]:
# replace all items with new values
s.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])

0    4.0
1    3.0
2    2.0
3    1.0
4    2.0
5    0.0
dtype: float64

Replacement can also be performed by specifying a dictionary for lookup (a variant of the
map process in the previous section):

In [73]:
# replace using entries in a dictionary
s.replace({0: 10, 1: 100})

0     10.0
1    100.0
2      2.0
3      3.0
4      2.0
5      4.0
dtype: float64

<p>If using .replace() on a DataFrame, it is possible to specify different replacement values
for each column. This is performed by passing a Python dictionary to the .replace()
method, where the keys of the dictionary represent the names of the columns where
replacement is to occur and the values of the dictionary are values that you want to
replace. The second parameter to the method is the value that will be replaced where any
matches are found.</p><p>
The following code demonstrates by creating a DataFrame object and then replacing
specific values in each of the columns with 100:</p>

In [74]:
# DataFrame with two columns
df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})
df

   a  b
0  0  5
1  1  6
2  2  7
3  3  8
4  4  9

In [75]:
# specify different replacement values for each column
df.replace({'a': 1, 'b': 8}, 100)

     a    b
0    0    5
1  100    6
2    2    7
3    3  100
4    4    9

In [76]:
s

0    0.0
1    1.0
2    2.0
3    3.0
4    2.0
5    4.0
dtype: float64

In [98]:
# demonstrate replacement with pad method
# set first item to 10, to have a distinct replacement value
s[0] = 10
s

0    10
1     1
2     2
3     3
4     4
dtype: int64

In [99]:
# replace items with index label 1, 2, 3, using fill from the
# most recent value prior to the specified labels (10)
s.replace([1, 2, 3], method='pad')

0    10
1    10
2    10
3    10
4     4
dtype: int64

<h3>Applying functions to transform data</h3>

In situations where a direct mapping or substitution will not suffice, it is possible to apply
a function to the data to perform an algorithm on the data. pandas provides the ability to
apply functions to individual items, entire columns, or entire rows, providing incredible
flexibility in transformation.

Functions can be applied using the conveniently named .apply() method, which given a
Python function, will iteratively call the function passing in each value from a Series, or
each Series representing a DataFrame column, or a list of values representing each row in
a DataFrame. The choice of technique to be used depends on whether the object is a
Series or a DataFrame object, and when a DataFrame object, depending upon which axis
is specified

In [79]:
# demonstrate applying a function to every item of a Series
s = pd.Series(np.arange(0, 5))
s

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [80]:
s.apply(lambda v: v * 2)

0    0
1    2
2    4
3    6
4    8
dtype: int64

<p>When applying a function to items in a Series, only the value for each Series item is
passed to the function, not the index label and the value.</p><p>When a function is applied to a DataFrame, the default is to apply the method to each
column. pandas will iterate through all columns passing each as a Series to your function.
The result will be a Series object with index labels matching column names and with the
result of the function applied to the column:</p>

In [81]:
# demonstrate applying a sum on each column
df = pd.DataFrame(np.arange(12).reshape(4, 3),
columns=['a', 'b', 'c'])
df

   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

In [83]:
# calculate cumulative sum of items in each column
df.apply(lambda col: col.sum())

a    18
b    22
c    26
dtype: int64

Application of the function can be switched to the values from each row by specifying
axis=1:

In [84]:
df

   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

In [85]:
# calculate the sum of items in each row
df.apply(lambda row: row.sum(), axis=1)

0     3
1    12
2    21
3    30
dtype: int64

<p>A common practice is to take the result of an apply operation and add it as a new column
of the DataFrame. This is convenient as you can add onto the DataFrame the result of one
or more successive calculations, providing yourself with progressive representations of the
derivation of results through every step of the process.</p><p>The following code demonstrates this process. The first step will multiply column a by
column b and create a new column named interim. The second step will add those values
and column c, and create a result column with those values:</p>

In [86]:
# create a new column 'interim' with a * b
df['interim'] = df.apply(lambda r: r.a * r.b, axis=1)
df

   a   b   c  interim
0  0   1   2        0
1  3   4   5       12
2  6   7   8       42
3  9  10  11       90

In [87]:
# and now a 'result' column with 'interim' + 'c'
df['result'] = df.apply(lambda r: r.interim + r.c, axis=1)
df

   a   b   c  interim  result
0  0   1   2        0       2
1  3   4   5       12      17
2  6   7   8       42      50
3  9  10  11       90     101

If you would like to change the values in the existing column, simply assign the result to
an already existing column. The following code changes the ‘a‘ column values to be the
sum of the values in the row:

In [88]:
# replace column a with the sum of columns a, b and c
df.a = df.a + df.b + df.c
df

    a   b   c  interim  result
0   3   1   2        0       2
1  12   4   5       12      17
2  21   7   8       42      50
3  30  10  11       90     101

<p>Another point to note, is that a pandas DataFrame is not a spreadsheet where cells are
assigned formulas and can be recalculated when cells that are referenced by the formula
change. If you desire this to happen, you will need to execute the formulas whenever the
dependent data changes. On the flip side, this is more efficient than with spreadsheets as
every little change does not cause a cascade of operations to occur.</p><p>The .apply() method will always apply to the provided function to all of the items, or
rows or columns. If you want to apply the function to a subset of these, then first perform
a Boolean selection to filter the items you do not want process.</p>

To demonstrate this, the following code creates a DataFrame of values and inserts one NaN
value into the second row. It then applies a function to only the rows where all values are
not NaN:

In [89]:
# create a 3x5 DataFrame
# only second row has a NaN
df = pd.DataFrame(np.arange(0, 15).reshape(3,5))
df.loc[1, 2] = np.nan
df

    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   NaN   8   9
2  10  11  12.0  13  14

In [90]:
df.dropna()

    0   1     2   3   4
0   0   1   2.0   3   4
2  10  11  12.0  13  14

In [91]:
# demonstrate applying a function to only rows having
# a count of 0 NaN values
df.dropna().apply(lambda x: x.sum(), axis=1)

0    10.0
2    60.0
dtype: float64

The last (but not least) method to apply functions that you will see in the next example is
the .applymap() method of the DataFrame. The .apply() method was always passed an
entire row or column. If you desire to apply a function to every individual item in the
DataFrame one by one, then .applymap() is the method to use.

Here is a practical example of using .applymap() method to every item in a DataFrame,
and specifically to format each value to a specified number of decimal points:
In

In [92]:
df

    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   NaN   8   9
2  10  11  12.0  13  14

In [93]:
# use applymap to format all items of the DataFrame
df.applymap(lambda x: '%.2f' % x)

       0      1      2      3      4
0   0.00   1.00   2.00   3.00   4.00
1   5.00   6.00    nan   8.00   9.00
2  10.00  11.00  12.00  13.00  14.00