# Methods for detecting, removing, and replacing null values in Pandas data structures.


In [1]:
import pandas as pd
import numpy as np

Brief exploration and demonstration of these
routines.

In [3]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

Use isnull() method to check whether a Series contain null values

In [8]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

Using masking with notnull()

In [9]:
data[data.notnull()]

0        1
2    hello
dtype: object

The isnull() and notnull() methods produce similar Boolean results for
DataFrame objects.

# Dropping Null Values

In [12]:
data.dropna()

0        1
2    hello
dtype: object

For a DataFrame, there are more options. Consider the following DataFrame:

In [15]:
df = pd.DataFrame(
    [
        [1, np.nan, 2],
        [2, 3, 5],
        [np.nan, 4, 6]
    ]
)
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


By default, dropna will drop all rows in which any null value is present:

In [16]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


Alternatively, you can drop NA values along a different axis. Using axis=1 or
axis='columns' drops all columns containing a null value:

In [18]:
df.dropna(axis = 'columns')

Unnamed: 0,2
0,2
1,5
2,6


The default is how='any', such that any row or column containing a null value will be
dropped. You can also specify how='all', which will only drop rows/columns that
contain all null values:

In [20]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [21]:
df.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


For finer-grained control, the thresh parameter lets you specify a minimum number
of non-null values for the row/column to be kept:

In [25]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


Here, the first and last rows have been dropped because they each contain only two
non-null values.

# Filling Null Values

In [28]:
data1 = pd.Series([1, np.nan, 2, None, 3], index = list('abcde'), dtype = 'Int32')
data1

a       1
b    <NA>
c       2
d    <NA>
e       3
dtype: Int32

We can fill NA entries with a single value, such as zero:

In [30]:
data1.fillna(0)

a    1
b    0
c    2
d    0
e    3
dtype: Int32

We can specify a forward fill to propagate the previous value forward:

In [32]:
# forward fill
data1.fillna(method = 'ffill')

a    1
b    1
c    2
d    2
e    3
dtype: Int32

Or we can specify a backward fill to propagate the next values backward:

In [33]:
# back fill
data1.fillna(method = 'bfill')

a    1
b    2
c    2
d    3
e    3
dtype: Int32

In the case of a DataFrame, the options are similar, but we can also specify an axis
along which the fills should take place:

In [34]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [36]:
df.fillna(method = 'ffill', axis = 1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


Notice that if a previous value is not available during a forward fill, the NA value
remains.