By default, missing values are represented with NaN.
If the dataset has 0s, 99s or 999s, etc, be sure to either drop or approximate them as you would with missing values.

It is better to approximate the missing values rather than simply drop them.

In [147]:
import numpy as np
import pandas as pd

from pandas import DataFrame

Fillin missing values using fillna(), replace() and interpolate()

In [148]:
data = {'names': ['steve', 'john', 'richard', 'sara', 'randy', 'michael', 'julie'],
        'age': [20, 22, 20, 21, 24, 23, 22],
        'gender': ['Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female'],
        'rank': [2, 1, 4, 5, 3, 7, 6]}
data
type(data)

dict

Create dataframe for ranking.
The dataframe can be created from the dictionary.

In [149]:
ranking_df = DataFrame(data)
ranking_df

Unnamed: 0,names,age,gender,rank
0,steve,20,Male,2
1,john,22,Male,1
2,richard,20,Male,4
3,sara,21,Female,5
4,randy,24,Male,3
5,michael,23,Male,7
6,julie,22,Female,6


In [150]:
ranking_df.iloc[2:5, 1] = np.nan
ranking_df

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2
1,john,22.0,Male,1
2,richard,,Male,4
3,sara,,Female,5
4,randy,,Male,3
5,michael,23.0,Male,7
6,julie,22.0,Female,6


In [151]:
ranking_df.iloc[3:6, 3] = np.nan
ranking_df.iloc[3, :] = np.nan
ranking_df

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,,Male,4.0
3,,,,
4,randy,,Male,
5,michael,23.0,Male,
6,julie,22.0,Female,6.0


Now the dataframe has some missing values.

Detect if there are any missing values.

The isnull() function returns true where there are NaNs.

In [152]:
ranking_df.isnull()

Unnamed: 0,names,age,gender,rank
0,False,False,False,False
1,False,False,False,False
2,False,True,False,False
3,True,True,True,True
4,False,True,False,True
5,False,False,False,True
6,False,False,False,False


The method notnull() gives True if there are not NaNs.

In [153]:
ranking_df.notnull()

Unnamed: 0,names,age,gender,rank
0,True,True,True,True
1,True,True,True,True
2,True,False,True,True
3,False,False,False,False
4,True,False,True,False
5,True,True,True,False
6,True,True,True,True


Apply the isnull() method for the 'age' column

In [154]:
bool_series = pd.isnull(ranking_df['age'])
bool_series

0    False
1    False
2     True
3     True
4     True
5    False
6    False
Name: age, dtype: bool

Filter the dataframe object by passing the bool_series mask.

In [155]:
ranking_df[bool_series] # Returns the rows where the age is missing (NaN)

Unnamed: 0,names,age,gender,rank
2,richard,,Male,4.0
3,,,,
4,randy,,Male,


Fill the NaNs with 0s.

In [156]:
ranking_df.fillna(0)

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,0.0,Male,4.0
3,0,0.0,0,0.0
4,randy,0.0,Male,0.0
5,michael,23.0,Male,0.0
6,julie,22.0,Female,6.0


Fill the missing values with the value in the row above.

In [157]:
ranking_df


Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,,Male,4.0
3,,,,
4,randy,,Male,
5,michael,23.0,Male,
6,julie,22.0,Female,6.0


In [158]:
ranking_df.fillna(method='pad')

  ranking_df.fillna(method='pad')


Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,22.0,Male,4.0
3,richard,22.0,Male,4.0
4,randy,22.0,Male,4.0
5,michael,23.0,Male,4.0
6,julie,22.0,Female,6.0


Fill missing values with the value in the row below.

In [159]:
ranking_df

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,,Male,4.0
3,,,,
4,randy,,Male,
5,michael,23.0,Male,
6,julie,22.0,Female,6.0


In [160]:
ranking_df.fillna(method='bfill')

  ranking_df.fillna(method='bfill')


Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,23.0,Male,4.0
3,randy,23.0,Male,6.0
4,randy,23.0,Male,6.0
5,michael,23.0,Male,6.0
6,julie,22.0,Female,6.0


By filling NaNs with interpolate() method interpolates between the above and the below values, assuming they are equally spaced, (linear method).


In [161]:
ranking_df

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,,Male,4.0
3,,,,
4,randy,,Male,
5,michael,23.0,Male,
6,julie,22.0,Female,6.0


In [162]:
ranking_df.interpolate(method='linear')

  ranking_df.interpolate(method='linear')


Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,22.25,Male,4.0
3,,22.5,,4.5
4,randy,22.75,Male,5.0
5,michael,23.0,Male,5.5
6,julie,22.0,Female,6.0


Missing values can also be dropped with dropna() method.

In [163]:
ranking_df.dropna() # Only rows and cols that do not have NaNs are returned.

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
6,julie,22.0,Female,6.0


With how='all' only the rows and the columns that are filled with NaNs are dropped. So it is not enough to have only one NaN.

In [164]:
ranking_df

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,,Male,4.0
3,,,,
4,randy,,Male,
5,michael,23.0,Male,
6,julie,22.0,Female,6.0


In [165]:
ranking_df.dropna(how='all')

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
2,richard,,Male,4.0
4,randy,,Male,
5,michael,23.0,Male,
6,julie,22.0,Female,6.0


Drop all columns that contain at least one NaN.


In [166]:
ranking_df.dropna(axis=1) # No columns are returned, all columns had at least one NaN. 

0
1
2
3
4
5
6


Drop all rows that contain at least one NaN.

In [167]:
ranking_df.dropna(axis=0)

Unnamed: 0,names,age,gender,rank
0,steve,20.0,Male,2.0
1,john,22.0,Male,1.0
6,julie,22.0,Female,6.0
