# Missing data
is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

Let us now see how we can handle missing values (say NA or NaN) using Pandas.

In [5]:
# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)

# Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

        one       two     three
a  0.940463 -0.340565 -1.837282
b       NaN       NaN       NaN
c -0.728328  0.474133  0.054413
d       NaN       NaN       NaN
e -1.064686 -0.898499  0.894368
f  0.375816 -0.675072 -0.287711
g       NaN       NaN       NaN
h  0.321263  0.068812 -1.024018


# Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −

In [6]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].isnull())

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].notnull())

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool


# Calculations with Missing Data
>  When summing data, NA will be treated as Zero

>  If the data are all NA, then the result will be NA

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

print(df['one'].sum())

        one       two     three
a  1.357254 -0.622479 -0.234068
b       NaN       NaN       NaN
c -1.083986 -0.152065  0.694614
d       NaN       NaN       NaN
e  1.511118 -0.323155 -0.134069
f -1.219519 -0.057692 -1.697159
g       NaN       NaN       NaN
h  1.545225 -0.839488  0.233139
2.11009211022252


In [6]:
import pandas as pd
import numpy as np

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])

print(df)
print(df['one'].sum())

   one  two
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN
3  NaN  NaN
4  NaN  NaN
5  NaN  NaN
0


# Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

# Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".

In [9]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
print(df)

df = df.reindex(['a', 'b', 'c'])
print(df)

print ("NaN replaced with '0':")
print(df.fillna(0))

# Here, we are filling with value zero; instead we can also fill with any other value.

        one       two     three
a -0.530415 -0.389313  1.220359
c  0.107680 -1.255502 -2.223981
e -0.547719 -0.890175 -2.193763
        one       two     three
a -0.530415 -0.389313  1.220359
b       NaN       NaN       NaN
c  0.107680 -1.255502 -2.223981
NaN replaced with '0':
        one       two     three
a -0.530415 -0.389313  1.220359
b  0.000000  0.000000  0.000000
c  0.107680 -1.255502 -2.223981


# Fill NA Forward and Backward
Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

# pad/fill

Fill methods Forward

# bfill/backfill

Fill methods Backward

In [11]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

print(df.fillna(method='pad'))

        one       two     three
a  0.343581  1.061289  0.264791
b       NaN       NaN       NaN
c  0.451677 -1.401001 -1.789420
d       NaN       NaN       NaN
e  0.002460  0.646232  0.281441
f  1.056473  0.245884  1.146801
g       NaN       NaN       NaN
h  0.514490  0.649167 -0.916691
        one       two     three
a  0.343581  1.061289  0.264791
b  0.343581  1.061289  0.264791
c  0.451677 -1.401001 -1.789420
d  0.451677 -1.401001 -1.789420
e  0.002460  0.646232  0.281441
f  1.056473  0.245884  1.146801
g  1.056473  0.245884  1.146801
h  0.514490  0.649167 -0.916691


In [12]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

print(df.fillna(method='backfill'))

        one       two     three
a  0.096311  2.571615 -0.835888
b       NaN       NaN       NaN
c -2.106597 -0.337306 -1.122852
d       NaN       NaN       NaN
e -0.424513  0.774408 -0.857533
f  1.045501 -0.570207 -0.798906
g       NaN       NaN       NaN
h  1.468672  0.824735 -1.107703
        one       two     three
a  0.096311  2.571615 -0.835888
b -2.106597 -0.337306 -1.122852
c -2.106597 -0.337306 -1.122852
d -0.424513  0.774408 -0.857533
e -0.424513  0.774408 -0.857533
f  1.045501 -0.570207 -0.798906
g  1.468672  0.824735 -1.107703
h  1.468672  0.824735 -1.107703


# Drop Missing Values
If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

In [14]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

print(df.dropna())

        one       two     three
a  0.565671 -0.436732 -1.050266
b       NaN       NaN       NaN
c  0.726742  1.957275 -0.362878
d       NaN       NaN       NaN
e  0.310776 -1.311723  0.709041
f  0.687475  0.536972 -2.300331
g       NaN       NaN       NaN
h  0.930695  1.321308 -1.367443
        one       two     three
a  0.565671 -0.436732 -1.050266
c  0.726742  1.957275 -0.362878
e  0.310776 -1.311723  0.709041
f  0.687475  0.536972 -2.300331
h  0.930695  1.321308 -1.367443


In [19]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

print(df.dropna(axis=1))
print(df.dropna(axis=0))

        one       two     three
a -1.211359 -0.736993  0.551887
b       NaN       NaN       NaN
c -0.712212  1.482116 -0.044715
d       NaN       NaN       NaN
e -1.075143  0.235990 -1.194043
f  0.127326 -0.295030  0.823656
g       NaN       NaN       NaN
h  0.006178 -0.826464 -0.773656
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]
        one       two     three
a -1.211359 -0.736993  0.551887
c -0.712212  1.482116 -0.044715
e -1.075143  0.235990 -1.194043
f  0.127326 -0.295030  0.823656
h  0.006178 -0.826464 -0.773656


# Replace Missing (or) Generic Values
Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.

Replacing NA with a scalar value is equivalent behavior of the fillna() function.

In [23]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print(df)

print(df.replace({1000:10,2000:60,0:20}))

    one   two
0    10  1000
1    20     0
2    30    30
3    40    40
4    50    50
5  2000    60
   one  two
0   10   10
1   20   20
2   30   30
3   40   40
4   50   50
5   60   60
