# Filling in Missing Data

In [1]:
import pandas as pd
from pandas import DataFrame , Series
import numpy as np

In [2]:
df = DataFrame(np.random.randn(7, 3))
df

Unnamed: 0,0,1,2
0,-0.23001,-0.054632,-1.3074
1,0.96396,0.237626,-0.293383
2,1.181242,-0.088504,-1.539212
3,-0.60659,0.597766,0.872352
4,-0.970013,1.081886,0.054375
5,0.903728,-2.271121,-0.279293
6,0.796697,-0.285128,0.088446


In [3]:
from numpy import nan as NA

In [4]:
df.ix[:4, 1] = NA#; df.ix[:2, 2] = NA

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [5]:
df

Unnamed: 0,0,1,2
0,-0.23001,,-1.3074
1,0.96396,,-0.293383
2,1.181242,,-1.539212
3,-0.60659,,0.872352
4,-0.970013,,0.054375
5,0.903728,-2.271121,-0.279293
6,0.796697,-0.285128,0.088446


In [6]:
df.ix[:2, 2] = NA

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [7]:
df

Unnamed: 0,0,1,2
0,-0.23001,,
1,0.96396,,
2,1.181242,,
3,-0.60659,,0.872352
4,-0.970013,,0.054375
5,0.903728,-2.271121,-0.279293
6,0.796697,-0.285128,0.088446


Rather than filtering out missing data (and potentially discarding other data along with
it), you may want to fill in the “holes” in any number of ways. For most purposes, the
fillna method is the workhorse function to use. Calling fillna with a constant replaces
missing values with that value:

In [8]:
df.fillna(0)
df.isnull()

Unnamed: 0,0,1,2
0,False,True,True
1,False,True,True
2,False,True,True
3,False,True,False
4,False,True,False
5,False,False,False
6,False,False,False


Calling fillna with a dict you can use a different fill value for each column:

In [9]:
df.fillna({1: 0.5, 2: -1})

Unnamed: 0,0,1,2
0,-0.23001,0.5,-1.0
1,0.96396,0.5,-1.0
2,1.181242,0.5,-1.0
3,-0.60659,0.5,0.872352
4,-0.970013,0.5,0.054375
5,0.903728,-2.271121,-0.279293
6,0.796697,-0.285128,0.088446


fillna returns a new object, but you can modify the existing object in place:

In [19]:
# always returns a reference to the filled object
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False


The same interpolation methods available for reindexing can be used with fillna:

In [11]:
df = DataFrame(np.random.randn(6, 3))
df

Unnamed: 0,0,1,2
0,0.019556,-1.010601,-1.317891
1,1.036795,0.598009,0.344202
2,-1.625356,1.083522,0.558252
3,0.236336,-0.279523,0.242042
4,-0.233641,-0.369312,-0.238208
5,-1.483483,0.530471,-1.239086


In [12]:
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [13]:
df

Unnamed: 0,0,1,2
0,0.019556,-1.010601,-1.317891
1,1.036795,0.598009,0.344202
2,-1.625356,,0.558252
3,0.236336,,0.242042
4,-0.233641,,
5,-1.483483,,


In [14]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.019556,-1.010601,-1.317891
1,1.036795,0.598009,0.344202
2,-1.625356,0.598009,0.558252
3,0.236336,0.598009,0.242042
4,-0.233641,0.598009,0.242042
5,-1.483483,0.598009,0.242042


In [15]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.019556,-1.010601,-1.317891
1,1.036795,0.598009,0.344202
2,-1.625356,0.598009,0.558252
3,0.236336,0.598009,0.242042
4,-0.233641,,0.242042
5,-1.483483,,0.242042


With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

In [16]:
data = Series([1., NA, 3.5, NA, 7])

In [17]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Table 5-13. fillna function arguments


Argument Description

value Scalar value or dict-like object to use to fill missing values
method Interpolation, by default 'ffill' if function called with no other arguments
axis Axis to fill on, default axis=0
inplace Modify the calling object without producing a copy
limit For forward and backward filling, maximum number of consecutive periods to fill