# Filtering Out Missing Data

You have a number of options for filtering out missing data. While doing it by hand is
always an option, dropna can be very helpful. On a Series, it returns the Series with only
the non-null data and index values:

In [1]:
import pandas as pd
from pandas import DataFrame , Series
import numpy as np

In [5]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [6]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

Naturally, you could have computed this yourself by boolean indexing:

In [7]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, these are a bit more complex. You may want to drop rows
or columns which are all NA or just those containing any NAs. dropna by default drops
any row containing a missing value:

In [18]:
data = DataFrame([[1, 65, 3], [1., NA, NA],[NA, NA, NA], [NA, 65, 3]])
data

Unnamed: 0,0,1,2
0,1.0,65.0,3.0
1,1.0,,
2,,,
3,,65.0,3.0


In [20]:
cleaned = data.dropna()
cleaned.astype(int)

Unnamed: 0,0,1,2
0,1,65,3


In [26]:
number = int(input('enter a number : '))
for i in cleaned.values:
    if i == number(how='any'):
        print('hi',number)

enter a number : 1


TypeError: 'int' object is not callable

Passing how='all' will only drop rows that are all NA:

In [14]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument:

In [15]:
df = DataFrame(np.random.randn(7, 3))

In [16]:
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [17]:
df

Unnamed: 0,0,1,2
0,0.013432,,
1,-1.85182,,
2,-0.48104,,
3,-0.898491,,-0.012771
4,0.10519,,-0.810356
5,-0.290724,-0.167729,-0.890571
6,0.435922,-0.879174,-1.65653


In [18]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,-0.290724,-0.167729,-0.890571
6,0.435922,-0.879174,-1.65653
