# Handling Missing Data with Pandas

This section from the code camp level covers conceptual theory and cleaning of data in python with numpy and pandas. Pandas specifically borrows all of the capabilities from numpy selection, and adds a number of other methods for handling those items.

We can very rapidly identify the missing data; however, there are other issues that can occur within a specific domain, for example, age might have an entry of 200, we know this is invalid inherently within that domain, but in another domain you may not know all this information.

As such, it is important that when you are working on an analysis project that you gain some understanding of the domain, this could be very much something like knowledge engineering in an expert system.

## Hands on

In [2]:
import numpy as np
import pandas as pd

## Pandas utility functions

These are all similar to numpy for identifying null values.

In [7]:
pd.isnull(np.nan)

True

In [8]:
pd.isnull(None)

True

In [9]:
pd.isna(np.nan)

True

In [10]:
pd.isna(None)

True

Similarly for ``pd.notnull(N)`` for identifying non-null values.

These work on whole series, and thus dataframes

In [13]:
pd.isnull(pd.DataFrame({
        "Column A" : [13, np.nan, 7],
        "Column B" : [np.nan, 100, 50]
}))

Unnamed: 0,Column A,Column B
0,False,True
1,True,False
2,False,False


## Filtering null values

In [14]:
s = pd.Series([1,2,3,np.nan, np.nan, 4])

In [15]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [17]:
pd.notnull(s).sum() # Count not null

4

In [18]:
pd.isnull(s).sum() # Count null

2

We can opt to drop those with null values

In [19]:
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

Te