In [29]:
import numpy as np, pandas as pd


# Handling Missing Data

Missing data can either be denoted by a data point which is  clearly not in the range of conventionally occuring data points(sentinel value approach) or by another boolean variable that is true when the data point exists and false when it's missing(masking approach).

Pandas depends on NumPy's convention to represent missing data, since usage of a masking variable would significantly reduce the range of values the each data-type can represent, NumPy and Pandas both used the sentinel approach for representing Nullpoint Values

## `none`: Pythonic Missing Data

In [30]:
vals1=np.array([1,None,'bear',3,4])
vals1

array([1, None, 'bear', 3, 4], dtype=object)

The Data-type:O (short for Object) means that all the operations are done on the python level which means they take more time than the numpy computations.

## `NaN`: Missing Numerical Data

`NaN` is a special floating-point standing for 'Not a Number'

In [31]:
vals2=np.array([1,np.nan,3,4])
vals2.dtype

dtype('float64')

`NaN` should be viewed as a data virus, i.e. anything that it touches changes into `Nan`

In [32]:
print(1+np.nan)
print(0*np.nan)
print(vals2.sum(),vals2.mean(),vals2.max())

nan
nan
nan nan nan


But there are methods that ignore `NaN` values

In [33]:
print(np.nansum(vals2),np.nanmean(vals2),np.nanmax(vals2))

8.0 2.6666666666666665 4.0


Keep in mind that `NaN` is specifically a floating-point value; there is no equivalent `NaN` value for integers, strings, or other types.



## `NaN` and `None` in Pandas

Although `NaN` and `None` both have their own place and differences in interpretations, Pandas treats them in a nearly interchangeable manner

In [34]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

Notice how Pandas typecasted `None` to `NaN`
We can see this below as well, how pandas typecasts an `None` to a `NaN` in an integer array

In [35]:
x=pd.Series([0,1,2,4], dtype=int)
x

0    0
1    1
2    2
3    4
dtype: int32

In [36]:
x[0]=None
x[1]=np.nan
x

0    NaN
1    NaN
2    2.0
3    4.0
dtype: float64

Notice how doing so also changed the datatype of the Series

## Operating on Null Values

### Detecting Null Values

The `isnull()` and `notnull()` function returns a boolean mask that's over the data.

In [37]:
thedata=pd.Series([1,np.nan,2,4,'hello',None])
print(thedata.notnull())
print(thedata.isnull())
thedata[thedata.notnull()]#Using the boolean array as a mask

0     True
1    False
2     True
3     True
4     True
5    False
dtype: bool
0    False
1     True
2    False
3    False
4    False
5     True
dtype: bool


0        1
2        2
3        4
4    hello
dtype: object

### Dropping null values

In [38]:
thedata.dropna()

0        1
2        2
3        4
4    hello
dtype: object

Dropping null values in `Series` is a bit more simpler than dropping values in `DataFrames`. 

In [39]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [40]:
#This will, by default, return a dataframe after dropping all the rows with null values
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [41]:
df.dropna(axis='columns')
#This drops all columns containing any null values. also works with axis=1

Unnamed: 0,2
0,2
1,5
2,6


In [42]:
df[3]=np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


Say we want to block a column with all null values

In [43]:
df.dropna(axis='columns',how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [44]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


the default value of how is any. i.e. it will drop a row/column if it detects any null value

To define a threshold number of values we can define the `thresh` attribute with an integer

`thresh` would represent the least number of non null values that would be allowed

In [45]:
df.dropna(axis='rows',thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


### Filling Null Values

the `fillna` method returns a copy of the array with the null values replaced with the argument

In [46]:
print(df)
df.fillna(0)

     0    1  2   3
0  1.0  NaN  2 NaN
1  2.0  3.0  5 NaN
2  NaN  4.0  6 NaN


Unnamed: 0,0,1,2,3
0,1.0,0.0,2,0.0
1,2.0,3.0,5,0.0
2,0.0,4.0,6,0.0


There is this option where we can fill each null value with it's previous value, i.e. propogate the previous value forward.

In [47]:
thedata = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
thedata

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [115]:
thedata.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Or it can be done the other way round

In [116]:
thedata.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

For `DataFrame`s the options are also the same, the only difference being that here we can specify the axis of operation

In [117]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [121]:
df.fillna(method='ffill',axis=0)

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,2.0,4.0,6,


In [120]:
df.fillna(method='ffill',axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0
