<a class='anchor' id='top'></a>
<h1>Handling Missing Data</h1>

<b>Author</b>: Calvin King<br>
<b>Date</b>:   04/09/2022

Here, we'll learn how to handle missing data in Pandas using <code>fillna() | interpolate() | dropna()</code> methods. We can fill missing values using a value, list of values, or one of the interpolation methods. 

<hr/>

<h2>Contents</h2>

* [Convert String Column Into <code>type=Date</code>](#convert_string)
* [Use date as an Index with <code>.set_index()</code>](#set_index)
* [Use <code>fillna()</code> method](#fillna)
* [Use <code>fillna(method=ffill)</code> in DataFrame](#ffill)
* [Use <code>fillna(method=bfill)</code> in DataFrame](#bfill)
* [Use <code>fillna(axis=0)</code> in DataFrame](#axis)
* [Use <code>fillna(limit)</code> in DataFrame](#limit)
* [Use <code>interpolate()</code> to do interpolation](#interpolate)
* [Interpolate() method 'Time'](#time)
* [Use <code>.dropna()</code> to drop all rows with 'na'](#dropna)
* [Use <code>.dropna(how)</code>](#how)
* [Use <code>.dropna(thresh)</code>](#thresh)



In [69]:
import numpy as np
import pandas as pd
import seaborn as sns

<hr/>

* [...back to top](#top)
<a class='anchor' id='convert_string'><h2>Convert String Column Into <code>type=Date</code></h2></a>

When importing data, you can use the <code>parse_dates=['COL_NAME']</code> to convert a column to the `DATE` data type:

In [70]:
df = pd.read_csv(r'C:\Users\Work\Desktop\Python Lessons\Data Science\Data Science w Py Course\Data For Use\weather.csv',
                parse_dates=['day']);
df.day.dtypes

dtype('<M8[ns]')

Alternatively, you can use the <code>.astype(type)</code>

In [71]:
df.day = df.day.astype('datetime64[ns]')
df

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32,6,Rain
1,2017-01-02,35,7,Sunny
2,2017-01-03,28,2,Snow
3,2017-01-04,24,7,Snow
4,2017-01-05,32,4,Rain
5,2017-01-06,32,2,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='set_index'><h2>Use date as an Index with <code>.set_index()</code></h2></a>

Using the <code>set_index('Col', inplace=T/F)</code> sets the index to the values of the column specified in the parameter:

In [53]:
df.set_index('day',inplace=True); df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32,6,Rain
2017-01-02,35,7,Sunny
2017-01-03,28,2,Snow
2017-01-04,24,7,Snow
2017-01-05,32,4,Rain
2017-01-06,32,2,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='fillna'><h2>Use <code>fillna()</code> method</h2></a>

The <code>.fillna()</code> method replaces any 'NaN' values with the value specified in the parameter: 

In [72]:
nullDF

Unnamed: 0,High,Low,People
0,1.0,,4.0
1,,2.0,3.0
2,3.0,3.0,


In [68]:
nullDF.isnull().sum()

High      0
Low       0
People    0
dtype: int64

As we can see, there are no 'null' or 'NaN' values in this dataset. To be safe, we'll set all potential 'null' values to the mean of the column: 

In [75]:
df.fillna({
    'temperature': 0,
    'windspeed': 0,
    'event': 'no event'
})

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32,6,Rain
1,2017-01-02,35,7,Sunny
2,2017-01-03,28,2,Snow
3,2017-01-04,24,7,Snow
4,2017-01-05,32,4,Rain
5,2017-01-06,32,2,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='ffill'><h2>Use <code>fillna(method=ffill)</code> in DataFrame</h2></a>

The `ffill` method simply carries the previous row's values to the 'NaN' values. 

In [82]:
df.fillna(method='ffill')

df

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-02,35.0,7.0,Sunny
2,2017-01-03,28.0,2.0,Snow
3,2017-01-04,24.0,7.0,Snow
4,2017-01-05,32.0,4.0,Rain
5,2017-01-06,32.0,2.0,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='bfill'><h2>Use <code>fillna(method=bfill)</code> in DataFrame</h2></a>

The `bfill` method carriers the value AFTER the empty value into the empty row. 

In [83]:
df.fillna(method='bfill')

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-02,35.0,7.0,Sunny
2,2017-01-03,28.0,2.0,Snow
3,2017-01-04,24.0,7.0,Snow
4,2017-01-05,32.0,4.0,Rain
5,2017-01-06,32.0,2.0,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='axis'><h2>Use <code>fillna(axis=0)</code> in DataFrame</h2></a>

The `axis` method simply applies the `fillna()` operation on the specified row (axis=0) or column (axis=1):

In [85]:
df.fillna(0, axis=0)

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-02,35.0,7.0,Sunny
2,2017-01-03,28.0,2.0,Snow
3,2017-01-04,24.0,7.0,Snow
4,2017-01-05,32.0,4.0,Rain
5,2017-01-06,32.0,2.0,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='limit'><h2>Use <code>fillna(limit)</code> in DataFrame</h2></a>

The `limit` method simply tells the program how many values to apply `fillna()` to:

In [86]:
df.fillna(method='ffill', limit=1)

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-02,35.0,7.0,Sunny
2,2017-01-03,28.0,2.0,Snow
3,2017-01-04,24.0,7.0,Snow
4,2017-01-05,32.0,4.0,Rain
5,2017-01-06,32.0,2.0,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='interpolate'><h2>Use <code>interpolate()</code> to do interpolation</h2></a>

The `interpolate` function simply replaces the null values with linear interpolation:


<p align='center'> <b>Before interpolation... </b></p>
    
| Low | Medium | High |
|---:|---:|---:|
| 1 | 3 |  | 
| 2 |  | 2 |
    
<p align='center'> <b>After interpolation... </b></p>

| Low | Medium | High |
|---:|---:|---:|
| 1 | 3 | 1 | 
| 2 | 4 | 2 |

In [None]:
df.interpolate()

<hr/>

* [...back to top](#top)
<a class='anchor' id='time'><h2>Interpolate() method 'Time'</h2></a>

The `method=time` form of `interpolate()` allows the filling of values based on linear interpolation from time interpolation:

In [None]:
df.interpolate(method='time')

<hr/>

* [...back to top](#top)
<a class='anchor' id='dropna'><h2>Use <code>.dropna()</code> to drop all rows with 'na'</h2></a>

Used to delete any rows with any null value. 

In [90]:
df.dropna()

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-02,35.0,7.0,Sunny
2,2017-01-03,28.0,2.0,Snow
3,2017-01-04,24.0,7.0,Snow
4,2017-01-05,32.0,4.0,Rain
5,2017-01-06,32.0,2.0,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='how'><h2>Use <code>.dropna(how)</code></h2></a>

`.dropna(how='all')` only delets rows that have ALL null values.

In [91]:
df.dropna(how='all')

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-02,35.0,7.0,Sunny
2,2017-01-03,28.0,2.0,Snow
3,2017-01-04,24.0,7.0,Snow
4,2017-01-05,32.0,4.0,Rain
5,2017-01-06,32.0,2.0,Sunny


<hr/>

* [...back to top](#top)
<a class='anchor' id='thresh'><h2>Use <code>.dropna(thresh)</code></h2></a>

`.dropna(thresh=1)` controls the delete further by deleting rows that don't have a valid value equal to the parameter.

In [98]:
df.dropna(thresh=1)

Unnamed: 0,day,temperature,windspeed,event


<h2>Adding new dates to the index</h2>

In [97]:
dt = pd.date_range('01-01-2017', '01-10-2017')
idx = pd.DatetimeIndex(dt)
df = df.reindex(idx)
df

Unnamed: 0,day,temperature,windspeed,event
2017-01-01,NaT,,,
2017-01-02,NaT,,,
2017-01-03,NaT,,,
2017-01-04,NaT,,,
2017-01-05,NaT,,,
2017-01-06,NaT,,,
2017-01-07,NaT,,,
2017-01-08,NaT,,,
2017-01-09,NaT,,,
2017-01-10,NaT,,,
