# Handle Missing Data

1. fillna
2. interpolate
3. dropna

In [1]:
import pandas as pd
df = pd.read_csv("7-5_weather_data.csv")
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,7.0,Sunny
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,31.0,2.0,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


In [2]:
import pandas as pd
df = pd.read_csv("7-5_weather_data.csv", parse_dates=["day"])
df

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,7.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,31.0,2.0,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [6]:
import pandas as pd
df = pd.read_csv("7-5_weather_data.csv", parse_dates=["day"])
df.set_index('day',inplace=True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


First, we read the CSV as a dataframe. Then, we want to read the "day" column values as dates, using **parse_dates**. Finally, want to set the "day" column as the index. Remember to put **inplace=True**, otherwise it won't modify the original dataframe.

Often it's useful to convert all NA values (NaN) with something meaningful. We can do this in several ways. 

## fillna

In [9]:
new_df = df.fillna(0)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,7.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,0
2017-01-07,32.0,0.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,0.0,0.0,0
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


**fillna** = Replaces all NaN with 0.

Sometimes 0 isn't always the best guess. For instance, in "event", what does 0 mean? You want to use **fillna**, but you don't wan to fill the entire **df** with 0. You want to put specific values in specific columns. 

In [10]:
new_df = df.fillna({
    'temperature': 0,
    'windspeed': 0,
    'event': 'no event',
    })
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,7.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,no event
2017-01-07,32.0,0.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,0.0,0.0,no event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


You can use a dictionary in **fillna** to specify input.

But, it's still not perfect. In "temperature", the NA value shouldn't be 0...that's a huge drop from 32 degrees to 0 degrees in one day. We can instead carry the value over from the previous cell. 

In [11]:
new_df = df.fillna(method="ffill")
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [13]:
new_df = df.fillna(method="bfill", axis="columns")
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6,Rain
2017-01-04,7.0,7,Sunny
2017-01-05,28.0,Snow,Snow
2017-01-06,7.0,7,
2017-01-07,32.0,Rain,Rain
2017-01-08,31.0,2,Sunny
2017-01-09,,,
2017-01-10,34.0,8,Cloudy
2017-01-11,40.0,12,Sunny


In [14]:
new_df = df.fillna(method="ffill", limit=1)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


**ffill** = Foward fill. If I have an NA value, carry forward the previous day's value.

**bfill** = Backwards fill. You can copy the next day's value. 

**axis="columns"** = Copies data horizontally (example is showing how data is being copies backwards, "bfill", across the columns).

**limit** = Copies data to how many values forward/backward you want (1 cell, etc.).

# Interpolate

In [15]:
new_df = df.interpolate()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,30.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,4.5,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,32.5,5.0,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [16]:
new_df = df.interpolate(method="time")
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,29.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,4.5,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,32.5,5.0,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


**interpolate** has Pandas making a better guess for you. You can see a gradual decrease/increase in the number data. 

The default interpolation is linear, but there are many other ways too. 

The second example is interpolating with time (date) in mind. Because we are skipping days in the data, and the temperature should be falling more between 1/1 and 1/4. 

# dropna

In [17]:
new_df = df.dropna()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [18]:
new_df = df.dropna(how="all")
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [20]:
new_df = df.dropna(thresh=2)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [25]:
dt = pd.date_range("01-01-2017", "01-11-2017")
idx = pd.DatetimeIndex(dt)
df = df.reindex(idx)
df

Unnamed: 0,temperature,windspeed,event
2017-01-01,32.0,6.0,Rain
2017-01-02,,,
2017-01-03,,,
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy


**dropna** = Whichever rows have NA in them, it drops.

Sometimes you want to drop the row if it has at least one NA (like the first example). But what if you only want to drop rows that have ALL NA?

The second example shows the **how="all"** parameter that only drops rows where all cells are NA.

What if I want to go by non-NA values? If I have at least two non-NA values, then keep that row, and drop all other rows. 

The third example shows **thresh=2**, which means there are at least two non-NA values.

How do you go about inserting the missing dates?

You create a date range (**dt**), pass that to **idx** (your index), and reindex your dataframe (**df**).