## Module 7-5
#### This notebook contains my work on the fifth Tutorial video.

## How to handle missing data in pandas
#### Often data that comes from the Internet may have missing values
   * fillna to fill missing values
   * interpolate to make a guess on missing values
   * dropna to drop rows with missing values

### Begin by reading a csv file in pandas

In [1]:
import pandas as pd
df = pd.read_csv('7-5_weather_data.csv')
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,7.0,Sunny
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,31.0,2.0,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


### Change "day" to "date" by using the parse_dates argument

In [2]:
df = pd.read_csv('7-5_weather_data.csv', parse_dates=['day'])
df

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,7.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,31.0,2.0,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [3]:
type(df.day[0]) # Display the date format being used in the dataframe

pandas._libs.tslibs.timestamps.Timestamp

### Make "Day" column the index for the dataframe using sex_index argument

In [None]:
df.set_index('day', inplace=True) #Use inplace=True argument to modify original dataframe

In [6]:
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Replace all "NaN" values in the dataframe with some other value using _fillna_ argument

In [7]:
new_df = df.fillna(0) #Call it new_df in order to not replace original dataframe
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,7.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,0
2017-01-07,32.0,0.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,0.0,0.0,0
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### How to use _fillna_ to supply different values for different columns
#### Pass a dictionary

In [8]:
new_df = df.fillna({
    'temperature': 0,
    'windspeed': 0,
    'event': 'no event'
})
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,7.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,no event
2017-01-07,32.0,0.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,0.0,0.0,no event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### 0 might actually not be a great option because it will skew the data - how to get a better estimate?
#### Carry forward the temperature (or whichever value) from the previous day

In [9]:
new_df = df.fillna(method="ffill") #Forward fill --> Carry forward previous value
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Use bfill argument to copy next day's value (back fill)

In [10]:
new_df = df.fillna(method='bfill') #Back fill
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,28.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,32.0,7.0,Rain
2017-01-07,32.0,2.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,34.0,8.0,Cloudy
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Try axis argument

In [13]:
new_df = df.fillna(method='bfill', axis="columns") #Copies horizontally rather than vertically by setting axis
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6,Rain
2017-01-04,7.0,7,Sunny
2017-01-05,28.0,Snow,Snow
2017-01-06,7.0,7,
2017-01-07,32.0,Rain,Rain
2017-01-08,31.0,2,Sunny
2017-01-09,,,
2017-01-10,34.0,8,Cloudy
2017-01-11,40.0,12,Sunny


In [14]:
new_df = df.fillna(method='ffill', limit=1) #Limit the number of times you can use a value to fill a missing value
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Use interpolate argument to come up with a "better guess" for missing values

In [15]:
new_df = df.interpolate() #By default linear, but you can use several other methods to specify for interpolation
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,30.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,4.5,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,32.5,5.0,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [16]:
new_df = df.interpolate(method="time") # Considers the proximity of dates to each other, rather than just taking
                                       # the mean as with linear
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,29.0,7.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,4.5,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,32.5,5.0,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Drop all the rows with NaN values. Use method called _dropna( )_

In [17]:
new_df = df.dropna() # Drops the whole row, even if there is only one NaN value in the row.
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [18]:
new_df = df.dropna(how="all") # Drops only the rows that have ALL NaN cells
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [20]:
new_df = df.dropna(thresh=2) # Set the threshold for when to drop a row based on how many NaN values it contains.
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Inserting missing data into the dataframe

In [22]:
dt = pd.date_range("01-01-2017", "01-11-2017") # Set the date range for the dataframe
idx = pd.DatetimeIndex(dt) # Pass the range to the date/time index
df = df.reindex(idx) # Reindex the dataframe using this index
df

Unnamed: 0,temperature,windspeed,event
2017-01-01,32.0,6.0,Rain
2017-01-02,,,
2017-01-03,,,
2017-01-04,,7.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,31.0,2.0,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy


#### Even though there are NaN values above, we can use _fillna_, _dropna_, or _interpolate_ to handle these.