In this tutorial we'll learn how to handle missing data in pandas using fillna, interpolate and dropna methods. You can fill missing values using a value or list of values or use one of the interpolation methods. 

In [1]:
import pandas as pd

###### 1) fillna - to fill missing data using different ways
###### 2) interpolate - to make a guess on missing values using interpolation
###### 3) dropna - to drop rows with missing data

In [2]:
weather = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', delimiter = '\t')

In [3]:
weather

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


In [4]:
type(weather['day'][0]) 

str

Although the date column looks like a number it is in fact a string. This happens when you import data from the web. To convert this into an actual date, you have to use the parse_dates attribute...

In [5]:
weathercsv = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', parse_dates = ['day'], delimiter = '\t')

In [6]:
weathercsv # We can see that the day column has indeed undergone a bit of a transformation

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,9.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,,,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [7]:
type(weathercsv['day'][0]) # When we look at the type, we see that it is now a Timestamp object and not a string

pandas._libs.tslib.Timestamp

We are now going to set the day column as our index...

In [8]:
weathercsv.set_index('day', inplace = True)

In [9]:
weathercsv # That seems to have worked very well

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


NaN values - It is often a good idea to replace these with some other meaningful value or even a guess...

###### 1) fillna

In [10]:
weathercsv.fillna(50) # This will replace all NaN values with 50

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,50.0,9.0,Sunny
2017-01-05,28.0,50.0,Snow
2017-01-06,50.0,7.0,50
2017-01-07,32.0,50.0,Rain
2017-01-08,50.0,50.0,Sunny
2017-01-09,50.0,50.0,50
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


There are a couple of issues here tho. This blanket fills every NaN with 50 but that is across the board. What does 50 mean in the events columns. Was the temp really 50 and the windspeed? Not a very subtle solution

###### To specify different fillna values for different columns
We are going to pass our df a dictionary of values that specify that a different value for the NaN's in each of our columns

In [11]:
weathercsv = weathercsv.fillna({
        'temperature': 0,
        'windspeed': 0,
        'event': 'No Event'
    })
weathercsv # This has worked really well

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,No Event
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,No Event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


We can see the different values, from our dictionary, that been applied properly across our df. This still isn't ideal however. As we can see the temp went from 32F on 1/1/17 to zero on 4/1/17, that doesn't seem very likely. Also, if you are trying to work our mean values, this zero might mess that up. 

So another way to fill in missing values is just to carry over the previous value that we do know into the next empty cell...

In [22]:
weathervsc = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', parse_dates = ['day'], delimiter = '\t')

In [24]:
weathervsc

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,9.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,,,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [25]:
weathervsc = weathervsc.fillna(method = 'ffill')

In [26]:
weathervsc # We can see that all of our missing values have been forward filled with the value from the previous cell

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,32.0,9.0,Sunny
2,2017-01-05,28.0,9.0,Snow
3,2017-01-06,28.0,7.0,Snow
4,2017-01-07,32.0,7.0,Rain
5,2017-01-08,32.0,7.0,Sunny
6,2017-01-09,32.0,7.0,Sunny
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


The ffill, bfill etc., works across the whole df, wherever there is a empty cell, the previous value will be used to fill that cell. The bfill argument, also not working today, works in reverse in that pandas will take the value from the following cell to fill the previous empty cell...same concept tho

The ffill and bfill arguments for fillna will continue to work on as many empty cells as there are. If you have consecutive empty cells then the value from two previous will be copied to the second empty cell. If you don't want this behaviour or you want to limit it, then pandas allows you to do that too...

In [29]:
weathervs = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', parse_dates = ['day'], delimiter = '\t')

In [30]:
weathervs

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,9.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,,,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [31]:
weathervs = weathervs.fillna(method = 'ffill', limit = 1) # This will limit the fillna method to just one forward cell

In [32]:
weathervs # Now that the limit argument is in place, we are only ffill-ing one empty cell so consecutive cells still have NaN

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,32.0,9.0,Sunny
2,2017-01-05,28.0,9.0,Snow
3,2017-01-06,28.0,7.0,Snow
4,2017-01-07,32.0,7.0,Rain
5,2017-01-08,32.0,,Sunny
6,2017-01-09,,,Sunny
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


###### 2) (linear) interpolate - Making a better guess

In [36]:
weathercv = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', parse_dates = ['day'], delimiter = '\t')

In [37]:
weathercv

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,9.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,,,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [38]:
weathercv = weathercv.interpolate() 

In [39]:
weathercv # The linear interpolation gives us a better guess than simply copying the previous cell's data

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,30.0,9.0,Sunny
2,2017-01-05,28.0,8.0,Snow
3,2017-01-06,30.0,7.0,
4,2017-01-07,32.0,7.25,Rain
5,2017-01-08,32.666667,7.5,Sunny
6,2017-01-09,33.333333,7.75,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


It is supposed to come up with a better guess of what the values should be rather than just copying the previous or the next day's values. In the online example, pandas came up with a temp of 30 to go between 32 and 28 and the same for 6/1/17. While the 8th was 32.667 and the 9th was 33.33. You get the idea. It did something similar for the windspeed column but did nothing for the event column. Does this mean that it can only handle int data?

Although this is definitely a better guess, we think we can do better. By default interpolate uses the linear argument hence we get the middle value for the temp. However, the 4th is close to the 5th rather than being in the middle, is there a way that we can get interpolate to recognise this and gives us a value, for the temp, that is close to the temp of 5/1/17?

In [47]:
weathercs = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', parse_dates = ['day'], delimiter = '\t')

In [49]:
weathercs.set_index('day')

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [51]:
# weathercs = weathercs.interpolate(method = 'time') # Still not working

In [46]:
weathercs

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,30.0,9.0,Sunny
2,2017-01-05,28.0,8.0,Snow
3,2017-01-06,30.0,7.0,
4,2017-01-07,32.0,7.25,Rain
5,2017-01-08,32.666667,7.5,Sunny
6,2017-01-09,33.333333,7.75,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


Using the time argument with method, we get a temp of 29F for the 4th which is closer to the temp of the 5th than just going for the middle value. This is a very good method to use when making a guess for missing values

###### 3) dropna
This is for when you just want to drop any rows that have missing values

In [55]:
weatherna = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', parse_dates = ['day'], delimiter = '\t')

In [56]:
weatherna = weatherna.dropna()
weatherna

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


Look what's happened to our df when we drop any row that has a NaN! You have to be very careful with this as any NaN value is enough to lose the vast majority of your df.

What we could do is only drop the rows that have all missing values and preserve the rows that have at least some data...

In [71]:
weathernaall = pd.read_csv('D:\\Pandas\\CodeBasics\\datasets\\05_weather.csv', parse_dates = ['day'], delimiter = '\t')

In [74]:
weathernaall.set_index('day', inplace = True)

In [77]:
weatherall = weathernaall.dropna(how = 'all')

In [78]:
weatherall

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


We see that we have dropped the row for the 9/1/17 as that was completely made up of missing values. All other rows are present as they had at least one pukka value present.

Threshold - This means that if I have at least one non-NaN value, keep that row and drop any other rows...

In [81]:
weatherthresh = weathernaall.dropna(thresh = 1)
weatherthresh # Again, we lose the 9/1/17 because it had no valid values.

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [82]:
weatherthresh02 = weathernaall.dropna(thresh = 2) # Now we are insisting on at least two valid values otherwise the row get dropped
weatherthresh02

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-07,32.0,,Rain
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


Now we have lost the rows for the 6th, 8th & 9th as they do not have at least two valid values

###### Inserting our missing dates
First we have to create a date range. This is then passed to the DateTimeIndex as an argument. Finally, you reindex your df passing the variable that you created in the previous step

In [83]:
dt = pd.date_range("01-01-2017","01-11-2017")
idx = pd.DatetimeIndex(dt)
weatherdates = weather.reindex(idx)

In [84]:
weatherdates # Well, we have got our missing dates and their values are NaN as expected.

Unnamed: 0,day,temperature,windspeed,event
2017-01-01,,,,
2017-01-02,,,,
2017-01-03,,,,
2017-01-04,,,,
2017-01-05,,,,
2017-01-06,,,,
2017-01-07,,,,
2017-01-08,,,,
2017-01-09,,,,
2017-01-10,,,,


However, the rest of our df is now NaN and I do not know why