<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Dealing with Missing Data

As you'll recall SciKit Learn mandates that the input dataset cannot contain missing data. In this tutorial, we're going to use the Automobiles dataset which doesn't meet this critera.



# Load the Automobiles Dataset 

Start by loading the data, and examining it.



In [277]:
import pandas as pd

auto_df = pd.read_csv('../../Data/automobiles.csv')
auto_df

Unnamed: 0,symboling,normalised_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,2952,ohc,four,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,188.8,68.8,55.5,3049,ohc,four,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3012,ohcv,six,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,-1,95.0,volvo,diesel,turbo,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3217,ohc,six,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


Note that not all columns of data are shown (the missing columns are between wheel_base and engine_size)

Let's tell Pandas to show all columns when displaying the dataframe. 

In [278]:
pd.set_option('max_columns', None)
auto_df

Unnamed: 0,symboling,normalised_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,2952,ohc,four,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,188.8,68.8,55.5,3049,ohc,four,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3012,ohcv,six,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,-1,95.0,volvo,diesel,turbo,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3217,ohc,six,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


Next we'll examine which columns are missing data. 

In [279]:
auto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalised_losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel_type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num_of_doors       203 non-null    object 
 6   body_style         205 non-null    object 
 7   drive_wheels       205 non-null    object 
 8   engine_location    205 non-null    object 
 9   wheel_base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb_weight        205 non-null    int64  
 14  engine_type        205 non-null    object 
 15  cylinders          205 non-null    object 
 16  engine_size        205 non

Again we're missing some columns so let's make sure we're seeing data about all columns

In [280]:
auto_df.iloc[:, 0:20].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalised_losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel_type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num_of_doors       203 non-null    object 
 6   body_style         205 non-null    object 
 7   drive_wheels       205 non-null    object 
 8   engine_location    205 non-null    object 
 9   wheel_base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb_weight        205 non-null    int64  
 14  engine_type        205 non-null    object 
 15  cylinders          205 non-null    object 
 16  engine_size        205 non

In [281]:
auto_df.iloc[:, 20:].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   compression_ratio  205 non-null    float64
 1   horsepower         203 non-null    float64
 2   peak_rpm           203 non-null    float64
 3   city_mpg           205 non-null    int64  
 4   highway_mpg        205 non-null    int64  
 5   price              201 non-null    float64
dtypes: float64(4), int64(2)
memory usage: 9.7 KB


# Strategy 1: Deleting columns with Missing Data

The worst column for missing is the normalised_losses column which only has data for 164 observations.

As it happens the symboling and normalised losses relate to predicting insurance premiums, so we can drop the first 2 columns. 



In [282]:
auto_df = auto_df.drop(['symboling', 'normalised_losses'], axis=1)
auto_df

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,2952,ohc,four,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,volvo,gas,turbo,four,sedan,rwd,front,109.1,188.8,68.8,55.5,3049,ohc,four,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3012,ohcv,six,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,volvo,diesel,turbo,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3217,ohc,six,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


# Strategy 2: Finding Rows of Missing Data



The code below finds rows that have a missing value in any column.

In [283]:
flt_missing = auto_df.isna().any(axis=1)
auto_df[flt_missing]

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
9,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,67.9,52.0,3053,ohc,five,131,mpfi,3.13,3.4,7.0,160.0,5500.0,16,22,
27,dodge,gas,turbo,,sedan,fwd,front,93.7,157.3,63.8,50.6,2191,ohc,four,98,mpfi,3.03,3.39,7.6,102.0,5500.0,24,30,8558.0
44,isuzu,gas,std,two,sedan,fwd,front,94.5,155.9,63.6,52.0,1874,ohc,four,90,2bbl,3.03,3.11,9.6,70.0,5400.0,38,43,
45,isuzu,gas,std,four,sedan,fwd,front,94.5,155.9,63.6,52.0,1909,ohc,four,90,2bbl,3.03,3.11,9.6,70.0,5400.0,38,43,
55,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,10945.0
56,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,11845.0
57,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2385,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,13645.0
58,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2500,rotor,two,80,mpfi,,,9.4,135.0,6000.0,16,23,15645.0
63,mazda,diesel,std,,sedan,fwd,front,98.8,177.8,66.5,55.5,2443,ohc,four,122,idi,3.39,3.39,22.7,64.0,4650.0,36,42,10795.0
129,porsche,gas,std,two,hatchback,rwd,front,98.4,175.7,72.3,50.5,3366,dohcv,eight,203,mpfi,3.94,3.11,10.0,288.0,5750.0,17,28,


In [284]:
flt_missing = auto_df.isna().any(axis=1)
auto_df[flt_missing]

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
9,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,67.9,52.0,3053,ohc,five,131,mpfi,3.13,3.4,7.0,160.0,5500.0,16,22,
27,dodge,gas,turbo,,sedan,fwd,front,93.7,157.3,63.8,50.6,2191,ohc,four,98,mpfi,3.03,3.39,7.6,102.0,5500.0,24,30,8558.0
44,isuzu,gas,std,two,sedan,fwd,front,94.5,155.9,63.6,52.0,1874,ohc,four,90,2bbl,3.03,3.11,9.6,70.0,5400.0,38,43,
45,isuzu,gas,std,four,sedan,fwd,front,94.5,155.9,63.6,52.0,1909,ohc,four,90,2bbl,3.03,3.11,9.6,70.0,5400.0,38,43,
55,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,10945.0
56,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,11845.0
57,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2385,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,13645.0
58,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2500,rotor,two,80,mpfi,,,9.4,135.0,6000.0,16,23,15645.0
63,mazda,diesel,std,,sedan,fwd,front,98.8,177.8,66.5,55.5,2443,ohc,four,122,idi,3.39,3.39,22.7,64.0,4650.0,36,42,10795.0
129,porsche,gas,std,two,hatchback,rwd,front,98.4,175.7,72.3,50.5,3366,dohcv,eight,203,mpfi,3.94,3.11,10.0,288.0,5750.0,17,28,


# Strategy 3: Deleting Rows

A common approach is to remove missing data. Generally speaking this is OK so long as it's less than 5% of the overall data. 

In this dataset the target variable will be price. 

Let's delete the rows that have no price. 

In [285]:
# show the rows that are missing values for price
flt_missing = auto_df['price'].isna()
auto_df[flt_missing==True]

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
9,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,67.9,52.0,3053,ohc,five,131,mpfi,3.13,3.4,7.0,160.0,5500.0,16,22,
44,isuzu,gas,std,two,sedan,fwd,front,94.5,155.9,63.6,52.0,1874,ohc,four,90,2bbl,3.03,3.11,9.6,70.0,5400.0,38,43,
45,isuzu,gas,std,four,sedan,fwd,front,94.5,155.9,63.6,52.0,1909,ohc,four,90,2bbl,3.03,3.11,9.6,70.0,5400.0,38,43,
129,porsche,gas,std,two,hatchback,rwd,front,98.4,175.7,72.3,50.5,3366,dohcv,eight,203,mpfi,3.94,3.11,10.0,288.0,5750.0,17,28,


In [286]:
auto_df = auto_df.dropna(subset=['price'])  # the subset contains a list of columns

In [287]:
auto_df  # note that we're down to 201 rows.

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,2952,ohc,four,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,volvo,gas,turbo,four,sedan,rwd,front,109.1,188.8,68.8,55.5,3049,ohc,four,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3012,ohcv,six,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,volvo,diesel,turbo,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3217,ohc,six,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


## Checkpoint

This leaves us with 8 rows that have missing data

In [288]:
flt_missing = auto_df.isna().any(axis=1)
auto_df[flt_missing]

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
27,dodge,gas,turbo,,sedan,fwd,front,93.7,157.3,63.8,50.6,2191,ohc,four,98,mpfi,3.03,3.39,7.6,102.0,5500.0,24,30,8558.0
55,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,10945.0
56,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,11845.0
57,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2385,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,13645.0
58,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2500,rotor,two,80,mpfi,,,9.4,135.0,6000.0,16,23,15645.0
63,mazda,diesel,std,,sedan,fwd,front,98.8,177.8,66.5,55.5,2443,ohc,four,122,idi,3.39,3.39,22.7,64.0,4650.0,36,42,10795.0
130,renault,gas,std,four,wagon,fwd,front,96.1,181.5,66.5,55.2,2579,ohc,four,132,mpfi,3.46,3.9,8.7,,,23,31,9295.0
131,renault,gas,std,two,hatchback,fwd,front,96.1,176.8,66.6,50.5,2460,ohc,four,132,mpfi,3.46,3.9,8.7,,,23,31,9895.0


# Strategy 4: Imputing Data

## 4a - Imputing using dropna and fillna

Imputing data means replacing a missing value with a value that seems reasonable. 

Imagine that we survey a class of year 13 students to find out what their age is. 

Of the 10 students in the class, only 8 responded. 

| #  | Student | Age |
| -- | ------- | --- |
| 1  | Anne    | 16  |
| 2  | Brian   | 17  |
| 3  | Claire  | 16  |
| 4  | David   | 17  |
| 5  | Ellen   |     |
| 6  | Fred    |     |
| 7  | Gwen    | 16  |
| 8  | Harry   | 17  |
| 9  | Ian     | 16  |
| 10 | Julie   | 17  |

Let's calculate the average age of the students

$$ \frac{16+17+16+17+16+17+16+17}{10} = 13.2 $$

Clearly there's something wrong here, because there are no students aged 13 in year 13! 

With an equal number of students aged 16 and 17, common sense would tell you that the average age of the students should have been 16.5.

The mistake that we made was to divide by 10. Because we only have ages for 8 of the 10 students we probably should have divided by 8. 

$$ \frac{16+17+16+17+16+17+16+17}{8} = 16.5 $$

Effectively this is the same as deleting the empty rows; like we did in Strategy 3, above. 

Other strategies we could have used would have been to replace the missing values with a value that seems reasonable.

For example, blank values could be substituted with: 

* zero 
  * in some cases, zero is the correct number to substitute. Know your data!
* mean 
  * the average of the numbers that do have values - in the example above this would be 16.5
* median
  * this is the central number. In a frequency distribution, there's a 50/50 chance of any point being higher or lower than this number. 
* mode 
  * most frequently appearing number in the dataset. Sometimes it's appropriate to substitute missing numbers with the number that it's statistically most likely to be.

Let's consider how this would work in our class of Year 13 students. 

In [289]:
import numpy as np

In [290]:
data_dict = {'Student': ['Anne', 'Brian', 'Claire', 'David', 'Ellen', 'Fred', 'Gwen', 'Harry', 'Ian', 'Julie'],
'Age': [16, 17, 16, 17, np.NaN,np.NaN , 16, 17, 16, 17] 
}




* Note that np.NaN stands for "Not a Number"
* This is the same as a null i.e. you're explicitly saying "there's nothing here"

Let's make our data_dict into a dataframe. 

In [291]:
students_df = pd.DataFrame(data = data_dict)
students_df

Unnamed: 0,Student,Age
0,Anne,16.0
1,Brian,17.0
2,Claire,16.0
3,David,17.0
4,Ellen,
5,Fred,
6,Gwen,16.0
7,Harry,17.0
8,Ian,16.0
9,Julie,17.0


In [292]:
students_df.fillna(0)  #With this command we can fill all missing values with a 0 - this is applied across all columns


Unnamed: 0,Student,Age
0,Anne,16.0
1,Brian,17.0
2,Claire,16.0
3,David,17.0
4,Ellen,0.0
5,Fred,0.0
6,Gwen,16.0
7,Harry,17.0
8,Ian,16.0
9,Julie,17.0


Of course we didn't actually save the new values (by now, I expect you know how to do this). 

fillna() can also be applied to a single column rather than a whole dataframe. 


In [293]:
students_df['Age'].fillna(0) 

0    16.0
1    17.0
2    16.0
3    17.0
4     0.0
5     0.0
6    16.0
7    17.0
8    16.0
9    17.0
Name: Age, dtype: float64

In [294]:
students_df

Unnamed: 0,Student,Age
0,Anne,16.0
1,Brian,17.0
2,Claire,16.0
3,David,17.0
4,Ellen,
5,Fred,
6,Gwen,16.0
7,Harry,17.0
8,Ian,16.0
9,Julie,17.0


What if you want to fill the missing data with the mean, median or mode? 

In [295]:
students_df['Age'].fillna(students_df['Age'].mean()) 

0    16.0
1    17.0
2    16.0
3    17.0
4    16.5
5    16.5
6    16.0
7    17.0
8    16.0
9    17.0
Name: Age, dtype: float64

In [296]:
students_df['Age'].fillna(students_df['Age'].median())

0    16.0
1    17.0
2    16.0
3    17.0
4    16.5
5    16.5
6    16.0
7    17.0
8    16.0
9    17.0
Name: Age, dtype: float64

In this simple example the median is the same value as the mean

In [297]:
students_df['Age'].mode()

0    16.0
1    17.0
dtype: float64

Note that Mode always returns a Pandas Series. In this case there's a tie, so the Pandas Series contains more than one value.

We can get the first number from this Series with: 

In [298]:
students_df['Age'].mode()[0]

16.0

And we can check if there are one or more values that occur most frequently with:

In [299]:
len(students_df['Age'].mode())

2

Finally let's fill the missing ages with the mode (or at least the first value that comes back)

In [300]:
students_df['Age'].fillna(students_df['Age'].mode()[0]) 

0    16.0
1    17.0
2    16.0
3    17.0
4    16.0
5    16.0
6    16.0
7    17.0
8    16.0
9    17.0
Name: Age, dtype: float64

Knowing if mean, median or mode is most suitable, requires knowing your data and also what you're trying to achieve. 

The diagram below illustrates how mean, median and mode differ depending on if your data are skewed to one side or another.

![Mean Median Mode](../../Images/wikipedia-mean-median-mode-skew.png)

Image source: https://en.wikipedia.org/wiki/Skewness

# Challenge

In [301]:
flt_missing = auto_df.isna().any(axis=1)
auto_df[flt_missing]

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
27,dodge,gas,turbo,,sedan,fwd,front,93.7,157.3,63.8,50.6,2191,ohc,four,98,mpfi,3.03,3.39,7.6,102.0,5500.0,24,30,8558.0
55,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,10945.0
56,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2380,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,11845.0
57,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2385,rotor,two,70,4bbl,,,9.4,101.0,6000.0,17,23,13645.0
58,mazda,gas,std,two,hatchback,rwd,front,95.3,169.0,65.7,49.6,2500,rotor,two,80,mpfi,,,9.4,135.0,6000.0,16,23,15645.0
63,mazda,diesel,std,,sedan,fwd,front,98.8,177.8,66.5,55.5,2443,ohc,four,122,idi,3.39,3.39,22.7,64.0,4650.0,36,42,10795.0
130,renault,gas,std,four,wagon,fwd,front,96.1,181.5,66.5,55.2,2579,ohc,four,132,mpfi,3.46,3.9,8.7,,,23,31,9295.0
131,renault,gas,std,two,hatchback,fwd,front,96.1,176.8,66.6,50.5,2460,ohc,four,132,mpfi,3.46,3.9,8.7,,,23,31,9895.0


Let's apply what we learned to the auto_df data frame.

1. Replace missing values in the 'num_of_doors' column with the mode (yes, it will work on a text column!)
2. Replace missing values for 'bore' and 'stroke' with the mean
3. Replace missing values for 'horsepower' and 'peak_rpm' with the median


Note that on the Latest version of Python, I've recently encountered warnings when performing these tasks. 

If you encounter a warning that says "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame." Just ignore it (assuming your code works).



In [302]:
mode = auto_df['num_of_doors'].mode()
mode[0]

'four'

In [303]:
auto_df['num_of_doors'].fillna(mode[0])

0       two
1       two
2       two
3      four
4      four
       ... 
200    four
201    four
202    four
203    four
204    four
Name: num_of_doors, Length: 201, dtype: object

In [304]:
auto_df['num_of_doors'] = auto_df['num_of_doors'].fillna(mode[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto_df['num_of_doors'] = auto_df['num_of_doors'].fillna(mode[0])


In [305]:
auto_df['bore'].fillna(auto_df['bore'].mean(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [306]:
auto_df['stroke'] = auto_df['stroke'].fillna(auto_df['stroke'].mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto_df['stroke'] = auto_df['stroke'].fillna(auto_df['stroke'].mean())


In [307]:
auto_df['horsepower'] = auto_df['horsepower'].fillna(auto_df['horsepower'].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto_df['horsepower'] = auto_df['horsepower'].fillna(auto_df['horsepower'].median())


In [308]:
auto_df['peak_rpm'] = auto_df['peak_rpm'].fillna(auto_df['peak_rpm'].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  auto_df['peak_rpm'] = auto_df['peak_rpm'].fillna(auto_df['peak_rpm'].median())


Does this make sense? Who knows! Remember this is just a tutorial, so I want to use a range of techniques. 

In an ideal world, I'd be able to hunt down the missing data, and get the correct values. 

After that, it's up to you to discern what the best trade-offs are. For sure, we're better to use the mean and median, than zeros in this dataset. But would we have been better just to delete the rows with missing data? 

Perhaps a little domain knowledge could be useful? Or maybe you'll try training your model with and without imputed data to see how it performs in each case?

Finally, let's check that we haven't left any missing values. 

In [309]:
flt_missing = auto_df.isna().any(axis=1)
auto_df[flt_missing]

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
