# More on Missing Data - Lab

## Introduction

In this lab, you'll continue to practice techniques for dealing with missing data. Moreover, you'll observe the impact on distributions of your data produced by various techniques for dealing with missing data.

## Objectives

In this lab you will: 

- Evaluate and execute the best strategy for dealing with missing, duplicate, and erroneous values for a given dataset   
- Determine how the distribution of data is affected by imputing values 

## Load the data

To start, load the dataset `'titanic.csv'` using pandas.

In [56]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib notebook

In [57]:
# Your code here
df = pd.read_csv("titanic.csv")

Use the `.info()` method to quickly preview which features have missing data

In [58]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1391 entries, 0 to 1390
Data columns (total 12 columns):
PassengerId    1391 non-null float64
Survived       1391 non-null float64
Pclass         1391 non-null object
Name           1391 non-null object
Sex            1391 non-null object
Age            1209 non-null float64
SibSp          1391 non-null float64
Parch          1391 non-null float64
Ticket         1391 non-null object
Fare           1391 non-null float64
Cabin          602 non-null object
Embarked       1289 non-null object
dtypes: float64(6), object(6)
memory usage: 130.5+ KB


## Observe previous measures of centrality

Let's look at the `'Age'` feature. Calculate the mean, median, and standard deviation of this feature. Then plot a histogram of the distribution.

In [14]:
# Your code here
print(df['Age'].agg(['mean', 'median', 'std']))

df['Age'].plot(kind = 'hist', bins = 30)

mean      29.731894
median    27.000000
std       16.070125
Name: Age, dtype: float64


<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x11df315df28>

## Impute missing values using the mean 

Fill the missing `'Age'` values using the average age. (Don't overwrite the original data, as we will be comparing to other methods for dealing with the missing values.) Then recalculate the mean, median, and std and replot the histogram.

In [15]:
# Your code here
df['Age'].fillna(df['Age'].mean()).describe()

count    1391.000000
mean       29.731894
std        14.981155
min         0.420000
25%        22.000000
50%        29.731894
75%        37.000000
max        80.000000
Name: Age, dtype: float64

### Commentary

Note that the standard deviation dropped, the median was slightly raised and the distribution has a larger mass near the center.

## Impute missing values using the median 

Fill the missing `'Age'` values, this time using the median age. (Again, don't overwrite the original data, as we will be comparing to other methods for dealing with the missing values.) Then recalculate the mean, median, and std and replot the histogram.

In [20]:
# Your code here
df['Age'].fillna(df['Age'].median()).describe()

count    1391.000000
mean       29.374450
std        15.009476
min         0.420000
25%        22.000000
50%        27.000000
75%        37.000000
max        80.000000
Name: Age, dtype: float64

### Commentary

Imputing the median has similar effectiveness to imputing the mean. The variance is reduced, while the mean is slightly lowered. You can once again see that there is a larger mass of data near the center of the distribution.

## Dropping rows

Finally, let's observe the impact on the distribution if we were to simply drop all of the rows that are missing an age value. Then, calculate the mean, median and standard deviation of the ages along with a histogram, as before.

In [80]:
df.isna().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [79]:
df.dropna(subset = ['Age'], inplace = True)



In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1209 entries, 0 to 1390
Data columns (total 12 columns):
PassengerId    1209 non-null float64
Survived       1209 non-null float64
Pclass         1209 non-null object
Name           1209 non-null object
Sex            1209 non-null object
Age            1209 non-null float64
SibSp          1209 non-null float64
Parch          1209 non-null float64
Ticket         1209 non-null object
Fare           1209 non-null float64
Cabin          578 non-null object
Embarked       1107 non-null object
dtypes: float64(6), object(6)
memory usage: 122.8+ KB


### Commentary

Dropping missing values leaves the distribution and associated measures of centrality unchanged, but at the cost of throwing away data.

## Summary

In this lab, you briefly practiced some common techniques for dealing with missing data. Moreover, you observed the impact that these methods had on the distribution of the feature itself. When you begin to tune models on your data, these considerations will be an essential process of developing robust and accurate models.