### Imputation Methods and Resources

One of the most common methods for working with missing values is by imputing the missing values.  Imputation means that you input a value for values that were originally missing. 

It is very common to impute in the following ways:
1. Impute the **mean** of a column.<br><br>

2. If you are working with categorical data or a variable with outliers, then use the **mode** of the column.<br><br>

3. Impute 0, a very small number, or a very large number to differentiate missing values from other values.<br><br>

4. Use knn to impute values based on features that are most similar.<br><br>

In general, you should try to be more careful with missing data in understanding the real world implications and reasons for why the missing values exist.  At the same time, these solutions are very quick, and they enable you to get models off the ground.  You can then iterate on your feature engineering to be more careful as time permits.

Let's take a look at how some of them work. Chris' content is again very helpful for many of these items - and you can access it [here](https://chrisalbon.com/).  He uses the [sklearn.preprocessing library](http://scikit-learn.org/stable/modules/preprocessing.html).  There are also a ton of ways to fill in missing values directly using pandas, which can be found [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)

Create the dataset you will be using for this notebook using the code below.


In [None]:
import pandas as pd
import numpy as np
import ImputationMethods as t

df = pd.DataFrame({'A':[np.nan, 2, np.nan, 0, 7, 10, 15],
                   'B':[3, 4, 5, 1, 2, 3, 5],
                   'C':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
                   'D':[np.nan, True, np.nan, False, True, False, np.nan],
                   'E':['Yes', 'No', 'Maybe', np.nan, np.nan, 'Yes', np.nan]})

df

#### Question 1

**1.** Use the dictionary below to label the columns as the appropriate data type.

In [None]:
a = 'categorical'
b = 'quantitative'
c = 'we cannot tell'
d = 'boolean - can treat either way'

question1_solution = {'Column A is': b,
                      'Column B is': b,
                      'Column C is': c,
                      'Column D is': d,
                      'Column E is': a,
                     }

# Check your answer
t.var_test(question1_solution)

#### Question 2

**2.** Are there any columns or rows that you feel comfortable dropping in this dataframe?

In [None]:
a = "Yes"
b = "No"

should_we_drop = a

#Check your answer
t.can_we_drop(should_we_drop)

In [None]:
new_df = df.drop('C', axis = 1)# Use this cell to drop any columns or rows you feel comfortable dropping based on the above

#### Question 3

**3.** Using **new_df**, I wrote a lambda function that you can use to impute the mean for the columns of your dataframe using the **apply** method.  Use as many cells as you need to correctly fill in the dictionary **impute_q3** to answer a few questions about your findings.

In [None]:
fill_mean = lambda col: col.fillna(col.mean())

try:
    new_df.apply(fill_mean, axis=0)
except:
    print('That broke...because column E is a string.')

In [None]:
new_df[['A', 'B', 'D']].apply(fill_mean, axis=0)

In [None]:
a = "fills with the mean, but that doesn't actually make sense in this case."
b = "gives an error."
c = "is no problem - it fills the NaN values with the mean as expected."


impute_q3 = {'Filling column A': c,
             'Filling column D': a,
             'Filling column E': b    
}

#Check your answer
t.impute_q3_check(impute_q3)

#### Question 4

**4.** Given the results above, it might make more sense to fill some columns with the mode.  Write your own function to fill a column with the mode value, and use it on the two columns that might benefit from this type of imputation.  Use the dictionary **impute_q4** to answer some questions about your findings.

In [None]:
fill_mode = lambda col: col.fillna(col.mode()[0])

new_df.apply(fill_mode, axis=0)

In [None]:
a = "Did not impute the mode."
b = "Imputes the mode."


impute_q4 = {'Filling column A': a,
             'Filling column D': a,
             'Filling column E': b}


#Check your answer
t.impute_q4_check(impute_q4)

You saw two of the most common ways to impute values in this notebook, and hopefully, you realized that even these methods have complications.  Again, these methods can be a great first step to get your models off the ground, but there are potentially detrimental aspects to the bias introduced into your models using these methods.