## Handling Missing Data

In this Jupyter notebook we will attempt question 1 of Assignment 2, which is to work on imputing Age and Gender.
We will be using pandas to load the data, and also to perform our imputation for us, while also discussing what approach we will be taking.

Let us start with reading the data and loading our files

In [1]:
import pandas as pd

In [2]:
bestincome = pd.read_csv('BestIncome.txt')

In [3]:
surveyincome = pd.read_csv('SurvIncome.txt')

In [4]:
bestincome

Unnamed: 0,lab_inc,cap_inc,height,weight
0,52655.605507,9279.509829,64.568138,152.920634
1,70586.979225,9451.016902,65.727648,159.534414
2,53738.008339,8078.132315,66.268796,152.502405
3,55128.180903,12692.670403,62.910559,149.218189
4,44482.794867,9812.975746,68.678295,152.726358
5,55394.631435,10769.461417,67.370550,151.602678
6,62627.896258,9730.261003,64.547689,151.421983
7,54936.555868,8712.627564,63.080352,153.917777
8,52730.248945,9260.989766,63.417904,147.327536
9,60525.267381,10310.988854,65.310226,154.179314


In [5]:
surveyincome

Unnamed: 0,tot_inc,weight,age,gender
0,63642.513655,134.998269,46.610021,1.0
1,49177.380692,134.392957,48.791349,1.0
2,67833.339128,126.482992,48.429894,1.0
3,62962.266217,128.038121,41.543926,1.0
4,58716.952597,126.211980,41.201245,1.0
5,45708.541543,116.023500,44.504057,1.0
6,73594.659828,111.142211,52.943038,1.0
7,66831.999693,140.317751,34.138021,1.0
8,63690.343092,138.023277,40.662877,1.0
9,59965.928189,124.794773,32.909093,1.0


## Imputation Technique

There are many ways to go about handling imputation problems. Some more straightforward techniques might involve just dropping missing values, or replacing it with the mean value found. The most popular method may just be using a linear regression model, and then using the coefficients to later simulate the values missing. We will attempt this method!
The formula for the same is \begin{equation}
Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
\end{equation}

Here, $Y_i$ is the variable we wish to impute, and $\beta_0$ and $\beta_1$ are the coefficients we will use to estimate our variable. 

Let us use scikit-learn for our regression model.

## Fitting our Model

In [6]:
from sklearn import linear_model

In [7]:
model_age = linear_model.LinearRegression()
model_age.fit(surveyincome[['tot_inc', 'weight']], surveyincome['age'])

  linalg.lstsq(X, y)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [8]:
model_age.coef_

array([ 2.52022048e-05, -6.72214424e-03])

We can see the coefficients of our model for predicting age. These will be used in our estimation.

In [9]:
model_gender = linear_model.LogisticRegression()
model_gender.fit(surveyincome[['tot_inc', 'weight']], surveyincome['gender'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [10]:
model_gender

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

We can see the coefficients of our model for predicting age. These will be used in our estimation.

## Imputing our Values

We will now use our models and our coefficients to impute the values.
Let us create a temporary column called Total Income to be used when creating our fit.
We will be using our models predict function to do the imputation.

In [11]:
bestincome['tot_inc'] = bestincome['cap_inc'] + bestincome['lab_inc']

In [12]:
bestincome['imputed_age'] = model_age.predict(bestincome[['tot_inc', 'weight']])

In [13]:
bestincome['imputed_gender'] = model_gender.predict(bestincome[['tot_inc', 'weight']])

In [14]:
bestincome = bestincome.drop(columns=['tot_inc'])

In [15]:
bestincome

Unnamed: 0,lab_inc,cap_inc,height,weight,imputed_age,imputed_gender
0,52655.605507,9279.509829,64.568138,152.920634,44.742614,0.0
1,70586.979225,9451.016902,65.727648,159.534414,45.154387,1.0
2,53738.008339,8078.132315,66.268796,152.502405,44.742427,0.0
3,55128.180903,12692.670403,62.910559,149.218189,44.915836,1.0
4,44482.794867,9812.975746,68.678295,152.726358,44.551391,0.0
5,55394.631435,10769.461417,67.370550,151.602678,44.858053,0.0
6,62627.896258,9730.261003,64.547689,151.421983,45.015372,1.0
7,54936.555868,8712.627564,63.080352,153.917777,44.779109,0.0
8,52730.248945,9260.989766,63.417904,147.327536,44.781626,0.0
9,60525.267381,10310.988854,65.310226,154.179314,44.958481,1.0


## Describing Variables

Let us now look at some statistics describing our imputed variables.

In [16]:
bestincome['imputed_age'].describe()

count    10000.000000
mean        44.890828
std          0.219150
min         43.976495
25%         44.743776
50%         44.886944
75%         45.038991
max         45.703819
Name: imputed_age, dtype: float64

In [17]:
bestincome['imputed_gender'].describe()

count    10000.000000
mean         0.471700
std          0.499223
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          1.000000
Name: imputed_gender, dtype: float64

## Correlation Matrix

Our last step is to print the correlation matrix, which we will do using the `corr` method.

In [18]:
bestincome.corr()

Unnamed: 0,lab_inc,cap_inc,height,weight,imputed_age,imputed_gender
lab_inc,1.0,0.005325,0.00279,0.004507,0.924053,0.677675
cap_inc,0.005325,1.0,0.021572,0.006299,0.234159,0.176901
height,0.00279,0.021572,1.0,0.172103,-0.045083,-0.066972
weight,0.004507,0.006299,0.172103,1.0,-0.300288,-0.382659
imputed_age,0.924053,0.234159,-0.045083,-0.300288,1.0,0.78426
imputed_gender,0.677675,0.176901,-0.066972,-0.382659,0.78426,1.0
