# Regression Imputation

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
%matplotlib inline

In [None]:
boston = load_boston() # Dataset from the sklearn library about house prices

In [None]:
print (boston.DESCR)

**load** the pickle file into a variable called **boston_df** and have a look at the first few lines of data

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
boston_df.shape

We need these pieces 
- X_train 
- X_test 
- y_train
- y_test 

We can use train_test_split()

# Check the data shape and explain what has happened?

--------------------

# This dataset has no missing values

I have degraded a feature (remove some values as missing data) to compare the full dataset with a dataset with missing values.

# AGE

proportion of owner-occupied units built prior to 1940

**load** the *other* pickle file into a variable called **boston_dropna_df** and have a look at the first few lines of data

Make a copy of this data because we will use it again to impute values

### Why do we need to do this?



```python
boston_impute_df = boston_dropna_df.copy()
```

### What would happen to 'boston_impute_df' if we did this and made changes to 'boston_dropna_df'?

```python
boston_impute_df = boston_dropna_df
```

----------------------

### How many nan's in boston_dropna_df['AGE']?

### boston_dropna_df will be a dataset to see what would happen if we just dropped rows with missing values - 

Go ahead and drop the rows with nan's in `boston_dropna_df`

In [None]:
#boston_dropna_df.dropna......

### Check that they are gone

In [None]:
#boston_dropna_df['AGE']......

Check that boston_impute_df didn't change when I changed boston_dropna_df?

### Again check the shape to look for confirmation

In [None]:
#boston_dropna_df......

# We will compare the performance of this model with the imputed and complete models

Now we need to train a linear regression model and predict the 'AGE' values for the boston_impute_df

Should we include the price in the training data for the imputation?

### create a dataset for linear regression modelling



In [None]:
# lm_fitting_df = .......

### Our target now is the 'AGE' column, that is now our `y`. 

- We will use the boston_dropna_df as the data to fit
- we cannot use data with missing values to train a model

# Using a separate model for the imputation.

In [None]:
lm_for_impute = LinearRegression() #instatiate

### There are many ways to do this

You need the data and the labels - 'y'

**model.fit(data,labels)**

```python
lm_for_impute.fit(lm_fitting_df[[x for x in lm_fitting_df.columns if x != 'AGE']],lm_fitting_df['AGE']) #fit
```

In [None]:
boston_impute_df[boston_impute_df['AGE'].isnull()].head()

### Now predict the missing values with the model

In [None]:
lm_for_impute.predict(boston_impute_df.drop(['AGE','y'],axis=1)) 
#this uses the other features to predict 'AGE' with the model

In [None]:
boston_impute_df['AGE'][boston_impute_df['AGE'].isnull()] = lm_for_impute.predict(boston_impute_df.drop(['AGE','y'],axis=1))

In [None]:
boxplot = pd.DataFrame({'imputed': boston_impute_df['AGE'],'full': boston_df['AGE'],'dropped': boston_dropna_df['AGE']})
boxplot.plot(kind='box')

So these are the imputed values predicted by the lm trained on the data we did have a value for 'AGE' for

# Predicting the price with the full data

\begin{equation}
y_i = \beta_0 + \beta_1 X_i + \epsilon_i
\end{equation}


In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_df.iloc[:,:-1],boston_df.iloc[:,-1:],random_state=111)

In [None]:
map(pd.np.shape,[X_train, X_test, y_train, y_test])

In [None]:
lm_full = LinearRegression()

In [None]:
lm_full.fit(X_train,y_train)

In [None]:
print ('r-squared for this model = ',lm_full.score(X_test,y_test))

# The dropped data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_dropna_df.iloc[:,:-1],boston_dropna_df.iloc[:,-1:],random_state=111)

In [None]:
lm_dropped = LinearRegression()
lm_dropped.fit(X_train,y_train)

print ('r-squared for this model = ',lm_dropped.score(X_test,y_test))

# The imputed data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_impute_df.iloc[:,:-1],boston_impute_df.iloc[:,-1:],random_state=111)

In [None]:

lm_impute = LinearRegression()
lm_impute.fit(X_train,y_train)

print ('r-squared for this model = ',lm_impute.score(X_test,y_test))

from [adventures-in-statistics](http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.