#### Imputing Values

You now have some experience working with missing values, and imputing based on common methods.  Now, it is your turn to put your skills to work in being able to predict for rows even when they have NaN values.

First, let's read in the necessary libraries, and get the results together from what you achieved in the previous attempt.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import ImputingValues as t
import seaborn as sns
%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')
df.head()

#Only use quant variables and drop any rows with missing values
num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
df_dropna = num_vars.dropna(axis=0)

#Split into explanatory and response variables
X = df_dropna[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
y = df_dropna['Salary']

#Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42) 

lm_model = LinearRegression(normalize=True) # Instantiate
lm_model.fit(X_train, y_train) #Fit
        
#Predict and score the model
y_test_preds = lm_model.predict(X_test) 
"The r-squared score for your model was {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))

'The r-squared score for your model was 0.01917066180376248 on 645 values.'

#### Question 1

**1.** As you may remember from an earlier analysis, there are many more salaries to predict than the values shown from the above code.  One of the ways we can start to make predictions on these values is by imputing items into the **X** matrix instead of dropping them.

Using the **num_vars** dataframe drop the rows with missing values of the response (Salary) - store this new dataframe in **drop_sal_df**, then impute the values for all the other missing values with the mean of the column - store this in **fill_df**.

In [2]:
drop_sal_df = num_vars.dropna(subset=['Salary'])
# test look
drop_sal_df.head()

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
2,113750.0,8.0,,9.0,8.0
14,100000.0,8.0,,8.0,8.0
17,130000.0,9.0,,8.0,8.0
18,82500.0,5.0,,3.0,
22,100764.0,8.0,,9.0,8.0


In [4]:
#Check that you dropped all the rows that have salary missing
t.check_sal_dropped(drop_sal_df)

Nice job! That looks right!


In [7]:
drop_sal_df.HoursPerWeek.isnull().mean()

0.54442004392094234

In [8]:
fill_mean = lambda col: col.fillna(col.mean())

try:
    drop_sal_df.apply(fill_mean, axis=0)
except:
    print('That broke....')

In [3]:
wtry_df = drop_sal_df.apply(lambda col: col.fillna(col.mean()), axis=0)
wtry_df

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
2,113750.000000,8.000000,2.447415,9.000000,8.000000
14,100000.000000,8.000000,2.447415,8.000000,8.000000
17,130000.000000,9.000000,2.447415,8.000000,8.000000
18,82500.000000,5.000000,2.447415,3.000000,8.442686
22,100764.000000,8.000000,2.447415,9.000000,8.000000
25,175000.000000,7.000000,0.000000,7.000000,9.000000
34,14838.709677,10.000000,1.000000,8.000000,10.000000
36,28200.000000,7.000000,2.447415,9.000000,7.000000
37,118279.569892,7.534907,1.000000,7.024825,8.000000
52,15674.203822,6.000000,4.000000,5.000000,8.000000


In [9]:
#fill_df = #Fill all missing values with the mean of the column.

fill_df = drop_sal_df.apply(fill_mean, axis=0)

# test look
fill_df.head()

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
2,113750.0,8.0,2.447415,9.0,8.0
14,100000.0,8.0,2.447415,8.0,8.0
17,130000.0,9.0,2.447415,8.0,8.0
18,82500.0,5.0,2.447415,3.0,8.442686
22,100764.0,8.0,2.447415,9.0,8.0


In [10]:
#Check your salary dropped, mean imputed datafram matches the solution
t.check_fill_df(fill_df)

Nice job! That looks right!


#### Question 2

**2.** Using **fill_df**, predict Salary based on all of the other quantitative variables in the dataset.  You can use the template above to assist in fitting your model:

* Split the data into explanatory and response variables
* Split the data into train and test (using seed of 42 and test_size of .30 as above)
* Instantiate your linear model using normalized data
* Fit your model on the training data
* Predict using the test data
* Compute a score for your model fit on all the data, and show how many rows you predicted for

Use the tests to assure you completed the steps correctly.

In [11]:
X_f = fill_df[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
y_f = fill_df['Salary']

In [12]:
#Split into explanatory and response variables

#Split into train and test
       
#Predict and score the model

X_train, X_test, y_train, y_test = train_test_split(X_f, y_f, test_size = .30, random_state=42) 

lm_model = LinearRegression(normalize=True) # Instantiate
lm_model.fit(X_train, y_train) #Fit
        
#Predict and score the model
y_test_preds = lm_model.predict(X_test) 
"The r-squared score for your model was {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))

#Rsquared and y_test
rsquared_score = r2_score(y_test, y_test_preds)
length_y_test = len(y_test_preds)

"The r-squared score for your model was {} on {} values.".format(rsquared_score, length_y_test)

'The r-squared score for your model was 0.03257139063404413 on 1503 values.'

In [13]:
# Pass your r2_score, length of y_test to the below to check against the solution
t.r2_y_test_check(rsquared_score, length_y_test)

Nice job! That looks right!


This model still isn't great.  Let's see if we can't improve it by using some of the other columns in the dataset.