# Linear Regression: sklearn - many (numeric) variables

Contents
 - load data
 - data manipulation
   - variables to use
 - multiple regression (with sklearn)
   - fit model
   - predictions
   - performance
   - coefficients

Sources:
http://ww2.amstat.org/publications/jse/v19n3/decock.pdf

Copyright (C) 2018 Alan Chalk  
Please do not distribute or publish without permission.

## Start_.

**Packages needed**

In [5]:
import os
import numpy as np
import pandas as pd
import pickle

from sklearn import preprocessing

import matplotlib.pyplot as plt
%matplotlib inline

**Functions**

In [6]:
def fn_MAE(actuals, predictions):
    return np.round(np.mean(np.abs(predictions - actuals)))

**Settings**

In [7]:
font = {'size'   : 22}
plt.rc('font', **font)

**Directories and paths**

In [8]:
# Set directories
print(os.getcwd())
dirRawData = "../input/"
dirPData   = "../PData/"

/home/jovyan/Projects/AmesHousing/PCode


**Load data**

In [9]:
#store = pd.HDFStore(dirPData + '02_df_all.h5')
#df_all = pd.read_hdf(store, 'df_all')
#store.close()
f_name = dirPData + '02_df.pickle'

with (open(f_name, "rb")) as f:
    dict_ = pickle.load(f)

df_all = dict_['df_all']

del f_name, dict_

In [10]:
# load the variables information
f_name = dirPData + '02_vars.pickle'
with open(f_name, "rb") as f:
    dict_ = pickle.load(f)
    
var_dep = dict_['var_dep']
vars_ind_numeric = dict_['vars_ind_numeric']

del f_name, dict_

**Take a small subset of variables for the linear model**

In [11]:
vars_toUse = vars_ind_numeric

In [12]:
X = df_all[vars_toUse].values
y = df_all[var_dep].values

## Multiple linear regression with sklearn

TODO

 - import LinearRegression from sklearn.linear_model
 - create an instance called lm_
 - fit the model to X and y

In [13]:
from sklearn.linear_model import LinearRegression

In [14]:
lm_ = LinearRegression()

In [15]:
lm_.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

**Coefficients**

In [16]:
coef_skl_intercept = list(lm_.intercept_)
coef_skl_other = list(lm_.coef_.flatten())
coef_skl = coef_skl_intercept + coef_skl_other

In [17]:
df_lm_results = pd.DataFrame({'features': ['intercept'] + vars_toUse,
                              'estimateCoefficients': np.round(coef_skl, 0)})
df_lm_results

Unnamed: 0,features,estimateCoefficients
0,intercept,384639.0
1,lot_area,1.0
2,overall_qual,15491.0
3,overall_cond,4304.0
4,year_built,359.0
5,year_remod_add,204.0
6,bsmtfin_sf_1,22.0
7,bsmtfin_sf_2,2.0
8,bsmt_unf_sf,-1.0
9,total_bsmt_sf,23.0


TODO 

- Why do we have some massive coefficients? (Actually they are not too bad - but in some early runs we had coefficients like 5e16 and -5e16)
- What can we do about this?

TODO

 - Rerun the above model with the smaller subset of variables in the cell below

In [None]:
vars_toUse = [var for var in vars_toUse if var not in ['bsmtfin_sf_1', 'bsmtfin_sf_2', 'bsmt_unf_sf',
                                                       'bsmt_full_bath', 'bsmt_half_bath',
                                                       'garage_cars',
                                                       'bedroom_abvgr', 'kitchen_abvgr',
                                                       'full_bath', 'half_bath',
                                                       'x1st_flr_sf', 'x2nd_flr_sf', 'low_qual_fin_sf',
                                                       'total_bsmt_sf',
                                                       'totrms_abvgrd', 'lot_area', 'overall_qual']]
X = df_all[vars_toUse].values
y = df_all[var_dep].values

TODO  
 - Based on the above coefficients, which feature is most important? (fireplaces?? - how many are there...)
 - Now rerun the model after using the cell below to scale the features (gr_liv_area - what was the range of this 
 before scaling?)
 - What would you know say is the most important feature?  Does this make sense?

In [None]:
from sklearn.preprocessing import StandardScaler

standardScaler_ = StandardScaler()
standardScaler_.fit(X)
X = standardScaler_.transform(X)

In [None]:
# scaled features: means 
print("scaled features: means: ", np.round(X.mean(axis=0),5) )

# scaled features: variance
print("scaled features: standard deviation: ", np.round(X.std(axis=0),5) )

**Graph of average sale price by living area quantile**

In [None]:
df_all['gr_liv_area_q'] = pd.qcut(df_all['gr_liv_area'], 20, labels=False)
gb_temp = df_all.groupby('gr_liv_area_q').agg({'saleprice': lambda x: np.round(np.mean(x))}).reset_index()
gb_temp.head()

In [None]:
# create a new figure
fig = plt.figure(figsize = (10,6))

# add a subplot
ax1 = fig.add_subplot(1, 1, 1)

_ = ax1.scatter(gb_temp['gr_liv_area_q'],  gb_temp['saleprice'], s = 16)
ax1.set_xlabel('gr_liv_area percentiles')
ax1.set_ylabel('saleprice $ ')
_ = plt.title('saleprice by lot_area (sq foot)')

## Predictions 

**TODO**

- Use the predict method of your lm_ object to predict the linear model for each row of X
- Find the mean prediction and compare it to the mean of the target variable.  Is it the same?  Is this surprising?

In [None]:
# prediction 
lm__pred = 

In [None]:
mean_predicted = 
mean_actual = 
print('mean predicted sale price: ${:,.0f}'.format(mean_predicted))
print('mean actual sale price: ${:,.0f}'.format(mean_actual))

**Replicate one prediction**

The prediction for the first example is given below

In [None]:
# What is the prediction on the first example:
print('prediction on first example', np.round(lm__pred[0], 0), '\n')

In [None]:
# What is the intercept and what are the other coefficients
print('intercept', lm_.intercept_, '\n')
print('coefficients', lm_.coef_, '\n')

**TODO** 
- Calculate the prediction for the first example:
 $$ \text{intercept} + \text{dot product of other coefficients and X[0]}$$
- Compare your result to the prediction above (197,352)

In [None]:
manual_prediction = 
print('manual prediction', manual_prediction)

**Performance**

**TODO**

 - Calculate the mean absolute error of your predictions
 - Is this a good indication of performance of this model on future data that it has not yet seen?  (Actually probably yes because it is unlikely to be overfitted)

In [None]:
# mean absolute error in predictions
# on train data
train_error = 
print('train error', train_error)