# Preprocessing and Fitting Model
## Import Libraries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import r2_score
import pickle

## Preprocessing

In [2]:
df = pd.read_csv('../datasets/clean_data.csv')
df.head()

Unnamed: 0,Gr Liv Area,Total Bsmt SF,Garage Area,Total Area,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,...,Overall Qual_2,Overall Qual_3,Overall Qual_4,Overall Qual_5,Overall Qual_6,Overall Qual_7,Overall Qual_8,Overall Qual_9,Overall Qual_10,SalePrice
0,1479,725.0,475.0,509330600.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,130500
1,2122,913.0,559.0,1082999000.0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,220000
2,1057,1057.0,246.0,274843300.0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,109000
3,1444,384.0,400.0,221798400.0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,174000
4,1445,676.0,484.0,472780900.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,138500


Assigning features to `X` and target value to `y`.

In [3]:
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

Spliting the data into 2 portions, training and testing, for validating the model's performance. 

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.2)

Perform standardization on the features to improve model performance.

In [5]:
ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test )

## Training and Fitting Model

Using cross validation to guage model perforrmance without regularization.

In [6]:
lr = LinearRegression()
print(f'Average cross val score:{cross_val_score(lr, X_train, y_train, cv = 5).mean()}')
print(f'Cross val score for each fold:{cross_val_score(lr, X_train, y_train, cv = 5)}')

Average cross val score:0.6682977399427619
Cross val score for each fold:[ 0.89713882  0.87279688  0.85420786  0.89120763 -0.17386249]


### Lasso Regularization
Using Lasso to regularize linear model as the model score is inconsistent in cross validation.

In [7]:
lasso = Lasso(max_iter=2000)
lasso.fit(X_train, y_train)
lasso.score(X_test, y_test)

0.8853386566365018

As shown above, with regularization the $R^2$ score is improved from 0.69 to 0.89

---

**Export** linear regression model with pickle, as well as  `X_test` and `y_test`.

In [8]:
with open('../Pickle/model.pickle', 'wb') as handle:
    pickle.dump(lasso, handle)

with open('../Pickle/X_test.pickle', 'wb') as handle:
    pickle.dump(X_test, handle)

with open('../Pickle/y_test.pickle', 'wb') as handle:
    pickle.dump(y_test, handle)