# Ames Housing Project


## Project Challenge Statement

#### Goal: Predict the price of homes at sale for the Aimes Iowa Housing dataset. 

Four files used to build the model. 

- feature_eng_train.csv -- this data contains all of the training data for this model after feature engineering.
- feature_eng_test.csv -- this data contains the test data for the model after feature engineering. You will feed this data into your regression model to make predictions.

#### Prediction Variables 
- train_X_variables -- this data contains all the training variables
- test_X_variables -- this data contains the final testing variable to predict housing prices 

## Table of Contents 

This Notebook is broken down into different sections for analysis purpose. The following links are connected to differenct section within the Notebook for simple navigation. 

### Contents:
- [Lasso Model](#Lasso-Model)
- [Ridge Model](#Ridge-Model)
- [ElasticNet Model](#ElasticNet-Model)

In [3]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [4]:
# Library imports
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, PowerTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression, RFECV
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

np.random.seed(42)
%matplotlib inline

In [6]:
#import Data 
clean_train_data = pd.read_csv('../datasets/train_data_clean.csv')
clean_test_data = pd.read_csv('../datasets/test_data_clean.csv')
train1_df = pd.read_csv("../datasets/train1_data")
test1_df = pd.read_csv("../datasets/test1_data")
train2_df = pd.read_csv("../datasets/train2_data")
test2_df = pd.read_csv("../datasets/test2_data")

### Lasso Model 1 on Train Test 1 model 

Model 1 focus on using feature engineered clean data set, perform polynomial tranformation, and standardize them using powertransformation. Find the number of Kbest variables, and use Lasso model. 

In [48]:
X = X_train_variables.drop(columns = ["Id", 'Unnamed: 0'])
y = train_df['SalePrice']

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [55]:
# code inference: Mark's local lecture 
pipe_lasso = Pipeline([
    ('poly', PolynomialFeatures(include_bias=False)),
    ('pt', PowerTransformer()),
    ('kbest',SelectKBest(f_regression, k = 150)),
    ('rfecv', RFECV(estimator =LassoCV()))
])

In [56]:
pipe_lasso.fit(X_train, y_train)
lasso_predict = pipe_lasso.predict(X_test)
print("train score", pipe_lasso.score(X_train, y_train))
print('test score', pipe_lasso.score(X_test, y_test))
print("R2 score", r2_score(lasso_predict, y_test));

train score 0.9182082640478301
test score 0.9255771697036619
R2 score 0.9202072676335374


In [60]:
lasso_params = {
    'kbest__k': np.arange(100, 310, 10)
}

gs = GridSearchCV(pipe_lasso, lasso_params, verbose = 1)

In [61]:
gs.fit(X_train, y_train)

print('gs best param', gs.best_params_)
print('gs train score',gs.score(X_train, y_train))
print('gs test score', gs.score(X_test, y_test))

Fitting 3 folds for each of 21 candidates, totalling 63 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


ValueError: Input contains infinity or a value too large for dtype('float64').

In [15]:
# Pipe 2 after grid search CV
pipe_lasso = Pipeline([
    ('poly', PolynomialFeatures(include_bias=False)),
    ('pt', PowerTransformer()),
    ('kbest',SelectKBest(f_regression, k = 230)),
    ('rfecv', RFECV(estimator =LassoCV()))
])

pipe_lasso.fit(X_train, y_train)

print("lasso train score", pipe_lasso.score(X_train, y_train))

print("lasso test score", pipe_lasso.score(X_test, y_test))

print("lasso R2 score", r2_score(lasso_predict, y_test))

lasso_test_predict = pipe_lasso.predict(X_test)
lasso_train_predict = pipe_lasso.predict(X_train)


print("lasso mse test", mean_squared_error(y_test, lasso_test_predict))
print("lasso mse train", mean_squared_error(y_train, lasso_train_predict))

lasso train score 0.9280421963906998
lasso test score 0.9333691428254745
lasso R2 score 0.9204532296035921
lasso mse test 421247912.4068305
lasso mse train 451341036.52801204


In [215]:
#save df procedure
id_df = X_test_variables[['Id']]

lasso_firstsub_test_predict = pipe_lasso.predict(X_test_variables.drop(columns = 'Id'))

lasso_firstsub_test_predict = pd.DataFrame(lasso_firstsub_test_predict, columns = ['SalePrice'])

df = id_df.join(lasso_firstsub_test_predict)

df.columns = ['Id', "SalePrice"]

df = df.set_index('Id')

df.to_csv('../datasets/lasso_first_submit.csv')

### Lasso Model 2
Model 2 contain train data with selected polynomial features, and uses Lasso as the model for fitting. 

In [39]:
X = X_train2_variables.drop(columns = "Id")
y = train_df['SalePrice']

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [41]:
# code inference: Mark's local lecture 
pipe_lasso2 = Pipeline([
#     ('poly', PolynomialFeatures(include_bias=False)),
    ('pt', PowerTransformer()),
    ('kbest',SelectKBest(f_regression, k = 30)),
    ('lasso', LassoCV())
])

In [42]:
pipe_lasso2.fit(X_train, y_train)

print("train score", pipe_lasso2.score(X_train, y_train))

print('test score', pipe_lasso2.score(X_test, y_test))

lasso2_predict = pipe_lasso2.predict(X_test)

print("R2 score", r2_score(lasso2_predict, y_test));

train score 0.8563424942338403
test score 0.8706056239966992
R2 score 0.8416849761165797


In [43]:
lasso_params = {
    'kbest__k': np.arange(10, 100, 10),
#     'lasso__alpha' : np.logspace(0, 5, 100)
}

gs = GridSearchCV(pipe_lasso, lasso_params)

gs.fit(X_train, y_train)

print('gs best param', gs.best_params_)

print('gs train score',gs.score(X_train, y_train))

print('gs test score', gs.score(X_test, y_test))

ValueError: Input contains infinity or a value too large for dtype('float64').

In [265]:
# Pipe 2 after grid search CV
pipe_lasso2 = Pipeline([
    ('poly', PolynomialFeatures()),
    ('ss', StandardScaler()),
    ('kbest',SelectKBest(f_regression, k = 90)),
    ('lasso', LassoCV())
])

pipe_lasso2.fit(X_train, y_train)

print("lasso train score", pipe_lasso2.score(X_train, y_train))

print("lasso test score", pipe_lasso2.score(X_test, y_test))

lasso2_test_predict = pipe_lasso2.predict(X_test)
lasso2_train_predict = pipe_lasso2.predict(X_train)

from sklearn.metrics import mean_squared_error
print("lasso2 mse test", mean_squared_error(y_test, lasso2_test_predict))
print("lasso2 mse train", mean_squared_error(y_train, lasso2_train_predict))

lasso train score 0.9163156720478631
lasso test score 0.9187616813329698
lasso2 mse test 513597957.4192093
lasso2 mse train 524893332.26655126


In [266]:
#save df procedure
id_df = X_test_variables[['Id']]

lasso_firstsub_test_predict = pipe_lasso.predict(X_test_variables.drop(columns = 'Id'))

lasso_firstsub_test_predict = pd.DataFrame(lasso_firstsub_test_predict, columns = ['SalePrice'])

df = id_df.join(lasso_firstsub_test_predict)

df.columns = ['Id', "SalePrice"]

df = df.set_index('Id')

df.to_csv('../datasets/lasso_first_submit.csv')

### Lasso Model 3

In [286]:
#import Data 
train2_df = pd.read_csv("../datasets/train2_data")
test2_df = pd.read_csv("../datasets/test2_data")

In [287]:
lst_1 = train2_df.columns.values

lst_2 = test2_df.columns.values

intersection = list(set(lst_1).intersection(set(lst_2)))

In [288]:
train2_df.shape

(2049, 291)

In [289]:
train2_df.head()

Unnamed: 0.1,Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Misc Feature_TenC,Sale Type_COD,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,0,109,533352170,60,69.0552,13517,6,8,1976,2005,...,0,0,0,0,0,0,0,0,0,1
1,1,544,531379050,60,43.0,11492,7,5,1996,1997,...,0,0,0,0,0,0,0,0,0,1
2,2,153,535304180,20,68.0,7922,5,7,1953,2007,...,0,0,0,0,0,0,0,0,0,1
3,3,318,916386060,60,73.0,9802,5,5,2006,2007,...,0,0,0,0,0,0,0,0,0,1
4,4,255,906425045,50,82.0,14235,6,8,1900,1993,...,0,0,0,0,0,0,0,0,0,1


In [291]:
X = train2_df[intersection].drop(columns = ["Id", "Unnamed: 0", "PID"]) 
y = train2_df['SalePrice']

In [292]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [293]:
# code inference: Mark's local lecture 
pipe_lasso3 = Pipeline([
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ss', StandardScaler()),
    ('kbest',SelectKBest(f_regression, k = 250)),
    ('lasso', LassoCV())
])

In [294]:
pipe_lasso3.fit(X_train, y_train)

print("train score", pipe_lasso3.score(X_train, y_train))

print('test score', pipe_lasso3.score(X_test, y_test))

lasso3_predict = pipe_lasso3.predict(X_test)

print("R2 score", r2_score(lasso3_predict, y_test));

train score 0.9209283127964466
test score 0.9258766478357242
R2 score 0.9213114597771419


In [297]:
lasso_params = {
    'kbest__k': np.arange(10, 200, 10),
}

gs = GridSearchCV(pipe_lasso3, lasso_params)

gs.fit(X_train, y_train)

print('gs best param', gs.best_params_)

print('gs train score',gs.score(X_train, y_train))

print('gs test score', gs.score(X_test, y_test))

KeyboardInterrupt: 

In [296]:
# Pipe 3 after grid search CV
pipe_lasso3 = Pipeline([
    ('poly', PolynomialFeatures()),
    ('ss', StandardScaler()),
    ('kbest',SelectKBest(f_regression, k = 180)),
    ('lasso', LassoCV())
])

pipe_lasso3.fit(X_train, y_train)

print("lasso train score", pipe_lasso3.score(X_train, y_train))

print("lasso test score", pipe_lasso3.score(X_test, y_test))

lasso3_test_predict = pipe_lasso3.predict(X_test)
lasso3_train_predict = pipe_lasso3.predict(X_train)

from sklearn.metrics import mean_squared_error
print("lasso3 mse test", mean_squared_error(y_test, lasso3_test_predict))
print("lasso3 mse train", mean_squared_error(y_train, lasso3_train_predict))

lasso train score 0.9144080734779071
lasso test score 0.9183036799050621
lasso3 mse test 516493494.9158985
lasso3 mse train 536858365.55909175


In [271]:
#save df procedure
id_df = test2_df[['Id']]

lasso3_sub_test_predict = pipe_lasso3.predict(test2_df.drop(columns = ["Id", "Unnamed: 0", "PID"]))

lasso3_sub_test_predict = pd.DataFrame(lasso3_firstsub_test_predict, columns = ['SalePrice'])

df3 = id3_df.join(lasso3_sub_test_predict)

df3.columns = ['Id', "SalePrice"]

df3 = df3.set_index('Id')

df3.to_csv('../datasets/lasso_thrid_submit.csv')

ValueError: X shape does not match training shape

### Ridge Model

In [62]:
X = X_train_variables.drop(columns = "Id")
y = train_df['SalePrice']

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [66]:
# code inference: Mark's local lecture 
ridge_pipe = Pipeline([
    ('poly', PolynomialFeatures(include_bias=False)),
    ('ss', StandardScaler()),
    ('kbest',SelectKBest(f_regression, k = 150)),
    ('Ridge', RidgeCV())
])

In [67]:
ridge_pipe.fit(X_train, y_train)

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


Pipeline(memory=None,
     steps=[('poly', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kbest', SelectKBest(k=150, score_func=<function f_regression at 0x1a142c7d90>)), ('Ridge', RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False))])

In [76]:
ridge_pipe.score(X_train, y_train)

0.9189969737403699

In [77]:
ridge_pipe.score(X_test, y_test)

0.920474950698848

In [80]:
ridge_predict = ridge_pipe.predict(X_test)

In [81]:
r2_score(ridge_predict, y_test)

0.9133911954202524

In [84]:
Ridge_params = {
    'kbest__k': np.arange(10, 300, 10),
}

In [85]:
gs = GridSearchCV(ridge_pipe, Ridge_params)

In [86]:
gs.fit(X_train, y_train)

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('poly', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kbest', SelectKBest(k=150, score_func=<function f_regression at 0x1a142c7d90>)), ('Ridge', RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'kbest__k': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,
       140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260,
       270, 280, 290])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [87]:
gs.best_params_

{'kbest__k': 230}

In [88]:
gs.score(X_train, y_train)

0.9361426468554552

In [89]:
gs.score(X_test, y_test)

0.9241619609284162

In [90]:
# Pipe 2 after grid search CV
ridge_pipe = Pipeline([
    ('poly', PolynomialFeatures()),
    ('ss', StandardScaler()),
    ('kbest',SelectKBest(f_regression, k = 230)),
    ('Ridge', RidgeCV())
])

In [91]:
ridge_pipe.fit(X_train, y_train)

  corr /= X_norms
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


Pipeline(memory=None,
     steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kbest', SelectKBest(k=230, score_func=<function f_regression at 0x1a142c7d90>)), ('Ridge', RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False))])

In [92]:
ridge_pipe.score(X_train, y_train)

0.9361426468554552

In [93]:
ridge_pipe.score(X_test, y_test)

0.9241619609284162

In [98]:
ridge_test_predict = ridge_pipe.predict(X_test)
ridge_train_predict = ridge_pipe.predict(X_train)

In [100]:
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, ridge_test_predict))
print(mean_squared_error(y_train, ridge_train_predict))

479456771.15116304
400532569.2635403


In [118]:
#Id df
id_df = X_test_variables[['Id']]

In [119]:
ridge_firstsub_test_predict = ridge_pipe.predict(X_test_variables.drop(columns = 'Id'))

In [120]:
ridge_firstsub_test_predict = pd.DataFrame(ridge_firstsub_test_predict , columns = ['SalePrice'])

In [121]:
df_2 = id_df.join(ridge_firstsub_test_predict)

In [122]:
df_2.columns = ['Id', "SalePrice"]

In [123]:
df_2 = df_2.set_index('Id')

In [124]:
df_2.to_csv('../datasets/ridge_first_submit.csv')