# 2. Imports

In [1]:
import pandas as pd
import missingno as msno
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import RobustScaler, PowerTransformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split


plt.rcParams['figure.figsize'] = [8,5]
plt.rcParams['font.size'] =14
plt.rcParams['font.weight']= 'bold'
plt.style.use('seaborn-whitegrid')

# 3. Load Data

In [2]:
train_df = pd.read_csv('./datasets/train.csv')
test_df = pd.read_csv('./datasets/test.csv')

### 4.3.2. Null Value Comparison
Created comprehensive list of parameters that may need to remove values if any of these variables is used as a predictor. Although none of these variables may be used.

In [3]:
train_null = train_df.isnull().sum().drop('SalePrice')
test_null = test_df.isnull().sum()
train_null.compare(test_null).sort_values('self',ascending=False)

Unnamed: 0,self,other
Pool QC,2042.0,874.0
Misc Feature,1986.0,837.0
Alley,1911.0,820.0
Fence,1651.0,706.0
Fireplace Qu,1000.0,422.0
Lot Frontage,330.0,160.0
Garage Cond,114.0,45.0
Garage Qual,114.0,45.0
Garage Finish,114.0,45.0
Garage Yr Blt,114.0,45.0


### 4.3.3. Distribution Comparison

In [4]:
train_df['Bsmt Cond'].unique()

array(['TA', 'Gd', nan, 'Fa', 'Po', 'Ex'], dtype=object)

In [5]:
numeric_feat =[col for col in train_df.columns if train_df[col].dtypes != 'O']
discrete_feat = [col for col in numeric_feat if len(train_df[col].unique())<25 and col not in ['Id']]
continuous_feat = [col for col in numeric_feat if col not in discrete_feat and col not in ['Id','PID']]
categorical_feat = [col for col in train_df.columns if train_df[col].dtypes == 'O']

In [6]:
combined_df = pd.concat([train_df,test_df],axis=0)

In [7]:
combined_df['Label'] = combined_df['SalePrice'].apply(lambda x: 'train' if x > 0 else 'test')

In [8]:
print(combined_df[['Yr Sold','Garage Yr Blt']].loc[[1699,1885]])

      Yr Sold  Garage Yr Blt
1699     2007         2207.0
1885     2007         2008.0


In [9]:
print("Total Number of Numeric features:\t",len(numeric_feat))
print("  Number of ID features:\t",2)
print("  Number of discrete features:\t",len(discrete_feat))
print("  Number of continuous features:", len(continuous_feat))
print("Total Number of Categorical features:\t",len(categorical_feat))

Total Number of Numeric features:	 39
  Number of ID features:	 2
  Number of discrete features:	 16
  Number of continuous features: 21
Total Number of Categorical features:	 42


#### 4.3.3.1. Distribution Comparison - Discrete
In general we expect train to be about 2.33 times test in all counts for all catergories to ensure a relatively stratefied set of results.

['MS Subclass'] is the type of house though it is numerical

['TotRms AbvGrd', 'Overall Qual'] have a normal looking distribution

['MS SubClass', 'age', 'garage_age'] are left-bound

['Yr Sold'] is almost uniform

['Mo Sold'] however is not even implying some internal seasonality (june, july are summer months with higher sales), with left skew

#### 4.3.3.2. Distribution Comparison - Continuous
In general we expect train to be about 2.33 times test in all counts for all catergories to ensure a relatively stratefied set of results

In [10]:
print(f'expected ratio between train to test: {len(train_df)/len(test_df):.2f}')

expected ratio between train to test: 2.34


#### 4.3.3.3. Linearity Check
Many variable are independent - in random parttern or a discrete pattern with no slope.
Otherwise variables (e.g. year built) have a relationship but with no practical meaning.

['Gr Liv Area', 'Total Bsmt SF', 'Garage Area', 'age', 'garage_age', '1st Flr SF', 'Total Bsment SF', 'Garage Area', 'age', 'Overal Qual', 'TotRms AbvGrd'] with scatterplot with apparent relationship, though not exhaustive list but some of these variables need to be amended to remove outliers

#### 4.3.3.4. Categorical Distribution Comparison 
Here it is the easy form to check values that are not within 'data description' file.

We continue to compare train to be about 2.34 times test in all counts for all catergories to ensure a relatively stratefied set of results. ['Pool QC'] has 'ex' greater in test than train but this is also highest missing data (recommend to drop column).

['Roof Matl','Street','Condition 2','Utilities','Heating'] variables are shortlisted becuase any one value count is more than a count of 2,000.

In [11]:
for i, feature in enumerate(categorical_feat):
    if train_df[feature].value_counts()[0]>2000:
        print(feature)

Street
Utilities
Condition 2
Roof Matl
Heating


['Neighborhood'] also appears to have significant distribution of SalesPrice across its values.

### 4.4. Find Suitable value for Numerical missing values


### 4.5. Temporal Variable Analysis

In [12]:
for i,x in enumerate(combined_df['Yr Sold']-combined_df['Garage Yr Blt']):
    if x<0:
        print(i)

1699
1885


In [13]:
combined_df.loc[1699,'Garage Yr Blt'] = 2007
combined_df.loc[1885,'Garage Yr Blt'] = 2007

### 4.6. Data Correlation
Garage parameters are highly correlated.
year built and garage year built is highly correlated.
rooms and living areas is highly correlated.

all above which may cause multicollinearity

## 5. Feature Engineering

### 5.1. Drop Columns or Rows

Missing values will be resolved by picking predictors which have significant data.

_['Pool QC', 'Misc Feature', 'Alley', 'Fence', 'Fireplace Qu',
       'Lot Frontage', 'Garage Cond', 'Garage Qual', 'Garage Finish',
       'Garage Yr Blt', 'Garage Type', 'Bsmt Exposure', 'BsmtFin Type 2',
       'BsmtFin Type 1', 'Bsmt Qual', 'Bsmt Cond', 'Mas Vnr Area',
       'Mas Vnr Type']_ 
       
Columns are dropped based on the following:
- there is a significant quantity of missing values (assumed threshhold: > 5 training values)
- the remaining variables are related and likely to have predictive influence on response even if this vatiable is not considered

18 **columns** are removed from `combined_df` because they have more that 5 entries missing.

Consideration at this point that 'Quality' data sets are harder to be objective and will not be considered significant in the first iterations of the model but may be considered for fine tuning if necessary
       
---

In [14]:
isnull_set = train_null.compare(test_null).sort_values('self',ascending=False)
drop_columns = list(isnull_set[isnull_set['self']>5].index)
#drop_columns.append('Id')
drop_columns.append('PID')

In [15]:
print("Number of columns before dropping:\t",len(combined_df.columns))
print("Number of dropping columns:\t\t",len(drop_columns))
combined_df.drop(columns=drop_columns, inplace=True, errors='ignore')
print("Number of columns after dropping:\t",len(combined_df.columns))

Number of columns before dropping:	 82
Number of dropping columns:		 19
Number of columns after dropping:	 63


### 5.2 Temporal Variable Change



In [16]:
for feature in ['Year Built','Year Remod/Add']:
    combined_df[feature]=combined_df['Yr Sold']-combined_df[feature]

In [17]:
#there is corr between rms and living area. might have to be removed later
combined_df.corr()['SalePrice'].drop('SalePrice').sort_values(key=lambda x:abs(x),ascending=False)[:12]

Overall Qual      0.800207
Gr Liv Area       0.697038
Garage Area       0.650270
Garage Cars       0.648220
Total Bsmt SF     0.628925
1st Flr SF        0.618486
Year Built       -0.571881
Year Remod/Add   -0.551716
Full Bath         0.537969
TotRms AbvGrd     0.504014
Fireplaces        0.471093
BsmtFin SF 1      0.423519
Name: SalePrice, dtype: float64

### testing: add 'Gr Liv Area'^2, 'Year Built'^2

In [18]:
combined_df['Year Built^2'] = combined_df['Year Built']**2
combined_df['Gr Liv Area^2'] = combined_df['Gr Liv Area']**2

### 5.3 Fill Missing Values
missing values 1 or 2 missing entries per column so we assume they are missing all at random and are dropped instead.

- numerical nulls set price to 0 (except SalePrice)
- Electrical is categorical and set to mode
 
 0 **rows** removed 
 
'*'alternative approach to only drop train entries then do no touch test. After testing, it is found that imputing values is more effective

'*' considered finding other value in `Electrical` but is not available

In [19]:
[col for col in combined_df.columns if combined_df[col].isnull().sum() > 0]

['BsmtFin SF 1',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 'Electrical',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Garage Cars',
 'Garage Area',
 'SalePrice']

In [20]:
null_features_numerical = ['Bsmt Full Bath','Bsmt Half Bath','Total Bsmt SF','Bsmt Unf SF','BsmtFin SF 2','BsmtFin SF 1','Garage Cars','Garage Area']

In [21]:
for col in null_features_numerical:
    if col not in drop_columns:
        combined_df[col] = combined_df[col].fillna(0.0)

combined_df['Electrical'] = combined_df['Electrical'].fillna(combined_df['Electrical'].mode()[0])        

In [22]:
before_drop=len(combined_df)
print("Number of rows before dropping:\t",before_drop)
drop_rows = ['Bsmt Half Bath','Bsmt Full Bath','Electrical','Garage Cars','Garage Area','BsmtFin SF 1','BsmtFin SF 2','Bsmt Unf SF','Total Bsmt SF']
combined_df.drop(list(combined_df[combined_df.drop(columns='SalePrice').isnull().any(axis=1)].index[:-1]),inplace=True)
print("Number of rows dropped:\t\t",before_drop-len(combined_df))
print("Number of rows after dropping:\t",len(combined_df))

Number of rows before dropping:	 2929
Number of rows dropped:		 0
Number of rows after dropping:	 2929


### 5.4 Convert Numerical feature to Categorical
some of the numeric features can be grouped as
- Area
- Quality / Condition
    - assumed to be linearly related
- Type of house feature
    - _house type as seen in data description is not linear and will be changed to categorical_
- Quantity of house feature
- Year
- Price

In [23]:
convert_list = ['MS SubClass']
combined_df['MS SubClass'] = combined_df['MS SubClass'].astype('str')

### 5.5. Apply PowerTransformer to columns
- We saw in distribution of continuous features that some features are not linear towards target feature. So we need to transform this.
- Lets check the skewness of all distributions
- after reviewing the following is shortlisted for log transformation
    ['Gr Liv Area','Lot Area', 'Garage Area','1st Flr SF', '2nd Flr SF', 'Enclosed Porch','BsmtFin SF 1', 'Total Bsmt SF', 'Bsmt Unf SF']
- 'yeo-johnson' and 'box-cox' method is used

In [24]:
numeric_feat =[col for col in combined_df.columns if combined_df[col].dtypes != 'O']
discrete_feat = [col for col in numeric_feat if len(combined_df[col].unique())<25]
continuous_feat = [col for col in numeric_feat if col not in discrete_feat]

# check the skew of all numerical features
skewed_feats = combined_df[continuous_feat].skew()
print('\n Skew in numerical features: \n')
skewness_df = pd.DataFrame({'Skew' : skewed_feats}).sort_values('Skew',ascending=False)
print(skewness_df)


 Skew in numerical features: 

                      Skew
Misc Val         21.996036
Lot Area         12.899157
Low Qual Fin SF  12.116056
3Ssn Porch       11.401807
BsmtFin SF 2      4.139978
Enclosed Porch    4.013674
Screen Porch      3.956673
Gr Liv Area^2     3.820356
Open Porch SF     2.525565
Wood Deck SF      1.843810
Year Built^2      1.813130
SalePrice         1.557551
1st Flr SF        1.449236
BsmtFin SF 1      1.410038
Gr Liv Area       1.220100
Total Bsmt SF     1.132228
Bsmt Unf SF       0.923750
2nd Flr SF        0.865112
Year Built        0.603031
Year Remod/Add    0.450886
Garage Area       0.240043
Id                0.000660


In [25]:
log_list = ['Gr Liv Area','Gr Liv Area^2','Lot Area', 'Garage Area','1st Flr SF', '2nd Flr SF', 'Enclosed Porch','BsmtFin SF 1', 'Total Bsmt SF', 'Bsmt Unf SF']

In [26]:
for col in log_list:
    if col in ['Lot Area', '1st Flr SF']:
        power = PowerTransformer(method='box-cox', standardize=True)
        combined_df[[col]] = power.fit_transform(combined_df[[col]])
    else:
        power = PowerTransformer(method='yeo-johnson', standardize=True)
        combined_df[[col]] = power.fit_transform(combined_df[[col]]) # fit with combined_data to avoid overfitting with training data?

print('Number of skewed numerical features got transform : ', len(log_list))

Number of skewed numerical features got transform :  10


### 5.6 Regroup Features¶
regroups features to remove unnecessary dummy variables. considering [4.3.3.4. Categorical Distribution Comparison](#4.3.3.4.-Categorical-Distribution-Comparison). 'Type'  are still relevant since they are binary. However discretionary quality variables with  namely with  values:
       Ex	Excellent  
       Gd	Good  
       TA	Average/Typical  
       Fa	Fair  
       Po	Poor  
have few 'Fa' and 'Po' counts so these are consolidated. affected variables 

['Kitchen Qual','Heating QC','Exter Cond', 'Exter Qual']

Exclusions
['Exter Cond'] is evenly split into a distribution with 'TA' as greatest, we assume it is normal distribution so there is significance in 'Fa' and 'Po'
['Exter Qual'] has no 'Po' so does not need to be regrouped

Remainder to regroup is ['Kitchen Qual','Heating QC']

In [27]:
combined_df['Kitchen Qual'] = combined_df['Kitchen Qual'].apply(lambda x: 'Fa/Po' if x in ['Fa','Po'] else x)
combined_df['Heating QC'] = combined_df['Heating QC'].apply(lambda x: 'Fa/Po' if x in ['Fa','Po'] else x)

### 5.7 Get-Dummies

starting with 62 columns >`get_dummies`> 241 columns

In [28]:
combined_df = pd.get_dummies(combined_df).reset_index(drop=True)

### 5.8 Get X and y

In [29]:
combined_df.to_csv('combined_test_df.csv',index=False)

In [30]:
combined_df = pd.read_csv('combined_test_df.csv')

In [31]:
coef_index = combined_df.columns.drop(['Id','Label_test','Label_train','SalePrice'])
new_train_data = combined_df.loc[combined_df['Label_train']==1].drop(columns=['Id','Label_test','Label_train'])
new_test_data = combined_df.loc[combined_df['Label_test']==1].drop(columns=['Id','Label_test','Label_train'])
X_train = new_train_data.drop('SalePrice', axis=1)
y_train = np.log1p(new_train_data['SalePrice'].values.ravel())
X_test = new_test_data.drop('SalePrice', axis=1)

In [32]:
pre_precessing_pipeline = make_pipeline(RobustScaler(),
                                       )

X_train = pre_precessing_pipeline.fit_transform(X_train)
X_test = pre_precessing_pipeline.transform(X_test)

print(X_train.shape)
print(X_test.shape)

(2051, 240)
(878, 240)


## 6.4 Model Development

the entire train set is used to determine if adding a power series will improve the fit. we compare R^2 as it is a simple metric and assumed to be congruent to the RMSE in performance

In [33]:
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error

In [34]:
cv = KFold(n_splits=10, random_state=42, shuffle=True)

### 6.0 OLS

#### 6.0.1 OLS fit

In [35]:
ols=LinearRegression()
ols.fit(X_train, y_train)

LinearRegression()

In [36]:
ols_scores = cross_val_score(ols, X_train, y_train, cv=cv, n_jobs=-1,scoring='r2')
ols_scores.mean()

0.8725843222789429

#### 6.0.2 OLS Evaluating Model

In [37]:
ols_train_score = ols.score(X_train,y_train)
print(f'ols train R^2:\t\t {ols_train_score:.4f}')

ols_cv_score = ols_scores.mean()
print(f'ols mean cv score:\t {ols_cv_score:.4f}')

y_pred = ols.predict(X_train)
ols_mse = mean_squared_error(y_train, y_pred)
print(f'train mse\t\t{ols_mse:.4f}')

ols_score_delta = ols_train_score-ols_cv_score
print(f'train-test score delta \t{ols_score_delta/ols_cv_score*100:.2f}%')

ols train R^2:		 0.9378
ols mean cv score:	 0.8726
train mse		0.0105
train-test score delta 	7.47%


In [38]:
ols_coef = pd.DataFrame(ols.coef_, index = coef_index)
ols_coef.columns = ['ols_coef']
ols_coef.sort_values(by='ols_coef',key=lambda x: abs(x),ascending=False)[:20]

Unnamed: 0,ols_coef
Gr Liv Area^2,15.330139
Gr Liv Area,-15.168436
Roof Matl_ClyTile,-1.551506
MS Zoning_A (agr),-0.567966
Exterior 1st_CBlock,0.510191
Functional_Sal,-0.453992
Roof Matl_Membran,0.424306
Neighborhood_GrnHill,0.410906
Exterior 2nd_CBlock,-0.404666
Roof Matl_WdShngl,0.349068


### 6.0 OLS

#### 6.0.1 OLS fit

In [39]:
ols=LinearRegression()
ols.fit(X_train, y_train)

LinearRegression()

In [40]:
ols_scores = cross_val_score(ols, X_train, y_train, cv=cv, n_jobs=-1,scoring='r2')
ols_scores.mean()

0.8725843222789429

#### 6.0.2 OLS Evaluating Model

In [52]:
ols_cv_score = ols_scores.mean()
print(f'ols mean cv score:\t {ols_cv_score:.4f}')

ols_score_delta = ols_train_score-ols_cv_score
print(f'train-test score delta \t{ols_score_delta/ols_cv_score*100:.2f}%')

y_pred = np.expm1(ols.predict(X_train))
ols_mse = mean_squared_error(np.expm1(y_train), y_pred)
print(f'train rmse\t\t{ols_mse**.5:,}')

ols mean cv score:	 0.8726
train-test score delta 	7.47%
train rmse		18,701.387293360924


In [42]:
ols_coef = pd.DataFrame(ols.coef_, index = coef_index)
ols_coef.columns = ['ols_coef']
ols_coef.sort_values(by='ols_coef',key=lambda x: abs(x),ascending=False)[:20]

Unnamed: 0,ols_coef
Gr Liv Area^2,15.330139
Gr Liv Area,-15.168436
Roof Matl_ClyTile,-1.551506
MS Zoning_A (agr),-0.567966
Exterior 1st_CBlock,0.510191
Functional_Sal,-0.453992
Roof Matl_Membran,0.424306
Neighborhood_GrnHill,0.410906
Exterior 2nd_CBlock,-0.404666
Roof Matl_WdShngl,0.349068


In [43]:
x = ols_coef.to_dict()
x=x['ols_coef']
for coef in ['Gr Liv Area','Gr Liv Area^2','Year Built','Year Built^2']:
    print(coef,'\t', x[coef])

Gr Liv Area 	 -15.168435887556809
Gr Liv Area^2 	 15.33013854093558
Year Built 	 -0.18392298584706354
Year Built^2 	 0.03910670744498802


### 6.1 Ridge

#### 6.1.1 Ridge fit

In [44]:
ridge_cv = RidgeCV(cv=5)
ridge_cv_scores = cross_val_score(ridge_cv, X_train, y_train, cv=cv, n_jobs=-1,scoring='r2')

In [45]:
ridge_cv_scores.mean()

0.8928436327993833

In [58]:
ridge_cv = RidgeCV(cv=100,alphas=[.125,1.25,12.5],scoring='neg_mean_squared_error')
ridge_cv.fit(X_train, y_train)
print('Ridge alpha:', ridge_cv.alpha_)

Ridge alpha: 12.5


#### 6.1.2 Ridge Evaluating Model

In [60]:
ridge_cv_score = ridge_cv.score(X_train, y_train)
print(f'train R^2 :{ridge_cv_score:.4f}')

ridge_cv_score_delta = ridge_cv_score-ridge_cv_scores.mean()
print(f'train-test score delta: {ridge_cv_score_delta/ridge_cv_score*100:.2f}%')

y_pred = np.expm1(ridge_cv.predict(X_train))
ridge_cv_mse = mean_squared_error(np.expm1(y_train), y_pred)
print(f'train rmse\t\t{ridge_cv_mse**.5:,}')

train R^2 :0.9243
train-test score delta: 3.41%
train rmse		21,581.778161312188


In [49]:
ridge_cv_coef = pd.DataFrame(ridge_cv.coef_, index = coef_index)
ridge_cv_coef.columns = ['ridge_coef']
ridge_cv_coef.sort_values(by='ridge_coef',key=lambda x: abs(x),ascending=False)[:50]

Unnamed: 0,ridge_coef
Functional_Sal,-0.114111
Roof Matl_ClyTile,-0.112913
Overall Qual,0.112632
Functional_Typ,0.096432
Year Built,-0.095988
Exter Cond_Po,-0.08751
MS Zoning_A (agr),-0.087171
MS Zoning_C (all),-0.085586
Exterior 1st_BrkFace,0.084329
Gr Liv Area,0.08329


In [50]:
ridge_cv_coef

Unnamed: 0,ridge_coef
Lot Area,0.036089
Overall Qual,0.112632
Overall Cond,0.041832
Year Built,-0.095988
Year Remod/Add,-0.029160
...,...
Sale Type_ConLw,-0.011070
Sale Type_New,0.015277
Sale Type_Oth,0.031318
Sale Type_VWD,0.000000


In [51]:
x = ridge_cv_coef.to_dict()
x=x['ridge_coef']
for coef in ['Gr Liv Area','Gr Liv Area^2','Year Built','Year Built^2']:
    print(coef,'\t', x[coef])

Gr Liv Area 	 0.08328993012306976
Gr Liv Area^2 	 0.08322385912970143
Year Built 	 -0.09598776634539422
Year Built^2 	 0.0019979321960769563


### 6.2 Lasso
#### 6.2.1 Lasso fit

In [54]:
lasso_cv = LassoCV(cv=5)
lasso_cv_scores = cross_val_score(lasso_cv, X_train, y_train, cv=cv, n_jobs=-1,scoring='r2')

In [55]:
lasso_cv_scores.mean()

0.878256409413304

In [56]:
lasso_cv.fit(X_train, y_train)
print(f'Lasso alpha: {lasso_cv.alpha_}')

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

Lasso alpha: 0.0029715935987551113


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


In [68]:
lasso_cv_score = lasso_cv.score(X_train, y_train)
print(lasso_cv_score)

y_pred = np.expm1(lasso_cv.predict(X_train))
lasso_cv_mse = mean_squared_error(np.expm1(y_train), y_pred)
print(f'train rmse\t\t{lasso_cv_mse**.5:,}')

0.8953796051871508
train rmse		27,599.159520039735


In [64]:
lasso_cv_coef = pd.DataFrame(lasso_cv.coef_, index = coef_index)
lasso_cv_coef.columns = ['lasso_coef']
lasso_cv_coef.sort_values(by='lasso_coef',key=lambda x: abs(x),ascending=False)[:50]

Unnamed: 0,lasso_coef
Overall Qual,0.1700717
Gr Liv Area,0.1503701
Year Built,-0.1199079
BsmtFin SF 1,0.07492205
Overall Cond,0.04687926
Kitchen Qual_Ex,0.0465076
Lot Area,0.03901468
Garage Cars,0.03742856
Year Remod/Add,-0.03270144
1st Flr SF,0.03049073


In [65]:
x = lasso_cv_coef.to_dict()
x=x['lasso_coef']
for coef in ['Gr Liv Area','Gr Liv Area^2','Year Built','Year Built^2']:
    print(coef,'\t', x[coef])

Gr Liv Area 	 0.15037008236307375
Gr Liv Area^2 	 0.002813224866034874
Year Built 	 -0.11990791543400811
Year Built^2 	 -0.0004346439018440656


## 6.5 Model Evaluation
R^2 of train data OLS, Ridge, Lasso: [0.938, 0.924, 0.895]  

there is insignificant imporovement, likely due to the small coefficients and the multiple predictors involved. the power terms are rejected for the final model

In [67]:
train_scores = [ols_train_score,ridge_cv_score,lasso_cv_score]
mses = [ols_mse,ridge_cv_mse,lasso_cv_mse]

print(train_scores)
print(mses)

[0.9378057798541433, 0.9243220890876771, 0.8953796051871508]
[349741886.69628143, 465773148.60409164, 761713606.2126]
