# Intermediate Linear Regression Practice

## Use a Linear Regression model to get the lowest RMSE possible on the following dataset:

[Dataset Folder](https://github.com/ryanleeallred/datasets/tree/master/Ames%20Housing%20Data)

[Raw CSV](https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv)

## You model must include (at least):
- A log-transformed y variable
- Two polynomial features
- One interaction feature
- 10 other engineered features

What is the lowest Root-Mean-Squared Error that you are able to obtain? Share your best RMSEs in Slack!

Notes:

There may be some data cleaning that you need to do on some features of this dataset. Linear Regression will only accept numeric values and will not accept

Note* There may not be a clear candidate for an interaction term in this dataset. Include one anyway, sometimes it's a good practice for predictive modeling feature engineering in general. 

In [1]:
##### Your Code Here #####
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from math import sqrt

pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',500)

In [2]:
url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv'

data = pd.read_csv(url)

print(data.shape)
data.head()


(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [3]:
data.isna().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinSF1          0
BsmtFinType2       38
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
1stFlrSF            0
2ndFlrSF            0
LowQualFin

In [4]:
df = data.fillna(method='bfill')

In [5]:
df_nums = df.select_dtypes(include=['float','int']).dropna(axis=1)

y = df_nums['SalePrice']
X = df_nums.drop(columns=['Id','SalePrice'])

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=42)

model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test,y_pred)
rmse = sqrt(mse)
r2 = r2_score(y_test,y_pred)

beta_0 = model.intercept_
beta_i = model.coef_

print('mean squared error: ', mse)
print('root mean squared error: ', rmse)
print('R squared: ', r2)
print('---------------------')
print('Intercept: ', beta_0)
print('Coefficients: ', beta_i)


mean squared error:  1359271613.9728708
root mean squared error:  36868.300936887106
R squared:  0.8227882912333992
---------------------
Intercept:  -36954.909413481044
Coefficients:  [-1.98197729e+02 -1.08601713e+02  4.46644897e-01  1.80939444e+04
  4.11936585e+03  2.78821229e+02  1.80951293e+02  2.42804498e+01
  9.08201306e+00 -2.86800053e+00 -5.82165259e-01  5.63184726e+00
  1.21756447e+01  1.30811338e+01  7.35929280e+00  3.26160712e+01
  1.11197466e+04 -1.20019250e+03  2.83558520e+03 -2.26517618e+03
 -8.74143945e+03 -8.89823618e+03  5.10768751e+03  4.69771598e+03
  1.08555357e+02  1.20290781e+04 -3.75146347e+00  2.23564788e+01
 -5.84084663e+00  8.45547792e+00  4.07573926e+01  6.39731955e+01
 -2.17273944e+01 -8.06214965e-01 -1.70515290e+02 -5.66404525e+02]


In [6]:
df['SaleCondition'].unique()

array(['Normal', 'Abnorml', 'Partial', 'AdjLand', 'Alloca', 'Family'],
      dtype=object)

In [7]:
mask = {
    'Abnorml': 1, 
    'Partial': 2,
    'Normal': 3, 
    'AdjLand': 5,
    'Alloca': 4,
    'Family': 6,
}

df['SaleCondition'] = df['SaleCondition'].replace(mask,inplace=True)

In [8]:
print((df['Heating'].unique()))
print((df['HeatingQC'].unique()))
print((df['Foundation'].unique()))

foundations = {
    'PConc': 1,
    'CBlock': 2, 
    'BrkTil': 3,
    'Wood': 4,
    'Slab': 5, 
    'Stone': 6,
}
heating = {
    'Ex': 1,
    'Gd': 2,
    'TA': 3, 
    'Fa': 4,
    'Po': 5,
    
}
type_heating = {
    'GasA': 1,
    'GasW': 2, 
    'Grav': 3, 
    'Wall': 4,
    'OthW': 5,
    'Floor': 6
}
# df['N_Foundation'] = df['Foundation'].replace(foundations)
# df['N_HeatingQC'] = df['HeatingQC'].replace(heating)
# df['N_Heating'] = df['Heating'].replace(type_heating)

['GasA' 'GasW' 'Grav' 'Wall' 'OthW' 'Floor']
['Ex' 'Gd' 'TA' 'Fa' 'Po']
['PConc' 'CBlock' 'BrkTil' 'Wood' 'Slab' 'Stone']


In [9]:
# all_the_ones = ['Ones_Foundation','Ones_HeatingQC','Ones_Heating']

In [10]:
df['ln_price'] = np.log(df['SalePrice'])
df['N_Baths'] = df['BsmtFullBath'] + 0.5 * df['BsmtHalfBath'] + df['FullBath'] + 0.5 * df['HalfBath']
df['AboveGround'] = df['1stFlrSF'] + df['2ndFlrSF']
# df['Sale_Weight_Condition'] = df['ln_price'] * df['SaleCondition']
df['Quality_Overall'] = np.divide((df['OverallQual'] * df['OverallCond']), ((df['YrSold'] - df['YearRemodAdd'])+df['YrSold']-df['YearBuilt']) + df['YearBuilt'])
# df['Foundation_Weight'] = df['ln_price'] * df['Ones_Foundation']
# df['HeatingQC_Weight'] = df['Ones_HeatingQC'] * df['ln_price']
# df['HeatingType_Weight'] = df['Ones_Heating'] * df['ln_price']


In [18]:
## log
df_new = df.select_dtypes(include=['float','int']).dropna(axis=1)

y = df_new['ln_price']
X = df_new.drop(columns=['SalePrice','ln_price','Id',
                         'BsmtFullBath','BsmtHalfBath','FullBath',
                         'HalfBath','1stFlrSF','2ndFlrSF',
                         'OverallQual','OverallCond'])

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=42)

model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test,y_pred)
rmse = sqrt(mse)
r2 = r2_score(y_test,y_pred)

beta_0 = model.intercept_
beta_i = model.coef_

print('mean squared error: ', mse)
print('root mean squared error: ', rmse)
print('R squared: ', r2)
print('---------------------')
print('Intercept: ', beta_0)
print('Coefficients: ', beta_i)

mean squared error:  0.023341880178788955
root mean squared error:  0.15278049672254948
R squared:  0.8749185996923229
---------------------
Intercept:  13.136370874334197
Coefficients:  [-6.66915347e-04 -4.11994306e-04  1.65405997e-06  3.67147084e-03
  1.09578949e-03 -1.91706881e-06  3.78389875e-05  1.25343798e-05
  1.55565854e-05  6.59285843e-05  1.12298071e-04  1.41980232e-04
 -1.06080409e-02 -6.52995049e-02  2.00083982e-02  5.31469501e-02
 -4.66128133e-04  7.76316918e-02  4.51233703e-05  1.19943054e-04
 -5.50414015e-05  2.51058560e-04  2.70637376e-04  3.23818066e-04
 -4.00838542e-04 -6.46980186e-06  7.69076923e-04 -5.29908980e-03
  5.53380550e-02  2.96730826e-05  2.16609653e+01]


In [16]:
pf5 = PolynomialFeatures(degree=5, interaction_only=False,include_bias=False)
pf6 = PolynomialFeatures(degree=6, interaction_only=False,include_bias=False)
pf3 = PolynomialFeatures(degree=3)
pf2 = PolynomialFeatures(degree=2)

#okay this is from some tutorial ima try it
# test = df[['N_HeatingQC','N_Heating']]
# test2 = pf2.fit_transform(test)
#didn't work lmao

In [17]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,ln_price,N_Baths,AboveGround,Quality_Overall
0,1,60,RL,65.0,8450,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,TA,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,Ex,MnPrv,Shed,0,2,2008,WD,,208500,12.247694,3.5,1710,0.017387
1,2,20,RL,80.0,9600,Pave,Grvl,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,Ex,MnPrv,Shed,0,5,2007,WD,,181500,12.109011,2.5,1262,0.023553
2,3,60,RL,68.0,11250,Pave,Grvl,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,Ex,MnPrv,Shed,0,9,2008,WD,,223500,12.317167,3.5,1786,0.017378
3,4,70,RL,60.0,9550,Pave,Grvl,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,Ex,MnPrv,Shed,0,2,2006,WD,,140000,11.849398,2.0,1717,0.01714
4,5,60,RL,84.0,14260,Pave,Grvl,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,Ex,MnPrv,Shed,0,12,2008,WD,,250000,12.429216,3.5,2198,0.019841


In [14]:
print(df['Condition1'].unique())
print(df['Condition2'].unique())
print(df['Neighborhood'].unique())
print(df['LandSlope'].unique())
print(df['BldgType'].unique())

bldg = {
    '1Fam':1,
    '2fmCon':2,
    'Duplex':3,
    'TwnhsE':4,
    'Twnhs':5
}
land = {
    'Gtl':1,
    'Mod':2,
    'Sev':3
}
# df['N_LandSl'] = df['LandSlope'].replace(land)
# df['N_Bldg'] = df['BldgType'].replace(bldg)

['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
['Gtl' 'Mod' 'Sev']
['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']


In [19]:
le = LabelEncoder()
ohe_0 = OneHotEncoder(categorical_features=[0])

n_cols = ['HeatingQC','Heating', 'LandSlope', 'BldgType', 'Foundation']

lmao = df[n_cols].values
lmao

array([['Ex', 'GasA', 'Gtl', '1Fam', 'PConc'],
       ['Ex', 'GasA', 'Gtl', '1Fam', 'CBlock'],
       ['Ex', 'GasA', 'Gtl', '1Fam', 'PConc'],
       ...,
       ['Ex', 'GasA', 'Gtl', '1Fam', 'Stone'],
       ['Gd', 'GasA', 'Gtl', '1Fam', 'CBlock'],
       ['Gd', 'GasA', 'Gtl', '1Fam', 'CBlock']], dtype=object)

In [20]:
for i in range(5):
    lmao[:,i] = le.fit_transform(lmao[:,i])

type(lmao)

df_le = pd.DataFrame(data=lmao, columns={'LE_HeatingQC','LE_Heating','LE_LandSlope','LE_Bldg','LE_Foundation'})

final_df = pd.DataFrame.join(df,df_le,how='right')
final_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,ln_price,N_Baths,AboveGround,Quality_Overall,LE_HeatingQC,LE_LandSlope,LE_Foundation,LE_Bldg,LE_Heating
0,1,60,RL,65.0,8450,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,TA,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,Ex,MnPrv,Shed,0,2,2008,WD,,208500,12.247694,3.5,1710,0.017387,0,1,0,0,2
1,2,20,RL,80.0,9600,Pave,Grvl,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,Ex,MnPrv,Shed,0,5,2007,WD,,181500,12.109011,2.5,1262,0.023553,0,1,0,0,1
2,3,60,RL,68.0,11250,Pave,Grvl,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,Ex,MnPrv,Shed,0,9,2008,WD,,223500,12.317167,3.5,1786,0.017378,0,1,0,0,2
3,4,70,RL,60.0,9550,Pave,Grvl,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,Ex,MnPrv,Shed,0,2,2006,WD,,140000,11.849398,2.0,1717,0.01714,2,1,0,0,0
4,5,60,RL,84.0,14260,Pave,Grvl,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,Ex,MnPrv,Shed,0,12,2008,WD,,250000,12.429216,3.5,2198,0.019841,0,1,0,0,2


In [21]:
cols = ('MSSubClass','MSZoning','Street','Alley','LotShape','LandContour','Utilities','LotConfig',
        'LandSlope','Neighborhood', 'Condition1','Condition2','BldgType','HouseStyle','RoofStyle',
        'RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation',
        'BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC',
        'CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish',
        'GarageQual','GarageCond','PavedDrive','PoolQC','Fence','MiscFeature','SaleType','SaleCondition') 

In [22]:
df['HouseStyle'].unique()

style = {
    '2Story': 2, 
    '1Story': 1,
    '1.5Fin': 2,  
    '1.5Unf': 1, 
    'SFoyer': 2, 
    'SLvl': 1, 
    '2.5Unf': 2,
    '2.5Fin': 3
}
df['N_HouseStyle'] = df['HouseStyle'].replace(style)

In [32]:
df['StyleScore_weight'] = np.multiply(df['N_HouseStyle'],df['ln_price'])

In [33]:
for i in cols:
    
    LE = LabelEncoder() 
    LE.fit(list(df[i].values)) 
    df[i] = LE.transform(list(df[i].values))


In [46]:
## going to readd some features
df_hope = df.select_dtypes(include=['float','int']).dropna(axis=1)
df_hope.shape

(1460, 87)

In [53]:
#creating some more features?
df_hope['SalePrice'].shape

(1460,)

In [54]:
y = df_hope['SalePrice']
X = df_hope.drop(columns=['SalePrice','ln_price','Id'])

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=42)

model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test,y_pred)
rmse = sqrt(mse)
r2 = r2_score(y_test,y_pred)

beta_0 = model.intercept_
beta_i = model.coef_

print('mean squared error: ', mse)
print('root mean squared error: ', rmse)
print('R squared: ', r2)
print('---------------------')
print('Intercept: ', beta_0)
print('Coefficients: ', beta_i)

mean squared error:  724790858.3979365
root mean squared error:  26921.940093498768
R squared:  0.9055071663420514
---------------------
Intercept:  -90187.89849698622
Coefficients:  [-5.21724237e+02  4.56690949e+03 -1.28849619e+02  9.95543127e-02
  1.81906344e+04 -4.72651592e+02 -2.95123666e+02  2.08954038e+03
 -1.58515337e+04  1.60902747e+02  6.06683589e+03  3.41787294e+02
 -7.84519189e+02 -6.56430905e+03 -2.36009226e+02 -1.67608097e+03
  1.19950809e+04  5.39594510e+03 -8.74034329e+01  4.95892120e+01
  2.15374039e+03  3.08698923e+03 -8.00397573e+02 -5.46588366e+01
  4.46638444e+03  2.14847689e+01 -7.06825556e+03 -7.49004606e+02
 -5.54140885e+02 -6.31114070e+03  1.65700873e+03 -2.43878627e+03
  5.04449282e+02  4.51775019e+00  6.91587040e+01  1.32100497e+00
 -1.34916948e+00  4.49100226e+00  1.62202121e+03  9.18502261e+02
 -1.12695715e+04  9.57782441e+02  1.70466915e+01 -7.89263966e+00
 -1.96348623e+01 -1.04849455e+01  9.80090990e+02 -5.87869756e+02
  1.05762678e+03 -2.09927938e+03 -3.0

In [56]:
X_train.shape

(1168, 84)

In [51]:
SalePrice = model.predict(X_test)
Id = df.Id
SalePrice.shape

(292,)

In [44]:
predict_Sale = pd.Series(SalePrice,name='SalePrice')
predict_Sale.shape

(292,)

# Stretch Goals

- Write a blog post explaining one of today's topics.
- Find a new regression dataset from the UCI machine learning repository and use it to test out your new modeling skillz.
 [ - UCI Machine Learning Repository - Regression Datasets](https://)
- Make a list for yourself of common feature engineering techniques. Browse Kaggle kernels to learn more methods.
- Start studying for tomorrow's topic: Gradient Descent
- Try and make the ultimate model with this dataset. clean as many features as possible, engineer the most sensible features as possible and see how accurate of a prediction you can make. 
- Learn about the "Dummy Variable Trap" and how it applies to linear regression modeling.
- Learning about using linear regression to model time series data