<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

### Regression and Classification with the Ames Housing Data

---

You have just joined a new "full stack" real estate company in Ames, Iowa. The strategy of the firm is two-fold:
- Own the entire process from the purchase of the land all the way to sale of the house, and anything in between.
- Use statistical analysis to optimize investment and maximize return.

The company is still small, and though investment is substantial the short-term goals of the company are more oriented towards purchasing existing houses and flipping them as opposed to constructing entirely new houses. That being said, the company has access to a large construction workforce operating at rock-bottom prices.

This project uses the [Ames housing data recently made available on kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

# Ames, Iowa:
- Pop: ~60k
- Its a Collegetown: ISU
- Size: ~25sqmiles

# Ideas:
- Open concept = 1st floor sqft / first floor # rooms
- School district per neighborhood
- Ratio of Bedrooms to bathrooms
- Yard size
- Amount of rooms * Quality

<img src='http://cdn.ames.k12.ia.us/wp-content/uploads/2015/02/boundary20142-gray.jpg' style='width:300px'>


In [None]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [None]:
# Load the data
h = pd.read_csv('./housing.csv')

In [None]:
orig = pd.read_csv('./housing.csv')

In [None]:
h.head(1)

# Quick exploration of the data

# Neighborhood: Physical locations within Ames city limits

       Blmngtn	Bloomington Heights
       Blueste	Bluestem
       BrDale	Briardale
       BrkSide	Brookside
       ClearCr	Clear Creek
       CollgCr	College Creek
       Crawfor	Crawford
       Edwards	Edwards
       Gilbert	Gilbert
       IDOTRR	Iowa DOT and Rail Road
       MeadowV	Meadow Village
       Mitchel	Mitchell
       Names	North Ames
       NoRidge	Northridge
       NPkVill	Northpark Villa
       NridgHt	Northridge Heights
       NWAmes	Northwest Ames
       OldTown	Old Town
       SWISU	South & West of Iowa State University
       Sawyer	Sawyer
       SawyerW	Sawyer West
       Somerst	Somerset
       StoneBr	Stone Brook
       Timber	Timberland
       Veenker	Veenker

# Since real estate is usually all about location, I bet the prices of the neighborhoods will differ.
Very interesting to see the top three neighborhoods have average salep rices ~120k above the Ames average sale price. I bet some of these will be very good predictors.

In [None]:
from sklearn import linear_model

In [None]:
a = [.0005,.0008,.001,.002,.0025,.003,.0035,.004,.005,.01,.05,.1]
score = []
for i in a:
    reg = linear_model.Lasso(alpha = i)
    reg.fit(X_train,y_train)
    score.append(reg.score(X_test,y_test))
zip(a,score)

# This looks promising but again, we should take a look at the cross-val scores to get a more realistic score. Lets make a function to put a model into the Kfolds and return RMSE

In [None]:
def cv_rmse(estimator):
    score = cross_val_score(estimator,X,y,scoring='neg_mean_squared_error',cv=kfold)
    positive_score = -score
    return np.sqrt(positive_score) #np.sqrt(positive_score)

In [None]:
cv_rmse(lr).mean()

# Lasso cross validation with cv=10 and alpha=[that list]
#### Leads to sparsity and simplification of a model, which is very good in this case because we want to be able to explain the outcome of salesprice in terms of the coeficients

In [None]:
a = [.0005,.0008,.001,.002,.003,.004,.005,.01,.05]
Lasso_rmse = []
Lasso_std = []
for i in a:
    Lasso_rmse.append(cv_rmse(linear_model.Lasso(alpha=i)).mean())
    Lasso_std.append(cv_rmse(linear_model.Lasso(alpha=i)).std())
    

In [None]:
from pprint import pprint
ls = zip(a,Lasso_rmse,Lasso_std)
ls = pd.DataFrame(ls,columns=['Alpha','RMSE','Std'])

In [None]:
ls.plot('Alpha','RMSE',title='Lasso k=10 Cross-Validated RMSE')

In [None]:
# import pylab
# from scipy.stats import norm

# for i in range(len(ls)):
#     xp = np.linspace(.05,.3,100)
#     yp = norm.pdf(xp, loc=ls[i][1], scale=ls[i][2])    # for example
#     pylab.plot(xp,yp,label=ls[i][0])
#     pylab.legend()
#     #ax.legend(loc='upper center', shadow=True)
# pylab.show()

# Train Lasso model on best hyperparameter alpha

In [None]:
lasso = linear_model.Lasso(alpha=.004)
lasso.fit(X,y)
Lasso_coef = pd.DataFrame(lasso.coef_,index=Xn.columns)

In [None]:
lasso.score(X_hold,y_hold)

In [None]:
print(X_hold.shape)
print(y_hold.shape)
print(y_pred.shape)

In [None]:
y_pred = lasso.predict(X_hold)

In [None]:
#X_hold.reset_index()
lasso_pred = pd.concat([X_hold.reset_index(),y_hold.reset_ndex(),pd.DataFrame(y_pred,columns=['Pred_SalePrice'])],axis=1)

In [None]:
lasso_pred.plot.scatter('SalePrice','Pred_SalePrice')

In [None]:
lasso_pred.plot.scatter('GrLivArea','Pred_SalePrice')
lasso_pred.plot.scatter('GrLivArea','SalePrice')

In [None]:
lasso_pred.plot.scatter('QualOnSqft','Pred_SalePrice')
lasso_pred.plot.scatter('QualOnSqft','SalePrice')

# Lasso strength of top 20 coef (out of 53 Lasso selected)

In [None]:
zip(X.columns,lr.coef_)[:5]

In [None]:
np.e**lr.intercept_

# Lets get a more confident value, and run the lr through cross validation
#### Okay, our R2 value got hammered. But it typically falls when comparing test/train to cross-val

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=4)
score = cross_val_score(lr,X,y,scoring='neg_mean_squared_error',cv=kfold)
positive_score = -score
np.sqrt(positive_score).mean()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=4)
score = cross_val_score(lr,X,y,scoring='r2',cv=kfold)
np.mean(score)

# Look at all those coeficients! This needs some regularization.
#### AKA: this needs to be simplified. Regularization injects bias into the model and limits the values of the coeficients based on an alpha value


# Take a look at the corr of new features I added.
- They are built off one another there is a high correlation between each other. 
- They may be less useful than it looks, but for now they show strong correlations to the saleprice

In [None]:
#saleprice correlation matrix
k = 15 #number of variables for heatmap
corr = h.corr()
cols = corr.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(h[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, 
                 fmt='.2f', annot_kws={'size': 6}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

# Make sure added cols have non Nans

In [None]:
#h.fillna(0,inplace=True)
h.isnull().sum().sort_values(ascending=False).head()

# Start Evaluation of data
- h_log_dummy is the fully log transformed and categorical dummy dataframe
- h_no_transforms is the original data cleaned
- h_no_dummy is log transformed, but doesn't have categorical dummy variables
- h_dummy is not transformed but has dummy variables

In [None]:
#h = h_dummy.copy()
h = h_log_dummy.copy()

# Create a fixed and non-fixed feature dataframes to evaluate separately

In [None]:
h_fixed = h_log_dummy.copy()

In [None]:
#fixed features only
fixed_col = []
for col in h_fixed.columns:
    for fix in feature_dict['fixed']:
        if fix in col:
            #del h_fixed[col]
            fixed_col.append(col)
h_fixed = h_fixed[fixed_col]

In [None]:
#non-fixed features only
nonfixed_col = []
for col in h_nonfixed.columns:
    for nonfix in feature_dict['non-fixed']:
        if nonfix in col:
            #del h_fixed[col]
            nonfixed_col.append(col)
h_nonfixed = h_nonfixed[nonfixed_col]

In [None]:
h = h_nonfixed.copy()
#h = h_fixed.copy()

# Remove the target variable

In [None]:
y = h['SalePrice']
X = h.drop("SalePrice", axis=1)
Xn = h.drop("SalePrice", axis=1)

In [None]:
#X = h
print(X.shape)
print(y.shape)

In [None]:
#y

In [None]:
#h = pd.concat([h,y],axis=1)
#h.describe().T.head()

# Scale it

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X = ss.fit_transform(X)
X = pd.DataFrame(X,columns=Xn.columns)

# Create Holdouts

# BldgType: Type of dwelling
		
       1Fam	Single-family Detached	
       2FmCon	Two-family Conversion; originally built as one-family dwelling
       Duplx	Duplex
       TwnhsE	Townhouse End Unit
       TwnhsI	Townhouse Inside Unit

# Some types of houses are more desireable than others as well. Detached single familys and the townhouses on the end of the block have the highest mean price

In [None]:
h.groupby('BldgType')['SalePrice'].mean()

# To get a feel for what type of houses are in ames I did a quick gmaps streetview look around. 

# There are nice victorians and alot of ranches
<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Carey_house_denison_iowa.jpg/800px-Carey_house_denison_iowa.jpg',style='width:300px'><img src='http://images.housesforsalelists.com/Images/Houses/ia/ames/3318-southdale-dr-ames-iowa-50010.jpg',style='width:300px'>
# And there are some very nice townhouses
<img src='https://thumbs.frmonline.com/imgs/fr/propertyFiles/865/591/1000/resized/08_139698905401514141024076890008020.jpg',style='width:300px'>

# Take a look at the mean price by type and neighborhood! 
- Interesting to see in the nicest neighborhoods single familys are most valueable, while the townhouses trend better in average neighborhoods especially crawford and bloomington.
- In long term realestate plays, it might be a good idea to pick up the single family houses in a neighborhood on the cusp of becoming a 'nice' neighborhood because they have a higher topout.

In [None]:
h.groupby(['Neighborhood','BldgType'])['SalePrice'].mean().unstack().sort_values(by='1Fam',ascending=False)

# Condition2: Proximity to various conditions (if more than one is present)
		
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

# Define fixed/non-fixed features

### 'Fixed': 
'MSSubClassCat'
,'MSZoning'
,'LotFrontage'
,'LotArea'
,'Street'
,'Alley'
,'LotShape'
,'LandContour'
,'Utilities'
,'LotConfig'
,'LandSlope'
,'Neighborhood'
,'Condition1'
,'Condition2'
,'BldgType'
,'YearBuilt'
,'YearRemodAdd'
,'RoofStyle'
,'Foundation'
,'BsmtQual'
,'BsmtExposure'
,'GarageYrBlt'
,'MoSold'
,'YrSold'
,'SaleType'
,'SaleCondition'
,'SalePrice'

### 'Non-Fixed': 
'HouseStyle'
,'OverallQual'
,'OverallCond'
,'RoofMatl'
,'Exterior1st'
,'Exterior2nd'
,'MasVnrType'
,'MasVnrArea'
,'ExterQual'
,'ExterCond'
,'BsmtCond'
,'BsmtFinType1'
,'BsmtFinSF1'
,'BsmtFinType2'
,'BsmtFinSF2'
,'BsmtUnfSF'
,'TotalBsmtSF'
,'Heating'
,'HeatingQC'
,'CentralAir'
,'Electrical'
,'1stFlrSF'
,'2ndFlrSF'
,'LowQualFinSF'
,'GrLivArea'
,'BsmtFullBath'
,'BsmtHalfBath'
,'FullBath'
,'HalfBath'
,'BedroomAbvGr'
,'KitchenAbvGr'
,'KitchenQual'
,'TotRmsAbvGrd'
,'Functional'
,'Fireplaces'
,'FireplaceQu'
,'GarageType'
,'GarageFinish'
,'GarageCars'
,'GarageArea'
,'GarageQual'
,'GarageCond'
,'PavedDrive'
,'WoodDeckSF'
,'OpenPorchSF'
,'EnclosedPorch'
,'3SsnPorch'
,'ScreenPorch'
,'PoolArea'
,'PoolQC'
,'Fence'
,'MiscFeature'
,'MiscVal'

In [None]:
#create dict from string above, manually selected
feature_dict = {
    'fixed': ['MSSubClassCat'
,'MSZoning'
,'LotFrontage'
,'LotArea'
,'Street'
,'Alley'
,'LotShape'
,'LandContour'
,'Utilities'
,'LotConfig'
,'LandSlope'
,'Neighborhood'
,'Condition1'
,'Condition2'
,'BldgType'
,'YearBuilt'
,'YearRemodAdd'
,'RoofStyle'
,'Foundation'
,'BsmtQual'
,'BsmtExposure'
,'GarageYrBlt'
,'MoSold'
,'YrSold'
,'SaleType'
,'SaleCondition'
,'SalePrice'],
    'non-fixed': ['HouseStyle'
,'OverallQual'
,'OverallCond'
,'RoofMatl'
,'Exterior1st'
,'Exterior2nd'
,'MasVnrType'
,'MasVnrArea'
,'ExterQual'
,'ExterCond'
,'BsmtCond'
,'BsmtFinType1'
,'BsmtFinSF1'
,'BsmtFinType2'
,'BsmtFinSF2'
,'BsmtUnfSF'
,'TotalBsmtSF'
,'Heating'
,'HeatingQC'
,'CentralAir'
,'Electrical'
,'1stFlrSF'
,'2ndFlrSF'
,'LowQualFinSF'
,'GrLivArea'
,'BsmtFullBath'
,'BsmtHalfBath'
,'FullBath'
,'HalfBath'
,'BedroomAbvGr'
,'KitchenAbvGr'
,'KitchenQual'
,'TotRmsAbvGrd'
,'Functional'
,'Fireplaces'
,'FireplaceQu'
,'GarageType'
,'GarageFinish'
,'GarageCars'
,'GarageArea'
,'GarageQual'
,'GarageCond'
,'PavedDrive'
,'WoodDeckSF'
,'OpenPorchSF'
,'EnclosedPorch'
,'3SsnPorch'
,'ScreenPorch'
,'PoolArea'
,'PoolQC'
,'Fence'
,'MiscFeature'
,'MiscVal'
,'SalePrice']
}

#pprint(feature_dict)

In [None]:
#confirm all features are in dict
j = 0
for i, feature in enumerate(feature_dict.keys()):
    #pprint(feature_dict[feature])
    j += len(feature_dict[feature])
#print(j)

# Elementary Schools and ratings:
- Expect the northwest part of town to be the highest priced based on scool districts


http://www.greatschools.org/iowa/ames/schools/
http://www.ames.k12.ia.us/boundaries/

In [None]:
elem_rating = {
    'Edwards': 8,
    'Sawyer': 9,
    'Fellows': 9,
    'Meeker': 7,
    'Mitchell': 7
}

# Remove commercial properties from the dataset

In [None]:
h.groupby(['MSZoning']).size()

In [None]:
h = h[h['MSZoning'] != 'C (all)']

# Set Id as index

In [None]:
h.set_index('Id',inplace=True)

# Take a look at hist and top pair plots

In [None]:
#h.hist(bins=50,figsize=(20,20))

# Remove cols with less than 90% data in cols

In [None]:
#keep all columns that have more than 10% null values
print((1 - (h.isnull().sum().sort_values(ascending=False) / len(h)) < .9).head(10))
h.drop(['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','LotFrontage'],axis=1,inplace=True)

# Convert MSSubClass to cat

In [None]:
def MSconvert(num):
    MS_dict = {
        20: '1STORY1946',
        30: '1STORY1945',
        40: '1STORYATTIC',
        45: '1.5STORYUNFIN',
        50: '1.5STORYFIN',
        60: '2STORY1946N',
        70: '2STORY1945O',
        75: '2.5STORY',
        80: 'SPLITLEVEL',
        85: 'SPLITFOYER',
        90: 'DUPLEX',
       120: '1STORYPUD',
       150: '1.5STORYPUD',
       160: '2STORYPUD',
       180: 'MULTILEVELPUD',
       190: '2FAMILY'}
    return MS_dict[num]

MSconvert(orig['MSSubClass'][0])

In [None]:
h['MSSubClassCat'] = h['MSSubClass'].apply(MSconvert)

In [None]:
h.drop('MSSubClass',axis=1,inplace=True)

# Take copy

In [None]:
hcopy = h.copy()

# Function to fillna with median (numerics only)

In [None]:
#np.median(h['GarageYrBlt'])
#h.info()

In [None]:
for col in h.columns:
    if h[col].dtype != 'object':
        h[col].fillna(value = h[col].median(), inplace=True)

In [None]:
for col in h.columns:
    if h[col].dtype == 'object':
        h[col].fillna(value = 'None', inplace=True)

# Take a look at NaNs

In [None]:
(h.groupby('Neighborhood')['SalePrice'].mean() / h.groupby('Neighborhood')['GrLivArea'].mean()).sort_values(ascending=False)


In [None]:
#h.groupby('Neighborhood').size().sort_values(ascending=False)
#pd.options.display.float_format = '{:,.2f}'.format
print(str('Mean saleprice for Ames dataset: ' + str(h['SalePrice'].mean())))
h.groupby('Neighborhood')['SalePrice'].describe().unstack().sort_values(by='mean',ascending=False)['mean']

# Condition1: Proximity to various conditions
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

# The condition feature is also an interesting one, PosA and PosN we expect will have a positive impact on price. Wouldn't expect to see RRNe to be positive
Taking a closer look it turns out being closer to the East-West railroad also benefits the price! Turns out you can get to chicago in ~4.5hrs via amtrak.

In [None]:
h.isnull().sum().sort_values(ascending=False).head()

# Correlation matrix

In [None]:
#saleprice correlation matrix
k = 15 #number of variables for heatmap
corr = h.corr()
cols = corr.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(h[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, 
                 fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

# Pairplot top corrs
- Overall Qual is the best indicator, interesting as this definately a Non-Fixed somewhat subjective valiable. It ties right into the obsession that realestate brokers have with 'staging' a house with higher quality art and such.
- GrLivArea makes sense as well, since we generally think of larger houses as higher price.
- I'm somewhat surprised with the next two garage variables, though garages are imortant in suburbia, I didn't expect them to play this large of a roll.

In [None]:
h.groupby('Condition1')[['SalePrice']].mean()
#h.groupby('Condition1')['SalePrice'].size()

In [None]:
#Setup an odered list of neighborhoods
Ordered_Nbs = list(h.groupby('Neighborhood')['SalePrice'].mean().sort_values(ascending=False).index)

In [None]:
print('mean price for 5 best neighborhoods: ' + str(h[h['Neighborhood'].isin(Ordered_Nbs[:5])]['SalePrice'].mean()))
print('mean lotarea for 5 best neighborhoods: ' + str(h[h['Neighborhood'].isin(Ordered_Nbs[:5])]['LotArea'].mean()))
print('\n')
print('mean price for 5 worst neighborhoods: ' + str(h[h['Neighborhood'].isin(Ordered_Nbs[-5:])]['SalePrice'].mean()))
print('mean lotarea for 5 worst neighborhoods: ' + str(h[h['Neighborhood'].isin(Ordered_Nbs[-5:])]['LotArea'].mean()))

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_hold, y_train, y_hold = train_test_split(X,y,test_size=0.2)

# Create new test/train set off the non-holdout terms

In [None]:
X = X_train.copy()
y = y_train.copy()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

In [None]:
#h = h_no_dummy.copy()

In [None]:
#h = h_no_log.copy()

# Add dummy cols

In [None]:
# #old version
# for col in h.columns:
#     if h[col].dtype == 'object':
#         h = pd.concat([h, pd.get_dummies(h[col])], axis=1)
#         del h[col]

In [None]:
#drop base case per dummies col
for col in h.columns:
    if h[col].dtype == 'object':
        dummies = pd.DataFrame()
        dummies = pd.get_dummies(h[col]).rename(columns=lambda x: str(col) + '_' + str(x))
        drop_col = str(col) + '_' + str(h[col].unique()[-1])
        h = pd.concat([h, dummies], axis=1)
        #h = pd.concat([h, pd.get_dummies(h[col])], axis=1)
        del h[col]
        del h[drop_col]

In [None]:
#categorize Mosold and any other categorical numerics into dummies too
col = 'MoSold'
dummies = pd.DataFrame()
dummies = pd.get_dummies(h[col]).rename(columns=lambda x: str(col) + '_' + str(x))
drop_col = str(col) + '_' + str(h[col].unique()[-1])
h = pd.concat([h, dummies], axis=1)
del h[col]
del h[drop_col]

In [None]:
#h.columns
#h['MSSubClassCat'].value_counts()[-1]
#h.groupby('MSSubClassCat')['MSSubClassCat'].size().min()
#h['MSSubClassCat'].unique()[-1]
#dummies = pd.get_dummies(h['MSSubClassCat']).rename(columns=lambda x: str('MSSubClassCat') + '_' + str(x))
#dummies = dummies.drop(dummies.columns[0],axis=1)
#len(dummies.columns)

# Add some features

In [None]:
#amount of high quality sqft
h['Qualsqft'] = (h.GrLivArea  - h.LowQualFinSF)

#2 story building, assuming multiple floors is a more desireable house
#h['2story'] = h['2ndFlrSF'] > 0
#h['Y' if h[h['2ndFlrSF'] > 0] else 'N']

#sqft of rooms other than bedrooms, indicitive of open concept?
h['oConcept'] = (h['1stFlrSF'] / (h.TotRmsAbvGrd - h.BedroomAbvGr + 1))

#rooms other than bedrooms
h['Rooms'] = (h.TotRmsAbvGrd - h.BedroomAbvGr)

#avg sqft per room
h['SqftPerRoom'] = (h.GrLivArea / h.TotRmsAbvGrd)

#yard size, based on lot size minus all other measured areas we have
h['Yard'] = h.LotArea - h['1stFlrSF'] - h.GarageArea# - h.WoodDeckSF - h.OpenPorchSF - h['3SsnPorch'] - h.ScreenPorch

#bathroom to bed ratio, weighting halfbaths appropriately
h['Bath2Bed'] = ((h.FullBath + (h.HalfBath)*.5) / (h.BedroomAbvGr + 1))

#bathroom to kitchen ratio, weighting halfbaths appropriately
h['Bath2Kitch'] = ((h.FullBath + (h.HalfBath)*.5) / (h.KitchenAbvGr + 1))

#overall quality squared
h['OverallQualSq'] = h['OverallQual'] ** 2

h['QualOnLot'] =  (h['Qualsqft'] ** h['OverallQualSq'])

h['QualOnSqft'] =  (h['GrLivArea'] * h['OverallQualSq'])

h['QualOnRooms'] =  ((h['BedroomAbvGr'] + h['KitchenAbvGr'] + 
                      h['TotRmsAbvGrd'] + h['GarageCars'] + 
                      h['FullBath'] + h['HalfBath'] + 
                      h['BsmtFullBath'] + h['BsmtHalfBath'] + h['Fireplaces']) * h['OverallQualSq'])

h['RoomsOverArea'] = h['TotRmsAbvGrd'] / h['GrLivArea']

#price per bed
#h['PriceRoom'] = h['SalePrice'] / h['TotRmsAbvGrd']# + h['FullBath'] + h['HalfBath']

#price per bed
#h['PriceBed'] = h['SalePrice'] / h['BedroomAbvGr']
#h['SalePrice'] / h['BedroomAbvGr']

#years since renovations
h['YearsBetween'] = h['YearRemodAdd'] - h['YearBuilt']

h['Qualsqft2'] = np.sqrt(h['Qualsqft']) * h['OverallQual'] # * h['sqftperroom']
#h['Qualsqft2'].fillna(0)

h['Qualsqft3'] = np.sqrt(h['Qualsqft']) * h['OverallQualSq'] # * h['sqftperroom']
#h['Qualsqft3'].fillna(0)

h['BKG'] =  (h['Bath2Bed'] * h['Qualsqft'])

#h['PriceOnRooms'] =  (h['SalePrice'] / (h['BedroomAbvGr'] + h['KitchenAbvGr'] + h['TotRmsAbvGrd']))

h['BKP'] =  (h['Bath2Kitch'] * h['QualOnRooms'])

In [None]:
#add features
#(h.groupby('Neighborhood')['SalePrice'].mean() / h.groupby('Neighborhood')['GrLivArea'].mean()).sort_values(ascending=False)

In [None]:
#h['Qualsqft2']
h['Qualsqft2'].fillna(0,inplace=True)
h['Qualsqft3'].fillna(0,inplace=True)
h['QualOnLot'].fillna(0,inplace=True)

In [None]:
sns.pairplot(h, 
             x_vars=['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
       'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt'],
             y_vars=['SalePrice'],size=7,aspect=.7,kind='reg')

# Take a look at distributions, looks like a few of the top candidates are not normally distributed

In [None]:
from scipy.stats import norm

In [None]:
f, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10) = plt.subplots(10,figsize=(7,36))
sns.distplot(h['SalePrice'], fit=norm, ax=ax1)
sns.distplot(h['OverallQual'], fit=norm, ax=ax2)
sns.distplot(h['FullBath'], fit=norm, ax=ax3)
sns.distplot(h['GrLivArea'], fit=norm, ax=ax4)
sns.distplot(h['GarageArea'], fit=norm, ax=ax5)
sns.distplot(h['GarageCars'], fit=norm, ax=ax6)
sns.distplot(h['TotalBsmtSF'], fit=norm, ax=ax7)
sns.distplot(h['1stFlrSF'], fit=norm, ax=ax8)
sns.distplot(h['TotRmsAbvGrd'], fit=norm, ax=ax9)
sns.distplot(h['YearBuilt'], fit=norm, ax=ax10)

# Save cleaned no_transforms dataframe

In [None]:
h_no_transforms = h.copy()

In [None]:
#h_no_log = h_no_transforms.copy()

# Transform the variables that are not normally distributed - linear regression assumes values are normally distributed
- log transform salesprice, grlivarea, 1stflrsq, 
- do something about the 0's in garagearea, totalbsmtsf

In [None]:
#saleprice
f, (ax1, ax2) = plt.subplots(2,figsize=(10,10))
res = stats.probplot(h['SalePrice'], plot=ax1)
h['SalePrice'] = np.log(h['SalePrice'])
res = stats.probplot(h['SalePrice'], plot=ax2)

In [None]:
#GrLivArea
f, (ax1, ax2) = plt.subplots(2,figsize=(10,10))
res = stats.probplot(h['GrLivArea'], plot=ax1)
h['GrLivArea'] = np.log(h['GrLivArea'])
res = stats.probplot(h['GrLivArea'], plot=ax2)

In [None]:
#1stFlrSF
f, (ax1, ax2) = plt.subplots(2,figsize=(10,10))
res = stats.probplot(h['1stFlrSF'], plot=ax1)
h['1stFlrSF'] = np.log(h['1stFlrSF'])
res = stats.probplot(h['1stFlrSF'], plot=ax2)

In [None]:
#OverallQual - debateable if this should have it
f, (ax1, ax2) = plt.subplots(2,figsize=(10,10))
res = stats.probplot(h['OverallQual'], plot=ax1)
h['OverallQual'] = np.log(h['OverallQual'])
res = stats.probplot(h['OverallQual'], plot=ax2)

# Convert 0's in some numerics to categorical - BsmtSq

In [None]:
def Zeroconvert(num):
    if num == 0:
        out = 1
    else:
        out = 0
    return out
Zeroconvert(orig['TotalBsmtSF'][0])

In [None]:
#there are 37 samples without a basement
h['TotalBsmtSF'].value_counts().head()

In [None]:
h['NoBsmt'] = h['TotalBsmtSF'].apply(Zeroconvert)

In [None]:
h['NoBsmt'].value_counts().head()

In [None]:
#there are 79 samples without a Garage
h['GarageArea'].value_counts().head()

In [None]:
h['NoGarage'] = h['GarageArea'].apply(Zeroconvert)

In [None]:
#TotalBsmtSF
f, (ax1, ax2) = plt.subplots(2,figsize=(10,10))
res = stats.probplot(h[h['TotalBsmtSF'] > 0]['TotalBsmtSF'], plot=ax1)
h.loc[h['NoBsmt'] == 0,'TotalBsmtSF'] = np.log(h['TotalBsmtSF'])
res = stats.probplot(h[h['TotalBsmtSF'] > 0]['TotalBsmtSF'], plot=ax2)

In [None]:
#GarageArea
f, (ax1, ax2) = plt.subplots(2,figsize=(10,10))
res = stats.probplot(h[h['GarageArea'] > 0]['GarageArea'], plot=ax1)
h.loc[h['NoGarage'] == 0,'GarageArea'] = np.log(h['GarageArea'])
res = stats.probplot(h[h['GarageArea'] > 0]['GarageArea'], plot=ax2)

# Check distribution afterwards

In [None]:
# f, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10) = plt.subplots(10,figsize=(7,36))
# sns.distplot(h['SalePrice'], fit=norm, ax=ax1)
# #ax1.set_title('SalePrice')
# sns.distplot(h['OverallQual'], fit=norm, ax=ax2)
# #ax2.set_title('OverallQual')
# sns.distplot(h['FullBath'], fit=norm, ax=ax3)
# sns.distplot(h['GrLivArea'], fit=norm, ax=ax4)
# sns.distplot(h['GarageArea'], fit=norm, ax=ax5)
# sns.distplot(h['GarageCars'], fit=norm, ax=ax6)
# sns.distplot(h['TotalBsmtSF'], fit=norm, ax=ax7)
# sns.distplot(h['1stFlrSF'], fit=norm, ax=ax8)
# sns.distplot(h['TotRmsAbvGrd'], fit=norm, ax=ax9)
# sns.distplot(h['YearBuilt'], fit=norm, ax=ax10)

# Copy of transformed data without dummies