# Cleaning

#### NA's (or NaNs) as value
1. **Alley:** Type of alley access to property
2. **BsmtQual:** Evaluates the height of the basement
3. **BsmtCond:** Evaluates the general condition of the basement
4. **BsmtExposure:** Refers to walkout or garden level walls
5. **BsmtFinType1:** Rating of basement finished area
6. **BsmtFinType2:** Rating of basement finished area (if multiple types)*
7. **FireplaceQu:** Fireplace quality
8. **GarageType:** Garage location
9. **GarageFinish:** Interior finish of the garage
10. **GarageQual:** Garage quality
11. **GarageCond:** Garage condition
12. **PoolQC:** Pool quality (Biggest of the above)
13. **Fence:** Fence quality
14. **MiscFeature:** Miscellaneous feature not covered in other categories

##### The two below are listed as None in description, but NA in dataset
* MasVnrArea: Masonry veneer type
* MasVnrType: Masonry veneer area in square feet

##### Others
* LotFrontage: Maybe set to NA if there is no street connected to property?
* GarageYrBlt: Set to NA if the above Garage attributes are set as NA

* Electrical: Electrical system. There is only one property with a missing data in Electrical. At row 1381


In [175]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import copy
%matplotlib inline

In [176]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.shape, test.shape, "total[0] =",test.shape[0]+train.shape[0])

(1460, 81) (1459, 80) total[0] = 2919


In [177]:
border = train.shape[0]
border

1460

In [178]:
DFc = pd.concat([train, test],ignore_index=True)
DFc.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


(2919, 81)

In [179]:
DFc.iloc[border-2:border+2, :]

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
1458,1078,0,0,,2,1Fam,TA,Mn,49.0,1029.0,...,WD,0,Pave,5,1078.0,AllPub,366,1950,1996,2010
1459,1256,0,0,,3,1Fam,TA,No,830.0,290.0,...,WD,0,Pave,6,1256.0,AllPub,736,1965,1965,2008
1460,896,0,0,,2,1Fam,TA,No,468.0,144.0,...,WD,120,Pave,5,882.0,AllPub,140,1961,1961,2010
1461,1329,0,0,,3,1Fam,TA,No,923.0,0.0,...,WD,0,Pave,6,1329.0,AllPub,393,1958,1958,2010


In [180]:
DFc = DFc.drop(columns=['LotShape', 'GarageYrBlt'])

In [181]:
def summary_missing_data(data):
    total = data.isnull().sum().sort_values(ascending=False)
    percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data

In [182]:
missing_data = summary_missing_data(DFc)
print(missing_data)

               Total   Percent
PoolQC          2909  0.996574
MiscFeature     2814  0.964029
Alley           2721  0.932169
Fence           2348  0.804385
SalePrice       1459  0.499829
FireplaceQu     1420  0.486468
LotFrontage      486  0.166495
GarageCond       159  0.054471
GarageFinish     159  0.054471
GarageQual       159  0.054471
GarageType       157  0.053786
BsmtCond          82  0.028092
BsmtExposure      82  0.028092
BsmtQual          81  0.027749
BsmtFinType2      80  0.027407
BsmtFinType1      79  0.027064
MasVnrType        24  0.008222
MasVnrArea        23  0.007879
MSZoning           4  0.001370
BsmtFullBath       2  0.000685
BsmtHalfBath       2  0.000685
Utilities          2  0.000685
Functional         2  0.000685
Exterior2nd        1  0.000343
Exterior1st        1  0.000343
KitchenQual        1  0.000343
GarageCars         1  0.000343
Electrical         1  0.000343
GarageArea         1  0.000343
SaleType           1  0.000343
...              ...       ...
GrLivAre

In [183]:
atts = pd.read_csv('attributes.csv')

In [184]:
for i in range(0, len(atts)):
    if atts.iloc[i]['Type']=='Categorical':
        col_name=atts.iloc[i]['Attribute']
        DFc[col_name].fillna('None', inplace=True)
        DFc=pd.get_dummies(DFc, columns=[col_name])

for value in ((missing_data[missing_data['Total'] < 50]).index):
    DFc = DFc.drop(DFc.loc[DFc[value].isnull()].index)
summary_missing_data(DFc,17)

In [185]:
summary_missing_data(DFc.drop(columns=['SalePrice'])).head(11)

Unnamed: 0,Total,Percent
LotFrontage,486,0.166495
MasVnrArea,23,0.007879
BsmtFullBath,2,0.000685
BsmtHalfBath,2,0.000685
BsmtUnfSF,1,0.000343
GarageArea,1,0.000343
TotalBsmtSF,1,0.000343
BsmtFinSF1,1,0.000343
BsmtFinSF2,1,0.000343
GarageCars,1,0.000343


#### Manually filled in some missing values with modes

In [186]:
DFc['LotFrontage'].fillna(float(DFc['LotFrontage'].mode()), inplace=True)
DFc['MasVnrArea'].fillna(float(DFc['MasVnrArea'].mode()), inplace=True)
DFc['BsmtFullBath'].fillna(float(DFc['BsmtFullBath'].mode()), inplace=True)
DFc['BsmtHalfBath'].fillna(float(DFc['BsmtHalfBath'].mode()), inplace=True)
DFc['BsmtUnfSF'].fillna(float(DFc['BsmtUnfSF'].mode()), inplace=True)
DFc['GarageArea'].fillna(float(DFc['GarageArea'].mode()), inplace=True)
DFc['TotalBsmtSF'].fillna(float(DFc['TotalBsmtSF'].mode()), inplace=True)
DFc['BsmtFinSF1'].fillna(float(DFc['BsmtFinSF1'].mode()), inplace=True)
DFc['BsmtFinSF2'].fillna(float(DFc['BsmtFinSF2'].mode()), inplace=True)
DFc['GarageCars'].fillna(float(DFc['GarageCars'].mode()), inplace=True)

In [187]:
summary_missing_data(DFc.drop(columns=['SalePrice'])).head(11)

Unnamed: 0,Total,Percent
SaleCondition_Partial,0,0.0
HouseStyle_SFoyer,0,0.0
BldgType_Duplex,0,0.0
BldgType_Twnhs,0,0.0
BldgType_TwnhsE,0,0.0
HouseStyle_1.5Fin,0,0.0
HouseStyle_1.5Unf,0,0.0
HouseStyle_1Story,0,0.0
HouseStyle_2.5Fin,0,0.0
HouseStyle_2.5Unf,0,0.0


In [188]:
DFtrain = DFc.iloc[0:border,:]
print(DFtrain.shape)
DFtest = DFc.iloc[border:,:]
DFtest = DFtest.drop(columns=['SalePrice'])
print(DFtest.shape)

(1460, 376)
(1459, 375)


In [189]:
DFtrain.to_csv('trainer.csv', index=False)
DFtest.to_csv('tester.csv', index=False)

# Modeling

Was going to get ready for PCA stuff below

In [190]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [191]:
X = DFtrain.drop(columns=['Id', 'SalePrice'])
Y = DFtrain['SalePrice']
Xt = DFtest.drop(columns=['Id'])

##### Scaling, otherwise we will just see one component!

In [192]:
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
Xt = scaler.transform(Xt)

In [193]:
Xcopy  = X
Ycopy  = Y
Xtcopy = Xt

In [194]:
pca = PCA(5)

In [195]:
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [196]:
pca.n_components_

5

In [197]:
X = pca.transform(X)
Xt = pca.transform(Xt)

In [198]:
X.shape, Xt.shape, Y.shape

((1460, 5), (1459, 5), (1460,))

## Linear Regression

In [199]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [200]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [201]:
XTRAIN, XTEST, YTRAIN, YTEST=train_test_split(X,Y)
r=LinearRegression().fit(XTRAIN,YTRAIN)
P=r.predict(XTEST)
R2=r2_score(YTEST,P)
MSE = mean_squared_error(YTEST,P)
print(R2,MSE)

0.7866504333174228 1209913494.2356231


In [202]:
errs=[]
for i in range(100):
    XTRAIN, XTEST, YTRAIN, YTEST=train_test_split(X,Y)
    r=LinearRegression().fit(XTRAIN,YTRAIN)
    P=r.predict(XTEST)
    R2=r2_score(YTEST,P)
    #MSE = mean_squared_error(YTEST,P)
    #print(R2,MSE)
    errs.append(1-R2)

In [203]:
print('Result: ~',np.round((1-np.mean(errs))*100), '% accuracy')

Result: ~ 77.0 % accuracy


maybe we should try a different number of components

## Naive Bayes

In [204]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix

In [209]:
gnb=GaussianNB()
errs=[]
nsplits = 100 #it takes a very long time with 100 splits
for split in range(nsplits):
    XTRAIN, XTEST, YTRAIN, YTEST=train_test_split(X,Y,test_size=.25)
    gnb.fit(XTRAIN,YTRAIN)
    YP=gnb.predict(XTEST)
    errs.append(1-accuracy_score(YTEST,YP))
print("%d Splits: Mean Error=%7.6f +/- %7.6f (95%%)"\
      %(nsplits, np.mean(errs),1.96*np.std(errs)))
print(confusion_matrix(YTEST,YP))

100 Splits: Mean Error=0.991151 +/- 0.009140 (95%)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Random Forest

In [210]:
from sklearn.ensemble import RandomForestClassifier

In [213]:
RF=RandomForestClassifier(n_estimators=500)
XTRAIN, XTEST, YTRAIN, YTEST=train_test_split(Xcopy,Ycopy,test_size=.25)
RF.fit(XTRAIN,YTRAIN)
YP=RF.predict(XTEST)
error=(1-accuracy_score(YTEST,YP))
print('Error: ', error*100,'%')

Error:  98.08219178082192 %


## Logistic Regression

In [206]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(solver = 'lbfgs')
logisticRegr.fit(X,Y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [207]:
logisticRegr.predict(Xt[0].reshape(1,-1))

array([139000.])

In [208]:
logisticRegr.predict(Xt[0:10])

array([139000., 139000., 176000., 250000., 147000., 178000., 175000.,
       178000., 180000., 148000.])