# Cleaning

#### NA's (or NaNs) as value
1. **Alley:** Type of alley access to property
2. **BsmtQual:** Evaluates the height of the basement
3. **BsmtCond:** Evaluates the general condition of the basement
4. **BsmtExposure:** Refers to walkout or garden level walls
5. **BsmtFinType1:** Rating of basement finished area
6. **BsmtFinType2:** Rating of basement finished area (if multiple types)*
7. **FireplaceQu:** Fireplace quality
8. **GarageType:** Garage location
9. **GarageFinish:** Interior finish of the garage
10. **GarageQual:** Garage quality
11. **GarageCond:** Garage condition
12. **PoolQC:** Pool quality (Biggest of the above)
13. **Fence:** Fence quality
14. **MiscFeature:** Miscellaneous feature not covered in other categories

##### The two below are listed as None in description, but NA in dataset
* MasVnrArea: Masonry veneer type
* MasVnrType: Masonry veneer area in square feet

##### Others
* LotFrontage: Maybe set to NA if there is no street connected to property?
* GarageYrBlt: Set to NA if the above Garage attributes are set as NA

* Electrical: Electrical system. There is only one property with a missing data in Electrical. At row 1381


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ds_utils import *
import copy
%matplotlib inline

In [2]:
DFtrain = pd.read_csv('train.csv')
Y = DFtrain['SalePrice']
DFtest = pd.read_csv('test.csv')
DFtrain = DFtrain.drop(['SalePrice'], axis=1)
# DFtrain.shape, DFtest.shape

In [3]:
DFc = pd.concat([DFtrain, DFtest])
del DFc['Id']
del DFc['GarageYrBlt']
DFc.shape

(2919, 78)

In [4]:
def summary_missing_data(data, n):
    total = data.isnull().sum().sort_values(ascending=False)
    percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data.head(n)

In [5]:
missing_data = summary_missing_data(DFc,34)
print(missing_data)

              Total   Percent
PoolQC         2909  0.996574
MiscFeature    2814  0.964029
Alley          2721  0.932169
Fence          2348  0.804385
FireplaceQu    1420  0.486468
LotFrontage     486  0.166495
GarageCond      159  0.054471
GarageQual      159  0.054471
GarageFinish    159  0.054471
GarageType      157  0.053786
BsmtCond         82  0.028092
BsmtExposure     82  0.028092
BsmtQual         81  0.027749
BsmtFinType2     80  0.027407
BsmtFinType1     79  0.027064
MasVnrType       24  0.008222
MasVnrArea       23  0.007879
MSZoning          4  0.001370
BsmtHalfBath      2  0.000685
Utilities         2  0.000685
Functional        2  0.000685
BsmtFullBath      2  0.000685
BsmtFinSF1        1  0.000343
BsmtFinSF2        1  0.000343
Exterior2nd       1  0.000343
BsmtUnfSF         1  0.000343
TotalBsmtSF       1  0.000343
SaleType          1  0.000343
Exterior1st       1  0.000343
Electrical        1  0.000343
KitchenQual       1  0.000343
GarageArea        1  0.000343
GarageCars

#### Getting rid of rows with few missinng data

In [6]:
for value in ((missing_data[missing_data['Total'] < 50]).index):
    DFc = DFc.drop(DFc.loc[DFc[value].isnull()].index)
summary_missing_data(DFc,17)

Unnamed: 0,Total,Percent
PoolQC,2835,0.996485
MiscFeature,2742,0.963796
Alley,2651,0.93181
Fence,2280,0.801406
FireplaceQu,1387,0.487522
LotFrontage,466,0.163796
GarageFinish,156,0.054833
GarageCond,156,0.054833
GarageQual,156,0.054833
GarageType,155,0.054482


In [7]:
DFc.shape

(2845, 78)

#### Filled missing values in LotFrontage with the mode

In [8]:
LotFrontage = DFc['LotFrontage']
mode = DFc['LotFrontage'].mode() # Mode is 60
mode = float(mode)
DFc['LotFrontage'].fillna(mode, inplace=True)

#### Filled the remaining missing values with "None"
The remaining ones just have None listed as NA in the data description

In [9]:
DFc.fillna("None",inplace=True)
summary_missing_data(DFc,5)

Unnamed: 0,Total,Percent
SaleCondition,0,0.0
Foundation,0,0.0
RoofMatl,0,0.0
Exterior1st,0,0.0
Exterior2nd,0,0.0


In [10]:
atts = pd.read_csv('attributes.csv')

In [11]:
DFdum = DFc.copy()

### Trying to get dummies only for Categorical atts

In [12]:
n = 0
for dtype in atts['Type']:
    #print(atts['Attribute'][n], dtype)
    if dtype == 'Categorical':
        dummies = pd.get_dummies(DFc[atts['Attribute'][n]],prefix=atts['Attribute'][n])
        for dum in dummies:
            DFdum[dum] = pd.Series(dummies[dum])
    n += 1

In [13]:
DFdum

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,MiscVal_17000,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,0,0,0,0,1
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,0,0,0,0,0,0,0,1
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,0,0,0,0,1
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,0,0,0,0,0,0,0,1
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,0,0,0,0,0,0,0,1
5,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,0,0,0,0,1
6,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,0,0,0,0,1
7,60,RL,60.0,10382,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,0,0,0,0,0,0,0,1
8,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,0,0,0,0,0,0,0,1
9,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,...,0,0,0,0,0,0,0,0,0,1


# Modeling

Was going to get ready for PCA stuff below

In [14]:
from sklearn.preprocessing import StandardScaler

y = DF.loc[:,['SalePrice']].values
DF = DF.drop(['SalePrice'], axis=1)
y

X = datatrain.drop(columns=['SalePrice', 'Id'])
Y = datatrain['SalePrice']
Xt = datatest.drop(columns=['Id'])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Xt_scaled = scaler.fit_transform(Xt)

