## Kaggle Submission

The purpose of this notebook is to use the production model that we chose earlier, ridge regression in this case, to predict the sale price of the test data we were given and upload that to Kaggle. 

In [46]:
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
import pickle
import csv
import re
import time
np.random.seed(42)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

This imports all of the models we need. 

In [47]:
X = pd.read_pickle("../datasets/training_data_cleaned_X.pkl")
y = pd.read_pickle("../datasets/training_data_cleaned_y.pkl")
X_train_sc = np.load('../datasets/X_train_sc.npy')
X_test_sc = np.load('../datasets/X_test_sc.npy')
y_train = np.load('../datasets/y_train.npy')
y_test = np.load('../datasets/y_test.npy')
ss = pickle.load(open('../datasets/ss.sav', 'rb'))
columns = np.load('../datasets/columns.npy')

This imports all of the data we need. 

In [48]:

kaggle_data = pd.read_csv('../datasets/test.csv',index_col = 'Id')

In [49]:
kaggle_data.isnull().sum().sort_values(ascending=False)

Pool QC            875
Misc Feature       838
Alley              821
Fence              707
Fireplace Qu       422
Lot Frontage       160
Garage Cond         45
Garage Qual         45
Garage Yr Blt       45
Garage Finish       45
Garage Type         44
BsmtFin Type 1      25
Bsmt Exposure       25
Bsmt Qual           25
Bsmt Cond           25
BsmtFin Type 2      25
Mas Vnr Area         1
Electrical           1
Mas Vnr Type         1
Year Built           0
Exter Qual           0
Exter Cond           0
Foundation           0
Exterior 2nd         0
Exterior 1st         0
Roof Matl            0
Roof Style           0
Year Remod/Add       0
Sale Type            0
Overall Cond         0
                  ... 
Misc Val             0
Pool Area            0
Screen Porch         0
3Ssn Porch           0
Enclosed Porch       0
Open Porch SF        0
Wood Deck SF         0
Paved Drive          0
Garage Area          0
Garage Cars          0
Fireplaces           0
Functional           0
TotRms AbvG

As you can see from the above, there are 19 columns with some amount of missing values. I will now proceed to replace the null values the same way as I did for the training set. 

In [50]:
na_replacement_values = {
    'Pool QC'   : 'No pool',         
    'Misc Feature' : 'No feature',      
    'Alley'         : 'No alley access',    
    'Fence'          : 'No fence',   
    'Fireplace Qu'    : 'No Fireplace',  
    'Lot Frontage'  : 0,
    'Garage Finish'  : 'No Garage'   ,  
    'Garage Cond'     : 'No Garage'  , 
    'Garage Qual'      : 'No Garage',  
    'Garage Yr Blt'  : 0,
    'Garage Type'     : 'No Garage',   
    'Bsmt Exposure'    : 'No Basement',   
    'BsmtFin Type 2'    : 'No Basement',  
    'BsmtFin Type 1'     : 'No Basement',
    'Bsmt Cond'           : 'No Basement',
    'Bsmt Qual'            : 'No Basement',
    'Mas Vnr Type'         : 'None',
    'Mas Vnr Area'         : 0,
    'Bsmt Half Bath'        : 'No Basement',
    'Bsmt Full Bath'        : 'No Basement',
    'Garage Cars'           : 0,
    'Garage Area'           : 0,
    'Bsmt Unf SF'           : 0,
    'BsmtFin SF 2'          : 0,
    'Total Bsmt SF'         : 0,
    'BsmtFin SF 1'          : 0,                        
}

In [51]:
kaggle_data.fillna(value=na_replacement_values,inplace=True)

In [52]:
kaggle_data.isnull().sum().sort_values(ascending=False).head(5)

Electrical    1
Sale Type     0
Exter Cond    0
Roof Style    0
Roof Matl     0
dtype: int64

I will now replace the null value of the Electrical column with the column's most common value, as there is no option for me to put that there is no electricity. 

In [53]:
kaggle_data['Electrical'].fillna(value = 'SBrkr',inplace=True)

In [54]:
kaggle_data.isnull().sum().sort_values(ascending=False).head(5)

Sale Type       0
Exter Cond      0
Roof Style      0
Roof Matl       0
Exterior 1st    0
dtype: int64

Thus, I have now replaced every null value in the test dataset with an appropriate value instead. 

In [55]:
neighborhood_replacement_values = {
       'Blmngtn': 'Bloomington Heights',
       'Blueste': 'Bluestem',
       'BrDale': 'Briardale',
       'BrkSide': 'Brookside',
       'ClearCr': 'Clear Creek',
       'CollgCr': 'College Creek',
       'Crawfor': 'Crawford',
       'Edwards': 'Edwards',
       'Gilbert': 'Gilbert',
       'Greens': 'Greens',
       'GrnHill': 'Green Hills',
       'IDOTRR': 'Iowa DOT and Rail Road',
       'Landmrk': 'Landmark',
       'MeadowV': 'Meadow Village',
       'Mitchel': 'Mitchell',
       'Names': 'North Ames',
       'NoRidge': 'Northridge',
       'NPkVill': 'Northpark Villa',
       'NridgHt': 'Northridge Heights',
       'NWAmes': 'Northwest Ames',
       'OldTown': 'Old Town',
       'SWISU': 'South & West of Iowa State University',
       'Sawyer': 'Sawyer',
       'SawyerW': 'Sawyer West',
       'Somerst': 'Somerset',
       'StoneBr': 'Stone Brook',
       'Timber': 'Timberland',
       'Veenker': 'Veenker'
}

In [56]:
kaggle_data['Neighborhood'].replace(neighborhood_replacement_values,inplace=True)

I have now replaced all of the kaggle data neighborhood name abbreviations in order to make them more human readable. 

In [57]:
ms_subclass_replacement_values = {
       20 :'1-STORY 1946 & NEWER ALL STYLES',
       30 : '1-STORY 1945 & OLDER',
       40 : '1-STORY W/FINISHED ATTIC ALL AGES',
       45 : '1-1/2 STORY - UNFINISHED ALL AGES',
       50 : '1-1/2 STORY FINISHED ALL AGES',
       60 : '2-STORY 1946 & NEWER',
       70 : '2-STORY 1945 & OLDER',
       75 : '2-1/2 STORY ALL AGES',
       80 : 'SPLIT OR MULTI-LEVEL',
       85 : 'SPLIT FOYER',
       90 : 'DUPLEX - ALL STYLES AND AGES',
       120 : '1-STORY PUD (Planned Unit Development) - 1946 & NEWER',
       150 : '1-1/2 STORY PUD - ALL AGES',
       160 : '2-STORY PUD - 1946 & NEWER',
       180 : 'PUD - MULTILEVEL - INCL SPLIT LEV/FOYER',
       190 : '2 FAMILY CONVERSION - ALL STYLES AND AGES',

    
}

In [58]:
kaggle_data['MS SubClass'].replace(ms_subclass_replacement_values, inplace = True)

I am replacing the MS SubClass abbreviations to make them more human readable.

In [59]:
ms_zoning_replacement_values = {
       'A' : 'Agriculture',
       'C' : 'Commercial',
       'FV' : 'Floating Village Residential',
       'I' : 'Industrial',
       'RH' : 'Residential High Density',
       'RL' : 'Residential Low Density',
       'RP' : 'Residential Low Density Park', 
       'RM' : 'Residential Medium Density'
}

In [60]:
kaggle_data['MS Zoning'].replace(ms_zoning_replacement_values, inplace = True)

I am replacing the MS Zoning abbreviations to make them more human readable. 

In [61]:
month_replacement_values = {
    1 : 'January',
    2 : 'February',
    3 : 'March',
    4 : 'April',
    5 : 'May',
    6 : 'June',
    7 : 'July',
    8 : 'August',
    9 : 'September',
    10 : 'October',
    11 : 'November',
    12 : 'December',
}

In [62]:
kaggle_data['Mo Sold'].replace(month_replacement_values, inplace = True)

In [63]:
kaggle_data['Mo Sold'] = kaggle_data['Mo Sold'].astype('object')
kaggle_data['Yr Sold'] = kaggle_data['Yr Sold'].astype('object')
kaggle_data['Full Bath'] = kaggle_data['Full Bath'].astype('object')
kaggle_data['Half Bath'] = kaggle_data['Half Bath'].astype('object')
kaggle_data['Exter Qual'] = kaggle_data['Exter Qual'].astype('object')
kaggle_data['Exter Cond'] = kaggle_data['Exter Cond'].astype('object')
kaggle_data['Overall Qual'] = kaggle_data['Overall Qual'].astype('object')
kaggle_data['Overall Cond'] = kaggle_data['Overall Cond'].astype('object')
kaggle_data['Garage Cars'] = kaggle_data['Garage Cars'].astype('object')
kaggle_data['Bedroom AbvGr'] = kaggle_data['Bedroom AbvGr'].astype('object')
kaggle_data['Kitchen AbvGr'] = kaggle_data['Kitchen AbvGr'].astype('object')
kaggle_data['TotRms AbvGrd'] = kaggle_data['TotRms AbvGrd'].astype('object')
kaggle_data['Fireplaces'] = kaggle_data['Fireplaces'].astype('object')



I am converting the categorical numerical data into objects.

In [64]:
kaggle_data.drop(['Garage Yr Blt'], axis=1,inplace = True)
kaggle_data.drop(['Fireplaces'], axis=1,inplace = True)

I am dropping the Garage Yr Blt column, as it is extraneous since most houses have their garages constructed when they are constructed, thus rendering the column redundant. Additionally, I am dropping the Fireplaces column, as I believe it is extraneous as well 

In [65]:
kaggle_data['Overall Qual'].replace([1, 2, 3], 'Bad',inplace = True)
kaggle_data['Overall Qual'].replace([4, 5, 6, 7], 'Good',inplace = True)
kaggle_data['Overall Qual'].replace([8, 9, 10], 'Excellent',inplace = True)

kaggle_data['Overall Cond'].replace([1, 2, 3], 'Bad',inplace = True)
kaggle_data['Overall Cond'].replace([4, 5, 6,7], 'Good',inplace = True)
kaggle_data['Overall Cond'].replace([8, 9, 10], 'Excellent',inplace = True)

This takes the categorical variables of Overall Qual and Overall Cond and converts them into three distinct categories. 

Thus, I have replicated all of the data cleaning that I have done to the training data with regards to the test data. 

In [66]:
kaggle_dummies = pd.get_dummies(kaggle_data)

In [67]:
training_dummies = pickle.load(open("../datasets/training_data_cleaned_X.pkl", 'rb'))

In [68]:
for i in set(training_dummies.columns.difference(kaggle_dummies.columns)):
    kaggle_dummies[i] = 0
    

In [69]:
kaggle_dummies.drop(set(kaggle_dummies.columns.difference(training_dummies.columns)),1, inplace=True)

In [70]:
kaggle_dummies = kaggle_dummies[training_dummies.columns]

In [71]:
threshold = VarianceThreshold(.05)
training_threshold = threshold.fit_transform(training_dummies)
kaggle_threshold = threshold.transform(kaggle_dummies)

In [72]:
training_threshold_ss = ss.fit_transform(training_threshold)
kaggle_threshold_ss = ss.transform(kaggle_threshold)

In [73]:
ridge_final = RidgeCV(cv=5)
ridge_final.fit(training_threshold_ss, y)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=5, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)

This fits the Ridge Regression model with 5 cross validation sets using my training data and sale price. 

In [74]:
predictions = ridge_final.predict(kaggle_threshold_ss)

This generates a list of predictions using my kaggle data as the predictor variables. 

In [75]:
kaggle_submissions = pd.DataFrame(predictions, index=kaggle_data.index, columns=['SalePrice'])
kaggle_submissions.sort_index(inplace=True)
kaggle_submissions.to_csv('../datasets/kaggle_submission.csv')

This creates the csv I will submit to Kaggle.

Thus, I have successfully applied the production model to my kaggle data and generated a csv of predictions to submit. 