# Preprocessing and Dummying

The purpose of this notebook is to split our data into x and y training and test sets, so that we can use those sets in order to fit our models for the most accurate predictions. Additionally, we will dummy out columns containing categorical variables, so that we can see the effect each category has on our prediciton as a whole. 

In [1]:
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
import pickle
import csv
import re
import time
np.random.seed(42)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

This imports all of the libraries we need. 

In [2]:
ames_training_data_cleaned = pd.read_pickle("../datasets/training_data_cleaned.pkl")

This reads in the pickle of the cleaned training data that we reated in 01_Cleaning_and_EDA.

In [3]:
ames_training_data_cleaned.set_index('Id',inplace = True)

This sets the index of the cleaned training data equal to the Id column.  

In [4]:
ames_training_data_cleaned.drop('PID', 1, inplace=True)


This drops the PID column, as it is a categorical variable that does not contribute towards our model in any meaningful way. 

In [5]:
ames_training_data_cleaned_dummies = pd.get_dummies(ames_training_data_cleaned)

This dummies out the categorical variables into their own separate columns. 

In [6]:
ames_training_data_cleaned_dummies.shape

(2051, 377)

The number of features has gone up from 81 to 377, as a result of the dummying. 

In [7]:
X = ames_training_data_cleaned_dummies.drop('SalePrice', 1)
y = ames_training_data_cleaned_dummies.SalePrice

This sets the predictor variables to every variable that isn't the sale price, and the predicted variable to the sale price. 

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

This splits the training data into a train set and a test set at a ratio of 0.75 to 0.25. 

In [9]:
threshold = VarianceThreshold(.05)
X_train_thresh = threshold.fit_transform(X_train)
X_test_thresh = threshold.transform(X_test)

This removes all features with a variance lower than 0.05. 

In [10]:
columns = X.columns[threshold.get_support()]

This preserves an ordered list of the columns. 

In [11]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train_thresh)
X_test_sc = ss.transform(X_test_thresh)

This scales the data to an approximately Normal distribution. 

In [12]:
X.to_pickle("../datasets/training_data_cleaned_X.pkl")
y.to_pickle("../datasets/training_data_cleaned_y.pkl")
np.save('../datasets/columns',columns)
np.save('../datasets/X_train_sc',X_train_sc)
np.save('../datasets/X_test_sc',X_test_sc)
np.save('../datasets/y_train',y_train)
np.save('../datasets/y_test',y_test)
pickle.dump(ss, open('../datasets/ss.sav', 'wb'))

This saves the data out to be reused later. 