# Predict the Sale Price of a Property

### W207 Final Project
#### Surya, Ian Anderson, Allison Godfrey, and Jackie Ma

### 1. Problem Statement

We will be using machine learning approaches to try to most accurately predict home price based on relevant features. The goal of this notebook is to fully understand the data, extract the most relevant features, and apply different machine learning models with various regularization strengths to assess the accuracies, evaluate and compare the errors, and choose the best model accordingly. 

The components of our notebook are as follows: 
1. Exploratory Data Analysis 
2. Data Cleaning 
3. Feature Engineering
4. Encoding Categorical Features
5. Outlier Analysis
6. Machine Learning Models and Assessment

https://www.kaggle.com/c/house-prices-advanced-regression-techniques  
 

### 2. Exploratory Data Analysis (EDA)

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from scipy import stats
from scipy.stats import norm
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.svm import LinearSVR
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn import metrics
#Set default matplotlib style to seaborn
mpl.style.use('seaborn-darkgrid')
base_color='#436BAD'
red_color='#990000'

In [None]:
# Load Data
pd.options.display.width=None
pd.options.display.max_columns = None
#Train data
train = pd.read_csv('./house-prices-data/train.csv',index_col=0)

#Test data
test = pd.read_csv('./house-prices-data/test.csv',index_col=0)

In [None]:
# Train data size
print(train.shape)

In [None]:
#Sample Train data
train.head()

In [None]:
# Test data size
print(test.shape)

In [None]:
#Sample Test data
test.head()

#### 2.1 Missing Values            

In [None]:
# Analyze on missing data
# Missing values
train_missing = train.isnull().sum()
train_missing = pd.DataFrame(train_missing[train_missing > 0])
train_missing.columns = ['Count']
train_missing.sort_values(by='Count', ascending = False, inplace=True)
train_missing['Percent'] = round((train_missing['Count'] /  len(train.index))* 100, 2) 
plt.figure(figsize=(15, 5))   
plt.subplot(1,2, 1)
print(train_missing)
plt.bar(train_missing.index,train_missing['Count'],color=base_color)
plt.xticks(rotation=90)
plt.ylabel('Count')
plt.title('Missing Values - Count')
plt.subplot(1,2, 2)
plt.bar(train_missing.index,train_missing['Percent'],color=base_color)
plt.xticks(rotation=90)
plt.ylabel('Percentage')
plt.title('Missing Values - Percentage')

#### 2.2  Feature Selection

(Jacky): Select 25 features and apply L1 analysys.

#### 2.2.1 Exclude columns

In [None]:
# From section 2.1 there are 6 features above 10% NA. Let us drop them. Most of them are above 50% NA
excluded_columns = ['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','LotFrontage']
train_new= train.drop(excluded_columns, axis=1)
test_new= test.drop(excluded_columns ,axis=1)

#### 2.2.2  Update Data

In [None]:
# Numerical and Categorical (type=object) columns/features
numerical_cols = train_new.select_dtypes(exclude=['object'])
categorical_cols = train_new.select_dtypes(include=['object'])

print('Numeric columns info')
print('Total:', len(numerical_cols.columns))
print('Column Names:', numerical_cols.columns)
print('\n')
print('Categorical columns info')
print('Total:', len(categorical_cols.columns))
print('Column Names:', categorical_cols.columns)
#fill missing object values with NA
for col in list(categorical_cols.columns):
    train_new[col].fillna('NA', inplace=True)  
    if (col != 'SalePrice'):
        test_new[col].fillna('NA', inplace=True)
    
#fill missing numerical values with 0
for col in list(numerical_cols.columns):
    train_new[col].fillna(0, inplace=True)
    if (col != 'SalePrice'):
        test_new[col].fillna(0, inplace=True)

#Re-initialize the columns with updated data
numerical_cols = train_new.select_dtypes(exclude=['object'])
categorical_cols = train_new.select_dtypes(include=['object'])

In [None]:
train_new

###### TODO: Do some analysys on Categorical columns .... come up with a paragraph on what our understanding is...
(Ian)

#### 2.2.3  Features with Highest Correlation with Sales Price

In [None]:
# Spearman's Rank
spearman_rank = pd.DataFrame()
spearman_rank['Feature'] = train_new.columns
spearman_rank['Spearman Rank'] = [train_new[f].corr(train_new['SalePrice'], 'spearman') for f in train_new.columns]
spearman_rank = spearman_rank.sort_values('Spearman Rank')
    
plt.figure(figsize=(6, 0.25*len(train_new.columns)))
sns.barplot(data=spearman_rank, y='Feature', x='Spearman Rank', orient='h', palette="RdBu")

In [None]:
# Heat Map for largest 25 features
corr_matrix = train_new.corr()
high_corr_cols = corr_matrix.nlargest(26, 'SalePrice')['SalePrice'].index # Add 1 as it includes SalePrice
cm = np.corrcoef(train_new[high_corr_cols].values.T)
sns.set(font_scale=1.1)
fig, ax = plt.subplots(figsize=(12,12))
hm = sns.heatmap(cm,annot=True, square=True, fmt='.2f', 
                 annot_kws={'size': 10}, yticklabels=high_corr_cols.values, 
                 xticklabels=high_corr_cols.values, ax = ax, cmap="Blues", linewidths=1, 
                 linecolor=base_color, cbar=False)
plt.show()

Todo: Based on Heat Map and Spearman's Rank...... analysis

#### 2.3 Distributions

In [None]:
# Show the distribution plots for numerical features.
fig, axes = plt.subplots(figsize=(20,30))
fig.tight_layout()
cell_no = 1
for column_name in numerical_cols.columns:     
    plt.subplot(9, 4, cell_no)
    sns.distplot(numerical_cols[column_name], hist=True, kde=True, 
                 hist_kws = {'color':base_color, 'alpha':0.9},  
                 kde_kws = {'color':red_color, 'alpha':0.9},label=column_name)
    cell_no+=1

TODO: Provide analysis (Surya)

#### 2.4 Analysis on 'Sale Price' (Outcome / Dependent / Target) variable

In [None]:
# Plot the Sale Price distribution
fig = plt.figure(figsize=(20, 6))
ax1 = fig.add_subplot(121)
sns.distplot(train_new['SalePrice'], hist=True, kde=True, 
                 hist_kws = {'color':base_color, 'alpha':0.9},  
                 kde_kws = {'color':red_color, 'alpha':0.9});
ax2 = fig.add_subplot(122)
plt.title("Distribution")
stats.probplot(train['SalePrice'], plot=ax2);
ax2.get_lines()[0].set_color(base_color)
ax2.get_lines()[1].set_color(red_color)
plt.title("Probability Plot")

TODO: Provide analysys on right skewed distribution. 

In [None]:
#Apply log transformation to SalePrice Outcome variable
fig = plt.figure(figsize=(20, 6))
ax1 = fig.add_subplot(121)
sns.distplot(np.log(train_new['SalePrice']),   hist=True, kde=True, 
                 hist_kws = {'color':base_color, 'alpha':0.9},  
                 kde_kws = {'color':red_color, 'alpha':0.9})

ax2 = fig.add_subplot(122)
plt.title("Distribution after Log transformation")
stats.probplot(np.log(train_new['SalePrice']), plot=ax2);
ax2.get_lines()[0].set_color(base_color)
ax2.get_lines()[1].set_color(red_color)
plt.title("Probability after Log transformation")

TODO: Provide analysys on normal distribution after log transformation. 

#### 2.5    Target feature ('Sales Price') and Predictors

In [None]:
plt.figure(figsize=(30, 30))

n = 1
for col in numerical_cols.columns:  
    scatter = plt.subplot(6, 6, n)
    plt.scatter(train_new[col], train_new["SalePrice"], color = 'b')
    plt.xlabel(col)
    plt.ylabel("Sale Price")
    n+=1
plt.show()

TODO: Analysys

#### 2.6   Exclude Outliers

In [None]:
#TODO: For the selected features, drop the Outliers with certain conditions.
# (Ian) - Summary statistics and box plots

#### 2.7  Create Dev data

In [None]:
# TODO Split the Train Data into Train(80%) and Dev Data (20%) for model building
# Label will be the SalePrice column data. We can even randamize the initial 80% split if possible.

#Output should be train_data, train_labels, dev_data, dev_labels. No change to test_data. 
# (Jacky)

### 3. Model building

 ##### L1 Analysus

TODO: Run each model with different L1 values

#### 3.1 Linear Regression      

In [None]:
# TODO

#### 3.2 KNN Regression   

In [None]:
# TODO

#### 3.3 Random Forest  

In [None]:
# TODO

#### 3.4 Gradient Boosting   

In [None]:
# TODO

#### 3.5 Support Vector Regression 

In [None]:
# TODO This is from Week 8

#### 3.6 XGBoost Regression 

In [None]:
# TODO 

#### 3.7 Models comparison (RMSE comparison between models)

In [None]:
# TODO We can provide some sort of simple bar chart showing different R squared values of each model.

### 4. Summary

###### TODO

### 5. References

TODO