# Kaggle Housing Prices competition

describe here the competition

## Goal:
Get familiar with the overall process of solving a ML problem
My purpose here is not necessarily to reach the best score, but rather select some relevant features, train a model and sumbit predictions
More specifically, I want to practice cross validation / pipelines / fine tuning the model.
My goal here is to select a few relevant features and create a clean workflow with pipeline.

# Load the data

In [1]:
# Import standard libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import os

In [3]:
os.path.join("dataset","housing")

'dataset\\housing'

In [7]:
cwd = os.getcwd()
cwd

'C:\\Users\\flore\\KaggleHousingCompetition'

In [5]:
def load_housing_train():
    csv_path = os.path.join("house-prices-advanced-regression-techniques","train.csv")
    return pd.read_csv(csv_path)

def load_housing_test():
    csv_path = os.path.join("house-prices-advanced-regression-techniques", "test.csv")
    return pd.read_csv(csv_path)

In [6]:
housing = load_housing_train()
housing_test = load_housing_test()

FileNotFoundError: [Errno 2] File b'house-prices-advanced-regression-techniques\\train.csv' does not exist: b'house-prices-advanced-regression-techniques\\train.csv'

In [None]:
housing.head()

In [None]:
housing_test.head()

In [None]:
print ("Train data shape:", housing.shape)
print ("Test data shape:", housing_test.shape)

In [None]:
# Information about training set:
print(housing.info())

In [None]:
# Information about test set:
print(housing_test.info())

### Quick feedback
There are many features so we will select some of them to train a first model.
We can see that there are many missing values. There will be some work to do because for example you cannot fill in missing values for pool size if there aren't any pools.

# Discover and visualize the data to gain insights

## Analyzing "SalePrice"

In [None]:
housing["SalePrice"].hist(bins=20)

In [None]:
np.log(housing["SalePrice"]).hist(bins=20)

In [None]:
# Plotting the distribution with seaborn

sns.distplot(housing['SalePrice'])

The distribution has a very long tail. Probably we should use the logarithm to make it more linear.
However, we do not have other problems such as null or negative selling prices.

In [None]:
# Distribution of the logarithm of sales price (Why do we have more than 1 ?????)

sns.distplot(np.log(housing['SalePrice'])) 

Let's separate the numerical and categorical features

In [None]:
num_features = housing.select_dtypes(include='number').columns.to_list()
cat_features = housing.select_dtypes(exclude='number').columns.to_list()

In [None]:
print( "Number of numerical features : %f" %len(num_features))
print( "Number of categorical features :%f" %len(cat_features))

In [None]:
# Let's verify that there are not any mismatch of data type in the categorical features

#housing[cat_features[:19]].head()
housing[cat_features[-19:]].head()

The data types are well assigned for the categorical features.

## Correlation matrix to determine which nemerical features should be considered

In [None]:
corr_matrix = housing[num_features].corr()
corr_matrix["SalePrice"].sort_values(ascending=False)

In [None]:
plt.subplots(figsize=(12, 9))
sns.heatmap(corr_matrix)

In the last row of the heatmap we can observe which feature correlates well with SalePrice
We should keep in mind that only * linear * relationship are captured by this heatmap. More cpmplex relationship will need to be found in other ways.
We can see that several features are very well correlated, we should verify that these features are not correlated together (to avoid multi correlation)

Here are the most correlated features:
* OverallQual : Rates the overall material and finish of the house, from 1 (very poor) to 10 (excellent)
* GrLivArea: Above grade (ground) living area square feet
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* TotalBsmtSF: Total square feet of basement area
* 1stFlrSF: First Floor square feet 
* FullBath: Full bathrooms above grade (nb of bathrooms above ground level if I understand well)
* TotRmsAbvGrd: Total rooms above grade **does not include bathrooms**
* YearBuilt: Orignial year of construction
* YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

About the surfaces-related features:
There are 4 different features: GrLivArea; TotalBsmtSF; 1stFlrSF; 2stFlrSF

To clarify, GrLivArea in the living surface are and is equal the sum of 1stFlrSF and 2stFlrSF. So 1st floor correspond to the ground floor. TotalBsmtSF corresponds to the surface of the house at the ground level.

GarageCars and GarageArea are fully correlated together. We will keep GarageCars only since it is more correlated with SalePrice

The total number of rooms does not include bathrooms so we can keep these 2 features as they are.
Yearbuilt is correlated so we may need to do some time series analysis (which is not what I first expected for this challenge)

In [None]:
housing["FullBath"].value_counts()

In [None]:
# Just checking the assumptions made for the surfaces related features
surfaces_features = ["GrLivArea","TotalBsmtSF","1stFlrSF", "2ndFlrSF"]
housing_surfaces = housing[surfaces_features]
housing_surfaces["FloorSurfacesSum"] =  housing_surfaces["1stFlrSF"] +housing_surfaces["2ndFlrSF"]
housing_surfaces.head(10)

In [None]:
# Let's analyze the impact of YearBuilt and YearRemodAdd

housing_by_YearBuilt = housing.groupby(by="YearBuilt").SalePrice.mean()
housing_by_YearRemodAdd = housing.groupby(by="YearRemodAdd").SalePrice.mean()

plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("SalePrice per YearBuilt")
housing_by_YearBuilt.plot()
plt.subplot(122)
plt.title("SalePrice per YearRemodAdd (= year remodeled)")
housing_by_YearRemodAdd.plot()
plt.show()

In [None]:
# Plotting the distribution of YearBuilt
sns.distplot(housing['YearBuilt'])

As we can see there is a positive correlation between Yearbuilt and SalesPrice. Before 1900 houses are very expensive and as accoridng to the distribution plot there are not many of them.
However, it is more difficult to draw a conclusion for YearRemodAdd because it is set to Yearbuilt value if the house was not rebuilt. Since YearBuilt is more correlated with SalePrice we not consider YearRemodAdd

In [None]:
# Creation of a list containing the selected features

selected_features = ["OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF", "1stFlrSF", "FullBath", "TotRmsAbvGrd", "YearBuilt"]

## Categorical features analysis
We will select some features that are correlated with SalePrice

Some features that should be explored:

* MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density
* LotShape: General shape of property

       Reg	Regular	
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular
* LandContour: Flatness of the property

       Lvl	Near Flat/Level	
       Bnk	Banked - Quick and significant rise from street grade to building
       HLS	Hillside - Significant slope from side to side
       Low	Depression
* Utilities: Type of utilities available
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only


In [None]:
# Let's analyze the influence of MSZoning, with different methods oh calculating the mean of SalePrice

housing.groupby(by="MSZoning")["SalePrice"].agg([len, min, max, sum, 'mean', lambda x: sum(x)/len(x)])

In [None]:
housing.groupby(by="MSZoning")[["SalePrice"]].mean()

65 houses that are "FV" (=Floating Village Residential) are significantly more expensive."RM" houses are cheaper.
As a conclusion we should consider this feature. One-hot encoding seems appropriate since we do not have few values.

# Prepare the data for Machine Learning algorithms

To clarify our datasets for now:
* Housing is the original training set
* Housing_test is the test set

In [None]:
# Here are the selected features according to our assumptions
#selected_features = ["OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF", "1stFlrSF", "FullBath", "TotRmsAbvGrd", "YearBuilt"]

In [None]:
housing_X = housing[selected_features].copy()

In [None]:
housing_X.head()

In [None]:
housing_y = housing["SalePrice"].copy()

In [None]:
housing_y.head()

## Missing values

In [None]:
housing_X.info()

In [None]:
housing_X.isnull().any()

There is not any missing values but we will apply a simple imputer with a median strategy in case there are missing values when the data is updated.*

## Outliers

## Create pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer



# Create pipeline for numerical features
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

In [None]:
housing_X_prepared = num_pipeline.fit_transform(housing_X)

In [None]:
housing_X_prepared[:10]

# Select a model and train it

## Let's try some models

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_X_prepared, housing_y)

In [None]:
#Let's check the RMSE of the training set for linear regression

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_X_prepared)
lin_mse = mean_squared_error(housing_y, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
from sklearn.tree import DecisionTreeRegressor

# probably overfitting 
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_X_prepared, housing_y)

housing_predictions = tree_reg.predict(housing_X_prepared)

tree_rmse = np.sqrt(mean_squared_error(housing_y, housing_predictions))
tree_rmse 

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_X_prepared, housing_y,
                         scoring="neg_mean_squared_error", cv=5)
tree_rmse_scores = np.sqrt(-scores)
tree_rmse_scores

In [None]:
scores = cross_val_score(lin_reg, housing_X_transformed, housing_y,
                         scoring="neg_mean_squared_error", cv=5)
lin_rmse_scores = np.sqrt(-scores)
lin_rmse_scores

# Fine-tune a model