![Ames Housing dataset image](https://i.imgur.com/lTJVG4e.png)

# Housing Price Prediction

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

##### expand below cell for data dictionary and file description

## File descriptions

    train.csv - the training set
    test.csv - the test set
    data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
    sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

## Data fields

Here's a brief version of what you'll find in the data description file.

    SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
    MSSubClass: The building class
    MSZoning: The general zoning classification
    LotFrontage: Linear feet of street connected to property
    LotArea: Lot size in square feet
    Street: Type of road access
    Alley: Type of alley access
    LotShape: General shape of property
    LandContour: Flatness of the property
    Utilities: Type of utilities available
    LotConfig: Lot configuration
    LandSlope: Slope of property
    Neighborhood: Physical locations within Ames city limits
    Condition1: Proximity to main road or railroad
    Condition2: Proximity to main road or railroad (if a second is present)
    BldgType: Type of dwelling
    HouseStyle: Style of dwelling
    OverallQual: Overall material and finish quality
    OverallCond: Overall condition rating
    YearBuilt: Original construction date
    YearRemodAdd: Remodel date
    RoofStyle: Type of roof
    RoofMatl: Roof material
    Exterior1st: Exterior covering on house
    Exterior2nd: Exterior covering on house (if more than one material)
    MasVnrType: Masonry veneer type
    MasVnrArea: Masonry veneer area in square feet
    ExterQual: Exterior material quality
    ExterCond: Present condition of the material on the exterior
    Foundation: Type of foundation
    BsmtQual: Height of the basement
    BsmtCond: General condition of the basement
    BsmtExposure: Walkout or garden level basement walls
    BsmtFinType1: Quality of basement finished area
    BsmtFinSF1: Type 1 finished square feet
    BsmtFinType2: Quality of second finished area (if present)
    BsmtFinSF2: Type 2 finished square feet
    BsmtUnfSF: Unfinished square feet of basement area
    TotalBsmtSF: Total square feet of basement area
    Heating: Type of heating
    HeatingQC: Heating quality and condition
    CentralAir: Central air conditioning
    Electrical: Electrical system
    1stFlrSF: First Floor square feet
    2ndFlrSF: Second floor square feet
    LowQualFinSF: Low quality finished square feet (all floors)
    GrLivArea: Above grade (ground) living area square feet
    BsmtFullBath: Basement full bathrooms
    BsmtHalfBath: Basement half bathrooms
    FullBath: Full bathrooms above grade
    HalfBath: Half baths above grade
    Bedroom: Number of bedrooms above basement level
    Kitchen: Number of kitchens
    KitchenQual: Kitchen quality
    TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
    Functional: Home functionality rating
    Fireplaces: Number of fireplaces
    FireplaceQu: Fireplace quality
    GarageType: Garage location
    GarageYrBlt: Year garage was built
    GarageFinish: Interior finish of the garage
    GarageCars: Size of garage in car capacity
    GarageArea: Size of garage in square feet
    GarageQual: Garage quality
    GarageCond: Garage condition
    PavedDrive: Paved driveway
    WoodDeckSF: Wood deck area in square feet
    OpenPorchSF: Open porch area in square feet
    EnclosedPorch: Enclosed porch area in square feet
    3SsnPorch: Three season porch area in square feet
    ScreenPorch: Screen porch area in square feet
    PoolArea: Pool area in square feet
    PoolQC: Pool quality
    Fence: Fence quality
    MiscFeature: Miscellaneous feature not covered in other categories
    MiscVal: $Value of miscellaneous feature
    MoSold: Month Sold
    YrSold: Year Sold
    SaleType: Type of sale
    SaleCondition: Condition of sale


# Setup

In [1]:
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex6 import *
print("Setup Complete")

Setup Complete


In [2]:
# for data processing
import pandas as pd
import numpy as np

# for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Loading Data

In [3]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [4]:
print ("Train Shape :" ,train.shape)
print ("Test Shape :" ,test.shape)

Train Shape : (1460, 81)
Test Shape : (1459, 80)


#### lets look at a sample of training data for understanding the data

In [5]:
train.head(3).transpose()

Unnamed: 0,0,1,2
Id,1,2,3
MSSubClass,60,20,60
MSZoning,RL,RL,RL
LotFrontage,65.0,80.0,68.0
LotArea,8450,9600,11250
...,...,...,...
MoSold,2,5,9
YrSold,2008,2007,2008
SaleType,WD,WD,WD
SaleCondition,Normal,Normal,Normal


##### summary statistics on training data

In [6]:
train.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0


# Exploratory Data Analysis

In [7]:
import pandas_profiling
profile_report = pandas_profiling.ProfileReport(train)

This is a very easy way of doing quick EDA. 

#### High level summary
* Number of variables	81
* Number of observations	1460
* Missing cells	6965 (5.9%)

#### Variables types
* Numeric	32
* Categorical	48
* Boolean	1


# Feature Engineering

#### Lets have a deeper look into the missing values

In [8]:
def missingValuesInfo(df):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2)
    temp = pd.concat([total, percent], axis = 1,keys= ['Total', 'Percent'])
    return temp.loc[(temp['Total'] > 0)]

missingValuesInfo(train)

Unnamed: 0,Total,Percent
PoolQC,1453,99.52
MiscFeature,1406,96.3
Alley,1369,93.77
Fence,1179,80.75
FireplaceQu,690,47.26
LotFrontage,259,17.74
GarageYrBlt,81,5.55
GarageCond,81,5.55
GarageType,81,5.55
GarageFinish,81,5.55


#### lets do basic missing value treatment, using 'UNKNOWN' for categorical and median value for numeric features

In [9]:
def HandleMissingValues(df):

    num_cols = [cname for cname in df.columns if df[cname].dtype in ['int64', 'float64']]
    cat_cols = [cname for cname in df.columns if df[cname].dtype == "object"]
    values = {}

    for a in cat_cols:
        values[a] = 'UNKOWN'

    for a in num_cols:
        values[a] = df[a].median()
        
    df.fillna(value=values,inplace=True)
    
    
HandleMissingValues(train)
HandleMissingValues(test)

#### Encoding the categorical variables with low cardinality. "Cardinality" means the number of unique values in a column

In [10]:
# select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in train.columns if train[cname].nunique() < 10 and 
                        train[cname].dtype == "object"]

#### selecting only columns which are required

In [11]:
# select numeric columns
numeric_cols = [cname for cname in train.columns if train[cname].dtype in ['int64', 'float64']]

# removing salesprice (target) & id
numeric_cols.remove('Id')
numeric_cols.remove('SalePrice')

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols

In [12]:
X_train = train[my_cols].copy()
X_test = test[my_cols].copy()
y_train = train['SalePrice']

#### One-hot Encoding

In [13]:
# concatenating train and test for uniform one-hot encoding
X_train['train_or_test']='train'
X_test['train_or_test']='test'
data=pd.concat([X_train,X_test],sort=False)

# resetting index, removing old index
data.reset_index(inplace=True)
data.drop('index',axis=1,inplace=True)
data = pd.get_dummies(data,columns = low_cardinality_cols)

X_train = data[data['train_or_test']=='train']
X_test = data[data['train_or_test']=='test']

X_train.drop('train_or_test',axis=1,inplace=True)
X_test.drop('train_or_test',axis=1,inplace=True)

# looking at data after encoding
print("Shape of Training Data", X_train.shape)
print("Shape of Test Data", X_test.shape)

Shape of Training Data (1460, 253)
Shape of Test Data (1459, 253)


# Model Building

1.Ridge CV 

In [14]:
from sklearn.linear_model import RidgeCV

ridge_cv = RidgeCV(alphas=(0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10), cv=4)
ridge_cv.fit(X_train, y_train)
ridge_cv_preds = ridge_cv.predict(X_test)

2. XGBoost

In [15]:
import xgboost as xgb

model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.2)
model_xgb.fit(X_train, y_train)
xgb_preds=model_xgb.predict(X_test)

In [16]:
# final prediction based on esemble method. assign 50% weight to both models
predictions = ( ridge_cv_preds + xgb_preds )/2

#make the submission data frame
submission = {
    'Id': test.Id.values,
    'SalePrice': predictions
}
solution = pd.DataFrame(submission)
solution.head()

Unnamed: 0,Id,SalePrice
0,1461,112504.354972
1,1462,160570.333939
2,1463,181005.360775
3,1464,187673.901328
4,1465,195812.157865


In [17]:
solution.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


# References:
1. EDA - https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/
2. Ridge CV - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
3. Ideas for feature engineering - https://www.kaggle.com/vikumsw/house-prices-solution-beginner

## Feel free to share feedback, upvote if you found the notebook useful!