# Kaggle Micro Course: "*Intro to Machine Learning*"

# House Pricing Prediction (Iowa)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# Read dataset from csv file
home_data = pd.read_csv(iowa_file_path)

# Print the first 5 rows of the table
home_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


---

## Building the model

### Specify the prediction target (`y`)

First of all I select the target variable, which corresponds to the sales price and save it to a new variable called `y`.

In order to do this I need to print a list of the columns to find the name of the one I need.

In [2]:
home_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [3]:
y = home_data.SalePrice

### Specify the predictive features (`X`)
Now I need to create a DataFrame called `X` holding the predictive features.

Since I want only some columns from the original data, I first create a list with the names of the columns I want in `X`.

I will use just the following columns in the list:
- LotArea
- YearBuilt
- 1stFlrSF
- 2ndFlrSF
- FullBath
- BedroomAbvGr
- TotRmsAbvGrd

In [4]:
# Create the list of features
feature_names = [
    "LotArea",
    "YearBuilt",
    "1stFlrSF", 
    "2ndFlrSF", 
    "FullBath", 
    "BedroomAbvGr", 
    "TotRmsAbvGrd"
]

# Select data corresponding to features in feature_names
X = home_data[feature_names]

### Review Data

Before building a model, I take a quick look at **X** to verify it looks sensible.

In [5]:
# print description or statistics from X
X.describe()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


In [6]:
# print the top few lines
X.head()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


## Split data

I use the `train_test_split` function to split up my data.

I give it the argument `random_state=1` for the sake of reproducibility.

In [7]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

### Specify and Fit Model

I create a `DecisionTreeRegressor` and save it as `iowa_model`.

Then I fit the model using the data in `X` and `y` that I have saved above.

In [8]:
# First of all I need to specify the model
# For model reproducibility, I set the `random_state` argument
iowa_model = DecisionTreeRegressor(random_state=0)

# Then I need to fit the model to training data
iowa_model.fit(train_X, train_y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=0, splitter='best')

### Make predictions

Finally I make predictions with the model's `predict` command using `val_X` as the data.

I save the results to a variable called `val_predictions`.

In [9]:
val_predictions = iowa_model.predict(val_X)

---

## Evaluating the model

I manually inspect my first predictions and actual values from validation data.

In [10]:
# print the top few validation predictions
print("MY PREDICTIONS\n{}\n".format(val_predictions[:5]))
# print the top few actual prices from validation data
print("VALIDATION DATA\n{}".format(val_y.head()))

MY PREDICTIONS
[186500. 163000. 130000.  92000. 157900.]

VALIDATION DATA
258     231500
267     179500
288     122000
649      84500
1233    142000
Name: SalePrice, dtype: int64


### Calculate the *Mean Absolute Error* in Validation Data

In [11]:
val_mae = mean_absolute_error(val_y, val_predictions)
print(f'{val_mae:,.2f}')

29,478.64


## Improving the model

The following function will compute the MAE for predictions provided by a `DecisionTreeRegressor` with a variable `max_leaf_nods` parameter.

I will use it momentarily to study how to improve the model by varying that parameter.

In [12]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [13]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

# Loop to find the ideal tree size from candidate_max_leaf_nodes
mae_values = []
for max_leaf_nodes in candidate_max_leaf_nodes:
    mae_values.append(get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y))

# Best value of max_leaf_nodes
index_of_minimum_mae = mae_values.index(min(mae_values))
best_tree_size = candidate_max_leaf_nodes[index_of_minimum_mae]

print("Best MAE value: {:,.2f}".format(mae_values[index_of_minimum_mae]))
print("Best tree size:", best_tree_size)

Best MAE value: 27,282.51
Best tree size: 100


I might now try to train a more sophisticated model.

I will use a RandomForestRegressor.

In [14]:
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)

# Fit the model
rf_model.fit(train_X, train_y)

# Calculate the mean absolute error of the Random Forest model on the validation data
val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.2f}".format(rf_val_mae))

Validation MAE for Random Forest Model: 22,762.43




### Fit Model Using All Data
I know the best tree size. I can now make my model even more accurate by using all of the data and keeping that tree size.  That is, I don't need to hold out the validation data now that I've made all my modeling decisions.

In [15]:
final_model = RandomForestRegressor(random_state=1)

# fit the final model
final_model.fit(X, y)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=1, verbose=0, warm_start=False)