In [2]:
# On the use of the Pandas library for data exploration

import pandas as pd

# Read the data and store as a pandas dataframe
melbourne_data = pd.read_csv('./data/melb_data.csv')
# print a summary of the data
melbourne_data.describe()


Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [3]:
# Read columns of dataset

melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [4]:
# For this example with Melbourne data, we will drop missing values

melbourne_data = melbourne_data.dropna(axis=0)

# Select prediction target (price)
# Convention dictates that the target is called 'y'
y = melbourne_data.Price

# Columns that are inputs to our model are called features
# Convention dictates that the features are called 'X'

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()


Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [5]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


We will use scikit-learn to create models.
The steps to buil and use a model are as follows:
1. Define: What type of model will it be? Some other parameters of the model type are specified too.
2. Fit: Capture patterns from provided data. The heart of modeling.
3. Predict: Just what it sounds like
4. Evaluate: Determine the accuracy of the models predictions

In [6]:
# Here is an example of of defining a decision tree model
# And fitting it with the features and target variable

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [7]:
print("Making predictions for the following 5 house:")

print(X.head())

print("The predictions are")

print(melbourne_model.predict(X.head()))

Making predictions for the following 5 house:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


No we will validate the model. Being able to measure model quality is the key to iteratively improving models.
* In most, though not all, applications the relevant measure of model quality is predictive accuracy
* Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data.
* We need to summarize quality into an understandable way. One way to do this is to summarize this into a single metric

As one example we will look at using Mean Absoluter Error (MAE)

In [8]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

Problem with "In-Sample" scores
* There may exists patterns that only exists in the sample that we used to model and evaluate.
* This will affect using the model on new data
* The most straight forward way to resolve this is to exclude some data fomr the model building process., then use that data to test the models accuracy, as this is data that the model has not seen before.
**This is called the validation data**

In [9]:
# Scikit-learn has a function to split the data into training and test sets

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator.
# Supplying a numeric value to the random_state argument guarantees we get the same split every time.
# Here, we split the data with a random_state of 0
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(train_X, train_y)

# Get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

273782.4600817732


Underfitting and Overfitting
* Experimentation is the name of the game here. we have a reliable way to test model accuracy so we will want to see how our alternative models perform.
* For this first example, lets refine using the same model (DecisionTreeRegressor)
* This model has an option to change the tree's depth (How many decisions it makes before coming to a prediction)
---
_Excerpt from Kaggle_
In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 2^10 groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called **overfitting**, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.

In [10]:
# In the case of using the DecisionTreeRegressor model, we can use max_leaf_nodes to control overfitting/underfitting

# Lets set up a utility function to compare MAE scores with differenet values of max_leaf_nodes
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# Compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))    

Max leaf nodes: 5  		 Mean Absolute Error:  385696
Max leaf nodes: 50  		 Mean Absolute Error:  279794
Max leaf nodes: 500  		 Mean Absolute Error:  261718
Max leaf nodes: 5000  		 Mean Absolute Error:  271320


In the above example we see that the option of having 500 max leaf nodes provides with the lowest MAE of the options we provided.
With this information now we can go back and set the correct parameters to the model and use all of the data instead of splitting it.

In [11]:
best_model = DecisionTreeRegressor(max_leaf_nodes=500, random_state=1)
best_model.fit(X, y)
best_model.predict(X)

array([1078446.42857143, 1246050.        , 1660600.        , ...,
        393070.45454545,  635110.39330544, 2677500.        ],
      shape=(6196,))

We took a look at the proces of training a model and figuring out how to optimize it.
But what if moving on to another model might make the predictions even better? In this instance we looked at the Decision Tree model but we will move on to a more sophisticated model.

## Random Forest
The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

Building a random forest model is similar to how we build the decision tree in scikit-learn. We just use the `RandomForestRegressor` class

In [12]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
forest_predictions = forest_model.predict(val_X)
forest_mae = mean_absolute_error(val_y, forest_predictions)
print(forest_mae)

207190.6873773146
