# Introductory Machine Learning:
- Making predictions on house prices using basic scikit and pandas
- Tech used:
    - Pandas
    - scikit-learn:
        - train test split 
        - mean average error 
        - decision tree regressor - max depth argument 
        - random forest regressor 
- using metrics to counteract over and underfitting 

# Introduction Notes:
- Machine Learning uses models to make predictions from data
- Find patterns in data by fitting a model
- BASIC STEPS:
    - Train model on training data 
    - Apply the fit model to new data to predict prices
- Deeper the decision tree the more specific of a prediction you can make
- Pandas v. important for data science and machine learning

In [1]:
# General useful imports
import pandas as pd

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

In [3]:
# save filepath to variable
data_file_path = 'melb_data.csv'
# read the data and store data in a DataFrame
data = pd.read_csv(data_file_path) 
# print a summary of the data in data
data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


# Selecting data for modelling
- If a dataset has a lot of variables, you can select a portion of them, either by intuition or by statistical methods to make it easier to understand

In [7]:
# display the columns
print(data.columns)
# simple missing value handle is to drop columns with missing values 
data = data.dropna(axis=0) # axis=0 is columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')


In [5]:
# Select Prediction Target with dot notation returns a series
y = data.Price
# Select Features, column list
data_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = data[data_features]
X.describe()
X.head() # manual inspection of the feature dataframe
 

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


# Build the model
- scikit-learn the predominant modelling library
- Model building steps:
    - Define: what type of model, what param.s
    - Fit: Capture patterns in training data
    - Predict: Make predictions on the unseen data
    - Evaluate: Determine how accurate model was able to classify unseen data 
    

In [6]:
# USE A DECISION TREE FROM SKLEARN - USE FEATURES AND TARGET VARIABLE
from sklearn.tree import DecisionTreeRegressor

# DEFINE MODEL, SPECIFY A RANDOM STATE TO ENSURE SAME RESULT EACH RUN
data_model = DecisionTreeRegressor(random_state=1)

# Fit/Train model
data_model.fit(X, y)

# DEMONSTRATIVE PREDICTION OF FIRST 5 HOUSES 
# (predictions should be made on data you didnt train on)
print(X.head())
print("The predictions are")
print(data_model.predict(X.head()))

   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


# Model Validation
- How good is your model?
- Use this to iteratively improve your model
- How accurately can your model predict unseen instances
- Metrics:
    - Mean Absolute Error (MAE): error = actual - predicted
        - get the average absolute error of each, then take average overall
- avoid in sample scoring using train test split 

        

In [8]:
# Model to evaluate - same data used - unadjusted for data split 
data = data.dropna(axis=0)
# Choose target and features
y = data.Price
data_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = data[data_features]
# Define model
data_model_1 = DecisionTreeRegressor()
# Fit model
data_model_1.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [10]:
# evaluate model
from sklearn.metrics import mean_absolute_error

predicted_home_prices = data_model_1.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544

In [16]:
# TRAIN_TEST_SPLIT - help avoid overtraining and avoiding noise patters.
from sklearn.model_selection import train_test_split

# split data into training and validation data
# based on RNG , specify state to get same split each run
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
# Define model
data_model_2 = DecisionTreeRegressor()
# Fit model
data_model_2.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = data_model_2.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

253425.3692704971


# Underfitting and Overfitting:
- Using metrics you can experiment with different models to see what works best
- Other decision tree options:
    - tree depth - trade off between higher accuracy and overfitting to the data. e.g. have a tree so specific it has a split for each individual house price, this is not going to be effective on unseen instances. however too shallow and it won't make accurate predicitions either as it will miss important patterns, FIND THE MIDDLE GROUND 
    - theres a few ways to control tree depth, basic is to use max_leaf_nodes

In [23]:
# effects of max_leaf_nodes using MAE scores
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    data_model_3 = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    data_model_3.fit(train_X, train_y)
    preds_val = data_model_3.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))


Max leaf nodes: 5  		 Mean Absolute Error:  324110
Max leaf nodes: 50  		 Mean Absolute Error:  252108
Max leaf nodes: 500  		 Mean Absolute Error:  239204
Max leaf nodes: 5000  		 Mean Absolute Error:  249507


# Random Forests:
- Random forests use many trees to make predictions 
- ensemble style model
- work well without much tuning

In [28]:
# build a random forest
# can you specify 
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
data_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, data_preds))

185932.79104798794




# CONCLUSION:
Better results from the MAE metric were obtained by increasing the depth to an optimal level for decison trees. The random forest also improved predictions. 

# Notes to self:
- Further Study Topics:
    - Decision tree from scratch, how does it get updated?
        - other sk-learn models 
    - other metrics to summarise model quality 
    - AutoML
    - monte carlo simulation
    - other random forest arguments 