## Predict Tomorrow's Temperature based on Historical Data

### Random Forest Regression Problem



We are working on a supervised, regression machine learning task where the goal is to predict the maximum temperature tomorrow (in Seattle, WA) from past 6 years of historical data.


Following are explanations of the columns:

year: Which year's data taken in the data points - 2011 to 2016.

month: number for month of the year

day: number for day of the year

weekday: day of the week as a character string

ws_1: one week average

temp_2: max temperature 2 days prior

temp_1: max temperature 1 day prior

average: historical average max temperature

actual: max temperature measurement

friend: your friend’s prediction, a random number between 20 below the average and 20 above the average


In [0]:
# Pandas is used for data manipulation
import pandas as pd

# Read in data as a dataframe
features = pd.read_csv('Temperature.csv')

In [0]:
# One Hot Encoding
features = pd.get_dummies(features)

# Extract features and labels
labels = features['actual']
features = features.drop('actual', axis = 1)

## Restrict to the most important features

In [0]:
# Names of six features accounting for 95% of total importance
important_feature_names = ['temp_1', 'average', 'ws_1', 'temp_2', 'friend', 'year']

# Update feature list for visualizations
feature_list = important_feature_names[:]

features = features[important_feature_names]
features.head(5)

Unnamed: 0,temp_1,average,ws_1,temp_2,friend,year
0,37,45.6,4.92,36,40,2011
1,40,45.7,5.37,37,50,2011
2,39,45.8,6.26,40,42,2011
3,42,45.9,5.59,39,59,2011
4,38,46.0,3.8,42,39,2011


In [0]:
# Convert to numpy arrays
import numpy as np

features = np.array(features)
labels = np.array(labels)

# Training and Testing Sets
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, 
                                                                            test_size = 0.25, random_state = 42)

In [0]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
print('{:0.1f} years of data in the training set'.format(train_features.shape[0] / 365.))
print('{:0.1f} years of data in the test set'.format(test_features.shape[0] / 365.))

Training Features Shape: (1643, 6)
Training Labels Shape: (1643,)
Testing Features Shape: (548, 6)
Testing Labels Shape: (548,)
4.5 years of data in the training set
1.5 years of data in the test set


Examine the Default Random Forest to Determine Parameters
We will use these parameters as a starting point. I relied on the sklearn random forest documentation to determine which features to change and the available options.

In [0]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state = 42)

from pprint import pprint

# Look at parameters used by our current forest
print('Parameters currently in use:\n')
print(rf.get_params())

Parameters currently in use:

{'bootstrap': True, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 'warn', 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}


## Evaluation Function

In [0]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

## Evaluate the Performance of a Base RF Model with Default Parameters

In [0]:
base_model = RandomForestRegressor(n_estimators = 10, random_state = 42)
base_model.fit(train_features, train_labels)
base_accuracy = evaluate(base_model, test_features, test_labels)

Model Performance
Average Error: 3.9170 degrees.
Accuracy = 93.36%.


# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
##Random Search with Cross Validation

In [0]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


### Select the Best Parameters after Random Search for the best Hyperparameter

In [0]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor(random_state = 42)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 100, scoring='neg_mean_absolute_error', 
                              cv = 3, verbose=2, random_state=42, n_jobs=-1,
                              return_train_score=True)

# Fit the random search model
rf_random.fit(train_features, train_labels)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 11.4min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score=True, scoring='neg_mean_absolute_error',
          verbose=2)

## Best Parameters found by Random Search

In [0]:
rf_random.best_params_


{'bootstrap': True,
 'max_depth': 70,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 10,
 'n_estimators': 400}

In [0]:
rf_random.cv_results_

{'mean_fit_time': array([ 0.84624712,  4.64778495,  3.28077642, 10.02533134,  3.31612508,
         2.11023132,  3.03564095,  1.96018505,  7.38710658,  8.6583643 ,
         1.36810565,  2.53769374,  5.1920716 ,  4.32967599,  2.1373558 ,
         8.94244957,  3.5270896 ,  2.83347273,  9.32520223,  2.59909773,
         4.99765396,  5.05104891,  3.0171001 ,  7.60802484,  4.53987757,
         1.49470242, 10.22373875,  2.34093984,  9.59968702,  2.88799691,
         3.12360771,  2.47784901,  2.5455362 ,  1.70067183,  3.90168707,
         2.15646068,  2.85034903,  3.97024989,  3.25654229,  2.45215495,
         0.70434515,  1.31626693,  1.734212  ,  0.83521628,  1.17650652,
         2.88968221,  8.38717397,  1.01106771,  3.14120293,  1.8751688 ,
        13.0903577 , 10.68149471,  4.86921501,  5.00468191,  4.25635918,
         1.6022222 ,  6.40568717,  7.61964504,  7.06363455,  3.02211372,
         0.58193453,  0.85522278,  4.83202974,  6.23708161,  3.58184783,
         5.91677427,  8.67000961, 

### Evaluate the Best Random Search Model

In [0]:
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, test_features, test_labels)

Model Performance
Average Error: 3.7159 degrees.
Accuracy = 93.73%.


#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
##Grid Search
We can now perform grid search building on the result from the random search. We will test a range of hyperparameters around the best values returned by random search.

In [0]:
from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],    # 1
    'max_depth': [80, 90, 100, 110],  # 4
    'max_features': [2, 3], # 2
    'min_samples_leaf': [3, 4, 5], # 3
    'min_samples_split': [8, 10, 12], # 3
    'n_estimators': [100, 200, 300, 1000] # 4
}  # Total 1x4x3x3x4=288 parmeter combinations- exhaustive search
   # Also, 3-fold cross validation to be performed on each combination - totaling a 288x3=864 fits

# Create a base model
rf = RandomForestRegressor(random_state = 42)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2, return_train_score=True)

In [0]:
# Fit the grid search to the data
grid_search.fit(train_features, train_labels);

Fitting 3 folds for each of 288 candidates, totalling 864 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 644 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 864 out of 864 | elapsed:  7.8min finished


### Evaluate the best model from Grid Search

In [0]:
# Best parameter values after exhaustive grid search
grid_search.best_params_

{'bootstrap': True,
 'max_depth': 80,
 'max_features': 3,
 'min_samples_leaf': 5,
 'min_samples_split': 12,
 'n_estimators': 100}

### Select the best Estimator

In [0]:
best_grid = grid_search.best_estimator_


### Performance of the Best Estimator on the Test Set

In [0]:
# Performance of the Best Estimator on the Test Set
grid_accuracy = evaluate(best_grid, test_features, test_labels)

print('\n Performance of the best model : ')
print(grid_accuracy)

Model Performance
Average Error: 3.6565 degrees.
Accuracy = 93.83%.

 Performance of the best model : 
93.82944978324339


In [0]:
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))

Improvement of 0.50%.
