# Regression using random forest

This example is adapted from an article [here](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0). In this problem we want to predict the maximum temperature for any day in Seattle based on historical data from 2017. Temperature and precipitation data for Sydney may be accessed from [here](https://www.ncdc.noaa.gov/cdo-web/). For convenience we have downloaded the data in comma-separated-variable (CSV) format and placed it in ```DATA/temps.csv```.

Our workflow for this section is:

 1. Load the data into a Pandas dataframe
 1. Clean the data as required
 1. Prepare the data for ingestion by the machine learning system
 1. Train the model and visualise the training and results
 1. Make predictions against a test dataset and assess performance
 1. Adjust hyperparameters and retrain
 1. Visualise results

### 1) Load the data into a dataframe

In [None]:
# Import the Pandas module
import pandas as pd

# Load the data
features = pd.read_csv('DATA/temps.csv')

# Print the dimensions (shape) of the table
print("Table dimensions [rows, columns]:", features.shape)

# Display the first 5 rows
features.head(5)

In [None]:
# Display the last 5 rows
features.tail(5)

The key columns for our analysis are:

 * temp_2: max temperature 2 days prior
 * temp_1: max temperature 1 day prior
 * average: historical average max temperature
 * actual: max temperature measurement
 * friend: your friend’s prediction, a random number between 20 below the average and 20 above the average

### 2) Clean the data

It is useful now to examine simple statistics for each column:

In [None]:
# Print descriptive stats for each column
features.describe()

The statistics of the data don't show any unusual outliers.

Next we want to plot the data for a visual examination. For convenience we will make a function that creates a nicely formatted plot with date on the x-axis and temperature on the y-axis.

In [None]:
# Necessary imports and magic command to display plots inline
import datetime
import matplotlib.pyplot as plt
%matplotlib inline


def plot_trend_ax(ax, df, columnNames = ['actual'], yLabel='Temperature'):
    """Plot multiple trend lines over each other"""
    
    # Select columns years, months and days
    years = df['year'].astype(str)
    months = df['month'].astype(str)
    days = df['day'].astype(str)
    
    # Convert dates to a datetime object
    dateStrLst = ["-".join(x) for x in zip(years, months, days)]
    dates = [datetime.datetime.strptime(x, '%Y-%m-%d') for x in dateStrLst]
    
    # Plot the data
    for colName in columnNames:
        ax.step(dates, df[colName], where='mid', label=colName, linewidth=1)
    ax.set_ylabel('Date')
    ax.set_ylabel(yLabel)
    ax.tick_params(axis='x', rotation=45)
    
    # Enable the legend
    ax.legend(loc='best')
    
    return ax    

In [None]:
# Initialise the figure
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)

# Plot 
plot_trend_ax(ax, features, columnNames=['actual', 'temp_1', 'temp_2', 'friend'])

### 3) Preparing the data for the ML algorithm

We will be using an implementation of the Random Forest algorithm within *scikit-learn*. Most machine learning algorithms require *catagorical data* to be encoded as a binary vector. For example, a column containing days of the week ```['Mon', 'Tue', 'Wed' ...]``` would be converted to a set of new columns that looks like this:

![image.png](attachment:image.png)

This is known as *one-hot encoding* and is simply a vector where the current catagory is indicated by a '1'. Pandas has a built-in function to convert all catagorical variables to new 'one-hot' columns:

In [None]:
# One-hot encode categorical variables


Note that the other columns (e.g., month, day) are already in numerical format and so are not converted.

The goal of this excersise is to predict the actual temperature using the other columns in the dataset. We want to split out the 'actual' column and put it in a separate array. 

In [None]:
# Use numpy to create a new array from the 'actual' column

# Then delete the column from the table


Scikit-learn does not accept pandas dataframes, so we want to convert the table into a standard Numpy array that contains only numerical values. During this process the column names will be lost, so we save these as a separate list

In [None]:
# Save a list of column headers for latter use



In [None]:
# Convert the table to a numpy array



**Training and Test data**

In supervised learning we want to set aside some data as an 'independent' dataset to test our predictions. This is usually some randomly-chosen fraction of the input data. Scikit-learn has a convenience function to make this split easy:

In [None]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split 25% of the data into a test set
train_features, test_features, train_labels, test_labels = \
train_test_split(features, labels, test_size=0.25, random_state=42)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

### 4) Train the model

In [None]:
# Import the specific algorithm 

# Create an instance of the model with 1000 estimators

# Train the model on training data

In [None]:
# Use the forest's predict method on the test data

The predictions are a temperature output for each date in the test dataset.

### 5) Assess the performance of the model

In [None]:
# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

### 6) Adjusting the model

In [None]:
rf_new = RandomForestRegressor(n_estimators = 100, 
                               criterion = 'mse', 
                               max_depth = None, 
                               min_samples_split = 2, 
                               min_samples_leaf = 1)

### 7) Visualising the tree

In [None]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydot

# Pull out one tree from the forest
tree = rf.estimators_[5]

# Export the image to a dot file
export_graphviz(tree, out_file = 'TMP/tree.dot', feature_names = feature_list, rounded = True, precision = 1)

# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('TMP/tree.dot')

# Write graph to a png file and display
graph.write_png('TMP/tree.png');
Image('TMP/tree.png')

In [None]:
# Print the depth of the tree
print('The depth of this tree is:', tree.tree_.max_depth)

### Assess variable importance


In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];


**Model with 2 most important features only**

In [None]:
# New random forest with only the two most important variables
rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=42)

# Extract the two most important features
important_indices = [feature_list.index('temp_1'), feature_list.index('average')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]

# Train the random forest
rf_most_important.fit(train_important, train_labels)

# Make predictions and determine the error
predictions = rf_most_important.predict(test_important)

errors = abs(predictions - test_labels)

# Display the performance metrics
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

mape = np.mean(100 * (errors / test_labels))
accuracy = 100 - mape

print('Accuracy:', round(accuracy, 2), '%.')

### 8) Visualisations

In [None]:
# Visualise the importance

In [None]:
# Dates of training values
months = features[:, feature_list.index('month')]
days = features[:, feature_list.index('day')]
years = features[:, feature_list.index('year')]

# List and then convert to datetime object
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]

# Dataframe with true values and dates
true_data = pd.DataFrame(data = {'date': dates, 'actual': labels})

# Dates of predictions
months = test_features[:, feature_list.index('month')]
days = test_features[:, feature_list.index('day')]
years = test_features[:, feature_list.index('year')]

# Column of dates
test_dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]

# Convert to datetime objects
test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]

# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': test_dates, 'prediction': predictions})

In [None]:
# Plot the actual values
plt.plot(true_data['date'], true_data['actual'], 'b-', label = 'actual')

# Plot the predicted values
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()

# Graph labels
plt.xlabel('Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');
