# Random Forest

Random Forest is a type of bootstrap algorithm (bagging) and a type of ensemble machine learning algorithm that can be used for both classification and regression problems.

#### Quick Background on Bootstrap
First, to understand how a Random Forest works, we will start with the bootstrap method. Bootstrapping is a powerful statistical method for estimating a quantity from a data sample. This is easiest to understand if the quantity is a descriptive statistic such as a mean or a standard deviation.

If our sample is small (not having enough training samples), we can improve the estimate of our mean by using the bootstrap procedure. We will create many sub-samples from the same dataset that is randomized and calculate the mean of each sub-sample. Calculate the average of all of our collected means and use that as our estimated mean for the data. This let's us move onto bootstrap aggregation.

#### Quick Background on Bootstrap Aggregation
Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance (overfitting). An algorithm that has high variance are decision trees, like classification and regression trees (CART).

The main idea is that we are averaging multiple sets of observations to reduce variance. 

- Take many training sets from the population
- Build a separate model using each set and take an average of all of the results
    
#### Quick Background on Ensemble Method
An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model. So think of multiple weak learning models creating a stronger model when it is combined with an averaged result.

#### Main Difference Between Bootstrapping & Bagging
- Bootstrapping: sampling technique (draws repeated samples)
- Bagging: machine learning ensemble technique based on bootstrap samples

#### Decision Trees & Random Forest
A problem with decision trees like CART is that they are greedy. They choose which variable to split on using a greedy algorithm that minimizes error. 

Greedy meaning that the algorithm makes the optimal choice at each step as it attempts to find the overall optimal way to solve the entire problem. This means that it makes decisions based only on the information it has at any one step, without regard to the overall problem. Random Forest fixes that with a method called pruning. The decision to split at each node is made according to the metric called purity. A node is 100% impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single class. This specific subject can be discussed in a separate notebook about decision trees.

### Versatility

#### Estimated OOB Performance
"For each bootstrap sample taken from the training data, there will be samples left behind that were not included. These samples are called Out-Of-Bag samples or OOB.

The performance of each model on its left out samples when averaged can provide an estimated accuracy of the bagged models. This estimated performance is often called the OOB estimate of performance.

These performance measures are reliable test error estimate and correlate well with cross validation estimates."

#### Variable Importance
It doesn't just have a CV method within the algorithm, but we can calculate how much the error function drops for a variable at each split point making the algorithm a very popular one to use.

In regression problems this may be the drop in sum squared error and in classification this might be the Gini score.

These drops in error can be averaged across all decision trees and output to provide an estimate of the importance of each input variable. The greater the drop when the variable was chosen, the greater the importance.

These outputs can help identify subsets of input variables that may be most or least relevant to the problem and suggest at possible feature selection experiments you could perform where some features are removed from the dataset.

    Many of the notes about OOB/VarImp from MLM.
    
#### Advantages
- Solves both classification and regression
- Handles large data sets with higher dimensionality. 
- Handles thousands of input variables and identify most significant variables so it is considered as one of the dimensionality reduction methods. Further, the model outputs variable importance
- Has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing (bagging method)
- Handles imbalanced classes better than most algorithms
- Has in-model cross-validation method (OOB testing)

#### Disadvantages
- Not the best explainable algorithm due to the randomness of feature selections per tree
    - Control on what goes in the model is difficult
- Memory usage is higher than most algorithms

### Data
1. Title: Facebook performance metrics
2. Sources
   Created by: Sérgio Moro, Paulo Rita and Bernardo Vala (ISCTE-IUL) @ 2016
3. Past Usage:
    - The full dataset was described and analyzed in:
    - S. Moro, P. Rita and B. Vala. Predicting social media performance metrics and evaluation of the impact on 
   brand building: A data mining approach. Journal of Business Research, Elsevier, In press, Available online 
   since 28 February 2016.
4. Relevant Information:
    - The data is related to posts' published during the year of 2014 on the Facebook's page of a renowned cosmetics brand.
    - This dataset contains 500 of the 790 rows and part of the features analyzed by Moro et al. (2016). The remaining were omitted due to confidentiality issues.
5. Number of Instances: 500
6. Number of Attributes: 19
7. Missing Attribute Values: None

### Import libraries

In [299]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action='ignore')

### Import data

In [300]:
data = pd.read_csv('Data/13-facebook-likes.csv')

### Nulls

In [301]:
# Define a new function
def get_nulls(df):
    
    # Get null pct and counts
    null_cols = pd.DataFrame(df.isnull().sum().sort_values(ascending=False), columns=['Null Data Count'])
    null_cols_pct = pd.DataFrame(round(df.isnull().sum().sort_values(ascending=False)/len(df),2), columns=['Null Data Pct'])

    # Combine dataframes horizontally
    null_cols_df = pd.DataFrame(pd.concat([null_cols, null_cols_pct], axis=1))

    all_nulls = null_cols_df[null_cols_df['Null Data Pct']>0]

    # Print
    print('There are', len(all_nulls), 'columns with missing values.')
    return all_nulls

In [302]:
get_nulls(data)

There are 1 columns with missing values.


Unnamed: 0,Null Data Count,Null Data Pct
share,4,0.01


In [303]:
data.share = data.share.fillna(data.share.median())
data.like = data.like.fillna(data.like.median())

In [304]:
get_nulls(data)

There are 0 columns with missing values.


Unnamed: 0,Null Data Count,Null Data Pct


### Dummies

In [305]:
# Get dummies
data = pd.get_dummies(data, drop_first=False)

### X / y Split

In [306]:
# Split training
y = data['Total Interactions']
X = data.drop(['Total Interactions'], axis=1)

# # Re-test with top 6 features
# high_varimp = ['like', 'share', 'Lifetime Post reach by people who like your Page', 'comment',
#                'Lifetime Post Total Reach', 'Lifetime Engaged Users']
# X = data[high_varimp]


### Train / Test Split

In [307]:
# Import split module
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)

In [308]:
for idx, val in enumerate(X_train):
    print(idx, val)

0 Page total likes
1 Lifetime Post Total Reach
2 Lifetime Post Total Impressions
3 Lifetime Engaged Users
4 Lifetime Post Consumers
5 Lifetime Post Consumptions
6 Lifetime Post Impressions by people who have liked your Page
7 Lifetime Post reach by people who like your Page
8 Lifetime People who have liked your Page and engaged with your post
9 comment
10 like
11 share


In [309]:
# X_train.reset_index(inplace=True, drop=True)
# X_test.reset_index(inplace=True, drop=True)

### Scaler

In [311]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Transform the variables to be on the same scale
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Evaluation - RMSE

In [312]:
# Create a function that will calculate error
def base_rmse(y, y_pred):
    '''
    Return the sqrt of the mean squared error between the observed and predicted values
    '''
    rmse = np.sqrt(mean_squared_error(y, y_pred))
    return rmse

### Fix NA Issue (replace np array values)

In [313]:
# Check NA
np.where(np.isnan(X_train))
X_train = np.nan_to_num(X_train)
np.where(np.isnan(X_train))

(array([], dtype=int64), array([], dtype=int64))

## Model

In [314]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error 

# Random Forest
model = RandomForestRegressor(n_estimators=500, random_state=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rf_rmse = base_rmse(y_test, y_pred)
print('Random Forest RMSE: {:.2f}'.format(rf_rmse))

Random Forest RMSE: 54.88


### Feature Importance

In [315]:
headers = X.columns

# Create a new function to capture feature importance for free models (RF, GB, XGB)
def feature_importance(model):
    
    importance = pd.DataFrame({'Feature': headers,
                               'Importance': np.round(model.feature_importances_,5)})
    
    importance = importance.sort_values(by='Importance', ascending=False).set_index('Feature')
    
    return importance

In [316]:
feature_importance(model)

Unnamed: 0_level_0,Importance
Feature,Unnamed: 1_level_1
like,0.62148
share,0.09315
Lifetime Post reach by people who like your Page,0.08174
comment,0.08097
Lifetime Post Total Reach,0.06536
Lifetime Engaged Users,0.03358
Lifetime People who have liked your Page and engaged with your post,0.00996
Lifetime Post Impressions by people who have liked your Page,0.00466
Lifetime Post Total Impressions,0.00454
Lifetime Post Consumers,0.00289
