# Random Forest Basics
A random forest model is a collection of decision tree models that are combined together to make predictions. When you make a random forest, you have to specify the number of decision trees you want to use to make the model. The random forest algorithm then takes random samples of observations from your training data and builds a decision tree model for each sample. The random samples are typically drawn with replacement, meaning the same observation can be drawn multiple times. The end result is a bunch of decision trees that are created with different groups of data records drawn from the original training data.

The decision trees in a random forest model are a little different than the standard decision trees we made last time. Instead of growing trees where every single explanatory variable can potentially be used to make a branch at any level in the tree, random forests limit the variables that can be used to make a split in the decision tree to some random subset of the explanatory variables. Limiting the splits in this fashion helps avoid the pitfall of always splitting on the same variables and helps random forests create a wider variety of trees to reduce overfitting.

Random forests are an example of an ensemble model: a model composed of some combination of several different underlying models. Ensemble models often yields better results than single models because different models may detect different patterns in the data and combining models tends to dull the tendency that complex single models have to overfit the data.

# Random Forests on the Titanic

Python's sklearn package offers a random forest model that works much like the decision tree model we used last time. Let's use it to train a random forest model on the Titanic training set:

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
dir_path = os.path.curdir
os.getcwd()

'C:\\Users\\aakas'

In [3]:
# Load and prepare Titanic data
#os.chdir('data\\') # Set working directory

In [4]:
#Print current working directory
os.getcwd()

'C:\\Users\\aakas'

In [5]:
?pd.read_csv

In [6]:
titanic_train = pd.read_csv("D:\\PythonFiles\\Decision Tree, Random Forest, KNN\\4_Codes\\Random_Forest\\data\\train.csv")    # Read the data

In [7]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
# Impute median Age for NA Age values
new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
                       28,                       # Value if check is true
                       titanic_train["Age"])     # Value if check is false

titanic_train["Age"] = new_age_var 

In [9]:
# Impute Embarked to S for NA  values
new_embarked_var = np.where(titanic_train["Embarked"].isnull(), # Logical check
                       "S",                       # Value if check is true
                       titanic_train["Embarked"])     # Value if check is false

titanic_train["Embarked"] = new_embarked_var 

In [10]:
SouthamptionValues = (titanic_train["Embarked"]=="S")


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing 

In [13]:
# Set the seed
np.random.seed(12)

# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert some variables to numeric
titanic_train["Sex"] = label_encoder.fit_transform(titanic_train["Sex"])

titanic_train["Embarked"] = label_encoder.fit_transform(titanic_train["Embarked"])

In [14]:
# Initialize the model
rf_model = RandomForestClassifier(n_estimators=1000, # Number of trees
                                  max_features=2,    # Num features considered
                                  oob_score=True)    # Use OOB scoring*

In [17]:
features = ["Sex","Pclass","SibSp","Embarked","Age","Fare"]

In [18]:
# Train the model
rf_model.fit(X=titanic_train[features],
             y=titanic_train["Survived"])

print("OOB accuracy: ")
print(rf_model.oob_score_)

OOB accuracy: 
0.8215488215488216


Since random forest models involve building trees from random subsets or "bags" of data, model performance can be estimated by making predictions on the out-of-bag (OOB) samples instead of using cross validation. You can use cross validation on random forests, but OOB validation already provides a good estimate of performance and building several random forest models to conduct K-fold cross validation with random forest models can be computationally expensive.
The random forest classifier assigns an importance value to each feature used in training. Features with higher importance were more influential in creating the model, indicating a stronger association with the response variable. Let's check the feature importance for our random forest model:

In [19]:
for feature, imp in zip(features, rf_model.feature_importances_):
    print(feature, imp)

Sex 0.26841606896799425
Pclass 0.08807509846507673
SibSp 0.0511499211000967
Embarked 0.03140154321207602
Age 0.27296071210127776
Fare 0.2879966561534785


In [21]:
# Read and prepare test data
titanic_test = pd.read_csv("D:\\PythonFiles\\Decision Tree, Random Forest, KNN\\4_Codes\\Random_Forest\\data\\test.csv")    # Read the data

In [22]:
# Check if there are any null values, NaN or incompatible entries in test data
for feature in features:
    print(feature,set(titanic_test[feature].isnull()))

Sex {False}
Pclass {False}
SibSp {False}
Embarked {False}
Age {False, True}
Fare {False, True}


In [23]:
# Impute median Age for NA Age values
new_age_var = np.where(titanic_test["Age"].isnull(),
                       28,                      
                       titanic_test["Age"])      

titanic_test["Age"] = new_age_var 


# Impute Fare to mode of fare for NA  values
import numpy as np
new_fare_var = titanic_test["Fare"].fillna(titanic_test["Fare"].mode().iloc[0])
titanic_test["Fare"] = new_fare_var 

# Convert some variables to numeric
titanic_test["Sex"] = label_encoder.fit_transform(titanic_test["Sex"])
titanic_test["Embarked"] = label_encoder.fit_transform(titanic_test["Embarked"])

Feature importance can help identify useful features and eliminate features that don't contribute much to the model.
As a final exercise, let's use the random forest model to make predictions on the titanic test set and submit them to Kaggle to see how our actual generalization performance compares to the OOB estimate:

In [24]:
# Make test set predictions
test_preds = rf_model.predict(X= titanic_test[features])


In [25]:
# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
                           "Survived":test_preds})

# Save submission to CSV
submission.to_csv("tutorial_randomForest_submission.csv", 
                  index=False)        # Do not save index values

In [26]:
print('As per our prediction model:\n Out of {} passengers {} survived and {} did not survive'.format(len(test_preds), \
                    list(test_preds).count(1), list(test_preds).count(0) ))

As per our prediction model:
 Out of 418 passengers 153 survived and 265 did not survive


Upon submission, the random forest model achieves an accuracy score of 0.75120, which is actually worse than the decision tree model and even the simple gender-based model. What gives? Is the model overfitting the training data? Did we choose bad variables and model parameters? Or perhaps our simplistic imputation of filling in missing age data using median ages is hurting our accuracy. Data analyses and predictive models often don't turn out how you expect, but even a "bad" result can give you more insight into your problem and help you improve your analysis or model in a future iteration.