# Activity: Build a random forest model

## **Introduction**


As you're learning, random forests are popular statistical learning algorithms. Some of their primary benefits include reducing variance, bias, and the chance of overfitting.

This activity is a continuation of the project you began modeling with decision trees for an airline. Here, you will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Your random forest model will be used to predict whether a customer will be satisfied with their flight experience.

**Note:** Because this lab uses a real dataset, this notebook first requires exploratory data analysis, data cleaning, and other manipulations to prepare it for modeling.

## **Step 1: Imports** 


Import relevant Python libraries and modules, including `numpy` and `pandas`libraries for data processing; the `pickle` package to save the model; and the `sklearn` library, containing:
- The module `ensemble`, which has the function `RandomForestClassifier`
- The module `model_selection`, which has the functions `train_test_split`, `PredefinedSplit`, and `GridSearchCV` 
- The module `metrics`, which has the functions `f1_score`, `precision_score`, `recall_score`, and `accuracy_score`


In [35]:
# Import `numpy`, `pandas`, `pickle`
import numpy as np
import pandas as pd
import pickle  as pkl

# Import `sklearn` functions
from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score, accuracy_score

In [36]:
# Import data
air_data = pd.read_csv("Invistico_Airline.csv")

Now, you're ready to begin cleaning your data. 

## **Step 2: Data cleaning** 

To get a sense of the data, display the first 10 rows.

In [37]:
# Display first 10 rows.
air_data.head()

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0


Now, display the variable names and their data types. 

In [38]:
# Display variable names and types.
air_data.dtypes

satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

**Question:** What do you observe about the differences in data types among the variables included in the data?

**Answer:** 
* There are some variables which are categorical; most are numerical (int/float)

Next, to understand the size of the dataset, identify the number of rows and the number of columns.

In [39]:
# Identify the number of rows and the number of columns.
air_data.shape

(129880, 22)

Now, check for missing values in the rows of the data. 

In [40]:
# Get Booleans to find missing values in data.
# Get Booleans to find missing values along columns.
# Get the number of rows that contain missing values.
air_data.isna().any(axis=1).sum()


393

**Question:** How many rows of data are missing values?**

**Answer:** 
* There are 393 rows with missing values. 

Drop the rows with missing values. This is an important step in data cleaning, as it makes the data more useful for analysis and regression. Then, save the resulting pandas DataFrame in a variable named `air_data_subset`.

In [41]:
# Drop missing values.
# Save the DataFrame in variable `air_data_subset`.
air_data_subset = air_data.dropna(axis =0)

Next, display the first 10 rows to examine the data subset.

In [42]:
# Display the first 10 rows.
air_data_subset.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


Confirm that it does not contain any missing values.

In [43]:
# Count of missing values.
air_data_subset.isna().sum()


satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

Next, convert the categorical features to indicator (one-hot encoded) features. 

**Note:** The `drop_first` argument can be kept as default (`False`) during one-hot encoding for random forest models, so it does not need to be specified. Also, the target variable, `satisfaction`, does not need to be encoded and will be extracted in a later step.

In [44]:
# Convert categorical features to one-hot encoded features.
air_data_subset.dtypes
air_data_subset_dummies = pd.get_dummies(air_data_subset, columns=['Customer Type', 'Type of Travel', 'Class'], dtype=int)

# Display the first 10 rows.
air_data_subset_dummies.head(10)

Unnamed: 0,satisfaction,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,...,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,satisfied,65,265,0,0,0,2,2,4,2,...,2,0,0.0,1,0,0,1,0,1,0
1,satisfied,47,2464,0,0,0,3,0,2,2,...,2,310,305.0,1,0,0,1,1,0,0
2,satisfied,15,2138,0,0,0,3,2,0,2,...,2,0,0.0,1,0,0,1,0,1,0
3,satisfied,60,623,0,0,0,3,3,4,3,...,3,0,0.0,1,0,0,1,0,1,0
4,satisfied,70,354,0,0,0,3,4,3,4,...,5,0,0.0,1,0,0,1,0,1,0
5,satisfied,30,1894,0,0,0,3,2,0,2,...,2,0,0.0,1,0,0,1,0,1,0
6,satisfied,66,227,0,0,0,3,2,5,5,...,3,17,15.0,1,0,0,1,0,1,0
7,satisfied,10,1812,0,0,0,3,2,0,2,...,2,0,0.0,1,0,0,1,0,1,0
8,satisfied,56,73,0,0,0,3,5,3,5,...,4,0,0.0,1,0,0,1,1,0,0
9,satisfied,22,1556,0,0,0,3,2,0,2,...,2,30,26.0,1,0,0,1,0,1,0


**Question:** Why is it necessary to convert categorical data into dummy variables?**

**Answer:** 
* This is because Random Forest Clasissifier and other machine learning algorithms require binary variables for their implementation.

Then, check the variables of air_data_subset_dummies.

In [45]:
# Display variables.
air_data_subset_dummies.dtypes

satisfaction                          object
Age                                    int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer           int64
Customer Type_disloyal Customer        int64
Type of Travel_Business travel         int64
Type of Tr

**Question:** What changes do you observe after converting the string data to dummy variables?**

**Answer:**
* There are adidtional binary columns. All the data is encoded as numerical (except for `Satisfaction`).

## **Step 3: Model building** 

The first step to building your model is separating the labels (y) from the features (X).

In [46]:
# Separate the dataset into labels (y) and features (X).

y = air_data_subset_dummies['satisfaction']
X = air_data_subset_dummies.copy()
X = X.drop(columns='satisfaction')

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

To obtain the features, drop the `satisfaction` column from the DataFrame.

</details>

Once separated, split the data into train, validate, and test sets. 

In [47]:
# Separate into train, validate, test sets.

# Split set into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

# Split train into train and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.25, stratify=y_train, random_state=42)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Split `X`, `y` to get `X_train`, `X_test`, `y_train`, `y_test`. Set the `test_size` argument to the proportion of data points you want to select for testing. 

Split `X_train`, `y_train` to get `X_tr`, `X_val`, `y_tr`, `y_val`. Set the `test_size` argument to the proportion of data points you want to select for validation. 

</details>

### Tune the model

Now, fit and tune a random forest model with separate validation set. Begin by determining a set of hyperparameters for tuning the model using GridSearchCV.


In [55]:
# Determine set of hyperparameters.
cv_params = {'n_estimators': [50, 75, 100],
             'max_samples': [0.5, 0.9],
             'max_features': ['sqrt'], 
             'max_depth': [5, 10, 50, None],
             'min_samples_leaf': [0.1, 0.2, 0.4, 0.5, 0.8, 1],
             'min_samples_split': [0.0001, 0.001, 0.01],
             }


Next, create a list of split indices.

In [56]:
# Create list of split indices.
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)

Now, instantiate your model.

In [57]:
# Instantiate model.
rf = RandomForestClassifier(random_state=0)

# Set scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, pos_label='satisfied'),
    'recall': make_scorer(recall_score, pos_label='satisfied'),
    'f1': make_scorer(f1_score, pos_label='satisfied')
}

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `RandomForestClassifier()`, specifying the `random_state` argument for reproducible results. This will help you instantiate a random forest model, `rf`.

</details>

Next, use GridSearchCV to search over the specified parameters.

In [58]:
# Search over specified parameters.

rf_gs = GridSearchCV(estimator=rf, param_grid=cv_params, scoring=scoring, cv=custom_split, refit='f1', n_jobs=-1, verbose=1)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `GridSearchCV()`, passing in `rf` and `cv_params` and specifying `cv` as `custom_split`. Additional arguments that you can specify include: `refit='f1', n_jobs = -1, verbose = 1`. 

</details>

Now, fit your model.

In [59]:
%%time
# Fit the model.
rf_gs.fit(X_train, y_train)

Fitting 1 folds for each of 432 candidates, totalling 432 fits


CPU times: user 14.7 s, sys: 1.09 s, total: 15.8 s
Wall time: 1min 40s


Finally, obtain the optimal parameters.

In [60]:
# Obtain optimal parameters.

rf_gs.best_params_


{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 0.0001,
 'n_estimators': 100}

## **Step 4: Results and evaluation** 

Use the selected model to predict on your test data. Use the optimal parameters found via GridSearchCV.

In [62]:
# Use optimal parameters on GridSearchCV.

rf_opt = RandomForestClassifier(n_estimators=100,
                                max_depth=50,
                                max_features='sqrt',
                                max_samples=0.9,
                                min_samples_leaf=1,
                                min_samples_split=0.0001,
                                random_state=0)


Once again, fit the optimal model.

In [63]:
# Fit the optimal model.
rf_opt.fit(X_train, y_train)


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train `rf_opt` on `X_train` and `y_train`.

</details>

And predict on the test set using the optimal model.

In [64]:
# Predict on test set.
y_pred = rf_opt.predict(X_test)


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `predict()` function to make predictions on `X_test` using `rf_opt`. Save the predictions now (for example, as `y_pred`), to use them later for comparing to the true labels. 

</details>

### Obtain performance scores

First, get your precision score.

In [66]:
# Get precision score.
prcs_test = precision_score(y_test, y_pred, pos_label='satisfied')
print(f"Precision score: {prcs_test:.3f}" )


Precision score: 0.965


Then, collect the recall score.

In [68]:
# Get recall score.
rcl_test = recall_score(y_test, y_pred, pos_label='satisfied')
print(f"Recall score: {rcl_test:.3f}")

Recall score: 0.951


Next, obtain your accuracy score.

In [71]:
# Get accuracy score.
acc_test = accuracy_score(y_test, y_pred)
print(f"Accuracy score: {acc_test:.3f}")

Accuracy score: 0.954


Finally, collect your F1-score.

In [73]:
# Get F1 score.

f1_test = f1_score(y_test, y_pred, pos_label='satisfied')
print(f"The F1 score is {f1_test:.3f}")


The F1 score is 0.958


**Question:** How is the F1-score calculated?

**answer:** The F1 score is calculated as the harmnoic mean of precision and recall scores: That is $F1 = 2 * (precision * recall)/(precision + recall)$, where: 

* $Precision = TP/(TP + FP)$
* $Recall = TP/(TP + FN)$

**Question:** What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

Pros: <br />
*  The coding workload is reduced.
*  The scripts for data splitting are shorter.
*  It's only  necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).

Cons: <br />
* If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
* A potential overfitting issue could happen when fitting the model's scores on the test data.




### Evaluate the model

Now that you have results, evaluate the model. 

**Question:** What are the four basic parameters for evaluating the performance of a classification model?

The four basic parameters for evaluating a performance of a classification model are: true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).

**Question:**  What do the four scores demonstrate about your model, and how do you calculate them?

- Accuracy (TP+TN/TP+FP+FN+TN): The ratio of correctly predicted observations to total observations. 
 
- Precision (TP/TP+FP): The ratio of correctly predicted positive observations to total predicted positive observations. 

- Recall (Sensitivity, TP/TP+FN): The ratio of correctly predicted positive observations to all observations in actual class.

- F1 score: The harmonic average of precision and recall, which takes into account both false positives and false negati

Calculate the scores: precision score, recall score, accuracy score, F1 score.

In [84]:
# Precision score on test data set.

print(f'The precision score is: {prcs_test:.3f}, meaning that {prcs_test*100:.2f}% of predictions are true positives.')


The precision score is: 0.965, meaning that 96.51% of predictions are true positives.


In [85]:
# Recall score on test data set.

print(f'The recall score is: {rcl_test:.3f}, meaning that {rcl_test*100:.2f}% of real positive cases in the \ndataset are correctly predicted as positive.')


The recall score is: 0.951, meaning that 95.08% of real positive cases in the 
dataset are correctly predicted as positive.


In [86]:
# Accuracy score on test data set.
print(f'The accuracy score is: {acc_test:.3f}, meaning that {acc_test*100:.2f}% of real input data \nis accurately classified by the model.')


The accuracy score is: 0.954, meaning that 95.43% of real input data 
is accurately classified by the model.


In [88]:
# F1 score on test data set.
print(f"The F1 score is: {f1_test:.3f}, meaning that the test set's harmonic mean is {f1_test*100:.2f}%.")


The F1 score is: 0.958, meaning that the test set's harmonic mean is 95.79%.


**Question:** How does this model perform based on the four scores?

The model performs well according to all 4 metrics. The model was selected based on the F1 score, which is >95%. 

### Evaluate the model

Finally, create a table of results that you can use to evaluate the performace of your model.

In [90]:
# Create table of results.

# Create table of results.

### YOUR CODE HERE ###
table = pd.DataFrame({'Model':  "Tuned Random Forest",
                        'F1': [f1_test],
                        'Recall': [rcl_test],
                        'Precision': [prcs_test],
                        'Accuracy': [acc_test]
                      }
                    )
table

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Random Forest,0.957901,0.950793,0.965116,0.954251


**Question:** How does the random forest model compare to the decision tree model you built in the previous lab?

The tuned random forest has higher scores, and therefore is a better model. 


## **Considerations**


**What are the key takeaways from this lab?**
- Data exploring, cleaning, and encoding are necessary for model building.
- A separate validation set is typically used for tuning a model, rather than using the test set. This also helps avoid the evaluation becoming biased.
-  F1 scores are usually more useful than accuracy scores. If the cost of false positives and false negatives are very different, it’s better to use the F1 score and combine the information from precision and recall. 
* The random forest model yields a more effective performance than a decision tree model. 

**What summary would you provide to stakeholders?**
* The random forest model predicted satisfaction with more than 94.2% accuracy. The precision is over 95% and the recall is approximately 94.5%. 
* The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. This indicates that the random forest model may perform better.
* Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest. 
* In addition, you would provide details about the precision, recall, accuracy, and F1 scores to support your findings

### References

[What is the Difference Between Test and Validation Datasets?,  Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/)

[Decision Trees and Random Forests Neil Liberman](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991)