<a href="https://colab.research.google.com/github/andremarinho17/data_analytics_projects_en/blob/main/Activity_Build_a_random_forest_model_Andr%C3%A9_Marinho.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity: Building a random forest model to predict customer satisfaction for an airline company

<p align="center"><img src="https://cdn.shopify.com/s/files/1/0657/3100/2634/files/Papier_peint_avion_Decollage_en_style_pop_art_et_colore.jpg?v=1712910949" >


**Author**: André Marinho

## **Introduction**


As you're learning, random forests are popular statistical learning algorithms. Some of their primary benefits include reducing variance, bias, and the chance of overfitting.

This activity is a continuation of the project you began modeling with decision trees for an airline. Here, you will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Your random forest model will be used to predict whether a customer will be satisfied with their flight experience.

**Note:** Because this lab uses a real dataset, this notebook first requires exploratory data analysis, data cleaning, and other manipulations to prepare it for modeling.

## **Step 1: Imports**


Import relevant Python libraries and modules, including `numpy` and `pandas`libraries for data processing; the `pickle` package to save the model; and the `sklearn` library, containing:
- The module `ensemble`, which has the function `RandomForestClassifier`
- The module `model_selection`, which has the functions `train_test_split`, `PredefinedSplit`, and `GridSearchCV`
- The module `metrics`, which has the functions `f1_score`, `precision_score`, `recall_score`, and `accuracy_score`


In [None]:
# Import `numpy`, `pandas`, `pickle`, and `sklearn`.
# Import the relevant functions from `sklearn.ensemble`, `sklearn.model_selection`, and `sklearn.metrics`.

import numpy as np
import pandas as pd

import pickle as pkl

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE ###

df = pd.read_csv("/content/Invistico_Airline.csv")

Now, you're ready to begin cleaning your data.

## **Step 2: Data cleaning**

To get a sense of the data, display the first 10 rows.

In [None]:
# Display first 10 rows.

df.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


Now, display the variable names and their data types.

In [None]:
# Display variable names and types.

df.dtypes

Unnamed: 0,0
satisfaction,object
Customer Type,object
Age,int64
Type of Travel,object
Class,object
Flight Distance,int64
Seat comfort,int64
Departure/Arrival time convenient,int64
Food and drink,int64
Gate location,int64


**Question:** What do you observe about the differences in data types among the variables included in the data?

Most of the variables are integer type, while some are object, such as `satisfaction`, `Customer Type`, `Type of Travel`, and `Class`.

Next, to understand the size of the dataset, identify the number of rows and the number of columns.

In [None]:
# Identify the number of rows and the number of columns.

df.shape

(129880, 22)

Now, check for missing values in the rows of the data. Start with .isna() to get Booleans indicating whether each value in the data is missing. Then, use .any(axis=1) to get Booleans indicating whether there are any missing values along the columns in each row. Finally, use .sum() to get the number of rows that contain missing values.

In [None]:
# Get Booleans to find missing values in data.
# Get Booleans to find missing values along columns.
# Get the number of rows that contain missing values.

df.isna().any(axis=1).sum()

393

**Question:** How many rows of data are missing values?**

There are 393 rows with missing values in the dataset.

Drop the rows with missing values. This is an important step in data cleaning, as it makes the data more useful for analysis and regression. Then, save the resulting pandas DataFrame in a variable named `air_data_subset`.

In [None]:
# Drop missing values.
# Save the DataFrame in variable `air_data_subset`.

df_subset = df.dropna(axis=0)

Next, display the first 10 rows to examine the data subset.

In [None]:
# Display the first 10 rows.

df_subset.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


Confirm that it does not contain any missing values.

In [None]:
# Count of missing values.

df_subset.isna().sum()

Unnamed: 0,0
satisfaction,0
Customer Type,0
Age,0
Type of Travel,0
Class,0
Flight Distance,0
Seat comfort,0
Departure/Arrival time convenient,0
Food and drink,0
Gate location,0


Next, convert the categorical features to indicator (one-hot encoded) features.

**Note:** The `drop_first` argument can be kept as default (`False`) during one-hot encoding for random forest models, so it does not need to be specified. Also, the target variable, `satisfaction`, does not need to be encoded and will be extracted in a later step.

In [None]:
# Convert categorical features to one-hot encoded features.

df_subset = pd.get_dummies(df_subset, columns=['Customer Type','Type of Travel','Class'])
df_subset.head()

Unnamed: 0,satisfaction,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,...,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,satisfied,65,265,0,0,0,2,2,4,2,...,2,0,0.0,True,False,False,True,False,True,False
1,satisfied,47,2464,0,0,0,3,0,2,2,...,2,310,305.0,True,False,False,True,True,False,False
2,satisfied,15,2138,0,0,0,3,2,0,2,...,2,0,0.0,True,False,False,True,False,True,False
3,satisfied,60,623,0,0,0,3,3,4,3,...,3,0,0.0,True,False,False,True,False,True,False
4,satisfied,70,354,0,0,0,3,4,3,4,...,5,0,0.0,True,False,False,True,False,True,False


**Question:** Why is it necessary to convert categorical data into dummy variables?**

It's necessary because Random Forest do not support categorical variables, and requires them to be encoded as numeric.

Next, display the first 10 rows to review the `air_data_subset_dummies`.

In [None]:
# Display the first 10 rows.

df_subset.head(10)

Unnamed: 0,satisfaction,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,...,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,satisfied,65,265,0,0,0,2,2,4,2,...,2,0,0.0,True,False,False,True,False,True,False
1,satisfied,47,2464,0,0,0,3,0,2,2,...,2,310,305.0,True,False,False,True,True,False,False
2,satisfied,15,2138,0,0,0,3,2,0,2,...,2,0,0.0,True,False,False,True,False,True,False
3,satisfied,60,623,0,0,0,3,3,4,3,...,3,0,0.0,True,False,False,True,False,True,False
4,satisfied,70,354,0,0,0,3,4,3,4,...,5,0,0.0,True,False,False,True,False,True,False
5,satisfied,30,1894,0,0,0,3,2,0,2,...,2,0,0.0,True,False,False,True,False,True,False
6,satisfied,66,227,0,0,0,3,2,5,5,...,3,17,15.0,True,False,False,True,False,True,False
7,satisfied,10,1812,0,0,0,3,2,0,2,...,2,0,0.0,True,False,False,True,False,True,False
8,satisfied,56,73,0,0,0,3,5,3,5,...,4,0,0.0,True,False,False,True,True,False,False
9,satisfied,22,1556,0,0,0,3,2,0,2,...,2,30,26.0,True,False,False,True,False,True,False


Then, check the variables of air_data_subset_dummies.

In [None]:
# Display variables.

df_subset.columns

Index(['satisfaction', 'Age', 'Flight Distance', 'Seat comfort',
       'Departure/Arrival time convenient', 'Food and drink', 'Gate location',
       'Inflight wifi service', 'Inflight entertainment', 'Online support',
       'Ease of Online booking', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Cleanliness', 'Online boarding',
       'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'Customer Type_Loyal Customer', 'Customer Type_disloyal Customer',
       'Type of Travel_Business travel', 'Type of Travel_Personal Travel',
       'Class_Business', 'Class_Eco', 'Class_Eco Plus'],
      dtype='object')

**Question:** What changes do you observe after converting the string data to dummy variables?**

As can be seen, there are now 26 columns, instead of 22, with new dummy variables regarding `Customer Type`,`Type of Travel`, and `Class` categories.

## **Step 3: Model building**

The first step to building your model is separating the labels (y) from the features (X).

In [None]:
# Separate the dataset into labels (y) and features (X).

y = df_subset['satisfaction']

X = df_subset.copy()
X = X.drop('satisfaction', axis = 1)

Once separated, split the data into train, validate, and test sets.

In [None]:
# Separate into train, validate, test sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 0)

### Tune the model

Now, fit and tune a random forest model with separate validation set. Begin by determining a set of hyperparameters for tuning the model using GridSearchCV.


In [None]:
# Determine set of hyperparameters.

cv_params = {'n_estimators' : [10, 50, 100, 120],
              'max_depth' : [10,20,50],
              'min_samples_leaf' : [0.01, 0.5, 1],
              'min_samples_split' : [0.001, 0.01],
              'max_features' : ["sqrt", "log2" ],
              'max_samples': [0.3, 0.7] }

Next, create a list of split indices.

In [None]:
# Create list of split indices.

split_index = [0 if x in X_val.index else -1 for x in X_train.index]

Now, instantiate your model.

In [None]:
# Instantiate model.

rf = RandomForestClassifier(random_state = 0)

Next, use GridSearchCV to search over the specified parameters.

In [None]:
# Search over specified parameters.

custom_split = PredefinedSplit(split_index)

rf_val = GridSearchCV(rf, cv_params, cv=custom_split, refit='f1', n_jobs = -1, verbose = 1)

Now, fit your model.

In [None]:
%%time
# Fit the model.
rf_val.fit(X_train, y_train)

Fitting 1 folds for each of 288 candidates, totalling 288 fits
CPU times: user 20.2 s, sys: 1.64 s, total: 21.9 s
Wall time: 10min 33s


Finally, obtain the optimal parameters.

In [None]:
# Obtain optimal parameters.

rf_val.best_params_

{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.7,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 100}

## **Step 4: Results and evaluation**

Use the selected model to predict on your test data. Use the optimal parameters found via GridSearchCV.

In [None]:
# Use optimal parameters on GridSearchCV.

rf_opt = RandomForestClassifier(n_estimators = 100, max_depth = 50, min_samples_leaf = 1, min_samples_split = 0.001, max_features = 'sqrt', max_samples = 0.7, random_state = 0)

Once again, fit the optimal model.

In [None]:
# Fit the optimal model.

rf_opt.fit(X_train, y_train)

And predict on the test set using the optimal model.

In [None]:
# Predict on test set.

y_pred = rf_opt.predict(X_test)

### Obtain performance scores

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label = "satisfied")
recall = recall_score(y_test, y_pred, pos_label = "satisfied")
f1 = f1_score(y_test, y_pred, pos_label = "satisfied")

In [None]:
print("Accuracy:", "%.3f" % accuracy)
print("Precision:", "%.3f" % precision)
print("Recall:", "%.3f" % recall)
print("F1 Score:", "%.3f" % f1)

Accuracy: 0.941
Precision: 0.949
Recall: 0.944
F1 Score: 0.946


**Question:** How is the F1-score calculated?

The F1-score is a harmonic mean of the precision and recall. It's calculated by multiplying the precision by the recall and then dividing the result of this multiplication by the total precision and recall. Then, the final result is multiplied by 2. The formula is the following:

$$
2* \frac{precision*recall}{precision+recall}
$$

**Question:** What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

Using test data for model selection has some pros and cons. On the positive side, it allows for a quick evaluation of how well different models perform on unseen data, which can speed up the decision-making process. However, the major downside is that it can lead to overfitting, where the model learns specific features of the test data, resulting in overly optimistic performance metrics. This undermines the purpose of the test set, which should be used solely for final validation. Ultimately, using a separate validation dataset is more reliable, as it helps ensure that the model generalizes well to new, unseen data.




### Evaluate the model

Now that you have results, evaluate the model.

**Question:** What are the four basic parameters for evaluating the performance of a classification model?

The four basic parameters for evaluating the performance of a classification model are True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).

* **TP:** The correctly predicted positive values. In this context, customers who were truly satisfied and correctly predicted as satisfied by the model.
* **TN:** The correctly predicted negative values. In this context, customers who were truly dissatisfied and correctly predicted as dissatisfied by the model.
* **FP:** The negative values that were incorrectly predicted as positive. In this context, customers who were truly dissatisfied but incorrectly predicted as satisfied by the model.
* **FN:** The positive values that were incorrectly predicted as negative. In this context, customers who were truly satisfied but incorrectly predicted as dissatisfied by the model.







**Question:**  What do the four scores demonstrate about your model, and how do you calculate them?

**Accuracy**

Demonstrates the proportion of correct predictions. It's calculated by dividing the total of correct predictions by the total of predictions.

$$
\text{Accuracy} = \frac{\text{Total of correct predictions}}{\text{Total of predictions}}
$$

**Recall**

Shows the proportion of positives predicted correctly in actual class. It's calculated by dividing the total of true positives by the total of true positives and true negatives.

$$
\frac{TP}{TP+FN}
$$

**Precision**

It represents the proportion of positive predicted correctly in the total of positive observations. It's calculated by dividing the total of true positives by the total of true positives and false positives.

$$
\frac{TP}{TP+FP}
$$

**F1-score**

It is the harmonic mean of precision and recall. It's calculated by multiplying the precision by the recall and then dividing the result of this multiplication by the total precision and recall. Then, the final result is multiplied by 2. The formula is the following:

$$
2* \frac{precision*recall}{precision+recall}
$$





**Question:** How does this model perform based on the four scores?

The Random Forest demonstrates strong performance based on key metrics. With an accuracy of 94.1%, the model correctly identifies satisfied and dissatisfied customers in the majority of cases. The precision of 94.9% indicates that nearly all customers predicted as satisfied are truly satisfied, minimizing false positives. The recall of 94.4% shows that the model effectively captures most satisfied customers, reducing the chance of missing positive cases. Finally, the F1 score of 94.6%, balancing precision and recall, highlights the model's robustness in handling both satisfied and dissatisfied customers.

### Evaluate the model

Finally, create a table of results that you can use to evaluate the performace of your model.

In [None]:
# Create table of results.

table = pd.DataFrame({'Model': ["Tuned Decision Tree", "Tuned Random Forest"],
                        'F1':  [0.943673, f1],
                        'Recall': [0.935216, recall],
                        'Precision': [0.952305, precision],
                        'Accuracy': [0.938887, accuracy]
                      }
                    )
table

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.943673,0.935216,0.952305,0.938887
1,Tuned Random Forest,0.946344,0.943542,0.949163,0.9414


**Question:** How does the random forest model compare to the decision tree model you built in the previous lab?

The tuned random forest outperforms the tuned decision tree in most of the metrics, except in precision. Therefore, the random forest may perform better in this scenario.


## **Considerations**

**What summary would you provide to stakeholders?**

The random forest demonstrated a good performance in predicting customer satisfaction for the company, with key metrics reflecting its reliability and effectiveness. The accuracy of 94.1% suggests that the model makes correct predictions in most cases. Its precision of 94.9% indicates that the vast majority of customers predicted as satisfied are indeed satisfied, ensuring that resources are not wasted on false positives. The recall of 94.4% shows that the model captures almost all satisfied customers, which is crucial for targeting areas of improvement. With an F1 score of 94.6%, the model balances precision and recall well, offering a comprehensive and trustworthy tool to improve customer experience and satisfaction.

Furthermore, it outperformed the decision tree and can perform better in this scenario and be confidently used to inform business decisions and enhance customer engagement strategies.

### References

[What is the Difference Between Test and Validation Datasets?,  Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/)

[Decision Trees and Random Forests Neil Liberman](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991)

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged