# Classification using the Random Forest Algorithm
|                  |                                                                                                                                                                                                     |
|:-----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Course Codes** | BBT 4106, BCM 3104, and BFS 4102                                                                                                                                                                    |
| **Course Names** | BBT 4106: Business Intelligence I (Week 10-12 of 13),<br/>BCM 3104: Business Intelligence and Data Analytics (Week 10-12 of 13) and<br/>BFS 4102: Advanced Business Data Analytics (Week 4-6 of 13) |
| **Semester**     | April to July 2025                                                                                                                                                                                  |
| **Lecturer**     | Allan Omondi                                                                                                                                                                                        |
| **Contact**      | aomondi@strathmore.edu                                                                                                                                                                              |
| **Note**         | The lecture contains both theory and practice. This notebook forms part of the practice. This is intended for educational purpose only.                                                             |


## Step 1: Import the necessary libraries

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

## Step 2: Load the data

In [26]:
# Load the dataset into a DataFrame
# Description of dataset: https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset
url = 'https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/data/online_shoppers_intention.csv'
online_shoppers_intention_data = pd.read_csv(url)
# online_shoppers_intention_data = pd.read_csv("./data/online_shoppers_intention.csv")

## Step 3: Preprocess the Data

We need to convert the data into a numeric format suitable for the model. First, we map the boolean Revenue target to integers (0 and 1). Then we encode any categorical variables into numeric form. In this dataset, columns like VisitorType, Weekend, and Month are categorical. We can use label encoding for simplicity.

In [27]:
# Map the target 'Revenue' from False/True to 0/1
online_shoppers_intention_data['Revenue'] = online_shoppers_intention_data['Revenue'].map({False: 0, True: 1})

le = LabelEncoder()

# Encode the categorical columns: 'VisitorType', 'Weekend', and 'Month'
online_shoppers_intention_data['VisitorType'] = le.fit_transform(online_shoppers_intention_data['VisitorType']) # e.g., 'Returning_Visitor'->1
online_shoppers_intention_data['Weekend'] = le.fit_transform(online_shoppers_intention_data['Weekend']) # False->0, True->1
online_shoppers_intention_data['Month']   = le.fit_transform(online_shoppers_intention_data['Month']) # e.g., 'Feb'->0, 'Mar'->1


## Step 4: Initial Exploratory Data Analysis (EDA)

In [28]:
# Separate features (X) and target (y)
X = online_shoppers_intention_data.drop('Revenue', axis=1)  # all columns except target
y = online_shoppers_intention_data['Revenue']               # target variable

print("\nThe data types:")
print(online_shoppers_intention_data.info())

print("\nThe summary of the numeric columns:")
display(online_shoppers_intention_data.describe())

print("\nThe whole dataset:")
display(online_shoppers_intention_data)

print("The feature data (independent variables or predictors):")
print(X.head())

print("\nTarget labels (the dependent variable or outcome):")
print(y.head())

print("\nPercentage distribution for each category in y")
print("\nNumber of observations per class:")
print("Frequency counts:\n", y.value_counts())
print("\nPercentages:\n", y.value_counts(normalize=True) * 100, "%")


The data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  int64  
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  Traff

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,5.16399,2.124006,2.357097,3.147364,4.069586,1.718329,0.232603,0.154745
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,2.370199,0.911325,1.717277,2.401591,4.025169,0.690759,0.422509,0.361676
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,5.0,2.0,2.0,1.0,2.0,2.0,0.0,0.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,6.0,2.0,2.0,3.0,2.0,2.0,0.0,0.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,7.0,3.0,2.0,4.0,4.0,2.0,0.0,0.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,9.0,8.0,13.0,9.0,20.0,2.0,1.0,1.0



The whole dataset:


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,2,1,1,1,1,2,0,0
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,2,2,2,1,2,2,0,0
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,2,4,1,9,3,2,0,0
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,2,3,2,2,4,2,0,0
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,2,3,3,1,4,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12325,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,1,4,6,1,1,2,1,0
12326,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,7,3,2,1,8,2,1,0
12327,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,7,3,2,1,13,2,1,0
12328,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,7,2,2,3,11,2,0,0


The feature data (independent variables or predictors):
   Administrative  Administrative_Duration  Informational  \
0               0                      0.0              0   
1               0                      0.0              0   
2               0                      0.0              0   
3               0                      0.0              0   
4               0                      0.0              0   

   Informational_Duration  ProductRelated  ProductRelated_Duration  \
0                     0.0               1                 0.000000   
1                     0.0               2                64.000000   
2                     0.0               1                 0.000000   
3                     0.0               2                 2.666667   
4                     0.0              10               627.500000   

   BounceRates  ExitRates  PageValues  SpecialDay  Month  OperatingSystems  \
0         0.20       0.20         0.0         0.0      2                 1   


## Step 5: Resample with replacement to balance the dataset

In [29]:
# Separate majority and minority classes
df_majority = online_shoppers_intention_data[online_shoppers_intention_data['Revenue']==0]
df_minority = online_shoppers_intention_data[online_shoppers_intention_data['Revenue']==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority,
                               replace=True,     # Sample with replacement
                               n_samples=len(df_majority),    # To match the majority class
                               random_state=53)  # To ensure the results are reproducible

# Combine majority class with upsampled minority class
df_balanced = pd.concat([df_majority, df_minority_upsampled])

# Separate features and target from balanced dataset
X_balanced = df_balanced.drop('Revenue', axis=1)
y_balanced = df_balanced['Revenue']

print("\nNumber of observations per class:")
print("Frequency counts:\n", y_balanced.value_counts())
print("\nPercentages:\n", y_balanced.value_counts(normalize=True) * 100, "%")



Number of observations per class:
Frequency counts:
 Revenue
0    10422
1    10422
Name: count, dtype: int64

Percentages:
 Revenue
0    50.0
1    50.0
Name: proportion, dtype: float64 %


## Step 6: Split the data into training and testing sets

In [30]:
# Split into a training set and a test set (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_balanced, test_size=0.30, random_state=42)

## Step 7: Apply data transformation (feature scaling)

`StandardScaler` is a preprocessing technique from scikit-learn whose purpose is to standardize features by removing the mean and scaling it to a unit variance. It does this by applying the standardization formula to each feature:
- Standardization formula: `z = (x - μ) / σ`
- Where:
   - `x` is the original value of the feature.
   - `μ` is the mean of the feature values.
   - `σ` is the standard deviation of the feature values.
- The result is:
    - The transformed data will have a mean of 0
    - Standard deviation of 1
    - Roughly 68% of the values will lie between -1 and 1
    - Roughly 95% of the values will lie between -2 and 2

- Advantages:
    - Makes features comparable when they have different scales
    - Many machine learning algorithms perform better when features are on similar scales
    - Particularly important for algorithms that use distance calculations or assume normally distributed data


In [31]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 8: Train the model

In [35]:
# Initialize and train Random Forest
model = RandomForestClassifier(
    n_estimators=100, # The number of trees in the forest
    random_state=53, # Ensures reproducibility
    max_depth=3      # Limit depth for smoother decision boundaries
)
model.fit(X_train_scaled, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,3
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## Step 9: Evaluate the Model

`y_pred = model.predict(X_test_scaled)`

- This uses the trained decision tree classifier (`model`) to predict the labels for the test set features (`X_test`). This gives you the model’s predictions on data it has not seen before, which is necessary for evaluating its performance.

`print("Classification Report:\n", classification_report(y_test, y_pred))`
- This prints a detailed classification report comparing the true labels (`y_test`) to the predicted labels (`y_pred`). The report includes precision, recall, F1-score, and support for each class, enabling you to understand how well the model performs for each category.
- It shows the performance metrics for a model that predicts two classes:
    - Class 0 - A case where the user's interaction with the eCommerce website does not lead to a purchase.
    - Class 1 - A case where the user's interaction with the eCommerce website leads to a purchase.

- There are 300 total items tested:
    - Class 0 has 3,146 (50%)
    - Class 1 has 3,108 (50%)

| Term             | Meaning                                                                                                                             |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **Precision**    | Out of all items the model said are class X, how many are actually class X?                                                         |
| **Recall**       | Out of all actual items in class X, how many did the model correctly find?                                                          |
| **F1-score**     | A balance between precision and recall such  that a higher value means better balance.                                              |
| **Support**      | The number of actual items in that class.                                                                                           |
| **Macro avg**    | The average of precision, recall, and F1-score for both classes, treating them equally.                                             |
| **Weighted avg** | The average of precision, recall, and F1-score, but weighted by how many samples are in each class (so class 1 has more influence). |

- The results show that the model is much better at predicting class 1 than class 0, and overall gets 75% of predictions correct. This may be because there are more class 1 cases in the data.

In [33]:
# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Compute and display the accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))  # Overall fraction of correct predictions

# Show precision, recall, F1-score for each class
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Compute and display the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.8444195714742565

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.88      0.85      3146
           1       0.87      0.81      0.84      3108

    accuracy                           0.84      6254
   macro avg       0.85      0.84      0.84      6254
weighted avg       0.85      0.84      0.84      6254

Confusion Matrix:
[[2756  390]
 [ 583 2525]]


## Step 10: Use the model to make a prediction on a new sample

In [34]:
# Create a new sample (example values based on the dataset's feature order)
new_session = [[0,     # Administrative
                0.0,   # Administrative_Duration
                2,     # Informational
                0.0,   # Informational_Duration
                20,    # ProductRelated
                500.0, # ProductRelated_Duration
                0.02,  # BounceRates
                0.01,  # ExitRates
                0.005, # PageValues
                0.0,   # SpecialDay
                3,     # Month (encoded)
                2,     # OperatingSystems
                1,     # Browser
                1,     # Region
                1,     # TrafficType
                1,     # VisitorType (Returning_Visitor encoded as 1)
                0      # Weekend (False encoded as 0)
               ]]

# Use the same column names as the training data
new_session_online_shoppers_intention_data = pd.DataFrame(new_session)

# Predict using the trained model
prediction = model.predict(new_session_online_shoppers_intention_data)

# Display the result
print("\nPredicted Revenue:", "Yes" if prediction[0] == 1 else "No")


Predicted Revenue: Yes
