# Classification using the Support Vector Machine (SVM) Algorithm
|                  |                                                                                                                                                                                                     |
|:-----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Course Codes** | BBT 4106, BCM 3104, and BFS 4102                                                                                                                                                                    |
| **Course Names** | BBT 4106: Business Intelligence I (Week 10-12 of 13),<br/>BCM 3104: Business Intelligence and Data Analytics (Week 10-12 of 13) and<br/>BFS 4102: Advanced Business Data Analytics (Week 4-6 of 13) |
| **Semester**     | April to July 2025                                                                                                                                                                                  |
| **Lecturer**     | Allan Omondi                                                                                                                                                                                        |
| **Contact**      | aomondi@strathmore.edu                                                                                                                                                                              |
| **Note**         | The lecture contains both theory and practice. This notebook forms part of the practice. This is intended for educational purpose only.                                                             |


## Step 1: Import the necessary libraries

In [None]:
# Import pandas for data manipulation
import pandas as pd
# Import LabelEncoder for encoding categorical features
from sklearn.preprocessing import LabelEncoder
# Import train_test_split for splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split
# Import resampling utilities for balancing the dataset
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Step 2: Load and Explore the Data

In [None]:
# Load the dataset into a DataFrame
# Description of dataset: https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset
url = 'https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/data/online_shoppers_intention.csv'
online_shoppers_intention_data = pd.read_csv(url)
# online_shoppers_intention_data = pd.read_csv("./data/online_shoppers_intention.csv")

## Step 3: Preprocess the Data

We need to convert the data into a numeric format suitable for the model. First, we map the boolean Revenue target to integers (0 and 1). Then we encode any categorical variables into numeric form. In this dataset, columns like VisitorType, Weekend, and Month are categorical. We can use label encoding for simplicity.

In [None]:
# Map the target 'Revenue' from False/True to 0/1
online_shoppers_intention_data['Revenue'] = online_shoppers_intention_data['Revenue'].map({False: 0, True: 1})

le = LabelEncoder()

# Encode the categorical columns: 'VisitorType', 'Weekend', and 'Month'
online_shoppers_intention_data['VisitorType'] = le.fit_transform(online_shoppers_intention_data['VisitorType']) # e.g., 'Returning_Visitor'->1
online_shoppers_intention_data['Weekend'] = le.fit_transform(online_shoppers_intention_data['Weekend']) # False->0, True->1
online_shoppers_intention_data['Month']   = le.fit_transform(online_shoppers_intention_data['Month']) # e.g., 'Feb'->0, 'Mar'->1


## Step 4: Display the data

In [None]:
# Separate features (X) and target (y)
X = online_shoppers_intention_data.drop('Revenue', axis=1)  # all columns except target
y = online_shoppers_intention_data['Revenue']               # target variable

print("\nThe data types:")
print(online_shoppers_intention_data.info())

print("\nThe summary of the numeric columns:")
display(online_shoppers_intention_data.describe())

print("\nThe whole dataset:")
display(online_shoppers_intention_data)

print("The feature data (independent variables or predictors):")
print(X.head())

print("\nTarget labels (the dependent variable or outcome):")
print(y.head())

print("\nPercentage distribution for each category in y:")
print("\nNumber of observations per class:")
print("Frequency counts:\n", y.value_counts())
print("\nPercentages:\n", y.value_counts(normalize=True) * 100, "%")

## Step 5: Resample with replacement to balance the dataset

In [None]:
# Separate majority and minority classes
df_majority = online_shoppers_intention_data[online_shoppers_intention_data['Revenue']==0]
df_minority = online_shoppers_intention_data[online_shoppers_intention_data['Revenue']==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority,
                               replace=True,     # Sample with replacement
                               n_samples=len(df_majority),    # To match the majority class
                               random_state=53)  # To ensure the results are reproducible

# Combine majority class with upsampled minority class
df_balanced = pd.concat([df_majority, df_minority_upsampled])

# Separate features and target from balanced dataset
X_balanced = df_balanced.drop('Revenue', axis=1)
y_balanced = df_balanced['Revenue']

print("\nNumber of observations per class:")
print("Frequency counts:\n", y_balanced.value_counts())
print("\nPercentages:\n", y_balanced.value_counts(normalize=True) * 100, "%")


## Step 6: Split the data into training and testing sets

In [None]:
# Split into a training set and a test set (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_balanced, test_size=0.30, random_state=42)

## Step 7: Apply data transformation (feature scaling)

`StandardScaler` is a preprocessing technique from scikit-learn whose purpose is to standardize features by removing the mean and scaling it to a unit variance. It does this by applying the standardization formula to each feature:
- Standardization formula: `z = (x - μ) / σ`
- Where:
   - `x` is the original value of the feature.
   - `μ` is the mean of the feature values.
   - `σ` is the standard deviation of the feature values.
- The result is:
    - The transformed data will have a mean of 0
    - Standard deviation of 1
    - Roughly 68% of the values will lie between -1 and 1
    - Roughly 95% of the values will lie between -2 and 2

- Advantages:
    - Makes features comparable when they have different scales
    - Many machine learning algorithms perform better when features are on similar scales
    - Particularly important for algorithms that use distance calculations or assume normally distributed data


In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 8: Train the model

In [None]:
# Initialize the classifier
model = SVC(kernel='rbf', random_state=53)

# Fit the model on the training data
model.fit(X_train_scaled, y_train)

## Step 9: Evaluate the Model

`y_pred = model.predict(X_test_scaled)`

- This uses the trained decision tree classifier (`model`) to predict the labels for the test set features (`X_test`). This gives you the model’s predictions on data it has not seen before, which is necessary for evaluating its performance.

`print("Classification Report:\n", classification_report(y_test, y_pred))`
- This prints a detailed classification report comparing the true labels (`y_test`) to the predicted labels (`y_pred`). The report includes precision, recall, F1-score, and support for each class, enabling you to understand how well the model performs for each category.
- It shows the performance metrics for a model that predicts two classes:
    - Class 0 - A case where the user's interaction with the eCommerce website does not lead to a purchase.
    - Class 1 - A case where the user's interaction with the eCommerce website leads to a purchase.

- There are 300 total items tested:
    - Class 0 has 3,146 (50%)
    - Class 1 has 3,108 (50%)

| Term             | Meaning                                                                                                                             |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **Precision**    | Out of all items the model said are class X, how many are actually class X?                                                         |
| **Recall**       | Out of all actual items in class X, how many did the model correctly find?                                                          |
| **F1-score**     | A balance between precision and recall such  that a higher value means better balance.                                              |
| **Support**      | The number of actual items in that class.                                                                                           |
| **Macro avg**    | The average of precision, recall, and F1-score for both classes, treating them equally.                                             |
| **Weighted avg** | The average of precision, recall, and F1-score, but weighted by how many samples are in each class (so class 1 has more influence). |

- The results show that the model is much better at predicting class 1 than class 0, and overall gets 75% of predictions correct. This may be because there are more class 1 cases in the data.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Compute and display the accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))  # Overall fraction of correct predictions

# Show precision, recall, F1-score for each class
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Compute and display the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


## Step 10: Use the model to make a prediction on a new sample

In [None]:
# Create a new sample (example values based on the dataset's feature order)
new_session = [[0,     # Administrative
                0.0,   # Administrative_Duration
                2,     # Informational
                0.0,   # Informational_Duration
                20,    # ProductRelated
                500.0, # ProductRelated_Duration
                0.02,  # BounceRates
                0.01,  # ExitRates
                0.005, # PageValues
                0.0,   # SpecialDay
                3,     # Month (encoded)
                2,     # OperatingSystems
                1,     # Browser
                1,     # Region
                1,     # TrafficType
                1,     # VisitorType (Returning_Visitor encoded as 1)
                0      # Weekend (False encoded as 0)
               ]]

# Use the same column names as the training data
new_session_online_shoppers_intention_data = pd.DataFrame(new_session)

# Predict using the trained model
prediction = model.predict(new_session_online_shoppers_intention_data)

# Display the result
print("\nPredicted Revenue:", "Yes" if prediction[0] == 1 else "No")