# Classification using the CART Algorithm (a decision tree classifier) and the Gini Impurity

|                  |                                                                                                                                                                                                     |
|:-----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Course Codes** | BBT 4106, BCM 3104, and BFS 4102                                                                                                                                                                    |
| **Course Names** | BBT 4106: Business Intelligence I (Week 10-12 of 13),<br/>BCM 3104: Business Intelligence and Data Analytics (Week 10-12 of 13) and<br/>BFS 4102: Advanced Business Data Analytics (Week 4-6 of 13) |
| **Semester**     | April to July 2025                                                                                                                                                                                  |
| **Lecturer**     | Allan Omondi                                                                                                                                                                                        |
| **Contact**      | aomondi@strathmore.edu                                                                                                                                                                              |
| **Note**         | The lecture contains both theory and practice. This notebook forms part of the practice. This is intended for educational purpose only.                                                             |

**Business context**: A business has a strategic objective to *reduce customer churn to 10% by the end of the current financial year*. The lagging KPI in the customer perspective of the business' performance is the churn rate whereas its leading KPI is the number of support calls. The business would like to predict whether a customer will renew their subscription so that the marketing and sales teams can intervene early and avoid losing customers.

**Dataset**: The synthetic dataset used in this notebook is based on the **"Subscription Churn"** dataset. It contains 1,000 observations of customer data with the following features and target.

| **Type**    | **Name**        | **Description**                                                                                       |
|:------------|-----------------|:------------------------------------------------------------------------------------------------------|
| **Feature** | `monthly_fee`   | Monthly fee paid by the customer                                                                      |
| **Feature** | `customer_age`  | Age of the customer in years                                                                          |
| **Feature** | `support_calls` | Number of support calls made by the customer                                                          |
| **Target**  | `renew`         | A categorical variable that indicates if the customer renewed (1) or cancelled (0) their subscription |

## Step 1: Import the necessary libraries

1. **Data Manipulation Libraries**
    - `pandas as pd`: For loading the dataset, creating and managing DataFrames, data manipulation and analysis using DataFrames

2. **Machine Learning Libraries**
    - `DecisionTreeClassifier`: A class from scikit-learn that implements the CART (Classification and Regression Trees) algorithm for building decision tree models.
    - `plot_tree`: A function from scikit-learn’s tree module that visualizes the decision tree structure.
    - `train_test_split`: A function from scikit-learn’s model_selection module that splits the dataset into training and testing sets.
    - `classification_report`: A function from scikit-learn’s metrics module used to evaluate the performance of the classifier. It gives detailed metrics such as precision, recall, f1-score, and support for each class.

3. **Statistical Analysis (SciPy)**
    - `kurtosis`: Measures the "tailedness" of data distribution
    - `skew`: Measures the asymmetry of data distribution

4. **Visualization Libraries**
    - `matplotlib.pyplot as plt`: For basic plotting functionality
    - `seaborn as sns`: For enhanced statistical visualizations

5. **Warnings Management**
    - `warnings`: Controls warning messages
    - `warnings.filterwarnings('ignore')`: Suppresses warning messages for cleaner output
    - Used to suppress warnings that may arise during the execution of the code. Even though it is not necessary for the code to run, it helps in keeping the output clean and focused on the results.

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from scipy.stats import kurtosis
from scipy.stats import skew
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Step 2: Load and Explore the Data

`url =`
- Specifies the location where the `.csv` dataset can be found

`subscription_churn_data = pd.read_csv(url)`
-Used to load the dataset into the data frame called `subscription_churn_data`

`print("\nThe dimensions (number of observations and number of dimensions):")`
`print(subscription_churn_data.shape)`
* Prints the string "The dimensions (number of observations and number of dimensions):" using the `print()` function and then uses the `shape` attribute to print the number of rows and columns in the data frame. This gives you an idea of how many observations (rows) and dimensions (columns) are present in the dataset.

In [None]:
# If you are using Google Colab, uncomment the following lines to mount your Google Drive and load the new data from there.
# from google.colab import drive
# drive.mount('/content/drive')
# url = '/content/drive/My Drive/Colab Notebooks/data/DataCoSupplyChainDataset.csv'

# url = './data/subscription_churn.csv'
url = 'https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/data/subscription_churn.csv'
subscription_churn_data = pd.read_csv(url)

## Step 3: Initial Exploratory Data Analysis (EDA)

In [None]:
print("\n*1* The number of observations and variables")
display(subscription_churn_data.shape)

print("\n*2* The data types:")
display(subscription_churn_data.info())

print("\n*3* The summary of the numeric columns:")
display(subscription_churn_data.describe())

print("\n*4* The whole dataset:")
display(subscription_churn_data)

print("\n*5* The first 5 rows in the dataset:")
display(subscription_churn_data.head())

print("\n*6* Percentage distribution for each category")
print("\nNumber of observations per class:")
print("Frequency counts:\n", subscription_churn_data['renew'].value_counts())
print("\nPercentages:\n", subscription_churn_data['renew'].value_counts(normalize=True) * 100, "%")

### Measures of Distribution

#### Variance of numeric columns

**Selection of numeric columns**
- The code selects columns with numeric data types (`int64` and `float64`) that can be subjected to mathematica or statistical functions.
- This is done using `select_dtypes()` method of the DataFrame, which filters columns based on their data types.

In [None]:
numeric_cols = subscription_churn_data.select_dtypes(include=['int64', 'float64']).columns
print("\nVariance of the numeric columns:")
print(subscription_churn_data[numeric_cols].var())

#### Standard deviation of numeric columns

In [None]:
print("\nStandard deviation of the numeric columns:")
print(subscription_churn_data[numeric_cols].std())

#### Kurtosis of numeric columns

In [None]:
print("\nFisher Kurtosis of numeric columns:")
print("\nInterpretation:")
print("→ Positive kurtosis indicates heavier tails (more outliers) than what is expected in a normal distribution - leptokurtic")
print("→ Negative kurtosis indicates lighter tails (less outliers) than what is expected in a normal distribution - platykurtic")
print("→ A normal distribution has kurtosis of 0 - mesokurtic")
print("\nKurtosis values:")
print(subscription_churn_data[numeric_cols].apply(lambda x: kurtosis(x, fisher=True)))

#### Skewness of numeric columns

In [None]:
print("\nSkewness of numeric columns:")
print("\nInterpretation:")
print("→ Positive skewness indicates a long right tail (right-skewed distribution)")
print("→ Negative skewness indicates a long left tail (left-skewed distribution)")
print("→ Skewness close to 0 indicates a symmetric distribution")
print("\nSkewness values:")
print(subscription_churn_data[numeric_cols].apply(lambda x: skew(x)))

### Measures of Relationship

#### Covariance matrix of numeric features

In [None]:
print("\nCovariance matrix of numeric features:")
print("\nInterpretation:")
print("→ Positive values indicate that variables move in the same direction")
print("→ Negative values indicate that variables move in opposite directions")
print("→ Values close to 0 indicate little to no linear relationship")
print("\nCovariance values:")
display(subscription_churn_data[numeric_cols].cov())

#### Correlation matrix of numeric features

In [None]:
print("\nSpearman's rank correlation matrix of numeric features:")
spearman_corr = subscription_churn_data[numeric_cols].corr(method='spearman')
print("\nInterpretation:")
print("→ Values range from -1 to +1")
print("→ +1 indicates perfect positive correlation")
print("→ -1 indicates perfect negative correlation")
print("→ 0 indicates no correlation")
print("\nCorrelation values:")
display(spearman_corr)

### Basic visualization of the data

- `n_cols = 3` Sets the number of plots per row to 3
- `n_rows = (len(numeric_cols) // n_cols) + (1 if len(numeric_cols) % n_cols else 0)` Calculates the number of rows needed based on the number of numeric columns and the number of columns per row.
- `plt.figure(figsize=(12, 5 * n_rows))` Sets the figure size to be wider and taller based on the number of rows.
- `for i, col in enumerate(numeric_cols, 1):` Iterates over each numeric column (`numeric_cols`), starting the index at 1. `enumerate(numeric_cols, 1)` returns pairs of (index, value) for each item in the list. The 1 means that the index will start from 1, e.g., (1, 'Days for shipping (real)'), (2, 'Days for shipment (scheduled)'), etc.
- `plt.subplot(n_rows, n_cols, i)` Creates a subplot in a grid layout with `n_rows` rows and `n_cols` columns, placing the current plot in the `i`-th position.
- `sns.histplot(data=supply_chain_data, x=col)` Plots a histogram for the current numeric column using Seaborn's `histplot` function.
- `sns.boxplot(data=supply_chain_data, y=col)` Plots a box plot for the current numeric column using Seaborn's `boxplot` function.
- `sns.despine(right=True, top=True)` Removes the right and top spines (borders) of the plot for a cleaner look.
- `plt.title(f'Distribution of {col}')` Sets the title of the current subplot to indicate which column's distribution is being shown.
- `plt.grid(axis='y', alpha=0.2)` Adds a grid to the y-axis with a transparency level of 0.2 for better visibility.
- `plt.grid(axis='x', visible=False)` Hides the grid for the x-axis to reduce clutter and increase the data-to-ink ratio.
- `plt.tight_layout()` Adjusts the spacing between subplots to prevent overlap and ensure a clean layout.
- `plt.show()` Displays the entire figure with all subplots.

#### Histograms

In [None]:
n_cols = 3
n_rows = (len(numeric_cols) // n_cols) + (1 if len(numeric_cols) % n_cols else 0)

plt.figure(figsize=(15, 5 * n_rows))
for i, col in enumerate(numeric_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.histplot(data=subscription_churn_data, x=col)
    sns.despine(right=True, top=True)
    plt.title(f'Distribution of {col}')
    plt.grid(axis='y', alpha=0.2)
    plt.grid(axis='x', visible=False)
plt.tight_layout()  # Adjust spacing
plt.show()

#### Box plots

In [None]:
n_cols = 3
n_rows = (len(numeric_cols) // n_cols) + (1 if len(numeric_cols) % n_cols else 0)

plt.figure(figsize=(15, 5 * n_rows))
for i, col in enumerate(numeric_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.boxplot(data=subscription_churn_data, y=col)
    sns.despine(right=True, top=True, bottom=True)
    plt.title(f'Box Plot of {col}')
    plt.grid(axis='y', alpha=0.2)
    plt.grid(axis='x', visible=False)
plt.tight_layout()
plt.show()

#### Missing data plot

- This visualization helps to quickly identify which columns have missing values and the extent of the missing data. The heatmap will show yellow for missing values and purple for non-missing values, making it easy to spot patterns of missingness. This is useful for understanding the completeness of the dataset and deciding how to handle missing values in subsequent analysis.
- The code uses `sns.heatmap()` to visualize missing data in the DataFrame.
- The code also uses the `isnull()` method to create a boolean DataFrame indicating where values are missing (True) or present (False).
- `yticklabels=False` hides the y-axis labels to reduce clutter.
- `cbar=False` removes the color bar, which is not necessary for this plot.
- `cmap='viridis'` sets the color map to 'viridis' which is a perceptually uniform color map suitable for visualizing missing data.
- `plt.title('Missing Data')` sets the title of the plot to 'Missing Data'
- `plt.show()` displays the plot.

In [None]:
plt.figure(figsize=(8, 4))
sns.heatmap(subscription_churn_data.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.title('Missing Data')
plt.show()

#### Correlation heatmap

- This visualization helps to quickly identify relationships between numeric features. The heatmap will show the strength and direction of correlations, with colors indicating positive (red) or negative (blue) correlations. This is useful for understanding how features relate to each other and can inform feature selection or feature engineering in subsequent analysis.
- The code uses `sns.heatmap()` to visualize the Spearman correlation matrix of the numeric features in the DataFrame.
- `annot=True` adds the correlation values as annotations in the heatmap.
- `cmap='coolwarm'` sets the color map to 'coolwarm' which provides a gradient from blue (negative correlation) to red (positive correlation).
- `center=0` centers the color map at 0, which is useful for visualizing both positive and negative correlations.

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(spearman_corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

#### Scatter plot matrix

- This visualization helps to quickly identify relationships between pairs of numeric features. The scatter plot matrix will show scatter plots for each pair of numeric features, allowing for visual inspection of relationships, trends, and potential outliers. This is useful for understanding how features interact with each other and can inform feature selection or feature engineering in subsequent analysis.
- The code uses `sns.pairplot()` to create a scatter plot matrix of the numeric features in the DataFrame

In [None]:
plt.figure(figsize=(10, 10))
sns.pairplot(subscription_churn_data[numeric_cols])
plt.suptitle('Scatter Plot Matrix', y=1.02)
plt.show()

## Step 4: Data preparation

### Create X and y datasets for the features and target variable respectively

`X = pd.DataFrame(subscription_churn_data, columns = ['monthly_fee','customer_age','support_calls'])`
* Separates the data such that the data frame called `X` contains only the features (independent variables or predictors)

`y = pd.Series(subscription_churn_data['renew'])`
* Separates the data such that the data frame called `y` contains only the target (dependent variable or outcome)

In [None]:
X = pd.DataFrame(subscription_churn_data, columns = ['monthly_fee','customer_age','support_calls'])
y = pd.Series(subscription_churn_data['renew'])

### Train‑test split

- `train_test_split` is a function from scikit-learn that splits your dataset into two parts: one for training the model and one for testing it.
- `X` is your feature data (inputs), and `y` is your target data (outputs/labels).
- `test_size=0.3` means 30% of the data will be used for testing, and the remaining 70% for training.
- `random_state=53` sets a seed for the random number generator, ensuring that the split is reproducible (you get the same split every time you run the code).

- The `train_test_split` function returns four objects:
  - `X_train`: features for training
  - `X_test`: features for testing
  - `y_train`: labels for training
  - `y_test`: labels for testing

**Why:**  
Splitting the data this way allows you to train your model on one part of the data and evaluate its performance on unseen data, which helps prevent overfitting and gives a realistic measure of model accuracy.

Analogy: This is similar to how a student learning a subject is not exposed to only one past paper that they then memorize. If they memorize the past paper and the exam assesses them on a different set of questions, then their performance in the exam will not be the same as their performance in the memorized past paper.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=53
)

## Step 5: Create and Train the CART Decision Tree

**Explanation:**

- `model = DecisionTreeClassifier(criterion="gini", random_state=53)`
  - This creates an instance of a decision tree classifier using the CART algorithm.
  - **criterion="gini"**: Specifies that the tree should use the Gini impurity measure to decide splits (the default for CART).
  - **random_state=53**: Ensures reproducibility by setting the random seed.

- `model.fit(X_train, y_train)`
  - This trains (fits) the decision tree classifier on the training data (`X_train` for features, `y_train` for the target).
  - This step therefore builds the decision tree model so it can learn patterns from the training data and later make predictions on new, unseen data.

In [None]:
# Using Gini impurity by default
decisiontree_model = DecisionTreeClassifier(
    criterion="gini",
    random_state=53,
    max_depth=4)
decisiontree_model.fit(X_train, y_train)


## Step 6: Evaluate the Model

`y_pred = model.predict(X_test)`

- This uses the trained decision tree classifier (`model`) to predict the labels for the test set features (`X_test`). This gives you the model’s predictions on data it has not seen before, which is necessary for evaluating its performance.

`print("Classification Report:\n", classification_report(y_test, y_pred))`
- This prints a detailed classification report comparing the true labels (`y_test`) to the predicted labels (`y_pred`). The report includes precision, recall, F1-score, and support for each class, enabling you to understand how well the model performs for each category.
- It shows the performance metrics for a model that predicts two classes:
    - Class 0
    - Class 1

- There are 300 total items tested:
    - Class 0 has 56 items
    - Class 1 has 244 items

| Term             | Meaning                                                                                                                             |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **Precision**    | Out of all items the model said are class X, how many are actually class X?                                                         |
| **Recall**       | Out of all actual items in class X, how many did the model correctly find?                                                          |
| **F1-score**     | A balance between precision and recall such  that a higher value means better balance.                                              |
| **Support**      | The number of actual items in that class.                                                                                           |
| **Macro avg**    | The average of precision, recall, and F1-score for both classes, treating them equally.                                             |
| **Weighted avg** | The average of precision, recall, and F1-score, but weighted by how many samples are in each class (so class 1 has more influence). |

- The results show that the model is much better at predicting class 1 than class 0, and overall gets 75% of predictions correct. This may be because there are more class 1 cases in the data.

In [None]:
y_pred = decisiontree_model.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred))

## Step 7: Visualize the Decision Tree

`plt.figure(figsize=(12, 8))`
This creates a new matplotlib figure with a size of 12 inches by 8 inches to ensure that the decision tree plot is large and readable.

`plot_tree(model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)`
- Plots the trained decision tree (`model`).
    - `feature_names=iris.feature_names`: Labels the tree’s nodes with the feature names.
    - `class_names=iris.target_names`: Labels the leaves with the class names.
    - `filled=True`: Colors the nodes based on the class for better visualization.
- This visually shows how the decision tree splits the data and makes decisions.

`plt.title("A Decision Tree Classifier using Gini Impurity (CART)")`
- Sets the title of the plot to provide context for the visualization.

`plt.show()`
- This is used to display the plot in the notebook.

In [None]:
plt.figure(figsize=(20, 7))
plot_tree(
    decisiontree_model,
    feature_names=['monthly_fee','customer_age','support_calls'],
    class_names= ['Cancel', 'Renew'],
    filled=True,
    max_depth=4)
plt.title("Decision Tree using the Gini Impurity (CART)")
plt.show()

## Step 8: Make predictions on new data and save the results for reporting in Power BI

In [None]:
# Example: Using the trained model to make predictions on new or unseen data.
# Create a DataFrame with new customer data (replace values as needed)
new_data = pd.DataFrame({
    'monthly_fee': [50, 80, 48],
    'customer_age': [25, 40, 50],
    'support_calls': [2, 0, 1]
})

# Predict whether these 3 customers will renew or cancel their subscription
predictions = decisiontree_model.predict(new_data)
print("Predictions for new data:", predictions)

label_map = {0: 'Cancel', 1: 'Renew'}
predicted_labels = [label_map[p] for p in predictions]
print("Predicted labels for new data:", predicted_labels)

- This section demonstrates how to use the trained decision tree model to make predictions on new data and save the results for reporting in Power BI.
- It loads new data, processes it similarly to the training data, makes predictions, and saves the results to a CSV file for further analysis or reporting.
- The new data is expected to have the same structure as the training data, with the same features used for training the model.
- The predictions include whether the customers are likely to renew or cancel their subscription, along with the probabilities of each outcome.
- The results are saved to a CSV file, which can then be imported into Power BI for visualization and reporting.
- This is useful for businesses to understand customer behavior and make informed decisions based on the model's predictions.

In [None]:
# Load the new data
# If you are using Google Colab, uncomment the following lines to mount your
# Google Drive and load the new data from there.
# from google.colab import drive
# drive.mount('/content/drive')
# new_data_file = '/content/drive/My Drive/Colab Notebooks/data/subscription_churn_new_data.csv'

# If you are using Google Colab, comment the following line
new_data_file = './data/subscription_churn_new_data.csv'

new_data = pd.read_csv(new_data_file)

# Make predictions
predictions = decisiontree_model.predict(new_data)
probabilities = decisiontree_model.predict_proba(new_data)

# Add predictions and probabilities to the original dataframe
new_data['Predicted_Renew'] = predictions
new_data['Renew_Probability_Class_0'] = probabilities[:, 0]  # Probability of cancellation
new_data['Renew_Probability_Class_1'] = probabilities[:, 1]  # Probability of renewal

print("\nThe new data with predictions:")
display(new_data)

# Save the results to a CSV file


# If you are using Google Colab, uncomment the following lines to save the predicted data to your Google Drive.
# output_file = '/content/drive/My Drive/Colab Notebooks/data/subscription_churn_predicted_data.csv'

output_file = './data/subscription_churn_predicted_data.csv'
new_data.to_csv(output_file, index=False)

# Print save confirmation
print(f"\nPredictions saved to: {output_file}")

# Business Insights
- The model's predictions can be used to identify customers who are likely to churn, allowing the marketing and sales teams to take proactive measures to retain them.
- The predictions and probabilities for new customer data have been saved to a CSV file, which can be imported into Power BI for reporting.
- The decision tree visualization provides insights into the key features that influence customer churn, such as monthly fee, customer age, and support calls.
- The model can be further improved by tuning hyperparameters, using more advanced algorithms, or incorporating additional features.