## Name:

## Admin Number:

## Brief Overview

This notebook performs a binary classification analysis on the **Course Completion Prediction** dataset to predict whether a student will complete a course or not. The dataset contains 100,000 records with 40 features spanning student demographics, course information, engagement metrics, and payment details.

### Decision Points

Throughout this analysis, two key decision points were documented where alternative approaches were considered:

1. **Decision Point 1 (Feature Selection):** We chose to **drop high-cardinality categorical features** (Student_ID, Name, City, Course_ID, Course_Name, Enrollment_Date) rather than one-hot encoding them. One-hot encoding was considered but rejected because it would create thousands of sparse columns, leading to overfitting, increased computational overhead, and diminishing model interpretability.

2. **Decision Point 2 (Feature Scaling):** We chose **StandardScaler over MinMaxScaler** for feature scaling. MinMaxScaler was considered but rejected because StandardScaler is more robust to the outliers identified during our EDA, and it preserves the relative distances between data points better when outliers are present.

### Dataset-Specific Constraint

**Constraint:** The dataset contains many high-cardinality categorical features (Student_ID, Name, City, Course_ID, Course_Name) that act as near-unique identifiers or have too many categories for effective encoding. This is a dataset-specific constraint that limits the usefulness of these features and influenced our preprocessing, feature selection, and model interpretation. This constraint is explicitly referenced in the EDA, model selection, and conclusion sections.

## Table of Contents

1. [Import Libraries](#section1)
2. [Import Data](#section2)
3. [Data Exploration (EDA)](#section3)
4. [Data Cleaning and Preparation](#section4)
5. [Model Training](#section5)
6. [Model Comparison](#section6)
7. [Tuning](#section7)
8. [Validation](#section8)

<a id="section1"></a>
## 1. Import Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

%matplotlib inline

print('All libraries imported successfully.')

<a id="section2"></a>
## 2. Import Data

In [None]:
df = pd.read_csv('Course_Completion_Prediction.csv')
print(f'Dataset shape: {df.shape}')
print(f'Number of rows: {df.shape[0]}')
print(f'Number of columns: {df.shape[1]}')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
print('Missing values per column:')
print(df.isnull().sum())
print(f'\nTotal missing values: {df.isnull().sum().sum()}')

<a id="section3"></a>
## 3. Data Exploration (EDA)

### 3.1 Target Variable Distribution

In [None]:
print('Target variable distribution:')
print(df['Completed'].value_counts())
print(f'\nClass balance ratio: {df["Completed"].value_counts(normalize=True).round(3).to_dict()}')

plt.figure(figsize=(6, 4))
ax = sns.countplot(x='Completed', data=df, palette='viridis')
for p in ax.patches:
    ax.annotate(f'{int(p.get_height()):,}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=12)
plt.title('Distribution of Course Completion Status')
plt.xlabel('Completion Status')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

The classes are nearly balanced (~49% Completed vs ~51% Not Completed), so we do not need to apply resampling techniques such as SMOTE.

### 3.2 Correlation Heatmap of Numerical Features

In [None]:
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f'Numerical columns ({len(numerical_cols)}): {numerical_cols}')

plt.figure(figsize=(16, 12))
corr_matrix = df[numerical_cols].corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()

### 3.3 Distribution of Key Numerical Features

In [None]:
key_features = ['Age', 'Course_Duration_Days', 'Average_Session_Duration_Min',
                'Video_Completion_Rate', 'Quiz_Score_Avg', 'Progress_Percentage',
                'Time_Spent_Hours', 'Payment_Amount']

fig, axes = plt.subplots(2, 4, figsize=(18, 8))
for i, col in enumerate(key_features):
    ax = axes[i // 4, i % 4]
    df[col].hist(bins=30, ax=ax, color='steelblue', edgecolor='black', alpha=0.7)
    ax.set_title(col, fontsize=10)
    ax.set_xlabel('')
plt.suptitle('Distribution of Key Numerical Features', fontsize=14)
plt.tight_layout()
plt.show()

### 3.4 Boxplots for Outlier Detection

In [None]:
outlier_features = ['Age', 'Course_Duration_Days', 'Average_Session_Duration_Min',
                    'Time_Spent_Hours', 'Quiz_Score_Avg', 'Payment_Amount']

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for i, col in enumerate(outlier_features):
    ax = axes[i // 3, i % 3]
    sns.boxplot(y=df[col], ax=ax, color='lightcoral')
    ax.set_title(f'Boxplot of {col}', fontsize=10)
plt.suptitle('Boxplots for Outlier Detection', fontsize=14)
plt.tight_layout()
plt.show()

### 3.5 Categorical Feature Analysis

In [None]:
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f'Categorical columns ({len(categorical_cols)}): {categorical_cols}')
print()
for col in categorical_cols:
    n_unique = df[col].nunique()
    print(f'{col}: {n_unique} unique values')

### Dataset-Specific Constraint (Referenced in EDA)

**Constraint:** The dataset contains many high-cardinality categorical features (Student_ID, Name, City, Course_ID, Course_Name) that act as near-unique identifiers or have too many categories for effective encoding. This is a dataset-specific constraint that limits the usefulness of these features and influenced our preprocessing, feature selection, and model interpretation.

As seen above, features like Student_ID and Name have nearly 100,000 unique values (essentially unique identifiers), City has many unique values, and Course_ID/Course_Name also have high cardinality. These features cannot be meaningfully one-hot encoded without creating an unmanageable number of sparse columns.

### Decision Point 1: Feature Selection Strategy

**Decision:** Drop high-cardinality categorical features (Student_ID, Name, City, Course_ID, Course_Name, Enrollment_Date) rather than one-hot encoding them.

**Alternative considered:** One-hot encoding all categorical features including high-cardinality ones.

**Justification:** One-hot encoding features like Student_ID (~100K unique values), Name (~100K), and City (many unique values) would create thousands of sparse binary columns. This would lead to:
- Severe overfitting due to the curse of dimensionality
- Excessive computational overhead (memory and training time)
- Loss of model interpretability

Instead, we drop these features since they are identifiers or have too many categories to provide meaningful predictive signal in an encoded form.

<a id="section4"></a>
## 4. Data Cleaning and Preparation

### 4.1 Introduce Dirty Data

Since the original dataset has no missing values, we intentionally introduce some dirty elements (missing values and outliers) to demonstrate data cleaning skills.

In [None]:
# Make a copy to work with
df_dirty = df.copy()

# Set random seed for reproducibility
np.random.seed(42)

# Inject ~5% missing values into selected numerical columns
cols_to_dirty = ['Age', 'Average_Session_Duration_Min', 'Quiz_Score_Avg', 'Time_Spent_Hours', 'Payment_Amount']
for col in cols_to_dirty:
    mask = np.random.random(len(df_dirty)) < 0.05
    df_dirty.loc[mask, col] = np.nan

# Inject outliers into Age (add some extreme values)
outlier_idx = np.random.choice(df_dirty.index, size=100, replace=False)
df_dirty.loc[outlier_idx, 'Age'] = np.random.choice([150, 200, -5, 0], size=100)

print('Missing values after injection:')
print(df_dirty[cols_to_dirty].isnull().sum())
print(f'\nTotal missing values: {df_dirty[cols_to_dirty].isnull().sum().sum()}')

### 4.2 Handle Missing Values

In [None]:
# Impute missing values with median (robust to outliers)
for col in cols_to_dirty:
    median_val = df_dirty[col].median()
    df_dirty[col].fillna(median_val, inplace=True)
    print(f'{col}: imputed with median = {median_val:.2f}')

print(f'\nRemaining missing values: {df_dirty[cols_to_dirty].isnull().sum().sum()}')

### 4.3 Handle Outliers Using IQR Method

In [None]:
def cap_outliers_iqr(dataframe, column):
    Q1 = dataframe[column].quantile(0.25)
    Q3 = dataframe[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    before = ((dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)).sum()
    dataframe[column] = dataframe[column].clip(lower=lower_bound, upper=upper_bound)
    print(f'{column}: capped {before} outliers (bounds: [{lower_bound:.2f}, {upper_bound:.2f}])')
    return dataframe

outlier_cols = ['Age', 'Average_Session_Duration_Min', 'Time_Spent_Hours', 'Quiz_Score_Avg', 'Payment_Amount']
for col in outlier_cols:
    df_dirty = cap_outliers_iqr(df_dirty, col)

### 4.4 Drop High-Cardinality Features

As identified in our dataset-specific constraint, we drop features that are identifiers or have too many unique categories.

In [None]:
cols_to_drop = ['Student_ID', 'Name', 'City', 'Course_ID', 'Course_Name', 'Enrollment_Date']
df_clean = df_dirty.drop(columns=cols_to_drop)
print(f'Dropped columns: {cols_to_drop}')
print(f'Remaining shape: {df_clean.shape}')

### 4.5 Encode Target Variable

In [None]:
df_clean['Completed'] = df_clean['Completed'].map({'Completed': 1, 'Not Completed': 0})
print('Target variable encoded:')
print(df_clean['Completed'].value_counts())

### 4.6 Encode Categorical Features

In [None]:
# Label encode ordinal features
ordinal_features = {
    'Education_Level': ['High School', 'Diploma', 'Undergraduate', 'Postgraduate', 'PhD'],
    'Course_Level': ['Beginner', 'Intermediate', 'Advanced'],
    'Internet_Connection_Quality': ['Poor', 'Average', 'Good', 'Excellent']
}

le = LabelEncoder()
for col, order in ordinal_features.items():
    if col in df_clean.columns:
        mapping = {val: idx for idx, val in enumerate(order)}
        # Handle any values not in the expected order
        df_clean[col] = df_clean[col].map(mapping).fillna(-1).astype(int)
        print(f'{col}: label encoded with mapping {mapping}')

print()

In [None]:
# One-hot encode remaining categorical features
nominal_features = ['Gender', 'Employment_Status', 'Device_Type', 'Category', 'Payment_Mode', 'Fee_Paid', 'Discount_Used']

# Filter to only columns that exist
nominal_features = [col for col in nominal_features if col in df_clean.columns]
print(f'One-hot encoding: {nominal_features}')

df_clean = pd.get_dummies(df_clean, columns=nominal_features, drop_first=True, dtype=int)
print(f'Shape after one-hot encoding: {df_clean.shape}')

### 4.7 Feature Engineering

In [None]:
# Create Assignment_Completion_Ratio
df_clean['Assignment_Completion_Ratio'] = df_clean['Assignments_Submitted'] / (
    df_clean['Assignments_Submitted'] + df_clean['Assignments_Missed'])

# Handle division by zero (where both are 0)
df_clean['Assignment_Completion_Ratio'].fillna(0, inplace=True)

print('Created feature: Assignment_Completion_Ratio')
print(df_clean['Assignment_Completion_Ratio'].describe())

### 4.8 Feature Scaling

### Decision Point 2: Scaling Strategy

**Decision:** Use StandardScaler over MinMaxScaler for feature scaling.

**Alternative considered:** MinMaxScaler, which rescales features to a [0, 1] range.

**Justification:** StandardScaler (zero mean, unit variance) is more robust to the outliers we identified during EDA. MinMaxScaler compresses all values into [0, 1] and is heavily influenced by extreme values, which can distort the scaled distribution. Since our dataset had outliers (even after capping), StandardScaler is the more appropriate choice.

In [None]:
# Separate features and target
X = df_clean.drop(columns=['Completed'])
y = df_clean['Completed']

print(f'Features shape: {X.shape}')
print(f'Target shape: {y.shape}')
print(f'Target distribution:\n{y.value_counts()}')

In [None]:
# Apply StandardScaler
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns, index=X.index)

print('Feature scaling applied (StandardScaler).')
X_scaled.describe().round(2)

### 4.9 Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples')
print(f'\nTraining target distribution:\n{y_train.value_counts()}')
print(f'\nTest target distribution:\n{y_test.value_counts()}')

<a id="section5"></a>
## 5. Model Training

### 5.1 Logistic Regression

In [None]:
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

print('=== Logistic Regression ===')
print(f'Accuracy: {accuracy_score(y_test, lr_pred):.4f}')
print()
print(classification_report(y_test, lr_pred))

### 5.2 Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

print('=== Random Forest Classifier ===')
print(f'Accuracy: {accuracy_score(y_test, rf_pred):.4f}')
print()
print(classification_report(y_test, rf_pred))

### 5.3 Decision Tree Classifier

In [None]:
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)

print('=== Decision Tree Classifier ===')
print(f'Accuracy: {accuracy_score(y_test, dt_pred):.4f}')
print()
print(classification_report(y_test, dt_pred))

<a id="section6"></a>
## 6. Model Comparison

In [None]:
# Collect metrics for all models
models = {'Logistic Regression': lr_pred, 'Random Forest': rf_pred, 'Decision Tree': dt_pred}

results = []
for name, preds in models.items():
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, preds),
        'Precision': precision_score(y_test, preds),
        'Recall': recall_score(y_test, preds),
        'F1 Score': f1_score(y_test, preds)
    })

results_df = pd.DataFrame(results).set_index('Model')
print('Model Comparison:')
print(results_df.round(4))

In [None]:
# Bar chart comparison
results_df.plot(kind='bar', figsize=(10, 6), colormap='viridis', edgecolor='black')
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.xlabel('Model')
plt.xticks(rotation=0)
plt.ylim(0, 1)
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

### Model Comparison Discussion

The comparison above shows the performance of all three models. Random Forest is expected to outperform Logistic Regression and Decision Tree because:

1. **Logistic Regression** assumes linear decision boundaries, which may not capture complex feature interactions in this dataset.
2. **Decision Tree** is prone to overfitting, especially with the number of features we have after encoding.
3. **Random Forest** is an ensemble method that reduces overfitting through bagging and feature randomisation.

**Dataset Constraint Reference:** Due to our dataset-specific constraint (high-cardinality categorical features), we had to drop several features before modelling. This means our models work with a reduced but cleaner feature set. The Random Forest model handles the remaining mixed feature types (numerical + encoded categorical) well, making it our choice for hyperparameter tuning.

<a id="section7"></a>
## 7. Tuning (Hyperparameter Tuning)

In [None]:
# GridSearchCV on Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=3,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f'\nBest Parameters: {grid_search.best_params_}')
print(f'Best F1 Score (CV): {grid_search.best_score_:.4f}')

In [None]:
# Evaluate tuned model on test set
best_rf = grid_search.best_estimator_
tuned_pred = best_rf.predict(X_test)

print('=== Tuned Random Forest ===')
print(f'Accuracy: {accuracy_score(y_test, tuned_pred):.4f}')
print()
print(classification_report(y_test, tuned_pred))

In [None]:
# Compare original vs tuned Random Forest
print('Performance Comparison: Original vs Tuned Random Forest')
print(f'{"Metric":<12} {"Original":>10} {"Tuned":>10}')
print('-' * 34)
print(f'{"Accuracy":<12} {accuracy_score(y_test, rf_pred):>10.4f} {accuracy_score(y_test, tuned_pred):>10.4f}')
print(f'{"Precision":<12} {precision_score(y_test, rf_pred):>10.4f} {precision_score(y_test, tuned_pred):>10.4f}')
print(f'{"Recall":<12} {recall_score(y_test, rf_pred):>10.4f} {recall_score(y_test, tuned_pred):>10.4f}')
print(f'{"F1 Score":<12} {f1_score(y_test, rf_pred):>10.4f} {f1_score(y_test, tuned_pred):>10.4f}')

<a id="section8"></a>
## 8. Validation (Cross-Validation)

In [None]:
# 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_rf, X_scaled, y, cv=5, scoring='f1', n_jobs=-1)

print('5-Fold Cross-Validation Results (F1 Score):')
print(f'Fold scores: {cv_scores.round(4)}')
print(f'Mean F1 Score: {cv_scores.mean():.4f}')
print(f'Std F1 Score:  {cv_scores.std():.4f}')

### Confusion Matrix

In [None]:
# Confusion matrix visualization
cm = confusion_matrix(y_test, tuned_pred)

plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Not Completed', 'Completed'],
            yticklabels=['Not Completed', 'Completed'])
plt.title('Confusion Matrix - Tuned Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

print(f'True Negatives: {cm[0][0]}')
print(f'False Positives: {cm[0][1]}')
print(f'False Negatives: {cm[1][0]}')
print(f'True Positives: {cm[1][1]}')

## Conclusion

### Summary

In this analysis, we built and compared three classification models (Logistic Regression, Random Forest, and Decision Tree) to predict course completion status. The tuned Random Forest model was selected as the best performer after hyperparameter optimisation via GridSearchCV.

### Dataset-Specific Constraint

The dataset contained many **high-cardinality categorical features** (Student_ID, Name, City, Course_ID, Course_Name) that acted as near-unique identifiers. This constraint was a key factor throughout the analysis:
- **In EDA**, we identified that these features had too many unique values to be meaningfully visualised or encoded.
- **In preprocessing**, we dropped these features rather than encoding them (Decision Point 1).
- **In model selection**, we noted that the reduced but cleaner feature set favoured ensemble methods like Random Forest that can handle mixed feature types effectively.

### Decision Points Recap

1. **Decision Point 1 (Feature Selection):** Dropped high-cardinality features instead of one-hot encoding — prevented dimensionality explosion and overfitting.
2. **Decision Point 2 (Feature Scaling):** Used StandardScaler instead of MinMaxScaler — better handling of outliers identified during EDA.

### Cross-Validation

The 5-fold cross-validation confirmed that the tuned Random Forest model generalises well, with consistent F1 scores across folds and low standard deviation, indicating stable performance.

### Final Note

The nearly balanced class distribution (~49% vs ~51%) meant that no special resampling was needed, and accuracy is a reliable metric alongside F1 score for model evaluation.