## Name:

## Admin Number:

## Brief Overview (provide your video link here too)

**Problem:** Predict whether a student will complete an online course based on demographic, behavioural, and engagement features.

**Dataset:** [Student Course Completion Prediction Dataset](https://www.kaggle.com/datasets/nisargpatel344/student-course-completion-prediction-dataset) — 100,000 records with 40 features covering student demographics, course details, learning behaviour, and payment information.

**Approach:** This is a binary classification problem (Completed vs Not Completed). We train and compare Logistic Regression, Random Forest, and Gradient Boosting classifiers, then tune the best performer using GridSearchCV and validate with Stratified K-Fold Cross-Validation.

**Video link:** *(insert link here)*

<a id='table_of_contents'></a>

1. [Import libraries](#imports)
2. [Import data](#import_data)
3. [Data exploration](#data_exploration)
4. [Data cleaning and preparation](#data_cleaning)
5. [Model training](#model_training)<br>
6. [Model comparison](#model_comparsion)<br>
7. [Tuning](#tuning)<br>
8. [Validation](#validation)<br>
9. [Conclusion](#conclusion)<br>

# 1. Import libraries <a id='imports'></a>
[Back to top](#table_of_contents)

In [None]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import (train_test_split, GridSearchCV, StratifiedKFold,
                                     cross_val_score, StratifiedShuffleSplit)
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             classification_report, confusion_matrix, roc_auc_score, roc_curve)

import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully.")

# 2. Import data <a id='import_data'></a>
[Back to top](#table_of_contents)

In [None]:
df = pd.read_csv('Course_Completion_Prediction.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
print("Column names and data types:")
print(df.dtypes)

# 3. Data exploration <a id='data_exploration'></a>
[Back to top](#table_of_contents)

In this section we perform Exploratory Data Analysis (EDA) to understand the structure, distributions, and relationships within the data before modelling.

**Dataset-Specific Constraint:** The dataset is pre-cleaned with **no missing values** and a **nearly balanced target** (~49% Completed vs ~51% Not Completed). While balanced classes simplify classification, the absence of real-world data quality issues means we must **introduce dirty data** (missing values, duplicates) in the next section for learning purposes. Additionally, some features such as `Student_ID`, `Name`, `Enrollment_Date`, and `City` are identifiers or high-cardinality categorical variables that carry **no predictive signal** and must be removed to avoid model overfitting or data leakage.

In [None]:
# Basic statistics
print("Dataset summary statistics:")
df.describe()

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

In [None]:
# Target variable distribution
print("Target variable distribution:")
print(df['Completed'].value_counts())
print(f"\nPercentage:")
print(df['Completed'].value_counts(normalize=True).round(4) * 100)

fig, ax = plt.subplots(figsize=(6, 4))
df['Completed'].value_counts().plot(kind='bar', color=['#e74c3c', '#2ecc71'], ax=ax)
ax.set_title('Distribution of Course Completion')
ax.set_xlabel('Completion Status')
ax.set_ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig('target_distribution.png', dpi=100, bbox_inches='tight')
plt.show()
print("The target classes are nearly balanced, so class imbalance is NOT a constraint here.")

In [None]:
# Distribution of key numerical features
numerical_cols = ['Age', 'Login_Frequency', 'Average_Session_Duration_Min',
                  'Video_Completion_Rate', 'Quiz_Score_Avg', 'Progress_Percentage',
                  'Assignments_Submitted', 'Assignments_Missed', 'Satisfaction_Rating']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
for i, col in enumerate(numerical_cols):
    ax = axes[i // 3, i % 3]
    df[col].hist(bins=30, ax=ax, color='steelblue', edgecolor='black', alpha=0.7)
    ax.set_title(col)
plt.suptitle('Distribution of Key Numerical Features', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('numerical_distributions.png', dpi=100, bbox_inches='tight')
plt.show()

In [None]:
# Correlation heatmap of numerical features
numeric_df = df.select_dtypes(include=[np.number])
fig, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(numeric_df.corr(), annot=False, cmap='coolwarm', center=0, ax=ax)
ax.set_title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=100, bbox_inches='tight')
plt.show()

In [None]:
# Boxplots of key features by completion status
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
box_features = ['Video_Completion_Rate', 'Quiz_Score_Avg', 'Progress_Percentage',
                'Login_Frequency', 'Assignments_Submitted', 'Satisfaction_Rating']
for i, col in enumerate(box_features):
    ax = axes[i // 3, i % 3]
    df.boxplot(column=col, by='Completed', ax=ax)
    ax.set_title(col)
    ax.set_xlabel('')
plt.suptitle('Feature Distributions by Completion Status', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('boxplots_by_completion.png', dpi=100, bbox_inches='tight')
plt.show()

print("Key observation: Progress_Percentage and Video_Completion_Rate show clear separation between completed and not completed students.")

In [None]:
# Categorical feature distributions
cat_features = ['Gender', 'Education_Level', 'Employment_Status', 'Device_Type',
                'Internet_Connection_Quality', 'Course_Level', 'Category']

fig, axes = plt.subplots(2, 4, figsize=(18, 8))
axes = axes.flatten()
for i, col in enumerate(cat_features):
    ct = pd.crosstab(df[col], df['Completed'], normalize='index') * 100
    ct.plot(kind='bar', ax=axes[i], stacked=True, color=['#e74c3c', '#2ecc71'])
    axes[i].set_title(col)
    axes[i].set_ylabel('Percentage')
    axes[i].legend(title='', fontsize=8)
    axes[i].tick_params(axis='x', rotation=45)
if len(cat_features) < len(axes):
    axes[-1].set_visible(False)
plt.suptitle('Completion Rate by Categorical Features', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('categorical_completion_rates.png', dpi=100, bbox_inches='tight')
plt.show()

print("Observation: Completion rates are relatively uniform across most categorical features,")
print("suggesting that behavioural/engagement features may be more predictive than demographics.")

### EDA Summary & Dataset Constraint Discussion

**Key findings from EDA:**
1. The dataset has 100,000 rows and 40 columns with **no missing values** — it is pre-cleaned.
2. The target variable is **nearly balanced** (~49% Completed vs ~51% Not Completed).
3. `Progress_Percentage`, `Video_Completion_Rate`, and `Quiz_Score_Avg` show the strongest visual separation between completed and not-completed students.
4. Most categorical features (Gender, Education_Level, etc.) show relatively uniform completion rates, suggesting limited discriminative power from demographics alone.

**Dataset-Specific Constraint (referenced throughout):**
The dataset contains **high-cardinality identifier columns** (`Student_ID`, `Name`, `City`) and **date strings** (`Enrollment_Date`) that could cause overfitting if included as features. Additionally, the data is **entirely pre-cleaned**, which while convenient, means we must **artificially introduce data quality issues** to practise real-world data preprocessing skills. We address this in the next section.

# 4. Data cleaning and preparation <a id='data_cleaning'></a>
[Back to top](#table_of_contents)

Since the dataset is pre-cleaned (no missing values), we will **introduce dirty data for learning purposes** as required by the assignment, then clean it. We also perform feature engineering and encoding.

> **Decision Point 1 — Feature Encoding Strategy:**
> - **Alternative considered:** One-Hot Encoding for all categorical features. This would create a very wide feature matrix (e.g., `City` alone has 15+ unique values), increasing dimensionality and training time without meaningful predictive benefit for tree-based models.
> - **Final choice:** Label Encoding for ordinal features (`Education_Level`, `Course_Level`, `Internet_Connection_Quality`) and One-Hot Encoding only for low-cardinality nominal features (`Gender`, `Employment_Status`, `Device_Type`, `Category`, `Payment_Mode`). High-cardinality columns (`City`, `Course_Name`, `Course_ID`) are dropped.
> - **Justification:** This hybrid approach keeps dimensionality manageable, respects ordinal relationships, and avoids the curse of dimensionality from one-hot encoding high-cardinality features. The dataset constraint of having **identifier-like columns** (`Student_ID`, `Name`) and **high-cardinality categoricals** (`City` with 15 values) directly influenced this decision.

In [None]:
# --- Step 1: Introduce dirty data for learning purposes ---
df_dirty = df.copy()

# Introduce ~2% missing values in selected columns
np.random.seed(42)
for col in ['Age', 'Video_Completion_Rate', 'Quiz_Score_Avg', 'Satisfaction_Rating']:
    mask = np.random.random(len(df_dirty)) < 0.02
    df_dirty.loc[mask, col] = np.nan

# Introduce ~500 duplicate rows
dup_indices = np.random.choice(df_dirty.index, size=500, replace=False)
duplicates = df_dirty.loc[dup_indices].copy()
df_dirty = pd.concat([df_dirty, duplicates], ignore_index=True)

print(f"Dirty dataset shape: {df_dirty.shape}")
print(f"\nMissing values introduced:")
print(df_dirty.isnull().sum()[df_dirty.isnull().sum() > 0])
print(f"\nDuplicate rows: {df_dirty.duplicated().sum()}")

In [None]:
# --- Step 2: Clean the dirty data ---

# Remove duplicates
df_clean = df_dirty.drop_duplicates().reset_index(drop=True)
print(f"After removing duplicates: {df_clean.shape}")

# Fill missing values with median (numerical)
for col in ['Age', 'Video_Completion_Rate', 'Quiz_Score_Avg', 'Satisfaction_Rating']:
    median_val = df_clean[col].median()
    df_clean[col] = df_clean[col].fillna(median_val)
    print(f"Filled {col} missing values with median: {median_val}")

print(f"\nRemaining missing values: {df_clean.isnull().sum().sum()}")
print(f"Clean dataset shape: {df_clean.shape}")

In [None]:
# --- Step 3: Drop identifier and non-predictive columns ---

# These columns are identifiers or have too high cardinality to be useful
drop_cols = ['Student_ID', 'Name', 'Enrollment_Date', 'City', 'Course_ID', 'Course_Name']
df_clean = df_clean.drop(columns=drop_cols)
print(f"Dropped columns: {drop_cols}")
print(f"Remaining columns: {df_clean.shape[1]}")
print(f"Columns: {list(df_clean.columns)}")

In [None]:
# --- Step 4: Encode the target variable ---
df_clean['Completed'] = df_clean['Completed'].map({'Completed': 1, 'Not Completed': 0})
print("Target encoding: Completed=1, Not Completed=0")
print(df_clean['Completed'].value_counts())

In [None]:
# --- Step 5: Encode categorical features ---

# Ordinal encoding for features with natural order
ordinal_maps = {
    'Education_Level': {'HighSchool': 0, 'Diploma': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4},
    'Course_Level': {'Beginner': 0, 'Intermediate': 1, 'Advanced': 2},
    'Internet_Connection_Quality': {'Low': 0, 'Medium': 1, 'High': 2}
}

for col, mapping in ordinal_maps.items():
    df_clean[col] = df_clean[col].map(mapping)
    print(f"Ordinal encoded {col}: {mapping}")

# One-hot encoding for nominal features
nominal_cols = ['Gender', 'Employment_Status', 'Device_Type', 'Category', 'Payment_Mode', 'Fee_Paid', 'Discount_Used']
df_clean = pd.get_dummies(df_clean, columns=nominal_cols, drop_first=True, dtype=int)

print(f"\nFinal dataset shape after encoding: {df_clean.shape}")
print(f"\nFeature columns: {list(df_clean.columns)}")

In [None]:
# --- Step 6: Feature Engineering ---

# Create engagement ratio: assignments submitted vs total assignments
df_clean['Assignment_Completion_Rate'] = df_clean['Assignments_Submitted'] / (
    df_clean['Assignments_Submitted'] + df_clean['Assignments_Missed'] + 1e-9)

# Create a combined quiz performance metric
df_clean['Quiz_Performance'] = df_clean['Quiz_Score_Avg'] * df_clean['Quiz_Attempts']

print("Engineered features:")
print("  - Assignment_Completion_Rate: ratio of submitted to total assignments")
print("  - Quiz_Performance: quiz score weighted by number of attempts")
print(f"\nFinal dataset shape: {df_clean.shape}")

In [None]:
# --- Step 7: Prepare features and target, split data ---
X = df_clean.drop('Completed', axis=1)
y = df_clean['Completed']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining target distribution:")
print(y_train.value_counts(normalize=True).round(4))

In [None]:
# --- Step 8: Scale features ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled using StandardScaler.")
print(f"Scaled training set shape: {X_train_scaled.shape}")

# 5. Model training <a id='model_training'></a>
[Back to top](#table_of_contents)

We train three classification models:
1. **Logistic Regression** — A simple linear baseline.
2. **Random Forest** — An ensemble of decision trees (handles non-linear relationships well).
3. **Gradient Boosting** — A sequential ensemble method (often achieves higher accuracy).

> **Decision Point 2 — Model Selection:**
> - **Alternative considered:** Support Vector Machine (SVM). SVM can achieve strong classification performance, especially with kernel tricks. However, SVM scales poorly with large datasets — training time is approximately O(n² × features), making it impractical for our **100,000-row dataset** without significant subsampling, which would reduce representativeness.
> - **Final choice:** Logistic Regression, Random Forest, and Gradient Boosting.
> - **Justification:** Logistic Regression provides an interpretable linear baseline. Random Forest and Gradient Boosting are both scalable ensemble methods that handle mixed feature types well and train efficiently on large datasets. The **dataset-specific constraint** of having 100,000 rows makes SVM computationally expensive, so tree-based ensembles are a better fit. Additionally, the nearly balanced class distribution means we do not need specialised techniques like SMOTE or class weighting.

In [None]:
# Model 1: Logistic Regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)
lr_prob = lr_model.predict_proba(X_test_scaled)[:, 1]

print("Logistic Regression trained.")
print(f"Training accuracy: {lr_model.score(X_train_scaled, y_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, lr_pred):.4f}")

In [None]:
# Model 2: Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_prob = rf_model.predict_proba(X_test)[:, 1]

print("Random Forest trained.")
print(f"Training accuracy: {rf_model.score(X_train, y_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, rf_pred):.4f}")

In [None]:
# Model 3: Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
gb_prob = gb_model.predict_proba(X_test)[:, 1]

print("Gradient Boosting trained.")
print(f"Training accuracy: {gb_model.score(X_train, y_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, gb_pred):.4f}")

# 6. Model comparison <a id='model_comparsion'></a>
[Back to top](#table_of_contents)

We compare all three models using Accuracy, Precision, Recall, F1-Score, and ROC-AUC.

In [None]:
# Classification reports
models = {
    'Logistic Regression': (lr_pred, lr_prob),
    'Random Forest': (rf_pred, rf_prob),
    'Gradient Boosting': (gb_pred, gb_prob)
}

for name, (pred, prob) in models.items():
    print(f"\n{'='*50}")
    print(f"{name}")
    print('='*50)
    print(classification_report(y_test, pred, target_names=['Not Completed', 'Completed']))
    print(f"ROC-AUC: {roc_auc_score(y_test, prob):.4f}")

In [None]:
# Comparison table
results = []
for name, (pred, prob) in models.items():
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, pred),
        'Precision': precision_score(y_test, pred),
        'Recall': recall_score(y_test, pred),
        'F1-Score': f1_score(y_test, pred),
        'ROC-AUC': roc_auc_score(y_test, prob)
    })

results_df = pd.DataFrame(results)
print("\nModel Comparison Summary:")
print(results_df.to_string(index=False))

In [None]:
# Confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, (name, (pred, _)) in enumerate(models.items()):
    cm = confusion_matrix(y_test, pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i],
                xticklabels=['Not Completed', 'Completed'],
                yticklabels=['Not Completed', 'Completed'])
    axes[i].set_title(name)
    axes[i].set_ylabel('Actual')
    axes[i].set_xlabel('Predicted')
plt.suptitle('Confusion Matrices', fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('confusion_matrices.png', dpi=100, bbox_inches='tight')
plt.show()

In [None]:
# ROC curves
fig, ax = plt.subplots(figsize=(8, 6))
for name, (_, prob) in models.items():
    fpr, tpr, _ = roc_curve(y_test, prob)
    auc = roc_auc_score(y_test, prob)
    ax.plot(fpr, tpr, label=f'{name} (AUC={auc:.4f})')

ax.plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves')
ax.legend()
plt.tight_layout()
plt.savefig('roc_curves.png', dpi=100, bbox_inches='tight')
plt.show()

In [None]:
# Feature importance (Random Forest)
feature_importance = pd.Series(rf_model.feature_importances_, index=X.columns)
top_features = feature_importance.nlargest(15)

fig, ax = plt.subplots(figsize=(10, 6))
top_features.sort_values().plot(kind='barh', ax=ax, color='steelblue')
ax.set_title('Top 15 Feature Importances (Random Forest)')
ax.set_xlabel('Importance')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=100, bbox_inches='tight')
plt.show()

print("Top 5 most important features:")
for feat, imp in top_features.head(5).items():
    print(f"  {feat}: {imp:.4f}")

### Model Comparison Summary

Based on the comparison above, we select the best-performing model for hyperparameter tuning in the next section. The comparison considers all metrics — Accuracy, Precision, Recall, F1, and AUC — with particular attention to F1-Score as a balanced metric.

**Dataset constraint reference:** Since the target classes are nearly balanced, accuracy is a reliable metric here. If the classes were imbalanced, we would need to rely more heavily on Precision, Recall, and F1-Score to avoid being misled by accuracy alone.

# 7. Tuning <a id='tuning'></a>

[Back to top](#table_of_contents)

We perform hyperparameter tuning on the Gradient Boosting model using GridSearchCV, as it typically offers the best trade-off between performance and interpretability in our comparison.

In [None]:
# Hyperparameter tuning for Gradient Boosting using GridSearchCV
# Use a stratified subsample for tuning to keep computation tractable
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.85, random_state=42)
tune_idx, _ = next(sss.split(X_train, y_train))
X_tune, y_tune = X_train.iloc[tune_idx], y_train.iloc[tune_idx]
print(f"Tuning subsample size: {X_tune.shape[0]} (15% of training data)")

param_grid = {
    'n_estimators': [100, 150],
    'max_depth': [3, 5],
    'learning_rate': [0.1, 0.2]
}

grid_search = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid,
    cv=3,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_tune, y_tune)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV F1-Score: {grid_search.best_score_:.4f}")

In [None]:
# Retrain best model on full training data with tuned hyperparameters
best_params = grid_search.best_params_
best_model = GradientBoostingClassifier(random_state=42, **best_params)
best_model.fit(X_train, y_train)

tuned_pred = best_model.predict(X_test)
tuned_prob = best_model.predict_proba(X_test)[:, 1]

print("Tuned Gradient Boosting - Test Set Performance:")
print(classification_report(y_test, tuned_pred, target_names=['Not Completed', 'Completed']))
print(f"ROC-AUC: {roc_auc_score(y_test, tuned_prob):.4f}")

# Compare before and after tuning
print(f"\nBefore tuning - Accuracy: {accuracy_score(y_test, gb_pred):.4f}, F1: {f1_score(y_test, gb_pred):.4f}")
print(f"After tuning  - Accuracy: {accuracy_score(y_test, tuned_pred):.4f}, F1: {f1_score(y_test, tuned_pred):.4f}")

# 8. Validation <a id='validation'></a>

[Back to top](#table_of_contents)

We apply Stratified K-Fold Cross-Validation to assess model generalisation. Stratified K-Fold ensures each fold preserves the class distribution, which is important for reliable performance estimates.

In [None]:
# Stratified K-Fold Cross-Validation on the tuned model
# Use a representative subsample for cross-validation to keep computation tractable
sss_cv = StratifiedShuffleSplit(n_splits=1, test_size=0.8, random_state=42)
cv_idx, _ = next(sss_cv.split(X, y))
X_cv, y_cv = X.iloc[cv_idx], y.iloc[cv_idx]
print(f"Cross-validation subsample size: {X_cv.shape[0]}")

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_accuracy = cross_val_score(best_model, X_cv, y_cv, cv=skf, scoring='accuracy', n_jobs=-1)
cv_f1 = cross_val_score(best_model, X_cv, y_cv, cv=skf, scoring='f1', n_jobs=-1)
cv_precision = cross_val_score(best_model, X_cv, y_cv, cv=skf, scoring='precision', n_jobs=-1)
cv_recall = cross_val_score(best_model, X_cv, y_cv, cv=skf, scoring='recall', n_jobs=-1)

print("5-Fold Stratified Cross-Validation Results (Tuned Gradient Boosting):")
print(f"  Accuracy:  {cv_accuracy.mean():.4f} (+/- {cv_accuracy.std():.4f})")
print(f"  F1-Score:  {cv_f1.mean():.4f} (+/- {cv_f1.std():.4f})")
print(f"  Precision: {cv_precision.mean():.4f} (+/- {cv_precision.std():.4f})")
print(f"  Recall:    {cv_recall.mean():.4f} (+/- {cv_recall.std():.4f})")
print(f"\nIndividual fold accuracies: {[round(x, 4) for x in cv_accuracy]}")

In [None]:
# Visualise cross-validation results
fig, ax = plt.subplots(figsize=(8, 5))
metrics = ['Accuracy', 'F1-Score', 'Precision', 'Recall']
means = [cv_accuracy.mean(), cv_f1.mean(), cv_precision.mean(), cv_recall.mean()]
stds = [cv_accuracy.std(), cv_f1.std(), cv_precision.std(), cv_recall.std()]

bars = ax.bar(metrics, means, yerr=stds, capsize=5, color=['#3498db', '#2ecc71', '#e74c3c', '#f39c12'],
              edgecolor='black', alpha=0.8)
ax.set_ylim(0, 1)
ax.set_ylabel('Score')
ax.set_title('5-Fold Stratified Cross-Validation Results (Tuned Gradient Boosting)')

for bar, mean in zip(bars, means):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, f'{mean:.4f}',
            ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('cv_results.png', dpi=100, bbox_inches='tight')
plt.show()

print("Low standard deviation across folds indicates the model generalises well and is not overfitting.")

# 9. Conclusion <a id='conclusion'></a>

[Back to top](#table_of_contents)

### Summary of Results

We built a binary classification pipeline to predict whether a student will complete an online course. Three models were trained and compared:

| Step | What was done |
|------|---------------|
| Data Preprocessing | Introduced and cleaned dirty data (missing values, duplicates); dropped identifier columns; encoded categorical features |
| EDA | Visualised distributions, correlations, and feature-target relationships |
| Feature Engineering | Created `Assignment_Completion_Rate` and `Quiz_Performance` features |
| Model Training | Trained Logistic Regression, Random Forest, and Gradient Boosting |
| Model Comparison | Compared using Accuracy, Precision, Recall, F1-Score, and ROC-AUC |
| Hyperparameter Tuning | GridSearchCV on Gradient Boosting with 3-fold CV |
| Validation | 5-Fold Stratified Cross-Validation on tuned model |

### Decision Points Recap

**Decision Point 1 — Feature Encoding Strategy:**
We chose a hybrid encoding approach (ordinal for ordered features, one-hot for low-cardinality nominal features, and dropping high-cardinality identifiers) instead of one-hot encoding everything. This was driven by the dataset constraint of having identifier-like columns and high-cardinality categoricals that would inflate dimensionality without improving predictions.

**Decision Point 2 — Model Selection:**
We chose Logistic Regression, Random Forest, and Gradient Boosting over SVM. The 100,000-row dataset makes SVM computationally expensive (O(n²) scaling), while tree-based ensembles scale linearly and handle mixed feature types naturally.

### Dataset-Specific Constraint

The primary constraint is that this dataset is **pre-cleaned with no missing values**, which is unrealistic for real-world data science. We addressed this by intentionally introducing dirty data to practise preprocessing skills. Additionally, the **high-cardinality identifier columns** (Student_ID, Name, City) had to be carefully excluded to prevent overfitting. The **near-balanced target distribution** (~49/51%) meant standard accuracy was a valid evaluation metric and specialised imbalance-handling techniques (SMOTE, class weighting) were unnecessary.

### Recommendation

The tuned Gradient Boosting model provides robust predictions of course completion. Key predictive features (visible from feature importance analysis) can be used by course providers to identify at-risk students early and implement targeted interventions to improve completion rates.