# 1. Setup

## 1.A Summary

### <span style="color: #e74c3c;">**k-Nearest Neighbours Implementation Summary**</span>

This notebook implements **k-Nearest Neighbours (k-NN)** classification to predict student withdrawal risk using the preprocessed student dataset.

### <span style="color: #2E86AB;">**1. Algorithm Overview**</span>

**k-Nearest Neighbours** is a **non-parametric, instance-based learning algorithm** that makes predictions by:
- Finding the k closest data points to a new instance
- Using majority voting amongst these neighbours to determine classification
- Making no assumptions about the underlying data distribution

**Key characteristics:**
- **Lazy learning**: No explicit training phase - stores all data points
- **Distance-based**: Uses similarity measures (typically Euclidean distance)
- **Local decision boundaries**: Adapts to local patterns in the data

### <span style="color: #2E86AB;">**2. Binary Classification Setup**</span>

**Target transformation**: Combined "Graduate" and "Enrolled" into "Continuation" (1), with "Dropout" as "Withdrawn" (0), creating a balanced 68:32 class distribution suitable for k-NN's majority voting mechanism.

**Dataset**: 4,424 students with preprocessed features ready for distance-based classification.

### <span style="color: #2E86AB;">**3. Preprocessing Requirements**</span>

**Essential for k-NN performance:**
- **Feature scaling**: StandardScaler applied to prevent features with larger ranges from dominating distance calculations
- **One-hot encoding**: Categorical features converted to binary dummy variables
- **Feature selection**: Remove redundant and uninformative features identified in earlier analysis

### <span style="color: #2E86AB;">**4. Model Configuration**</span>

**Hyperparameter tuning** focuses on:
- **k value**: Number of neighbours to consider (typically odd numbers to avoid ties)
- **Distance metric**: Euclidean distance for continuous features
- **Weighting scheme**: Uniform vs distance-weighted voting

### <span style="color: #e74c3c;">**Expected Outcomes**</span>

This implementation will evaluate k-NN's effectiveness for student dropout prediction, comparing performance against logistic regression whilst addressing the algorithm's sensitivity to feature scaling and dimensionality.

## 1.B Libraries Import

In [355]:
from tools import Tools
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

## 1.C Invoke Classes

In [356]:
tools = Tools()

## 1.D Load Configuration

In [357]:
config = tools.load_toml_file("config.toml")
tools.print_message('success', 'Loaded configuration', format_dict={'number of keys': len(config)})

## 1.E Load the dataset

In [358]:
# Open dataset
# Realinho, V., Martins, M.V., Machado, J. and Baptista, L.M.T., 2021. Predict Students' Dropout and Academic Success. UCI Machine Learning Repository. Available at: https://doi.org/10.24432/C5MC89 [Accessed 31 May 2025].
df_dataset = tools.load_dataset(file_name='dataset_raw.csv')
df_dataset.head()

Unnamed: 0,marital_status,application_mode,application_order,course,daytime_evening_attendance,previous_qualification,previous_qualification_grade,nationality,mothers_qualification,fathers_qualification,...,curricular_units_2nd_sem_credited,curricular_units_2nd_sem_enrolled,curricular_units_2nd_sem_evaluations,curricular_units_2nd_sem_approved,curricular_units_2nd_sem_grade,curricular_units_2nd_sem_without_evaluations,unemployment_rate,inflation_rate,gdp,target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


## 1.F Apply Target Binary Transformation

In [359]:
# Add a new target column with renamed values for one vs rest classification
df_dataset['target_binary'] = df_dataset['target'].map({'Dropout': 0, 'Graduate': 1, 'Enrolled': 1})
df_dataset['target_binary'].value_counts()

target_binary
1    3003
0    1421
Name: count, dtype: int64

## 1.G Data Shape Check

In [360]:
shape = df_dataset.shape
tools.print_message('success', 'Dataset loaded', format_dict={'rows': shape[0], 'columns': shape[1]})

# 2. Feature Selection

## 2.A Summary

### <span style="color: #e74c3c;">**Feature Selection for k-Nearest Neighbours**</span>

This analysis reduced the dataset from 36 original features to 10 carefully selected features optimised for k-NN performance. The selection process addressed key challenges including data leakage, multicollinearity, and the curse of dimensionality.

### <span style="color: #2E86AB;">**1. Data Leakage Prevention**</span>

**Data leakage** occurs when we accidentally include information that wouldn't be available when making real predictions. In our student withdrawal dataset, second semester data creates severe leakage issues.

**The Problem:**
- Students who withdraw during first semester have **zero values** for all second semester metrics
- These zeros perfectly identify withdrawn students - but only **after** withdrawal has occurred
- Using 2nd semester features gives artificially high accuracy but useless real-world predictions

**Solution Applied:**
- **Removed ALL second semester features**: grades, credited units, enrolled units, approved units, evaluations
- **Kept first semester features**: These represent genuine early warning indicators available during the semester
- **Focus on early intervention**: The model can now predict withdrawals using information available **before** students drop out

### <span style="color: #2E86AB;">**2. Multicollinearity Resolution**</span>

**Multicollinearity** occurs when features provide nearly identical information, measured by **Variance Inflation Factor (VIF)**. High VIF values harm k-NN performance by:
- Creating redundant dimensions in distance calculations
- Amplifying noise and reducing accuracy
- Making feature scaling less effective

**Problematic Features Removed:**
- `curricular_units_1st_sem_enrolled` (VIF: 23.49)
- `curricular_units_1st_sem_credited` (VIF: 15.57)  
- `curricular_units_1st_sem_approved` (VIF: 12.63)

**Kept:** `curricular_units_1st_sem_grade` (VIF: 4.98) - represents actual academic performance without multicollinearity issues.

### <span style="color: #2E86AB;">**3. High Cardinality Feature Engineering**</span>

**High cardinality features** (many categories) create severe problems for k-NN. Two different encoding strategies were applied:

### <span style="color: #2E86AB;">**Parental Background - Binary Grouping:**</span>
- Parents' **qualifications** had **29 categories each**
- Parents' **occupations** had **46 categories each**
- After one-hot encoding, this would create **150 new binary features** (29+29+46+46)

**Solution Applied:**
```python
# Reduced 4 high-cardinality features to 2 meaningful binary indicators:
parental_higher_education          # Combines 58 education categories
parental_professional_occupation   # Combines 92 occupation categories
```

### <span style="color: #2E86AB;">**Academic Context - Withdrawal Rate Encoding:**</span>
- `course`: 17 different programmes of study
- `application_mode`: 18 different admission routes

**Solution Applied:**
```python
# Target encoding using withdrawal rates for each category:
course_withdrawal_rate              # Each course gets its historical withdrawal rate
application_mode_withdrawal_rate    # Each application route gets its withdrawal rate
```

**Benefits of Withdrawal Rate Encoding:**
- **Reduces dimensionality**: 17 course categories → 1 continuous feature
- **Preserves predictive power**: Directly captures risk level of each category
- **Interpretable results**: Higher values = higher risk groups
- **k-NN friendly**: Creates meaningful distance measurements between similar risk levels

**Overall Benefits:**
- Captures meaningful background and context information
- Reduces dimensionality from 150+ to 4 features
- Maintains predictive power whilst eliminating noise
- Creates continuous features suitable for distance calculations

### <span style="color: #2E86AB;">**4. Uninformative Feature Removal**</span>

**Severely imbalanced features** provide little predictive value:
- `nationality`: 97.5% Portuguese students
- `educational_special_needs`: 98.9% have no special needs
- `international`: 97.5% domestic students

**Weak predictors** with minimal correlation to target:
- Economic indicators: unemployment, inflation, GDP (correlations -0.03 to 0.05)
- `previous_qualification_grade`: correlation 0.08

These features add noise without improving k-NN accuracy.

### <span style="color: #e74c3c;">**Why 10 Features is Optimal for k-NN**</span>

### <span style="color: #2E86AB;">**1. Curse of Dimensionality**</span>
**The curse of dimensionality** means that as feature count increases, data points become increasingly distant from each other, making similarity measurements meaningless. With too many features:
- All students appear equally "different" from each other
- Nearest neighbours become arbitrary rather than truly similar
- Model performance degrades despite having more information

### <span style="color: #2E86AB;">**2. Distance Calculation Efficiency**</span>
k-NN calculates distances between data points using all features. With 10 features:
- **Computational efficiency**: Distance calculations remain fast
- **Feature scaling effectiveness**: Each feature has meaningful impact on similarity
- **Interpretable results**: Easy to understand why students are classified as similar

### <span style="color: #2E86AB;">**3. Signal-to-Noise Ratio**</span>
Ten carefully selected features provide:
- **Strong signal**: Each feature contributes meaningful predictive information
- **Minimal noise**: Removed redundant and weak predictors
- **Balanced representation**: Academic, financial, demographic, and family factors

### <span style="color: #e74c3c;">**Final 10 Features Selected**</span>

### <span style="color: #2E86AB;">**Financial Predictors (2 features):**</span>
- `tuition_fees_up_to_date` - strongest single predictor
- `scholarship_holder` - financial support indicator

### <span style="color: #2E86AB;">**Academic Context (3 features):**</span>
- `course_withdrawal_rate` - programme risk level
- `application_mode_withdrawal_rate` - admission route risk level
- `application_order` - preference ranking

### <span style="color: #2E86AB;">**Performance Indicators (2 features):**</span>
- `curricular_units_1st_sem_grade` - academic achievement
- `age_at_enrollment` - maturity/readiness indicator

### <span style="color: #2E86AB;">**Background Factors (2 features):**</span>
- `parental_higher_education` - family education background
- `parental_professional_occupation` - family socio-economic status

### <span style="color: #2E86AB;">**Pre-enrollment Predictor (1 feature):**</span>
- `admission_grade` - academic preparation

This feature selection enables k-NN to identify truly similar students based on meaningful characteristics whilst avoiding the pitfalls of high-dimensional data and data leakage.

## 2.B Features to Remove

In [361]:
# Severe class imbalance makes these features uninformative
uninformative_categorical = [
    'nationality',                    # 97.5% Portuguese - no variation
    'educational_special_needs',      # 98.9% no special needs - no variation
    'international',                  # 97.5% domestic - no variation
    'displaced',                      # Zero mutual information with target
    'daytime_evening_attendance'      # Zero mutual information with target
]

# Very weak correlation with target variable makes these unhelpful
weak_economic_features = [
    'unemployment_rate',              # -0.03 correlation - essentially no relationship
    'inflation_rate',                 # 0.02 correlation - essentially no relationship
    'gdp'                            # 0.05 correlation - essentially no relationship
]

# Data leakage - using information that only exists after the outcome has occurred
second_semester_remove = [
    'curricular_units_2nd_sem_grade',           # VIF 5.46 but still data leakage
    'curricular_units_2nd_sem_enrolled',        # VIF 16.42
    'curricular_units_2nd_sem_credited',        # VIF 12.39
    'curricular_units_2nd_sem_approved',        # VIF 10.14
    'curricular_units_2nd_sem_evaluations',     # VIF 3.33
    'curricular_units_2nd_sem_without_evaluations'  # VIF 1.57
]

# Remove HIGH VIF 1st semester features (>10) to fix multicollinearity
first_semester_high_vif_remove = [
    'curricular_units_1st_sem_enrolled',        # VIF 23.49 (WORST)
    'curricular_units_1st_sem_credited',        # VIF 15.57 
    'curricular_units_1st_sem_approved'         # VIF 12.63
]

# Features to remove for final k-NN model - keeping only top 10 predictive features
features_to_remove_final = [
    'marital_status',                           # Weaker categorical predictor
    'previous_qualification',                   # Weaker categorical predictor  
    'previous_qualification_grade',             # Weak correlation (0.08)
    'debtor',                                  # Redundant with tuition_fees_up_to_date
    'gender',                                  # Weaker categorical predictor
    'curricular_units_1st_sem_evaluations',   # Moderate but less critical
    'curricular_units_1st_sem_without_evaluations',  # Moderate but less critical

    'target'                                 # Old target variable (replaced with target_binary) - no longer needed
]

# Combine all features to drop
drop_columns = (uninformative_categorical + weak_economic_features + 
                second_semester_remove + first_semester_high_vif_remove + 
                features_to_remove_final)

df_dataset.drop(columns=drop_columns, inplace=True)

## 2.C Reduce High Cardinality Features

In [362]:
# Check if parental features still exist in dataset
parental_features = ['mothers_qualification', 'fathers_qualification', 'mothers_occupation', 'fathers_occupation']
existing_features = [f for f in parental_features if f in df_dataset.columns]
print(f"Remaining parental features: {existing_features}")

Remaining parental features: ['mothers_qualification', 'fathers_qualification', 'mothers_occupation', 'fathers_occupation']


In [363]:
# To reduce the number of categories in the parental qualification and occupation features, we will group them into broader categories.
def create_parental_higher_ed(df):
    """
    Creates binary indicator for parental higher education.
    Returns 1 if at least one parent has higher education, 0 otherwise.
    """
    higher_ed_codes = [2, 3, 4, 5, 6, 39, 40, 41, 42, 43, 44]
    
    mother_higher_ed = df['mothers_qualification'].isin(higher_ed_codes)
    father_higher_ed = df['fathers_qualification'].isin(higher_ed_codes)
    
    # At least one parent has higher education
    df['parental_higher_education'] = (mother_higher_ed | father_higher_ed).astype(int)
    df = df.drop(columns=['mothers_qualification', 'fathers_qualification'])
    
    return df

# Usage:
df_dataset = create_parental_higher_ed(df_dataset)
df_dataset.parental_higher_education.value_counts()

parental_higher_education
0    3616
1     808
Name: count, dtype: int64

In [364]:
def create_parental_professional_occupation(df):
    """
    Creates binary indicator for parental professional occupation.
    Returns 1 if at least one parent has professional/managerial role, 0 otherwise.
    """
    professional_codes = [1, 2, 3, 101, 102, 112, 114, 121, 122, 123, 124, 
                          131, 132, 134, 135]
    
    mother_professional = df['mothers_occupation'].isin(professional_codes)
    father_professional = df['fathers_occupation'].isin(professional_codes)
    
    # At least one parent has professional occupation
    df['parental_professional_occupation'] = (mother_professional | father_professional).astype(int)
    df = df.drop(columns=['mothers_occupation', 'fathers_occupation'])
    
    return df

# Usage:
df_dataset = create_parental_professional_occupation(df_dataset)
df_dataset.parental_professional_occupation.value_counts()

parental_professional_occupation
0    3270
1    1154
Name: count, dtype: int64

In [365]:
print(f"Dataset shape after parental feature engineering: {df_dataset.shape}")
print(f"Remaining features: {df_dataset.columns.tolist()}")

Dataset shape after parental feature engineering: (4424, 11)
Remaining features: ['application_mode', 'application_order', 'course', 'admission_grade', 'tuition_fees_up_to_date', 'scholarship_holder', 'age_at_enrollment', 'curricular_units_1st_sem_grade', 'target_binary', 'parental_higher_education', 'parental_professional_occupation']


In [366]:
def encode_categorical_withdrawal_rate(df, cat_col, target_col='target_binary'):
    """
    Replace categorical column with withdrawal rate encoding.
    
    Parameters:
    df: pandas DataFrame
    cat_col: name of categorical column to encode
    target_col: name of target column where 0=withdrawn
    
    Returns:
    pandas DataFrame with categorical column replaced by withdrawal_rate
    """
    import pandas as pd
    
    df_encoded = df.copy()
    
    # Calculate withdrawal rate for each category
    withdrawal_rates = (df[target_col] == 0).groupby(df[cat_col]).mean()
    
    # Create new withdrawal rate column
    new_col_name = f'{cat_col.lower().replace(" ", "_")}_withdrawal_rate'
    df_encoded[new_col_name] = df[cat_col].map(withdrawal_rates)
    
    # Remove original column
    df_encoded = df_encoded.drop(columns=[cat_col])
    
    return df_encoded

# Usage:
df_dataset = encode_categorical_withdrawal_rate(df_dataset, 'application_mode')
df_dataset = encode_categorical_withdrawal_rate(df_dataset, 'course')
df_dataset.describe().to_clipboard()

# 3. Scaling

In [367]:
from sklearn.preprocessing import StandardScaler

# Only these 4 features need scaling
features_to_scale = [
    'application_order', 
    'admission_grade', 
    'age_at_enrollment',
    'curricular_units_1st_sem_grade'
]

scaler = StandardScaler()
X_scaled = X.copy()
X_scaled[features_to_scale] = scaler.fit_transform(X[features_to_scale])

NameError: name 'X' is not defined

In [None]:
from sklearn.preprocessing import StandardScaler

# Features that need scaling (different ranges)
features_to_scale = [
    'application_order', 
    'admission_grade', 
    'age_at_enrollment',
    'curricular_units_1st_sem_grade'
]

# Create and fit scaler
scaler = StandardScaler()
X_scaled = X.copy()

# Scale only the features that need it
X_scaled[features_to_scale] = scaler.fit_transform(X[features_to_scale])

# Check the results
print("Before scaling - ranges:")
print(X[features_to_scale].describe())
print("\nAfter scaling - should be mean≈0, std≈1:")
print(X_scaled[features_to_scale].describe())

NameError: name 'X' is not defined