# 1. Setup

## 1.A Summary

### <span style="color: #e74c3c;">**k-Nearest Neighbours Implementation Summary**</span>

This notebook implements **k-Nearest Neighbours (k-NN)** classification to predict student withdrawal risk using the preprocessed student dataset.

### <span style="color: #2E86AB;">**1. Algorithm Overview**</span>

**k-Nearest Neighbours** is a **non-parametric, instance-based learning algorithm** that makes predictions by:
- Finding the k closest data points to a new instance
- Using majority voting amongst these neighbours to determine classification
- Making no assumptions about the underlying data distribution

**Key characteristics:**
- **Lazy learning**: No explicit training phase - stores all data points
- **Distance-based**: Uses similarity measures (typically Euclidean distance)
- **Local decision boundaries**: Adapts to local patterns in the data

### <span style="color: #2E86AB;">**2. Binary Classification Setup**</span>

**Target transformation**: Combined "Graduate" and "Enrolled" into "Continuation" (1), with "Dropout" as "Withdrawn" (0), creating a balanced 68:32 class distribution suitable for k-NN's majority voting mechanism.

**Dataset**: 4,424 students with preprocessed features ready for distance-based classification.

### <span style="color: #2E86AB;">**3. Preprocessing Requirements**</span>

**Essential for k-NN performance:**
- **Feature scaling**: StandardScaler applied to prevent features with larger ranges from dominating distance calculations
- **One-hot encoding**: Categorical features converted to binary dummy variables
- **Feature selection**: Remove redundant and uninformative features identified in earlier analysis

### <span style="color: #2E86AB;">**4. Model Configuration**</span>

**Hyperparameter tuning** focuses on:
- **k value**: Number of neighbours to consider (typically odd numbers to avoid ties)
- **Distance metric**: Euclidean distance for continuous features
- **Weighting scheme**: Uniform vs distance-weighted voting

### <span style="color: #e74c3c;">**Expected Outcomes**</span>

This implementation will evaluate k-NN's effectiveness for student dropout prediction, comparing performance against logistic regression whilst addressing the algorithm's sensitivity to feature scaling and dimensionality.

## 1.B Libraries Import

In [14]:
from tools import Tools
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

## 1.C Invoke Classes

In [15]:
tools = Tools()

## 1.D Load Configuration

In [16]:
config = tools.load_toml_file("config.toml")
tools.print_message('success', 'Loaded configuration', format_dict={'number of keys': len(config)})

## 1.E Load the dataset

In [17]:
# Open dataset
# Realinho, V., Martins, M.V., Machado, J. and Baptista, L.M.T., 2021. Predict Students' Dropout and Academic Success. UCI Machine Learning Repository. Available at: https://doi.org/10.24432/C5MC89 [Accessed 31 May 2025].
df_dataset = tools.load_dataset(file_name='dataset_raw.csv')
df_dataset.head()

Unnamed: 0,marital_status,application_mode,application_order,course,daytime_evening_attendance,previous_qualification,previous_qualification_grade,nationality,mothers_qualification,fathers_qualification,...,curricular_units_2nd_sem_credited,curricular_units_2nd_sem_enrolled,curricular_units_2nd_sem_evaluations,curricular_units_2nd_sem_approved,curricular_units_2nd_sem_grade,curricular_units_2nd_sem_without_evaluations,unemployment_rate,inflation_rate,gdp,target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


## 1.F Apply Target Binary Transformation

In [18]:
# Add a new target column with renamed values for one vs rest classification
df_dataset['target_binary'] = df_dataset['target'].map({'Dropout': 0, 'Graduate': 1, 'Enrolled': 1})
df_dataset['target_binary'].value_counts()

target_binary
1    3003
0    1421
Name: count, dtype: int64

## 1.G Data Shape Check

In [19]:
shape = df_dataset.shape
tools.print_message('success', 'Dataset loaded', format_dict={'rows': shape[0], 'columns': shape[1]})

# 2. Feature Selection

## 2.A Summary

### <span style="color: #e74c3c;">**Feature Selection for k-Nearest Neighbours**</span>

This section details the feature selection process optimised specifically for k-NN algorithm requirements, reducing dimensionality whilst preserving predictive power.

### <span style="color: #2E86AB;">**1. Statistical Significance Measures**</span>

**Mutual Information (MI)**: **Mutual Information** measures how much information one variable provides about another. For categorical features and binary targets, MI quantifies the reduction in uncertainty about withdrawal risk when we know a feature's value. Values range from 0 (no information) to higher positive values (more informative). MI=0.085 means the feature provides substantial predictive information.

**Chi-Square (χ²) Test**: **Chi-square tests** measure whether there's a statistically significant association between a categorical feature and the target variable. Higher χ² values indicate stronger relationships. A p-value <0.05 confirms the association isn't due to random chance. χ²=811.93 indicates an extremely strong relationship.

**Why These Matter for k-NN**: Features with high MI and χ² scores create more meaningful clusters in feature space, leading to better neighbour identification and more accurate predictions.

### <span style="color: #2E86AB;">**2. Categorical Features Strategy**</span>

**✅ Keep High-Value Features (7 features):**
- `tuition_fees_up_to_date`                 - MI=0.085, χ²=811.93 (strongest predictor)
- `scholarship_holder`                      - MI=0.047, χ²=265.10
- `course`                                  - MI=0.033, χ²=298.27
- `application_mode`                        - MI=0.029, χ²=399.12
- `gender`, `debtor`, `marital_status`      - moderate predictors

**❌ Remove Uninformative Features:**
- **Severely imbalanced**: `nationality` (97.5% Portuguese), `educational_special_needs` (98.9% no needs)
- **Zero information**: `daytime_evening_attendance`, `displaced`, `international` (MI≈0.00)

**⚠️ Handle High Cardinality (4 features):**
- `mothers_qualification` (35 categories), `fathers_qualification` (44 categories)
- `mothers_occupation` (32 categories), `fathers_occupation` (46 categories)
- **Strategy**: Group rare categories (<2% frequency) into "Other" to prevent dimensionality explosion

### <span style="color: #2E86AB;">**3. Continuous Features Strategy**</span>

**✅ Keep Strong Predictors (7 features):**
- **Academic performance**: `curricular_units_2nd_sem_grade` (correlation=0.57), `curricular_units_2nd_sem_approved` (correlation=0.57)
- **Demographic**: `age_at_enrollment` (correlation=-0.25)
- **Academic background**: `previous_qualification_grade`, `admission_grade`
- **2nd semester metrics**: enrolled units, evaluations

**❌ Remove Multicollinear Features:**
- **1st semester metrics**: VIF >15, highly correlated with 2nd semester (r=0.84-0.94)
- **Economic indicators**: `unemployment_rate`, `inflation_rate`, `gdp` (correlation <0.05)

### <span style="color: #2E86AB;">**4. k-NN Specific Rationale**</span>

**Curse of Dimensionality**: **Curse of dimensionality** occurs when high-dimensional spaces become sparse, making distance measurements less meaningful. Reducing from 36 to ~25 features (after encoding) improves k-NN performance.

**Distance Calculation**: k-NN relies on measuring similarity between data points. **Multicollinear features** would double-count similar information, whilst **uninformative features** add noise to distance calculations.

**Feature Scaling Preparation**: Remaining features have diverse scales (grades 0-200, units 0-26, age 17-70) requiring careful scaling for meaningful distance computation.

### <span style="color: #e74c3c;">**Final Feature Portfolio**</span>

**Before Selection**: 36 features (18 categorical + 18 continuous)
**After Selection**: 15 features (7 categorical + 4 grouped high-cardinality + 7 continuous)
**After Encoding**: ~25 features (one-hot encoding expands categoricals)

**Outcome**: Streamlined feature set optimised for distance-based similarity matching whilst preserving strongest predictive signals for student withdrawal detection.

In [20]:
# Function to check if required features are present in the dataset
def check_features_are_present(df, features):
    """
    Returns a list of features from `features` that are not present in the dataframe `df`.
    """
    not_present = [feature for feature in features if feature not in df.columns]
    if not_present:
        tools.print_message('error', 'Missing features in dataset', format_dict={'missing features': not_present})
        raise ValueError('Missing features')
    else:
        tools.print_message('success', 'All features are present in the dataset')

## 2.A Categorical Features - Removal (Uninformative)

In [21]:
# List of features to remove based on the EDA
categorical_features_remove = ['target', 'nationality', 'educational_special_needs', 'international', 'daytime_evening_attendance', 'displaced']

# Check that features exist in the dataset
check_features_are_present(df_dataset, categorical_features_remove)
    
# Remove these features from the dataset
df_dataset.drop(columns=categorical_features_remove, inplace=True)
tools.print_message('success', 'Removed categorical features', format_dict={'removed features': categorical_features_remove})


## 2.B Categorical Features - List High Predictive Value

In [22]:
categorical_features_keep = [
    'tuition_fees_up_to_date',  # MI=0.085, χ²=811.93
    'scholarship_holder',       # MI=0.047, χ²=265.10
    'course',                   # MI=0.033, χ²=298.27
    'application_mode',         # MI=0.029, χ²=399.12
    'gender',
    'debtor',
    'marital_status'
]

# Check that features exist in the dataset
check_features_are_present(df_dataset, categorical_features_keep)

## 2.C Categorical Features - High Cardinality

In [23]:
categorical_features_group_rare = [
    'mothers_qualification',
    'fathers_qualification',
    'mothers_occupation',
    'fathers_occupation'
]

# Check that features exist in the dataset
check_features_are_present(df_dataset, categorical_features_group_rare)

In [24]:
# To reduce the number of categories in the parental qualification and occupation features, we will group them into broader categories.
def create_parental_higher_ed(df):
    """
    Creates binary indicator for parental higher education.
    Returns 1 if at least one parent has higher education, 0 otherwise.
    """
    higher_ed_codes = [2, 3, 4, 5, 6, 39, 40, 41, 42, 43, 44]
    
    mother_higher_ed = df['mothers_qualification'].isin(higher_ed_codes)
    father_higher_ed = df['fathers_qualification'].isin(higher_ed_codes)
    
    # At least one parent has higher education
    df['parental_higher_education'] = (mother_higher_ed | father_higher_ed).astype(int)
    df = df.drop(columns=['mothers_qualification', 'fathers_qualification'])
    
    return df

# Usage:
df_dataset = create_parental_higher_ed(df_dataset)
df_dataset.parental_higher_education.value_counts()

parental_higher_education
0    3616
1     808
Name: count, dtype: int64

In [25]:
def create_parental_professional_occupation(df):
    """
    Creates binary indicator for parental professional occupation.
    Returns 1 if at least one parent has professional/managerial role, 0 otherwise.
    """
    professional_codes = [1, 2, 3, 101, 102, 112, 114, 121, 122, 123, 124, 
                          131, 132, 134, 135]
    
    mother_professional = df['mothers_occupation'].isin(professional_codes)
    father_professional = df['fathers_occupation'].isin(professional_codes)
    
    # At least one parent has professional occupation
    df['parental_professional_occupation'] = (mother_professional | father_professional).astype(int)
    df = df.drop(columns=['mothers_occupation', 'fathers_occupation'])
    
    return df

# Usage:
df_dataset = create_parental_professional_occupation(df_dataset)
df_dataset.parental_professional_occupation.value_counts()

parental_professional_occupation
0    3270
1    1154
Name: count, dtype: int64

## 2.D Continuous Features - Remove (VIF > 15) & Weak Correlations

In [26]:
continuous_features_remove = [
    # Keep 2nd semester only (higher target correlation)
    'curricular_units_1st_sem_credited',     # VIF=15.57
    'curricular_units_1st_sem_enrolled',     # VIF=23.49
    'curricular_units_1st_sem_evaluations',
    'curricular_units_1st_sem_approved',
    'curricular_units_1st_sem_grade',
    # Weak economic predictors
    'unemployment_rate',                      # correlation = -0.03
    'inflation_rate',                         # correlation = 0.02
    'gdp'                                     # correlation = 0.05
]

# Check that features exist in the dataset
check_features_are_present(df_dataset, continuous_features_remove)

# Remove these features from the dataset
df_dataset.drop(columns=continuous_features_remove, inplace=True)
tools.print_message('success', 'Removed features from dataset', format_dict={'removed features': continuous_features_remove})

## 2.E Continuous Features - Strong Predictors

In [27]:
continuous_features_keep = [
    # High target correlation
    'curricular_units_2nd_sem_grade',        # correlation = 0.57
    'curricular_units_2nd_sem_approved',     # correlation = 0.57
    'age_at_enrollment',                      # correlation = -0.25
    'curricular_units_2nd_sem_enrolled',
    'curricular_units_2nd_sem_evaluations',
    # Moderate predictors  
    'previous_qualification_grade',           # correlation = 0.08
    'admission_grade'                         # correlation = 0.10
]

# Check that features exist in the dataset
check_features_are_present(df_dataset, continuous_features_keep)