#KNN Classifier on the PBCSeq Dataset

## Intro
In this project, I apply the **K-Nearest Neighours (KNN)** classifer on the **PBCSeq dataset** from OpenML

Reasons I chose this dataset:
-It contains a lot of **missing values**, which makes preprocessing more challenging and realistic
-Mix of **numerical and categorical features**, which requires data cleaning and encoding
-Different from toy datasets (such as Iris or Wine), this data set being closer to real-world data

The goal of this project is to handle missing values properly, encode categorical values, and than build a pipeline with **StandardScaler + KNN**, using **GridSearchCV** for hyperparameter tuning

In [33]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import pandas as pd

## Load Dataset

The dataset is fetched directly from [OpenML](https://www.openml.org/d/802)

In [58]:
data = fetch_openml('pbcseq', as_frame=True)

X = data.data
y = data.target

#Display first 3 rows & each column 
list(X.columns), X.head(3)

- version 1, status: active
  url: https://www.openml.org/search?type=data&id=516
- version 2, status: active
  url: https://www.openml.org/search?type=data&id=802



(['case_number',
  'number_of_days',
  'drug',
  'age',
  'sex',
  'day',
  'presence_of_asictes',
  'presence_of_hepatomegaly',
  'presence_of_spiders',
  'presence_of_edema',
  'serum_bilirubin',
  'serum_cholesterol',
  'albumin',
  'alkaline_phosphatase',
  'SGOT',
  'platelets',
  'prothrombin_time',
  'histologic_stage_of_disease'],
    case_number  number_of_days  ... prothrombin_time  histologic_stage_of_disease
 0            1             400  ...             12.2                            4
 1            1             400  ...             11.2                            4
 2            2            5169  ...             10.6                            3
 
 [3 rows x 18 columns])

## First step of preprocessing the data

After fetching the data I am checking the number of missing values and the types of features

In [35]:
#Checking where the missing values are
X.isna().sum()

#Checking what the value types are in each column
X['presence_of_asictes'].unique()
X['presence_of_hepatomegaly'].unique()
X['presence_of_spiders'].unique()
X['serum_cholesterol'].unique() 
X['alkaline_phosphatase'].unique()
X['platelets'].unique()

#The first 3 indicate presence, so i have values of yes/no or 1/0, while the last 3 have continous values

array([190., 183., 221., 188., 161., 122., 135., 100., 103., 113., 151.,
       160., 107., 109., 240., 251., 220., 338., 200., 101., 136., 114.,
        99.,  90.,  82.,  nan, 297., 296., 264., 204., 192., 164., 173.,
       121., 373., 362., 369., 388., 322., 229., 232., 238., 269., 252.,
       250., 331., 195., 302., 258., 285., 265., 244., 187., 155., 108.,
        85.,  77.,  92., 132.,  71.,  98., 245., 246., 217., 247., 233.,
       197., 222., 218., 241., 207., 176., 156., 137., 123.,  96., 133.,
       295., 235., 268., 272., 181., 184., 196., 198., 280., 253., 230.,
       289., 317., 209., 210., 243., 219., 231., 224., 216., 283., 215.,
       205., 191., 199., 165., 182., 178., 162., 437., 506., 336., 328.,
       294., 128., 119.,  95., 214.,  70., 286., 323., 341., 313., 311.,
       242., 324., 150., 208., 170., 110., 148., 421., 443., 360.,  80.,
       144., 143., 390., 335., 290., 163., 124., 116., 260., 316., 327.,
       282., 129., 120., 223., 225., 212., 159., 16

## Second step of Preprocessing

After finding out the types of data and what columns have missing values I proceed to fill in the missing values

In [36]:
X['serum_cholesterol'].fillna(X['serum_cholesterol'].median(), inplace=True)
X['alkaline_phosphatase'].fillna(X['alkaline_phosphatase'].median(), inplace=True)
X['platelets'].fillna(X['platelets'].median(), inplace=True)

X["presence_of_asictes"] = X["presence_of_asictes"].cat.add_categories([2])
X['presence_of_asictes'] = X['presence_of_asictes'].replace({'no': 0, 'yes': 1})
X["presence_of_hepatomegaly"] = X["presence_of_hepatomegaly"].cat.add_categories([2])
X["presence_of_spiders"] = X["presence_of_spiders"].cat.add_categories([2])
X['presence_of_spiders'].fillna(2, inplace = True)
X['presence_of_spiders'] = X['presence_of_spiders'].astype(int)
X['presence_of_asictes'].fillna(2, inplace=True)
X['presence_of_hepatomegaly'].fillna(2, inplace=True)
X['presence_of_hepatomegaly'] = X['presence_of_hepatomegaly'].replace({'no': 0, 'yes' : 1})
X['presence_of_hepatomegaly'].astype(int)
X['presence_of_spiders'].fillna(2, inplace=True)
X['drug'] = X['drug'].map({'D-penicillamine': 1, '0': 0})
X['drug'] = X['drug'].astype(int)
X['sex'] = X['sex'].map({'male' : 0, 'female': 1})
X['sex'] = X['sex'].astype(int)
X['day'] = X['day'].replace({'no' : 0})
X['day'] = X['day'].astype(int)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X['serum_cholesterol'].fillna(X['serum_cholesterol'].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['serum_cholesterol'].fillna(X['serum_cholesterol'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col

## Splitting the data

To ensure an even split for the training and testing splits of the target classes, I use **StratifiedShuffleSplit**.

It proved to be better than a traditional **train_test_split**, increasing accuracy by ≈1-2%, thanks to class balance and avoiding bias.

In [44]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2)

for train_ind, test_ind in split.split(X, y):
    X_train, X_test = X.iloc[train_ind], X.iloc[test_ind]
    y_train, y_test = y.iloc[train_ind], y.iloc[test_ind]

## Pipeline + GridSearch

For the sake of scalability and code clarity, I used a **Pipeline** that combines preprocessing (**StandardScaler**) with the model (**KNeighborsClassifier**).

This approach makes the workflow more readbale, maintainable and easier to extend (e.g., replacing the classifier later or adding more preprocessing steps).

The use of **StandardScaler** is motivated by the way this model classifies: it makes decisions based on distances. Without scaling, features with larger ranges would dominate those with smaller ones, reducing or even eliminating their importance. **StandardScaler** ensures all features are brought to a comparable scale.

Finally, using **GridSearchCV** helps in finding the best **hyperparameters** for the model I am using testing its accuracy on all the provided parameters, in this instance it compares 30 posible combinations (5 x 2 x 3).


In [55]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

grid_param = {
    'knn__n_neighbors': [1, 3, 5, 7, 9],
    'knn__weights' : ['distance', 'uniform'],
    'knn__metric': ['euclidean', 'manhattan', 'minkowski']
}

## Results

Using **GridSearch** i ran a **5-fold cross-validation**, taking the training set and dividing it in 5 folds, and at each iteration, using 4 folds for training and 1 fold for validation, then repeating this procces 5 times, each time using a different fold for validation, then taking the mean of all 5 rounds

After running all the combinations, the model achieved a score of ≈0.93, using the **manhattan** distance metric, 1 **neighbour**, and the **distance** weighting

In [59]:
grid = GridSearchCV(pipeline, grid_param, scoring='accuracy', error_score='raise')
grid.fit(X_train, y_train)

grid.score(X_test, y_test), grid.best_params_

(0.9305912596401028,
 {'knn__metric': 'manhattan',
  'knn__n_neighbors': 1,
  'knn__weights': 'distance'})

## Conclusion

-Proper handling of missing values and categorical encoding is crucial for real-world datasets  
-Feature scaling is essential for distance-based models like KNN  
-Cross-validation provides a robust estimate of model performance  
-Finding the best hyperparameters and tuning is very important, boosting this models performance by ≈8-10%