# Example usage

To use `pkg_pyknnclassifier` in a project:

In [8]:
import pkg_pyknnclassifier
import pandas as pd

from pkg_pyknnclassifier.data_loading import data_loading
from pkg_pyknnclassifier.scaling import scaling
from pkg_pyknnclassifier.predict import predict
from pkg_pyknnclassifier.evaluate import evaluate

Check the version of our package.

In [2]:
print(pkg_pyknnclassifier.__version__)

0.1.0


## Narrative


We will now conduct an initial data wrangling to make our data ready for classification. Note our package was designed to only accept numerical features, and the binary target variable. If your data involves some categorical features, please convert them into a numerical format using appropriate encoding techniques before using our package.


## Data-Loading

For better interpretation, we used Iris dataset, which comprises 2 distinct categories (Iris _Setosa_ and Iris _Versicolor_), each with 50 samples, along with their corresponding features:
"SepalLengthCm": length of the sepal in cm
"SepalWidthCm": width of the sepal in cm
"PetalLengthCm": length of the petal in cm
"PetalWidthCm" width of the petal in cm

The data showcases various species of the iris flower, serving as a prime example to demonstrate and evaluate the performance of our tailored k-Nearest Neighbors (kNN) model.

We load the dataset using the data_loading() function, importing the 'Iris.csv' file and specifying the target column upon which the categorization is based.


In [3]:
path_to_training = "../data/iris_train.csv"
path_to_validation = "../data/iris_valid.csv"
target_column = "Species"
X_train, y_train = data_loading(path_to_training, target_column)

To give a glance at our dataset:

In [4]:
X_train.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,4.8,3.4,1.6,0.2
1,4.6,3.2,1.4,0.2
2,7.7,3.0,6.1,2.3
3,5.2,3.4,1.4,0.2
4,6.8,3.2,5.9,2.3



We have excluded the "id" column, recognizing it as a unique identifier that could lead to overfitting if included. Since it does not offer any meaningful predictive value, it is appropriately removed from the feature set for our analysis.


In [5]:
X_val, y_val = data_loading(path_to_validation, target_column)
print(set(list(y_train)))

{'Iris-setosa', 'Iris-virginica'}


## Scaling
Before building our kNN model, we scale our features so no features dominate the distance calculations. The model can learn from all the features equally and leads to a better performing model. Here we used the 'MinMaxScaler'. 'StandardScaler' can be an alternative by filling it into scale_method parameter.

In addition, in this dataset, we do not have any missing value, but you can specify the impute strategy if there is any missing value in your dataset.


In [6]:
train_X_scaled = scaling(X_train, impute_strategy="mean", scale_method="MinMaxScaler")

## Prediction
The line below is to predict the labels of the unlabeled observations based on the similarity score from the Euclidean distance. 

In [14]:
y_pred = predict(X_train, y_train, X_val, pred_method="hard", k=3)
pd.DataFrame({'predictions': y_pred, 'values': y_val})

Unnamed: 0,predictions,values
0,Iris-virginica,Iris-virginica
1,Iris-virginica,Iris-virginica
2,Iris-virginica,Iris-virginica
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
5,Iris-setosa,Iris-setosa
6,Iris-setosa,Iris-setosa
7,Iris-virginica,Iris-virginica
8,Iris-setosa,Iris-setosa
9,Iris-setosa,Iris-setosa


## Evaluate
We try to evaluate how our model is performing, and by this line of code, we can see the accuracy of how our model fits to the validation dataset, and it seems that this model fits really well to this dataset. Alternatively, other metrics (such as precision, recall, and f1 score) can be used to evaluate the performance.


In [15]:
y_pred_train = predict(X_train, y_train, X_train, pred_method="hard", k=3)
acc_train = evaluate(y_train, y_pred_train, metric='accuracy')
print("The training score is:", acc_train)

The training score is: 1.0


In [16]:
acc_valid = evaluate(y_val, y_pred, metric='accuracy')
print("The validation score is:", acc_valid)

The validation score is: 1.0


## Reference

Fisher,R. A.. (1988). Iris. UCI Machine Learning Repository. https://doi.org/10.24432/C56C76.