# Example usage

To use `pkg_pyknnclassifier` in a project:

In [1]:
import pkg_pyknnclassifier

print(pkg_pyknnclassifier.__version__)

from pkg_pyknnclassifier.data_loading import data_loading
from pkg_pyknnclassifier.scaling import scaling
from pkg_pyknnclassifier.predict import predict
from pkg_pyknnclassifier.evaluate import evaluate

0.1.0


## Narrative


We will now conduct an initial data wrangling to make our data ready for classification. Note our package was designed to only accept numerical features, and the binary target variable. If your data involve some categorical features, you should apply some transformation so that the features can be applied in our package.


## Data-Loading

For better interpretation, we used Iris dataset, which comprises 2 distinct categories (Iris Setosa and Iris Versicolor), each with 50 samples, along with their corresponding features:
"SepalLengthCm": length of the sepal in cm
"SepalWidthCm": width of the sepal in cm
"PetalLengthCm": length of the petal in cm
"PetalWidthCm" width of the petal in cm

The data showcases various species of the iris flower, serving as a prime example to demonstrate and evaluate the performance of our tailored k-Nearest Neighbors (kNN) model.

We load the dataset using the data_loading() function, importing the 'Iris.csv' file and specifying the target column upon which the categorization is based.

## Preprocessing
We have excluded the "id" column, recognizing it as a unique identifier that could lead to overfitting if included. Since it does not offer any meaningful predictive value, it is appropriately removed from the feature set for our analysis.



In [12]:
path_to_training = "../data/iris_train.csv"
path_to_validation = "../data/iris_valid.csv"
target_column = "Species"
X_train, y_train = data_loading(path_to_training, target_column)
# X_train = X_train.drop(columns="Id")
# index_mask = y_train != 'Iris-virginica'
# X_train = X_train[index_mask]
# y_train = y_train[index_mask]
X_val, y_val = data_loading(path_to_validation, target_column)
print(set(list(y_train)))

{'Iris-setosa', 'Iris-virginica'}


## Scaling
Before building our KNN model, we scale our features so no features dominate the distance calculations. The model can learn from all the features equally and leads to a better performing model.


In [13]:
train_X_scaled = scaling(X_train, impute_strategy="mean", scale_method="MinMaxScaler")

In [15]:
y_pred = predict(X_train, y_train, X_val, pred_method="hard", k=3)

## Evaluate
We try to evaluate how our model is performing, and by this line of code, we can see the accuracy of how our model fits to the validation dataset, and it seems that this model fits really well to this dataset.


In [17]:
acc = evaluate(y_val, y_pred, metric='accuracy')
print(acc)

1.0


In [18]:
y_pred_train = predict(X_train, y_train, X_train, pred_method="hard", k=3)
acc_train = evaluate(y_train, y_pred_train, metric='accuracy')