# Iris Classification

This Jupyter Notebook demonstrates how to perform classification using [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) on the [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).

In [13]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from joblib import dump

## Load Dataset

In [2]:
# Load the Iris dataset and shuffle it
df = load_iris(as_frame=True)["data"]
df["target"] = load_iris(as_frame=True)["target"]
df = df.sample(frac=1, random_state=42)

## Explore Dataset

In [3]:
# Look at the first 5 rows
df.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
73,6.1,2.8,4.7,1.2,1
18,5.7,3.8,1.7,0.3,0
118,7.7,2.6,6.9,2.3,2
78,6.0,2.9,4.5,1.5,1
76,6.8,2.8,4.8,1.4,1


In [4]:
# Generate descriptive statistics
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [5]:
# Check if there are any NaNs
df.isna().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [6]:
# Look at the `target` variable
df.value_counts("target")

target
0    50
1    50
2    50
Name: count, dtype: int64

## Split Dataset

In [7]:
# Define the `features` and the `target` variable
features = ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]
target = "target"

In [8]:
# Split the dataset into X (features) and y (target)
X = df[features].values
y = df[target].values

# Use `train_test_split` to split the dataset into 25% test and 75% train
dev_X, test_X, dev_y, test_y = train_test_split(X, y, test_size=0.25, random_state=42)

## Model

In [9]:
# Create a pipeline that first scales the data and then uses kNN
knn = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("predictor", KNeighborsClassifier()),
])
knn.fit(dev_X, dev_y)

In [10]:
# Use 10-fold cross-validation to see how well the model performs
cross_val_score(knn, dev_X, dev_y, cv=10, scoring="accuracy").mean()

0.9363636363636362

In [11]:
# Evaluate the model on the test set
accuracy_score(test_y, knn.predict(test_X))

0.9736842105263158

## Save Model

Retrain the model using all the data and save it to the *model.pkl* file:

In [12]:
knn.fit(X, y)
dump(knn, "model.pkl")

['model.pkl']