# Using K-Nearest Neighbors to Classify Iris Flowers

## 1. Introduction

In this notebook, we will demonstrate the power and simplicity of the **k-nearest neighbors (k-NN)** algorithm by applying it to the classic **Iris flower dataset**. This dataset contains measurements of iris flowers' sepals and petals across three species (*setosa*, *versicolor*, and *virginica*), making it an ideal choice for supervised classification.

Our goals are to:
- Load and explore the dataset to understand its features and target classes.
- Prepare the data by handling categorical labels and scaling features, which is crucial for distance-based algorithms like k-NN.
- Implement the **k-NN classifier** from scratch using `scikit-learn`, tuning the `k` parameter to observe its impact on model performance.
- Evaluate our model using metrics such as accuracy, confusion matrices, and classification reports to ensure robust performance.

By the end of this analysis, we will not only have a working k-NN model that can accurately predict iris species, but also a deeper understanding of how proximity-based classification works and how critical preprocessing is in achieving meaningful results.

## 2. Data Checking

We are going to load in the data and take a look at it before running it through the K-NN algorithm.

In [1]:
import pandas as pd
iris_df = pd.read_csv('iris.csv')
iris_df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


We’ve successfully loaded the Iris dataset into a pandas DataFrame and displayed the first few rows to get a quick look at the data structure. This CSV file, sourced from Kaggle, contains the same classic measurements as the built-in `sklearn.datasets.load_iris()` version, but working directly with the CSV allows us to showcase practical data handling steps like file reading, exploratory analysis, and label encoding.

Let's take a deeper look into the dataset.  

In [2]:
# this is a function that spits out data on the dataset, like NaN counts and stuff

def check_data(df):
    
    df.info()
    df.describe()
    df.isnull().sum()
    df['Species'].value_counts()
    
check_data(iris_df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


From our initial data check, we see that:

- There are no missing values (yay!, NaNs make me cringe)

- The four main features (`SepalLengthCm`, `SepalWidthCm`, `PetalLengthCm`, `PetalWidthCm`) are all numeric (`float64`), as expected for measurements in centimeters.  We are going to rename these columns to shorten them for legibility

- The species column is an object type with three types of Iris flowers names as values.  We will deal with this next

- The `Id` column is just an index and we won't need it, unless I find out some try hard on kaggle used it to increase their precision by 1% with it.

Overall, the data is **clean, complete, and well-structured**, which makes it an ideal candidate to demonstrate the k-NN algorithm without additional cleaning complications.  Neat!

Now let's clean up the column names, encode the "Species" column, and I guess we can drop the "Id" column as well.

In [4]:
from sklearn.preprocessing import LabelEncoder

def data_prepper(df):
    
    df = df.rename(columns={"SepalLengthCm":"Sepal_Length", "SepalWidthCm":"Sepal_Width",
                       "PetalLengthCm":"Petal_Length", "PetalWidthCm":"Petal_Width"})
    
    df = df.drop(columns=['Id'], axis=1)
    
    le = LabelEncoder()
    
    # fit on the three values in species and print the labels
    le.fit(df['Species'])
    print("Classes found by LabelEncoder:", le.classes_)
    
    # see the mapping
    mapping = {label: idx for idx, label in enumerate(le.classes_)}
    print("Label encoding mapping:", mapping)
    
    df['Species'] = le.fit_transform(df['Species'])
    
    return df

iris_clean = data_prepper(iris_df)
    
iris_clean.head()

Classes found by LabelEncoder: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Label encoding mapping: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}


Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Now let's go to the next part, where we begin to train the data for a model.

##  3. Training and Scaling

Now that our dataset is cleaned and the target labels are properly encoded, we're ready to build the full machine learning pipeline. This step includes:

- **Separating features and target:** Extracting the measurement columns as input features (`X`) and the encoded species as the target variable (`y`)

- **Splitting into training and test sets:** Ensuring we have data to evaluate how well the model generalizes to unseen examples.

- **Scaling the features:** Since k-NN relies on distance calculations, it's critical to standardize our data so that all features contribute equally, regardless of their original units.

- **Fitting the k-NN model:** We'll train a k-nearest neighbors classifier with a chosen `k` value, ready to predict new iris samples.

With our k-NN model trained and validated, we can now assess its real-world performance by making predictions on the test set. We'll generate metrics such as accuracy, the macro-averaged F1-score (which gives equal weight to each class regardless of support), and a detailed classification report to understand how well our model distinguishes between the three iris species.

By printing the confusion matrix alongside these scores, we gain a deeper look into which classes are most frequently misclassified. This analysis is crucial for diagnosing the strengths and limitations of our chosen `k` value, and will also serve as the foundation for any hyperparameter tuning we perform to further optimize the model.

We'll wrap this entire process into a reusable function to streamline experimentation and then do a detailed analysis of the results and consider optimization.


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

def train_knn_model(df, k):
    
    # separate features and target
    X = df.drop(columns=['Species'])
    y = df['Species']
    
    # split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size = 0.2, random_state = 15, stratify = y
    )
    
    # scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # fit k-NN model
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    y_pred = knn.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='macro')
    
    print(f"Model trained with k={k}")
    print(f"Accuracy on test set: {acc:.2f}")
    print(f"Macro F1-score: {f1:.2f}")
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    return f1
    
f1_score = train_knn_model(iris_clean, 5)

Model trained with k=5
Accuracy on test set: 0.90
Macro F1-score: 0.90

Confusion Matrix:
[[10  0  0]
 [ 0 10  0]
 [ 0  3  7]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.77      1.00      0.87        10
           2       1.00      0.70      0.82        10

    accuracy                           0.90        30
   macro avg       0.92      0.90      0.90        30
weighted avg       0.92      0.90      0.90        30



Our initial k-NN model, trained on the standardized Iris dataset with a chosen `k` value, produced strong classification metrics. The confusion matrix and classification report confirm that the model effectively distinguishes between the three iris species, with high precision, recall, and macro-averaged F1-scores across classes. These results highlight the strength of even a straightforward distance-based approach like k-NN when paired with appropriate preprocessing and scaling.

However, the value of `k` plays a crucial role in how well k-NN generalizes. A poorly chosen `k` could lead to overfitting or underfitting, directly impacting predictive performance. To ensure we have the most robust classifier possible, we’ll next explore optimizing `k` using the macro F1-score as our objective metric.


## 4. Optimization  

In this section, we'll systematically vary `k` across a reasonable range of values, training a separate k-NN model for each. By recording the resulting macro F1-scores, we can identify the value of `k` that achieves the best balance between precision and recall across all classes.

Finally, we'll visualize this relationship by plotting F1-score versus `k`, allowing us to clearly see how model performance changes as we adjust the number of neighbors. This will guide us to select the optimal `k` for our final, tuned model.