# Iris Flower Classification with Scikit-Learn

![Iris](https://github.com/featurestoreorg/serverless-ml-course/raw/main/src/01-module/assets/iris.png)


In this notebook we will, 

1. Synthetic data generation for Iris Flowers
2. Split trainind data into train and test sets (one set each for features and labels)
3. Encode the label
4. Train a KNN Model using SkLearn
5. Evaluate model performance on the test set

In [19]:
import random
import pandas as pd

def generate_flower(name, sepal_len_max, sepal_len_min, sepal_width_max, sepal_width_min, 
                    petal_len_max, petal_len_min, petal_width_max, petal_width_min):

    df = pd.DataFrame({ "sepal_length": [random.uniform(sepal_len_max, sepal_len_min)],
                       "sepal_width": [random.uniform(sepal_width_max, sepal_width_min)],
                       "petal_length": [random.uniform(petal_len_max, petal_len_min)],
                       "petal_width": [random.uniform(petal_width_max, petal_width_min)]
                      })
    df['variety'] = name
    return df


virginica_df = generate_flower("virginica", 8, 5.5, 3.8, 2.2, 7, 4.5, 2.5, 1.4)
versicolor_df = generate_flower("versicolor", 7.5, 4.5, 3.5, 2.1, 3.1, 5.5, 1.8, 1.0)
setosa_df =  generate_flower("setosa", 6, 4.5, 4.5, 2.3, 1.2, 2, 0.7, 0.3)

# randomly pick one of these 3 and write it to the featurestore
pick_random = random.uniform(0,3)
if pick_random >= 2:
    iris_df = virginica_df
elif pick_random >= 1:
    iris_df = versicolor_df
else:
    iris_df = setosa_df

iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
0,7.929584,3.133859,4.961827,1.688672,virginica


In [22]:
features = iris_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
labels = iris_df[["variety"]]
features

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,7.929584,3.133859,4.961827,1.688672


In [23]:
labels

Unnamed: 0,variety
0,virginica


Now we will do some **Feature Engineering**. 

We will transform the label from a categorical variable (a string) into a numerical variable (an int). Many machine learning training algorithms only take numerical values as inputs for training (and inference).

We can see that our original lables (**y_train** and **y_test**) are categorical variables. We will use Scikit-Learn's **LabelEncoder** to transform the strings into numbers.

We can see that **y_test_encoded** has been transformed into a numerical variable (an int). **y_train_encoded** has been similarly transformed.

In [None]:
y_test_encoded[0:8]

Now, we can fit a model to our features and labels from our training set (**X_train** and **y_train_encoded**). Fitting a model to a dataset is more commonly called "training a model".

In [None]:
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train_encoded) 

Now, we have trained our model. We can evaluate our model on the **test_set** to estimate its performance.
Notice that the model was trained to output the encoded labels (numbers).

In [None]:
y_pred_encoded = model.predict(X_test)
y_pred_encoded

We can look at the predicted flower names by inverse transforming our numerical predictions back into their original string form. To perform the inverse transform, we need the **le** (LabelEncoder) object used to perform the orginal categorical to numerical mapping.

In [None]:
le.inverse_transform(y_pred_encoded)

We can report on how accurate these predictions (**y_pred_encoded**) are compared to the labels (the actual results - **y_test_encoded**). 

In [None]:
from sklearn.metrics import classification_report

metrics = classification_report(y_test_encoded, y_pred_encoded)
print(metrics)

In [None]:
from sklearn.metrics import confusion_matrix

results = confusion_matrix(y_test_encoded, y_pred_encoded)
print(results)

In [None]:
from matplotlib import pyplot

# Setosa = 0, Versicolor = 1, Virginica = 2

df_cm = pd.DataFrame(results, ['True Setosa', 'True Versicolor', 'True Virginica'],
                     ['Pred Setosa', 'Pred Versicolor', 'Pred Virginica'])

sns.heatmap(df_cm, annot=True)

Homework task

Rewrite the last two cells, but instead of computing the *classification_report* and the *confusion_matrix* with the encoded labels, use the unencoded labels. 

Are the results the same? Why?

In [None]:
from sklearn.metrics import confusion_matrix

predictions_untransformed = le.inverse_transform(y_pred_encoded)
results = confusion_matrix(y_test, predictions_untransformed)
print(results)

In [None]:
!pip install gradio --quiet
!pip install typing-extensions==4.3.0

In [None]:
import gradio as gr
import numpy as np


def iris(sl, sw, pl, pw):
    input_list = []
    input_list.append(sl)
    input_list.append(sw)
    input_list.append(pl)
    input_list.append(pw)
    res = model.predict(np.asarray(input_list).reshape(1, -1)) 
    # Convert the numerical representation of the label back to it's original iris flower name.
    # le.inverse_transform returns a list of flower names with only 1 entry, so we add '[0]' to 
    # the list returned by le.inverse_transform(..) to return only the iris flower name (not the list)
    return le.inverse_transform(res)[0]

demo = gr.Interface(
    fn=iris,
    title="Iris Flower Predictive Analytics",
    description="Experiment with sepal/petal lengths/widths to predict which flower it is.",
    allow_flagging="never",
    inputs=[
        gr.inputs.Number(default=1.0, label="sepal length (cm)"),
        gr.inputs.Number(default=1.0, label="sepal width (cm)"),
        gr.inputs.Number(default=1.0, label="petal length (cm)"),
        gr.inputs.Number(default=1.0, label="petal width (cm)"),
        ],
    outputs="text")

demo.launch(share=True)