# Iris Flower Classification with Scikit-Learn

![Iris](https://github.com/featurestoreorg/serverless-ml-course/raw/main/src/01-module/assets/iris.png)


In this notebook we will, 

1. Load the Iris Flower dataset into Pandas from a CSV file
2. Split trainind data into train and test sets (one set each for features and labels)
3. Encode the label
4. Train a KNN Model using SkLearn
5. Evaluate model performance on the test set

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import seaborn as sns

We are downloading the 'raw' iris data. We explicitly do not want transformed data, reading for training. 

So, let's download the iris dataset, and preview some rows. 

Note, that it is 'tabular data'. There are 5 columns: 4 of them are "features", and the "variety" column is the **target** (what we are trying to predict using the 4 feature values in the target's row).

In [None]:
iris_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/iris.csv")
iris_df.sample(10)

We can see that our 3 different classes of iris flowers have different *petal_lengths* 
(although there are some overlapping regions between Versicolor and the two other varieties (Setoas, Virginica))

In [None]:
sns.set(style='white', color_codes=True)

sns.boxplot(x='variety', y='petal_length', data=iris_df)

In [None]:
sns.set(style='white', color_codes=True)

sns.boxplot(x='variety', y='sepal_length', data=iris_df)

In [None]:
sns.set(style='white', color_codes=True)

sns.boxplot(x='variety', y='petal_width', data=iris_df)

We need to split our DataFrame into two Dataframes. 

* The **features** DataFrame will contain the inputs for training/inference. 
* The **labels** DataFrame will contain the target we are trying to predict.

Note, that the ordering of the rows is preserved between the features and labels. For example, 'row 40' in the **features** DataFrame contains the correct features for 'row 40' in the **labels** DataFrame. That is, the row index acts like a common "join key" between the two DataFrames.

In [None]:
sns.set(style='white', color_codes=True)

sns.boxplot(x='variety', y='sepal_width', data=iris_df)

Split the DataFrame into 2: one DataFrame containing the *features* and one containing the *labels*.

In [None]:
features = iris_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
labels = iris_df[["variety"]]
features

In [None]:
labels

We can split our features and labels into a **train_set** and a **test_set**. You split your data into a train_set and a test_set, because you want to train your model on only the train_set, and then evaluate its performance on data that was not seen during training, the test_set. This technique helps evaluate the ability of your model to accurately predict on data it has not seen before.

This looks as follows:

* **X_** is a vector of features, so **X_train** is a vector of features from the **train_set**. 
* **y_** is a scale of labels, so **y_train** is a scalar of labels from the **train_set**. 

Note: a vector is an array of values and a scalar is a single value.

Note: that mathematical convention is that a vector is denoted by an uppercase letter (hence "X") and a scalar is denoted by a lowercase letter (hence "y").

**X_test** is the features and **y_test** is the labels from our holdout **test_set**. The **test_set** is used to evaluate model performance after the model has been trained.


In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(features, labels, test_size=0.2)

Now we will do some **Feature Engineering**. 

We will transform the label from a categorical variable (a string) into a numerical variable (an int). Many machine learning training algorithms only take numerical values as inputs for training (and inference).

We can see that our original lables (**y_train** and **y_test**) are categorical variables. We will use Scikit-Learn's **LabelEncoder** to transform the strings into numbers.

In [None]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
y_train_encoded=le.fit_transform(y_train['variety'])
y_test_encoded = le.transform(y_test['variety'])
y_test.head(8)

We can see that **y_test_encoded** has been transformed into a numerical variable (an int). **y_train_encoded** has been similarly transformed.

In [None]:
y_test_encoded[0:8]

Now, we can fit a model to our features and labels from our training set (**X_train** and **y_train_encoded**). Fitting a model to a dataset is more commonly called "training a model".

In [None]:
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train_encoded) 

Now, we have trained our model. We can evaluate our model on the **test_set** to estimate its performance.
Notice that the model was trained to output the encoded labels (numbers).

In [None]:
y_pred_encoded = model.predict(X_test)
y_pred_encoded

We can look at the predicted flower names by inverse transforming our numerical predictions back into their original string form. To perform the inverse transform, we need the **le** (LabelEncoder) object used to perform the orginal categorical to numerical mapping.

In [None]:
le.inverse_transform(y_pred_encoded)

We can report on how accurate these predictions (**y_pred_encoded**) are compared to the labels (the actual results - **y_test_encoded**). 

In [None]:
from sklearn.metrics import classification_report

metrics = classification_report(y_test_encoded, y_pred_encoded, output_dict=True)
print(metrics)

In [None]:
from sklearn.metrics import confusion_matrix

predictions_untransformed = le.inverse_transform(y_pred_encoded)
results = confusion_matrix(y_test, predictions_untransformed)
print(results)

In [None]:
from sklearn.metrics import confusion_matrix

results = confusion_matrix(y_test_encoded, y_pred_encoded)
print(results)

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema
import os
import joblib
import hopsworks

project =  hopsworks.login()
mr = project.get_model_registry()

model_dir="iris_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)
# The 'iris_model' directory, containing these two pickled objects, will be saved to the model registry
pickle='knn_iris_model.pkl'
le_pickle='knn_iris_encoder.pkl'

joblib.dump(model, model_dir + "/" + pickle)
joblib.dump(le, model_dir + "/" + le_pickle)

input_example = X_train.sample()
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

iris_model = mr.python.create_model(
    version=2,
    name="iris_model", 
    metrics={"accuracy" : metrics['accuracy']},
    model_schema=model_schema,
    input_example=input_example, 
    description="Iris Flower Predictor")

iris_model.save(model_dir)

In [None]:
from matplotlib import pyplot

# Setosa = 0, Versicolor = 1, Virginica = 2

df_cm = pd.DataFrame(results, ['True Setosa', 'True Versicolor', 'True Virginica'],
                     ['Pred Setosa', 'Pred Versicolor', 'Pred Virginica'])

sns.heatmap(df_cm, annot=True)

In [None]:
!pip install gradio --quiet
!pip install typing-extensions==4.3.0

In [None]:
import gradio as gr
import numpy as np
from PIL import Image
import urllib.request

url = "https://repo.hops.works/dev/jdowling/iris/"

def iris(sepal_length, sepal_width, petal_length, petal_width):
    input_list = []
    input_list.append(sepal_length)
    input_list.append(sepal_width)
    input_list.append(petal_length)
    input_list.append(petal_width)
    # 'res' is a list of predictions returned as the transformed label.
    # if the model predicted "Setosa", then res[0] == 0. For "Virginica", res[0] == 2.
    res = model.predict(np.asarray(input_list).reshape(1, -1)) 

    # inverse_transform convert the transformed label (the number) back to the original iris flower name.
    # We add '[0]' to the result of the transformed 'res', because 'res' is a list, and we only want 
    # the first element.
    flower = le.inverse_transform(res)[0] + ".png"
    
    # Now we can download a png file for that flower name from Internet at 'url'
    urllib.request.urlretrieve(url + flower, flower)
    
    # We downloaded the png file into the same directory as this notebook, and can open the png file with PIL
    img = Image.open(flower)            
    return img
        
demo = gr.Interface(
    fn=iris,
    title="Iris Flower Predictive Analytics",
    description="Experiment with sepal/petal lengths/widths to predict which flower it is.",
    allow_flagging="never",
    inputs=[
        gr.inputs.Number(default=1.0, label="sepal length (cm)"),
        gr.inputs.Number(default=1.0, label="sepal width (cm)"),
        gr.inputs.Number(default=1.0, label="petal length (cm)"),
        gr.inputs.Number(default=1.0, label="petal width (cm)"),
        ],
    outputs=gr.Image(type="pil"))

demo.launch(share=True)