# Iris Flower Train and Publish Model


In this notebook we will, 

1. Load the Iris Flower dataset into random split (train/test) DataFrames using a Feature View
2. Train a KNN Model using SkLearn
3. Evaluate model performance on the test set
4. Register the model with Hopsworks Model Registry

In [1]:
!pip install -U hopsworks --quiet

In [4]:
!pip install xgboost

Collecting xgboost
  Using cached xgboost-2.1.3-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.3-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB 165.2 kB/s eta 0:12:37
   ---------------------------------------- 0.0/124.9 MB 131.3 kB/s eta 0:15:52
   ---------------------------------------- 0.0/124.9 MB 164.3 kB/s eta 0:12:41
   ---------------------------------------- 0.0/124.9 MB 164.3 kB/s eta 0:12:41
   ---------------------------------------- 0.1/124.9 MB 241.3 kB/s eta 0:08:38
   ---------------------------------------- 0.1/124.9 MB 297.7 kB/s eta 0:07:00
   ---------------------------------------- 0.1/124.9 MB 300.4 kB/s eta 0:06:56
   ---------------------------------------- 0.2/124.9 MB 388.2 kB/s eta 0:05:2

In [1]:
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import seaborn as sns
import hopsworks

Let's first get a feature_view for the iris flower dataset, or create one if it does not already exist.
If you are running this notebook for the first time, it will create the feature view, which contains all of the columns from the **iris feature group**.

There are 5 columns: 4 of them are "features", and the **variety** column is the **label** (what we are trying to predict using the 4 feature values in the label's row). The label is often called the **target**.

In [2]:
project = hopsworks.login()
fs = project.get_feature_store()

try: 
    feature_view = fs.get_feature_view(name="aqi", version=1)
except:
    iris_fg = fs.get_feature_group(name="aqi", version=1)
    query = iris_fg.select([
        "aqi_value", "aqi_category", "co_aqi_value", 
        "ozone_aqi_value", "no2_aqi_value", "pm25_aqi_value"
    ])
    feature_view = fs.create_feature_view(name="aqi",
                                      version=1,
                                      description="Read from AQI dataset with uuid",
                                      labels=["aqi_category"],
                                      query=query)

2025-01-03 01:37:18,789 INFO: Initializing external client
2025-01-03 01:37:18,790 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-03 01:37:21,965 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1207459


We will read our features and labels split into a **train_set** and a **test_set**. You split your data into a train_set and a test_set, because you want to train your model on only the train_set, and then evaluate its performance on data that was not seen during training, the test_set. This technique helps evaluate the ability of your model to accurately predict on data it has not seen before.

We can ask the feature_view to return a **train_test_split** and it returns:

* **X_** is a vector of features, so **X_train** is a vector of features from the **train_set**. 
* **y_** is a scale of labels, so **y_train** is a scalar of labels from the **train_set**. 

Note: a vector is an array of values and a scalar is a single value.

Note: that mathematical convention is that a vector is denoted by an uppercase letter (hence "X") and a scalar is denoted by a lowercase letter (hence "y").

**X_test** is the features and **y_test** is the labels from our holdout **test_set**. The **test_set** is used to evaluate model performance after the model has been trained.

In [3]:
X_train, X_test, y_train, y_test = feature_view.train_test_split(0.2)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (3.93s) 




In [4]:
y_train.shape
X_train.shape
X_test

Unnamed: 0,aqi_value,co_aqi_value,ozone_aqi_value,no2_aqi_value,pm25_aqi_value
7,169,1,49,0,169
17,49,1,49,2,38
25,55,2,29,0,55
28,150,2,25,8,150
31,22,1,22,2,22
...,...,...,...,...,...
16355,58,1,31,2,58
16363,54,2,6,17,54
16366,72,1,45,0,72
16367,41,1,41,3,32


Now, we can fit a model to our features and labels from our training set (**X_train** and **y_train**). 

Fitting a model to a dataset is more commonly called "training a model".

In [5]:
model = KNeighborsClassifier(n_neighbors=2)
model.fit(X_train, y_train.values.ravel())

Now, we have trained our model. We can evaluate our model on the **test_set** to estimate its performance.

In [6]:
y_pred = model.predict(X_test)
y_pred

array(['Unhealthy', 'Good', 'Moderate', ..., 'Moderate', 'Good', 'Good'],
      dtype=object)

We can report on how accurate these predictions (**y_pred**) are compared to the labels (the actual results - **y_test**). 

In [7]:
from sklearn.metrics import classification_report

metrics = classification_report(y_test, y_pred, output_dict=True)
print(metrics)

{'Good': {'precision': 0.9953795379537954, 'recall': 0.9993373094764745, 'f1-score': 0.9973544973544973, 'support': 1509.0}, 'Hazardous': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15.0}, 'Moderate': {'precision': 0.9956521739130435, 'recall': 0.9934924078091106, 'f1-score': 0.99457111834962, 'support': 1383.0}, 'Unhealthy': {'precision': 0.9424083769633508, 'recall': 0.994475138121547, 'f1-score': 0.967741935483871, 'support': 181.0}, 'Unhealthy for Sensitive Groups': {'precision': 0.9869281045751634, 'recall': 0.9263803680981595, 'f1-score': 0.9556962025316456, 'support': 163.0}, 'Very Unhealthy': {'precision': 0.96, 'recall': 0.8571428571428571, 'f1-score': 0.9056603773584906, 'support': 28.0}, 'accuracy': 0.9917657822506862, 'macro avg': {'precision': 0.9800613655675589, 'recall': 0.9618046801080248, 'f1-score': 0.9701706885130207, 'support': 3279.0}, 'weighted avg': {'precision': 0.991869434757589, 'recall': 0.9917657822506862, 'f1-score': 0.9917041949029392, 's

In [8]:
from sklearn.metrics import confusion_matrix

results = confusion_matrix(y_test, y_pred)
print(results)

[[1508    0    1    0    0    0]
 [   0   15    0    0    0    0]
 [   7    0 1374    0    2    0]
 [   0    0    0  180    0    1]
 [   0    0    5    7  151    0]
 [   0    0    0    4    0   24]]


Notice in the confusion matrix results that we have 1 or 2 incorrect predictions.
We have only 30 flowers in our test set - **y_test**.
Our model predicted 1 or 2 flowers were of type "Virginica", but the flowers were, in fact, "Versicolor".

In [9]:
from matplotlib import pyplot

df_cm = pd.DataFrame(results, ['True Setosa', 'True Versicolor', 'True Virginica'],
                     ['Pred Setosa', 'Pred Versicolor', 'Pred Virginica'])

cm = sns.heatmap(df_cm, annot=True)

fig = cm.get_figure()
fig.savefig("assets/confusion_matrix.png") 
fig.show()

ValueError: Shape of passed values is (6, 6), indices imply (3, 3)

## Register the Model with Hopsworks Model Registry



In [9]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema
import os
import joblib
import hopsworks
import shutil

#project =  hopsworks.login()
mr = project.get_model_registry()

# The 'aqi_model' directory will be saved to the model registry
model_dir="aqi_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)
joblib.dump(model, model_dir + "/aqi.pkl")
# shutil.copyfile("assets/confusion_matrix.png", model_dir + "/confusion_matrix.png")

input_example = X_train.sample()
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

aqi_model = mr.python.create_model(
    version=1,
    name="aqi", 
    metrics={"accuracy" : metrics['accuracy']},
    model_schema=model_schema,
    input_example=input_example, 
    description="Air Quality Predictor")

aqi_model.save(model_dir)



  0%|          | 0/6 [00:00<?, ?it/s]

Uploading: 0.000%|          | 0/1318390 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/18 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/588 elapsed<00:00 remaining<?

Model created, explore it at https://c.app.hopsworks.ai:443/p/1207459/models/aqi/1


Model(name: 'aqi', version: 1)