# Exercise: Develompment of BentoML Service
## Iris Dataset Classification

Exercise that consists in training a ML model to predict the type of flower using the Iris dataset, and developing a Service with the model, using BentoML.
This notebook contains tests on the data loading and model training, before the Service development and deployment.

## Imports

In [58]:
from pycaret.classification import ClassificationExperiment
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

## Data

In [15]:
data = load_iris(as_frame=True)
print(data["DESCR"])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [18]:
df = data["frame"]
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [24]:
X = df.drop(columns=["target"])
y = df["target"]

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)

In [28]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(135, 4)
(15, 4)
(135,)
(15,)


## Model 

First tests using Pycaret classification. Further tuning will be made using Sklearn.

### Pycaret

In [37]:
pycaret_clf = ClassificationExperiment()
pycaret_clf.setup(df, target="target", session_id=7, train_size=0.8)

Unnamed: 0,Description,Value
0,Session id,7
1,Target,target
2,Target type,Multiclass
3,Original data shape,"(150, 5)"
4,Transformed data shape,"(150, 5)"
5,Transformed train set shape,"(120, 5)"
6,Transformed test set shape,"(30, 5)"
7,Numeric features,4
8,Preprocess,True
9,Imputation type,simple


<pycaret.classification.oop.ClassificationExperiment at 0x7fdf9c1cece0>

In [39]:
best_pycaret_clf = pycaret_clf.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.975,1.0,0.975,0.98,0.9746,0.9625,0.9653,0.023
lda,Linear Discriminant Analysis,0.9667,1.0,0.9667,0.9733,0.9661,0.95,0.9537,0.025
lr,Logistic Regression,0.95,1.0,0.95,0.9622,0.9484,0.925,0.932,0.026
knn,K Neighbors Classifier,0.9417,0.9844,0.9417,0.9533,0.9407,0.9125,0.919,0.028
gbc,Gradient Boosting Classifier,0.9417,0.974,0.9417,0.9556,0.9399,0.9125,0.9205,0.09
et,Extra Trees Classifier,0.9417,0.9969,0.9417,0.9578,0.939,0.9125,0.9219,0.071
nb,Naive Bayes,0.9333,0.9979,0.9333,0.9511,0.9306,0.9,0.9104,0.023
dt,Decision Tree Classifier,0.9333,0.95,0.9333,0.9511,0.9306,0.9,0.9104,0.022
rf,Random Forest Classifier,0.9333,0.9958,0.9333,0.9511,0.9306,0.9,0.9104,0.076
ada,Ada Boost Classifier,0.9333,0.9906,0.9333,0.9511,0.9306,0.9,0.9104,0.05


### Sklearn

In [43]:
qda_clf = QuadraticDiscriminantAnalysis()

In [44]:
qda_clf.fit(X_train, y_train)

In [52]:
y_pred = qda_clf.predict(X_test)
y_pred

array([2, 1, 0, 1, 2, 0, 1, 1, 0, 1, 1, 1, 0, 2, 0])

In [54]:
print(classification_report(y_test, y_pred, target_names=data["target_names"]))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         5
  versicolor       1.00      1.00      1.00         7
   virginica       1.00      1.00      1.00         3

    accuracy                           1.00        15
   macro avg       1.00      1.00      1.00        15
weighted avg       1.00      1.00      1.00        15



The best trained model was Quadratic Discriminant Analysis. Now, this model will be deployed as a service using BentoML, training over the whole dataset.