# Iris Flower Classification

<img src="./img/iris_small.png" alt="iris flower" width="200"/>


Before we get started, please make sure you have jupyter notebook installed and running.

You have three options to run jupyter notebook:
1. Run Jupyter Notebook Locally
1. Run Jupyter Notebook on Kubeflow
1. Run Jupyter Notebook via Colaboratory


#### Run Jupyter Notebook Locally
Jupyter notebook installation command
```
pip install notebook
```

After installation, run the following command from the directory you like to create your notebook.
```
jupyter notebook
```

#### Run Jupyter Notebook on Kubeflow
This option is for the folks who have full Kubeflow installed
- Navigate to [localhost:8080]()
- Sign in with username `user@example.com` and password `12341234`
- Click "Notebook" from the menu
- Click "New server"
- Give a name "iris-demo" and click "Launch"
- Give a few seconds for it to run
- Once "connect" button becomes available, click to start the notebook
- Choose "python3" under "Notebook"

#### Run Jupyter Notebook via Colaboratory
- Navigate to [https://research.google.com/colaboratory/](https://research.google.com/colaboratory/)
- Login with your Google account
- Click "New Notebook"

### Goal
Using the [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), design a model that learns from the measurements of the Iris flower to **predict the species of the new irises**.

### Approach


![Iris Problem Statement](./img/problem_statement.png)

|    |    |
| -- | -- |
| **Features** | Petal width & length and Septal width & length | 
| **Labels** | Setosa, Veriscolor, and Virginica |
| **Classifiers** | Decision Tree and K-Nearest Neighbors |

### High Level End-to-End ML Workflow

1. Collect Data
1. Prepare Data
1. Choose the model
1. Train the model
1. Evaluate the model's performance and predictions
1. Tune hyperparameter
1. Predict / Inference
1. Monitor the model
1. Iterate the steps and Adjust the model

### Step 1 - Collect Data

For the workshop, we'll load a builtin dataset from the scikit learn library

- What is **[scikit learn](https://scikit-learn.org/stable/)**?
- Do you have scikit-learn installed?
```
! python3 -m pip show scikit-learn
```
- Install scikit-learn
```
! python3 -m pip install -U scikit-learn
```

In [None]:
# Run terminal commands in the notebook cells by prepending with an exclamation mark

# check to see if scikit-learn is installed
! python3 -m pip show scikit-learn

In [None]:
# install scikit learn
! python3 -m pip install -U scikit-learn

In [None]:
from sklearn import datasets

iris_dataset = datasets.load_iris()
iris_dataset

### Step 2 - Prepare Data

Some of the steps involved in data preparations are:
Data Discovery, Data Cleaning, Data Transformation, Data Sampling, Data Split, and etc.


Let's learn basic skills to explore dataset using pandas!
- What is **[pandas](https://pandas.pydata.org/)**?
- Do you have pandas installed?
```
! python3 -m pip show pandas
```
- Install pandas
```
! python3 -m pip install -U pandas
```

In [None]:
! python3 -m pip show pandas

In [None]:
! python3 -m pip install -U pandas

- Let's try to answer these questions
    - [ ] How many columns?
    - [ ] What are the names of the columns?
    - [ ] How many rows exist with label "Setosa"?
    - [ ] Which label does min petal length belong to?
    - [ ] Which label does max petal length belong to?
    - [ ] What are the input variables and output variables?

In [None]:
import pandas as pd

df_features = pd.DataFrame(data=iris_dataset.data, columns=iris_dataset.feature_names)
df_features.head()

- [x] How many columns?
- [x] What are the names of the columns?
- [ ] How many rows exist with label "Setosa"?
- [ ] Which label does min petal length belong to?
- [ ] Which label does max petal length belong to?
- [ ] What are the input variables and output variables?

In [None]:
# let's look at the dataset again, which key do we need to get the labels?
iris_dataset

In [None]:
df_features_and_target = df_features.copy()

# add a new column target
df_features_and_target['target'] = pd.Series(iris_dataset.target)

# add a new column species
df_features_and_target['species'] = pd.Categorical.from_codes(iris_dataset.target, iris_dataset.target_names)

# show the first 5 rows
df_features_and_target.head(5)

In [None]:
df_features_and_target["species"].value_counts()


In [None]:
temp = df_features_and_target.copy()

temp[temp["petal length (cm)"] == temp["petal length (cm)"].min()].species


In [None]:
temp[temp["petal length (cm)"] == temp["petal length (cm)"].max()].species


- [x] How many columns?
- [x] What are the names of the columns?
- [x] How many rows exist with label "Setosa"?
- [x] Which label does min petal length belong to?
- [x] Which label does max petal length belong to?
- [ ] What are the input variables and output variables?

```
df.shape()
df.describe()
```

- Check out more about [pandas dataframes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
- Check out [pandas cheatsheet](https://www.educative.io/blog/pandas-cheat-sheet](https://www.educative.io/blog/pandas-cheat-sheet)
- Check out [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

In [None]:
# outcome of data exploration - X, y

def download_data():
    iris_dataset = datasets.load_iris()
    
    # do some preprocessing magic if needed!
    
    # x contains iris features
    # (Sepal length, Sepal width, Petal length, Petal width)
    X = iris_dataset.data
    
    # y contins the labels
    # (0 for Setosa, 1 for Versicolor, or 2 for Virginica)
    y = iris_dataset.target
    
    return X, y

In [None]:
X, y = download_data()

In [None]:
# Final Outcome of Step 2

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

- **x_train** contains the training features (Sepal length, Sepal width, Petal length, Petal width)
- **x_test** contains the testing features (Sepal length, Sepal width, Petal length, Petal width)

- **y_train** contains training labels (0 for Setosa, 1 for Versicolor, or 2 for Virginica)
- **y_test** contains the testing labels (0 for Setosa, 1 for Versicolor, or 2 for Virginica)

In [None]:
print('There are {} samples in the training set and {} samples in the test set'.format(X_train.shape[0], X_test.shape[0]))


### Step 3 - Choose the Model

There are different algorithms for different tasks. 
Explore, experiment, and choose the right one!

![]()

### Step 4 - Train the Model

The goal of training for our current model is to make a prediction correctly as much as possible.

![]()

In [None]:
from sklearn import tree

# valid splitter: "best" or "random"
# "best": takes feature with the highest importance
# "random": take the feature randomly

# tree_classifier = tree.DecisionTreeClassifier(splitter="random")
# tree_classifier.fit(X_train, y_train)

In [None]:
from sklearn import neighbors

# valid neighbors <= n

# knn_classifier = neighbors.KNeighborsClassifier(n_neighbors=2)
# knn_classifier.fit(X_train, y_train)


### Step 5 - Evaluate the Model

Use metrics to measure performance of model

In [None]:
tree_predictions = tree_classifier.predict(X_test)
knn_predictions = knn_classifier.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

tree_score = accuracy_score(y_test, tree_predictions)
knn_score = accuracy_score(y_test, knn_predictions)

print(tree_score)
print(knn_score)

In [None]:
# put it all together!

def build_model(model="", parameter=""):
    iris_dataset = datasets.load_iris()
    X = iris_dataset.data
    y = iris_dataset.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

    if model == "knn":
        classifier = neighbors.KNeighborsClassifier(n_neighbors=int(parameter))
    else:
        model = "tree"
        classifier = tree.DecisionTreeClassifier(splitter=parameter)

    classifier.fit(X_train, y_train)

    predictions = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("accuracy={}".format(accuracy))

    return classifier

In [None]:
model_tree = build_model("tree", "random")


In [None]:
model_knn = build_model("knn", "2")


### Test with our own data

Iris with following features:
- Sepal length of 1 cm
- Sepal width of 2 cm
- Petal length of 3 cm
- Petal width of 4 cm


In [None]:
X_new = [[1, 2, 3, 4]]

prediction_tree = model_tree.predict(X_new)
print("Predicted Iris species using Decision Tree: {}".format(iris_dataset.target_names[prediction_tree]))

prediction_knn = model_knn.predict(X_new)
print("Predicted Iris species using K Nearest Neighbors: {}".format(iris_dataset.target_names[prediction_knn]))

# Steps achieved

1. Collect Data
1. Prepare Data
1. Choose the model
1. Train the model
1. Evaluate the model's performance and predictions

### Now What?

- What if we get a new dataset to train?
- What if we want to test out more different parameters?
- What if we want to test different models?
