# Iris Species Classifier

This is an example of a supervised learning - classification model. We are going to create a model to predict iris' species.

* Setosa
* Versicolor
* Virginica

The sample dataset for training will be loaded from scikit-learn.

Reference:

1. https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html


In [None]:
from sklearn.datasets import load_iris

# load iris dataset
data = load_iris()

# check the data type
print("data type is {}".format(type(data)))

In [None]:
# get the keys
print("data keys: {}".format(data.keys()))

In [None]:
# get description
print(data['DESCR'])

In [None]:
# get feature names
print(data['feature_names'])

In [None]:
# get the target names (labels/classes)
print(data['target_names'])

In [None]:
# sample data
print(type(data['data']))
print(data['data'])

In [None]:
# sample label/target/output
print(type(data['target']))
print(data['target'])

**Prepare the dataset for training and testing**

Size of dataset: 150

- features (input/attributes): x_train, x_test
- label (output/class): y_train, y_test

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data['data'], data['target'])

# check the size
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))
print("y_test shape: {}".format(y_test.shape))

**Build your first ML model**

We can start with Logistic Regression model.

In [None]:
from sklearn.linear_model import LogisticRegression

# create model
model = LogisticRegression()

# train model
model.fit(x_train, y_train)

**Use trained model for prediction**

To use the model for prediction, we need to prepare a totally new dataset (numpy array)

In [None]:
import numpy as np

# create new data for prediction
x_new = np.array([[5, 2.9, 1, 0.2]])

# use  model for prediction
prediction = model.predict(x_new)

# display result
print("prediction: {}".format(prediction))

# extract the species names from iris dataset
print("species: {}".format(data['target_names'][prediction]))

In [None]:
# test with another data for prediction
x_new = np.array([[6.5, 0.5, 2.5, 4.2]])

# use model for prediction
prediction = model.predict(x_new)

# display result
print("prediction: {}".format(prediction))

# extract a target names instead
print("prediction target names: {}".format(data['target_names'][prediction]))

**Evaluate the model**

Use the testing dataset to evaluate the model. We are going to score the model.

In [None]:
# use testing dataset
y_pred = model.predict(x_test)

# y_pred contains prediction generated by the model
# y_test contains actual label/species from iris dataset
# can compare these two labels to verify the model accuracy
print(y_pred)
print(y_test)

# score is actually based on how different y_pred against y_test
# the score will determine the accuracy of the model
print("model's score is: {:.3f}".format(model.score(x_test, y_test)))

Based on the score, the model has 97.4% accuracy.

**Repeat the process/steps with different models/algorithms**

Next: K-Nearest Neighbors (KNN)

Reference: https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#k-nearest-neighbors-classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# create model
model = KNeighborsClassifier(n_neighbors=1)

# train model
model.fit(x_train, y_train)

In [None]:
# create new data for prediction
x_new = np.array([[5, 2.9, 1, 0.2]])

# use  model for prediction
prediction = model.predict(x_new)

# display result
print("prediction: {}".format(prediction))

# extract the species names from iris dataset
print("species: {}".format(data['target_names'][prediction]))

In [None]:
# test with another data for prediction
x_new = np.array([[6.5, 0.5, 2.5, 4.2]])

# use model for prediction
prediction = model.predict(x_new)

# display result
print("prediction: {}".format(prediction))

# extract a target names instead
print("prediction target names: {}".format(data['target_names'][prediction]))

In [None]:
# use testing dataset
y_pred = model.predict(x_test)

# y_pred contains prediction generated by the model
# y_test contains actual label/species from iris dataset
# can compare these two labels to verify the model accuracy
print(y_pred)
print(y_test)

# score is actually based on how different y_pred against y_test
# the score will determine the accuracy of the model
print("model's score is: {:.3f}".format(model.score(x_test, y_test)))

The score for KNN model is 94.7%.

It seems like Logistic Regression model is better at prediction compared to KNN with 1 neighbor.

Let's try KNN with 2 neighbors.

In [None]:
# repeat with 2 neighbors
model = KNeighborsClassifier(n_neighbors=2)
model.fit(x_train, y_train)

# first test
x_new = np.array([[5, 2.9, 1, 0.2]])
prediction = model.predict(x_new)
print("prediction: {}".format(prediction))
print("species: {}".format(data['target_names'][prediction]))

# second test
x_new = np.array([[6.5, 0.5, 2.5, 4.2]])
prediction = model.predict(x_new)
print("prediction: {}".format(prediction))
print("prediction target names: {}".format(data['target_names'][prediction]))

# validation
y_pred = model.predict(x_test)
print(y_pred)
print(y_test)
print("model's score is: {:.3f}".format(model.score(x_test, y_test)))

**TRY THIS**

Repeat the steps with another ML model: Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

model = GaussianNB()
model.fit(x_train, y_train)

# first test
x_new = np.array([[5, 2.9, 1, 0.2]])
prediction = model.predict(x_new)
print("prediction: {}".format(prediction))
print("species: {}".format(data['target_names'][prediction]))

# second test
x_new = np.array([[6.5, 0.5, 2.5, 4.2]])
prediction = model.predict(x_new)
print("prediction: {}".format(prediction))
print("prediction target names: {}".format(data['target_names'][prediction]))

# validation
y_pred = model.predict(x_test)
print(y_pred)
print(y_test)
print("model's score is: {:.3f}".format(model.score(x_test, y_test)))

print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred) * 100)