# Demo 4: A demo project for IRIS Flower Classification
(Refer to:https://ai.plainenglish.io/iris-flower-classification-step-by-step-tutorial-c8728300dc9e)

## Introduction

### What is IRIS ?
IRIS Dataset is a dataset containing of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with measured petal and sepal length. 
<img src="iris.png">


### Aims
In this demo, we aim to 
1. analyze the data composition of IRIS, 
2. data distribution and visulation
3. use traditional machine learning algrothims for classification, 
etc.

## Dataset Loading
In this section, we illustrate how to use Pandas lib to load a dataset and list its items.

In [1]:
# import the required library
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('seaborn-deep')
import sklearn


# dataset_soruce = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv'
dataset_source = './iris.csv'
col_name = ['sepal-length','sepal-width','petal-length','petal-width','class']
dataset = pd.read_csv(dataset_source, names=col_name)

In [2]:
# List some items at the beginning and the end
dataset

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [3]:
# List only the top 10 items
dataset.head(10)


Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [4]:
# Show the summary of the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal-length    150 non-null float64
sepal-width     150 non-null float64
petal-length    150 non-null float64
petal-width     150 non-null float64
class           150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [5]:
# How many classes it has?
dataset['class'].value_counts()

Iris-versicolor    50
Iris-setosa        50
Iris-virginica     50
Name: class, dtype: int64

Indicating we are going to the 3-class classification!

In [6]:
dataset.describe()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Dataset Analysis

### Dataset Discription


In [7]:
dataset.describe()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Characteristic Distribution

In [None]:
sns.violinplot(y='class', x='sepal-length', data=dataset, inner='quartile')
plt.show()
sns.violinplot(y='class', x='sepal-width', data=dataset, inner='quartile')
plt.show()
sns.violinplot(y='class', x='petal-length', data=dataset, inner='quartile')
plt.show()
sns.violinplot(y='class', x='petal-width', data=dataset, inner='quartile')
plt.show()

In [None]:
# Using another plot style
sns.distplot(dataset["sepal-length"], rug=False, hist=True); plt.show();
sns.distplot(dataset["sepal-width"], rug=False, hist=True); plt.show();
sns.distplot(dataset["petal-length"], rug=False, hist=True); plt.show();
sns.distplot(dataset["petal-width"], rug=False, hist=True); plt.show();

### Pair-wise Characteris Analysis

In [None]:
sns.pairplot(dataset, hue='class', markers='+')
plt.show()

\textbf{Finding}: Different classes show various pair-wise characteristic relationships, meaning they can be separable.

### Principle Component Analysis (PCA)

In [1]:
# Convert the dataset from Dataframe to Numpy array for PCA using sklearn
dataset_np = dataset.to_numpy()
X = dataset_np[:, :-1]
Y = dataset_np[:, -1]

from sklearn.decomposition import PCA

# Reduce the data dimension to 2
pca = PCA(n_components=2)
X_transform = pca.fit(X).transform(X)

# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

NameError: name 'dataset' is not defined

In [2]:
plt.figure(1)
for cls in set(dataset['class']):
    idx = np.where(Y == cls)[0]
    plt.scatter(X_transform[idx, 0], X_transform[idx, 1])
    
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.legend(list(set(dataset['class'])))

NameError: name 'plt' is not defined

In [3]:
X_Y = np.concatenate((X_transform, Y[:, np.newaxis]), axis=1)
X_Y = pd.DataFrame(X_Y, columns=['1st component', '2nd component', 'class'])
sns.pairplot(X_Y, hue='class', markers='+')

NameError: name 'np' is not defined