# Iris Dataset Notebook


## What is the Iris Dataset?
The Iris flower dataset was introduced by Ronald Fisher (a British statistician and biologist) in 1936. He used linear discriminant analysis to find a linear cobination of features that separates a multitude of classes, by taking multiple measurements which is an example of such analysis in taxonomic problems.

---
### [More information about the content of the dataset](https://medium.com/codebagng/basic-analysis-of-the-iris-data-set-using-python-2995618a6342)
In this dataset alone, 50 of Iris flowers from each of the 3 species were collected and measured for accurate and non-bias results. Although because 2 of the 3 species were collected from the same location (Gaspé Peninsula) on the same pasture and day and because they were measured at the same time by the same person with the same apparatus, the end results might have become slightly biased which is why the visualisation, as you will see later on, shows 2 species being more alike.
* 150 iris flowers of 3 species were identified and used for this dataset
    * setosa
    * virginica
    * versicolor
* There are 4 columns included of the flowers' measurements
    * sepal length
    * sepal width
    * petals length
    * petals width
* The 5th column in the dataset is the flower species - named as class in this case



## Packages
First and foremost we need to import the packages require for this code to work.

In [None]:
# Load data sets - data procesing
import pandas as pd

# Encoding categorical variables.
import sklearn.preprocessing as pre

# Splitting into training and test sets.
import sklearn.model_selection as mod

# Find a number of training samples
import sklearn.neighbors as nei

# Compute subset accuracy
from sklearn.metrics import accuracy_score

# See the relation between each pair of features
import seaborn as sns

# Linear algebra
import numpy as np

import random

# Visualise data
import matplotlib.pyplot as plt
%matplotlib inline

## Show the dataset

### Read the iris dataset
Take the dataset from a csv (excel) file and read it

In [None]:
# Loading the dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/ianmcloughlin/datasets/master/iris.csv")

# Displays the dataset in a user-friendly manner
dataset

---
## Dataset information
Gather additional information from the dataset that might be useful in coding

In [None]:
# Source code adapted from: 
# - https://medium.com/codebagng/basic-analysis-of-the-iris-data-set-using-python-2995618a6342

In [None]:
# Shows the start of the dataset
# the first 5 rows
dataset.head()

In [None]:
# Information about the dataset
# - how many entries/objects were used in this dataset
# - the value of those entries (null or not)
# - the type of those entries (integer, float, object, etc)
dataset.info() 

In [None]:
# Statistics summarizing dispersion & shape of dataset’s distribution
dataset.describe()

In [None]:
# Outputs the amount of rows and columns of dataset
dataset.shape

In [None]:
# Object containing the amount of objects present in each class or species
dataset['class'].value_counts()

### Visualise

#### Display the measurements of the iris species

In [None]:
# Source code adapted from:
# - https://www.kaggle.com/adityabhat24/iris-data-analysis-and-machine-learning-python
# This graph plots on the graph the sepal length in each species
sns.violinplot(data=dataset, x="class", y="sepal_length")

In [None]:
# This graph plots on the graph the sepal width in each species
sns.violinplot(data=dataset, x="class", y="sepal_width")

In [None]:
# This graph plots on the graph the petal length in each species
sns.violinplot(data=dataset, x="class", y="petal_length")

In [None]:
# This graph plots on the graph the petal width in each species
sns.violinplot(data=dataset, x="class", y="petal_width")

#### Plot the Iris dataset which includes the distinguished species of the iris

In [None]:
# Source code adapted from:
# - https://www.kaggle.com/jchen2186/machine-learning-with-iris-dataset
# Show the iris dataset using pairplot
sns.pairplot(dataset, hue='class', markers='o')

---
## Train the dataset
### To train and test the Iris dataset we need follow the following steps:
1. Inputs/Outputs
    * Get the inputs/features
        - **Features:**
            1. sepal length in cm
            2. sepal width in cm
            3. petal length in cm
            4. petal width in cm
    * Get the desired output/targets
        - **Targets:**
            1. Iris Setosa
            2. Iris Versicolour
            3. Iris Virginica
2. Split 
    * The dataset is separated into 'training' & 'testing'
3. Train 
    * The model is trained using the 'fit' method on the 'training'data.
4. Predict 
    * The outputs for the 'test data' is predicted
5. Print & plot 
    * Outputs

This information was found [here](https://mclguide.readthedocs.io/en/latest/sklearn/multiclass.html#iris-dataset)

### Inputs
These represent the features or measurements taken from the Iris flowers that can be found in the Iris dataset.

In [None]:
# Check column names to find out what names the features were given
dataset.columns

[In relation to the iris dataset features](https://www.ritchieng.com/machine-learning-iris-dataset/)
* Each row is an observation, or record
* Each column is a feature or an independent variable

In [None]:
# Source code adapted from: 
# - https://stackoverflow.com/questions/41130856/keyerror-petal-length-not-in-index
# dataset['sepal_length (cm)', 'sepal_width (cm)', 'petal_length (cm)', 'petal_width (cm)']
inputs = np.array(dataset[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])
inputs

### Outputs
**Get the targets of the Iris dataset which are the species of the Iris that were included in this dataset**

In [None]:
outputs = np.array(dataset['class'], dtype='<U10')
outputs

### Classify

In [None]:
knn = nei.KNeighborsClassifier(n_neighbors=5)

### Fit

In [None]:
knn.fit(inputs, outputs)

### Train

In [None]:
# Train the dataset to predict the input and output tests 
# with a set size for the maximum amount that can be tested
inputs_train, inputs_test, outputs_train, outputs_test = mod.train_test_split(inputs, outputs, test_size=0.5)

### Predict

In [None]:
# Source code adapted from:
# - https://github.com/ianmcloughlin/jupyter-teaching-notebooks/blob/master/knn-iris.ipynb
# Predict the species from the inputs 
prediction1 = knn.predict([[5.1, 3.5, 1.4, 0.2]])
prediction2 = knn.predict([[5.9, 3. , 5.1, 1.8]])

In [None]:
# This should be setosa as it is the first input in the list of inputs
prediction1

In [None]:
# This should be virginica as it is the last input in the list of inputs
prediction2

In [None]:
# Source code adapted from: 
# - https://stackoverflow.com/questions/1058712/how-do-i-select-a-random-element-from-an-array-in-python
# Takes a random array of inputs from the iris' measurements
rand = random.choice(inputs)
rand

In [None]:
# Predicts the species that matches the random features
prediction3 = knn.predict([rand])

In [None]:
# This should predict the species of the iris according to the random measurements chosen
prediction3

### Accuracy

In [None]:
# Source code adapted from:
# - https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
# Predicts the species from the input tests in the train
pred = knn.predict(inputs_test)
pred

In [None]:
# It shows how accurate it is in %
perc = (accuracy_score(outputs_test, pred)) * 100
print(perc "%")