## Table of contents

* [Introduction](#introduction)
* [What is Iris Dataset?](#whatIsIrisDataset)
* [Imports](#imports)
* [Get the dataset](#getTheDataset)
* [Data observation](#dataObservation)
* [Data visualization](#dataVisualization)
* [Problems with the dataset](#problemsWithTheDataset)
* [Using a neural network](#usingNetwork)
     - [Linear discriminant analysis](#linearDiscriminantAnalysis)
     - [K neighbour classifier](#kNeighbourClassifier)
* [Building and training the network](#buildingAndTraining)
    - [Imports](#imports2)
    - [Linear discriminant](#linearDiscriminant)
    - [K neighbours](#kNeighbours)
* [References](#references)


# Introduction <a name="introduction"></a>
This notebook is concerned with Fisher's Iris data set. In this notebook, I will be explaining the dataset itself, as well as creating easy on the eye visualisations of the dataset. I will also be discussing why it is difficult to write an algorithm that will accurately separate the three species of iris based on the variables in the dataset.

 ![](img/iris.jpg)

## What is the Iris dataset? <a name="whatIsIrisDataset"></a>
The Iris dataset was created by British statistician and biologist [Ronald Fisher.](https://en.wikipedia.org/wiki/Ronald_Fisher)
in 1936. The dataset itself consists of 150 samples of three species of Iris(setosa,virginica, virsicolor), with 50 samples for each species. Of this dataset, four features were measured from each sample: sepal length, sepal width, petal length, and petal width. This dataset has become a very popular test case in recent years for the techniques used in machine learning. Lets take a look at the actual dataset to get a better understanding.

## Imports <a name="imports"></a>

In [1]:
import numpy as np
import pandas as pd ### so we can read the data from an external url
import seaborn as sns ### so we can style the plots
sns.set_palette('husl') ### set the default seaborn style to husl
import matplotlib.pyplot as plt ### get the plotting package
%matplotlib inline

ModuleNotFoundError: No module named 'pandas'

## Get the dataset <a name="getTheDataset"></a>

In [None]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" ### read in the data from that url
headings = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species'] ### give each column a heading 
data = pd.read_csv(url, names=headings) ### save the data to the data variable

## Data observation <a name="dataObservation"></a>
So now that we have read in our data and we have a handle on it, lets have a look at what this dataset contains.

In [None]:
data.head(10) ### view thew top 10 records in the dataset

As we can see, this dataset is comprised of 6 columns. The first column is simply the ID of the samlple. The subsequent columns are named accoringly as sepel-length, sepal-width, petal-length, petal-width and species respectively. 

In [None]:
data.info() ### print information about the dataset

In [None]:
data.describe() ### describe what the dataset contains

From the above table, we can get a taste for what the dataset contains. We can see statistical information about the dataset, aswell as each numerical column in the dataset. 

In [None]:
data['species'].value_counts() ### output the count of each species in the dataset 

We can see that there are equal numbers of samples for each species(50 for each).

## Data Visualization <a name="dataVisualization"></a>
Now that we have taken a look at what the dataset contains, I think it's time that we actually graph this data so that we can see for ourselves the relationship between each species and the four variables, sepal-length, sepal-width, petal-length, petal-width.

In [None]:
g = sns.violinplot(y='species', x='sepal-length', data=data, inner='quartile') ### sepal length in cm
plt.show()
g = sns.violinplot(y='species', x='sepal-width', data=data, inner='quartile') ### sepal width in cm
plt.show()
g = sns.violinplot(y='species', x='petal-length', data=data, inner='quartile') ### pepal length in cm
plt.show()
g = sns.violinplot(y='species', x='petal-width', data=data, inner='quartile') ### pepal width in cm
plt.show()

Here, we can see the the relationship that each species has with each individual variable. We also get a nice illustration of the distribution of each species accross each individual variable.
Next, let's take a look at the relationships between all four variables with each of the three species.

In [None]:
g = sns.pairplot(data, hue='species', markers='+')
plt.show()

In the above plot, you can clearly see that the Iris-setosa(pink) is distinctly different from those of the other two species. This is because there is almots no overlap between the setosa and the other two species for the four variables in question.

## Problems with the dataset <a name="problemsWithTheDataset"></a>
While this may be a fairly extensive dataset, we quickly run into problems when we try to write an algorithm that can accurately predict the species of flower with the four variables we have. This is because of the overlapping that can be seen in the above graph between versicolor and virginica. For example, if you take the (sepal-length,sepal-width) graph, the versicolor and virginica clusters are completely mixed up with eachother making it extremely difficult to construct an algorithm that can percisely differenciate between the two species. The best way to get around this would be to use a neural network.

## Using a Neural Network <a name="usingNetwork"></a>
In this notebook, I will be using two models to build my neural network- 
*  [LinearDiscriminantAnalysis.](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
*  [KNeighborsClassifier.](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

If you want a detailed explanation of what each of those models does, please feel free to click on one of the links which will bring you to their wikipedia page. However I will give a brief summary of what each of them does on a high level here.

### Linear Discriminant Analysis <a name="linearDiscriminantAnalysis"></a>
Linear discriminant analysis was created by Ronald Fisher in 1936 to characterize or seperate two or more classes of events. This method tries to draw a line of best fit through two of the clusters and estimates that if most of class A lands above the line and most of class B lands below the line, that we can predict the class of an incoming event based on where it lands in relation to the line of best fit.

 ![](img/linear.png)

In this simple picture, we can see that a line of best fit can clearly be drawn between the setosa cluster and the versicolor cluster. Meaning that if another species came in, and its variables placed it above that line we can predict with good percision what it is a setosa. Likewise for the line that can be drawn between versicolor and virginica, however this line would be harder to visualize.

### K Neighbour Classifier <a name="kNeighbourClassifier"></a>
The K neighbour classifier is a form of supervised learning in the machine learning world. On a high level, what this algorithm takes its example dataset of, for example three classes, and predicts the class of an incoming event based on the classes that are situated closest to it on a graph.

![](img/knn.png)


In the above image, we can see clearly that if a sample came through with variables placing it near the bottom left corner of the graph, chances are it is a setosa because all of its closest neighbours on the graph are setosa. This may be obvious on the graph displayed above, but if we had three clusters that had overlapping, this algorithm could become very helpful as it would become difficult to perform linear discriminant analysis.

These are very high level and oversimplified explanations of the above algorithms just so we can understand better the examples that follow. If you wish to dive deeper into how these algorithms work, please click on the links provided. 

## Building and training the network <a name="buildingAndTraining"></a>
In this section, we are going to construct a neual network with each of the models discussed above, along with training the network from a sample from the dataset.

### Imports <a name="imports2"></a>


In [None]:
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier ### knn model
from sklearn.model_selection import train_test_split ### so we can split the data
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis ### linear discriminant analysis

The reason for importing model_selection is so that we can split the data up into two groups, one group for training the network and the other for testing the network. In this case, we will use 75% of the dataset to train the network, and 25% to test the network.

In [None]:
array = data.values
X = array[:,0:4]
Y = array[:,4]

# random_state is defining a random number seed
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=5) ###for75% train data and 25% test data
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

### Linear Discriminant <a name="linearDiscriminant"></a>

First, let us see how accurate we can expect this model to be.

In [None]:
linear = LinearDiscriminantAnalysis()
linear.fit(X, Y)
y_pred = linear.predict(X)
print(metrics.accuracy_score(Y, y_pred))

We can now input variables to the network and see what species the network believes is it. We will give it fairly low numbers as a test because we know that it should then output setosa.

In [None]:
linear.predict([[4,4,1,2]]) ### make a prediction for an example of an out-of-sample observation

As expected, we got back setosa, this is a good sign that the network is working. Now we can start entering more complicated input variables and observing the outputs.

### K neighbours <a name="kNeighbours"></a>
First, let us see how accurate we can expect this model to be.

In [None]:
knn = KNeighborsClassifier(n_neighbors=12) ### to measure it up to 12 neighbors
knn.fit(X, Y)
y_pred = knn.predict(X)
print(metrics.accuracy_score(Y, y_pred))

We can now input variables to the network and see what species the network believes is it. We will give it fairly low numbers as a test because we know that it should then output setosa.

In [None]:

knn.predict([[4, 4, 1, 2]]) ### make a prediction for an example of an out-of-sample observation

As expected, we got back setosa, this is a good sign that the network is working. Now we can start entering more complicated input variables and observing the outputs.

## References <a name="references"></a>

*  [LinearDiscriminantAnalysis.](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
*  [KNeighborsClassifier.](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
* [Machine Learning with Iris Dataset.](https://www.kaggle.com/jchen2186/machine-learning-with-iris-dataset/notebook)


Author: *Gary Connelly*