# The Iris Dataset

The iris dataset is a famous dataset taken from Fisher's paper. Iris is a flowering plant and the common name given to all iris species. 

Fisher's dataset contains data about 3 classes/species of iris plant (iris-setosa, iris-versicolor and iris-virginica). 150 plants were sampled, 50 for each species of iris. Four features (individual x variables) were measured: sepal length, sepal width, petal length and petal width, all in cm. The class of iris plant was the target (y variable) which had been assigned by an expert botanist. Thus, this is an example of supervised learning. 

Importing in an image that shows the different species of iris.

In [None]:
from IPython import display
display.Image(r)

Aim: The aim of this code is to produce a model using the iris dataset, to determine the species of iris based on these certain features.

**Importing the data**

Scikit-learn contains a few datasets (including iris). Therefore, scikit-learn can be loaded and the iris dataset imported from this, rather than having to download it from an external website. 

Defining and splitting the data into X and y: 

- X (capital) is given for all the features (sepal length, width and petal length, width) 

- y is given for the target (iris class - setosa, verscolor and virginica)

In [None]:
from sklearn.datasets import load_iris

X, y = load_iris(as_frame=True, return_X_y=True)

**Examining the data**

First looking at the targets:

In [None]:
y = y.replace(dict(enumerate(load_iris().target_names)))

This funciton above converts the numerical target values (0, 1, 2) to strings, the species name (setosa, versicolor, virignica), respecitvely, for the y column.

In [None]:
y

Looking at the features:

In [None]:
X.head()

This table shows the 4 features and the data for the first 5 samples of iris.

**Plotting the correlation between features**

To get the linear correlation between all the features, the corr() method is called on the features (X) and assigned to the variable corr. The table sets out the features along the rows and columns, giving the correlation between each pair. 1.0 is given as the absolute positive correlation and thus, shows for the correlation between the same feature. 

In [None]:
corr = X.corr()

In [None]:
corr

It is useful to see the correlation visually. Thus, a heatmap and pairplot can be plotted.

Plotting a heatmap:

The visualisation library Seaborn, is imported and the heatmap method applied to the correlation data.

In [None]:
%matplotlib inline

import seaborn as sns

sns.heatmap(corr, vmin=-1.0, vmax=1.0, square=True, cmap="RdBu")

The heatmap is an example of multiple cross-correlation and plots every feature against each other, colouring in the cell based on the correlation between the features. A first glance shows sepal width and sepal length are the most uncorrelated pair. In comparison, petal width and petal length are the most positively correlated. Petal length and sepal width are the most negatively correlated pair.

Plotting a pairplot:

pandas is imported and the theme is set.

In [None]:
import pandas as pd

In [None]:
sns.set_theme(style="darkgrid", palette="husl") 

For the purpose of plotting a pairplot and scatter plot the X and y data needs to be grouped:

In [None]:
X_y = pd.concat([X,y], axis=1)

In [None]:
sns.pairplot(data=X_y, hue="target")

The pairplot plots each feature against every other and each cell intersection shows a scatter plot to indicate the relationship between the two features. The plots are also coloured by species of iris. It confirms that sepal width and sepal length are the most uncorrelated features as the datapoints are more scattered forming a blob rather than a strong diagonal line. Therefore, it seems that the two best features to use to make a model which can predict the species of iris is sepal width and sepal length, since both features are providing predictive power.

Looking at the indivdual scatter plot for sepal length against sepal width.

In [None]:
sns.relplot(data=X_y, x="sepal length (cm)", y="sepal width (cm)", hue="target")

This scatter plot is the same as in the pairplot but blown up. It shows clearly that the two features are fairly uncorrelated. The species look fairly distinct from each other with setosa being well separated from versicolor and virginica. 

**Making a predictive model**

**k-nearest neighbours (kNN)**

To make a model which can predict the species of iris based on sepal length and sepal width, k nearest neighbours can be used. This will work well as based on sepal length and width, the species are relatively uncorrelated and distinguishable from each other.

kNN works by assigning the species of iris to your data, based on the most common species of its nearest neighbours.

First the dataframe is defined and the two desired features selected. In this case they are sepal length and sepal width. Here XS is used to represent X-Subset.

The data is then split into a train and test subset. Here the random split is assigned as 42 so that the random split is reproducible each time.

In [None]:
from pandas import DataFrame
from sklearn.model_selection import train_test_split

X = DataFrame(load_iris().data, columns=load_iris().feature_names)
XS = X[["sepal length (cm)", "sepal width (cm)"]]
y = load_iris().target

train_XS, test_XS, train_yS, test_yS = train_test_split(XS, y, random_state=42)

The model can now be used and to do this, the model KNeighboursClassifier needs to be imported. For now, the number of neighbours is set to the default, 5.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(train_XS, train_yS)

The performance of the model can be checked against the test data, producing a score which assesses how good the model is.

In [None]:
model.score(test_XS, test_yS)

This score suggests the model is pretty good but there is still room for improvement.

The hyperparameter (number of neighbours) can be adjusted to give the ideal number of nearest neighbours which produce the best possible score. This can be done using GridSearchCV which will automatically run every possible hyperparameter it is given, so the best one can be chosen.

In this case, it will run every value of neighbours from 1 to 60, using the training data to choose the best one.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

parameters = {
    "n_neighbors" : range(1, 60),
}
model = GridSearchCV(KNeighborsClassifier(), parameters)
model.fit(train_XS, train_yS)

This sorts the analysis by the test score:

In [None]:
cv_results = DataFrame(model.cv_results_)
cv_results = cv_results.sort_values(["rank_test_score", "mean_test_score"])
cv_results.head()[["param_n_neighbors", "mean_test_score", "std_test_score", "rank_test_score"]]

This table implies that 18 and 31 neighbours are the best ranked that give rise to the best score.

A scatter plot can also be plotted to show this visually:

In [None]:
cv_results.plot.scatter("param_n_neighbors", "mean_test_score", yerr="std_test_score")

From both the table and plot, 18 or 31 seems like the best number of nearest neighbours to use. The model can now be fitted with with both 18 and 31 nearest neighbours to see which is the best number to use. 

First with 18 nearest neighbours:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=18)
model.fit(train_XS, train_yS)

In [None]:
model.score(test_XS, test_yS)

This gives a lower score compared to when run with the default, 5 nearest neighours.

Now fitting with 31 nearest neighbours:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=31)
model.fit(train_XS, train_yS)

In [None]:
model.score(test_XS, test_yS)

This shows that using 31 nearest neighbours gives a better model than with 18 and 5. It improves the model, increasing the score to 0.895 (3 d.p.). Again, this is a good score but can still be improved.

This model can then be plotted by creating a plot.py module with the relavent code and importing this in.

The plot shows the nearest neighbours classification and plots the decision boundaries for each species when, k = 31.

In [None]:
from plot import plot_knn

plot_knn(model, XS, y)

As before, this KNN plot of our model shows that setosa is well separated from both versicolor and virginica. However, there is some overlap between versicolor and virginica and they are not as well separated from each other.

**Modifying the model**

**Feature scaling**

The values for sepal length and sepal width have different ranges but they are not hugely different. A big difference in ranges can impact kNN. To combat this is to scales the values. In this case feature scaling can be adopted to see what difference it makes to the ranges and subsequently the model score. The StandardScaler can be imported from skikit-learn and perform scaling of the data.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(XS)

In [None]:
XS_scaled_raw = scaler.transform(XS)
XS_scaled = pd.DataFrame(XS_scaled_raw, columns=XS.columns)

The scaled results can then be plotted as a scatter plot to see the changes in range to sepal length and sepal width.

In [None]:
sns.relplot(data=XS_scaled, x="sepal length (cm)", y="sepal width (cm)", hue=y, )

After being scaled the values exist in the same ranges and thus there is an even spread in the data.

Feature scaling can then be added into the model by making a pipeline which uses the scaled data and calls kNN on this. This is then run and a model score generated.

In [None]:
from sklearn.pipeline import make_pipeline

scaled_knn_XS = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=31)
)
scaled_knn_XS

In [None]:
scaled_knn_XS.fit(train_XS, train_yS)
scaled_knn_XS.score(test_XS, test_yS)

The score is in fact slightly worse compared to the non-scaled data with k=31. The boundaries can then be plotted again by calling the plot_knn function from the previously created module.

In [None]:
plot_knn(scaled_knn_XS, XS, y)

Comparing this to the plot for the non-scaled data it can be observed that there is only a subtle change with the boundaries.

**Applying the model to all the features**

This pipeline (scaling and kNN) can now be applied to all the features rather than just two (sepal length and sepal width).The data first is re-split, this time using all the features and then the pipeline is applied to it.

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42) 

In [None]:
knn_all_X = make_pipeline(    
    StandardScaler(),
    KNeighborsClassifier()
)
knn_all_X

In [None]:
knn_all_X.fit(train_X, train_y)
knn_all_X.score(test_X, test_y)

The score for this is too perfect. A PCA can be done to reduce the dimensionality whilst still retaining enough information about the data as a whole. This should help improve the score and make it more realistic and thus, not too perfect. 

**Principal Component Analysis (PCA)**

Principal Component Analysis (PCA) can be done to indentify the most important features or to group highly correlated features togther in a principal component. This can then reduce the dimensionality and allow more than two features to be included in the model which may improve the score and better represent the data as a whole. 

The PCA function can be imported and then incorporated into the pipeline. It is important to note that the number of components is the same as the number of features in the dataset. 

In [None]:
from sklearn.decomposition import PCA

pca_knn_X = make_pipeline(
    StandardScaler(),
    PCA(n_components=4),  
    KNeighborsClassifier(n_neighbors=31)
)
pca_knn_X

In [None]:
pca_knn_X.fit(train_X, train_y)
pca_knn_X.score(test_X, test_y)

The score has now improved since it is no longer exactly 1.0. Therefore, emplying PCA has seemed to help the model and make it a better predictor. 

The amount of variation explained by each principal component (PC) can then be observed.

In [None]:
pca_knn_X["pca"].explained_variance_ratio_

This output reveals that the frist principal component provides 72%, the second 24%, the thrid 4% and the fourth 0.6%. Thus, the first two PC's explain most of the variation. It can be seen that all four components, together explain 100% of the variation, as shown below.

In [None]:
sum(pca_knn_X["pca"].explained_variance_ratio_)

GridSearchCV can be employed again to try different numbers of PC's to see how many components give the best score.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

pca_knn_cv = GridSearchCV(
    make_pipeline(
        PCA(),
        KNeighborsClassifier(n_neighbors=31)
    ),
    {
        "pca__n_components" : range(1, 5),
    }
)
pca_knn_cv

In [None]:
pca_knn_cv.fit(train_X, train_y)
pca_knn_cv.score(test_X, test_y)

In [None]:
pca_knn_cv.best_estimator_["pca"].n_components_

This shows that 4 PC's are best to use. As before we can then run the pipeline with scaled data, 4 PC's and 31 nearest neighbours.

In [None]:
from sklearn.decomposition import PCA

pca_knn_2 = make_pipeline(
    StandardScaler(),
    PCA(n_components=4),  
    KNeighborsClassifier(n_neighbors=31)
)
pca_knn_2

In [None]:
pca_knn_2.fit(train_X, train_y)
pca_knn_2.score(test_X, test_y)

In [None]:
sum(pca_knn_2["pca"].explained_variance_ratio_)

The score is 0.947 (3 d.p.) which is good and is an improved score from the first kNN model with 5 nearest neighbours, a model with just 31 nearest neighbours and a model which did not include PCA. The model incorporates scaled data and principal components that explain all the variation in the data. This is good as it takes all the collected data into account rather than just a selected part of it. Thus, PCA has helped reduce the dimesionality and incorporated more than two features into the model. Overall, the model is a good predictor for the species of iris of the plant and has a distinct separation between setosa and both versicolor and virginica. 