# Iris Data Set Visualization and Machine Learning Implementation

## Goal: To classify the species of flower based on their attributes

## Importing our dataset and creating a DataFrame for analysis

In [None]:
##Importing basic Python libraries necessary for data manipulation and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

##Enabling matplotlib in jupyter notebook

%matplotlib inline

In [None]:
##Step 1: Loading and examining the Iris dataset file into a Pandas DataFrame for analysis.
#Creating a DataFrame with variable name 'iris'.

iris = pd.read_csv ('../input/iris/Iris.csv')
iris.head()

In [None]:
#Assigning Id column as the index for slightly easier data manipulation later on.

iris.set_index('Id', inplace = True)

In [None]:
iris.head()

In [None]:
#Changing all column names to lowercase for easier selection (personal preference).

iris.columns = map(str.lower, iris)

In [None]:
iris.head()

## Exploratory Data Analysis

In [None]:
#Step 2: Performing statistical analysis on the dataset, as well as checking for possible errors (missing values).

iris.info()

In [None]:
iris.describe()

In [None]:
iris[['sepallengthcm', 'sepalwidthcm', 'petallengthcm', 'petalwidthcm']].isna().describe()

#Fortunately, there are no missing values in this dataset.

In [None]:
#Step 3: Plotting a few graphs to gain a sensing of the relation between the features.
#Using Seaborn's pairplot to gain a broad overview of the dataset.

sns.pairplot(iris, hue = 'species', diag_kind = 'hist', palette = 'Set1')

In [None]:
#Using Seaborn's jointplot for a slightly more indepth look.

sns.jointplot(x = 'sepallengthcm', y = 'sepalwidthcm', data = iris)

In [None]:
#Using Seaborn's scatterplot to obtain a similar plot, but color-coded by their species

sns.scatterplot(x = 'sepallengthcm', y = 'sepalwidthcm', data = iris, hue = 'species', palette = 'Set1')
plt.title('Sepal Width vs Sepal Length')
plt.legend(loc = 'center left', bbox_to_anchor = (1.0, 0.5))

From the above plots, namely the Pairplot and Scatterplot, we can see that there is already one species which stands out from the rest, relatively speaking; the Setosa species. From the graphs, it seems that this particular species has features which stand out the most distinctively from the other two species.

In [None]:
##We shall now continue with the plots, with a deeper look into the distribution and correlation between the features.

In [None]:
#Using matplotlib to create a figure for a more compressed view of the four features.
#Using histograms for an overview of the varying values of each feature.

iris.hist(figsize = (12, 12), ec = 'black')

In [None]:
#Using matplotlib to create a figure consisting of four subplots.
#Using Seaborn's boxplot to gain insight into the distribution of the four features.

plt.figure(figsize = (12, 12))
sns.set_style('whitegrid')
sns.set_palette('Set1')
plt.subplot(2, 2, 1)
sns.boxplot(x = 'species', y = 'petallengthcm', data = iris)
plt.subplot(2, 2, 2)
sns.boxplot(x = 'species', y = 'petalwidthcm', data = iris)
plt.subplot(2, 2, 3)
sns.boxplot(x = 'species', y = 'sepallengthcm', data = iris)
plt.subplot(2, 2, 4)
sns.boxplot(x = 'species', y= 'sepalwidthcm', data = iris,)

In [None]:
#Now, I wish to show the correlation between the features(if any), using Seaborn's heatmap feature.
#Firstly, doing some basic transformation of the pandas DataFrame.

iris.head()

In [None]:
iris_corr = iris.corr()
iris_corr.head()

In [None]:
#Using Seaborn's heatmap to visualize the correlation.

sns.heatmap(iris_corr, cbar = True, annot = True, cmap = 'RdBu_r')

Having done some exploratory data analysis, it is time to move on to training and implementing our machine learning algorithms.

## Machine Learning Algorithms

The end goal of this analyis is to successfully create a model which is able to classify the species of the flower based on their attributes. Given that this is a classification problem, it makes sense that we try to employ the relevant algorithms to achieve this.

In particular, we will strive to implement the following Machine Learning models:
1. Logistic Regression
2. K-Nearest Neighbours
3. Decision Trees/Random Forests
5. Support Vector Machines


### 1. Logistic Regression

Logistic regression is a reasonably straightforward linear statistical model which allows us to predict the binomial
outcome of one or more variables. 

In our case, we will be measuring the relationships between our sole categorical variable (the species of flower), and the rest of our independent variables (the sepal length/width and petal length/width).

Ultimately, the model will estimate the probability of a certain data point of being a particular species of flower using a logistic function (the sigmoid function).

In [None]:
##I will be going through my process of implementing logistic regression in the following rows using a step-by-step
##approach once again.

In [None]:
##Step 1: Creating a training and test set (look up Google for reasons on why we need to split a training and test set)

#Brief review of our DataFrame:
iris.head()

In [None]:
#Creating our matrix of flower features and their respective values for each data point, X.

#We drop the 'species' column since it is our dependent variable in this case.
X = iris.drop('species', axis = 1) 
X.head()

In [None]:
#Assigning our dependent variable, the species of the flower, to y.

y = iris['species']
y.value_counts()

In [None]:
#From here, we will be importing relevant modules from libraries to allow us to create our training and test data sets.

from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

In [None]:
#with this, our training(X_train, y_train) and test(X_test, y_test) sets have been created.

In [None]:
##Step 2: Training our model by fitting it to the training data sets.

In [None]:
#Importing our LogisticRegression module and creating an instance to carry out the training of the model

from sklearn.linear_model import LogisticRegression

logm = LogisticRegression()

In [None]:
#Fitting the model to our training data.

logm.fit(X_train, y_train)

In [None]:
##Step 3: Obtaining our predictions by using the model with the test set.

In [None]:
lg_predictions = logm.predict(X_test)

In [None]:
##Step 4: Evaluating our results by comparing the predicted results (lg_predictions) to the actual results (y_test).
#To do this, I will import a few modules which allows us to visualize this comparison.

from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(confusion_matrix(y_test, lg_predictions))

In [None]:
print(classification_report(y_test, lg_predictions))

From this, we can see that our Logistic Regression Model did reasonably well to predict the species of flower with a good accuracy and f1-scores. However, it may be too soon to conclude whether this is the best model thus far given the relatively simple dataset. Perhaps we can achieve better scores with the other training models which we will be carrying out.

### 2. K-Nearest Neighbours (kNN)

K-Nearest Neighbours is another relatively simple machine learning algorithm which be used for both classification and regression problems. It works on the assumption that data points which share similar features will be clustered together. Hence, this allows us to determine a boundary between these 'clusters' and correspondingly group them using our predictions from the training model.

The main challenge when it comes to implementing this algorithm would be choosing the value of 'k', which corresponds to the number of data points closest to the data point we are examining we should take into account to come up with our predictions. If 'k' is too small, we may end up jumping to conclusions. If 'k' is too large', it becomes difficult to accurately create our 'clusters'.

Fortunately, from the above Exploratory Data Analysis we have carried out, we can see that the data points have already been nicely 'clustered' for us. Hence, implementing kNN should proceed rather smoothly.

In [None]:
##Step 1: Creating a training and test data set.

In [None]:
X = iris.drop('species', axis = 1)
y = iris['species']

In [None]:
X_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

In [None]:
##Step 2: Implementing kNN, starting with an arbitrary value of k=5.
#Now, we will again import the kNeighborsClassifier to carry out our implementation and learning.

from sklearn.neighbors import KNeighborsClassifier

In [None]:
#Creating an instance and fitting it to our training data.

knn = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean') 
knn.fit(X_train, y_train)

In [None]:
#Creating our predictions.

knn_predictions = knn.predict(X_test)

In [None]:
##Step 3: Evaluating our predicted results against the actual results.

In [None]:
print(confusion_matrix(y_test, knn_predictions))

In [None]:
print(classification_report(y_test, knn_predictions))

In [None]:
##Step 4: Choosing the 'best' value of 'k'
#In order to do this, we can plot a graph of the error-rate of the model against the k-value that is being selected.
#This allows us to get a better gauge of the domain of k values which may allow us to obtain a better accuracy on our
#model.

In [None]:
#We will use a for loop to repeat steps 3-4 using different values of k.
error_rate = []

for i in range (1, 30):
    knn_i = KNeighborsClassifier(n_neighbors = i)
    knn_i.fit(X_train, y_train)
    knn_i_pred = knn_i.predict(X_test)
    error_rate.append(np.mean (knn_i_pred != y_test))

In [None]:
plt.figure(figsize = (12, 8))
plt.plot(np.arange(1, 30), error_rate, 'o-')

In [None]:
#Seems like our initial value of k=5 was a pretty good estimate. Hence, we shall stick with it.

### 3. Decision Trees/Random Forests

Decision trees are another classification/regression algorithm of which the objective is to predict the value of a variable using certain 'rules' via a tree based off the features of the data points.

In layman terms, it could be said to follow a 'if this feature is true, then proceed here, else the other side' principle. For more information, do look up Google or other sources for a better explanation.

In [None]:
##Step 1: Creating a training/test data set.
#X and y have already been defined from above examples.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

In [None]:
##Step 2: Importing our module to fit and train the model for predictions.

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()

In [None]:
tree.fit(X_train, y_train)

In [None]:
tree_predictions = tree.predict(X_test)

In [None]:
##Step 3: Evaluating our metrics given the trained model.

print(confusion_matrix(y_test, tree_predictions))

In [None]:
print(classification_report(y_test, tree_predictions))

In [None]:
##We shall now carry on with Random Forests implementation. For more
##details regarding this model, please look up details from other
##sources for more detailed explanations.

In [None]:
##Step 1: Importing our random forests module and training it.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

In [None]:
rfc.fit(X_train, y_train)
rfc_predictions = rfc.predict(X_test)

In [None]:
##Step 2: Evaluating our metrics given the predicted values.

print(confusion_matrix(y_test, rfc_predictions))

In [None]:
print(classification_report(y_test, rfc_predictions))

Looks like we got similar results to our DecisionTreeClassifier.
Originally, the purpose of Random Forests was to select a random sample of features for every tree at every split in the decision tree. This could serve the purpose of reducing high variance by 'decorrelating' the original decision trees, especially if there is a particular feature which has a proportionately larger impact on the outcome than the other features.

In such cases, depending on whether our end goal is to obtain high precision or high recall, a decision on which model to use will have to be made.

### 4. Support Vector Machines

Support Vector Machines work by learning from the data to fit a dividing hyperplane which best separates the data points based on their cluster. This can also be seen as something like a 'decision boundary' for those who may be more familiar with this term. Nevertheless, it can be used for both regression and classification problems, and is suitable for our dataset here.

In [None]:
##Step 1: Creating a training and test data set.
##Since the steps have been repeated multiple times in the above few
##examples, I shall be skipping this step from here on.

In [None]:
##Step 2: Importing our SVM classifier and training it.

from sklearn.svm import SVC
svc = SVC()

In [None]:
svc.fit(X_train, y_train)
svc_predictions = svc.predict(X_test)

In [None]:
##Step 3: Evaluating our metrics given predicted values.

print(confusion_matrix(y_test, svc_predictions))

In [None]:
print(classification_report(y_test, svc_predictions))

With that, we have completed the implementation of the ML algorithms which we set out to do. There are definitely other methods which could be used, but I shall leave that to the curious mind to explore more. As this is my first project, please do go easy on me and feel free to point out any mistakes or questions. Thank you!