#  Predicting Iris Flower Species

The purpose of this notebook is to go through and get acquainted with the basic machine learning algorithms and the steps that have to be followed while doing data science. As a beginner, this is my first notebook towards my journey in learning Data science and Machine learning. The dataset used is the famous Iris dataset from the UCI Machine Learning Repository. The dataset consists of 3 classes of 50 instances each. Our task is to predict the class of flower given the data. The three classes of flowers are Iris Setosa , Iris Versicolour, and Iris Virginica. The features in the dataset are sepal length, sepal width, petal length and petal width in centimeters. This is a supervised machine learning classification problem since we have data that is labeled and our model will learn from this data. The algorithms that are going to be used are Decision trees, K-Nearest Neighbours and Linear Regression. In the end, we will evaluate which algorithm worked best for our classification problem.

In [1]:
# import the necessary libraries 
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  

Lets now import the dataset and convert it into a Pandas dataframe. 

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe
dataset = pd.read_csv(url, names=names)  
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Our Target variable in Class which has catagorical variables. It is recommended that we convert the catagorical variables into numeric variables so that our algorithms does a better job at predicting. This process is called One hot encoding.   

In [3]:
#map each of the title groups to a numerical value
title_mapping = {'Iris-setosa': 1, 'Iris-versicolor': 2,'Iris-virginica':3 }


dataset['Class'] = dataset['Class'].map(title_mapping)
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,1
1,4.9,3.0,1.4,0.2,1
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,1


Now lets go ahead and seperate the features and labels

In [4]:
X = dataset.iloc[:, :-1].values  #attributes
y = dataset.iloc[:, 4].values    #labels

Split the data into training and testing sets. We will keep 20% of our data for testing and train our algorithms on the remaining 80% data.

In [5]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

In Machine Learning it important that we scale the features to be centered around 0. This is done so that all the features have their variance in the same range and thus one feature who's variance is more will not end up dominating the other feature. We use the StandardScaler from sklearn to do the scaling.

In [6]:
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)  

**Decision Trees**

We then import the tree classifier from sklearn and train our model using the training data.

In [7]:
from sklearn import tree
classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(X_train, y_train)

Our model is now trained and ready to predict the class of the flowers. Lets now take the test features and predict their labels using our classifier.

In [8]:
y_pred = classifier.predict(X_test)
y_pred

array([1, 3, 3, 2, 1, 3, 1, 2, 3, 3, 2, 1, 2, 1, 3, 2, 2, 1, 1, 3, 3, 2,
       1, 2, 1, 2, 3, 1, 1, 1], dtype=int64)

To evaluate how well our model did, we will the classifiers score fuction.

In [9]:
print(classifier.score(X_train,y_train))

1.0


The decision tree gives an accuracy of 96%. Now lets try and implement another supervised machine learning algorithm. K-Nearest Neighbours. We will use the same training and testing data for this processs.  

**K-Nearest Neighbours**

In [10]:
from sklearn.neighbors import KNeighborsClassifier  
classifier = KNeighborsClassifier(n_neighbors=10)  
classifier.fit(X_train, y_train)  

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform')

In [11]:
y_pred = classifier.predict(X_test)

In [12]:
print(classifier.score(X_train,y_train))

0.9583333333333334


For K-nearest neighbours we get an accuracy of 93%. Now, let go ahead and try Linear Regression. 

**Linear Regression**

In [14]:
from sklearn.linear_model import LinearRegression 
Classifier = LinearRegression()
Classifier.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [15]:
y_pred = Classifier.predict(X_test)

In [16]:
print(Classifier.score(X_train,y_train))

0.9257159309857883


<h1><center>**Observations**</center></h1>

| Algorithm | Accuracy | 
| --- | --- | 
| Decision Trees | 96 |
| K-nearest Neighbours | 93 |
| Linear Regression | 92 |

To conclude, the Decision Trees gave the best accuracy for the task of flower prediction among the models we tried. We can use this notebook as a baseline and work further on more complex problems in Data science. 