**Introduction**

From Wikipedia

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3]

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

The iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset:

    Iris-setosa (n=50). 
    Iris-versicolor (n=50).
    Iris-virginica (n=50). 

The four features of the Iris dataset:

    sepal length in cm
    sepal width in cm
    petal length in cm
    petal width in cm

For this classification exercise on the Iris species data, 
* I first use some simple Python techniques to explore the data set. 
* Then I split half of the data into the training set to train the hypothesis model and half of them as validation set to check the test accuracy score 
* At last, we use the support vector machine to train the classification model.
* And we use GridSearchCV to tune the hyperparameters(C, gamma, kernel) in the SVC model to achieve 100% accuracy score.

One very import reminder is below. train_test_split shuffle the data before doing the split.  GridSearchCV does not shuffle the data before doing cross-validation. And our iris data is ordered by response variable Species ( 50 Iris-setosa, 50 Iris-versicolor, and 50 Iris-virginica)  so we need to shuffle iris before using GridSearchCV.

In the future, I will also add the decision tree, bagging, Boosting and AdaBoost,  random forest classifier, logistic regression, K nearest neighbor classifier, naive Bayes to classify the model. 

* If you think my kernel is helpful, please give me a voteup. This is very important for new people like me. Thank you in advance.
* If you have any question, please feel free to leave me a message, I will check every day. Thank you so much.

**Part I: Import library and load data**

In [None]:
# Data analysis libraries
import pandas as pd
import numpy as np

# Data visualization libraires
import seaborn as sns
import matplotlib.pyplot as plt

# show plot in the notebook
%matplotlib inline

# Next, we'll load the Iris flower dataset, which is in the "../input/" directory
iris = pd.read_csv("dataset\Iris.csv") # the iris dataset is now a Pandas DataFrame

**Part II: Check the data information**

In [None]:
# first five observations
iris.head()

In [None]:
# Number of observations and missing values. 
# There are 150 observations and no nan value
iris.info()

In [None]:
# Check basic description for features
iris.drop(['Id','Species'], axis=1).describe()

In [None]:
# Check the response variable frequency
iris['Species'].value_counts()

**Part III: Explorary data analysis**

In [None]:
# Create a pairplot of the data set. Which flower species seems to be the most separable?
sns.pairplot(iris.drop(['Id'], axis=1),hue='Species')
# Iris setosa seems most separable from the other two species

In [None]:
# Create a kde plot of sepal_length versus sepal width for setosa species of flower.
sub=iris[iris['Species']=='Iris-setosa']
sns.kdeplot(data=sub[['SepalLengthCm','SepalWidthCm']],cmap="plasma", shade=True, shade_lowest=False)
plt.title('Iris-setosa')
plt.xlabel('Sepal Length Cm')
plt.ylabel('Sepal Width Cm')

In [None]:
sns.kdeplot(data=sub[['PetalLengthCm','PetalWidthCm']],cmap="plasma", shade=True, shade_lowest=False)
plt.title('Iris-setosa')
plt.xlabel('Petal Length Cm')
plt.ylabel('Petal Width Cm')

In [None]:
sub_virginica=iris[iris['Species']=='Iris-virginica']
# Create a scatter plot of the Sepal
plt.scatter(sub_virginica['SepalLengthCm'], sub_virginica['SepalWidthCm'], marker='o', color='r')
plt.xlabel('Sepal Length Cm')
plt.ylabel('Sepal Width Cm')
plt.title('Sepal Width versus Length for virginica species')

**Part 4: Train Test Split**

In [None]:
# Split data into a training set and a testing set.
# train_test_split shuffle the data before the split (shuffle=True by default)
from sklearn.model_selection import train_test_split
X=iris.drop(['Species', 'Id'], axis=1)
y=iris['Species']
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.5, shuffle=True,random_state=100)

**Part 5: Train a Model**

For sklearn.svm.SVC(), The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples. The multiclass support is handled according to a one-vs-one scheme.

In [None]:
# Now it's time to train a Support Vector Machine Classifier. 
# Call the SVC() model from sklearn and fit the model to the training data.
from sklearn.svm import SVC
model=SVC(C=1, kernel='rbf', tol=0.001)
model.fit(X_train, y_train)

**Part 6: Model Evaluation**

In [None]:
# Now get predictions from the model and create a confusion matrix and a classification report.
pred=model.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(confusion_matrix(y_test, pred))
print('\n')
print(classification_report(y_test, pred))
print('\n')
print('Accuracy score is: ', accuracy_score(y_test, pred))

**Part 7: Gridsearch to tune hyperparameters**

In [None]:
iris.head(20)

***  7.1 GridSearchCV does not shuffle the data before CV like train_test_split, we need to shuttle the Iris data by ourselves before using the GridSearchCV. Since in the original Iris data, data is sorted by species. ***

In [None]:
from sklearn.utils import shuffle
X=iris.drop('Species', axis=1)
y=iris['Species']
print('Before shuffle: ',y[0:20])
X,y = shuffle(X,y, random_state=0)
print("After shuffle: ", y[0:20])

In [None]:
# Create a dictionary called param_grid and fill out some parameters for C and gamma.
param_grid = {'C': [0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
# param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': ['auto'], 'kernel': ['rbf']}
from sklearn.model_selection import GridSearchCV
grid=GridSearchCV(estimator=SVC(), param_grid=param_grid, scoring='accuracy',cv=3, verbose=1, refit=True )
grid.fit(X, y)

* The GridSearchCV exhaustive search over specified parameter values for an estimator in param_grid.
* CV=3 means we will use the three-fold cross-validation and check the performance of the mean accuracy score on the validation set (default is cv=3, three-fold cross-validation)
* scoring='accuracy' means we choose parameters with the best accuracy score, this is the default setting. We can also use scoring='precision', 'f1', 'recall'
* By using GridSearchCV to tune hyperparameters, we get 98.6% test accuracy score.

In [None]:
# The best hyperparameters chosen is
print(grid.best_params_)
print(grid.best_estimator_)
print('Mean cross-validated score of the best_estimator: ', grid.best_score_)
print('The number of cross-validation splits (folds/iterations): ', grid.n_splits_)

In [None]:
# Another optition for shuffle is to use cv=KFold, we get 98% accuracy
from sklearn.model_selection import KFold
X=iris.drop(['Species', 'Id'], axis=1)
y=iris['Species']
# Create a dictionary called param_grid and fill out some parameters for C and gamma.
param_grid = {'C': [0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
# param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': ['auto'], 'kernel': ['rbf']}
from sklearn.model_selection import GridSearchCV
grid=GridSearchCV(estimator=SVC(), param_grid=param_grid, scoring='accuracy',
                  cv=KFold(n_splits=3, shuffle=True, random_state=0), verbose=1, refit=True )
grid.fit(X, y)

# The best hyperparameters chosen is
print(grid.best_params_)
print(grid.best_estimator_)
print('Mean cross-validated score of the best_estimator: ', grid.best_score_)
print('The number of cross-validation splits (folds/iterations): ', grid.n_splits_)