# Setting foot into the world of Machine Learning with IRIS.

The Iris dataset was used in R.A. Fisher's classic 1936 paper, [The Use of Multiple Measurements in Taxonomic Problems](http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf), and can also be found on the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/).

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

- Id
- SepalLengthCm
- SepalWidthCm
- PetalLengthCm
- PetalWidthCm
- Species

![](https://i.imgur.com/7iqseyn.png)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


## Loading the data

In [None]:
#Reading the csv file and saving the data into a dataframe named df.
df = pd.read_csv('/kaggle/input/iris/Iris.csv')

In [None]:
#Checking the first 5(by default) rows of the dataframe
df.head()

## Exploring the data with dataframe.describe()

The **describe()** method is used for calculating some statistical data like **percentile, mean and std** of the numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types.

**Syntax**
> DataFrame.describe(percentiles=None, include=None, exclude=None)  

![](https://i.imgur.com/NEGwo2c.png)

In [None]:
df.describe() # When given a mix of categorical and numerical data, then By default it describes only the attributes with numerical values. 

In [None]:
df.describe(include ='all') # include ='all' is a parameter used to include all the numerical as well as categorical values.

In [None]:
#Finding the null values in the column
df.isna().sum()

In [None]:
#Count of each kind of flower species by creating a bar Plot using Pandas.
df['Species'].value_counts().plot(kind='bar')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Designing the Pairplot

The pair plot uses two basic figures - The histogram and the scatter plot. The histogram on the diagonal allows us to visualize the distribution of single variable. While the scatter plot on the uper half and lower half traingles shows the relation between two variables. 


In [None]:
#Remove the Id column as logically it does not have much worth in prediting the Species. Right!
df = df[df.columns.drop('Id')]

In [None]:
#Finally plotting the pair plot using the seaborn library
sns.pairplot(df)
plt.show()

Analysis of the above Pairplot: 
- First graph from the top row represents the count/frequncy distribution of SepalLengthCm
- Second graph represents weak negative relationship between SepalLengthCm and SepalWidthCm
- Third and fourth graph shows a strong positive relationship of SepalLengthCm with PetalLengthCm and PetalWidthCm

## Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

### An example of a correlation matrix
Mostly, a correlation matrix is “square”, with the same variables shown in the rows and columns. The below visualization shows the correlations matrix. 

- It is used to state the importance of various things to people. 
- The line of 1.00s going from the top left to the bottom right is the main diagonal, which shows that each variable always perfectly correlates with itself. 
- This matrix is symmetrical. The correlation of elements above the diagonal are mirror image of the ones which are below.

### Applications of a correlation matrix

There are three broad reasons for using a correlation matrix:

- While summarizing a large amount of data where our goal is to see patterns in the data. 
- From our example above, now we can easily tell which of the variables highly correlate with each other and which don't.
- To input into other analyses. For example, people commonly use correlation matrixes as inputs for:

    - Exploratory factor analysis 
    - Confirmatory factor analysis
    - Structural equation models 
    
    
- Used with linear regression, for example: A high amount of correlations suggests that the linear regression estimates will be unreliable. In other words, highly correlated data might represent similar information.

In [None]:
#Confirming the same by finding the correlation of all the relevant attributes.
df.corr()

In [None]:
#Better way to represent the correlation matrix
# Compute the correlation matrix
corr = df.corr()

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr)

At first glance, it doesn’t seem like we’ll be able to figure out much from the heatmap. Well, that’s because this heatmap is in its most basic form. Let me walk you through the most common ways of formatting a correlation heatmap so that your data can be presented in a way that is both effective and visually appealing.

In [None]:
#Adding Annotation - Displaying numbers inside the each cell of the heat map
sns.heatmap(corr, annot = True)

In [None]:
# Since upper half is mirror of the below half of the diagonal, you can remove either of them. 
# Generate a mask for the upper triangle.
mask = np.triu(np.ones_like(corr, dtype=bool)) 
#np.triu is used for Upper triangle of an array.
#np.ones_like: Return an array of ones with the same shape and type as a given array.
sns.heatmap(corr, annot = True, mask=mask)

In [None]:
# Generate a mask for the lower triangle.
mask = np.tril(np.ones_like(corr, dtype=bool)) #np.tril is used for Lower triangle of an array.
sns.heatmap(corr, annot = True, mask=mask)

In [None]:
#Invert Axis
mask = np.triu(np.ones_like(corr, dtype=bool))
ax = sns.heatmap(corr, annot = True, mask=mask)
ax.invert_yaxis()

## Pandas Profiling: EDA with single line command

Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.


In [None]:
import pandas_profiling as pp
pp.ProfileReport(df)

## Applying Machine Learning to Predict the Flower Species

### Steps To Be followed When Applying an Algorithm
- Split the dataset into training and testing dataset. The testing dataset is generally smaller than training one as it will help in training the model better.
- Select any algorithm based on the problem (classification or regression) whatever you feel may be good.
- Then pass the training dataset to the algorithm to train it. We use the .fit() method
- Then pass the testing data to the trained algorithm to predict the outcome. We use the .predict() method.
- We then check the accuracy by passing the predicted outcome and the actual output to the model.

In [None]:
# split into train and test sets
X = df.drop(['Species'],axis=1)
y = df['Species']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

#Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# evaluate the model
prediction = model.predict(X_test)

print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,y_test))

In [None]:
from sklearn import svm
model = svm.SVC()
model.fit(X_train,y_train) 
prediction=model.predict(X_test)
print('The accuracy of the SVM is:',metrics.accuracy_score(prediction,y_test))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=3)
model.fit(X_train,y_train)
prediction=model.predict(X_test)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction,y_test))

In [None]:
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(X_train,y_train) 
prediction=model.predict(X_test) 
print('The accuracy of the Decision Tree is ',metrics.accuracy_score(prediction,y_test))

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=20,criterion='entropy',random_state=0)
model.fit(X_train,y_train) 
prediction=model.predict(X_test) 
print('The accuracy of the Random Forest is ',metrics.accuracy_score(prediction,y_test))

In [None]:
petal=df[['PetalLengthCm','PetalWidthCm','Species']]
sepal=df[['SepalLengthCm','SepalWidthCm','Species']]

In [None]:
train_p,test_p=train_test_split(petal,test_size=0.3,random_state=0)  #petals
train_x_p=train_p[['PetalWidthCm','PetalLengthCm']]
train_y_p=train_p.Species
test_x_p=test_p[['PetalWidthCm','PetalLengthCm']]
test_y_p=test_p.Species

In [None]:
train_s,test_s=train_test_split(sepal,test_size=0.3,random_state=0)  #Sepal
train_x_s=train_s[['SepalWidthCm','SepalLengthCm']]
train_y_s=train_s.Species
test_x_s=test_s[['SepalWidthCm','SepalLengthCm']]
test_y_s=test_s.Species

In [None]:
model=svm.SVC()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the SVM using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model=svm.SVC()
model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the SVM using Sepal is:',metrics.accuracy_score(prediction,test_y_s))

In [None]:
model = LogisticRegression()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

In [None]:
model=DecisionTreeClassifier()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

In [None]:
model=KNeighborsClassifier(n_neighbors=3) 
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the KNN using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the KNN using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

**Observations:**

Using Petals over Sepal for training the data gives a much better accuracy.
This was expected as we saw in the heatmap above that the correlation between the Sepal Width and Length was very low whereas the correlation between Petal Width and Length was very high.

In [None]:
!pip install pycaret

In [None]:
# Importing pycaret classification method

from pycaret.classification import *
# This is the first step of model selection
# Here the data is our datasets, target is the labeled 
# column(dependent variable), section is random number for future identification.
exp = setup(data = df, target = 'Species', session_id=77 )

# After this we will get a list of our columns and its type, just confirm they are the same. Then hit enter.

In [None]:
# Importing dataset
from pycaret.datasets import get_data
diabetes = get_data('iris')

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'species')

# return best model
best = compare_models()

# return top 3 models based on 'Accuracy'
top3 = compare_models(n_select = 3)

# return best model based on AUC
best = compare_models(sort = 'AUC') #default is 'Accuracy'

# compare specific models
# best_specific = compare_models(whitelist = ['dt','rf','xgboost'])

# blacklist certain models
# best_specific = compare_models(blacklist = ['catboost', 'svm'])