# Objective:


### 1.   The Iris flower dataset consists of three species: setosa, versicolor, and virginica. These species can be distinguished based on their measurements. Now, imagine that you have the measurements of Iris flowers categorized by their respective species. Your objective is to train a machine learning model that can learn from these measurements and accurately classify the Iris flowers into their respective species.
### 2.   Use the Iris dataset to develop a model that can classify iris flowers into different species based on their sepal and petal measurements. This dataset is widely used for introductory classification tasks.



# Downloading the dataset

In [None]:
%pip install kaggle



In [None]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
! kaggle datasets download arshid/iris-flower-dataset

iris-flower-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!unzip /content/iris-flower-dataset.zip -d /content/

Archive:  /content/iris-flower-dataset.zip
replace /content/IRIS.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings('ignore')

# Reading & Understanding the data

In [None]:
data = pd.read_csv('/content/IRIS.csv')

In [None]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


# Checking for missing data

In [None]:
data.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

# Checking for Duplicate data

In [None]:
data.duplicated().sum()

3

### Since the amount of data is limited, and there are very few duplicate entries, I opted not to remove them in order to maintain consistency.

# Exploratory Data Analysis

In [None]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal_length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal_width,150.0,3.054,0.433594,2.0,2.8,3.0,3.3,4.4
petal_length,150.0,3.758667,1.76442,1.0,1.6,4.35,5.1,6.9
petal_width,150.0,1.198667,0.763161,0.1,0.3,1.3,1.8,2.5


In [None]:
data['species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: species, dtype: int64

### Based on the value counts, the dataset is balanced.

In [None]:
fig = px.scatter(data, x="sepal_width", y="sepal_length", color="species",
                 size='petal_length', hover_data=['petal_width'], template = 'plotly_dark')
fig.show()

In [None]:
fig = px.scatter(data, x="petal_width", y="petal_length", color="species", symbol="species", template = 'plotly_dark')
fig.show()

## Insights :-

### 1. Based on the visualizations provided above, we can observe that the iris-setosa species tends to have a shorter sepal length but a greater sepal width, while the Versicolor species falls somewhere in between with respect to both sepal length and width. On the other hand, Virginica typically exhibits longer sepal lengths and narrower sepal widths.

### 2. It is evident from the data that the setosa variety displays the smallest values for both petal length and petal width. In contrast, Versicolor exhibits intermediate values for both petal length and petal width, while the Virginica species stands out with the highest values for both petal length and petal width.

In [None]:
fig = px.scatter(data,
                 x = 'sepal_length',
                 y = 'petal_length',
                 color = 'species',
                 facet_col= 'species',
                 template = 'plotly_dark'
                 )

fig.show()

### From the above plot we can conclude that :-

### 1. The setosa species exhibits distinctive characteristics compared to the other species. It displays smaller petal width and length, along with a high sepal width and low sepal length.

### 2. Conversely, the Versicolor species generally displays moderate dimensions, whether it pertains to sepal or petal attributes.

### 3. In contrast, the Virginica species stands out with its high petal width and length, coupled with a smaller sepal width but a larger sepal length.

In [None]:
data.corr()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.109369,0.871754,0.817954
sepal_width,-0.109369,1.0,-0.420516,-0.356544
petal_length,0.871754,-0.420516,1.0,0.962757
petal_width,0.817954,-0.356544,0.962757,1.0


In [None]:
fig = px.imshow(data.corr(), text_auto=True, color_continuous_scale='deep', template = 'plotly_dark')
fig.show()

### Insights :-

### High Correlation : Petal Length & Petal Width
### Good Correlation : Petal Width & Sepal Length, Petal Length & Sepal Length

In [None]:
fig = px.histogram(data, x="sepal_width", color="species",
                   labels={"species": "Species", "sepal_width": "Sepal Width"},
                   title="Distribution of Sepal Width by Species",
                   opacity=0.7,
                   marginal="rug")

fig.update_layout(
    xaxis_title="Sepal Width",
    yaxis_title="Count",
    legend_title="Species",
    font=dict(family="Arial", size=12)
)

fig.show()

In [None]:
fig = px.histogram(data, x="sepal_length", color="species",
                   labels={"species": "Species", "sepal_length": "Sepal Length"},
                   title="Distribution of Sepal Length by Species",
                   opacity=0.7,
                   marginal="rug")

fig.update_layout(
    xaxis_title="Sepal Length",
    yaxis_title="Count",
    legend_title="Species",
    font=dict(family="Arial", size=12)
)

fig.show()

In [None]:
fig = px.histogram(data, x="petal_width", color="species",
                   labels={"species": "Species", "petal_width": "Petal Width"},
                   title="Distribution of Petal Width by Species",
                   opacity=0.7,
                   marginal="rug",)

fig.update_layout(
    xaxis_title="Petal Width",
    yaxis_title="Count",
    legend_title="Species",
    font=dict(family="Arial", size=12)
)

fig.show()

In [None]:
fig = px.histogram(data, x="petal_length", color="species",
                   labels={"species": "Species", "petal_length": "Petal Length"},
                   title="Distribution of Petal Length by Species",
                   opacity=0.7,
                   marginal="rug")

fig.update_layout(
    xaxis_title="Petal Length",
    yaxis_title="Count",
    legend_title="Species",
    font=dict(family="Arial", size=12)
)

fig.show()

## Summary of Exploratory Data Analysis (EDA):

### 1. The dataset exhibits balance, with an equal number of records for all three species.
### 2. A notable correlation exists between petal width and petal length.
### 3. The setosa species stands out as the most distinguishable due to its smaller feature dimensions.
### 4. Distinguishing between the Versicolor and Virginica species can be challenging as they often overlap, with Versicolor typically displaying average feature dimensions and Virginica having larger feature dimensions.

# Model Building

In [None]:
models = {
    'LogisticRegression': LogisticRegression(),
    'KNeighborsClassifier': KNeighborsClassifier(),
    'GaussianNB': GaussianNB(),
    'SVC': SVC(),
    'DecisionTreeClassifier' : DecisionTreeClassifier(),
    'RandomForestClassifier': RandomForestClassifier()
}

params = {
    'LogisticRegression': {'solver' : ["liblinear"]},
    'KNeighborsClassifier': { 'n_neighbors' : [x for x in range(0,20)]},
    'GaussianNB': {},
    'SVC' : {},
    'DecisionTreeClassifier' : {'max_depth' : [x for x in range(0,20)]},
    'RandomForestClassifier' : {'n_estimators' : [5, 10, 20, 50, 100], 'max_depth' : [y for y in range(0,20)]}
}

In [None]:
x = data.drop('species',axis=1)
y = data['species']

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=.3)

for model_name, model in models.items():

    model_to_tune = GridSearchCV(model, params[model_name], cv=5)
    model_to_tune.fit(x_train, y_train)

    y_pred = model_to_tune.predict(x_train)
    y_pred_test = model_to_tune.predict(x_test)

    print('---'*50, end='\n')
    print('Model Name: ', model_name)
    print(f"Best parameters: {model_to_tune.best_params_}")

    train_accuracy = accuracy_score(y_train, y_pred)
    test_accuracy = accuracy_score(y_test, y_pred_test)

    classification_rep = classification_report(y_test, y_pred_test)

    print("Training Accuracy:", train_accuracy, end = '\n')
    print("Testing Accuracy:", test_accuracy, end = '\n')
    print("\nClassification Report:\n", classification_rep, end='\n')
    print('---'*50, end='\n')

------------------------------------------------------------------------------------------------------------------------------------------------------
Model Name:  LogisticRegression
Best parameters: {'solver': 'liblinear'}
Training Accuracy: 0.9619047619047619
Testing Accuracy: 0.9777777777777777

Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        19
Iris-versicolor       1.00      0.92      0.96        13
 Iris-virginica       0.93      1.00      0.96        13

       accuracy                           0.98        45
      macro avg       0.98      0.97      0.97        45
   weighted avg       0.98      0.98      0.98        45

------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------

In [None]:
model = DecisionTreeClassifier(max_depth= 18)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

fig = ff.create_annotated_heatmap(cm, x=['Iris-setosa', 'Iris-virginica', 'Iris-versicolor'],
                                  y=['Iris-setosa', 'Iris-virginica', 'Iris-versicolor'],
                                  annotation_text=cm,
                                  colorscale='Blues')

fig.update_layout(title_text='<i><b>Confusion matrix</b></i>')

fig.add_annotation(dict(font=dict(color="black", size=14),
                        x=0.5,
                        y=-0.27,
                        showarrow=False,
                        text="Predicted value",
                        xref="paper",
                        yref="paper"))

fig.add_annotation(dict(font=dict(color="black", size=14),
                        x=-0.15,
                        y=0.5,
                        showarrow=False,
                        text="Actual value",
                        textangle=-90,
                        xref="paper",
                        yref="paper"))

fig.update_layout(margin=dict(t=170, l=250))

fig['data'][0]['showscale'] = True

fig.show()

# Conclusion:

## Based on the classification reports and accuracy values for different models, the DecisionTreeClassifier seems to be an excellent choice for the Iris dataset. Factors such as:-

### 1. The Decision Tree model achieves a testing accuracy of 100%, which indicates that it correctly classifies all samples in the test set.

### 2.  Decision Trees are known for their robustness to noisy data and outliers. The model's ability to achieve perfect accuracy on the test set suggests that it can handle variations and noise in the data effectively.

### 3. The Decision Tree model's training accuracy is also high (100%), but this doesn't necessarily indicate overfitting, as the testing accuracy is also 100%. This suggests that the model generalizes well to unseen data.


### 4. While models like K-Nearest Neighbors and Support Vector Machines achieve high accuracy as well, the Decision Tree model strikes a good balance between accuracy, interpretability, and simplicity.