#**About Dataset**
The IRIS dataset is a seminal dataset in machine learning, frequently used as an introductory resource for exploring classification algorithms and data analysis techniques. Comprising 150 samples, the dataset captures measurements from three species of the Iris flower: Iris setosa, Iris versicolor, and Iris virginica. Each sample in the dataset is described by four key feature variables:

1. **Sepal Length (cm)**: The length of the flower's sepal.
2. **Sepal Width (cm)**: The width of the flower's sepal.
3. **Petal Length (cm)**: The length of the flower's petal.
4. **Petal Width (cm)**: The width of the flower's petal.

These features are crucial in differentiating between the three species, making the IRIS dataset a go-to resource for testing various machine learning models, particularly in the realm of supervised learning. Due to its relatively small size and the clear relationship between the feature variables and target classes, it serves as an excellent tool for demonstrating data visualization, exploratory data analysis, and classification techniques such as logistic regression, decision trees, and k-nearest neighbors.

The IRIS dataset’s simplicity, combined with its well-defined structure, makes it an essential stepping stone for anyone venturing into the world of machine learning.

In [432]:
import warnings
warnings.filterwarnings('ignore')

#**Importing Libraries**

In [433]:
import numpy as np
import pandas as pd

df = pd.read_csv("https://github.com/YBI-Foundation/Dataset/raw/main/IRIS.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


#**Descibing the dataset**

In [434]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [435]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [436]:
df['species'].value_counts()

Unnamed: 0_level_0,count
species,Unnamed: 1_level_1
Iris-setosa,50
Iris-versicolor,50
Iris-virginica,50


#**Preprocessing as per requirement**

In [437]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = ss.fit_transform(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

##**Encoding the categorical variables**

In [438]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])

##**Checking for duplicate values**

In [439]:
df.duplicated().sum()

3

In [440]:
df.drop_duplicates(inplace=True)

In [441]:
df.duplicated().sum()

0

#**Preprocessed Dataset**

In [442]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,-0.900681,1.032057,-1.341272,-1.312977,0
1,-1.143017,-0.124958,-1.341272,-1.312977,0
2,-1.385353,0.337848,-1.398138,-1.312977,0
3,-1.506521,0.106445,-1.284407,-1.312977,0
4,-1.021849,1.263460,-1.341272,-1.312977,0
...,...,...,...,...,...
145,1.038005,-0.124958,0.819624,1.447956,2
146,0.553333,-1.281972,0.705893,0.922064,2
147,0.795669,-0.124958,0.819624,1.053537,2
148,0.432165,0.800654,0.933356,1.447956,2


#**Defining target variable(y) and feature variable(x)**

In [443]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [444]:
y = df['species']
x = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

#**Train-Test Split**

In [445]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.5, random_state=2529)

#**Model Selection**

In [446]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

#**Training the model**

In [447]:
rfc.fit(x_train,y_train)

#**Testing the model**

In [448]:
y_pred = rfc.predict(x_test)

#**Calculating the metrics**

In [451]:
from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        24
           1       0.92      0.96      0.94        23
           2       0.96      0.93      0.94        27

    accuracy                           0.96        74
   macro avg       0.96      0.96      0.96        74
weighted avg       0.96      0.96      0.96        74



In [452]:
confusion_matrix(y_test,y_pred)

array([[24,  0,  0],
       [ 0, 22,  1],
       [ 0,  2, 25]])