
# IRIS DATA SET 

**Context** 

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. 

The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). 

Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

This dataset became a typical test case for many statistical classification techniques in machine learning.

**Content**

The dataset contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).

**Acknowledgements**

This dataset is free and is publicly available at the UCI Machine Learning Repository


# Import libraries

In [62]:
# import libraries
import numpy as np
import pandas as pd

# import preprocessing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

#ignore warnings
import warnings
warnings.filterwarnings('ignore')


# Collecting Data

In [63]:
data = pd.read_csv("E:\Kaggle_DATA\IRIS\IRIS.csv")

In [64]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [65]:
data.shape

(150, 5)

In [66]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.3+ KB


Species column is in categorical form whereas rest all 4 columns are in float values.

__The target column is Species__

In [67]:
print("\nColumn Names:\n") 
print(data.columns)


Column Names:

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')


# Feature Engineering

To know more about Label Encoder, click below:

**[Label Encode the target variable](
https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd)**

In [68]:
#creating instance of a label encoder
encode = LabelEncoder()

#Assigning numerical values and storing in the same name column 'species'
data.species = encode.fit_transform(data.species)

In [69]:
print(data.head(10))

   sepal_length  sepal_width  petal_length  petal_width  species
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1.4          0.2        0
5           5.4          3.9           1.7          0.4        0
6           4.6          3.4           1.4          0.3        0
7           5.0          3.4           1.5          0.2        0
8           4.4          2.9           1.4          0.2        0
9           4.9          3.1           1.5          0.1        0


In [81]:
# species column values are changed to numerical form
data.species.unique()

array([0, 1, 2])

In [70]:
#train test split
train, test = train_test_split(data, test_size = 0.2, random_state = 0)

In [71]:
print(train.shape)

(120, 5)


In [72]:
print(test.shape)

(30, 5)


In [73]:
#Seperate the target and independent variable
train_X = train.drop(columns = ['species'], axis = 1)
train_Y = train['species']

test_X = test.drop(columns = ['species'], axis = 1)
test_Y = test['species']

In [74]:
print(train_X.shape, train_Y.shape)

(120, 4) (120,)


In [75]:
print(test_X.shape, test_Y.shape)

(30, 4) (30,)


# Data Modelling

### What is [Logistic Regression](https://careerfoundry.com/en/blog/data-analytics/what-is-logistic-regression/)

In [76]:
#Create the Object of the model
model = LogisticRegression()

model.fit(train_X, train_Y)

LogisticRegression()

In [77]:
predict = model.predict(test_X)

In [78]:
print("Predicted values on test data before inversing", predict)

Predicted values on test data before inversing [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0]


In [79]:
print("Predicted values on test data: ", encode.inverse_transform(predict))

Predicted values on test data:  ['Iris-virginica' 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica'
 'Iris-setosa' 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa'
 'Iris-virginica' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa'
 'Iris-virginica' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-setosa']


In [80]:
print("\nAccuracy score on test data: \n")
print(accuracy_score(test_Y,predict))


Accuracy score on test data: 

1.0
