# Building a prediction model over the Iris Dataset

## Loading and setting up the data

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
iris = pd.DataFrame(iris.data, columns = iris.feature_names)
iris['target'] = load_iris().target
target_names = load_iris().target_names

In [3]:
iris.info()
iris.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [4]:
target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
iris.columns = ['s_len', 's_wid', 'p_len', 'p_wid', 'target']

## Splitting the data for training the model

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X = iris.iloc[:, :-1]
y = iris['target']

In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   s_len   150 non-null    float64
 1   s_wid   150 non-null    float64
 2   p_len   150 non-null    float64
 3   p_wid   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [9]:
y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 150 entries, 0 to 149
Series name: target
Non-Null Count  Dtype
--------------  -----
150 non-null    int64
dtypes: int64(1)
memory usage: 1.3 KB


In [10]:
Xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

## Loading LogisticRegressor

### Selected LogisticRegressor because this is a multi-class classification problem.

In [11]:
from sklearn.linear_model import LogisticRegression

In [12]:
model = LogisticRegression()

### Fitting the model with the training data.

In [13]:
model.fit(Xtrain, ytrain)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### Evalutating the built model.

In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [15]:
ypred = model.predict(xtest)

In [16]:
print(accuracy_score(ytest, ypred))

1.0


In [17]:
print(confusion_matrix(ytest, ypred))

[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]


In [18]:
print(classification_report(ytest, ypred, target_names = target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Conclusion

A logistic regression model was built and deployed successfully. The model is extremely reliable with an accuracy score of 1.0, meaning the model makes 100% correct predictions.

From the confusion matrix we can infer that the model correctly classifies all the flowers.

We can see this being reflected in the precision and recall in the classification report.

The model achieved a perfect accuracy of 1.0 on this specific run which could be stated as the "ideal" condition for any ML model. And as we all know "ideal" conditions are like the north star of programming world, something that is largely pursued but is rarely attained.

This primarily due to the random_state=42 used during the train_test_split. This particular seed created a favourable split where the 30 samples in the split were perfectly classifiable by the model trained on the 120 samples.

Changing the random_state might result in a sightly different, non-perfect accuracy.