# Logistic Regression on the Iris dataset

In [1]:
#importing the necessary packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('iris.csv') #read in the csv file
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
df.shape #let's see how big this data set is.

(150, 6)

In [4]:
df.isnull().sum() #Quick check if there are any missing values. It seems this data set is already clean.

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [5]:
df['Species'].unique() #check what are the unique values in the species column, because I want to just make two categories, setosa and not setosa.

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Now I will proceed to splitting my data set into X and y, to prepare my data for model building. 

In [6]:
X = df.iloc[:,1:5].values #columns 1 to 4 are my X.
y = df.iloc[:, 5].values #Species column as my y.

In [7]:
df.replace({'Species' : {'Iris-virginica':'not setosa', 'Iris-versicolor' : 'not setosa'}}, inplace = True)

In the species column, I replaced 'versicolor' and 'virginica' with 'not setosa', so that I will have only two unique values for this column.

In [8]:
df['Species'].unique() #check if replacing the values went as I expected. Seems it worked.

array(['Iris-setosa', 'not setosa'], dtype=object)

In [9]:
df['Species'] = df['Species'].replace({'not setosa': 1, 'Iris-setosa': 0}) #I'll just encode setosa and not setosa as 0 and 1.

In [10]:
df #quick check if that changed my species column as I expected.

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,1
146,147,6.3,2.5,5.0,1.9,1
147,148,6.5,3.0,5.2,2.0,1
148,149,6.2,3.4,5.4,2.3,1


In [11]:
from sklearn.model_selection import train_test_split #Import train test split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) #splitting my data into 80% train set and 20% test set

In [13]:
logreg = LogisticRegression() #build my model

In [14]:
logreg.fit(X_train, y_train)   #fit the model with my train set
y_pred = logreg.predict(X_test)  #predict y_pred using the X_test set

In [15]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(conf_mat,
                     index = ['setosa','not setosa'], 
                     columns = ['setosa','not setosa'])

cm_df

Unnamed: 0,setosa,not setosa
setosa,8,0
not setosa,0,22


The model has high precision, high recall, and high accuracy, as the model predicted everything correctly.

In [17]:
#i will manually calculate precision, recall, accuracy, and f1 scores. Then I will compare it with the automatically calculated ones using the classification report from sklearn.
TP = cm_df.iloc[0,0]
FP = cm_df.iloc[1,0]
FN = cm_df.iloc[0,1]
TN = cm_df.iloc[1,1]

In [18]:
#Precision
#TP/TP+FP

prec = TP/(TP+FP)
prec

1.0

In [19]:
#Recall
#TP/FN+TP

rec = TP/(TP+FN)
rec

1.0

In [20]:
#Accuracy
acc = (TP + TN)/(TP+FN+TN+FP)
acc

1.0

In [21]:
#F1
f1 = 2 *((prec*rec)/(prec+rec))
f1

1.0

As we expected, the model gave us precision, recall, accuracy, and f1 scores of 1.0.

In [16]:
from sklearn.metrics import classification_report

In [22]:
target_names = ['setosa', 'not setosa']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8
  not setosa       1.00      1.00      1.00        22

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



We get the same results by using classification report from sklearn.

In conclusion, logistic regression is a very good/ideal ML model for this mini project. I splitted the data set into 80% training set and 20% testing set. The model gave us a 100% accuracy in predicting a flower is a 'setosa' species or 'not a setosa' species in the test set. 