# Comparing Prediction Accuracy Between Logistic Regression Model and Decision Tree on the Iris Dataset

### Abstract

The Iris Dataset is arguablly one of the more known and used database for classification techniques in machine learning. It contains three species of the Iris flower, namely, Setosa, Versicolor, and Virginica.

The mini-project compares the accuracy of a logistic regression model and decision tree in predicting the species using the four features: sepal length, sepal width, petal length, and petal width.

**FINDINGS**


Create (1) a logistic model and (2) a decision tree model that predicts the species of the Iris plant

### Packages

In [58]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from acquire import get_iris_data
from split_scale import split_my_data
from sklearn.metrics import confusion_matrix, classification_report

### Create the Model

#### Acquire the Iris Dataset

In [5]:
df = get_iris_data()
df.head()

Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,1,1,5.1,3.5,1.4,0.2,setosa
1,1,2,4.9,3.0,1.4,0.2,setosa
2,1,3,4.7,3.2,1.3,0.2,setosa
3,1,4,4.6,3.1,1.5,0.2,setosa
4,1,5,5.0,3.6,1.4,0.2,setosa


#### Split Train-Test

In [11]:
X = df[["sepal_length","sepal_width","petal_length","petal_width"]]
y = df[["species_name"]]

X_train, X_test, y_train, y_test = split_my_data(X,y,0.7)

In [24]:
X_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
114,5.8,2.8,5.1,2.4
136,6.3,3.4,5.6,2.4
53,5.5,2.3,4.0,1.3
19,5.1,3.8,1.5,0.3
38,4.4,3.0,1.3,0.2


#### Create a Logistic Model

Fit the logistic regression classifier to your training sample and transform, i.e. make predictions on the training sample

In [67]:
log_model_saga = LogisticRegression(C=1, random_state = 123, solver='saga').fit(X_train, y_train)
y_train_pred = log_model_saga.predict(X_train)

y_train_pred = pd.DataFrame(y_train_pred).set_index=y_train

Evaluate your in-sample results using the model score, confusion matrix, and classification report

**Model's Score**

In [73]:
s_score = log_model_saga.score(X_train,y_train)

print(f"""
The model's score is {s_score}
""")


The model's score is 0.9619047619047619



**Confusion Matrix**

In [89]:
conf_matrix = confusion_matrix(y_train, y_train_pred)

In [90]:
predicted_labels = ["p_setosa", "p_versicolor", "p_virginica"]
actual_labels = ["a_setosa", "a_versicolor", "a_virginica"]

conf_matrix = pd.DataFrame(conf_matrix, index=actual_labels, columns=predicted_labels)
conf_matrix

Unnamed: 0,p_setosa,p_versicolor,p_virginica
a_setosa,32,0,0
a_versicolor,0,40,0
a_virginica,0,0,33


**Classification Report**

In [62]:
cr = classification_report(y_train, y_train_pred)
print(cr)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        32
  versicolor       1.00      1.00      1.00        40
   virginica       1.00      1.00      1.00        33

    accuracy                           1.00       105
   macro avg       1.00      1.00      1.00       105
weighted avg       1.00      1.00      1.00       105



Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [66]:
cr.loc["accuracy"]

AttributeError: 'str' object has no attribute 'loc'

Look in the scikit-learn documentation to research the `solver` parameter. What is your best option(s) for the particular problem you are trying to solve and the data to be used?

In [84]:
log_model_ll = LogisticRegression(C=1, random_state = 123, solver='liblinear').fit(X_train, y_train)
y_train_pred = log_model_ll.predict(X_train)

y_train_pred = pd.DataFrame(y_train_pred).set_index=y_train
ll_score = log_model_ll.score(X_train,y_train)
ll_score

# For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
# For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
# ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
# ‘liblinear’ and ‘saga’ also handle L1 penalty
# ‘saga’ also supports ‘elasticnet’ penalty
# ‘liblinear’ does not handle no penalty

0.9523809523809523

In [85]:
log_model_ncg = LogisticRegression(C=1, random_state = 123, solver='newton-cg').fit(X_train, y_train)
y_train_pred = log_model_ncg.predict(X_train)

y_train_pred = pd.DataFrame(y_train_pred).set_index=y_train
ncg_score = log_model_ncg.score(X_train,y_train)
ncg_score

0.9619047619047619

In [77]:
log_model_lbfgs = LogisticRegression(C=1, random_state = 123, solver='lbfgs').fit(X_train, y_train)
y_train_pred = log_model_lbfgs.predict(X_train)

y_train_pred = pd.DataFrame(y_train_pred).set_index=y_train
lbfgs_score = log_model_lbfgs.score(X_train,y_train)
lbfgs_score

0.9619047619047619

In [83]:
log_model_sag = LogisticRegression(C=1, random_state = 123, solver='sag').fit(X_train, y_train)
y_train_pred = log_model_sag.predict(X_train)
y_train_pred = pd.DataFrame(y_train_pred).set_index=y_train
sag_score = log_model_sag.score(X_train,y_train)
sag_score

0.9619047619047619