# Exercise 5 - IART - Supervised Learning

### Adapted from Notebook by [Randal S. Olson](http://www.randalolson.com/), supported by [Jason H. Moore](http://www.epistasis.org/)
#### [University of Pennsylvania Institute for Bioinformatics](http://upibi.org/)

## 5.3 Iris flower extended data set –Classification using different Algorithms.
Continuing with the Iris dataset, suppose that we have Iris already identified in the 3 classes but now we also have the Iris packed in different types of packages: “Simple – 0”, “Gift – 1” and “Luxury – 2”. We also have a new variable “price” with three possibilities: “Low”, “Medium”, “High”.

We now have a different classification problem in which we want to predict the “price” classification based on the remaining characteristics: sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm, iris_type, and package.

a) Create a new notebook and start by importing the needed libraries.

b) Read the data from the CSV file and check the data using the head(), describe(), and other
Pandas commands.

In [3]:
import pandas as pd

iris_data = pd.read_csv('iris-data-new2.csv')

iris_data.head()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,iris_type,package,price
0,5.1,3.5,1.4,0.2,Iris-setosa,2,Medium
1,4.9,3.0,1.4,0.2,Iris-setosa,1,Low
2,4.7,3.2,1.3,0.2,Iris-setosa,0,Low
3,4.6,3.1,1.5,0.2,Iris-setosa,0,Low
4,5.0,3.6,1.4,0.2,Iris-setosa,0,Low


In [4]:
iris_data.describe()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,package
count,149.0,149.0,149.0,149.0,149.0
mean,5.847651,3.059732,3.775168,1.209732,0.442953
std,0.799542,0.430104,1.75872,0.762191,0.710753
min,4.4,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.4,1.3,0.0
75%,6.4,3.3,5.1,1.8,1.0
max,7.9,4.4,6.9,2.5,2.0


In [5]:
iris_data['iris_type'].value_counts()

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        49
Name: iris_type, dtype: int64

c) Using only the attribute sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm, fit a simple decision tree model to the data, using holdout, with 75% for training.

In [17]:
# Import the necessary libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Variables and Target
x_decision = iris_data[['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']]
y_decision = iris_data['price']

# Split the data into training and testing sets
x_decision_train, x_decision_test, y_decision_train, y_decision_test = train_test_split(x_decision, y_decision, test_size=0.75, random_state=42)

# Create a Decision Tree Classifier
clf = DecisionTreeClassifier()
clf.fit(x_decision_train, y_decision_train)

DecisionTreeClassifier()

d) Analyze the accuracy, precision, recall and f-measure achieved.

In [21]:
from sklearn.metrics import classification_report

# Evaluate the performance of the model
score_decision = clf.score(x_decision_test, y_decision_test)
print("Accuracy:", score_decision)

y_decision_pred = clf.predict(x_decision_test)

print("Classification Report:")
print(classification_report(y_decision_test, y_decision_pred))

Accuracy: 0.75
Classification Report:
              precision    recall  f1-score   support

        High       0.76      0.81      0.79        16
         Low       0.86      0.75      0.80        57
      Medium       0.62      0.72      0.67        39

    accuracy                           0.75       112
   macro avg       0.75      0.76      0.75       112
weighted avg       0.76      0.75      0.75       112



e) Create and analyze a confusion matrix of the results.

In [22]:
from sklearn.metrics import confusion_matrix

print("Confusion Matrix:")
print(confusion_matrix(y_decision_test, y_decision_pred))

Confusion Matrix:
[[13  0  3]
 [ 0 43 14]
 [ 4  7 28]]


f) Using only the attribute sepal_length_cm, sepal_width_cm, petal_length_cm,
petal_width_cm, fit a simple nearest neighbor model to the data using holdout with 75% for
training.

In [23]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

x_neighbor = iris_data[['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']]
y_neighbor = iris_data['price']

x_neighbor_train, x_neighbor_test, y_neighbor_train, y_neighbor_test = train_test_split(x_neighbor, y_neighbor, test_size=0.25, random_state=42)

knn = KNeighborsClassifier()
knn.fit(x_neighbor_train, y_neighbor_train)

KNeighborsClassifier()

g) Analyze the accuracy, precision, recall and f-measure achieved and the confusion matrix.

In [28]:

score_neighbor = knn.score(x_neighbor_test, y_neighbor_test)
print("Accuracy:", score_neighbor)

y_neighbor_pred = knn.predict(x_neighbor_test)

print("Classification Report:")
print(classification_report(y_neighbor_test, y_neighbor_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_neighbor_test, y_neighbor_pred))

Accuracy: 0.8157894736842105
Classification Report:
              precision    recall  f1-score   support

        High       1.00      0.57      0.73         7
         Low       0.89      0.89      0.89        18
      Medium       0.69      0.85      0.76        13

    accuracy                           0.82        38
   macro avg       0.86      0.77      0.79        38
weighted avg       0.84      0.82      0.81        38

Confusion Matrix:
[[ 4  0  3]
 [ 0 16  2]
 [ 0  2 11]]


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


h) Use two different methods for balancing the dataset and repeat the previous analyses.

##### UnderSampling Method

In [32]:
# pip install imbalanced-learn
from imblearn.under_sampling import RandomUnderSampler

x_balance_under = iris_data[['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']]
y_balance_under = iris_data['price']

# UnderSampling Method
rus = RandomUnderSampler(random_state=42)
x_rus, y_rus = rus.fit_resample(x_balance_under, y_balance_under)

x_train_rus, x_test_rus, y_train_rus, y_test_rus = train_test_split(x_rus, y_rus, test_size=0.25, random_state=42)

knn_rus = KNeighborsClassifier()
knn_rus.fit(x_train_rus, y_train_rus)

y_pred_rus = knn_rus.predict(x_test_rus)

print("Undersampling Confusion Matrix:")
print(confusion_matrix(y_test_rus, y_pred_rus))

print("\nUndersampling Classification Report:")
print(classification_report(y_test_rus, y_pred_rus))

Undersampling Confusion Matrix:
[[5 0 3]
 [0 5 1]
 [1 2 2]]

Undersampling Classification Report:
              precision    recall  f1-score   support

        High       0.83      0.62      0.71         8
         Low       0.71      0.83      0.77         6
      Medium       0.33      0.40      0.36         5

    accuracy                           0.63        19
   macro avg       0.63      0.62      0.62        19
weighted avg       0.66      0.63      0.64        19



  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


##### OverSampling Method

In [31]:
from imblearn.over_sampling import RandomOverSampler

x_balance_over = iris_data[['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']]
y_balance_over = iris_data['price']

# OverSampling Method
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(x_balance_over, y_balance_over)

X_train_ros, X_test_ros, y_train_ros, y_test_ros = train_test_split(X_ros, y_ros, test_size=0.25, random_state=42)

knn_ros = KNeighborsClassifier()
knn_ros.fit(X_train_ros, y_train_ros)

y_pred_ros = knn_ros.predict(X_test_ros)

print("Oversampling Confusion Matrix:")
print(confusion_matrix(y_test_ros, y_pred_ros))

print("\nOversampling Classification Report:")
print(classification_report(y_test_ros, y_pred_ros))

Oversampling Confusion Matrix:
[[17  0  2]
 [ 0 13  3]
 [ 5  3 11]]
Oversampling Classification Report:
              precision    recall  f1-score   support

        High       0.77      0.89      0.83        19
         Low       0.81      0.81      0.81        16
      Medium       0.69      0.58      0.63        19

    accuracy                           0.76        54
   macro avg       0.76      0.76      0.76        54
weighted avg       0.75      0.76      0.75        54



  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


i) Using all the attributes available, and the balanced dataset, fit distinct models such as
Nearest Neighbor, Decision Trees, SVMs and Neural Networks to the data and try different
configuration parameters, using holdout with 75% for training.

In [None]:
# all other imports done previously
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# export the balanced dataset to the file iris-data-balanced.csv

iris_data = pd.read_csv('iris-data-balanced.csv')
X = iris_data.drop(['price'], axis=1)
y = iris_data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

over_sampler = RandomOverSampler(sampling_strategy='minority')
X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train)

under_sampler = RandomUnderSampler(sampling_strategy='majority')
X_train_under, y_train_under = under_sampler.fit_resample(X_train, y_train)

# K-Nearest Neighbor
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_over, y_train_over)
y_pred_knn = knn.predict(X_test)
print("K-Nearest Neighbor:")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Precision:", precision_score(y_test, y_pred_knn, average='macro'))
print("Recall:", recall_score(y_test, y_pred_knn, average='macro'))
print("F1-Score:", f1_score(y_test, y_pred_knn, average='macro'))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

# Decision Tree
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train_under, y_train_under)
y_pred_dt = dt.predict(X_test)
print("Decision Tree:")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Precision:", precision_score(y_test, y_pred_dt, average='macro'))
print("Recall:", recall_score(y_test, y_pred_dt, average='macro'))
print("F1-Score:", f1_score(y_test, y_pred_dt, average='macro'))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))

# Support Vector Machine - SVC
svm = SVC(kernel='rbf', C=1, gamma=0.1)
svm.fit(X_train_under, y_train_under)
y_pred_svm = svm.predict(X_test)
print("Support Vector Machine:")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Precision:", precision_score(y_test, y_pred_svm, average='macro'))
print("Recall:", recall_score(y_test, y_pred_svm, average='macro'))
print("F1-Score:", f1_score(y_test, y_pred_svm, average='macro'))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))

# Neural Network
nn = MLPClassifier(hidden_layer_sizes=(10, 5), activation='logistic', solver='sgd', learning_rate_init=0.01, max_iter=1000)
nn.fit(X_train_over, y_train_over)
y_pred_nn = nn.predict(X_test)
print("Neural Network:")
print("Accuracy:", accuracy_score(y_test, y_pred_nn))
print("Precision:", precision_score(y_test, y_pred_nn, average='macro'))
print("Recall:", recall_score(y_test, y_pred_nn, average='macro'))
print("F1-Score:", f1_score(y_test, y_pred_nn, average='macro'))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_nn))


j) Analyze the accuracy, precision, recall and f-measure achieved and the confusion matrix.

k) Repeat the previous analysis but experimenting with all models using 5-fold Cross-
Validation.

l) Use Grid Search to define the best parameters for the best two algorithms.

m) Analyze the accuracy, precision, recall, f-measure and the confusion matrix for the best
model.