# **CLASSIFICATION**

**DATASET: IRIS.XLS**


---



---





### **IMPORTING NECESSARY LIBRARIES**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import files
files.upload()

### **1. Read the dataset to python environment**

In [2]:
data=pd.read_excel(r"/content/iris.xls")
data

Unnamed: 0,SL,SW,PL,PW,Classification
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


### **2. Do the necessary pre-processing steps**

In [3]:
data.head()

Unnamed: 0,SL,SW,PL,PW,Classification
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
data.tail()

Unnamed: 0,SL,SW,PL,PW,Classification
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   SL              143 non-null    float64
 1   SW              144 non-null    float64
 2   PL              144 non-null    float64
 3   PW              150 non-null    float64
 4   Classification  150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
data.describe()

Unnamed: 0,SL,SW,PL,PW
count,143.0,144.0,144.0,150.0
mean,5.855944,3.049306,3.75625,1.198667
std,0.828168,0.430644,1.761306,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [7]:
data['Classification'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Classification, dtype: int64

In [8]:
data.shape

(150, 5)

**a) Handling missing values:**

In [9]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values:\n", missing_values)

Missing values:
 SL                7
SW                6
PL                6
PW                0
Classification    0
dtype: int64


In [10]:
# Replace missing values with the mean
mean_values = np.mean(data, axis=0)
data.fillna(mean_values, inplace=True)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


In [11]:
data.isnull().sum()

SL                0
SW                0
PL                0
PW                0
Classification    0
dtype: int64

**b) Checking for duplicates:**

In [12]:
# Check for duplicates
duplicates = data.duplicated()

# Count the number of duplicates
num_duplicates = duplicates.sum()
print("Number of duplicates:", num_duplicates)

Number of duplicates: 3


In [13]:
# Remove duplicates
data = data.drop_duplicates()

# Reset the index after removing duplicates
data = data.reset_index(drop=True)

**c) Encoding categorical variables:**

In [14]:
from sklearn.preprocessing import LabelEncoder


# Apply label encoding to the "Classification" column
label_encoder = LabelEncoder()
data["Classification"] = label_encoder.fit_transform(data["Classification"])

# Print the updated dataset
print(data)

           SL   SW       PL   PW  Classification
0    5.100000  3.5  1.40000  0.2               0
1    4.900000  3.0  1.40000  0.2               0
2    5.855944  3.2  1.30000  0.2               0
3    4.600000  3.1  1.50000  0.2               0
4    5.000000  3.6  1.40000  0.2               0
..        ...  ...      ...  ...             ...
142  6.700000  3.0  5.20000  2.3               2
143  6.300000  2.5  5.00000  1.9               2
144  6.500000  3.0  3.75625  2.0               2
145  6.200000  3.4  5.40000  2.3               2
146  5.900000  3.0  5.10000  1.8               2

[147 rows x 5 columns]


In [15]:
print(data["Classification"].unique())

[0 1 2]


In [16]:
data['Classification'].value_counts()

1    50
2    49
0    48
Name: Classification, dtype: int64

**d) Outlier handling:**

In [17]:
from scipy import stats

# Compute the z-scores for each feature
z_scores = stats.zscore(data.iloc[:, :-3])  # Exclude the encoded columns

# Set a threshold for outlier detection
threshold = 3

# Find the indices of outliers
outlier_indices = (abs(z_scores) > threshold).any(axis=1)

# Remove the rows containing outliers
data = data[~outlier_indices]

# Reset the index after removing outliers
data = data.reset_index(drop=True)

# Display the first few rows of the cleaned dataset
print(data)


           SL   SW       PL   PW  Classification
0    5.100000  3.5  1.40000  0.2               0
1    4.900000  3.0  1.40000  0.2               0
2    5.855944  3.2  1.30000  0.2               0
3    4.600000  3.1  1.50000  0.2               0
4    5.000000  3.6  1.40000  0.2               0
..        ...  ...      ...  ...             ...
141  6.700000  3.0  5.20000  2.3               2
142  6.300000  2.5  5.00000  1.9               2
143  6.500000  3.0  3.75625  2.0               2
144  6.200000  3.4  5.40000  2.3               2
145  5.900000  3.0  5.10000  1.8               2

[146 rows x 5 columns]


### **Step 3: Model Evaluation**

In [18]:
# Separate features and target variable
X = data.drop('Classification', axis=1)
y = data['Classification']

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**SVM**

In [20]:
# SVM
from sklearn.svm import SVC

# SVM model
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)

# Calculate evaluation metrics
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_precision = precision_score(y_test, svm_predictions, average='weighted')
svm_recall = recall_score(y_test, svm_predictions, average='weighted')
svm_f1 = f1_score(y_test, svm_predictions, average='weighted')

# Classification report and confusion matrix
svm_classification_report = classification_report(y_test, svm_predictions)
svm_confusion_matrix = confusion_matrix(y_test, svm_predictions)

# Print the results
print("SVM Model")
print("Accuracy:", svm_accuracy)
print("Precision:", svm_precision)
print("Recall:", svm_recall)
print("F1-score:", svm_f1)
print("Classification Report:")
print(svm_classification_report)
print("Confusion Matrix:")
print(svm_confusion_matrix)

SVM Model
Accuracy: 0.9
Precision: 0.9079545454545455
Recall: 0.9
F1-score: 0.8996675191815858
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.88      0.78      0.82         9
           2       0.82      1.00      0.90         9

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.89        30
weighted avg       0.91      0.90      0.90        30

Confusion Matrix:
[[11  1  0]
 [ 0  7  2]
 [ 0  0  9]]


In [21]:
svm_model.predict(X_test)

array([0, 2, 1, 0, 0, 2, 0, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 0, 1, 1, 2,
       0, 2, 0, 1, 1, 2, 2, 0])

In [22]:
svm_model.score(X_test,y_test)

0.9

In [23]:
svm_model.score(X_train,y_train)

0.9568965517241379

**KNN**

In [24]:
from sklearn.neighbors import KNeighborsClassifier

# KNN model
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_test)

# Calculate evaluation metrics
knn_accuracy = accuracy_score(y_test, knn_predictions)
knn_precision = precision_score(y_test, knn_predictions, average='weighted')
knn_recall = recall_score(y_test, knn_predictions, average='weighted')
knn_f1 = f1_score(y_test, knn_predictions, average='weighted')

# Classification report and confusion matrix
knn_classification_report = classification_report(y_test, knn_predictions)
knn_confusion_matrix = confusion_matrix(y_test, knn_predictions)

# Print the results
print("KNN Model")
print("Accuracy:", knn_accuracy)
print("Precision:", knn_precision)
print("Recall:", knn_recall)
print("F1-score:", knn_f1)
print("Classification Report:")
print(knn_classification_report)
print("Confusion Matrix:")
print(knn_confusion_matrix)

KNN Model
Accuracy: 0.9
Precision: 0.9079545454545455
Recall: 0.9
F1-score: 0.8996675191815858
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.88      0.78      0.82         9
           2       0.82      1.00      0.90         9

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.89        30
weighted avg       0.91      0.90      0.90        30

Confusion Matrix:
[[11  1  0]
 [ 0  7  2]
 [ 0  0  9]]


In [25]:
knn_model.predict(X_test)

array([0, 2, 1, 0, 0, 2, 0, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 0, 1, 1, 2,
       0, 2, 0, 1, 1, 2, 2, 0])

In [26]:
knn_model.score(X_test,y_test)

0.9

In [27]:
knn_model.score(X_train,y_train)

0.9568965517241379

**LOGISTIC REGRESSION**

In [28]:
from sklearn.linear_model import LogisticRegression

# Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)

# Calculate evaluation metrics
lr_accuracy = accuracy_score(y_test, lr_predictions)
lr_precision = precision_score(y_test, lr_predictions, average='weighted')
lr_recall = recall_score(y_test, lr_predictions, average='weighted')
lr_f1 = f1_score(y_test, lr_predictions, average='weighted')

# Classification report and confusion matrix
lr_classification_report = classification_report(y_test, lr_predictions)
lr_confusion_matrix = confusion_matrix(y_test, lr_predictions)

# Print the results
print("Logistic Regression Model")
print("Accuracy:", lr_accuracy)
print("Precision:", lr_precision)
print("Recall:", lr_recall)
print("F1-score:", lr_f1)
print("Classification Report:")
print(lr_classification_report)
print("Confusion Matrix:")
print(lr_confusion_matrix)

Logistic Regression Model
Accuracy: 0.9
Precision: 0.9079545454545455
Recall: 0.9
F1-score: 0.8996675191815858
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.88      0.78      0.82         9
           2       0.82      1.00      0.90         9

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.89        30
weighted avg       0.91      0.90      0.90        30

Confusion Matrix:
[[11  1  0]
 [ 0  7  2]
 [ 0  0  9]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [29]:
lr_model.predict(X_test)

array([0, 2, 1, 0, 0, 2, 0, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 0, 1, 1, 2,
       0, 2, 0, 1, 1, 2, 2, 0])

In [30]:
lr_model.score(X_test,y_test)

0.9

In [31]:
lr_model.score(X_train,y_train)

0.9655172413793104

**RANDOM FOREST CLASSIFIER**

In [32]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

# Calculate evaluation metrics
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_precision = precision_score(y_test, rf_predictions, average='weighted')
rf_recall = recall_score(y_test, rf_predictions, average='weighted')
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')

# Classification report and confusion matrix
rf_classification_report = classification_report(y_test, rf_predictions)
rf_confusion_matrix = confusion_matrix(y_test, rf_predictions)

# Print the results
print("Random Forest Model")
print("Accuracy:", rf_accuracy)
print("Precision:", rf_precision)
print("Recall:", rf_recall)
print("F1-score:", rf_f1)
print("Classification Report:")
print(rf_classification_report)
print("Confusion Matrix:")
print(rf_confusion_matrix)


Random Forest Model
Accuracy: 0.9333333333333333
Precision: 0.9454545454545454
Recall: 0.9333333333333333
F1-score: 0.9325
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      0.78      0.88         9
           2       0.82      1.00      0.90         9

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.92        30
weighted avg       0.95      0.93      0.93        30

Confusion Matrix:
[[12  0  0]
 [ 0  7  2]
 [ 0  0  9]]


In [33]:
rf_model.predict(X_test)

array([0, 2, 0, 0, 0, 2, 0, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 0, 1, 1, 2,
       0, 2, 0, 1, 1, 2, 2, 0])

In [34]:
rf_model.score(X_test,y_test)

0.9333333333333333

In [35]:
rf_model.score(X_train,y_train)

1.0

**DECISION TREE**

In [36]:
from sklearn.tree import DecisionTreeClassifier

# Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)

# Calculate evaluation metrics
dt_accuracy = accuracy_score(y_test, dt_predictions)
dt_precision = precision_score(y_test, dt_predictions, average='weighted')
dt_recall = recall_score(y_test, dt_predictions, average='weighted')
dt_f1 = f1_score(y_test, dt_predictions, average='weighted')

# Classification report and confusion matrix
dt_classification_report = classification_report(y_test, dt_predictions)
dt_confusion_matrix = confusion_matrix(y_test, dt_predictions)

# Print the results
print("Decision Tree Model")
print("Accuracy:", dt_accuracy)
print("Precision:", dt_precision)
print("Recall:", dt_recall)
print("F1-score:", dt_f1)
print("Classification Report:")
print(dt_classification_report)
print("Confusion Matrix:")
print(dt_confusion_matrix)

Decision Tree Model
Accuracy: 0.9333333333333333
Precision: 0.9454545454545454
Recall: 0.9333333333333333
F1-score: 0.9325
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      0.78      0.88         9
           2       0.82      1.00      0.90         9

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.92        30
weighted avg       0.95      0.93      0.93        30

Confusion Matrix:
[[12  0  0]
 [ 0  7  2]
 [ 0  0  9]]


In [37]:
dt_model.predict(X_test)

array([0, 2, 0, 0, 0, 2, 0, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 0, 1, 1, 2,
       0, 2, 0, 1, 1, 2, 2, 0])

In [38]:
dt_model.score(X_test,y_test)

0.9333333333333333

In [39]:
dt_model.score(X_train,y_train)

1.0

**MODEL ACCURACIES**

In [40]:
from IPython.display import display

# Create a dictionary with model names and accuracies
model_accuracies = {
    'SVM': svm_accuracy,
    'KNN': knn_accuracy,
    'Random Forest': rf_accuracy,
    'Logistic Regression': lr_accuracy,
    'Decision Tree': dt_accuracy
}

# Create a DataFrame from the dictionary
accuracy_df = pd.DataFrame(list(model_accuracies.items()), columns=['Model', 'Accuracy'])

# Display the DataFrame as a table
display(accuracy_df)

Unnamed: 0,Model,Accuracy
0,SVM,0.9
1,KNN,0.9
2,Random Forest,0.933333
3,Logistic Regression,0.9
4,Decision Tree,0.933333


Based on the accuracy results obtained:

1. Logistic Regression, SVM, and KNN achieved accuracy of 0.9, indicating that, it may have some limitations in handling the dataset's complexity.

2. Decision Tree and Random Forest achieved an accuracy of 0.93, indicating a slightly higher performance compared to the other models, that they were able to classify all instances correctly on the test set.


**Inference**: Decision Tree and Random Forest models demonstrated higher performance, correctly classifying all instances in the test set. SVM, KNN and Logistic Regression models also performed well but with slightly lower accuracies. Overall, the models show promising results, suggesting that they can effectively classify the Iris dataset.



---



---

