# Machine Learning and Statistics: Project

### Classification Algorithms on the Iris Flower Dataset

Author: Daria Sep

### Introduction
***
The obective of this project is to explore the use of classification algorithms in supervised learning, focusing on the renowned iris flower dataset introduced by Ronald A. Fisher. 

The notebook will start with an introduction to supervised learning, explaining its core concepts and significance, followed by a detailed look at classification algorithms. It will then demonstrate the implementation of the Support Vector Machine (SVM) Algorithm using the scikit-learn Python library. Emphasis will be placed on enhancing understanding through appropriate plots, mathematical notation, and diagrams, ensuring a blend of theoretical knowledge and practical application.

### Supervised Learning
***

#### Overview of Supervised Learning

Supervised learning falls within the domain of artificial intelligence and machine learning. It is characterised by the use of labeled datasets to train algorithms in classifying data or making accurate predictions (IBM n.d). The goal of supervised learning is for the algorithm to learn a mapping function that can predict the labels for new, unseen data (Brownlee 2023).

In supervised learning, we typically have the following components:
- **Features** (also referred to as "X variables"): These are the input variables or attributes that describe the data instances. For example, in the Iris flower dataset, the features are sepal length, sepal width, petal length, and petal width.
- **Labels** (typically reffered to as "target variables" or "y variables"): These are the output variables or categories that we want to predict. In the Iris dataset, the labels correspond to the species of iris flowers (e.g., setosa, versicolor, virginica).
- **Training Data**: This is the labelled dataset that we use to train the machine's learning model. It consists of input features paired with the correct output labels. (Ali 2022).

In supervised learning, the algorithm is taught by example. An operator provides the machine learning algorithm with a well-defined dataset containing specified inputs and corresponding desired outputs. The algorithm's task is to discern the underlying patterns that lead to those inputs and outputs. While the operator knows the correct answers to the problem, the algorithm identifies patterns in data, learns from observations and makes predictions. Subsequently, the algorithm produces predictions, which are then reviewed and corrected by the operator. This iterative process continues until the algorithm achieves a high level of accuracy and performance (Wakefield n.d.).

The effectiveness of supervised learning is underscored by its wide applicability and the remarkable advancements it has driven in various fields. From enhancing customer experiences through personalized recommendations to enabling autonomous vehicles, supervised learning algorithms have become indispensable tools in extracting meaningful patterns and insights from complex datasets. Their significance is further magnified by their versatility in handling different types of data and tasks, making them a foundational pillar in the ever-evolving landscape of machine learning and artificial intelligence.

Supervised learning is divided into two main types: **classification** and **regression**. Classification is the task of predicting or identifying which category (or categories) a data point belongs to. In classification output variables are always discrete values meaning they can be placed into clear categories or classes. 

Unlike classification, which places data into discrete categories, regression problems use input variables to identify continuous, real-value quantities eg. time-series data, sales figures, salaries, scores, heights, weights etc (Hillier 2022).

In this project, we will focus on classification, where the goal is to assign each data instance to one of several predefined classes or categories.


<img src="./images/types.png" width="500" title="Types of Machine Learning"/>

*Figure 1: Types of Machine Learning (Geeksforgeeks n.d.)*


#### Classification Algorithms in Supervised Learning

As previously noted, classification is a type of supervised machine learning technique in which the goal is to accurately predict the label for a given input. The process involves training a model on labeled examples to learn patterns between input features and output classes, followed by an evaluation phase using test data, before being used to perform prediction on new, unseen data (Keita 2022).

Examples of Classsification Algorithms include:
- **Logistic Regression**: Logistic regression algorithm that models the probability of a binary outcome based on input features, using a logistic function to transform linear combinations of inputs into probabilities. It's commonly used for binary classification tasks, such as spam detection or disease diagnosis

- **Support Vector Machine (SVM)**: SVM is a powerful classification method that works by finding the hyperplane that best separates different classes in the feature space. It's effective in high-dimensional spaces and versatile enough to handle linear and non-linear relationships. The algorithm is often used in image classification and bioinformatics.

- **Random Forest**: Random forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is commonly used in stock market analysis and e-commerce.

- **Decision Tree**: A decision tree is a flowchart-like structure where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. It's simple to understand and interpret, but can be prone to overfitting. Decision trees can be used in customer segmentation and quality control.

- **K-Nearest Neighbors (KNN)**: KNN is a simple, non-parametric algorithm that classifies a data point based on how its neighbors are classified, typically using a majority vote of its k nearest neighbors. The algortm uses include recommendation systems and real estate valuations.

- **Naive Bayes**: Naive Bayes is a probabilistic classifier based on applying Bayes' theorem with the assumption of independence between the features. It is particularly suited for high-dimensional data and is known for its simplicity and speed in handling large datasets. It is often used for text classification.




### Exploration of a Support Vector Machine Algorithm
***

#### Overview

Support Vector Machines (SVM) are a set of supervised learning methods used for classification, regression, and outlier detection. They are particularly well-suited for complex but small- or medium-sized datasets. 

The main objective of SVM is to find the hyperplane that best separates different classes in the feature space by performing optimal data transformations that determine boundaries between data points based on predefined classes, labels, or outputs. In two dimensions, this hyperplane is a line. 

SVMs are widely used in areas such as image classification, text categorization, bioinformatics, and hand-written character recognition. (Kanade 2022, Geeksforgeeks n.d.).      

#### Linear vs Non-Linear SVM

Support vector machines are broadly classified into two types: simple or linear SVM and kernel or non-linear SVM.

A linear SVM refers to the SVM type used for classifying linearly separable data, meaning that the data points of different classes can be divided by a single straight line. 

In Linear SVM, **hyperplanes** act as decision-making boundaries, classifying data points based on which side of the hyperplane they fall on. The nature of the hyperplane is determined by the feature count; with two input features, it's a line, and with three, it becomes a two-dimensional plane. 

A linear SVM is typically used to address classification and regression analysis problems (Kanade 2022)

<img src="./images/plane.png" width="500" title="Hyperplanes"/>

*Figure 2: 1D and 2D Hyperplanes (Jana 2020)* 

**Support vectors**, the data points nearest to the hyperplane, significantly impact hyperplane's placement and direction. By leveraging these support vectors, the classifier's margin is maximized for optimal separation (Gandhi 2008).

Non-linear data that cannot be segregated into distinct categories with the help of a straight line uses the *Kernel* trick to handle non-linearly separable data. This involves mapping the input space into a higher-dimensional space where a linear separator can be found. 

Kernel SVMs are typically used to handle optimization problems that have multiple variables.

<img src="./images/nonl.png" width="500" title="Linear vs Non-Linear SVM"/>

*Figure 3: Linear vs Non-Linear SVM (Girgin 2019)* 




#### Maximizing the Margin


In the realm of SVM classification, the maximization of the margin is a fundamental concept. This margin represents the distance between the defining hyperplane and the nearest data points from each class, known as the support vectors. The primary objective of an SVM classifier is to locate a hyperplane that not only accurately separates the different classes but also maximizes this margin. Essentially, a larger margin equates to a more robust model with better generalization capabilities. The optimal hyperplane, therefore, is the one that achieves this maximum margin, effectively increasing the classifier's ability to distinguish between classes. 

Mathematically, the margin is defined as $\frac{2}{\|w\|}$ where $w$ is the weight vector perpendicular to the hyperplane. The optimization problem in SVM focuses on minimizing $ \frac{1}{2} \|w\|^2 $, subject to the constraint that each data point $ x_i $ with label $ y_i $ satisfies:

$y_i(w \cdot x_i + b) \geq 1$

This constraint ensures that every data point is correctly classified with at least a unit margin. Maximizing the margin is crucial as it leads to a more distinct separation between classes, contributing to the effectiveness and accuracy of the SVM model.


 <img src="./images/margin.jpg" width="500" title="Maximizing the Margin"/>

*Figure 4: Maximizing the Margin (Packt n.d.)* 



#### Mathematical Formulation

#### Handling Non-Linear Data - Kernel Trick

### Dataset Overview and Historical Background
***

The Iris Dataset, aslo known as the Fisher's Iris Dataset, is a multivariate dataset created by Sir Ronald Aymer Fisher in 1936. 

<img src="./images/fisher.jpeg" height="300" title="Ronald Aymer Fisher"/>

*Figure 2: R. A. Fisher (Wikipedia 2023)*


This dataset is also known as Anderson's Iris dataset, named after Edgar Anderson who collected the data to quantify the variations among Iris flowers of three different classes.

The information included in the dataset is as follows:

1. Sepal length in cm
2. Sepal width in cm
3. Petal length in cm
4. Petal width in cm
5. Class: Iris Setosa, Iris Versicolour, Iris Virginica
    
<img src="./images/iris.png" width="500" title="Iris Species"/>

*Figure 3: Iris Species (Chauhan 2021)*

Originally, the dataset served as an example of linear discrimination analysis. However, over time, it gained popularity as a benchmark for evaluating statistical classification methods in machine learning. Today, the Iris Dataset is widely used as an introductory dataset for those starting in machine learning (Chauhan 2021).

### Data Preprocessing 

***

#### Overview

Data quality is crucial in data science and machine learning, as it directly influences machine learning model performance. Data preprocessing, which includes preparing and cleaning data for analysis and modeling, is a critical, not just preliminary, step in these fields (Buhl 2023). 

Data preprocessing is a detailed process, specific to each dataset and analysis requirement, essential for ensuring data quality and reliability. Effective preprocessing markedly improves machine learning model performance and the quality of analytical insights.

Data preprocessing may involve all or some of the following:

1. Data Cleaning: handling missing values, removing duplicates, filtering outliers. 
2. Data Transformation: normalization/standardization, encoding categorical variables.
3. Feature Engineering: feature selection and extraction.
4. Data Splitting: training and testing split
5. Data Balancing: handling imbalanced data
6. Data Visualisation: exploratory data analysis.

(Gryczka 2023)

### Iris Dataset Preprocessing

***

##### Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as ss 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

##### Loading and Inspecting the Data

In [2]:
iris_df = pd.read_csv('csv/iris.csv')

iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
# Displaying general info, checking datatypes and missing values
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


- The Iris Dataset contains 150 entries without any missing values.
- The dataset has four numerical features: sepal_length, sepal_width, petal_length, and petal_width, and one categorical feature, class, which represents the species of the iris flower.

##### Encoding

In [4]:
# Encoding the 'class' column
label_encoder = LabelEncoder()
iris_df['class'] = label_encoder.fit_transform(iris_df['class'])

# Separating the features and the target variable
X = iris_df.drop('class', axis=1)
y = iris_df['class']

##### Splitting

In [5]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### Scaling

In [6]:

# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Displaying the first few rows of the scaled training data and the training labels
X_train_scaled[:5], y_train[:5]

(array([[-1.47393679,  1.22037928, -1.5639872 , -1.30948358],
        [-0.13307079,  3.02001693, -1.27728011, -1.04292204],
        [ 1.08589829,  0.09560575,  0.38562104,  0.28988568],
        [-1.23014297,  0.77046987, -1.21993869, -1.30948358],
        [-1.7177306 ,  0.32056046, -1.39196294, -1.30948358]]),
 22    0
 15    0
 65    1
 11    0
 42    0
 Name: class, dtype: int32)

###  Implementation Using scikit-learn
***

#### Code Implementation

In [11]:
svm_model = SVC(kernel='linear')  # Starting with a linear kernel
svm_model.fit(X_train_scaled, y_train)

# Predicting using the test set
y_pred = svm_model.predict(X_test_scaled)

# Evaluating the model
classification_rep = classification_report(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)

# Formatting the output
formatted_classification_rep = f"Classification Report:\n{classification_rep}"
formatted_confusion_matrix = f"Confusion Matrix:\n{confusion_mat}"

# Combined formatted output
formatted_output = f"{formatted_classification_rep}\n\n{formatted_confusion_matrix}"
print(formatted_output)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30


Confusion Matrix:
[[10  0  0]
 [ 0  8  1]
 [ 0  0 11]]


#### Code Explanation


The code above demonstrates the process of training, predicting, and evaluating a Support Vector Machine (SVM) model using the scikit-learn library in Python. The following is the step by step explanation of the code:

1. Model Initialization: An SVM classifier is initialised with a linear kernel using `SVC(kernel='linear')`. The kernel parameter specifies the type of hyperplane used to separate the data. In this case, 'linear' indicates a linear boundary between the classes.

2. Model Training: `fit()` method is used to  train the SVM model on the scaled training data (`X_train_scaled`) and the corresponding labels (`y_train`). The model learns to classify data points based on this training data.

3. Making Predictions: After training, the model uses the scaled test data (`X_test_scaled`) to make predictions. The output ``y_pred` is the model's prediction of the class labels for the test set.

4. Evaluating the Model: `classification_report()` function is used to generate a report on several key metrics (precision, recall, F1-score) for evaluating the performance of the classification model.

5. Formatting the Output: The classification report and confusion matrix outputs are formatted into more readable strings, prefixed with titles for clarity.

6. Displaying the Results: The final formatted output, combining both the classification report and the confusion matrix, is concatenated into a single string and printed. This provides a comprehensive view of the model's performance.



#### Results Interpretation

The model shows excellent performance, particularly for the Setosa class, which it classified perfectly. The Versicolor and Virginica classes also show high accuracy, though there is a minor misclassification between these two classes.

The overall accuracy of 97% indicates that the linear kernel SVM is highly effective for this dataset.

#### Model Evaluation

This SVM model demonstrates robust performance in classifying the species in the Iris dataset. The results highlight the effectiveness of SVM for this type of classification task, especially with a well-preprocessed dataset. The high accuracy and precision also suggest that a linear separation is quite effective for this particular dataset.

### Conclusion
***

### References
***

Ali M. (2022). *Supervised Machine Learning.* Available online at <https://www.datacamp.com/blog/supervised-machine-learning>

Brownlee J. (2022). *Supervised and Unsupervised Machine Learning Algorithms.* Available online at <https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/>

Buhl N. (2023). *Mastering Data Cleaning & Data Preprocessing.* Available online at <https://encord.com/blog/data-cleaning-data-preprocessing>

Chakure A. (2022). *How to Preprocess Data in Python.* Available online at <https://builtin.com/machine-learning/how-to-preprocess-data-python>

Chauhan G. (2021). *Iris Dataset Project from UCI Machine Learning Repository.* Available online at <https://machinelearninghd.com/iris-dataset-uci-machine-learning-repository-project/>

Cristianini N. & Shawe-Taylor J. (2000). *An Introduction to Support Vector Machines.* Campridge Univeersity Press.

Gandhi R. (2008). *Support Vector Machine — Introduction to Machine Learning Algorithms.* Available online at <https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47>

Geeksforgeeks (n.d.). *Support Vector Machine (SVM) Algorithm.* Available online at <https://www.geeksforgeeks.org/support-vector-machine-algorithm/>

Geeksforgeeks (n.d.). *Top 10 Algorithms every Machine Learning Engineer should know.* Available onlne at <https://www.geeksforgeeks.org/top-10-algorithms-every-machine-learning-engineer-should-know/>

Geeksforgeeks (n.d.). *Types of Machine Learning.* Available online at <https://www.geeksforgeeks.org/types-of-machine-learning/>

Girgin S. (2019). *Day-12: Kernel SVM (Non-Linear SVM).* Available online at <https://medium.com/pursuitnotes/day-12-kernel-svm-non-linear-svm-5fdefe77836c>

Gryczka K. (2023). *Data preprocessing: a comprehensive step-by-step guide.* Available online at <https://www.future-processing.com/blog/data-preprocessing-a-comprehensive-step-by-step-guide/>

Jana A. (2020). *Support Vector Machines for Beginners – Linear SVM.* Available online at <https://www.adeveloperdiary.com/data-science/machine-learning/support-vector-machines-for-beginners-linear-svm/>

Hillier W. (2022). *What Is the Difference Between Regression and Classification?* Available online at <https://careerfoundry.com/en/blog/data-analytics/regression-vs-classification/>

IBM (n.d.). *What is supervised learning?* Available online at <https://www.ibm.com/topics/supervised-learning>

Kanade V. (2022). *What Is a Support Vector Machine? Working, Types, and Examples.* Available online at <https://www.spiceworks.com/tech/big-data/articles/what-is-support-vector-machine/>

Keita Z. (2022). *Classification in Machine Learning: An Introduction.* Available online at <https://www.datacamp.com/blog/classification-machine-learning>

Packt (n.d.). *SVM: Maximize the Margin.* Available online at <https://static.packt-cdn.com/products/9781783555130/graphics/3547_03_07.jpg>

scikit-learn (n.d.). *1.4. Support Vector Machines.* Available online at <https://scikit-learn.org/stable/modules/svm.html>

scikit-learn (n.d.). *sklearn.svm.SVC.* Available online at <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>

VanderPlas J. (2016). *Python Data Science Handbook: Essential Tools for Working with Data.* O'Reilly Media, Inc. Available online at <https://jakevdp.github.io/PythonDataScienceHandbook/>

Wakefield K. (n.d.). *A guide to the types of machine learning algorithms and their application.* Available online at <https://www.sas.com/en_ie/insights/articles/analytics/machine-learning-algorithms.html>

Wikipedia (2023). *Ronald Fisher.* Available online at <https://en.wikipedia.org/wiki/Ronald_Fisher>

***
### End