# Machine Learning and Statistics: Project

### Classification Algorithms on the Iris Flower Dataset

Author: Daria Sep

### Introduction
***
The obective of this project is to explore the use of classification algorithms in supervised learning, focusing on the renowned iris flower dataset introduced by Ronald A. Fisher. 

The notebook will start with an introduction to supervised learning, explaining its core concepts and significance, followed by a detailed look at classification algorithms. It will then demonstrate the implementation of CHNAGE[one common classification algorithm] using the scikit-learn Python library. Emphasis will be placed on enhancing understanding through appropriate plots, mathematical notation, and diagrams, ensuring a blend of theoretical knowledge and practical application.

### Supervised Learning
***

#### Overview of Supervised Learning

Supervised learning falls within the domain of artificial intelligence and machine learning. It is characterised by the use of labeled datasets to train algorithms in classifying data or making accurate predictions (IBM n.d). The goal of supervised learning is for the algorithm to learn a mapping function that can predict the labels for new, unseen data (Brownlee 2023).

In supervised learning, we typically have the following components:
- **Features** (also referred to as "X variables"): These are the input variables or attributes that describe the data instances. For example, in the Iris flower dataset, the features are sepal length, sepal width, petal length, and petal width.
- **Labels** (typically reffered to as "target variables" or "y variables"): These are the output variables or categories that we want to predict. In the Iris dataset, the labels correspond to the species of iris flowers (e.g., setosa, versicolor, virginica).
- **Training Data**: This is the labelled dataset that we use to train the machine's learning model. It consists of input features paired with the correct output labels. (Ali 2022).

In supervised learning, the algorithm is taught by example. An operator provides the machine learning algorithm with a well-defined dataset containing specified inputs and corresponding desired outputs. The algorithm's task is to discern the underlying patterns that lead to those inputs and outputs. While the operator knows the correct answers to the problem, the algorithm identifies patterns in data, learns from observations and makes predictions. Subsequently, the algorithm produces predictions, which are then reviewed and corrected by the operator. This iterative process continues until the algorithm achieves a high level of accuracy and performance (Wakefield n.d.).

Supervised learning is divided into two main types: **classification** and **regression**. Classification is the task of predicting or identifying which category (or categories) a data point belongs to. In classification output variables are always discrete values meaning they can be placed into clear categories or classes. 

Unlike classification, which places data into discrete categories, regression problems use input variables to identify continuous, real-value quantities eg. time-series data, sales figures, salaries, scores, heights, weights etc (Hillier 2022).

In this project, we will focus on classification, where the goal is to assign each data instance to one of several predefined classes or categories.

<figure align="center">
    <img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/20190522174744/MachineLearning.png" width="500" title="Types of Machine Learning"/>
    <figcaption> Types of Machine Learning (Geeksforgeeks n.d.)</figcaption>
</figure>


#### Classification Algorithms in Supervised Learning

As previously noted, classification is a type of supervised machine learning technique in which the goal is to accurately predict the label for a given input. The process involves training a model on labeled examples to learn patterns between input features and output classes, followed by an evaluation phase using test data, before being used to perform prediction on new, unseen data (Keita 2022).

Examples of Classsification Algorithms include:
- **Logistic Regression**: Logistic regression algorithm that models the probability of a binary outcome based on input features, using a logistic function to transform linear combinations of inputs into probabilities. It's commonly used for binary classification tasks, such as spam detection or disease diagnosis
- **Support Vector Machine (SVM)**: SVM is a powerful classification method that works by finding the hyperplane that best separates different classes in the feature space. It's effective in high-dimensional spaces and versatile enough to handle linear and non-linear relationships. The algorithm is often used in image classification and bioinformatics.
- **Random Forest**: Random forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is commonly used in stock market analysis and e-commerce.
- **Decision Tree**: A decision tree is a flowchart-like structure where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. It's simple to understand and interpret, but can be prone to overfitting. Decision trees can be used in customer segmentation and quality control.
- **K-Nearest Neighbors (KNN)**: KNN is a simple, non-parametric algorithm that classifies a data point based on how its neighbors are classified, typically using a majority vote of its k nearest neighbors. The algortm uses include recommendation systems and real estate valuations.
- **Naive Bayes**: Naive Bayes is a probabilistic classifier based on applying Bayes' theorem with the assumption of independence between the features. It is particularly suited for high-dimensional data and is known for its simplicity and speed in handling large datasets. It is often used for text classification.




### Exploration of a Specific Classification Algorithm [NAME]
***

### Dataset Exploration and Preprocessing
***

#### Historical Background 

The Iris Dataset, aslo known as the Fisher's Iris Dataset, is a multivariate dataset created by Sir Ronald Aymer Fisher in 1936. 

<figure align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/a/aa/Youngronaldfisher2.JPG" height="300" title="Ronald Aymer Fisher"/>
    <figcaption> R. A. Fisher (Wikipedia 2023)</figcaption>
</figure>


This dataset is also known as Anderson's Iris dataset, named after Edgar Anderson who collected the data to quantify the variations among Iris flowers of three different classes.

Originally, the dataset served as an example of linear discrimination analysis. However, over time, it gained popularity as a benchmark for evaluating statistical classification methods in machine learning. Today, the Iris Dataset is widely used as an introductory dataset for machine learning (Chauhan 2021).

#### Dataset Overview

The information included in the dataset is as follows:

1. Sepal length in cm
2. Sepal width in cm
3. Petal length in cm
4. Petal width in cm
5. Class:
    - Iris Setosa
    - Iris Versicolour
    - Iris Virginica

<figure align="center">
    <img src="images/iris.png" width="500" title="Iris Species"/>
    <figcaption> Iris Species (Chauhan 2021)</figcaption>
</figure>


#### Dataset Visualisation

##### Imports

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as ss 

##### Data

In [7]:
iris_df = pd.read_csv('csv/iris.csv')

iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


#### Data preprocessing

###  Implementation Using scikit-learn
***

#### Code Implementation

#### Explanation of code and parameters

### Model Evaluation
***

### Conclusion
***

### References
***


Ali M. (2022). *Supervised Machine Learning.* Available online at <https://www.datacamp.com/blog/supervised-machine-learning>

Brownlee J. (2022). *Supervised and Unsupervised Machine Learning Algorithms.* Available online at <https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/>

Chauhan G. (2021). *Iris Dataset Project from UCI Machine Learning Repository.* Available online at <https://machinelearninghd.com/iris-dataset-uci-machine-learning-repository-project/>

Geeksforgeeks (n.d.). *Top 10 Algorithms every Machine Learning Engineer should know.* Available onlne at <https://www.geeksforgeeks.org/top-10-algorithms-every-machine-learning-engineer-should-know/>

Geeksforgeeks (n.d.). *Types of Machine Learning.* Available online at <https://www.geeksforgeeks.org/types-of-machine-learning/>

Hillier W. (2022). *What Is the Difference Between Regression and Classification?* Available online at <https://careerfoundry.com/en/blog/data-analytics/regression-vs-classification/>

IBM (n.d.). *What is supervised learning?* Available online at <https://www.ibm.com/topics/supervised-learning>

Keita Z. (2022). *Classification in Machine Learning: An Introduction.* Available online at <https://www.datacamp.com/blog/classification-machine-learning>

---Maglogiannis I. G. (Ed.) (2007). *Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in EHealth, HCI, Information Retrieval and Pervasive Technologies.* Amsterdam: IOS Press.

Wakefield K. (n.d.). *A guide to the types of machine learning algorithms and their application.* Available online at <https://www.sas.com/en_ie/insights/articles/analytics/machine-learning-algorithms.html>

Wikipedia (2023). *Ronald Fisher.* Available online at <https://en.wikipedia.org/wiki/Ronald_Fisher>

***
### End