# Iris dataset - Machine Learning and Statistics   Winter 23/24

## Author: David Higgins - G00411302     Atlantic Technological University

## Introduction

The Fisher Iris dataset, introduced by Sir Ronald A. Fisher in 1936, stands as a seminal contribution to the fields of data science and machine learning. This dataset, featuring 150 iris flower samples, holds a unique place in the annals of statistical and botanical research. Its primary purpose is to explore the feasibility of utilizing quantitative measurements—namely, sepal and petal dimensions—to accurately classify iris flowers into three distinct species: Iris setosa, Iris versicolor, and Iris virginica.

At its core, the dataset is elegantly simple. Each iris sample is characterized by four essential features: sepal length, sepal width, petal length, and petal width—each meticulously recorded in centimeters. These features serve as the foundation for supervised learning endeavors, where the challenge lies in building classification models that can discern the subtle botanical differences among iris species.

Beyond its historical significance, the Fisher Iris dataset continues to play a vital role in modern data science. It serves as a benchmark for evaluating classification algorithms, a canvas for data visualization and exploration, and a platform for teaching fundamental concepts in machine learning. In this project, we embark on a journey to unlock the dataset's insights and showcase its enduring relevance in the ever-evolving landscape of data analysis and predictive modeling.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
import pandas as pd

#load the dataset from sklearn
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

print(iris_df)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

       species  
0       se

## Supervised Learning

Supervised learning stands as a cornerstone in the landscape of machine learning, representing a foundational paradigm that underpins a myriad of real-world applications. This methodology, characterized by the presence of labeled training data, encapsulates the essence of learning from past experience to make informed predictions on unseen data points. In its fundamental form, supervised learning can be distilled into a process where algorithms learn to map input features to their corresponding output labels, effectively modeling the underlying data distribution.

Central to the efficacy of supervised learning is the concept of supervision, wherein the algorithm is provided with a ground truth, enabling it to discern the relationship between input and output variables. This supervision manifests through a training dataset comprising pairs of input instances and their associated labels, thereby affording the algorithm an opportunity to generalize from known patterns and make predictions on novel, unlabeled data. The overarching objective in supervised learning is to optimize the model's ability to capture the underlying patterns, fostering the creation of predictive models that can generalize effectively.

Supervised learning encompasses two primary categories: regression and classification. In regression tasks, the goal is to predict a continuous output variable, as exemplified in predicting housing prices based on various features. Conversely, classification tasks entail categorizing input data into predefined discrete classes or categories, as seen in image recognition or sentiment analysis. Classification algorithms, including logistic regression, decision trees, support vector machines, and neural networks, are pivotal components of the supervised learning arsenal, each offering distinct capabilities to tackle specific challenges.

The practical significance of supervised learning reverberates across diverse domains, ranging from natural language processing and computer vision to medical diagnosis and recommendation systems. Its versatility and adaptability have rendered it indispensable in harnessing the power of data to drive decision-making, providing a fertile ground for research, innovation, and the development of intelligent systems. In sum, supervised learning constitutes an indispensable pillar in the realm of machine learning, serving as the bedrock upon which data-driven insights and predictive models are constructed, propelling the field towards ever-expanding horizons.

## Classification Algorithms

Classification algorithms are fundamental in machine learning, serving to categorize data into predefined classes or categories. They are pivotal in numerous real-world applications, such as sentiment analysis, image recognition (e.g., the MNIST dataset), and medical diagnosis (e.g., the Breast Cancer Wisconsin dataset).

These algorithms harness labeled training data to learn the intricate relationships between input features and output classes. Their objective is to construct predictive models capable of accurately assigning new, unlabeled data points to the correct categories. A suite of classification algorithms exists, each with distinct capabilities and suitability for specific tasks.

For instance, logistic regression models class probabilities, decision trees hierarchically partition feature spaces (e.g., the Iris dataset), support vector machines seek optimal hyperplanes for separation, and k-nearest neighbors classify based on proximity. In recent years, deep neural networks (e.g., CNNs for image classification using datasets like CIFAR-10) have made notable strides in handling complex, high-dimensional data.

The selection of a classification algorithm hinges on factors such as data complexity, interpretability requirements, and computational resources. Continual advancements in classification algorithms propel the field of machine learning, enabling automation of decision-making and classification tasks across various domains.

## K Nearest Neighbours

## Support Vector Machines