
# Machine Learning with Python

Welcome to the **Machine Learning** course! This course is designed to give you hands-on experience with the foundational concepts and advanced techniques in machine learning. You will explore:

1. **Supervised Learning**
    - Regression algorithms
    - Classification algorithms
2. **Unsupervised Learning**
    - Clustering algorithms
    - Dimensionality reduction
3. **Fairness and Interpretability**
    - Interpretable methods
    - Bias evaluation
    
Throughout the course, you'll engage in projects to solidify your understanding and gain practical skills in implementing machine learning algorithms.  

Instructor: Dr. Adrien Dorise  
Contact: adrien.dorise@hotmail.com  

---


## Part1.2: Supervised learning - Classification on the Iris dataset
In this project, you will compare multiple classification model on the Iris dataset. The taks will include:  

1. **Import and Understand a Dataset**: Learn how to load, preprocess, and explore a dataset to prepare it for training.
2. **Train a classification model**: Select and train a classification model using scikit-learn.
3. **Evaluate and plot the model performance**: Select a criterion to which you can evaluate the model, and plot its result.
4. **Compare multiple classification model, and get the best performance**: Compare multiple models, and find the best model to fit the data.

By the end of this project, you'll have a solid understanding of the different classification methods.

---

## Dataset

This exercise will use the **Iris dataset** (https://www.kaggle.com/datasets/uciml/iris).  
The Iris dataset is a classic dataset in machine learning that contains 150 samples of iris flowers, with 4 features per sample: sepal length, sepal width, petal length, and petal width. Each sample is labeled with one of three species: Setosa, Versicolor, or Virginica.  

This dataset can be accessed through scikt-learn API.    
Here, it is given to you in the `part1_supervised_learning/2_classification/classification_dataset.csv`.  
The code snippet below allows you to load the dataset.

In [2]:
import pandas as pd


#Import Iris dataset
df = pd.read_csv("classification_dataset.csv")

# Remove ID column
df = df.drop(columns=["id"])

## Data visualisation

**Your job**:
- Print the first 10 samples of the dataset.
    - What can you say about the available features?
- Plot the dataset using matplotlib plt.scatter method.

In [None]:
import numpy as np
import matplotlib.pyplot as plt


## Data preparation

**Your job:**
- Data cleaning
    - If necessary, take care of missing values.
    - The strategy used is up to you.
    - You can remove the sample, or fill the missing value with a strategy of your choice.
- Split features and targets.
    - The target is the column labelled as **"target"** in the Dataframe.
    - The features are all the other columns.
- Normalise the dataset between [0,1] using **StandardScaler** (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).
    - Normalization ensures that all features contribute equally by scaling them to a common range, preventing models from being biased toward features with larger values. It also helps gradient-based algorithms converge faster and improves overall model performance.

In [None]:
from sklearn.preprocessing import StandardScaler



## Training and evaluating a classification model using the hold out method

**Your job**:
- Train a model using the sklearn library (https://scikit-learn.org/stable/supervised_learning.html):
    - Divide the dataset into a train and test set using the holdout method.
    - Select an algorithm that we studied in the course.
    - Train the algorithm on the train set.
- Evaluate the model.
    - Compute the confusion matrix on the test set.
    - Copute the accuracy of the model for both the train and test set.


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC




In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



## Training and evaluating a classification model using cross-validation

**Your job**:
- Train a model using the sklearn library (https://scikit-learn.org/stable/supervised_learning.html):
    - Select another algorithm that we studied in the course.
    - Train the algorithm using K-folds cross validation.
- Evaluate the model.
    - Compute the confusion matrix on the test set.
    - Compute the accuracy of the model for both the train and test set.
- Compare the performance with the holdout method.

In [None]:
from sklearn.model_selection import KFold, cross_val_score



In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
