
# Machine Learning with Python

Welcome to the **Machine Learning** course! This course is designed to give you hands-on experience with the foundational concepts and advanced techniques in machine learning. You will explore:

1. **Supervised Learning**
    - Regression algorithms
    - Classification algorithms
2. **Unsupervised Learning**
    - Clustering algorithms
    - Dimensionality reduction
3. **Fairness and Interpretability**
    - Interpretable methods
    - Bias evaluation
    
Throughout the course, you'll engage in projects to solidify your understanding and gain practical skills in implementing machine learning algorithms.  

Instructor: Dr. Adrien Dorise  
Contact: adrien.dorise@hotmail.com  

---


## Part3: Fairness in machine learning with the COMPAS dataset
In this project, you will try to tackle an ethic project with machine learning. The goal is to understand the possible biases in the datasets and when creating a machine learning model. The taks will include:  

1. **Import and Understand a Dataset**: Learn how to load, preprocess, and explore a dataset to prepare it for training.
2. **Perform classification on a dataset**: Learn to perform a classification on real dataset.
3. **Interpret the model**: Learn to create interpretable representation of your model.
4. **Analayse the possible biases**: Learn to be critical about the model's prediction.


By the end of this project, you'll have a better understanding of the risks related to biases in datasets.

---

## Dataset

This exercise will use the **COMPAS dataset** (https://www.kaggle.com/datasets/danofer/compass/).  
The COMPAS dataset contains data on individuals involved in the criminal justice system, including features like age, race, and criminal history, used to predict recidivism risk scores. It has been widely used to study algorithmic bias in risk prediction models.  

Here, it is given to you in the `compas_binarised.csv` file.  
The code snippet below allows you to load the dataset.

In [1]:
import pandas as pd

# Read the dataset
df = pd.read_csv('compas_binarised.csv')

# Remove rows with missing values
df = df.dropna()

# Remove the 'id' column
df = df.drop(columns=['id'])

## Data visualisation

The description of the dataset is given to you in the code snippets below.

**Your job**:
- Look at the dataset information.
- What can you say about the available features?
- What is the target called?
- Split the dataset into *feature* and *targets*.


In [3]:
# Show the first few rows to check the structure of the data
df.head()

Unnamed: 0,priors_count,is_recid,sex_Female,sex_Male,age_cat_25 - 45,age_cat_Greater than 45,age_cat_Less than 25,race_African-American,race_Asian,race_Caucasian,race_Hispanic,race_Native American,race_Other
0,0,0,False,True,False,True,False,False,False,False,False,False,True
1,0,0,False,True,False,True,False,False,False,False,False,False,True
2,0,1,False,True,True,False,False,True,False,False,False,False,False
3,4,1,False,True,False,False,True,True,False,False,False,False,False
4,4,1,False,True,False,False,True,True,False,False,False,False,False


In [4]:
# Display the structure of the dataset
df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10542 entries, 0 to 10541
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   priors_count             10542 non-null  int64
 1   is_recid                 10542 non-null  int64
 2   sex_Female               10542 non-null  bool 
 3   sex_Male                 10542 non-null  bool 
 4   age_cat_25 - 45          10542 non-null  bool 
 5   age_cat_Greater than 45  10542 non-null  bool 
 6   age_cat_Less than 25     10542 non-null  bool 
 7   race_African-American    10542 non-null  bool 
 8   race_Asian               10542 non-null  bool 
 9   race_Caucasian           10542 non-null  bool 
 10  race_Hispanic            10542 non-null  bool 
 11  race_Native American     10542 non-null  bool 
 12  race_Other               10542 non-null  bool 
dtypes: bool(11), int64(2)
memory usage: 360.3 KB


In [5]:
# Display the summary statistics of the dataset
print(df.describe())  

       priors_count      is_recid
count  10542.000000  10542.000000
mean       4.099507      0.473629
std        5.380332      0.499328
min        0.000000      0.000000
25%        1.000000      0.000000
50%        2.000000      0.000000
75%        6.000000      1.000000
max       43.000000      1.000000


In [7]:
# TODO
target = 'is_recid'
# Séparer le dataset en features et target
X = df.drop(columns=target)
y = df[target]

## Train an SVM classifier

You will start by training a SVM classifier on the COMPAS dataset.

**Your job:**
- Split the dataset between train and test using the holdout method.
- Train a SVM model.
- Print the accuracy of the SVM model.

In [10]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
svm_model = SVC()
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

# Calcul et affichage de l'accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy du modèle SVM :", round(accuracy,2))

Accuracy du modèle SVM : 0.65


## Train a Decision tree classifier and interpret the model

Decision tree are highly interpretable. They can be usefull into knowing which features are relevant in the prediction.

**Your job:**
- Split the dataset between train and test using the holdout method.
- Train a decision tree model and modify the hyperparameters.
- Print the accuracy of the decision tree model.
- Plot the confusion matrix 
- Visualise the tree
    - You can use the **plot_tree method**
    - `plot_tree(model, filled=True, feature_names=features.columns, class_names=['No Recidivism', 'Recidivism'], rounded=True)`
- Conclude about the most important features.

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


# Train a decision tree
model = DecisionTreeClassifier(max_depth=5, criterion='gini', splitter = "random", min_samples_leaf = 800 , min_samples_split = 1000, random_state=42)

# TODO

In [None]:
import matplotlib.pyplot as plt



## Evaluate biases

Now that you have evaluated your model on the whole dataset, it is now time to see if there exists a bias in your model.

**Your job:**
- Proposes an evaluation method that evaluates if the model is biased regarding a certain population.
- Is the model fair?
- Conclude on your results.

# The END!

Congratulations!  
You have now completed this course about machine learning!  
You should now have a good understanding of the basic principles of artificial intelligence!  

It is a fine knowledge basis on which you construct yourself. You are now well-prepared to tackle new challenges in machine learning!

If you liked this course, don't hesitate to contact me for other courses:
- **Machine learning:** from supervised to unsupervised, with ethical questionning.
- **Deep Learning:** from the 50's perceptron up to the transformer powerhouse.
- **Reinforcement learning:** learn to create you own unique agent!
- **AI in games:** learn to apply deep learning in video games!

Also, don't hesitate to *star* this repository, it helps me a lot!