### 1. Choix du dataset

Les contraintes pour vos jeux de données:
- Données d'entrées: Image, Signaux temporels, tabulaire (> 10 features grand minimum)
- Nombre d'exemples suffisamment grand pour permettre de faire du clustering (un ordre de grandeur de plus que le nombre de features pour les données tabulaires)
- labels/vérité disponible

1.1 Prétraitement de votre dataset

In [6]:
import pandas as pd

#Importation de la database
AI_db = pd.read_csv("student-mat.csv")

#Affichage des 5 premières lignes de la base de données
AI_db.head()

AI_db.count

<bound method DataFrame.count of     school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob  \
0       GP   F   18       U     GT3       A     4     4   at_home   teacher   
1       GP   F   17       U     GT3       T     1     1   at_home     other   
2       GP   F   15       U     LE3       T     1     1   at_home     other   
3       GP   F   15       U     GT3       T     4     2    health  services   
4       GP   F   16       U     GT3       T     3     3     other     other   
..     ...  ..  ...     ...     ...     ...   ...   ...       ...       ...   
390     MS   M   20       U     LE3       A     2     2  services  services   
391     MS   M   17       U     LE3       T     3     1  services  services   
392     MS   M   21       R     GT3       T     1     1     other     other   
393     MS   M   18       R     LE3       T     3     2  services     other   
394     MS   M   19       U     LE3       T     1     1     other   at_home   

     ... famrel fr

**Exploration**

- Combien d'exemples ?

Il y a 395 exemples.

- Quelles sont les features ? Combien ?

Il y a 33 features, les voici :

1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male
3. age - student's age (numeric: from 15 to 22
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

31. G1 - first period grade (numeric: from 0 to 20)
32. G2 - second period grade (numeric: from 0 to 20)
33. G3 - final grade (numeric: from 0 to 20, output target)

- Quels sont les labels ? Regardez la distribution, est-ce qu'ils sont équilibrés ?

Le label choisi est Walc correspondant à la consommation d'alcool lors des weekend.
Il est catégorisé comme suit : de 1 - très bas à 5 - très haut). Les données sont assez équilibrées mise à part une plus importante quantité pour 1 et une plus petite pour 5 comparé à la moyenne.

- Est-ce qu'il y a une grande variété de données ?

Les données sont variées du fait du nombre de valeurs possibles dans chaque feature. Pour exemple, les parents ne sont pas tous catégorisés dans le même domaine, donc il y a ici déjà une grande variété de classe sociale.
  

**Normalisation**

N'oubliez pas de systématiquement normaliser vos données: https://scikit-learn.org/stable/modules/preprocessing.html


In [7]:
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

encoders = {}

categorical_cols = [
    'school', 'sex', 'address', 'famsize', 'Pstatus',
    'Mjob', 'Fjob', 'reason', 'guardian',
    'schoolsup', 'famsup', 'paid', 'activities',
    'nursery', 'higher', 'internet', 'romantic'
]

for col in categorical_cols:
    le = LabelEncoder()
    AI_db[col] = le.fit_transform(AI_db[col])
    encoders[col] = le  


X_train = AI_db.to_numpy()
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
mapping = {}

for col, le in encoders.items():
    mapping[col] = {i: cls for i, cls in enumerate(le.classes_)}

mapping

    

{'school': {0: 'GP', 1: 'MS'},
 'sex': {0: 'F', 1: 'M'},
 'address': {0: 'R', 1: 'U'},
 'famsize': {0: 'GT3', 1: 'LE3'},
 'Pstatus': {0: 'A', 1: 'T'},
 'Mjob': {0: 'at_home', 1: 'health', 2: 'other', 3: 'services', 4: 'teacher'},
 'Fjob': {0: 'at_home', 1: 'health', 2: 'other', 3: 'services', 4: 'teacher'},
 'reason': {0: 'course', 1: 'home', 2: 'other', 3: 'reputation'},
 'guardian': {0: 'father', 1: 'mother', 2: 'other'},
 'schoolsup': {0: 'no', 1: 'yes'},
 'famsup': {0: 'no', 1: 'yes'},
 'paid': {0: 'no', 1: 'yes'},
 'activities': {0: 'no', 1: 'yes'},
 'nursery': {0: 'no', 1: 'yes'},
 'higher': {0: 'no', 1: 'yes'},
 'internet': {0: 'no', 1: 'yes'},
 'romantic': {0: 'no', 1: 'yes'}}

In [9]:
X = AI_db.drop(columns=['Walc']).to_numpy()  # todas las features
y = AI_db['Walc'].to_numpy()