# Analyse des notes du S3 (2021-2022)

Le fichier `S3_2021-2022.csv` contient les résultats (anonymisés) du S3 de l'année dernière. Nous allons essayer d'identifier des tendances sur les résultats.

Les cours du S3 (hors LV2) sont :

- CSA : calcul scientifique (a)
- CSB : calcul scientifique (b)
- AUT : automatique
- CSI : conception de systèmes industriels
- MFL : mécanique des fluides
- MDS : mécanique des structures
- SDM : science des matériaux
- RAY : rayonnement
- EPS : éducation physique et sportive
- COM : communication professionnelle
- EED : énergie et environnement : les défis
- ANG : anglais
- STA : stage ouvrier

In [1]:
import numpy as np
import pandas as pd

In [2]:
s3 = pd.read_csv('S3_2021-2022.csv')
s3

Unnamed: 0,CSA,CSB,AUT,CSI,MFL,MDS,SDM,RAY,EPS,COM,EED,ANG,STA
0,13.4,18.00,16.79,13.8,18.00,17.92,17.69,18.28,13.5,15.5,14.5,16.50,17.00
1,15.4,18.43,16.43,12.0,13.35,18.96,18.96,15.28,14.5,18.0,14.0,15.25,17.50
2,15.8,15.50,16.43,16.9,17.85,16.04,15.08,16.06,16.0,18.5,15.0,12.00,16.00
3,18.8,15.14,15.86,14.2,15.69,13.73,18.46,10.06,14.0,16.5,16.5,15.84,16.63
4,17.0,14.79,16.00,16.8,14.58,15.58,18.19,16.56,14.0,17.0,13.5,13.70,17.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
186,6.2,11.96,8.71,12.0,11.15,8.54,5.92,7.50,16.5,16.5,12.5,15.50,15.75
187,5.6,10.00,12.00,13.4,8.92,7.65,9.69,8.44,9.0,13.5,14.0,14.62,14.00
188,6.6,8.71,8.00,15.5,9.85,10.77,8.62,5.78,16.0,15.5,10.0,9.30,16.50
189,10.0,9.43,11.71,12.0,10.04,7.92,7.73,6.11,11.0,14.5,8.5,13.35,15.00


In [3]:
s3 = (s3.fillna(20) < 10)
s3.astype(int)

Unnamed: 0,CSA,CSB,AUT,CSI,MFL,MDS,SDM,RAY,EPS,COM,EED,ANG,STA
0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
186,1,0,1,0,0,1,1,1,0,0,0,0,0
187,1,0,0,0,1,1,1,1,1,0,0,0,0
188,1,1,1,0,1,0,1,1,0,0,0,1,0
189,0,1,0,0,0,1,1,1,0,0,1,0,0


In [4]:
s3['cnt'] = s3.sum(axis=1)
s3 = s3.sort_values(by=["cnt"] + list(s3.columns), ascending=True).drop(columns=['cnt']).reset_index(drop=True)
sample = s3.tail(10)
sample.astype(int)

Unnamed: 0,CSA,CSB,AUT,CSI,MFL,MDS,SDM,RAY,EPS,COM,EED,ANG,STA
181,1,1,1,0,0,1,0,0,0,0,0,0,0
182,0,0,1,0,1,1,0,1,0,0,1,0,0
183,0,1,0,0,0,1,1,1,0,0,1,0,0
184,0,1,0,0,1,0,1,1,0,0,1,0,0
185,1,0,0,0,1,1,0,1,0,1,0,0,0
186,1,0,0,0,1,1,1,1,0,0,0,0,0
187,1,0,1,0,0,1,1,1,0,0,0,0,0
188,0,1,1,0,1,1,1,1,0,0,0,0,0
189,1,0,0,0,1,1,1,1,1,0,0,0,0
190,1,1,1,0,1,0,1,1,0,0,0,1,0


L'extrait suivant représente les échecs des dix élèves en ayant le plus, avec un seuil à 10 (l'absence de résultat est considérée comme une réussite).

|     |   CSA |   CSB |   AUT |   CSI |   MFL |   MDS |   SDM |   RAY |   EPS |   COM |   EED |   ANG |   STA |
|----:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|
| 181 |     1 |     1 |     1 |     0 |     0 |     1 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |
| 182 |     0 |     0 |     1 |     0 |     1 |     1 |     0 |     1 |     0 |     0 |     1 |     0 |     0 |
| 183 |     0 |     1 |     0 |     0 |     0 |     1 |     1 |     1 |     0 |     0 |     1 |     0 |     0 |
| 184 |     0 |     1 |     0 |     0 |     1 |     0 |     1 |     1 |     0 |     0 |     1 |     0 |     0 |
| 185 |     1 |     0 |     0 |     0 |     1 |     1 |     0 |     1 |     0 |     1 |     0 |     0 |     0 |
| 186 |     1 |     0 |     0 |     0 |     1 |     1 |     1 |     1 |     0 |     0 |     0 |     0 |     0 |
| 187 |     1 |     0 |     1 |     0 |     0 |     1 |     1 |     1 |     0 |     0 |     0 |     0 |     0 |
| 188 |     0 |     1 |     1 |     0 |     1 |     1 |     1 |     1 |     0 |     0 |     0 |     0 |     0 |
| 189 |     1 |     0 |     0 |     0 |     1 |     1 |     1 |     1 |     1 |     0 |     0 |     0 |     0 |
| 190 |     1 |     1 |     1 |     0 |     1 |     0 |     1 |     1 |     0 |     0 |     0 |     1 |     0 |

## Itemsets fréquents

**Q1** Sur cet extrait, lister les index (i.e. les numéros d'élève entre 201 et 210) correspondant aux échecs suivants :

- CSA
- CSA, AUT
- CSA, AUT, MDS
- CSA, AUT, MDS, EED
- CSA, AUT, MDS, EED, MFL

**Q2** En déduire les support (absolu et relatif) de ces cinq itemsets.

**Q3** Quel est le support (absolu et relatif) des itemsets suivants :

- MFL
- MFL, MDS
- AUT, RAY, STA
- SDM, STA

**Q4** En suivant l'algorithme Apriori, lister les itemsets fréquents pour un support minimum de 5, avec le support associé à chaque itemset.

**Q5** Lister les itemsets fréquents pour un support minimum de 7.

**Q6** Vérifier vos réponses aux quatre questions précédentes en les comparant avec celles calculés par la fonction `apriori` de la bibliothèque [mlxtend](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).

In [5]:
from mlxtend.frequent_patterns import apriori

In [6]:
# Q1
for itemset in [{'CSA'}, {'CSA', 'AUT'}, {'CSA', 'AUT', 'MDS'}, {'CSA', 'AUT', 'MDS', 'EED'}, {'CSA', 'AUT', 'MDS', 'EED', 'MFL'}]:
    print(f"{itemset}: {sample[(sample[list(itemset)] == 1).all(axis=1)].index.to_list()}")

{'CSA'}: [181, 185, 186, 187, 189, 190]
{'AUT', 'CSA'}: [181, 187, 190]
{'AUT', 'MDS', 'CSA'}: [181, 187]
{'AUT', 'MDS', 'EED', 'CSA'}: []
{'MFL', 'CSA', 'MDS', 'EED', 'AUT'}: []


In [7]:
# Q2 et Q3
fq = apriori(sample, min_support=0.1, use_colnames=True) # ici un support < 0.1 vaut 0
for itemset in [
    {'CSA'}, {'CSA', 'AUT'}, {'CSA', 'AUT', 'MDS'}, {'CSA', 'AUT', 'MDS', 'EED'}, {'CSA', 'AUT', 'MDS', 'EED', 'MFL'}, # Q2
    {'MFL'}, {'MFL', 'MDS'}, {'STA', 'RAY', 'AUT'}, {'STA', 'SDM'}]: # Q3
    sup = fq[fq['itemsets'] == itemset]['support']
    if sup.size == 0:
        sup = 0
    else:
        sup = sup.iloc[0]
    print(f"{itemset}: {sup}")

{'CSA'}: 0.6
{'AUT', 'CSA'}: 0.3
{'AUT', 'MDS', 'CSA'}: 0.2
{'AUT', 'MDS', 'EED', 'CSA'}: 0
{'MFL', 'CSA', 'MDS', 'EED', 'AUT'}: 0
{'MFL'}: 0.7
{'MDS', 'MFL'}: 0.5
{'RAY', 'STA', 'AUT'}: 0
{'SDM', 'STA'}: 0


In [8]:
# Q4
fq = apriori(sample, min_support=0.5, use_colnames=True)
fq

Unnamed: 0,support,itemsets
0,0.6,(CSA)
1,0.5,(CSB)
2,0.5,(AUT)
3,0.7,(MFL)
4,0.8,(MDS)
5,0.7,(SDM)
6,0.9,(RAY)
7,0.5,"(MDS, CSA)"
8,0.5,"(RAY, CSA)"
9,0.5,"(MDS, MFL)"


In [9]:
# Q5
fq = apriori(sample, min_support=0.7, use_colnames=True)
fq

Unnamed: 0,support,itemsets
0,0.7,(MFL)
1,0.8,(MDS)
2,0.7,(SDM)
3,0.9,(RAY)
4,0.7,"(RAY, MFL)"
5,0.7,"(MDS, RAY)"
6,0.7,"(RAY, SDM)"


**Q7** Lister les fréquents maximaux pour un support minimum de 0.5.

In [10]:
# Q7
# Remarque : ce n'est pas la manière la plus efficace de calculer les fréquents maximaux
def isstrictsubset(itemset, itemsets):
    for it in itemsets:
        if itemset < it:
            return True
    return False
fq = apriori(sample, min_support=0.5, use_colnames=True)
fq['max'] = fq['itemsets'].apply(lambda x: not isstrictsubset(x, fq['itemsets']))
maxfq = fq[fq['max']]
maxfq

Unnamed: 0,support,itemsets,max
1,0.5,(CSB),True
2,0.5,(AUT),True
7,0.5,"(MDS, CSA)",True
8,0.5,"(RAY, CSA)",True
15,0.5,"(MDS, RAY, MFL)",True
16,0.5,"(RAY, SDM, MFL)",True
17,0.5,"(MDS, RAY, SDM)",True


In [11]:
# Q7 (bis)
from mlxtend.frequent_patterns import fpmax
fq = fpmax(sample, min_support=0.5, use_colnames=True)
fq

Unnamed: 0,support,itemsets
0,0.5,(CSB)
1,0.5,(AUT)
2,0.5,"(MDS, CSA)"
3,0.5,"(RAY, CSA)"
4,0.5,"(MDS, RAY, MFL)"
5,0.5,"(RAY, SDM, MFL)"
6,0.5,"(MDS, RAY, SDM)"


**Q8** Lister les fréquents clos pour un support minimum de 0.5.

In [12]:
# Q8
# Remarque : ce n'est pas la manière la plus efficace de calculer les fréquents clos
fq = apriori(sample, min_support=0.5, use_colnames=True)
fq['closed'] = fq.apply(lambda x: not isstrictsubset(x['itemsets'], fq[fq['support'] == x['support']]['itemsets']), axis=1)
clfq = fq[fq['closed']]
clfq

Unnamed: 0,support,itemsets,closed
0,0.6,(CSA),True
1,0.5,(CSB),True
2,0.5,(AUT),True
4,0.8,(MDS),True
6,0.9,(RAY),True
7,0.5,"(MDS, CSA)",True
8,0.5,"(RAY, CSA)",True
11,0.7,"(RAY, MFL)",True
13,0.7,"(MDS, RAY)",True
14,0.7,"(RAY, SDM)",True


**Q9** Donner deux itemsets comparables &ndash; c'est-à-dire dont le support de l'un est garanti d'être supérieur ou égal au support de l'autre &ndash; et deux itemsets incomparables &ndash; c'est-à-dire dont les supports ne sont pas liés par la relation de monotonie.

**Q10** Sur l'ensemble des notes du s3, comparer les temps d'exécution des fonctions `apriori`, `fpgrowth` et `fpmax`.  
Vous pourrez pour cela utiliser la commande [`%timeit`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit).

In [13]:
from mlxtend.frequent_patterns import fpgrowth, fpmax

In [14]:
# Q10
min_sup = 0.001
%timeit apriori(s3, min_support=min_sup)
%timeit fpgrowth(s3, min_support=min_sup)
%timeit fpmax(s3, min_support=min_sup)

6.44 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.22 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.58 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**Q11** Vérifier que les résultats données par `apriori` et `fpgrowth` sont identiques.

In [15]:
# Q11
fq1 = apriori(s3, min_support=min_sup)
fq2 = fpgrowth(s3, min_support=min_sup)
fq1['itemsets'] = fq1['itemsets'].apply(lambda x: tuple(sorted(x)))
fq2['itemsets'] = fq2['itemsets'].apply(lambda x: tuple(sorted(x)))
fq1 = fq1.sort_values(by='itemsets').reset_index(drop=True)
fq2 = fq2.sort_values(by='itemsets').reset_index(drop=True)
fq1.equals(fq2)

True

**Q12** Quelle serait l'UE de 2, 3 ou 4 matières la plus difficile ?  
Ces résultats étaient-ils prévisibles en considérant uniquement le nombre d'échecs dans chaque matière ?

In [16]:
s3.sum().sort_values(ascending=False)

RAY    65
CSA    42
MFL    30
MDS    23
EED    17
ANG    11
SDM    10
CSB     9
AUT     7
STA     2
CSI     1
EPS     1
COM     1
dtype: int64

In [17]:
fq = fpgrowth(s3, min_support=0.01, use_colnames=True)
fq['len'] = fq['itemsets'].apply(len)
fq['support (abs.)'] = (fq['support'] * len(s3)).astype(int)

In [18]:
# Q12, pour 4 matières
fq[fq['len'] == 4].sort_values(by='support', ascending=False).head(10)

Unnamed: 0,support,itemsets,len,support (abs.)
102,0.020942,"(MDS, RAY, SDM, CSA)",4,4
94,0.020942,"(MDS, RAY, SDM, MFL)",4,4
18,0.015707,"(ANG, MFL, RAY, CSA)",4,3
35,0.015707,"(MDS, RAY, MFL, CSA)",4,3
103,0.015707,"(SDM, MFL, RAY, CSA)",4,3
84,0.015707,"(RAY, SDM, MFL, CSB)",4,3
75,0.010471,"(MDS, RAY, SDM, CSB)",4,2
97,0.010471,"(RAY, SDM, EED, MFL)",4,2
79,0.010471,"(RAY, SDM, EED, CSB)",4,2
66,0.010471,"(MDS, RAY, SDM, AUT)",4,2


> Ici les échecs ne semblent pas complètement indépendants. Cependant, il faudrait vérifier si cela n'est pas simplement dû au hasard.  
> Il existe une branche de l'analyse dédiée à la vérification d'hypothèses, que nous ne verrons pas dans ce cours.

## Règles d'association

**Q13** Sur l'extrait des dix derniers relevés, calculer le support des itemsets suivants :

- AUT
- MDS
- SDM
- MDS, SDM
- AUT, SDM
- AUT, MDS
- AUT, MDS, SDM
- AUT, MDS, CSA
- AUT CSA, MDS, SDM

In [19]:
# Q13
fq = apriori(sample, min_support=0.1, use_colnames=True)
for itemset in [
    {'AUT'}, {'MDS'}, {'SDM'},
    {'MDS', 'SDM'}, {'AUT', 'SDM'}, {'AUT', 'MDS'},
    {'AUT', 'MDS', 'SDM'}, {'AUT', 'MDS', 'CSA'},
    {'AUT', 'CSA', 'MDS', 'SDM'}]:
    sup = fq[fq['itemsets'] == itemset]['support']
    if sup.size == 0:
        sup = 0
    else:
        sup = sup.iloc[0]
    print(f"{itemset}: {sup}")

{'AUT'}: 0.5
{'MDS'}: 0.8
{'SDM'}: 0.7
{'MDS', 'SDM'}: 0.5
{'SDM', 'AUT'}: 0.3
{'MDS', 'AUT'}: 0.4
{'MDS', 'SDM', 'AUT'}: 0.2
{'MDS', 'AUT', 'CSA'}: 0.2
{'MDS', 'SDM', 'AUT', 'CSA'}: 0.1


**Q14** En déduire la confiance des règles d'association suivantes :

- MDS $\rightarrow$ SDM
- AUT $\rightarrow$ SDM
- AUT, MDS $\rightarrow$ SDM
- AUT, CSA, MDS $\rightarrow$ SDM

**Q15** Calculer le lift, le leverage et la conviction des règles d'association précédentes.

**Q16** Calculer les mesures de Kulczynski, all_confidence, max_confidence, cosine et IR pour ces règles d'association.

**Q17** Vérifier vos réponses aux questions Q14 et Q15 en les comparant avec celles calculés par la fonction `association_rules` de la bibliothèque [mlxtend](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/).

In [20]:
from mlxtend.frequent_patterns import association_rules

In [21]:
# Q14 et Q15
fq = fpgrowth(sample, min_support=0.1, use_colnames=True)
rl = association_rules(fq, min_threshold=0.1)

rl = pd.concat([rl[(rl['antecedents'] == x) & (rl['consequents'] == y)]
           for x, y in [
               ({'MDS'}, {'SDM'}),
               ({'AUT'}, {'SDM'}),
               ({'AUT', 'MDS'}, {'SDM'}),
               ({'AUT', 'CSA', 'MDS'}, {'SDM'})
           ]])
rl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1526,(MDS),(SDM),0.8,0.7,0.5,0.625,0.892857,-0.06,0.8
141,(AUT),(SDM),0.5,0.7,0.3,0.6,0.857143,-0.05,0.75
399,"(MDS, AUT)",(SDM),0.4,0.7,0.2,0.5,0.714286,-0.08,0.6
187,"(MDS, AUT, CSA)",(SDM),0.2,0.7,0.1,0.5,0.714286,-0.04,0.6


**Q18** Ajouter à la DataFrame résultat les mesures de Kulczynski, all_confidence, max_confidence, cosine et IR.  
Vous pourrez généraliser cet ajout en définissant une fonction de traitement sur la DataFrame résultat de la fonction `association_rules`, pour un usage futur par exemple.

In [22]:
# Q18
rl['Kulc'] = rl['support']*(rl['antecedent support']+rl['consequent support'])/(2*rl['antecedent support']*rl['consequent support'])
rl['all'] = pd.concat([rl['confidence'], rl['support']/rl['consequent support']], axis=1).min(axis=1)
rl['max'] = pd.concat([rl['confidence'], rl['support']/rl['consequent support']], axis=1).max(axis=1)
rl['cos'] = rl['support']/np.sqrt(rl['antecedent support']*rl['consequent support'])
rl['IR'] = np.abs(rl['antecedent support']-rl['consequent support'])/(rl['antecedent support']+rl['consequent support']-rl['support'])
rl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,Kulc,all,max,cos,IR
1526,(MDS),(SDM),0.8,0.7,0.5,0.625,0.892857,-0.06,0.8,0.669643,0.625,0.714286,0.668153,0.1
141,(AUT),(SDM),0.5,0.7,0.3,0.6,0.857143,-0.05,0.75,0.514286,0.428571,0.6,0.507093,0.222222
399,"(MDS, AUT)",(SDM),0.4,0.7,0.2,0.5,0.714286,-0.08,0.6,0.392857,0.285714,0.5,0.377964,0.333333
187,"(MDS, AUT, CSA)",(SDM),0.2,0.7,0.1,0.5,0.714286,-0.04,0.6,0.321429,0.142857,0.5,0.267261,0.625


In [23]:
def compute_measures(df):
    rl = df.copy()
    rl['Kulc'] = rl['support']*(rl['antecedent support']+rl['consequent support'])/(2*rl['antecedent support']*rl['consequent support'])
    rl['all'] = pd.concat([rl['confidence'], rl['support']/rl['consequent support']], axis=1).min(axis=1)
    rl['max'] = pd.concat([rl['confidence'], rl['support']/rl['consequent support']], axis=1).max(axis=1)
    rl['cos'] = rl['support']/np.sqrt(rl['antecedent support']*rl['consequent support'])
    rl['IR'] = np.abs(rl['antecedent support']-rl['consequent support'])/(rl['antecedent support']+rl['consequent support']-rl['support'])
    return rl

**Q19** Quelles sont, d'après la mesure de Kulczynski, les règles les plus intéressantes sur l'ensemble des résultats du s3 ?  
Pensez à ajuster les seuils de support et de confiance afin de filtrer les résultats peu significatifs.

In [24]:
# Q19
fq = fpgrowth(s3, min_support=5/len(s3), use_colnames=True) # calculer le min_support en fonction d'un nombre significatif d'élèves, ici 5/191 = 0.026
rl = association_rules(fq, min_threshold=0.5)
rl = compute_measures(rl)
rl['support (abs.)'] = (rl['support'] * len(s3)).round(0).astype(int)
rl.sort_values(by=['Kulc', 'IR'], ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,Kulc,all,max,cos,IR,support (abs.)
23,(SDM),"(MDS, RAY)",0.052356,0.078534,0.036649,0.7,8.913333,0.032537,3.071553,0.583333,0.466667,0.7,0.571548,0.277778,7
14,(SDM),(RAY),0.052356,0.340314,0.052356,1.0,2.938462,0.034539,inf,0.576923,0.153846,1.0,0.392232,0.846154,10
19,"(SDM, MFL)",(RAY),0.036649,0.340314,0.036649,1.0,2.938462,0.024177,inf,0.553846,0.107692,1.0,0.328165,0.892308,7
21,"(MDS, SDM)",(RAY),0.036649,0.340314,0.036649,1.0,2.938462,0.024177,inf,0.553846,0.107692,1.0,0.328165,0.892308,7
3,"(EED, MFL)",(RAY),0.026178,0.340314,0.026178,1.0,2.938462,0.017269,inf,0.538462,0.076923,1.0,0.27735,0.923077,5
25,"(SDM, CSA)",(RAY),0.026178,0.340314,0.026178,1.0,2.938462,0.017269,inf,0.538462,0.076923,1.0,0.27735,0.923077,5
20,(SDM),"(RAY, MFL)",0.052356,0.099476,0.036649,0.7,7.036842,0.031441,3.001745,0.534211,0.368421,0.7,0.507833,0.409091,7
16,(SDM),(MDS),0.052356,0.120419,0.036649,0.7,5.813043,0.030345,2.931937,0.502174,0.304348,0.7,0.461566,0.5,7
22,"(RAY, SDM)",(MDS),0.052356,0.120419,0.036649,0.7,5.813043,0.030345,2.931937,0.502174,0.304348,0.7,0.461566,0.5,7
8,"(MDS, MFL)",(RAY),0.036649,0.340314,0.031414,0.857143,2.518681,0.018941,4.617801,0.474725,0.092308,0.857143,0.281284,0.878788,6


**Q20** Existe-t-il des règles permettant de déduire avec une confiance de 100% l'échec en rayonnement ? En automatique ? En EPS ?

In [25]:
# Q20
fq = fpgrowth(s3, min_support=0.001, use_colnames=True)
rl = association_rules(fq, min_threshold=1)
rl = compute_measures(rl)
rl['support (abs.)'] = (rl['support'] * len(s3)).round(0).astype(int)
# Pour RAY, on observe des règles qui concernent plusieurs élèves
display(rl[rl['consequents'] >= {'RAY'}].sort_values(by=['support'], ascending=False).head(5))
# Mais pas pour AUT (maximum un élève par règle)
display(rl[rl['consequents'] >= {'AUT'}].sort_values(by=['support'], ascending=False).head(5))
# Aucune règle exacte ne s'applique pour l'EPS, malgré qu'un étudiant ait obtenu 9
display(rl[rl['consequents'] >= {'EPS'}].sort_values(by=['support'], ascending=False).head(5))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,Kulc,all,max,cos,IR,support (abs.)
687,(SDM),(RAY),0.052356,0.340314,0.052356,1.0,2.938462,0.034539,inf,0.576923,0.153846,1.0,0.392232,0.846154,10
693,"(MDS, SDM)",(RAY),0.036649,0.340314,0.036649,1.0,2.938462,0.024177,inf,0.553846,0.107692,1.0,0.328165,0.892308,7
688,"(SDM, MFL)",(RAY),0.036649,0.340314,0.036649,1.0,2.938462,0.024177,inf,0.553846,0.107692,1.0,0.328165,0.892308,7
694,"(SDM, CSA)",(RAY),0.026178,0.340314,0.026178,1.0,2.938462,0.017269,inf,0.538462,0.076923,1.0,0.27735,0.923077,5
8,"(EED, MFL)",(RAY),0.026178,0.340314,0.026178,1.0,2.938462,0.017269,inf,0.538462,0.076923,1.0,0.27735,0.923077,5


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,Kulc,all,max,cos,IR,support (abs.)
9,"(MDS, MFL, CSB)",(AUT),0.005236,0.036649,0.005236,1.0,27.285714,0.005044,inf,0.571429,0.142857,1.0,0.377964,0.857143,1
388,"(ANG, CSA, CSB)","(RAY, SDM, AUT)",0.005236,0.015707,0.005236,1.0,63.666667,0.005153,inf,0.666667,0.333333,1.0,0.57735,0.666667,1
365,"(ANG, CSB)","(AUT, RAY, MFL, CSA)",0.005236,0.005236,0.005236,1.0,191.0,0.005208,inf,1.0,1.0,1.0,1.0,0.0,1
367,"(CSA, CSB, RAY, SDM, ANG)",(AUT),0.005236,0.036649,0.005236,1.0,27.285714,0.005044,inf,0.571429,0.142857,1.0,0.377964,0.857143,1
373,"(RAY, SDM, CSA, CSB)","(ANG, AUT)",0.005236,0.005236,0.005236,1.0,191.0,0.005208,inf,1.0,1.0,1.0,1.0,0.0,1


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,Kulc,all,max,cos,IR,support (abs.)


## Lien entre les fréquents clos et maximaux et les règles d'association

**Q21** Calculer les règles pouvant être générées à partir des fréquents maximaux sur l'extrait des dix derniers relevés de note et un seuil de fréquence de 0.6.

**Q22** Vérifier vos résultats avec ceux générés par la bibliothèque `mlxtend`.

In [26]:
maxfq = fpmax(sample, min_support=0.5, use_colnames=True)
maxfq

Unnamed: 0,support,itemsets
0,0.5,(CSB)
1,0.5,(AUT)
2,0.5,"(MDS, CSA)"
3,0.5,"(RAY, CSA)"
4,0.5,"(MDS, RAY, MFL)"
5,0.5,"(RAY, SDM, MFL)"
6,0.5,"(MDS, RAY, SDM)"


In [27]:
# Q22
maxrl = association_rules(maxfq, min_threshold=0.1, support_only=True)
maxrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MDS),(CSA),,,0.5,,,,
1,(CSA),(MDS),,,0.5,,,,
2,(RAY),(CSA),,,0.5,,,,
3,(CSA),(RAY),,,0.5,,,,
4,"(MDS, RAY)",(MFL),,,0.5,,,,
5,"(MDS, MFL)",(RAY),,,0.5,,,,
6,"(RAY, MFL)",(MDS),,,0.5,,,,
7,(MDS),"(RAY, MFL)",,,0.5,,,,
8,(RAY),"(MDS, MFL)",,,0.5,,,,
9,(MFL),"(MDS, RAY)",,,0.5,,,,


**Q23** Comparer les règles d'association générées sur la base des fréquents (Q4), des fréquents maximaux (Q7/Q22) et des fréquents clos (Q8).

In [28]:
# Q4
fq = apriori(sample, min_support=0.5, use_colnames=True)
# Q23 (fréquents)
rl = association_rules(fq, min_threshold=0.1)
rl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MDS),(CSA),0.8,0.6,0.5,0.625,1.041667,0.02,1.066667
1,(CSA),(MDS),0.6,0.8,0.5,0.833333,1.041667,0.02,1.2
2,(RAY),(CSA),0.9,0.6,0.5,0.555556,0.925926,-0.04,0.9
3,(CSA),(RAY),0.6,0.9,0.5,0.833333,0.925926,-0.04,0.6
4,(MDS),(MFL),0.8,0.7,0.5,0.625,0.892857,-0.06,0.8
5,(MFL),(MDS),0.7,0.8,0.5,0.714286,0.892857,-0.06,0.7
6,(SDM),(MFL),0.7,0.7,0.5,0.714286,1.020408,0.01,1.05
7,(MFL),(SDM),0.7,0.7,0.5,0.714286,1.020408,0.01,1.05
8,(RAY),(MFL),0.9,0.7,0.7,0.777778,1.111111,0.07,1.35
9,(MFL),(RAY),0.7,0.9,0.7,1.0,1.111111,0.07,inf


In [29]:
# voir Q8 pour le calcul de clfq
clfq

Unnamed: 0,support,itemsets,closed
0,0.6,(CSA),True
1,0.5,(CSB),True
2,0.5,(AUT),True
4,0.8,(MDS),True
6,0.9,(RAY),True
7,0.5,"(MDS, CSA)",True
8,0.5,"(RAY, CSA)",True
11,0.7,"(RAY, MFL)",True
13,0.7,"(MDS, RAY)",True
14,0.7,"(RAY, SDM)",True


In [30]:
# Q23 (fréquents clos)
clrl = association_rules(clfq, min_threshold=0.1, support_only=True)
clrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MDS),(CSA),,,0.5,,,,
1,(CSA),(MDS),,,0.5,,,,
2,(RAY),(CSA),,,0.5,,,,
3,(CSA),(RAY),,,0.5,,,,
4,(RAY),(MFL),,,0.7,,,,
5,(MFL),(RAY),,,0.7,,,,
6,(MDS),(RAY),,,0.7,,,,
7,(RAY),(MDS),,,0.7,,,,
8,(RAY),(SDM),,,0.7,,,,
9,(SDM),(RAY),,,0.7,,,,


> La génération basée sur les fréquents clos n'a pas de perte d'information (si ce n'est qu'avec mlxtend, les mesures compélmentaires ne sont pas calculées : il s'agit d'une limitation technique et non théorique).
> 
> Par exemple la règle {MFL} $\to$ {SDM} n'est pas présente car couverte par la règle {MFL} $\to$ {RAY, SDM} qui possède les mêmes valeurs de support.
> 
> Pour les fréquents maximaux, la règle {RAY} $\to$ {MDS} est absente car couverte. On sait que sa confiance est supérieure à celles de {RAY} $\to$ {MDS, MFL} et {RAY} $\to$ {MDS, SDM} sans pouvoir connaître la valeur exacte.

**Q24** Compléter les valeurs de support des antecédents et conséquents sur la base des support des itemsets fréquents associés (clos ou maximaux).

In [31]:
def sup(itemset, fq):
    return fq[fq['itemsets'] >= itemset]['support'].max()

In [32]:
# Exemple
it = {'RAY', 'MFL'}
print(sup(it, fq))
print(sup(it, clfq))
print(sup(it, maxfq))

0.7
0.7
0.5


In [33]:
# Q24 (fréquents clos)
# Pour les clos les supports sont exacts
# Il est possible sur cette base de calculer tous les autres indicateurs (confidence, lift, leverage, etc.)
clrl['antecedent support'] = clrl['antecedents'].apply(lambda x: sup(x, clfq))
clrl['consequent support'] = clrl['consequents'].apply(lambda x: sup(x, clfq))
clrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MDS),(CSA),0.8,0.6,0.5,,,,
1,(CSA),(MDS),0.6,0.8,0.5,,,,
2,(RAY),(CSA),0.9,0.6,0.5,,,,
3,(CSA),(RAY),0.6,0.9,0.5,,,,
4,(RAY),(MFL),0.9,0.7,0.7,,,,
5,(MFL),(RAY),0.7,0.9,0.7,,,,
6,(MDS),(RAY),0.8,0.9,0.7,,,,
7,(RAY),(MDS),0.9,0.8,0.7,,,,
8,(RAY),(SDM),0.9,0.7,0.7,,,,
9,(SDM),(RAY),0.7,0.9,0.7,,,,


In [34]:
# Q24 (fréquents maximaux)
# Pour les maximaux les supports calculés sont potentiellements inférieurs à la réalité
maxrl['antecedent support'] = maxrl['antecedents'].apply(lambda x: sup(x, maxfq))
maxrl['consequent support'] = maxrl['consequents'].apply(lambda x: sup(x, maxfq))
maxrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MDS),(CSA),0.5,0.5,0.5,,,,
1,(CSA),(MDS),0.5,0.5,0.5,,,,
2,(RAY),(CSA),0.5,0.5,0.5,,,,
3,(CSA),(RAY),0.5,0.5,0.5,,,,
4,"(MDS, RAY)",(MFL),0.5,0.5,0.5,,,,
5,"(MDS, MFL)",(RAY),0.5,0.5,0.5,,,,
6,"(RAY, MFL)",(MDS),0.5,0.5,0.5,,,,
7,(MDS),"(RAY, MFL)",0.5,0.5,0.5,,,,
8,(RAY),"(MDS, MFL)",0.5,0.5,0.5,,,,
9,(MFL),"(MDS, RAY)",0.5,0.5,0.5,,,,
