# Challenge Large Scale Machine Learning: Starting Kit

### Authors: 
#### Pavlo Mozharovskyi (pavlo.mozharovskyi@telecom-paris.fr), Nathan Noiry (nathan.noiry@telecom-paris.fr)

### Training data

The training set consist of two files, **train_data.npy** and **train_labels.txt**.

File **train_data.npy** contains one observation per row, which is made of the concatenation of two templates, each of dimension 48.

File **train_labels.txt** contains one column with each entry corresponding to one observation in **xtrain_challenge.csv**, maintaining the order, and has '1' if a pair of images belong to the same person and '0' otherwise.

In total, there are 267508 observations.

### Peformance criterion

You are asked to minimize the sum of the False Positive Rate (FPR) and the False Negative Rate (FNR). This amounts to maximize: 
$$ 1 - (FNR+FPR) $$

### Fairness Peformance criterion

Moreover, we want the prediction to be as fair as possible with respect to the gender attribute. In our case, we want to make the ratios
$$ BFPR := \frac{\max(FPR(male),FPR(female))}{\mathrm{GeomMean}(FPR(male),FPR(female))} \geq 1 $$
and
$$ BFNR := \frac{\max(FNR(male),FNR(female))}{\mathrm{GeomMean}(FNR(male),FNR(female))} \geq 1 $$
as close to 1 as possible, which corresponds to its minimum value. </br>
Here, GeomMean stands for the geometric mean, which is equal to $\sqrt{xy}$ for two nonnegative real numbers $x$ and $y$.

For a given input $\mathbf{x}$ made of two templates, the gender attribute of the first template is $\mathbf{x}[8]$ and the one of the second template is $\mathbf{x}[56]$. The value 1 is for "male" and 0 for "female".

# Example of submission

In [21]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, roc_curve

np.random.seed(seed=42)

In [22]:
def extract_labels(txt_file):
    with open(txt_file) as file:
        lines = file.readlines()
    y = []
    for elem in lines:
        label = int(elem[0])
        y.append(label)
    y = np.array(y)
    return y

In [23]:
X, y = np.load("train_data.npy"), extract_labels("train_labels.txt")
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1)

In [24]:
models = []
for i in [2, 5, 10]:
    clf = RandomForestClassifier(n_estimators=i)
    clf.fit(X_train, y_train)   
    models.append(clf)

In [15]:
# mask males
mask_males = (X_valid[:,8] == X_valid[:,56]) & (X_valid[:,8] == 1)
# mask females
mask_females = (X_valid[:,8] == X_valid[:,56]) & (X_valid[:,8] == 0)

In [25]:
def criterion(y_pred, y_true):
    CM = confusion_matrix(y_true, y_pred)
    TN, TP = CM[0, 0], CM[1, 1]
    FP, FN = CM[0, 1], CM[1, 0]
    return 1 - ( FP/(FP + TN) + FN/(FN + TP) )

def fairness_criterion(y_pred, y_true, mask_males, mask_females):
    y_true_male = y_true[mask_males]
    y_true_female = y_true[mask_females]
    y_pred_male = y_pred[mask_males]
    y_pred_female = y_pred[mask_females]

    CM_m = confusion_matrix(y_true_male, y_pred_male)
    TN_m, TP_m = CM_m[0, 0], CM_m[1, 1]
    FP_m, FN_m = CM_m[0, 1], CM_m[1, 0]
    FNR_m = FN_m/(FN_m + TP_m)
    FPR_m = FP_m/(FP_m + TN_m)

    CM_f = confusion_matrix(y_true_female, y_pred_female)
    TN_f, TP_f = CM_f[0, 0], CM_f[1, 1]
    FP_f, FN_f = CM_f[0, 1], CM_f[1, 0]
    FNR_f = FN_f/(FN_f + TP_f)
    FPR_f = FP_f/(FP_f + TN_f)

    if min(FNR_m, FNR_f) == 0:
        if max(FNR_m, FNR_f) == 0:
            BFNR = 1
        else:
            BFNR = np.Infinity
    else:
        BFNR = max(FNR_m, FNR_f) / np.sqrt(FNR_m * FNR_f)

    if min(FPR_m, FPR_f) == 0:
        if max(FPR_m, FPR_f) == 0:
            BFPR = 1
        else:
            BFPR = np.Infinity
    else:
        BFPR = max(FPR_m, FPR_f) / np.sqrt(FPR_m * FPR_f)

    return BFPR, BFNR

In [26]:
for n, i in enumerate([2, 5, 10]):
    print('RF with {} estimators'.format(i))
    y_pred = models[n].predict(X_valid)
    score_valid = criterion(y_pred, y_valid)
    BFPR_valid, BFNR_valid = fairness_criterion(y_pred, y_valid, mask_males, mask_females)
    print('FPR + FNR = {}'.format(score_valid))
    print('Fairness scores: BFPR={}, BFNR={}'.format(BFPR_valid, BFNR_valid))
    print('---------------')

RF with 2 estimators
FPR + FNR = 0.36018032566276403
Fairness scores: BFPR=1.2581889451648887, BFNR=1.1919824566836357
---------------
RF with 5 estimators
FPR + FNR = 0.4703270084779976
Fairness scores: BFPR=1.2617382786986342, BFNR=1.3848258933600173
---------------
RF with 10 estimators
FPR + FNR = 0.5121787108060827
Fairness scores: BFPR=1.3378775520350108, BFNR=1.3926933627069833
---------------


The previous results prove that improving the score can degrade the fairness metrics. Your job is to find a proper trade-off.

### Prepare a file for submission

In [20]:
# Load test data
X_test = np.load("test_data.npy")
# Classify the provided test data
y_test = clf.predict(X_test).astype(np.int8)
np.savetxt('y_test_challenge_student_TEST.txt', y_test, delimiter=',')

### About the evaluation

For each of your new submission, you will have access to your rank compared to the other participants in term of
<ul>
    <li> the score, </li>
    <li> the BFPR, </li>
    <li> the BFNR. </li>
</ul>
Your final ranking is an aggregation of these three rankings. Beware that each new submission can degrade your position. 