In this notebook we evaluate the performance of the model, both overall and in subslices of

In [2]:
import pandas as pd
import joblib
import os

from pathlib import Path


In [3]:

path = Path(os.getcwd())
dirname = os.path.join(path.parent.absolute(), 'model', 'latest')
data_path = os.path.join(path.parent.absolute(), 'data', 'census.csv')

In [4]:
    # Add code to load in the data.
model = joblib.load(os.path.join(dirname, 'model'))
encoder = joblib.load(os.path.join(dirname, 'encoder'))
lb = joblib.load(os.path.join(dirname, 'lb'))
cat_features = joblib.load(os.path.join(dirname, 'cat_features'))

In [5]:
import sys
sys.path.append("../") # go to parent dir

In [6]:
from ml.data import process_data
from ml.model import inference, compute_model_metrics
from sklearn.metrics import confusion_matrix
from ml.eval import get_tn_fp_fn_tp, list_unique_features, show_performance_on_slices

In [7]:
data = pd.read_csv(data_path)
X, y, _, _ =  process_data(
        data, categorical_features=cat_features, label="salary", encoder = encoder, lb = lb, training = False
)
  

### At first an overall breakdown of metrics is shown

In [8]:
preds = inference(model, X)
precision, recall, fbeta = compute_model_metrics(y, preds)
print(f"precision: {precision}, recall: {recall}, fbeta: {fbeta}")
tn, fp, fn, tp = get_tn_fp_fn_tp(y, preds)
print(f"TN, FP, FN, TP: {(tn, fp, fn, tp)}")

precision: 0.8046759396271086, recall: 0.6935339880117332, fbeta: 0.7449825330502088
TN, FP, FN, TP: (23400, 1320, 2403, 5438)


The model sees the value `<=50K` as negative and `>50K` as positive.
There are in the database 24720 that earn `<=50k` (negative) and 7841 that earn `>50K`.
There are more False Negatives than False Positives, and precision is better then recall.



Taking into account that
* a True Negative is when we correctly recognize that someone earns less than 50k, 
* a False Positive when someone earns less than 50k, but we predict that it earns more than 50k
* a False Negative when someone earns more than 50k, but we predict that they earn less than 50k
* a True Positive when we correctly recognize that someone earns more than 50k


We have a relatively high number of False Negatives overall, which means that we often predict that someone is making less money than the case is,
rather than the other case.
That is the reason we have a low recall : TP/(TP+FN) and a higher precision TP/(TP+FP)


In [9]:
list_unique_features(data, cat_features)


workclass : ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked']
education : ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']
marital-status : ['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed']
occupation : ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' '?'
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv']
relationship : ['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']
race : ['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
sex : ['Male' 'Female']
native-country : ['United-States' 'Cuba' 'Jamaica' 'India' '?' 'Mexico' 'South'
 'Puerto-Rico' 'Honduras' 'En

These are the way the dataset can be sliced over unique features, now we can verify how the model treats diffent slices of the dataset.


In [11]:
show_performance_on_slices(data, X, y, preds, cat_features)


==== SLICE FORF FEATURE : workclass
== CLS FOR FEATURE : State-gov
precision: 0.8126888217522659, recall: 0.7620396600566572, fbeta: 0.7865497076023392
TN, FP, FN, TP: (883, 62, 84, 269)
== CLS FOR FEATURE : Self-emp-not-inc
precision: 0.8134328358208955, recall: 0.6022099447513812, fbeta: 0.6920634920634922
TN, FP, FN, TP: (1717, 100, 288, 436)
== CLS FOR FEATURE : Private
precision: 0.8054408549914986, recall: 0.6681442675800927, fbeta: 0.730396475770925
TN, FP, FN, TP: (16932, 801, 1647, 3316)
== CLS FOR FEATURE : Federal-gov
precision: 0.7901234567901234, recall: 0.862533692722372, fbeta: 0.8247422680412372
TN, FP, FN, TP: (504, 85, 51, 320)
== CLS FOR FEATURE : Local-gov
precision: 0.7747163695299838, recall: 0.7747163695299838, fbeta: 0.7747163695299838
TN, FP, FN, TP: (1337, 139, 139, 478)
== CLS FOR FEATURE : ?
precision: 0.8035714285714286, recall: 0.4712041884816754, fbeta: 0.5940594059405941
TN, FP, FN, TP: (1623, 22, 101, 90)
== CLS FOR FEATURE : Self-emp-inc
precision: 0.8

We have printed out precision and recall for this model over different slices of the dataset, some conclusions can be drawn.

An example how the model can be biased in some categories cam be seen by the marriage status.

``` 
==== SLICE FORF FEATURE : marital-status
== CLS FOR FEATURE : Never-married
precision: 0.9513274336283186, recall: 0.4378818737270876, fbeta: 0.599721059972106
TN, FP, FN, TP: (10181, 11, 276, 215)
== CLS FOR FEATURE : Married-civ-spouse
precision: 0.792938459000165, recall: 0.7181709503885236, fbeta: 0.7537050105857446
TN, FP, FN, TP: (7029, 1255, 1886, 4806)
== CLS FOR FEATURE : Divorced
precision: 0.8744186046511628, recall: 0.4060475161987041, fbeta: 0.5545722713864307
TN, FP, FN, TP: (3953, 27, 275, 188)
```

People who are never married get very few False Positives, but even more False Negatives than True Positives. That is, the model tend to assume that never-married people
earn less than they actually do. That can be seen in a higher precision and lower recall than the overall values.
The model is accurate when it recognizes when never-married people make more than 50k, but flags incorrectly many never-married people as low-income, when they in reality make more money. That's obviously because most never-married people in the database make little money, and the model may see this as an indication of low income. We see the same phenomen for divorced.

For Married-civ-spouse entries, on the other hand, we do not see this effect. We have even a False Negative rate than the overall average, and a much better recall. The model is more accurate than on the full data-set when it flags a family as low-income, but is not.
It has a comparable precision as recorded in the full dataset, which means that the model is not as eager as in never-married entries to flag someone as low-income, but may treate this situation as neutral.


```
==== SLICE FORF FEATURE : sex
== CLS FOR FEATURE : Male
precision: 0.8010186160871092, recall: 0.6846292404683278, fbeta: 0.738264810618323
TN, FP, FN, TP: (13995, 1133, 2101, 4561)
== CLS FOR FEATURE : Female
precision: 0.8169642857142857, recall: 0.6208651399491094, fbeta: 0.7055421686746988
TN, FP, FN, TP: (9428, 164, 447, 732)
```

```
== CLS FOR FEATURE : White
precision: 0.8037880046519356, recall: 0.6797808065196009, fbeta: 0.7366017052375152
TN, FP, FN, TP: (19518, 1181, 2279, 4838)
== CLS FOR FEATURE : Black
precision: 0.823943661971831, recall: 0.6046511627906976, fbeta: 0.6974664679582713
TN, FP, FN, TP: (2687, 50, 153, 234)
```


As expected we see the same phenomenon in other splits, although it is not as dramatic as one might fear. We see that female have lower recall and are more often incorrectly flagged as low-income than males. So is the case for black people compared to white people.
