# Human-AI Card

To facilitate a smooth onboarding process for individuals working with an AI assistant, we introduce the Human-AI Card. This card provides detailed insights into the AI's capabilities, training, and performance. Below is a presentation of the core details of the card:

## Model-Card presented to the human as part of onboarding

| Information                         | Description |
| ----------------------------------- | ----------- |
| AI Input                            | What the AI uses to make its prediction |
| AI Output                           | What the AI provides as output (predictions, explanations, ...) |
| Source of Training Data for AI      | Description of data used to train the AI |
| Source of Pre-Training Data of AI   | Description of pre-training data of model AI is based on (if relevant) |
| Training Objective of AI            | what the AI is trying to achieve (minimize classification error, detect objects, next word prediction) |
| Average AI Performance              | (accuracy, FPR, AUC,...) |
| Average Human Performance           | (accuracy, FPR, AUC,...) |

Additionally you want to provide a breakdown of AI and Human performance on different subgroups of data. You might not have enough human data to do a breakdown for human performance, but you should have enough to do a breakdown for the AI model.

**This notebook** allows you to find subgroups in your data based on the intersection of up to three discrete features where the AI has performance that is statiscally different from the average AI performance.

We take an example of the Berkely Deep Driving dataset, where a subgroup might comprise images taken during the night, in rainy weather, on a highway. We compute the model's error for each possible subgroup and then perform a paired t-test comparing the subgroup model error to the average model error over the entire data. For the purpose of our user studies, we highlight subgroups defined by a single metadata category that show statistically significant differences ($p \leq 0.05$). It's important to note that, for rigorous analysis, one should apply corrections for multiple hypothesis testing. However, considering the vast number of metadata categories, many results might become insignificant. Therefore, for simplicity, we adopt this heuristic approach.


Here is a concrete human-AI card to present before a user study:

![human-ai card for a BDD object detection task](./bdd_card.PNG)





The ai_error_analysis function requires that the subgroups you wish to find are based on discrete features

In [1]:
import sys
sys.path.append("../")
import logging
import pickle
logging.getLogger().setLevel(logging.INFO)
import numpy as np
import os
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.metrics import classification_report
import pandas as pd
import requests
import math
from utils.utils import *
from utils.metrics_hai import *
from human_ai_card.error_analysis import *
from datasets_hai.bdd import *

1.12.1 True




2.28.2


In [2]:
dataset = pickle.load(open("../data/cleaned_pkl/bdd_dataset.pkl", "rb"))

In [3]:
print(classification_report(dataset.data_y, dataset.ai_preds))


              precision    recall  f1-score   support

           0       0.75      0.96      0.84      4349
           1       0.96      0.75      0.84      5651

    accuracy                           0.84     10000
   macro avg       0.85      0.85      0.84     10000
weighted avg       0.87      0.84      0.84     10000



In [6]:
# min_sample_count defines the smallest size of subgroup to consider
dataset_pd = pd.DataFrame({'true_label':dataset.data_y, 'predicted_label': dataset.ai_preds, 'metadata':list(dataset.metadata)})

md = ai_error_analysis(dataset_pd, dataset.metadata_labels, min_sample_count = 100)

binary


  metadata_metrics_df = metadata_metrics_df.append(overall_df_row, ignore_index=True)
  metadata_metrics_df = metadata_metrics_df.iloc[[-1]].append(metadata_metrics_df.iloc[:-1]).reset_index(drop=True)
  t_stat, p_value = stats.ttest_ind(subgroup, overall)


In [7]:
md.loc[   (md['analysis_type'] == 'univariate') & (md['significantly_different'] == 'yes') ].sort_values('get_len', ascending=False)



Unnamed: 0,analysis_type,category,subcategory,accuracy_score,confusion_matrix_metric_0,confusion_matrix_metric_1,confusion_matrix_metric_2,confusion_matrix_metric_3,get_len,significantly_different,p_value
18,univariate,car,alot,0.826427,0.965372,0.034628,0.269774,0.730226,7553.0,yes,0.01119001
10,univariate,scene,city street,0.828698,0.942753,0.057247,0.213199,0.786801,6112.0,yes,0.04225091
34,univariate,traffic sign,few,0.824871,0.964091,0.035909,0.278665,0.721335,6007.0,yes,0.008733935
32,univariate,traffic light,none,0.96321,0.96321,0.03679,0.0,0.0,4349.0,yes,2.095704e-95
31,univariate,traffic light,few,0.629226,0.0,0.0,0.370774,0.629226,3668.0,yes,1.29443e-160
11,univariate,scene,highway,0.869148,0.976836,0.023164,0.392318,0.607682,2499.0,yes,0.0004567342
19,univariate,car,few,0.886647,0.956522,0.043478,0.187114,0.812886,2329.0,yes,2.0147e-08
33,univariate,traffic sign,alot,0.895191,0.965625,0.034375,0.133632,0.866368,2204.0,yes,9.244689e-11
30,univariate,traffic light,alot,0.964196,0.0,0.0,0.035804,0.964196,1983.0,yes,4.083501e-48
3,univariate,weather,partly cloudy,0.879404,0.978571,0.021429,0.251572,0.748428,738.0,yes,0.00545678
