# Held out evaluation

This notebook will example result from held out evaluation

In [1]:
import pandas as pd

### Load Ground Truth
Papers annotated by Jerome

In [2]:
ground_truth = pd.read_csv('../fmri_participant_demographics/data/outputs/evaluation_labels.csv')
unique_ids = ground_truth.pmcid.unique()

### GPT predictions
- First load clean, and subset to include only those in ground truth (i.e. not those in training sample)
- Also load unclead data, to look at errors in papers that were cleaned up

In [3]:
gpt3_clean_preds = pd.read_csv('../fmri_participant_demographics/data/outputs/gpt/eval_participant_demographics_gpt_tokens-2000_clean.csv')
gpt3_all_preds = pd.read_csv('../fmri_participant_demographics/data/outputs/gpt/eval_participant_demographics_gpt_tokens-2000.csv')
embeddings = pd.read_parquet('../fmri_participant_demographics/data/outputs/gpt/eval_embeddings_tokens-2000.parquet')

In [4]:
gpt4_clean_preds = pd.read_csv('../fmri_participant_demographics/data/outputs/gpt/eval_participant_demographics_gpt4_tokens-2000_clean.csv')
gpt4_all_preds = pd.read_csv('../fmri_participant_demographics/data/outputs/gpt/eval_participant_demographics_gpt4_tokens-2000.csv')

In [5]:
gpt4_turbo_clean_preds = pd.read_csv('../fmri_participant_demographics/data/outputs/gpt/eval_participant_demographics_gpt-4-1106-preview_tokens-2000_clean.csv')
gpt4_turbo_all_preds = pd.read_csv('../fmri_participant_demographics/data/outputs/gpt/eval_participant_demographics_gpt-4-1106-preview_tokens-2000.csv')

In [6]:
(gpt4_turbo_clean_preds[gpt4_turbo_clean_preds.pmcid == 8752963]['final'] == True).any()

True

In [7]:
def _keep_final(df):
    """ If within a PMCID any rows are annotated as final keep, otherwise return all"""
    if (df['final'] == True).any():
        return df[df['final'] == True]
    else:
        return df

In [8]:
def _merge_score(clean_preds, all_preds, ground_truth, unique_ids):
    clean_preds = clean_preds[clean_preds.pmcid.isin(unique_ids)]
    all_preds = all_preds[all_preds.pmcid.isin(unique_ids)]
    
    # For GPT-4 turbo look for final key
    if 'final' in clean_preds:
        clean_preds = clean_preds.groupby('pmcid').apply(_keep_final).reset_index(drop=True)
        all_preds = all_preds.groupby('pmcid').apply(_keep_final).reset_index(drop=True)
    
    clean_sum_count = clean_preds.groupby('pmcid').sum().reset_index()[['pmcid', 'count']]
    gt_sum_count = ground_truth.groupby('pmcid').sum().reset_index()[['pmcid', 'count']]
    merged = pd.merge(clean_sum_count, gt_sum_count, on='pmcid', )
    merged = merged.rename(columns={'count_y': 'true_count', 'count_x': 'prediction'})
    
    # Score prediction error
    merged['pe'] = abs((merged['true_count'] - merged['prediction']) / merged['true_count'])
    merged = merged.sort_values('pe')
    
    return clean_preds, all_preds, merged

In [9]:
gpt3_clean_preds, gpt3_all_preds, gpt3_merged = _merge_score(gpt3_clean_preds, gpt3_all_preds, ground_truth, unique_ids)

In [10]:
gpt4_clean_preds, gpt4_all_preds, gpt4_merged = _merge_score(gpt4_clean_preds, gpt4_all_preds, ground_truth, unique_ids)

In [11]:
gpt4_turbo_clean_preds, gpt4_turbo_all_preds, gpt4_turbo_merged = _merge_score(gpt4_turbo_clean_preds, gpt4_turbo_all_preds, ground_truth, unique_ids)

## Scores

In [12]:
gpt3_merged['pe'].median()

0.009174311926605505

In [13]:
gpt3_merged['pe'].quantile(0.75)

0.2727272727272727

In [14]:
gpt3_merged['pe'].mean()

2.0561043082459176

In [15]:
gpt4_merged['pe'].median()

0.0

In [16]:
gpt4_merged['pe'].quantile(0.75)

0.2902097902097902

In [17]:
gpt4_merged['pe'].mean()

2.9548877383848384

Excluding one outlier, GPT-4 perform slightly better:

In [18]:
gpt3_merged.iloc[0:-1]['pe'].mean()

0.3697670955816621

In [19]:
gpt4_merged.iloc[0:-1]['pe'].mean()

0.29696997111410156

In [20]:
gpt4_turbo_merged['pe'].median()

0.0

In [21]:
gpt4_turbo_merged['pe'].mean()

0.13866483482514214

In [22]:
gpt4_turbo_merged[-10:]

Unnamed: 0,pmcid,prediction,true_count,pe
22,3913832,31,62,0.5
89,7493988,10,20,0.5
90,7539836,43,93,0.537634
31,4349631,16,38,0.578947
46,4983635,19,64,0.703125
39,4522562,8,46,0.826087
20,3893192,102,51,1.0
80,7038454,217,104,1.086538
12,3672681,78,32,1.4375
27,4215530,136,46,1.956522


## Explore

In [50]:
tid = 3672681

In [51]:
gpt4_turbo_all_preds[gpt4_turbo_all_preds.pmcid == tid]

Unnamed: 0,count,diagnosis,group_name,final,rank,start_char,end_char,pmcid,female count,age range,male count,age mean,age minimum,age maximum,subgroup_name,age median
16,46,,healthy,True,0,13846,15500,3672681,,,,,,,behavioral study,
17,32,,healthy,True,0,13846,15500,3672681,,,,,,,fMRI study,


In [52]:
gpt4_turbo_all_preds[gpt4_turbo_all_preds.pmcid == tid]

Unnamed: 0,count,diagnosis,group_name,final,rank,start_char,end_char,pmcid,female count,age range,male count,age mean,age minimum,age maximum,subgroup_name,age median
16,46,,healthy,True,0,13846,15500,3672681,,,,,,,behavioral study,
17,32,,healthy,True,0,13846,15500,3672681,,,,,,,fMRI study,


In [53]:
ground_truth[ground_truth.pmcid == tid]

Unnamed: 0,group_name,subgroup_name,project_name,annotator_name,pmcid,diagnosis,count,male count,age mean,female count,age minimum,age maximum,age median
16,healthy,_,participant_demographics,Jerome_Dockes,3672681,,32,,,,,,


In [54]:
content = embeddings[(embeddings.pmcid == tid) & (embeddings.start_char == gpt4_all_preds[gpt4_all_preds.pmcid == tid].iloc[0].start_char)].iloc[0].content

In [55]:
content

'\n## Materials and methods \n  \n### Participants \n  \nParticipants were recruited from a cohort of 615 young (behavioral study: age range 18–30 years, mean 23.65 ± 2.86; fMRI study: age range 19–30 years, mean 23.00 ± 2.51), healthy volunteers of a large-scale behavioral genetic study conducted at the Leibniz-Institute for Neurobiology, Magdeburg. Based on the assumption that a possible small effect of genes may not only require a large number of volunteers but also a strict control of non-genetic factors (Lee et al.,  ), participants were assessed for several exclusion criteria. All participants were right-handed according to self-report, not genetically related, and had obtained at least a university entrance diploma (  Abitur  ). Importantly, all participants had undergone routine clinical interview to exclude present or past neurological or psychiatric illness, alcohol, or drug abuse, use of centrally-acting medication, the presence of psychosis or bipolar disorder in a first-de

### Observations

GPT-3 & 4 are both very good at extracting sample size, with GPT-4 being a bit better sometimes (althuogh more amibitious and extracting more groups).

Challenge is if given more than 1 group, how to select fMRI group?
Often counts are full counts, and not including exclusions

GPT-4 often reports both inital and final counts