# Look for interesting targets to show bias with

The paper goal is to show how a clearly unrelated variable can be predicted by an X-ray when the CNN can detect the site location within the X-ray.

This notebook looks at fun and absurd variables to consider.

The questions being looked at are all multiple choice in their answers. The food questions are even ordinal. Yet the majority of research out there is using binary thresholds. This is a bit of cheat. An algorithm that randomly guesses already has an AUC of 50%. Thus papers that report an AUC of 60+% isn't saying much. It is able to detect a minor pattern beyond guessing. That could be statistical noise. To show this, we will play the same game. Thus we must find thresholds to binarize by that will tell a good story.

See TakeHomeQuestionnaire.pdf for the actual food questions

# Imports

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score

import OAI_Utilities as utils # ln -s ../../OAI/notebooks/OAI_Utilities.py

# Constants

In [2]:
OAI_DATA_PATH = Path.home() / 'code/OAI/notebooks/data/'
idxSlc = pd.IndexSlice

# Read in data

In [3]:
# Read in the dataframes for analysis 
allclinical_df = utils.read_parquet(OAI_DATA_PATH / 'allclinical_values.parquet')

xray_df = utils.read_parquet(OAI_DATA_PATH / 'xray_values.parquet')
xray_df = xray_df[xray_df['XRBARCD'] != '']
xray_df = xray_df[xray_df['EXAMTP'] == 'Bilateral PA Fixed Flexion Knee'] # Lets only consider xrays that we might use in the deep learning

enrollees_df = utils.read_parquet(OAI_DATA_PATH / 'enrollees_values.parquet')

# Used to map variable descriptions to a variable name
labels_df = utils.read_parquet(OAI_DATA_PATH / 'oai_vars_labels_sources.parquet')
labels_df = labels_df.set_index('Variable')

  if idx and isinstance(df[col][idx], np.ndarray):


# Merge to a single frame

In [4]:
# Map of patient IDs to barcodes, also drop 4 extraneous digits in barcode
# Result= ID: XRBARCD
barcode_site_id_df = pd.DataFrame(xray_df['XRBARCD'].reset_index('Visit', drop=True).str[4:])
print('{:,}'.format(len(barcode_site_id_df)))


# Create a list of possible targets
targets = ['INCOME', 'INCOME2', 'EDCV']

# Add in food questions
food_targets = ['FFQ' + str(num) for num in range(1,72)]
targets.extend(food_targets)

# Result= XRBARCD: ID, SITE, RACE, Visit, INCOME, INCOME2, EDCV, FFQ1, FFQ2, ....
barcode_site_id_df = barcode_site_id_df.join(enrollees_df[['SITE', 'RACE', 'SEX']], how='left')  # Add hospital site and patient race
barcode_site_id_df = barcode_site_id_df.join(allclinical_df.loc[idxSlc[:, 'V00'], :][targets].reset_index('Visit'), on='ID', how='left') # Add starting income and comborbidities
barcode_site_id_df = barcode_site_id_df.reset_index('ID').set_index('XRBARCD')  # Switch to index by barcode
print('{:,}'.format(len(barcode_site_id_df)))  # Sanity check, the joins shouldn't be increase the number of entries

26,522
26,522


# Check for useability

In [5]:
def remove_unused_missing_categories(series):
    # drop rows with categoricals that start with '.' (e.g. .Missing)
    missing_cats = [cat for cat in series.cat.categories if not cat[0].isdigit()]
    series = series[~series.isin(missing_cats)]
    return series.cat.remove_unused_categories().copy(deep=True) 

# See if any binarization point leads to some but not all sites being over 50%
# Returns the first binarization point where this is true
def check_useability(df, var, label, predictor='SITE'):
    s = remove_unused_missing_categories(df[var])

    df = pd.concat([df[predictor], s], axis=1, join='inner')

    # Get the percents by predictor (index: var values, cols: site names,
    #                           val: percent of patients with that answer at that site)
    var_by_site_df = df.groupby(predictor).value_counts(normalize=True).unstack(predictor)
    var_by_site_df = var_by_site_df.sort_index()  # Critical to ensure answers are in ordinal order  

    # Row by row, add the percentages of each row until one or more cols have over 50%
    sums = pd.Series(0, index=var_by_site_df.columns)
    for answer in var_by_site_df.index:  # Indices are question answers
        row = var_by_site_df.loc[answer]
        sums += row
        num_over_half = (sums > 0.50).sum()
        if num_over_half > 0:
            break

    if num_over_half == len(row):
        # No combination of values gives some columns over 50% and others under
        print("{}: {} doesn't work.\n Min percentage is {:.1%} with answers of {} or lower".format(var, label, sums.min(), answer))
        return None
    else:
        sums = sums[sums > 0]
        print("{}: {}\n With answers of {} or lower\n{}".format(var, label, answer, sums.apply(lambda x: '{:.1%}'.format(x))))
        return (sums, len(s))
    
# The prior calculation stops as soon as any site has over 50% response. This allows us to get all combinations where at least one site is above 50% and at least one below
def find_best_auc(df, var):
    s = remove_unused_missing_categories(df[var])

    df = pd.concat([df['SITE'], s], axis=1, join='inner')

    # Get the percents by site (index: var values, cols: site names,
    #                           val: percent of patients with that answer at that site)
    var_by_site_df = df.groupby('SITE').value_counts(normalize=True).unstack('SITE')
    var_by_site_df = var_by_site_df.sort_index()  # Critical to ensure answers are in ordinal order  

    # Get per site counts
    counts = df.groupby('SITE').value_counts().unstack('SITE').sum()

    sums = pd.Series(0, index=var_by_site_df.columns)
    aucs, idxs = [], []
    # Indices are how often a person eats this food
    for answer in var_by_site_df.index:
        row = var_by_site_df.loc[answer]
        sums += row
        num_over_half = (sums > 0.50).sum()
        if num_over_half == len(sums):  # All are over 50%, model would just always guess true for any site
            break
        elif num_over_half > 0:  # At least one site must have over a 50% response rate   
            # Calc the AUCs
            pos = (sums * counts).round().astype(int)
            neg = (counts - pos).astype(int)
            actual, pred = np.empty(0), np.empty(0)
            for site in sums.index:
                actual = np.append(actual, np.ones(pos[site]))
                actual = np.append(actual, np.zeros(neg[site]))
                pred = np.append(pred, np.ones(int(counts[site])) * sums[site])
            aucs.append(roc_auc_score(actual, pred))
            idxs.append(answer)
    if aucs:
        aucs = pd.Series(aucs, index=idxs)
        return (aucs.max(), aucs.idxmax(), aucs, aucs.min())
    else:
        return None

## Income

In [6]:
# Look at simplified income
var = 'INCOME2'
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'])

INCOME2: Yearly income (>50K or <50K) (calc) doesn't work.
 Min percentage is 100.0% with answers of 2: > $50K or lower


In [7]:
# Look at more detailed income
var = 'INCOME'
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'])

INCOME: Yearly income (calc) doesn't work.
 Min percentage is 67.4% with answers of 4: $50K to < $100K or lower


## Education

In [8]:
# Look at education
var = 'EDCV'
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'])

EDCV: Highest grade or year of school completed (calc)
 With answers of 2: Some college or lower
SITE
A    42.1%
B    47.3%
C    20.6%
D    41.8%
E    52.2%
dtype: object


## Food

In [31]:
# New version
viable = []
all_pcts = []
cnts = {}
for var in food_targets:
    label = labels_df.at['V00'+var, 'Label'].split(':')[1]
    tmp = check_useability(barcode_site_id_df, var, label)
    if tmp is not None:
        viable.append(var)
        all_pcts.append(tmp[0])
        cnts[var] = tmp[1]
    print('\n')

# Collect all contenders into a single dataframe
contenders = pd.DataFrame(all_pcts, index=viable)

FFQ1:  eggs (include egg biscuits/Egg McMuffins (not egg substitutes)), eat how often, past 12 months
 With answers of 4: 2-3 times per month or lower
SITE
A    44.0%
B    43.5%
C    47.9%
D    50.8%
E    44.0%
dtype: object


FFQ2:  bacon/breakfast sausage (including sausage biscuit), eat how often, past 12 months
 With answers of 3: Once per month or lower
SITE
A    43.4%
B    60.7%
C    51.8%
D    52.9%
E    46.6%
dtype: object


FFQ3:  cooked cereals (e.g., oatmeal/cream of wheat/grits) eat how often, past 12 months
 With answers of 3: Once per month or lower
SITE
A    40.6%
B    49.9%
C    51.1%
D    49.4%
E    44.1%
dtype: object


FFQ4:  cold cereals (e.g., Corn Flakes/Cheerios...), eat how often, past 12 months
 With answers of 5: Once per week or lower
SITE
A    52.2%
B    48.6%
C    46.1%
D    46.0%
E    51.3%
dtype: object


FFQ5:  cereal, which eat most often doesn't work.
 Min percentage is 87.9% with answers of 3: Other cold cereal, like Corn Flakes etc or lower


FFQ6:  

In [32]:
print(list(contenders.index))
print(len(contenders))

['FFQ1', 'FFQ2', 'FFQ3', 'FFQ4', 'FFQ7', 'FFQ9', 'FFQ12', 'FFQ13', 'FFQ14', 'FFQ15', 'FFQ17', 'FFQ18', 'FFQ21', 'FFQ22', 'FFQ23', 'FFQ24', 'FFQ28', 'FFQ34', 'FFQ36', 'FFQ37', 'FFQ39', 'FFQ40', 'FFQ46', 'FFQ47', 'FFQ52', 'FFQ53', 'FFQ55', 'FFQ63', 'FFQ66', 'FFQ67', 'FFQ68', 'FFQ69', 'FFQ71']
33


In [39]:
site_diffs = (contenders.max(axis=1) - contenders.min(axis=1)).sort_values(ascending=False)
site_diffs[site_diffs > 0.15]

FFQ18    0.316086
FFQ37    0.272725
FFQ22    0.247722
FFQ52    0.245985
FFQ47    0.212894
FFQ69    0.197521
FFQ71    0.195691
FFQ55    0.191253
FFQ23    0.189555
FFQ68    0.176790
FFQ24    0.174829
FFQ2     0.173392
FFQ34    0.170539
FFQ14    0.166597
FFQ39    0.165356
dtype: float64

In [11]:
# Go through all food questions and calculate the best possible AUCS for each question
# best_aucs - a dataframe with the best AUC for each question, and the threshold

questions = []
aucs = {}
max_aucs = []
min_aucs = []
answer = []
for var in food_targets:
    res = find_best_auc(barcode_site_id_df, var)
    if res is not None:
        questions.append(var)
        max_aucs.append(res[0])
        answer.append(res[1])
        aucs[var] = res[2]
        min_aucs.append(res[3])        
        
best_aucs = pd.DataFrame({'AUC': max_aucs, 'Freq':answer, 'min': min_aucs}, index=questions)
print(len(best_aucs))  #Sanity check, this should match the count from before
all_aucs = pd.DataFrame(aucs) # Has NA for any answer that won't work, most only have one answer with an AUC 

33


In [12]:
# Does anything have an AUC below 0.5 (meaning we can do well if we flip the positive case)
# This was a hack, the frequency value may not be correct for this.
print(all_aucs.T.sum(axis=1).idxmin(), all_aucs.T.sum(axis=1).min())

FFQ53 0.5220867873397199


In [13]:
# How many have more than one cut-off frequency
tmp = (~all_aucs.isna()).sum()
tmp[tmp > 1]

FFQ66    2
dtype: int64

In [14]:
# Can we use eating fried chicken skin?
best_aucs.loc['FFQ39']

AUC                      0.56089
Freq    1: Avoid eating the skin
min                      0.56089
Name: FFQ39, dtype: object

In [15]:
# What has strong AUCs?
for idx in best_aucs[best_aucs['AUC'] > 0.6].index:
    label = labels_df.at['V00'+idx, 'Label'].split(':')[1]
    if 'months' in label:
        label = label.split(',')[0] # drop the ', past 12 months' 
    print('{:.3f} {}  {}'.format(best_aucs.loc[idx]['AUC'], idx, label), best_aucs.loc[idx]['Freq'])

0.631 FFQ18   refried beans 1: Never
0.607 FFQ37   fried chicken 2: A few times per year
0.609 FFQ52   tortillas 1: Never


In [16]:
var = 'FFQ18'
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'])

FFQ18: Block Brief 2000: refried beans, eat how often, past 12 months
 With answers of 1: Never or lower
SITE
A    59.4%
B    66.1%
C    34.5%
D    56.0%
E    60.7%
dtype: object


In [17]:
var = 'FFQ37'
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'])

FFQ37: Block Brief 2000: fried chicken, at home or in a restaurant, eat how often, past 12 months
 With answers of 2: A few times per year or lower
SITE
A    49.8%
B    77.0%
C    65.3%
D    61.8%
E    53.0%
dtype: object


In [18]:
var = 'FFQ52'
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'])

FFQ52: Block Brief 2000: tortillas, eat how often, past 12 months
 With answers of 1: Never or lower
SITE
A    51.4%
B    52.5%
C    27.9%
D    43.6%
E    49.1%
dtype: object


# Check against other predictors

## Gender

In [19]:
# New version
viable = []
all_pcts = []
cnts = {}
for var in food_targets:
    label = labels_df.at['V00'+var, 'Label'].split(':')[1]
    tmp = check_useability(barcode_site_id_df, var, label, predictor='SEX')
    if tmp is not None:
        viable.append(var)
        all_pcts.append(tmp[0])
        cnts[var] = tmp[1]
    print('\n')

# Collect all contenders into a single dataframe
contenders = pd.DataFrame(all_pcts, index=viable)

FFQ1:  eggs (include egg biscuits/Egg McMuffins (not egg substitutes)), eat how often, past 12 months
 With answers of 5: Once per week or lower
SEX
1: Male      66.8%
2: Female    68.1%
dtype: object


FFQ2:  bacon/breakfast sausage (including sausage biscuit), eat how often, past 12 months
 With answers of 3: Once per month or lower
SEX
1: Male      45.2%
2: Female    57.6%
dtype: object


FFQ3:  cooked cereals (e.g., oatmeal/cream of wheat/grits) eat how often, past 12 months
 With answers of 3: Once per month or lower
SEX
1: Male      55.4%
2: Female    42.8%
dtype: object


FFQ4:  cold cereals (e.g., Corn Flakes/Cheerios...), eat how often, past 12 months
 With answers of 6: Twice per week or lower
SEX
1: Male      59.3%
2: Female    62.9%
dtype: object


FFQ5:  cereal, which eat most often
 With answers of 3: Other cold cereal, like Corn Flakes etc or lower
SEX
1: Male      90.3%
2: Female    89.5%
dtype: object


FFQ6:  cheese/sliced cheese/cheese spread (including on sandwiches

In [20]:
gender_diffs = (contenders['1: Male'] - contenders['2: Female']).abs().sort_values(ascending=False)
gender_diffs

FFQ70    0.326922
FFQ7     0.223119
FFQ13    0.222368
FFQ69    0.180269
FFQ42    0.170548
           ...   
FFQ61    0.010755
FFQ5     0.007320
FFQ11    0.004373
FFQ51    0.002981
FFQ30    0.000000
Length: 71, dtype: float64

In [21]:
gender_diffs[gender_diffs > 0.15]

FFQ70    0.326922
FFQ7     0.223119
FFQ13    0.222368
FFQ69    0.180269
FFQ42    0.170548
FFQ40    0.162464
FFQ26    0.162189
FFQ47    0.159888
FFQ31    0.159429
dtype: float64

In [22]:
var = 'FFQ18'
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'], predictor='SEX')

FFQ18: Block Brief 2000: refried beans, eat how often, past 12 months
 With answers of 1: Never or lower
SEX
1: Male      46.6%
2: Female    57.5%
dtype: object


## Race

In [23]:
# New version
viable = []
all_pcts = []
cnts = {}
for var in food_targets:
    label = labels_df.at['V00'+var, 'Label'].split(':')[1]
    tmp = check_useability(barcode_site_id_df, var, label, predictor='RACE')
    if tmp is not None:
        viable.append(var)
        all_pcts.append(tmp[0])
        cnts[var] = tmp[1]
    print('\n')

# Collect all contenders into a single dataframe
contenders = pd.DataFrame(all_pcts, index=viable)

FFQ1:  eggs (include egg biscuits/Egg McMuffins (not egg substitutes)), eat how often, past 12 months
 With answers of 4: 2-3 times per month or lower
RACE
1: White or Caucasian                46.7%
2: Black or African American         47.4%
3: Asian                             33.7%
0: Other Non-white                   45.0%
.D: Don t Know/Unknown/Uncertain     45.5%
.R: Refused                         100.0%
dtype: object


FFQ2:  bacon/breakfast sausage (including sausage biscuit), eat how often, past 12 months
 With answers of 1: Never or lower
RACE
1: White or Caucasian               10.5%
2: Black or African American         7.6%
3: Asian                            24.5%
0: Other Non-white                  12.3%
.D: Don t Know/Unknown/Uncertain    45.5%
.R: Refused                         53.8%
dtype: object


FFQ3:  cooked cereals (e.g., oatmeal/cream of wheat/grits) eat how often, past 12 months
 With answers of 3: Once per month or lower
RACE
1: White or Caucasian             

In [24]:
race_diffs = (contenders['1: White or Caucasian'] - contenders['2: Black or African American']).abs().sort_values(ascending=False)
race_diffs

FFQ64    0.337775
FFQ39    0.286143
FFQ63    0.267746
FFQ32    0.261741
FFQ52    0.260000
           ...   
FFQ10    0.014827
FFQ61    0.008770
FFQ1     0.007197
FFQ8     0.004419
FFQ30    0.000000
Length: 71, dtype: float64

In [25]:
race_diffs[race_diffs > 0.2]

FFQ64    0.337775
FFQ39    0.286143
FFQ63    0.267746
FFQ32    0.261741
FFQ52    0.260000
FFQ37    0.255972
FFQ18    0.246337
FFQ71    0.241373
FFQ24    0.236763
FFQ55    0.230609
FFQ45    0.217710
FFQ6     0.213421
FFQ70    0.212952
FFQ3     0.212350
FFQ22    0.204757
dtype: float64

In [26]:
tmp = check_useability(barcode_site_id_df, var, labels_df.at['V00'+var, 'Label'], predictor='RACE')

FFQ71: Block Brief 2000: wine/wine coolers, drink how often, past 12 months
 With answers of 1: Never or lower
RACE
1: White or Caucasian                21.3%
2: Black or African American         45.4%
3: Asian                             37.8%
0: Other Non-white                   30.7%
.D: Don t Know/Unknown/Uncertain     45.5%
.R: Refused                         100.0%
dtype: object


In [50]:
 set(site_diffs[site_diffs > 0.15].index) & set(gender_diffs[gender_diffs > 0.15].index)

{'FFQ47', 'FFQ69'}

In [107]:
set(site_diffs[site_diffs > 0.15].index) & set(race_diffs[race_diffs > 0.2].index)

{'FFQ18', 'FFQ22', 'FFQ24', 'FFQ37', 'FFQ39', 'FFQ52', 'FFQ55', 'FFQ71'}

In [40]:
set(race_diffs[race_diffs > 0.15].index) & set(gender_diffs[gender_diffs > 0.15].index)

{'FFQ40', 'FFQ69', 'FFQ70'}

In [44]:
set(race_diffs[race_diffs > 0.15].index) & set(gender_diffs[gender_diffs > 0.15].index) & set(site_diffs[site_diffs > 0.15].index)

{'FFQ69'}

In [120]:
df = barcode_site_id_df
var = 'FFQ70'
predictor = 'RACE'
label = labels_df.at['V00'+var, 'Label']

s = remove_unused_missing_categories(df[var])

df = pd.concat([df[predictor], s], axis=1, join='inner')

# Get the percents by predictor (index: var values, cols: site names,
#                           val: percent of patients with that answer at that site)
var_by_site_df = df.groupby(predictor).value_counts(normalize=True).unstack(predictor)
var_by_site_df = var_by_site_df.sort_index()  # Critical to ensure answers are in ordinal order
tmp = var_by_site_df.sum(axis=0) != 0.0
var_by_site_df = var_by_site_df[tmp[tmp].index]

# Row by row, add the percentages of each row until one or more cols have over 50%
sum_list = []
sums = pd.Series(0, index=var_by_site_df.columns)
for answer in var_by_site_df.index:  # Indices are question answers
    row = var_by_site_df.loc[answer]
    sums += row
    num_over_half = (sums > 0.50).sum()
    sum_list.append(sums.copy(deep=True))
    
sums_df = pd.DataFrame(sum_list, var_by_site_df.index)
sums_df

RACE,1: White or Caucasian,2: Black or African American,3: Asian,0: Other Non-white,.D: Don t Know/Unknown/Uncertain,.R: Refused
FFQ70,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1: Never,0.381534,0.594486,0.581633,0.508951,1.0,1.0
2: A few times per year,0.609777,0.791942,0.765306,0.744246,1.0,1.0
3: Once per month,0.695565,0.842366,0.816327,0.841432,1.0,1.0
4: 2-3 times per month,0.79367,0.89656,0.908163,0.895141,1.0,1.0
5: Once per week,0.857879,0.919651,1.0,0.959079,1.0,1.0
6: Twice per week,0.91827,0.94934,1.0,0.982097,1.0,1.0
7: 3-4 times per week,0.962286,0.975966,1.0,0.987212,1.0,1.0
8: 5-6 times per week,0.984103,0.991517,1.0,1.0,1.0,1.0
9: Every day,1.0,1.0,1.0,1.0,1.0,1.0


In [121]:
(sums_df['1: White or Caucasian'] - sums_df['2: Black or African American']).abs()

FFQ70
1: Never                   2.129520e-01
2: A few times per year    1.821645e-01
3: Once per month          1.468007e-01
4: 2-3 times per month     1.028901e-01
5: Once per week           6.177186e-02
6: Twice per week          3.107033e-02
7: 3-4 times per week      1.368030e-02
8: 5-6 times per week      7.414701e-03
9: Every day               1.110223e-16
dtype: float64

In [122]:
df = barcode_site_id_df
predictor = 'SEX'
df = pd.concat([df[predictor], s], axis=1, join='inner')

# Get the percents by predictor (index: var values, cols: site names,
#                           val: percent of patients with that answer at that site)
var_by_site_df = df.groupby(predictor).value_counts(normalize=True).unstack(predictor)
var_by_site_df = var_by_site_df.sort_index()  # Critical to ensure answers are in ordinal order
tmp = var_by_site_df.sum(axis=0) != 0.0
var_by_site_df = var_by_site_df[tmp[tmp].index]

# Row by row, add the percentages of each row until one or more cols have over 50%
sum_list = []
sums = pd.Series(0, index=var_by_site_df.columns)
for answer in var_by_site_df.index:  # Indices are question answers
    row = var_by_site_df.loc[answer]
    sums += row
    num_over_half = (sums > 0.50).sum()
    sum_list.append(sums.copy(deep=True))
    
sums_df = pd.DataFrame(sum_list, var_by_site_df.index)
sums_df

SEX,1: Male,2: Female
FFQ70,Unnamed: 1_level_1,Unnamed: 2_level_1
1: Never,0.232787,0.559709
2: A few times per year,0.432332,0.799622
3: Once per month,0.531603,0.864998
4: 2-3 times per month,0.655282,0.930171
5: Once per week,0.747541,0.962083
6: Twice per week,0.855738,0.976386
7: 3-4 times per week,0.931967,0.98988
8: 5-6 times per week,0.970765,0.996762
9: Every day,1.0,1.0


In [123]:
(sums_df['1: Male'] - sums_df['2: Female']).abs()

FFQ70
1: Never                   3.269217e-01
2: A few times per year    3.672907e-01
3: Once per month          3.333951e-01
4: 2-3 times per month     2.748890e-01
5: Once per week           2.145424e-01
6: Twice per week          1.206487e-01
7: 3-4 times per week      5.791270e-02
8: 5-6 times per week      2.599654e-02
9: Every day               1.110223e-16
dtype: float64

In [124]:
df = barcode_site_id_df
predictor = 'SITE'
df = pd.concat([df[predictor], s], axis=1, join='inner')

# Get the percents by predictor (index: var values, cols: site names,
#                           val: percent of patients with that answer at that site)
var_by_site_df = df.groupby(predictor).value_counts(normalize=True).unstack(predictor)
var_by_site_df = var_by_site_df.sort_index()  # Critical to ensure answers are in ordinal order
tmp = var_by_site_df.sum(axis=0) != 0.0
var_by_site_df = var_by_site_df[tmp[tmp].index]

# Row by row, add the percentages of each row until one or more cols have over 50%
sum_list = []
sums = pd.Series(0, index=var_by_site_df.columns)
for answer in var_by_site_df.index:  # Indices are question answers
    row = var_by_site_df.loc[answer]
    sums += row
    num_over_half = (sums > 0.50).sum()
    sum_list.append(sums.copy(deep=True))
    
sums_df = pd.DataFrame(sum_list, var_by_site_df.index)
sums_df

SITE,A,B,C,D,E
FFQ70,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1: Never,0.443422,0.437347,0.388405,0.40818,0.48123
2: A few times per year,0.692862,0.677984,0.604464,0.610439,0.690185
3: Once per month,0.768963,0.732587,0.683571,0.713717,0.777476
4: 2-3 times per month,0.846804,0.822366,0.784469,0.808721,0.839439
5: Once per week,0.899776,0.877844,0.852615,0.859803,0.893261
6: Twice per week,0.951505,0.920546,0.916931,0.921069,0.927635
7: 3-4 times per week,0.975131,0.956073,0.963418,0.969446,0.965174
8: 5-6 times per week,0.994031,0.978474,0.986265,0.986633,0.984622
9: Every day,1.0,1.0,1.0,1.0,1.0


In [41]:
all_aucs.sum().sort_values(ascending=False)

FFQ66    1.071475
FFQ18    0.630751
FFQ52    0.609361
FFQ37    0.606613
FFQ22    0.597942
FFQ47    0.583821
FFQ69    0.569841
FFQ55    0.569658
FFQ34    0.568984
FFQ68    0.568760
FFQ23    0.563854
FFQ71    0.562935
FFQ39    0.560890
FFQ2     0.560073
FFQ36    0.556660
FFQ46    0.553255
FFQ14    0.551494
FFQ24    0.550623
FFQ40    0.549256
FFQ15    0.547580
FFQ63    0.541345
FFQ17    0.541153
FFQ21    0.537506
FFQ3     0.536043
FFQ13    0.532905
FFQ1     0.532064
FFQ28    0.529299
FFQ67    0.528528
FFQ7     0.528049
FFQ9     0.527305
FFQ12    0.526109
FFQ4     0.525188
FFQ53    0.522087
dtype: float64

In [49]:
all_aucs['FFQ66']

1: Avoid eating the skin         NaN
1: Never                         NaN
2: A few times per year          NaN
3: Low-fat 1% milk               NaN
3: Once per month                NaN
4: 2-3 times per month      0.538795
5: Once per week            0.532680
6: Twice per week                NaN
Name: FFQ66, dtype: float64