<a href="https://colab.research.google.com/github/dellacortelab/DeepLearningExamples/blob/master/Dietary_GI_GL_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's learn how to add Dietary GI variables to a subset of the NHANES national health survey dataset.

1. We will load the dataset.
2. We will assign GI values.
3. We calculate GL values.
4. We aggregate the data get dietary GL values.
5. We calculate dietary GI values.
6. We perform some basic reporting.

In [10]:
!git clone https://github.com/dellacortelab/nhanes_gigl.git

Cloning into 'nhanes_gigl'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 9 (delta 1), reused 6 (delta 0), pack-reused 0[K
Receiving objects: 100% (9/9), 294.07 KiB | 1.54 MiB/s, done.
Resolving deltas: 100% (1/1), done.


In [44]:
import pickle
with open('./nhanes_gigl/demo_data.pkl', 'rb') as f:
  df = pickle.load(f)

Explore the Data Frame.

In [16]:
df

Unnamed: 0,respondent sequence number,Dietary 20 Year Weight,Data release cycle,USDA food code,Energy (kcal),Carbohydrate (gm),Dietary fiber (gm),Day
0,1.0,1213.225733,1,57348000.0,154.60,36.37,0.82,1.0
1,1.0,1213.225733,1,11112110.0,121.20,11.71,0.00,1.0
2,1.0,1213.225733,1,63149010.0,72.96,16.37,1.14,1.0
3,1.0,1213.225733,1,91611000.0,63.36,16.63,0.00,1.0
4,1.0,1213.225733,1,51101000.0,34.71,6.44,0.30,1.0
...,...,...,...,...,...,...,...,...
2259910,102951.0,9089.858831,10,58407030.0,205.00,28.04,1.30,2.0
2259911,102951.0,9089.858831,10,94100100.0,0.00,0.00,0.00,2.0
2259912,102951.0,9089.858831,10,27510241.0,490.00,30.72,1.10,2.0
2259913,102951.0,9089.858831,10,83110000.0,38.00,2.22,0.00,2.0


Let's extract the unique food codes.

In [45]:
food_codes = df['USDA food code'].unique().astype(int)
print('Example food code 1:', food_codes[0],'\nExmaple food code 2:', food_codes[1])


Example food code 1: 57348000 
Exmaple food code 2: 11112110


Now see if you can find the description of the food on: https://fdc.nal.usda.gov/

Now let's find the GI values associated with these food: https://glycemicindex.com/gi-search/

Imagine you would have to do this for over 10k food codes by hand! Enter the AI!

In [33]:
from transformers import AutoTokenizer, AutoModel
import torch
from scipy.spatial.distance import cosine

# Initialize the BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Define the food descriptions
food1 = "Food Description from FoodData Central"
food2 = "Food Description from GI Database"

# Tokenize and encode the food descriptions
inputs1 = tokenizer(food1, return_tensors='pt', truncation=True, padding=True)
inputs2 = tokenizer(food2, return_tensors='pt', truncation=True, padding=True)

# Get the embeddings for the food descriptions
with torch.no_grad():
    embed1 = model(**inputs1).last_hidden_state.mean(dim=1)
    embed2 = model(**inputs2).last_hidden_state.mean(dim=1)

# Calculate the cosine similarity
similarity = 1 - cosine(embed1.numpy()[0,:], embed2.numpy()[0,:])

print(f"The cosine similarity between the food descriptions is {similarity}")

The cosine similarity between the food descriptions is 0.8749364614486694


The code above could be used to compare thousands of food descriptions between different databases. It uses a somewhat aged AI, called BERT. Our paper uses a more modern Large Language Model from Open-AI (the makers of ChatGPT). But for our purposes this example suffices.

Let's assume that we used the AI to align all food codes with corresponding GI values. Here, we will take random numbers for sake of time.

In [51]:
import numpy as np
gi_values = [np.random.randint(45,100) for food_code in food_codes]
food_codes_to_gi = dict(zip(food_codes, gi_values))
for key in list(food_codes_to_gi.keys())[:5]:
  print(key,':', food_codes_to_gi[key])

57348000 : 67
11112110 : 76
63149010 : 60
91611000 : 76
51101000 : 97


Now, let's add the new GI variable to our dataframe!

In [52]:
df['GI'] = df['USDA food code'].map(food_codes_to_gi)
df

Unnamed: 0,respondent sequence number,Dietary 20 Year Weight,Data release cycle,USDA food code,Energy (kcal),Carbohydrate (gm),Dietary fiber (gm),Day,GI
0,1.0,1213.225733,1,57348000.0,154.60,36.37,0.82,1.0,67
1,1.0,1213.225733,1,11112110.0,121.20,11.71,0.00,1.0,76
2,1.0,1213.225733,1,63149010.0,72.96,16.37,1.14,1.0,60
3,1.0,1213.225733,1,91611000.0,63.36,16.63,0.00,1.0,76
4,1.0,1213.225733,1,51101000.0,34.71,6.44,0.30,1.0,97
...,...,...,...,...,...,...,...,...,...
2259910,102951.0,9089.858831,10,58407030.0,205.00,28.04,1.30,2.0,71
2259911,102951.0,9089.858831,10,94100100.0,0.00,0.00,0.00,2.0,87
2259912,102951.0,9089.858831,10,27510241.0,490.00,30.72,1.10,2.0,67
2259913,102951.0,9089.858831,10,83110000.0,38.00,2.22,0.00,2.0,87


Now let's calculate the GL value for each meal. This needs to take into account the available carbohydrates.

In [53]:
df['Available Carbohydrate'] = df['Carbohydrate (gm)'] - df['Dietary fiber (gm)']
df['GL'] = df['GI'] * df['Available Carbohydrate'] / 100
df

Unnamed: 0,respondent sequence number,Dietary 20 Year Weight,Data release cycle,USDA food code,Energy (kcal),Carbohydrate (gm),Dietary fiber (gm),Day,GI,Available Carbohydrate,GL
0,1.0,1213.225733,1,57348000.0,154.60,36.37,0.82,1.0,67,35.55,23.8185
1,1.0,1213.225733,1,11112110.0,121.20,11.71,0.00,1.0,76,11.71,8.8996
2,1.0,1213.225733,1,63149010.0,72.96,16.37,1.14,1.0,60,15.23,9.1380
3,1.0,1213.225733,1,91611000.0,63.36,16.63,0.00,1.0,76,16.63,12.6388
4,1.0,1213.225733,1,51101000.0,34.71,6.44,0.30,1.0,97,6.14,5.9558
...,...,...,...,...,...,...,...,...,...,...,...
2259910,102951.0,9089.858831,10,58407030.0,205.00,28.04,1.30,2.0,71,26.74,18.9854
2259911,102951.0,9089.858831,10,94100100.0,0.00,0.00,0.00,2.0,87,0.00,0.0000
2259912,102951.0,9089.858831,10,27510241.0,490.00,30.72,1.10,2.0,67,29.62,19.8454
2259913,102951.0,9089.858831,10,83110000.0,38.00,2.22,0.00,2.0,87,2.22,1.9314


Now we can sum up all the food items consumed by a given respondent in the NHANES survey. If we do this for each day of reporting, we will get Dietary GL.

In [63]:
agg_dict = {
    'Dietary 20 Year Weight': 'first',
    'Energy (kcal)': 'sum',
    'Data release cycle': 'first',
    'Available Carbohydrate':'sum',
    'GI': 'sum',
    'GL': 'sum'
}
summed_df = df.groupby(['respondent sequence number','Day']).agg(agg_dict).reset_index()
summed_df.rename(columns={'GL': 'Dietary GL'}, inplace=True)
summed_df['Dietary GI'] = summed_df['Dietary GL'] / summed_df['Available Carbohydrate']*100
summed_df


Unnamed: 0,respondent sequence number,Day,Dietary 20 Year Weight,Energy (kcal),Data release cycle,Available Carbohydrate,GI,Dietary GL,Dietary GI
0,1.0,1.0,1213.225733,1358.88,1,242.95,969,151.1440,62.211978
1,73.0,1.0,1441.458533,1089.91,1,186.59,483,112.8040,60.455544
2,93.0,1.0,11027.861243,1376.98,1,149.60,1535,108.6756,72.644118
3,962.0,1.0,6592.491222,1622.83,1,244.07,868,202.7762,83.081165
4,1126.0,1.0,2010.452172,1467.16,1,153.51,1357,125.4847,81.743665
...,...,...,...,...,...,...,...,...,...
1789,102553.0,2.0,1020.750383,2730.00,10,252.69,2030,208.1669,82.380347
1790,102823.0,1.0,6265.820687,1468.00,10,190.04,1685,139.0979,73.194012
1791,102823.0,2.0,5478.640411,1045.00,10,120.45,1348,79.0978,65.668576
1792,102951.0,1.0,6832.369652,1471.00,10,196.79,1228,162.0140,82.328370


Finally, we need to average over the different days, for those who respondent more than once.

In [64]:
agg_dict = {
    'Dietary 20 Year Weight': 'mean',
    'Energy (kcal)': 'mean',
    'Available Carbohydrate':'mean',
    'Dietary GI': 'mean',
    'Dietary GL': 'mean',
    'Data release cycle': 'first',

}
summed_df = summed_df.groupby(['respondent sequence number']).agg(agg_dict).reset_index()
summed_df

Unnamed: 0,respondent sequence number,Dietary 20 Year Weight,Energy (kcal),Available Carbohydrate,Dietary GI,Dietary GL,Data release cycle
0,1.0,1213.225733,1358.88,242.950,62.211978,151.14400,1
1,73.0,1441.458533,1089.91,186.590,60.455544,112.80400,1
2,93.0,11027.861243,1376.98,149.600,72.644118,108.67560,1
3,962.0,6592.491222,1622.83,244.070,83.081165,202.77620,1
4,1126.0,2010.452172,1467.16,153.510,81.743665,125.48470,1
...,...,...,...,...,...,...,...
984,102352.0,2729.342226,2305.50,270.765,66.486953,180.18945,10
985,102435.0,515.042577,3042.00,397.865,72.525800,288.62955,10
986,102553.0,1074.611849,2470.00,256.925,82.909924,213.03875,10
987,102823.0,5872.230549,1256.50,155.245,69.431294,109.09785,10


Now we could do some exploration of the participants. In the example data, we only have Data release cycle as a variable. Others are available from NHANES and were used in the paper. Here, however, we just focus on time trends. Let's plot the average values and approximate the errors.