# Hyper-Palatable Foods (HPF) Food product clustering

Goal:
    
Pull a subset of the nutrient matrix created in a previous notebook and calculate the following variables:
* PFAT: Percent calories (kilocalories) from fat
* PSUGR: Percent calories (kilocalories) from simple sugars
* PCARB: Percent calories (kilocalories) from carbohydrates
* PSODI: Percent sodium by food weight (in grams) per portion

Using the variables, calculate if each USDA food product satisfies the conditions to fall into any of the three different HPF clusters: 

1) FSOD: Fat and Sodium (>25% kcal from fat, ≥0.30% sodium by weight)
2) FS: Fat and Simple Sugars (>20% kcal from fat,>20% kcal from sugar)
3) CSOD: Carbohydrates and Sodium (>40% kcal from carbohydrates, ≥0.20% sodium by weight)

A food product can exist in one, all or none. If a food doesn't fall into any of these clusters, it is possible the food is not hyper palatable. True/False columns will be returned for each cluster. The methods for this notebook follow directly from the 2019 article [Hyper-Palatable Foods: Development of a Quantitative Definition and Application to the US Food System Database](https://www.researchgate.net/publication/337039170_Hyper-Palatable_Foods_Development_of_a_Quantitative_Definition_and_Application_to_the_US_Food_System_Database).


Other misc resources:
* https://github.com/USDA/USDA-APIs/issues/120

#### Setup

In [2]:
import numpy as np
import pandas as pd
import sqlalchemy as sal

from sqlalchemy import text

In [3]:
nutrient_matrix_data_p = r"../../data/"

nutrient_matrix_csv_p = nutrient_matrix_data_p + "nutrients_matrix.csv.gz"

nutrient_matrix_nutriscore_p = nutrient_matrix_data_p + "usda_2022_hpf_component.csv.gz"

#### Import the data cleaned in a previous notebook. Set the fdc_id to the index.

In [4]:
nutrients_matrix = pd.read_csv(nutrient_matrix_csv_p)
nutrients_matrix.set_index("fdc_id", inplace = True)
print(nutrients_matrix.shape)
nutrients_matrix.head()

(1590701, 103)


Unnamed: 0_level_0,1003,1004,1005,1008,1079,1082,1084,1087,1089,1092,...,1099,1196,1316,1233,1112,1111,1273,1236,1080,1068
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
344604,0.81,0.41,4.07,24.0,0.8,0.0,0.0,13.0,0.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344605,0.81,0.41,4.07,24.0,0.8,0.0,0.0,16.0,0.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344606,23.21,2.68,0.0,0.0,0.0,0.0,0.0,0.0,1.29,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344607,23.21,2.68,0.0,0.0,0.0,0.0,0.0,0.0,1.29,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344608,18.75,15.18,0.0,0.0,0.0,0.0,0.0,18.0,0.96,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Choose a subset of the nutrients

https://fdc.nal.usda.gov/docs/Foundation_Foods_Documentation_Apr2023.pdf
In the main article, the measurements of fat, simple sugars, carbohydrates, and sodium were the focuses of the analyses. The following assumptions were made:

"Percent calories (kilocalories) from fat (PFAT), simple sugars (PSUGR), and carbohydrates (PCARB) per serving was calculated using standard values of 9 kcal/g for fat and 4 kcal/g for carbohydrates and simple sugars (46). Percent kilocalories from carbohydrates was calculated from a total carbohydrates variable, which included fiber. Fiber slows the process of absorption of carbohydrates and sugar into the system, enhances satiety, and can alter palatability and food texture (47). Therefore, we subtracted fiber before calculating percent kilocalories from carbohydrates. To avoid overlap between the carbohydrates and simple sugars variables, we also subtracted sugar before calculating percent kilocalories from carbohydrates. The total sugars variable, which  consisted of  both  naturally  occurring  and  added  sugars, was used to calculate percent kilocalories from simple sugars. For sodium, percent sodium  by food  weight (PSODI) (in grams) per portion was calculated"

In addition to these assumptions, we must also consider any assumptions made by FoodData Central. Most of the assumptions are listed in their [documentation](https://fdc.nal.usda.gov/data-documentation.html). This is a summary of the FoodData Central assumptions:

- For calories, Atwater General Factors of 4, 9, and 4 for protein, fat and carbohydrate, respectively are used to calculated total energy in kcal.

The 

* Fat: Use Total Lipid (Fat)
* Simple Sugar
* Carbohydrate
*

In [6]:
sorted(list(nutrients_matrix.columns))

['1003',
 '1004',
 '1005',
 '1007',
 '1008',
 '1009',
 '1011',
 '1012',
 '1013',
 '1018',
 '1026',
 '1038',
 '1051',
 '1056',
 '1057',
 '1062',
 '1068',
 '1072',
 '1078',
 '1079',
 '1080',
 '1081',
 '1082',
 '1084',
 '1086',
 '1087',
 '1088',
 '1089',
 '1090',
 '1091',
 '1092',
 '1093',
 '1095',
 '1096',
 '1098',
 '1099',
 '1100',
 '1101',
 '1102',
 '1103',
 '1104',
 '1107',
 '1109',
 '1110',
 '1111',
 '1112',
 '1114',
 '1123',
 '1124',
 '1158',
 '1162',
 '1165',
 '1166',
 '1167',
 '1170',
 '1175',
 '1176',
 '1177',
 '1178',
 '1180',
 '1181',
 '1185',
 '1186',
 '1190',
 '1196',
 '1210',
 '1211',
 '1212',
 '1213',
 '1214',
 '1215',
 '1216',
 '1217',
 '1218',
 '1219',
 '1220',
 '1221',
 '1222',
 '1223',
 '1224',
 '1225',
 '1226',
 '1227',
 '1232',
 '1233',
 '1234',
 '1235',
 '1236',
 '1253',
 '1257',
 '1258',
 '1261',
 '1262',
 '1263',
 '1269',
 '1273',
 '1292',
 '1293',
 '1316',
 '1368',
 '1403',
 '1404',
 '2000']

Convert Kjoules to Kcal for the energy component, then combine the energy columns. At the time of creating this notebook, 1062 is energy in kjoules and 1008 is energy in calories

In [4]:
nutrients_matrix['1062'] = nutrients_matrix['1062']/4.184 #1 kcal is 4.184 kj
nutrients_matrix['1008'] = nutrients_matrix['1008'] + nutrients_matrix['1062']
del nutrients_matrix['1062']

#### Get the list of nutrient names from nourish

In [4]:
pip install psycopg2-binary

Collecting psycopg2-binary
  Using cached psycopg2_binary-2.9.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.9.6
Note: you may need to restart the kernel to use updated packages.


In [5]:
nourish_user = "gmichael"

nourish_pswd = "567khcwx3s"

engine = sal.create_engine('postgresql+psycopg2://' + nourish_user + ':' + nourish_pswd + '@awesome-hw.sdsc.edu/nourish')
conn = engine.connect()

In [6]:
query_nutrients = text('''SELECT *
from "usda_2022_nutrient_master"''')

result = conn.execute(query_nutrients)

nutrient_names = [i for i in result]

nutrient_names[0:2]

[(2047, 'Energy (Atwater General Factors)', 'KCAL', Decimal('957'), '280.0'),
 (2048, 'Energy (Atwater Specific Factors)', 'KCAL', Decimal('958'), '290.0')]

In [8]:
nutrient_names_df = pd.DataFrame(nutrient_names)
nutrient_names_df['name'] = nutrient_names_df['name'].str.upper()
nutrient_names_df

Unnamed: 0,id,name,unit_name,nutrient_nbr,rank
0,2047,ENERGY (ATWATER GENERAL FACTORS),KCAL,957,280.0
1,2048,ENERGY (ATWATER SPECIFIC FACTORS),KCAL,958,290.0
2,1001,SOLIDS,G,201,200.0
3,1002,NITROGEN,G,202,500.0
4,1003,PROTEIN,G,203,600.0
...,...,...,...,...,...
469,2061,"ERGOSTA-7,22-DIENOL",MG,,16211.0
470,2062,"ERGOSTA-5,7-DIENOL",MG,,16211.0
471,2063,VERBASCOSE,G,,2450.0
472,2064,OLIGOSACCHARIDES,MG,,2250.0


In [10]:
nutrient_names_df[nutrient_names_df['name'].str.contains('FAT')]

Unnamed: 0,id,name,unit_name,nutrient_nbr,rank
5,1004,TOTAL LIPID (FAT),G,204,800.0
50,1049,"SOLIDS, NON-FAT",G,253,999999.0
86,1085,TOTAL FAT (NLEA),G,298,900.0
258,1257,"FATTY ACIDS, TOTAL TRANS",G,605,15400.0
259,1258,"FATTY ACIDS, TOTAL SATURATED",G,606,9700.0
291,1291,"FATTY ACIDS, OTHER THAN 607-615, 617-621, 624-...",G,644,999999.0
292,1292,"FATTY ACIDS, TOTAL MONOUNSATURATED",G,645,11400.0
293,1293,"FATTY ACIDS, TOTAL POLYUNSATURATED",G,646,12900.0
318,1318,"FATTY ACIDS, SATURATED, OTHER",G,677,999999.0
319,1319,"FATTY ACIDS, MONOUNSAT., OTHER",G,678,999999.0


#### Pull the fdc_idcs and their ingredients / food categories from nourish

In [10]:
query_ingredients = text('''SELECT "fdc_id", "ingredients", "branded_food_category"
from "usda_2022_branded_food_product"''')

result = conn.execute(query_ingredients)

ingredient_data = [i for i in result]

conn.close()

In [11]:
ingredient_df = pd.DataFrame(ingredient_data)
ingredient_df.set_index("fdc_id", inplace = True)
print(ingredient_df.shape)
ingredient_df

(1702125, 2)


Unnamed: 0_level_0,ingredients,branded_food_category
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
355336,"Granola (Whole Grain Rolled Oats, Brown Sugar,...",
355337,"Ingredients: Raw cane sugar #, cocoa butter #,...",
355338,"INGREDIENTS: POTATO FLOUR, CANOLA OIL, CORNSTA...",
355339,"INGREDIENTS: SUGAR, UNBLEACHED ENRICHED FLOUR ...",
355340,INGREDIENTS: UNBLEACHED ENRICHED FLOUR (WHEAT...,
...,...,...
355332,"Ingredients: MILK**, sugar, vegetable fats (pa...",
355333,INGREDIENTS: UNBLEACHED ENRICHED FLOUR (WHEAT ...,
355334,"Ingredients: Sweeteners (isomalt, aspartame, a...",
355335,"INGREDIENTS: SUGAR, INVERT SUGAR, CORN SYRUP, ...",


In [12]:
ingredient_df["branded_food_category"].value_counts()

branded_food_category
Popcorn, Peanuts, Seeds & Related Snacks    80570
Candy                                       78867
Cheese                                      66472
Ice Cream & Frozen Yogurt                   52665
Cookies & Biscuits                          49547
                                            ...  
Fresh Fruit and Vegetables                      1
Cakes/Slices/Biscuits                           1
Ice-Cream/Block Single                          1
Amino Acid Supplements                          1
Cakes - Sweet (Shelf Stable)                    1
Name: count, Length: 360, dtype: int64

#### Rename the nutrient columns and merge with ingredients

In [13]:
rename_dict = dict(zip(nutrient_names_df["id"].astype(str), nutrient_names_df["name"]))

In [14]:
nutrients_matrix.rename(columns = rename_dict, inplace = True)
nutrients_matrix

Unnamed: 0_level_0,Protein,Total lipid (fat),"Carbohydrate, by difference",Energy,"Fiber, total dietary","Fiber, soluble","Fiber, insoluble","Calcium, Ca","Iron, Fe","Potassium, K",...,"Fluoride, F","Choline, from phosphotidyl choline","PUFA 18:2 n-6 c,c",Glutamine,Vitamin D3 (cholecalciferol),Vitamin D2 (ergocalciferol),SFA 22:0,"Sugars, intrinsic",Lignin,Beta-glucans
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
344604,0.81,0.41,4.07,24.0,0.8,0.0,0.0,13.0,0.00,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344605,0.81,0.41,4.07,24.0,0.8,0.0,0.0,16.0,0.00,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344606,23.21,2.68,0.00,0.0,0.0,0.0,0.0,0.0,1.29,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344607,23.21,2.68,0.00,0.0,0.0,0.0,0.0,0.0,1.29,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344608,18.75,15.18,0.00,0.0,0.0,0.0,0.0,18.0,0.96,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2340755,4.85,1.82,7.58,67.0,0.6,0.0,0.0,30.0,0.55,21.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2340756,4.85,1.82,7.58,67.0,0.6,0.0,0.0,30.0,0.55,52.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2340757,4.85,1.82,7.58,67.0,0.6,0.0,0.0,30.0,0.55,52.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2340758,4.85,1.82,7.58,67.0,0.6,0.0,0.0,30.0,0.55,21.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
print(sorted(list(nutrients_matrix.columns)))

['Acetic acid', 'Alanine', 'Alcohol, ethyl', 'Arginine', 'Ash', 'Aspartic acid', 'Beta-glucans', 'Biotin', 'Caffeine', 'Calcium, Ca', 'Carbohydrate, by difference', 'Carbohydrate, other', 'Carotene, beta', 'Chlorine, Cl', 'Cholesterol', 'Choline, from phosphotidyl choline', 'Choline, total', 'Chromium, Cr', 'Copper, Cu', 'Cysteine', 'Cystine', 'Energy', 'Epigallocatechin-3-gallate', 'Fatty acids, total monounsaturated', 'Fatty acids, total polyunsaturated', 'Fatty acids, total saturated', 'Fatty acids, total trans', 'Fiber, insoluble', 'Fiber, soluble', 'Fiber, total dietary', 'Fluoride, F', 'Folate, DFE', 'Folate, total', 'Folic acid', 'Fructose', 'Glucose', 'Glutamic acid', 'Glutamine', 'Glycine', 'Histidine', 'Inositol', 'Inulin', 'Iodine, I', 'Iron, Fe', 'Isoleucine', 'Lactic acid', 'Lactose', 'Leucine', 'Lignin', 'Lutein + zeaxanthin', 'Lysine', 'Magnesium, Mg', 'Manganese, Mn', 'Methionine', 'Molybdenum, Mo', 'Niacin', 'PUFA 18:2', 'PUFA 18:2 n-6 c,c', 'PUFA 18:3 n-3 c,c,c (ALA)'

In [16]:
nutrients_matrix = nutrients_matrix.merge(ingredient_df, left_index = True, right_index = True, how = 'left')
print(nutrients_matrix.shape)
nutrients_matrix.head(3)

(1590701, 104)


Unnamed: 0_level_0,Protein,Total lipid (fat),"Carbohydrate, by difference",Energy,"Fiber, total dietary","Fiber, soluble","Fiber, insoluble","Calcium, Ca","Iron, Fe","Potassium, K",...,"PUFA 18:2 n-6 c,c",Glutamine,Vitamin D3 (cholecalciferol),Vitamin D2 (ergocalciferol),SFA 22:0,"Sugars, intrinsic",Lignin,Beta-glucans,ingredients,branded_food_category
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
344604,0.81,0.41,4.07,24.0,0.8,0.0,0.0,13.0,0.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",
344605,0.81,0.41,4.07,24.0,0.8,0.0,0.0,16.0,0.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",
344606,23.21,2.68,0.0,0.0,0.0,0.0,0.0,0.0,1.29,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"White Turkey, Natural Flavoring",


### Grab a subset of nutrients and prepare for ranking

In [17]:
cols = ["branded_food_category", "ingredients", "Energy", "Sugars, Total", "Fatty acids, total saturated",
        "Sodium, Na", "Fiber, insoluble", "Fiber, soluble", "Fiber, total dietary", "Protein"]
nutrients_matrix = nutrients_matrix[cols]
nutrients_matrix.head()

Unnamed: 0_level_0,branded_food_category,ingredients,Energy,"Sugars, Total","Fatty acids, total saturated","Sodium, Na","Fiber, insoluble","Fiber, soluble","Fiber, total dietary",Protein
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
344604,,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",24.0,2.44,0.0,203.0,0.0,0.0,0.8,0.81
344605,,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",24.0,2.44,0.0,203.0,0.0,0.0,0.8,0.81
344606,,"White Turkey, Natural Flavoring",0.0,0.0,0.89,67.0,0.0,0.0,0.0,23.21
344607,,"Turkey Breast, Natural Flavoring",0.0,0.0,0.89,67.0,0.0,0.0,0.0,23.21
344608,,"Turkey, natural Flavoring.",0.0,0.0,4.46,103.0,0.0,0.0,0.0,18.75


In [18]:
#turn off SettingWithCopyWarning
pd.options.mode.chained_assignment = None

#Merge related columns for fibers
nutrients_matrix["Fibers"] = (nutrients_matrix["Fiber, insoluble"] + nutrients_matrix["Fiber, soluble"] + 
                                 nutrients_matrix["Fiber, total dietary"])

nutrients_matrix.drop(columns=["Fiber, insoluble", "Fiber, soluble", "Fiber, total dietary"], inplace = True)
nutrients_matrix

Unnamed: 0_level_0,branded_food_category,ingredients,Energy,"Sugars, Total","Fatty acids, total saturated","Sodium, Na",Protein,Fibers
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
344604,,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",24.0,2.44,0.00,203.0,0.81,0.8
344605,,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",24.0,2.44,0.00,203.0,0.81,0.8
344606,,"White Turkey, Natural Flavoring",0.0,0.00,0.89,67.0,23.21,0.0
344607,,"Turkey Breast, Natural Flavoring",0.0,0.00,0.89,67.0,23.21,0.0
344608,,"Turkey, natural Flavoring.",0.0,0.00,4.46,103.0,18.75,0.0
...,...,...,...,...,...,...,...,...
2340755,Chocolate,"FILTERED WATER, ORGAIN ORGANIC PROTEIN BLEND (...",67.0,2.73,0.30,48.0,4.85,0.6
2340756,Chocolate,"FILTERED WATER, ORGAIN ORGANIC PROTEIN BLEND (...",67.0,2.73,0.30,58.0,4.85,0.6
2340757,Chocolate,"FILTERED WATER, ORGAIN ORGANIC PROTEIN BLEND (...",67.0,2.73,0.30,58.0,4.85,0.6
2340758,Chocolate,"FILTERED WATER, ORGAIN ORGANIC PROTEIN BLEND (...",67.0,2.73,0.30,48.0,4.85,0.6


### Calculate a modified Nutri Score 
Modified because there is no fruits or vegetables %. Tried to follow the scoring mechanisms laid out here https://en.wikipedia.org/wiki/Nutri-Score. The overall score for a food is found by subtracting the total number of favourable points from the total
number of unfavourable points. 

In [19]:
beverage_categories = ['Alcohol', 'Alcoholic Beverages', 'Baby/Infant  Foods/Beverages', 
                       'Baby/Infant – Foods/Beverages', 'Beer', 'Breakfast Drinks', 'Coffee', 
                       'Coffee - Instant, Roast and Ground', 'Coffee/Tea/Substitutes', 'Drinks', 
                       'Drinks - Energy Drinks', 'Drinks - Juices, Drinks and Cordials', 'Drinks - Powdered', 
                       'Drinks - Soft Drinks', 'Drinks Flavoured - Ready to Drink', 'Energy, Protein & Muscle Recovery Drinks', 
                       'Food/Beverage/Tobacco Variety Packs', 'Frozen Fruit & Fruit Juice Concentrates', 
                       'Fruit & Vegetable Juice, Nectars & Fruit Drinks', 'Iced & Bottle Tea', 'Infant Formula', 'Liquid Water Enhancer',
                       'Milk', 'Milk Additives', 'Milk/Cream', 'Milk/Cream - Shelf Stable', 'Milk/Milk Substitutes', 
                       'Non Alcoholic Beverages  Not Ready to Drink', 'Non Alcoholic Beverages  Ready to Drink',
                       'Non Alcoholic Beverages – Not Ready to Drink', 'Non Alcoholic Beverages – Ready to Drink',
                       'Other Drinks', 'Plant Based Milk', 'Plant Based Water', 'Powdered Drinks', 'Ready To Drink', 
                       'Soda', 'Sport Drinks', 'Tea - Bags, Loose Leaf, Speciality', 'Tea Bags', 'Water']

In [20]:
def get_energy_points(df):
    if df["branded_food_category"] in beverage_categories:
        energy_points = df["Energy"]//7.2
        energy_points = min(energy_points, 10)
    else:
        energy_points = df["Energy"]//80
        energy_points = min(energy_points, 10)
    return energy_points

In [21]:
def get_sugar_points(df):
    if df["branded_food_category"] in beverage_categories:
        if df["Sugars, Total"]==0:
            sugar_points = 0
        else:
            sugar_points = (df["Sugars, Total"]//1.5) + 1 #adding 1 because this particular measurment is shifted up one row 
            sugar_points = min(sugar_points, 10)
    else:
        sugar_points = df["Sugars, Total"]//4.5
        sugar_points = min(sugar_points, 10)
    return sugar_points

First calculate the unhealthy points (0-10)

In [22]:
nutrients_matrix['Energy_NutriScore'] = nutrients_matrix.apply(get_energy_points, axis=1)

nutrients_matrix['Sugars_NutriScore'] = nutrients_matrix.apply(get_sugar_points, axis=1)

nutrients_matrix['SatFat_NutriScore'] =  nutrients_matrix["Fatty acids, total saturated"]//1
nutrients_matrix['SatFat_NutriScore'] = nutrients_matrix['SatFat_NutriScore'].apply(lambda x: min(x, 10))

nutrients_matrix['Salt_NutriScore'] = nutrients_matrix["Sodium, Na"]//90
nutrients_matrix['Salt_NutriScore'] = nutrients_matrix['Salt_NutriScore'].apply(lambda x: min(x, 10))

Next calculate the healthy points (0-5).

In [23]:
nutrients_matrix['Fibers_NutriScore'] = nutrients_matrix["Fibers"]//0.7
nutrients_matrix['Fibers_NutriScore'] = nutrients_matrix['Fibers_NutriScore'].apply(lambda x: min(x, 5))

nutrients_matrix['Protein_NutriScore'] = nutrients_matrix["Protein"]//1.6
nutrients_matrix['Protein_NutriScore'] = nutrients_matrix['Protein_NutriScore'].apply(lambda x: min(x, 5))

Since there is not a % of fruits and vegetables, we will search the ingredients column. If the ingredients contains at least one of the healthy values, this will give a +1. The list of values was given in a scientific and technical Q&A found here: https://www.santepubliquefrance.fr/en/nutri-score

In [24]:
fruits_veggies = "fruit,apple,pear,quince,medlar,date,lychee,persimmon,grape,cherry,blackcurrant,strawberries,redcurrants,blackberries,cranberries,bilberries,lemon,orange,grapefruit,kumquat,tangerine,banana,kiwi,pineapple,melon,fig,mango,passionfruit,guava,papaya,pomegranate,cashewfruit,carambola,durian,rambutan,sweetsop,pricklypear,sapodilla,breadfruit,tamarillo,tamarind,vegetable,endive,lettuce,leaflettuce,arugula,escarole,spinach,lamb'slettuce,dandeliongreens,nettle,beetgreens,sorrel,brassicasbcabbage,cauliflower,redcabbage,brusselssprouts,curlykale,greencabbage,chinesecabbage,watercress,radish,broccoli,celery,fennel,rhubarb,asparagus,chicory,globeartichoke,palmhearts,bambooshoots,taroshoots,onion,shallot,leek,garlic,chive,parsley,carrot,salsify,celeriac,radish,parsnip,beetroot,chicoryroot,tomato,aubergine,cucumber,courgette,sweetpepper,chillipepper,squash,gourd,greenbanana,plantain,avocado,olive,pickle,pumpkinflower,pea,broadbean,sweetcorn,soyabean,seaweed,algae,chickpea,greenpea,pigeonpea,bean,lentil,cowpea,soyabean,carobbean,broadbean,walnut,hazelnut,pistachio,brazilnut,cashew,pecan,coconut,peanut,almond,chestnuts,rapeseedoil,walnutoil,oliveoils,basil,coriander,lemongrass,marjoram,mint,oregano,sage".split(",")

In [25]:
print(fruits_veggies)

['fruit', 'apple', 'pear', 'quince', 'medlar', 'date', 'lychee', 'persimmon', 'grape', 'cherry', 'blackcurrant', 'strawberries', 'redcurrants', 'blackberries', 'cranberries', 'bilberries', 'lemon', 'orange', 'grapefruit', 'kumquat', 'tangerine', 'banana', 'kiwi', 'pineapple', 'melon', 'fig', 'mango', 'passionfruit', 'guava', 'papaya', 'pomegranate', 'cashewfruit', 'carambola', 'durian', 'rambutan', 'sweetsop', 'pricklypear', 'sapodilla', 'breadfruit', 'tamarillo', 'tamarind', 'vegetable', 'endive', 'lettuce', 'leaflettuce', 'arugula', 'escarole', 'spinach', "lamb'slettuce", 'dandeliongreens', 'nettle', 'beetgreens', 'sorrel', 'brassicasbcabbage', 'cauliflower', 'redcabbage', 'brusselssprouts', 'curlykale', 'greencabbage', 'chinesecabbage', 'watercress', 'radish', 'broccoli', 'celery', 'fennel', 'rhubarb', 'asparagus', 'chicory', 'globeartichoke', 'palmhearts', 'bambooshoots', 'taroshoots', 'onion', 'shallot', 'leek', 'garlic', 'chive', 'parsley', 'carrot', 'salsify', 'celeriac', 'radis

In [26]:
def get_fuit_veggie_point(ingredients):
    for fruit in fruits_veggies:
        if fruit in ingredients:
            return 1
    return 0

In [27]:
nutrients_matrix['ingredients'] = nutrients_matrix['ingredients'].astype(str)
nutrients_matrix['ingredients'] = nutrients_matrix['ingredients'].str.lower()
nutrients_matrix['FruitVeggie_NutriScore'] = nutrients_matrix['ingredients'].apply(lambda x: get_fuit_veggie_point(x))

Calculate a modified nutri score. Negative values are better, positive values are worse

In [28]:
nutrients_matrix["nutri_score"] = ((nutrients_matrix['Energy_NutriScore'] + nutrients_matrix['Sugars_NutriScore'] + 
                                   nutrients_matrix['SatFat_NutriScore'] + nutrients_matrix['Salt_NutriScore']) - 
                                    (nutrients_matrix['Fibers_NutriScore'] + nutrients_matrix['FruitVeggie_NutriScore'] +
                                    nutrients_matrix['Protein_NutriScore'] ) )

nutrients_matrix.drop(columns = ['Energy_NutriScore', 'Sugars_NutriScore', 'SatFat_NutriScore', 'Salt_NutriScore', 
                                 'Fibers_NutriScore', 'FruitVeggie_NutriScore', 'Protein_NutriScore', "Energy", "Sugars, Total", 
                                 "Fatty acids, total saturated", "Sodium, Na", "Protein", "Fibers", "branded_food_category",
                                "ingredients"], inplace = True)

Rank the nutri scores with grades A-E

In [29]:
def get_nutri_score_label(score):
    if score < 0:
        return "A"
    elif score < 3:
        return "B"
    elif score < 11:
        return "C"
    elif score < 19:
        return "D"
    else:
        return "E"

In [30]:
nutrients_matrix['nutri_score_label'] = nutrients_matrix['nutri_score'].apply(lambda x: get_nutri_score_label(x))

In [31]:
nutrients_matrix

Unnamed: 0_level_0,nutri_score,nutri_score_label
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
344604,0.0,B
344605,0.0,B
344606,-5.0,A
344607,-5.0,A
344608,0.0,B
...,...,...
2340755,-4.0,A
2340756,-4.0,A
2340757,-4.0,A
2340758,-4.0,A


In [34]:
nutrients_matrix.to_csv(nutrient_matrix_nutriscore_p, 
                   index = True, compression = "gzip")