# Hyper-Palatable Foods (HPF) Food product clustering

Goal:
    
Pull a subset of the nutrient matrix created in a previous notebook and calculate the following variables:
* PFAT: Percent calories (kilocalories) from fat
* PSUGR: Percent calories (kilocalories) from simple sugars
* PCARB: Percent calories (kilocalories) from carbohydrates
* PSODI: Percent sodium by food weight (in grams) per portion

Using the variables, calculate if each USDA food product satisfies the conditions to fall into any of the three different HPF clusters: 

1) FSOD: Fat and Sodium (>25% kcal from fat, ≥0.30% sodium by weight)
2) FS: Fat and Simple Sugars (>20% kcal from fat,>20% kcal from sugar)
3) CSOD: Carbohydrates and Sodium (>40% kcal from carbohydrates, ≥0.20% sodium by weight)

A food product can exist in one, all or none. If a food doesn't fall into any of these clusters, it is possible the food is not hyper palatable. True/False columns will be returned for each cluster. The methods for this notebook follow directly from the 2019 article [Hyper-Palatable Foods: Development of a Quantitative Definition and Application to the US Food System Database](https://www.researchgate.net/publication/337039170_Hyper-Palatable_Foods_Development_of_a_Quantitative_Definition_and_Application_to_the_US_Food_System_Database).


Other misc resources:
* https://github.com/USDA/USDA-APIs/issues/120

## Setup

In [1]:
import numpy as np
import pandas as pd
import sqlalchemy as sal

from sqlalchemy import text

In [2]:
nutrient_matrix_data_p = r"../../data/"

nutrient_matrix_csv_p = nutrient_matrix_data_p + "nutrients_matrix.csv.gz"

nutrient_matrix_nutriscore_p = nutrient_matrix_data_p + "usda_2022_hpf_component.csv.gz"

#### Import the data cleaned in a previous notebook. Set the fdc_id to the index.

In [3]:
nutrients_matrix = pd.read_csv(nutrient_matrix_csv_p)
nutrients_matrix.set_index("fdc_id", inplace = True)
print(nutrients_matrix.shape)
nutrients_matrix.head()

(1590701, 103)


Unnamed: 0_level_0,1003,1004,1005,1008,1079,1082,1084,1087,1089,1092,...,1099,1196,1316,1233,1112,1111,1273,1236,1080,1068
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
344604,0.81,0.41,4.07,24.0,0.8,0.0,0.0,13.0,0.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344605,0.81,0.41,4.07,24.0,0.8,0.0,0.0,16.0,0.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344606,23.21,2.68,0.0,0.0,0.0,0.0,0.0,0.0,1.29,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344607,23.21,2.68,0.0,0.0,0.0,0.0,0.0,0.0,1.29,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
344608,18.75,15.18,0.0,0.0,0.0,0.0,0.0,18.0,0.96,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Choose a subset of the nutrients

In the main article, the measurements of fat, simple sugars, carbohydrates, and sodium were the focuses of the analyses. The following assumptions were made:

* The article:
    - "Percent calories (kilocalories) from fat (PFAT), simple sugars (PSUGR), and carbohydrates (PCARB) per serving was calculated using standard values of 9 kcal/g for fat and 4 kcal/g for carbohydrates and simple sugars (46). Percent kilocalories from carbohydrates was calculated from a total carbohydrates variable, which included fiber. Fiber slows the process of absorption of carbohydrates and sugar into the system, enhances satiety, and can alter palatability and food texture (47). Therefore, we subtracted fiber before calculating percent kilocalories from carbohydrates. To avoid overlap between the carbohydrates and simple sugars variables, we also subtracted sugar before calculating percent kilocalories from carbohydrates. The total sugars variable, which  consisted of  both  naturally  occurring  and  added  sugars, was used to calculate percent kilocalories from simple sugars. For sodium, percent sodium  by food  weight (PSODI) (in grams) per portion was calculated"

* FoodData Central (their [documentation](https://fdc.nal.usda.gov/data-documentation.html)):
    - For calories, Atwater General Factors of 4, 9, and 4 for protein, fat and carbohydrate, respectively are used to calculated total energy in kcal.

Variables

* Fat: Use Total Lipid (Fat)
* Simple Sugar: Use 'SUGARS, TOTAL'. "The total sugars variable, which  consisted of  both  naturally  occurring  and  added  sugars,  was used to calculate percent kilocalories from simple sugars."
* Carbohydrate: Use 'CARBOHYDRATE, BY DIFFERENCE', subtract 'FIBER, TOTAL DIETARY' and 'SUGARS, TOTAL'
    - Food Data Central: "Carbohydrate content, referred to as “carbohydrate by difference” in the tables, is expressed as the difference between 100 and the sum of the percentages of water, protein, total lipid (fat), ash, and alcohol (when present)." Documentation for foundation foods: "Values for carbohydrate by difference include total dietary fiber content. "
    - Article: "...we subtracted fiber before calculating percent kilocal-ories from carbohydrates. To avoid overlap between the carbohydrates and simple sugars variables, we also subtracted sugar before calculat-ing percent kilocalories from carbohydrates."
* Use: 'SODIUM, NA'

In [4]:
subset = ['ENERGY', 'TOTAL LIPID (FAT)', 'SUGARS, TOTAL', 'CARBOHYDRATE, BY DIFFERENCE', 'FIBER, TOTAL DIETARY', 'SODIUM, NA']

#### Get the list of nutrient names from nourish

In [5]:
pip install psycopg2-binary

Note: you may need to restart the kernel to use updated packages.


In [6]:
nourish_user = "gmichael"

nourish_pswd = "567khcwx3s"

engine = sal.create_engine('postgresql+psycopg2://' + nourish_user + ':' + nourish_pswd + '@awesome-hw.sdsc.edu/nourish')
conn = engine.connect()

In [7]:
query_nutrients = text('''SELECT *
from "usda_2022_nutrient_master"''')

result = conn.execute(query_nutrients)

nutrient_names = [i for i in result]

nutrient_names[0:2]

[(2047, 'Energy (Atwater General Factors)', 'KCAL', Decimal('957'), '280.0'),
 (2048, 'Energy (Atwater Specific Factors)', 'KCAL', Decimal('958'), '290.0')]

In [8]:
nutrient_names_df = pd.DataFrame(nutrient_names)
nutrient_names_df['name'] = nutrient_names_df['name'].str.upper()
nutrient_names_df[nutrient_names_df['name'].isin(subset)]

Unnamed: 0,id,name,unit_name,nutrient_nbr,rank
5,1004,TOTAL LIPID (FAT),G,204,800.0
6,1005,"CARBOHYDRATE, BY DIFFERENCE",G,205,1110.0
9,1008,ENERGY,KCAL,208,300.0
63,1062,ENERGY,kJ,268,400.0
80,1079,"FIBER, TOTAL DIETARY",G,291,1200.0
94,1093,"SODIUM, NA",MG,307,5800.0
415,2000,"SUGARS, TOTAL",G,269,1510.0


Convert Kjoules to Kcal for the energy component, then combine the energy columns. At the time of creating this notebook, 1062 is energy in kjoules and 1008 is energy in calories

In [9]:
nutrients_matrix['1062'] = nutrients_matrix['1062']/4.184 #1 kcal is 4.184 kj
nutrients_matrix['1008'] = nutrients_matrix['1008'] + nutrients_matrix['1062']
del nutrients_matrix['1062']

Convert Sodium MG to G

In [10]:
nutrients_matrix['1093'] = nutrients_matrix['1093'] / 1000

#### Update the matrix column headers

In [11]:
var_mapping = {}
for col in nutrients_matrix.columns:
    var_mapping[col] = nutrient_names_df[nutrient_names_df['id']==int(col)].iloc[0]['name']
    
nutrients_matrix.rename(columns = var_mapping, inplace = True)

#### Choose a subset of the nutrients

In [13]:
nutrients_matrix = nutrients_matrix[subset]
nutrients_matrix

Unnamed: 0_level_0,ENERGY,TOTAL LIPID (FAT),"SUGARS, TOTAL","CARBOHYDRATE, BY DIFFERENCE","FIBER, TOTAL DIETARY","SODIUM, NA"
fdc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
344604,24.0,0.41,2.44,4.07,0.8,0.203
344605,24.0,0.41,2.44,4.07,0.8,0.203
344606,0.0,2.68,0.00,0.00,0.0,0.067
344607,0.0,2.68,0.00,0.00,0.0,0.067
344608,0.0,15.18,0.00,0.00,0.0,0.103
...,...,...,...,...,...,...
2340755,67.0,1.82,2.73,7.58,0.6,0.048
2340756,67.0,1.82,2.73,7.58,0.6,0.058
2340757,67.0,1.82,2.73,7.58,0.6,0.058
2340758,67.0,1.82,2.73,7.58,0.6,0.048
