In [2]:
import pandas as pd
import json
pd.set_option('display.max_colwidth', None)
pd.set_option('display.min_rows', 25)
pd.set_option('display.max_columns', None)

In [3]:
FILENAME = '../../datasets/products_0.995_cleaned.csv'
df = pd.read_csv(FILENAME)

  df = pd.read_csv(FILENAME)


# Preprocessing

In [4]:
def to_json(df: pd.DataFrame, filename): df.to_json(filename, indent=4, orient='records')

## Adding additives danger

In [5]:
with open('additives_danger.json', 'r') as json_file:
    ADDITIVES_DANGER = { additive['code']: 4 - additive['danger'] for additive in json.load(json_file) }


NOT_FOUND = []
def get_codes_by_level(level: int) -> list[str]:
    return [k for k,v in ADDITIVES_DANGER if v == level]

def get_danger(additives):
    if type(additives) != str: return 0, 0, 0
    dangeres = []
    n = 0
    for additive in additives.split(','):
        code = additive.split('-')[0].strip()
        if code in ADDITIVES_DANGER:
            dangeres.append(ADDITIVES_DANGER[code])
            n += 1
    if len(dangeres) == 0: return 0, 0, 0
    return min(dangeres), sum(dangeres) / n, max(dangeres)


df[['additives_min_danger', 'additives_average_danger', 'additives_max_danger']] = df.apply(lambda r: get_danger(r['additives']), axis=1, result_type="expand")

In [6]:
def get_additives_count_hazard(additives, level):
    if type(additives) != str: return 0
    hazard = 0
    for additive in additives.split(','):
        code = additive.split('-')[0].strip()
        if ADDITIVES_DANGER.get(code, -1) == level: hazard += 1
    return hazard

df['additives_0_count'] = df.apply(lambda r: get_additives_count_hazard(r['additives'], 0), axis=1)
df['additives_1_count'] = df.apply(lambda r: get_additives_count_hazard(r['additives'], 1), axis=1)
df['additives_2_count'] = df.apply(lambda r: get_additives_count_hazard(r['additives'], 2), axis=1)
df['additives_3_count'] = df.apply(lambda r: get_additives_count_hazard(r['additives'], 3), axis=1)

In [7]:
df['has_additives'] = df['additives'].notna()
df['has_additives_3'] = df['additives_3_count'] != 0
df['additives_count'] = df['additives'].apply(lambda x: len(x.split(',')) if type(x) == str else 0)
df['ingredients_count'] = df['ingredients_tags'].apply(lambda x: len(x.split(',')) if type(x) == str else 0)

# Analysis

## Introduction
Additives are substances that are added to food products during processing to maintain or improve certain properties such as appearance, freshness, taste or texture.

The development of food consumption habits has required a transformation of the food industry which must produce in greater quantities.

Thus, additives are sometimes necessary to ensure that processed foods remain safe and wholesome throughout their journey from factories through transportation to warehouses and stores to consumers.

Nevertheless, many products contain non-essential additives that do not serve a necessary need but are present only to embellish the product, to make it more attractive either its taste or its appearance (artificial sweeteners, food colorings, and flavor enhancers). Other products contain preservatives to increase the shelf life of the product.

[1] https://food.ec.europa.eu/safety/food-improvement-agents/additives_en 

[2] https://www.who.int/news-room/fact-sheets/detail/food-additives

[3] https://www.fda.gov/food/food-ingredients-packaging/overview-food-ingredients-additives-colors


In [8]:
bubble_chart_additives = df['additives'].str.split(pat=',').explode(ignore_index=True).value_counts().to_frame().reset_index().head(50)
bubble_chart_additives['danger'] = bubble_chart_additives.apply(lambda r: ADDITIVES_DANGER.get(r['index'].split('-')[0].strip(), 0), axis=1)
bubble_chart_additives.columns = ['additive', 'count', 'dangerosity']
to_json(bubble_chart_additives, 'graph/bubble_chart_additives.json')

## Dangeroussness

Nevertheless, many additives present health risks and it is important to be aware of the products we consume. Organizations such as the World Health Organization (WHO), the Food Drug Administration (FDA) in the United States or the European Food Safety Authority (EFSA) evaluate and regulate the use of additives and put in place restrictions on quantities. 

Nevertheless, some of these substances still present risks and many scientific studies show a correlation between high consumption of certain additives and adverse health effects such as increased risk of cardiovascular disease and cancer.

For our project, we used data from the company Yuka, which reviewed numerous scientific studies [4] on the effects of additive consumption in order to assign a score to each of them according to their dangerousness :
- 0 : No risk
- 1 : Limited risk
- 2 : Moderate risk
- 3 : Hazardous

[4] https://help.yuka.io/l/fr/article/bf5vi9gytc

In [9]:
print(df['has_additives'].mean())
print(df['has_additives_3'].mean())
print(df[df['has_additives']]['additives_count'].mean())

0.45714986309207567
0.1955126895255221
3.0057214375111747


## Food categories

The presence and the danger of additives depend strongly on the categories of products. Indeed, we find much more additives in cold cuts, sweetened drinks and ready-made meals than in vegetables, pasta and vegetable milks for example.

The following graph shows these two variables, the presence (radius) and the dangerousness (color) by product category. We can clearly observe that the delicatessen and the sodas represent the most dangerous products.

In [10]:
df['categories_splitted'] = df['categories'].str.split(',')

bubble_chart_categories = df.explode(['categories_splitted']).groupby('categories_splitted')['additives_average_danger'].agg(['mean', 'count']).query('count > 25').reset_index()
bubble_chart_categories.columns = ['categories', 'dangerosity', 'count']
bubble_chart_categories['group'] = 1
to_json(bubble_chart_categories, 'graph/bubble_chart_categories.json')

## NOVA Index

The NOVA index is used to indicate how processed a product is. Product processing is a broad term that can mean mixing different products to create a new one as well as several processing steps such as cooking, freezing, drying, fermenting and, of course, adding additives. 
- 1 : Unprocessed or minimally processed foods
- 2 : Processed culinary ingredients
- 3 : Processed foods
- 4 : Ultra-processed food and drink products

[5] https://world.openfoodfacts.org/nova

In [11]:
df.groupby('nova_group')[['additives_0_count', 'additives_1_count', 'additives_2_count', 'additives_3_count']] \
    .mean() \
    .reset_index() \
    .to_json('graph/nova_stacked_bar_chart.json', indent=4, orient='records')

As shown in the graph opposite, ultra-processed products contain significantly more additives than other products and also a higher proportion of hazardous additives. Here are some key facts about the impact of product processing:

In [28]:
print(df[(df['nova_group'] == 4)]['additives_count'].mean() / df[(df['nova_group'] == 1)]['additives_count'].mean())
print(df[(df['nova_group'] == 4) & (df['has_additives'] == True)]['additives_3_count'].mean() / df[(df['nova_group'] == 1) & (df['has_additives'] == True)]['additives_3_count'].mean())
print((df[(df['nova_group'] == 4)]['additives_count'].mean() / df[(df['nova_group'] < 4)]['additives_count'].mean() - 1) * 100)
print((df[(df['nova_group'] == 4)]['additives_average_danger'].mean() / df[(df['nova_group'] < 4)]['additives_average_danger'].mean() - 1) * 100)

57.54998675635629
7.404296758402484
1160.1314470881543
460.8563230413776


## Vegan and Vegetarian

The results of the product category analyses on the high presence of additives in meat products led us to analyze the impact of different diets. 
The vegetarian diet consists in avoiding all animal flesh (meat, fish, seafood, poultry, etc.).  The vegan diet is an addition to vegetarianism by abstaining from all animal products (milk, eggs, etc.). 

As shown in the two graphs on the left, a vegetarian or vegan diet can significantly reduce the number and dangerousness of the additives consumed.

As usual, here are some key figures

In [41]:
df['is_meat_based'] = df['categories'].str.contains('meat')

def get_vegan(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['is_vegan'] == True]

def get_non_vegan(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['is_vegan'] == False]

def get_vegetarian(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['is_vegetarian'] == True]

def get_non_vegetarian(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['is_vegetarian'] == False]

def get_vegan_or_vegetarian(df: pd.DataFrame) -> pd.DataFrame:
    return df[(df['is_vegan'] == True) | (df['is_vegetarian'] == True)]

def get_non_vegan_and_non_vegetarian(df: pd.DataFrame) -> pd.DataFrame:
    return df[(df['is_vegan'] == False) & (df['is_vegetarian'] == False)]

def get_meat_based_products(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['is_meat_based'] == True]

def get_non_meat_based_products(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['is_meat_based'] == False]

def get_above_dangerosity(df: pd.DataFrame, dangerosity: int) -> pd.DataFrame:
    return df[df['additives_min_danger'] > dangerosity]

In [31]:
labels = ['Meat based products', 'All products', 'Vegetarian', 'Vegan', 'Vegetarian or Vegan']
average_additives_counts = [
    get_meat_based_products(df)['additives_count'].mean(),
    df['additives_count'].mean(),
    get_vegetarian(df)['additives_count'].mean(),
    get_vegan(df)['additives_count'].mean(),
    get_vegan_or_vegetarian(df)['additives_count'].mean()
]
pd.DataFrame({
    'label': labels,
    'average_additives_count': average_additives_counts,
}).to_json('graph/vegetarian_vegan_additives_count.json', indent=4, orient='records')

In [33]:
labels = ['Meat based products', 'All products', 'Vegetarian', 'Vegan', 'Vegetarian or Vegan']
average_hazards = [
    get_meat_based_products(df)['additives_average_danger'].mean(),
    df['additives_average_danger'].mean(),
    get_vegetarian(df)['additives_average_danger'].mean(),
    get_vegan(df)['additives_average_danger'].mean(),
    get_vegan_or_vegetarian(df)['additives_average_danger'].mean()
]
pd.DataFrame({
    'label': labels,
    'average_hazard': average_hazards
}).to_json('graph/vegetarian_vegan_average_hazards.json', indent=4, orient='records')

In [43]:
print(df['is_meat_based'].mean() * 100)
print(get_above_dangerosity(get_meat_based_products(df), 2)['additives_count'].sum() / get_above_dangerosity(df, 2)['additives_count'].sum() * 100)

26.241721854304632
71.41414141414143


In [44]:
print((get_meat_based_products(df)['additives_count'].mean() / get_non_meat_based_products(df)['additives_count'].mean() - 1) * 100)

405.15044424284133


In [46]:
print(get_meat_based_products(df)['additives_average_danger'].mean() / get_non_meat_based_products(df)['additives_average_danger'].mean())

3.45489107452079


In [48]:
print((get_vegetarian(df)['additives_3_count'].mean() / df['additives_3_count'].mean() - 1) * 100)

-76.15582617330767


In [49]:
print((get_vegan(df)['additives_3_count'].mean() / df['additives_3_count'].mean() - 1) * 100)

-81.19117488055899


In [57]:
print(df['additives_count'].mean() / get_vegetarian(df)['additives_count'].mean())

1.7550443886560905


## Nutriscore

The nutriscore is an index that provides information on the nutritional quality of food products. It was created in 2017 in France but is used in many other countries (Belgium, Spain, Luxembourg, Germany, Netherlands, ...). The nutriscore goes from A (good nutritional quality) to E (poor nutritional quality). Although the nutriscore is not directly related to the presence or absence of additives, a strong correlation can be observed between the nutriscore index and the presence and dangerousness of additives in products.

[6] https://www.santepubliquefrance.fr/determinants-de-sante/nutrition-et-activite-physique/articles/nutri-score

In [58]:
df.groupby('nutriscore_fr')[['additives_0_count', 'additives_1_count', 'additives_2_count', 'additives_3_count']].mean().reset_index().to_json('graph/nutriscore_stacked_bar_chart.json', indent=4, orient='records')

In [67]:
print(df[df['nutriscore_fr'] == 'E']['additives_count'].mean() / df[df['nutriscore_fr'] == 'A']['additives_count'].mean())
print(df[df['nutriscore_fr'] == 'E']['additives_average_danger'].mean() / df[df['nutriscore_fr'] == 'A']['additives_average_danger'].mean())
print((df[~df['nutriscore_fr'].isin(['D', 'E'])]['additives_count'].mean() / df[df['nutriscore_fr'].isin(['D', 'E'])]['additives_count'].mean() - 1) * 100)
print((df[~df['nutriscore_fr'].isin(['D', 'E'])]['additives_average_danger'].mean() / df[df['nutriscore_fr'].isin(['D', 'E'])]['additives_average_danger'].mean() - 1) * 100)

3.590893363455362
4.6219489849590385
-35.096925613275445
-47.689800106547175
