# Open Food Facts database extraction for chocolate products

The Open Food Facts is a collaborative project where individual can contribute by adding data from the food products they buy. This project is a practice for visualization and classification specificaly on chocolate products. 

The code written hereafter was chosen as the best approach for this analysis after trying different approaches including using Hugging Face library, data dump and OpenFood Facts Python API.

The API is provided under MIT Licence. Please see https://github.com/openfoodfacts/openfoodfacts-python.git for further information.

## Data Loading and data selection

In [2]:
from openfoodfacts import API, APIVersion, Country, Environment, Flavor
import time
import json
import csv
import pandas as pd #version 2.3.3
import numpy as np #version 2.3.1

In [3]:
api = API(
    user_agent="<application name>",
    username=None,
    password=None,
    country=Country.world,
    flavor=Flavor.off,
    version=APIVersion.v2,
    environment=Environment.org,
    timeout=10
)

In [4]:
size = 100 #Maximum allowed by the api
results = api.product.text_search(query="chocolate, cocoa", page=1, page_size=size)
count = results.get("count")
total_pages = int(np.round(count/size))

Loading this first page allows to have a first sight on the data. It also gives the total products found and the number of pages to go through.
While iterating through each page of the search, only relevant data will be captured into the final dataframe.
Thus a first selection of the relevant columns is required.

In [7]:
def filter_df_by_keywords(df, keywords):
    """
    Filters a DataFrame to return a list which includes only columns 
    whose names contain any of the given keywords (provided as a list).
    """
    relevant_cols = []
    for col in df.columns: 
        for keyword in keywords:
            if keyword.lower() in col.lower():
                relevant_cols.append(col)
    return relevant_cols

targeted_keywords = ["name", "Quantity", "Brands", "Categories", 
                    "Manufacturing", "Stores", "Countr", 
                     "Ingredients","Origin", "nutriments", "id",
                    "keywords"]

df_test = pd.DataFrame(results["products"])
target_col = filter_df_by_keywords(df_test, targeted_keywords)

Columns number was narrowed down from 550 to 390 with this first keyword filtering.
Unfortunately, the remaining columns have very closely related names that can hardly be automaticaly filtered.
Hence, manual picking of columns was performed, aiming for information revelant for our analysis and provided in English.

In [8]:
column_selection = ['_id', '_keywords', 'generic_name_en','product_name', 'categories',
    'categories_hierarchy', 'brands','quantity', 'ingredients_original_tags', 'ingredients_text_en',
    'nutriments', 'countries', 'stores', 'manufacturing_places', 'origin', 'origin_en']
len(column_selection)

16

This final selection has 16 elements though the nutriments are nested within dictionaries for each product.
Normalizing gives 349 extra columns. Columns with more than 20 % of Nan are to be dropped.
The columns will then be filtered to retain only information standardized by 100 g.

In [40]:
def unpacking_nutriments(df, thershold):
    """
    Unpacking the nutriments, filterint relevant columns (100g an units),
    then merging it back with the initial dataframe

    Arguments: 
    - df: the dataframe containg a column nutriments where nutriments are stored in dictionaries
    - threshold: percentage of Nan within a column that triggers the removal of the column
    """
    df_nutriments = pd.json_normalize(df["nutriments"])

    # though the information is not often provided, we would like to keep information about cocoa content
    cocoa_columns = ["cocoa_100g", "cocoa_unit", "cocoa-minimum_100g", "cocoa-minimum_unit"]
    df_cocoa = df_nutriments[cocoa_columns]
    
    # removing columns that have a certain level of Nan values
    df_nutriments = df_nutriments.dropna(axis=1, thresh= len(df_nutriments)*(thershold/100)) 

    # then we want only data standardized per 100 g of product
    targeted_keywords_2 = ["100g", "unit"]
    target_col_nutriments = filter_df_by_keywords(df_nutriments, targeted_keywords_2)
    
    # this specific column is not relevant for our analysis because there shouldn't be any vegetable in any respectable chocolate.
    target_col_nutriments.remove("fruits-vegetables-legumes-estimate-from-ingredients_100g") 

    df = pd.concat([df, df_cocoa, df_nutriments[target_col_nutriments]], axis=1)
    df = df.drop(columns=["nutriments"])
    return df

df_test = unpacking_nutriments(df_test[column_selection], 20)
print(df_test.shape)

The final dataframe should have 44 columns provided that no other columns goes through the threshold of 20 % Nan value. The following code iterates through each pages of the research to store only the selected columns for chocolate products.

In order to avoid a time-out error from the OpenFood Facts server, a delay of 10 s is provided as an argument of the API. Because there is a lot of products, this retrieval code snippet takes 8 min to retrieve data which is a bit long. So I decided to export the dataset as .csv file to facilitate access to the data later on.

In [10]:
max_retries = 4
products = []

for i in range(1, total_pages + 1): # replace 5 by total_pages + 1 for definitive version
    for attempt in range(max_retries):
        try:
            results = api.product.text_search(query="chocolate, cocoa", page=i, page_size=size)
            product_list = results.get("products", [])

            # Filtering the column selected previously to more efficiently store data
            filtered_products = [
                {k: product.get(k) for k in column_selection if k in product}
                for product in product_list]
            
            products.extend(filtered_products)
            break
            
        except Exception as e:
            print(f"Error fetching page {i} (Attempt {attempt+1}/{max_retries}): {e}")
            time.sleep(10 + attempt * 2) # Exponential backoff: sleep longer on subsequent fails
            
    else: # This 'else' belongs to the inner 'for' loop and executes if 'break' wasn't hit
        
        print(f"Failed to fetch page {i} after {max_retries} attempts. Skipping.")
    #time.sleep(10)


In [42]:
df = pd.DataFrame(products)
df = unpacking_nutriments(df, 20)
df.shape

(15500, 44)

In [43]:
filepath = r'D:\OpenFoodFacts_chocolate.csv'
df.to_csv(filepath)

In [33]:
df.head()

Unnamed: 0,_id,_keywords,generic_name_en,product_name,categories,categories_hierarchy,brands,quantity,ingredients_original_tags,ingredients_text_en,nutriments,countries,stores,manufacturing_places,origin,origin_en
0,3046920029759,"[90, and, bar, chocolate, cocoa, dark, dot, ex...",Extra fine dark chocolate 90% cocoa,Supreme Dark 90%,"Snacks,Sweet snacks,Cocoa and its products,Cho...","[en:snacks, en:sweet-snacks, en:cocoa-and-its-...",Lindt,100 g,"[en:cocoa-paste, en:cocoa-butter, en:fat-reduc...","cocoa mass, cocoa butter, fat reduced cocoa, s...","{'added-sugars': 0, 'added-sugars_100g': 0, 'a...","Algeria,Austria,Belgium,Bulgaria,Canada,Czech ...","Carrefour,Géant,kupsch,Magasins U,Esselunga,Li...",Aachen,,
1,6111031005064,"[and, beverage, bimo, biscuit, cake, candie, c...",,Tonik,"Plant-based foods and beverages,Plant-based fo...","[en:plant-based-foods-and-beverages, en:plant-...",Bimo,22 g,"[fr:coffret-fourre-au-cacao, en:vanilla, fr:in...",,"{'alcohol': 0, 'alcohol_100g': 0, 'alcohol_ser...",Morocco,,,,
2,20995553,"[85, and, artificial, cacao, chocolat, chocola...",Dark chocolate,Chocolat noir - 85% cacao,"Snacks,Sweet snacks,Cocoa and its products,Cho...","[en:snacks, en:sweet-snacks, en:cocoa-and-its-...",J.D. Gross,125g,"[en:cocoa-paste, en:fat-reduced-cocoa-powder, ...","Cocoa mass, fat reduced cocoa powder, cocoa bu...","{'carbohydrates': 9, 'carbohydrates_100g': 36,...","Austria,Belgium,Bulgaria,Estonia,Finland,Franc...",Lidl,,,
3,8425197712024,"[almond, and, ce, chocolate, cocoa, compound, ...",Compound Chocolate with MILK AND ALMONDS,,"Snacks,Sweet snacks,Cocoa and its products,Con...","[en:snacks, en:sweet-snacks, en:cocoa-and-its-...",Maruja,150 g,"[en:sugar, en:cocoa-butter, en:whole-milk-powd...","sugar, cocoa butter, whole milk powder, cocoa ...","{'carbohydrates': 53, 'carbohydrates_100g': 53...","Algeria,Cameroon,France,Morocco,Spain",,Espagne,,
4,3608580065340,"[and, bonne, breakfast, cacao, chocolate, coco...",,Pâte à tartiner noisettes et cacao,"Breakfasts,Spreads,Sweet spreads,fr:Pâtes à ta...","[en:breakfasts, en:spreads, en:sweet-spreads, ...",Bonne Maman,360 g,"[en:sugar, en:hazelnut, en:vegetable-oil, en:s...","sugar, hazelnuts 20%, vegetable oils (sunflowe...","{'carbohydrates': 53, 'carbohydrates_100g': 53...",France,"carrefour.fr,Carrefour",,,
