# Chocolate Database analysis

The Open Food Facts is a collaborative project where individual can contribute by adding data from the food products they buy.
This project is a practice for visualization and classification specificaly on chocolate products. Here is the list of ideas, questions I wanted to investigate:
1. As any chocolate lover knows, there are several types of chocolate bar based on the amount of actual cocoa in the product. My hypothesis is that those categories should have a direct influence on the nutrition values (fat, carbohydrates, protein, etc.). Hence the first idea is to create a model to classify the different types of chocolate bars.
2. Based on the identified types, see the distribution for the nutrients and the prevalence of certain categories by brands.
3. Map the countries of origin with a visual for the number of product of that origin.
4. Map the dominant type of chocolate for each consumer coutry.

With no further ado, let's get started!

## Data Loading and cleaning

In [44]:
from datasets import load_dataset, Value, Features
import pandas as pd #version 2.3.3
import numpy as np #version 2.3.1
import seaborn as sns #version 0.13.2
import matplotlib #version 3.10.0
import matplotlib.pyplot as plt
import statsmodels.api as sm #version 0.14.5

To begin with, a small sample of the database is loaded to evaluate the structure.

In [45]:
data_stream = load_dataset(
    "openfoodfacts/product-database",
    split="food",
    streaming=True # Use streaming for massive datasets
    )
df_initial = pd.DataFrame(data_stream.take(1000))

The database contains many variables that are not relevant for our goal. With 110 columns, there are only a handful that would be of interrest. So that would be the first place to start cleaning.

In [46]:
def filter_df_by_keywords(df, keywords):
    """
    Filters a DataFrame to include only columns whose names contain any of the given keywords.
    """
    # Create a list of columns where the column name (lowercased) contains any of the keywords
    relevant_cols = []
    for col in df.columns: 
        for keyword in keywords:
            if keyword.lower() in col.lower():
                relevant_cols.append(col)
    return relevant_cols

In [47]:
targeted_keywords = ["name", "Quantity", "Brands", "Categories", 
                    "Manufacturing", "Stores", "Country", 
                     "Ingredients","Origin", "nutriments"]

target_col = filter_df_by_keywords(df_initial, targeted_keywords)

This first helped to narrow down to 34 columns. Unfortunately, even with filtering with keywords, there are several columns that are not useful. At this stage, a manual verification is the best option for selection. 

In [48]:
items_to_remove = ["categories", "categories_properties","ingredients_analysis_tags", 
                   "ingredients_from_palm_oil_n", "quantity", "ingredients_text",
                  "ingredients_with_specified_percent_n", "ingredients_with_unspecified_percent_n",
                  "ingredients_without_ciqual_codes_n", "ingredients_without_ciqual_codes",
                 "known_ingredients_n", "ingredients_n", "unknown_ingredients_n"]

removal_set = set(items_to_remove)
target_col_clean = [
        item for item in target_col
        if item not in removal_set
                    ]

Now we are down to 21 columns. there is still some cleaning work to unpack certain columns that contains list of dictionnaries and string with specific format. But it is more interesting to put that in please with the definitive data. Hence, it is necessary to load only the data related to chocolate. As tested on the advanced research option of the Open Food Facts website, setting Category to Chocolate and requesting Cocoa to be part of the ingredient is a good method to get chocolate product.

In [49]:
def check_for_substring(categories_list: list, search_term: str) -> bool:
    """
    Helper function to check if the target substring is present in the list.
    """
    # Handle NaN/None values safely: if the cell is empty, treat it as an empty list
    if categories_list is None or not isinstance(categories_list, list):
        return False
        
    # Use any() to check if AT LEAST ONE item in the list contains the search term
    return any(search_term in item.lower() for item in categories_list)


In [51]:
# creating a dictionary object to pass to the load_dataset to load only the targeted columns
selected_features = {item:Value("string") for item in target_col_clean}

data_stream = load_dataset(
    "openfoodfacts/product-database",
    split="food",
    features=selected_features,
    streaming=True # Use streaming for massive datasets
    )

df = pd.DataFrame(data_stream.take(1000))
mask_chocolate = df['categories_tags'].apply(check_for_substring, args=("chocolate",))
mask_cocoa = df['ingredients_tags'].apply(check_for_substring, args=("cocoa",))

filtered_df = df[mask_chocolate & mask_cocoa].reset_index(drop=True)
filtered_df.head()

TypeError: argument of type 'Value' is not iterable

In [None]:
def filter_target_products(example):
    """
    Filters records to find products classified as 'chocolate' AND
    containing 'cocoa' in the ingredients list, safely handling lists, strings,
    and lists of dictionaries for both fields.
    """
    
    # Process Categories (Must contain 'chocolate') ---
    # Try 'categories_tags', fall back to 'categories' if needed
    categories_data = example.get('categories_tags', example.get('categories', ''))
    
    searchable_categories = ''
    if isinstance(categories_data, list):
        # Check if the list contains dictionaries 
        if categories_data and isinstance(categories_data[0], dict):
            # Extract the 'name' or 'id' from each dictionary item
            string_parts = [item.get('name', item.get('id', '')) for item in categories_data if isinstance(item, dict)]
            searchable_categories = ' '.join(string_parts).lower()
        else:
            # Assume it's a list of strings and join them
            searchable_categories = ' '.join(categories_data).lower()
    else:
        # It's a string, NaN, or other single value
        searchable_categories = str(categories_data).lower()
        
    has_chocolate_category = 'chocolate' in searchable_categories
    
    
    # --- 2. Process Ingredients (Must contain 'cocoa') ---
    # Prioritize 'ingredients_text' (the clean string), fall back to 'ingredients'
    ingredients_data = example.get('ingredients_text', example.get('ingredients', ''))
    
    searchable_ingredients = ''
    if isinstance(ingredients_data, list):
        # Check if the list contains dictionaries (the cause of the TypeError)
        if ingredients_data and isinstance(ingredients_data[0], dict):
            # Critical fix: Extract the 'text' key from the dictionary objects
            string_parts = [item.get('text', '') for item in ingredients_data if isinstance(item, dict)]
            searchable_ingredients = ' '.join(string_parts).lower()
        else:
            # Assume it's a list of strings
            searchable_ingredients = ' '.join(ingredients_data).lower()
    else:
        # It's a string (like the pre-joined ingredients_text), NaN, or other single value
        searchable_ingredients = str(ingredients_data).lower()
    
    has_cocoa_ingredient = 'cocoa' in searchable_ingredients
    
    return has_chocolate_category and has_cocoa_ingredient



In [None]:
print("Starting data load and targeted filtering...")

data_stream = load_dataset(
    "openfoodfacts/product-database",
    split="food",
    streaming=True # Use streaming for massive datasets
)

filtered_data_stream = data_stream.filter(filter_target_products)
df_initial = pd.DataFrame(filtered_data_stream.take(5000))

df_keywords_filtered = filter_df_by_keywords(df_initial, keywords)