# Chocolate Database Analysis

The Open Food Facts (OFF) is a collaborative project where individual can contribute by adding data from the food products they buy.
This project is a practice for visualization and classification specificaly on chocolate products. 

Here is the list of ideas, questions I wanted to investigate:

1. As any chocolate lover knows, there are several types of chocolate bar based on the amount of actual cocoa in the product. My hypothesis is that those categories should have a direct influence on the nutrition values (fat, carbohydrates, protein, etc.). Hence the first idea is to create a model to classify the different types of chocolate bars.
2. Based on the identified types, see the distribution for the nutrients and the prevalence of certain categories by brands.
3. Map the countries of origin with a visual for the number of product of that origin.
4. Map the dominant type of chocolate for each consumer country.

With no further ado, let's get started!

## Data cleaning and exploration

In [3]:
import pandas as pd #version 2.3.3
import numpy as np #version 2.3.1
import seaborn as sns #version 0.13.2
import matplotlib #version 3.10.0
import matplotlib.pyplot as plt
import statsmodels.api as sm #version 0.14.5

The data was loaded through OFF Python API in order to have an up-to-date dataset with all necessary information for our analysis.  The API provided under MIT Licence. Please see https://github.com/openfoodfacts/openfoodfacts-python.git for further information. The code is detailed in data_loading_from_OpenFoodFacts notebook.  

I took a conservative approach where I kept redundant columns to be sure to capture the required data.

In [32]:
filepath = r'OpenFoodFacts_chocolate.csv'
df = pd.read_csv(filepath)
# removing an index column and redundant columns
df.drop(columns=["Unnamed: 0", "energy_100g", "energy_unit"], inplace=True)

Here is a description of the dataset:

- id: is the individual entry of the product
- keyords: are generic word for search purpose
- generic_name_en and product_name: are the commercial name either in English or in the original language
- categories, 	categories_hierarchy: standardized name from OOF to group products
- brands: the producer name
- quantity: is the amount sold, here provided as string with the unit aggregated
- countries: comma separated string of countries where the product is sold
- stores: comma separated string of retail brand where the product is sold
- manufacturing_places: it is supposed to countain the manufacturin place but there is a lot of variation in the format, language used. Even sometimes two location indicated. There is some heavy lifting to get usable information out of this column
- origin and origin_en: are supposedly the origin of the product. I quick verification indicate that the columns do not contain the same data. There are relevant info in origin_en regarding the place where the cocoa was grown.
- _100g columns: indicate the amount of the molecule or ingredient per 100 g of product.
- _unit columns: should indicate the unit of the corresponding _100g variable. There is some cleaning to do. For instance "cocoa_unit" contains g, % and % DV units.
- nova-group_100g: indicates the level of processing of the product. Group 1 - Unprocessed or minimally processed foods, Group 2 - Processed culinary ingredients, Group 3 - Processed foods, Group 4 - Ultra-processed food and drink products.
- nutrition-score-fr_100g: indicate the nutritional rating of the product
  

In [38]:
print(df.shape)
# checking if there are any duplicate product in the dataset
df["_id"].duplicated().any()

(15500, 42)


np.False_

In [39]:
df.columns

Index(['_id', '_keywords', 'generic_name_en', 'product_name', 'categories',
       'categories_hierarchy', 'brands', 'quantity',
       'ingredients_original_tags', 'ingredients_text_en', 'countries',
       'stores', 'manufacturing_places', 'origin', 'origin_en', 'cocoa_100g',
       'cocoa_unit', 'cocoa-minimum_100g', 'cocoa-minimum_unit',
       'carbohydrates_100g', 'carbohydrates_unit', 'energy-kcal_100g',
       'energy-kcal_unit', 'fat_100g', 'fat_unit', 'fiber_100g', 'fiber_unit',
       'fruits-vegetables-nuts-estimate-from-ingredients_100g',
       'nova-group_100g', 'nutrition-score-fr_100g', 'proteins_100g',
       'proteins_unit', 'salt_100g', 'salt_unit', 'saturated-fat_100g',
       'saturated-fat_unit', 'sodium_100g', 'sodium_unit', 'sugars_100g',
       'sugars_unit', 'energy-kj_100g', 'energy-kj_unit'],
      dtype='object')

In [40]:
main_nutrients = ["fat_100g", "carbohydrates_100g", "proteins_100g", "fiber_100g", "salt_100g"]

In [47]:
df["cocoa_100g"].describe()

count    2146.000000
mean       55.705191
std        20.982090
min         0.000000
25%        36.000000
50%        55.000000
75%        72.000000
max       100.000000
Name: cocoa_100g, dtype: float64