# Data collection and descriptive analysis
From [Open Food Facts](https://world.openfoodfacts.org/) we have a 1.7 GB `.csv` file which contains information of over 600 000 unique food products. Our purpose with this notebook is to explore this dataset and compile the availible information into one/several files of a smaller format containing only what is needed and can be used for our project.

In [1]:
import pandas as pd
import numpy as np
import functools
import re

In [2]:
data_folder = "./data/"

In [3]:
database = pd.read_csv(data_folder + "en.openfoodfacts.org.products.csv", sep='\t', dtype=object)

We take a look at the data:

In [4]:
database.describe()

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,...,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
count,696770,696770,696801,696800,696794,696804,696804,670800,77415,194249,...,344,140760,140760,2,1,27.0,1.0,14,33.0,13.0
unique,696731,696733,6656,562201,562207,551516,551515,479331,58231,22537,...,211,56,56,2,1,18.0,1.0,10,19.0,8.0
top,67275001132,http://world-en.openfoodfacts.org/product/0055...,kiliweb,1489077120,2017-03-09T10:37:09Z,France,en:france,Comté,Pâtes alimentaires au blé dur de qualité supér...,500 g,...,0,0,0,3,2,0.06,1.6e-05,4,0.02,0.007
freq,2,2,312135,20,20,29,31,451,181,7881,...,82,7338,10150,1,1,4.0,1.0,3,6.0,2.0


We have a lot of different types of data, 173 columns. This means that the `describe` method does not really give us information that is easy to survey - we need to explore the data in another way. 

# Finding the column with the country data
We are interested in doing out analysis based on which country the food item comes from. We therefore filter the data to find the columns which starts with the string `countr` for "countries":

In [5]:
filter_col = [col for col in database if col.startswith('countr')]
filter_col

['countries', 'countries_tags', 'countries_en']

We have three different columns regarding country data. Let's try to find out which one is relevant for us. A guess is that it is the `countries_en` column that we need, since the name presumably means that the column contain data of the origin country in English.

Looking at the `countries` column, we notice that there are duplicates:

In [6]:
database.countries.value_counts().head(5)

en:france    205162
France       179274
US           168473
en:FR         28054
Suisse         9097
Name: countries, dtype: int64

France appears several times! By comparing the number of unique values for each of the three different columns we see that the other two columns contain less than half the number of unique values:

In [7]:
for col in filter_col:
    print("Number of unique country labels in column '{}': ".format(col) + str(database[col].value_counts().shape[0]))


Number of unique country labels in column 'countries': 3227
Number of unique country labels in column 'countries_tags': 1227
Number of unique country labels in column 'countries_en': 1227


Taking a look at `countries_tags` and `countries_en` gives us the information that they are basically identical, just with a different format for each country:

In [8]:
database.countries_tags.value_counts().head(6)

en:france               421492
en:united-states        173575
en:switzerland           13463
en:germany               11845
en:france,en:germany      6309
en:spain                  6234
Name: countries_tags, dtype: int64

In [9]:
database.countries_en.value_counts().head(6)

France            421492
United States     173575
Switzerland        13463
Germany            11845
France,Germany      6309
Spain               6234
Name: countries_en, dtype: int64

Because of this, we decide to use the `countries_en` column. We note the format of the column, that each country starts with a capital letter and that if there are several countries they are separated by a comma without and whitespace.

# Filtering out France and the United States
We are only interested in comparing France against the United States. Because of this, we want to compile the rows of the database which contain data for these two countries into two new dataframes respectively.

In [10]:
# na=False drops all the rows where countries_en is NaN
france_data = database[database.countries_en.str.contains("France", na=False)]
us_data = database[database.countries_en.str.contains("United States", na=False)]

# Filtering out relevant columns

Relevant columns are columns with values for several products. We define that a column is relevant to look at if it has at least 10 000 defined values.

In [11]:
columns10000 = ((france_data.count() > 10000) & (us_data.count() > 10000))

In [12]:
france_data.columns[columns10000]

Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name', 'brands',
       'brands_tags', 'countries', 'countries_tags', 'countries_en',
       'ingredients_text', 'serving_size', 'serving_quantity', 'additives_n',
       'additives', 'additives_tags', 'additives_en',
       'ingredients_from_palm_oil_n',
       'ingredients_that_may_be_from_palm_oil_n', 'states', 'states_tags',
       'states_en', 'energy_100g', 'fat_100g', 'saturated-fat_100g',
       'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g',
       'salt_100g', 'sodium_100g'],
      dtype='object')

From the resulting columns, we will use the ones relevant for nutrition. These are the ones with values per 100g, servings and the product names. In addition to these, we will save the column "category", as we will use it to categorise our data. We will also save some additional vitamins and minerals, as well as two columns with nutrition facts. An overview of all columns can be found here: https://static.openfoodfacts.org/data/data-fields.txt

In [13]:
re_columns = ['product_name', 'brands', 'brands_tags', 'ingredients_text', 'serving_size', 'categories',
              'categories_tags', 'categories_en',
              'serving_quantity', 'energy_100g', 'proteins_100g', 'carbohydrates_100g', 'sugars_100g', 'fat_100g',
              'saturated-fat_100g','monounsaturated-fat_100g', 'polyunsaturated-fat_100g', 'omega-3-fat_100g',
              'omega-6-fat_100g','omega-9-fat_100g', 'trans-fat_100g', 'cholesterol_100g', 'fiber_100g',
              'sodium_100g', 'vitamin-a_100g','vitamin-d_100g', 'vitamin-e_100g', 'vitamin-k_100g', 'vitamin-c_100g',
              'vitamin-b1_100g','vitamin-b2_100g', 'vitamin-pp_100g', 'vitamin-b6_100g', 'vitamin-b9_100g',
              'vitamin-b12_100g',
              'biotin_100g', 'calcium_100g', 'phosphorus_100g', 'iron_100g', 'magnesium_100g', 'zinc_100g',
              'copper_100g', 'manganese_100g', 'fluoride_100g', 'selenium_100g', 'chromium_100g', 'molybdenum_100g',
              'iodine_100g', 'nutrition-score-fr_100g', 'nutrition-score-uk_100g']

In [14]:
france_data = france_data[re_columns]
us_data = us_data[re_columns]

# Cleaning data
### Drop rows which are unusable
Rows which only has NaN, and rows which does not have either any of the three food item category columns (`categories_en`, `categories` or `categories_tags`) or the `product name` category are not usable for us as we would not be able to use them in any comparison later on. For this reason we remove them:

In [15]:
# drop rows with all NaNs.
france_data.dropna(how='all', inplace=True)
us_data.dropna(how='all', inplace=True)

In [16]:
# drop rows if there are NaNs in all of the columns in `cols`
cols = ['categories_en', 'categories', 'categories_tags', 'product_name']
france_data.dropna(subset=cols, how='all', inplace=True)
us_data.dropna(subset=cols, how='all', inplace=True)

### Inspection of values

We continue the cleaning cleaning by reading calculating number of rows in the two dataframes.

In [17]:
# number of products
print("There are %d products sold in France" % len(france_data))
print("There are %d products sold in the USA" % len(us_data))

There are 438674 products sold in France
There are 174595 products sold in the USA


In [18]:
#us_data.to_csv(data_folder + "us_data.csv")
#france_data.to_csv(data_folder + "france_data.csv")

#france_data2 = pd.read_csv(data_folder + 'france_data.csv')
#us_data2 = pd.read_csv(data_folder + 'us_data.csv')

We will start to look at the data by considering the max and min of each column.

In [19]:
pd.options.display.float_format = '{:20.2f}'.format

# Convert to numeric columns where possible, otherwise ignore errors
france_data = france_data.apply(pd.to_numeric, errors='ignore')
us_data = us_data.apply(pd.to_numeric, errors='ignore')

france_data.max(numeric_only=True)

serving_quantity                     2601059.00
energy_100g                          1841546.00
proteins_100g                           4400.00
carbohydrates_100g                     72000.00
sugars_100g                            68000.00
fat_100g                                 915.00
saturated-fat_100g                       612.00
monounsaturated-fat_100g                  82.00
polyunsaturated-fat_100g                  75.00
omega-3-fat_100g                         485.00
omega-6-fat_100g                          71.00
omega-9-fat_100g                          75.00
trans-fat_100g                            31.00
cholesterol_100g                          28.00
fiber_100g                        5570000000.00
sodium_100g                              800.00
vitamin-a_100g                           800.00
vitamin-d_100g                             8.00
vitamin-e_100g                            22.90
vitamin-k_100g                             0.14
vitamin-c_100g                          

The max is out of range for a lot of the columns with values of a 100g. We will clean these columns by assuming that the values should be in the range 0 to 100.

In [20]:
def clean_with_range(df, column_names, max_value, min_value):
    for column in column_names:
        mask_max = df[column] > max_value
        mask_min = df[column] < min_value
        df.loc[mask_max, column] = np.nan
        df.loc[mask_min, column] = np.nan
    return df

In [21]:
numeric_columns = france_data.select_dtypes(include=[np.number]).columns.tolist()
columns_to_clean = [x for x in numeric_columns if x not in ['Unnamed: 0', 'energy_100g']]
france_data = clean_with_range(france_data, columns_to_clean, 100, 0)
# use same columns for the usa
us_data = clean_with_range(us_data, columns_to_clean, 100, 0)

Below we can see the cleaned max values.

In [22]:
france_data.max()

serving_quantity                         100.00
energy_100g                          1841546.00
proteins_100g                            100.00
carbohydrates_100g                       100.00
sugars_100g                              100.00
fat_100g                                 100.00
saturated-fat_100g                       100.00
monounsaturated-fat_100g                  82.00
polyunsaturated-fat_100g                  75.00
omega-3-fat_100g                          68.00
omega-6-fat_100g                          71.00
omega-9-fat_100g                          75.00
trans-fat_100g                            31.00
cholesterol_100g                          28.00
fiber_100g                               100.00
sodium_100g                               98.43
vitamin-a_100g                            73.00
vitamin-d_100g                             8.00
vitamin-e_100g                            22.90
vitamin-k_100g                             0.14
vitamin-c_100g                          

# Downsampling the France dataset
The France dataset contains a considerate amount more entries than the US dataset. Because of this, we downsample the France dataset at this stage so that both datasets have the same number of entries.

In [23]:
print("Size of France dataset: " + str(france_data.shape[0]))
print("Size of US dataset: "+ str(us_data.shape[0]))

Size of France dataset: 438674
Size of US dataset: 174595


In [24]:
france_data = france_data.sample(us_data.shape[0])
print("Size of France dataset: " + str(france_data.shape[0]))
print("Size of US dataset: "+ str(us_data.shape[0]))

Size of France dataset: 174595
Size of US dataset: 174595


# Define categories for comparisons
For doing comparisons between the US and France we need to categorize the data into comparable parts. As seen in the data there are already 3 columns containing (possibly overlapping) category information for the food items. These category columns needs to be investigated further, which we do starting with looking at the entries in each category which are most common:

In [25]:
france_data.categories.value_counts().head(5)

en:beverages                                  2730
en:fats                                        661
en:milks                                       269
Chocolats noirs                                254
Snacks sucrés,Biscuits et gâteaux,Biscuits     147
Name: categories, dtype: int64

In [26]:
france_data.categories_en.value_counts().head(5)

Beverages,Non-Alcoholic beverages           2021
Fats                                         712
Beverages,Sugared beverages                  645
Sugary snacks,Chocolates,Dark chocolates     449
Dairies,Milks                                371
Name: categories_en, dtype: int64

In [27]:
france_data.categories_tags.value_counts().head(5)

en:beverages,en:non-alcoholic-beverages              2021
en:fats                                               712
en:beverages,en:sugared-beverages                     645
en:sugary-snacks,en:chocolates,en:dark-chocolates     449
en:dairies,en:milks                                   371
Name: categories_tags, dtype: int64

We see that the three columns are very similar, but there are some differences between them. An example is that for the most common entry, the categories `categories_en` and `categories_tags` provide additional information in the form of the tag "Non-Alcoholic beverages". On the other hand, the column `categories` provides us with the names of things in French. Having additional information increases the chance that an item is found and correctly identified as being in a certain category.

As the two columns `categories_en` and `categories_tags` contain the same information (just in a slightly different format), we can choose to use only one, the `categories_en` column. This gives us two categories to use, `categories` and `categories_en`.

It might be a good strategy for us to search through both of the two categories when we want find items for a comparison, for example searching for items with the keywords "yoghurt" and "dairies", especially since it would not be taking any considerable amount of time to do (the computations are not very heavy).

We will also be using the `product_name` column for finding items that match certain categories, as the names can be descriptive and contain strings that are associated with a certain food category. An example of the most common food items in the US can be seen here:

In [28]:
us_data.product_name.value_counts().head(10)

Ice Cream                 408
Extra Virgin Olive Oil    296
Potato Chips              281
Premium Ice Cream         226
Beef Jerky                165
Pinto Beans               162
Popcorn                   150
Salsa                     149
Cookies                   144
Tomato Sauce              140
Name: product_name, dtype: int64

We define a function which takes a dataframe and a list of strings and returns a dataframe with the rows of that original dataframe in which the string exists in any of the four columns `categories_en`, `categories`, `categories_tags` or `product name`. 

In [29]:
def relevant_rows(df, tag_list):
    import functools
    import re
    cols = ['categories_en', 'categories', 'product_name']
    combinations = [df[col].str.contains(tag, na=False, flags=re.IGNORECASE) for tag in tag_list for col in cols]
    mask = functools.reduce(lambda x,y: x | y, combinations)
    return df[mask]

In [30]:
ice_cream = relevant_rows(us_data, ['Ice Cream'])
ice_cream.head(5)

Unnamed: 0,product_name,brands,brands_tags,ingredients_text,serving_size,categories,categories_tags,categories_en,serving_quantity,energy_100g,...,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,nutrition-score-fr_100g,nutrition-score-uk_100g
615,"Ice Cream, Vanilla",Lactaid,lactaid,"Milk, cream, sugar, corn syrup, guar gum, locu...",71 g (0.5 cup),,,,71.0,883.0,...,,,,,,,,,,
616,"Ice Cream, Butter Pecan",Lactaid,lactaid,"Milk, cream, sugar, corn syrup, pecans (pecans...",71 g (0.5 cup),,,,71.0,1000.0,...,,,,,,,,,,
2629,"Ice Cream, Vanilla Bean",Private Selection,private-selection,"Milk, cream, sugar, corn syrup, whey protein c...",93 g (0.5 cup),,,,93.0,946.0,...,,,,,,,,,,
2630,"Ice Cream, Classic Country Vanilla",Private Selection,private-selection,"Milk, cream, sugar, corn syrup, egg yolks, nat...",93 g (0.5 cup),,,,93.0,946.0,...,,,,,,,,,,
2631,"Ice Cream, Denali Extreme Moose Tracks",Private Selection,private-selection,"Ice cream - milk, cream, sugar, corn syrup, co...",85 g (0.5 cup),,,,85.0,1230.0,...,,,,,,,,,,


In [31]:
relevant_rows(us_data, ['Ice Cream']).shape

(3131, 50)

Here we see that the number of items related to the word `Ice Cream` in the data for USA is as many as 3131. As this is more than 6 times more items than what we found for `Ice Cream` in the `product_name` category we can therefore use this data if we want to find the products with the most entries in the database, e.g. the food item type with the most variants.

This function lets us easily find all of the relevant rows to a category we define by our own. The preliminary categories we will use are “Dairy”, “Snacks”, “Bread and Dry goods”, "Fats" and “Meat, Poultry, Fish, Seafood, etc.”

We also define a function for omitting rows with cerain words from the columns, including an option to omit rows which have certain ingredients.

In [32]:
def clean_categories(df, wrong_strings, wrong_ingredients = []):
    cols = ['categories_en', 'categories', 'product_name']
    combinations = [df[col].str.contains(wrong, na=False, flags=re.IGNORECASE) for wrong in wrong_strings for col in cols]
    ingredient_combinations = [df['ingredients_text'].str.contains(wrong, na=False, flags=re.IGNORECASE) for wrong in wrong_ingredients]
    mask = functools.reduce(lambda x,y: x | y, combinations + ingredient_combinations)
    return df[~mask]

# Category example
As an example, we now filter the data for food items relating to `fats`, using the functions defined. 

In [33]:
butter = r'^(?:.*\s)?butter(?:\s.*)?$'
oil = r'^(?:.*\s)?oil(?:\s.*)?$'
fat_words = [butter, "fats", oil, "beurre", "lätta", "milda", "margarin", "huile", "coconut fat" ]
non_fat_words = ["butter cups", "pop corn", "chicken", "popcorn", "potato", "toffee", "in oil", "with olive oil", "with oil", "protein bar","olive oil &", "marinat", "in olive oil", "bean", "snack", "ice cream", "cheese", "fromage", "a l'huile", "à l'huile", "caramel beurre", "petits beurre", "granola", "frits", "fried", "au beurre", "croissant", "all butter", "yaourt", "cookie", "chip", "sans huile", "sandwich", "soup", "pur beurre", "thon", "sauce", "chocolat", "chip", "cookie", "biscuits", "cake", "seafood"]

We now filter the france data:

In [34]:
france_fats = relevant_rows(france_data, fat_words)
print("Fat related items in France before cleaning: " + str(france_fats.shape[0]))

france_fats = clean_categories(france_fats, non_fat_words, ['flour'])
print("Fat related items in France after cleaning: " + str(france_fats.shape[0]))

Fat related items in France before cleaning: 6257
Fat related items in France after cleaning: 3215


A peek at the data reveals that we have somewhat succeeded:

In [35]:
france_fats.head(5)

Unnamed: 0,product_name,brands,brands_tags,ingredients_text,serving_size,categories,categories_tags,categories_en,serving_quantity,energy_100g,...,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,nutrition-score-fr_100g,nutrition-score-uk_100g
399395,Huile d'olive,,,,,,,,0.0,,...,,,,,,,,,,
350400,Beurre doux,,,,,,,,0.0,3117.0,...,,,,,,,,,,
558664,Huile D'olive Bio 500ml,Delhaize,delhaize,extra en rijk aan vitamine E et riche en vitam...,10g,,,,10.0,3700.0,...,,,,,,,,,,
324403,Haricots beurre extra fins,,,,,,,,0.0,105.0,...,,,,,,,,,,
540594,Crunchy Peanut Butter,,,,,,,,0.0,2690.0,...,,,,,,,,,,


Now the same for the US:

In [36]:
us_fats = relevant_rows(us_data, fat_words)
print("Fat related items in the US before cleaning: " + str(us_fats.shape[0]))

us_fats = clean_categories(us_fats, non_fat_words, ['flour'])
print("Fat related items in the US after cleaning: " + str(us_fats.shape[0]))

Fat related items in the US before cleaning: 6373
Fat related items in the US after cleaning: 3306


In [37]:
us_fats.head(5)

Unnamed: 0,product_name,brands,brands_tags,ingredients_text,serving_size,categories,categories_tags,categories_en,serving_quantity,energy_100g,...,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,nutrition-score-fr_100g,nutrition-score-uk_100g
70,Organic Sunflower Oil,Napa Valley Naturals,napa-valley-naturals,"Organic expeller pressed, refined high oleic s...",14 g (1 Tbsp),,,,14.0,3586.0,...,,,,,,,,,,
166,Organic Canola Oil Refined,Spectrum,spectrum,Organic expeller pressed refined canola oil,14 ml (1 Tbsp),,,,14.0,3586.0,...,,,,,,,,,,
233,Organic Unrefined Extra Virgin Coconut Oil,Aunt Patty,aunt-patty,Organic unrefined extra virgin coconuts oil,14 g (1 Tbsp),,,,14.0,3586.0,...,,,,,,,,,,
549,100% Pure Canola Oil,Canola Harvest,canola-harvest,100% canola oil .,14 g (1 Tbsp),,,,14.0,3586.0,...,,,,,,,,,,
551,"Buttery Spread, With Flaxseed Oil",Canola Harvest,canola-harvest,"Canola oil, water, palm oil, flax oil, palm ke...",14 g (1 Tbsp),,,,14.0,2389.0,...,,,,,,,,,,


As can be seen, the number of items in each dataset continues to be similar.

# Saving data
We finish by writing the divided and curated data into two csv-files, one for France and one for the US. This yields two files of the much more managable size of around 100 MB each, compared to the original 1.7 GB.

In [39]:
france_data = france_data.reset_index(drop=True)
us_data = us_data.reset_index(drop=True)
us_data.to_csv(data_folder + "us_data.csv")
france_data.to_csv(data_folder + "france_data.csv")