## Milestone 2
Dataset: Open Food Facts

The dataset is downloaded and stored in the /data folder

When describing the data, in particular, you should show (non-exhaustive list):

- That you can handle the data in its size.
- That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
- That you considered ways to enrich, filter, transform the data according to your needs.
- That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.
- That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.


In [78]:
import pandas as pd
import numpy as np
import scipy as sp

In [2]:
data_folder = './data/'

# Loading the data

## Open Food Facts dataset

The data is in the CSV file format and can be downloaded on the Openfoodfacts website. Its size is 1.6 GB and we first decided for this milestone to download and load it using spark, as we were not sure if Pandas could handle a file of this size but as we quickly realized it was working smoothly so we will use Pandas to manipulate the data.

In [3]:
data = pd.read_csv(data_folder + 'en.openfoodfacts.org.products.csv', sep='\t', encoding='utf-8', low_memory=False)
data.head()

We see that we have a lot of columns (173), not all of them will be useful for our project so we will select the most interesting ones and drop all the others to avoid keeping unused data for computations. We first chose to remove the columns related to subjects not related to our project for better readability like ingredients concentration per product, images, related to palm_oil...

In [46]:
%pprint
not_related = ['100g', 'image', 'palm_oil', 'code', 'url']
columns = [column for column in list(data.columns) if not any(st in column for st in not_related)] 
columns

Pretty printing has been turned OFF


['creator', 'created_t', 'created_datetime', 'last_modified_t', 'last_modified_datetime', 'product_name', 'generic_name', 'quantity', 'packaging', 'packaging_tags', 'brands', 'brands_tags', 'categories', 'categories_tags', 'categories_en', 'origins', 'origins_tags', 'manufacturing_places', 'manufacturing_places_tags', 'labels', 'labels_tags', 'labels_en', 'cities', 'cities_tags', 'purchase_places', 'stores', 'countries', 'countries_tags', 'countries_en', 'ingredients_text', 'allergens', 'allergens_en', 'traces', 'traces_tags', 'traces_en', 'serving_size', 'serving_quantity', 'no_nutriments', 'additives_n', 'additives', 'additives_tags', 'additives_en', 'nutrition_grade_uk', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'states', 'states_tags', 'states_en', 'main_category', 'main_category_en']

From the remainings columns we chose the ones that interests us the most. We made our choice first by judging the usefulness of each column in relation to our project and then by regarding in more details if the data in the column was usable.

In [83]:
%pprint
#TODO supprimez tous les elements de la liste qui peuvent etre inutiles
#ON DROP pas encore les rows NaNs comme on sait pas encore quelles values on va utiliser
keep = ['product_name','categories_tags','categories_en','origins_tags','manufacturing_places_tags','labels','countries_en','allergens','nutrition_grade_fr','main_category_en', 'pnns_groups_1', 'pnns_groups_2']
selected_data = data[keep]
selected_data.head()

Pretty printing has been turned OFF


Unnamed: 0,product_name,categories_tags,categories_en,origins_tags,manufacturing_places_tags,labels,countries_en,allergens,nutrition_grade_fr,main_category_en,pnns_groups_1,pnns_groups_2
0,Vitória crackers,,,,,,France,,,,,
1,Cacao,,,,,,France,,,,,
2,Sauce Sweety chili 0%,,,,,,France,,,,,
3,Mini coco,,,,,,France,,,,,
4,Mendiants,,,,,,France,,,,,


We save this dataframe in a csv file to speed up the loading process in our next runs.

In [84]:
selected_data.to_csv(data_folder + 'selected_data.csv')
#data = pd.read_csv(data_folder + 'en.openfoodfacts.org.products.csv', sep='\t', encoding='utf-8', low_memory=False)

Since there seems to be a lot of null values in the dataset, it can be interesting to take a look at the number of values that we have per remaining selected column.

In [85]:
selected_data.dropna()[:100]

Unnamed: 0,product_name,categories_tags,categories_en,origins_tags,manufacturing_places_tags,labels,countries_en,allergens,nutrition_grade_fr,main_category_en,pnns_groups_1,pnns_groups_2
1066,Cornish Cruncher Cheddar & Pickled Onion Hand ...,"en:plant-based-foods-and-beverages,en:plant-ba...","Plant-based foods and beverages,Plant-based fo...",royaume-uni,royaume-uni,Point Vert,"France,United Kingdom","Lait, Lait, Babeurre",c,Plant-based foods and beverages,Salty snacks,Appetizers
1143,6 Breaded Jumbo Tiger Prawns,"en:meals,en:refrigerated-foods,en:breaded-prod...","Meals,Refrigerated foods,Breaded products,Refr...",vietnam,vietnam,"Point Vert,Décongelé","France,United Kingdom","Crevettes, Crustacés, blé, Gluten",c,Meals,Composite foods,One-dish meals
1348,"grilled Cajun chicken breast, spicy wedges & s...","en:meals,en:meat-based-products,en:meals-with-...","Meals,Meat-based products,Meals with meat,Poul...","europe,royaume-uni",royaume-uni,"Peu ou pas de matière grasse,Peu de matière gr...","France,United Kingdom",Lait,a,Meals,Composite foods,One-dish meals
1355,Cornish Cove Cheddar,"en:dairies,en:fermented-foods,en:fermented-mil...","Dairies,Fermented foods,Fermented milk product...",royaume-uni,royaume-uni,Point Vert,France,Lait,d,Dairies,Milk and dairy products,Cheese
1393,Mild Cheddar with Onions & Chives,"en:dairies,en:fermented-foods,en:fermented-mil...","Dairies,Fermented foods,Fermented milk product...",royaume-uni,"royaume-uni,irelande",Point Vert,"France,United Kingdom","Fromage, Lait",d,Dairies,Milk and dairy products,Cheese
1637,British Self Raising Flour,"en:farines,en:farines-avec-levures","Farines,Farines-avec-levures",royaume-uni,royaume-uni,Végétarien,"France,United Kingdom","Wheat, Wheat",d,Farines,unknown,unknown
1644,Tomato & Sausage Pasta Sauce,"en:groceries,en:meat-based-products,en:sauces,...","Groceries,Meat-based products,Sauces,Pasta sau...",italie,italie,"Point Vert,Fabriqué en Italie",France,Céleri,b,Groceries,Fat and sauces,Dressings and sauces
2026,Made Without Wheat New York Cheesecake,"en:sugary-snacks,en:biscuits-and-cakes,en:cake...","Sugary snacks,Biscuits and cakes,Cakes,Cheesec...",royaume-uni,royaume-uni,"Sans gluten,Végétarien","France,United Kingdom","Lait, crème, Lait, crème, Lait, œufs, Lait, Lait",e,Sugary snacks,Sugary snacks,Biscuits and cakes
2166,Wrap Poulet à la Jamaïcaine,"en:meals,en:fresh-foods,en:sandwiches,en:fresh...","Meals,Fresh foods,Sandwiches,Fresh meals,Poult...",grande-bretagne,royaume-uni,Point Vert,"France,United Kingdom","blé, Gluten, blé, Lait",a,Meals,Composite foods,Sandwich
14418,Quiche Lorraine,"en:meals,en:refrigerated-foods,en:pizzas-pies-...","Meals,Refrigerated foods,Pizzas pies and quich...",united-kingdom,united-kingdom,"fr:Fait avec des œufs de poule en liberté,fr:F...","France,United Kingdom","Milk, Lait, Œufs, blé, Gluten, blé, Fromage ch...",d,Meals,Composite foods,Pizza pies and quiche


In [86]:
selected_data.count()

product_name                 672399
categories_tags              180592
categories_en                180558
origins_tags                  42604
manufacturing_places_tags     67490
labels                       102216
countries_en                 697880
allergens                     69125
nutrition_grade_fr           141605
main_category_en             180481
pnns_groups_1                257628
pnns_groups_2                263579
dtype: int64

We can also visualize the filling of columns as percentages to better visualize it :

In [61]:
selected_data.count() / max(selected_data.count()) * 100

product_name                  96.348799
categories_tags               25.877228
categories_en                 25.872356
origins_tags                   6.104774
manufacturing_places_tags      9.670717
labels                        14.646644
labels_tags                   14.651373
labels_en                     14.651373
cities_tags                    4.259042
countries_en                 100.000000
allergens                      9.904998
nutrition_grade_fr            20.290738
main_category_en              25.861323
pnns_groups_1                 36.915802
pnns_groups_2                 37.768528
dtype: float64

- Which countries are the highest exporters and importers and is there a relationship with the GDP?

For this one, we're interested in the following tags: `origin`, `manufacturing_places` and `countries`

Here, we are facing our first problem, while the countries have more than enough samples in it (we still need to check the distribution later on), the `origins` and `manufacturing_places` both represent less than 10% of the data. Now, if we check at the actual values of them:

In [80]:
selected_data.manufacturing_places_tags.unique()[:50]

array([nan, 'france', 'brossard-quebec', 'united-kingdom',
       'brossard,quebec', 'etats-unis', 'france,avranches',
       'brossars,quebec', 'thailand', 'belgien', 'net-wt',
       'las-ventas-de-retamosa,toledo-provincia,castilla-la-mancha,espana',
       'saint-yrieix,france', 'germany',
       'france,limousin,87500,saint-yrieux', 'france,87500',
       '87500,france', 'sarlat', 'pays-bas,netherlands', 'royaume-uni',
       'royaume-uni,ecosse', 'belgique', 'ireland', '87500-saint-yrieix',
       'ecosse,royaume-uni', 'black-sheep-brewery',
       'estados-unidos-americanos', 'vietnam', 'argentine', 'usa',
       '30800-st-gilles', 'chester,united-kingdom', 'uk',
       'royaume-uni,irelande', 'california,usa', 'italie', 'china',
       'angleterre', 'switzerland', 'canada', 'mexico',
       'the-hershey-company', 'cincinnati', 'united-states', 'taiwan',
       'japon', 'japan', 'topco', 'san-nicolas-de-los-garza,nuevo-leon',
       'san-francisco-california'], dtype=object)

In [82]:
selected_data.origins_tags.unique()[:50]

array([nan, 'france', 'quebec', 'quebec,canada', 'united-kingdom',
       'germany', 'ue', 'canada', 'mexico', 'grande-bretagne',
       'estados-unidos-americanos', 'royaume-uni', 'vietnam', 'argentine',
       'brazil', 'england', 'europe,royaume-uni',
       'easter-grangemuir-farm,pittenweem,fife,ky10-2rb,scotland,united-kingdom',
       'royaume-uni,hors-royaume-uni', 'perou', 'italie', 'tibet',
       'espagne,royaume-uni', 'united-states', 'madagascar', 'taiwan',
       'japon', 'usa', 'estados-unidos', 'california', 'estero',
       'atlantique-nord-ouest,canada', 'suisse', 'saudi',
       'estados-unidos-de-america', 'britain,british-chicken', 'italy',
       'francia', 'fougerolles,france', 'ancaster,ontario,canada',
       'scotland', 'royaume-uni,west-sussex', 'malaisie',
       'washington,usa', 'e-u-a', 'etats-unis', 'californie,etats-unis',
       'new-zealand', 'sicile,italie', 'indetermine'], dtype=object)

In [73]:
a = selected_data['origins_tags'].dropna(how='all') #TODO chercher en lowercase pour trouver aussi avec minuscules
a[a.str.contains('suisse')].head()

11255          suisse
71519    vevey,suisse
73202          suisse
92403          suisse
92407          suisse
Name: origins_tags, dtype: object

In [72]:
#DEMO de comment on peut chercher les pays d'origine qu'on veut
a = selected_data['origins_tags'].dropna(how='all') #TODO chercher en lowercase pour trouver aussi avec minuscules
a[a.str.contains('france')].head()

254    france
272    france
313    france
362    france
416    france
Name: origins_tags, dtype: object

We are now facing another problem, all the tags are not normalized and a lot of them are even invalid ("mer", postal code, or in other languages). 

## GDP and Life Expectancy 

In our project we would like to observe if there exists any relation between the food quality of a country and its general wealth. For this purpose we found additional datasets on the World Bank website at https://www.worldbank.org/. We will use GPD.csv file to get the GDP per country (in USD) for the year 2017 to estimate the wealth of a country.

In [74]:
gdp = pd.read_csv(data_folder + 'GDP.csv')
#gdp.head()

In [75]:
gdp = gdp[['Country Name', 'Country Code', '2017']].dropna()
gdp.head()

Unnamed: 0,Country Name,Country Code,2017
1,Afghanistan,AFG,20815300000.0
2,Angola,AGO,124209400000.0
3,Albania,ALB,13039350000.0
4,Andorra,AND,3012914000.0
5,Arab World,ARB,2591047000000.0


We also want to observe an eventual relation between food quality and health. For this we also found on the same website the LE.csv file from which we can obtain the life expectancy per country. We will use the data for year 2016 as it is the most recent available year (year 2017 has no value) :

In [76]:
le = pd.read_csv(data_folder + 'LE.csv')
#le.head()

In [77]:
le = le[['Country Name', 'Country Code', '2016']].dropna()
le.head()

Unnamed: 0,Country Name,Country Code,2016
0,Aruba,ABW,75.867
1,Afghanistan,AFG,63.673
2,Angola,AGO,61.547
3,Albania,ALB,78.345
5,Arab World,ARB,71.198456


# Cleaning the data

## Open Food Facts dataset

## Countries

In [12]:
data_countries = data.filter(data.countries_en != "")

In [13]:
col_split = F.split(data_countries['countries_en'], ',')

In [14]:
data_countries = data_countries.withColumn('countries_en', F.explode(col_split))

In [15]:
data_countries.select('countries_en').distinct().show(500)

+--------------------+
|        countries_en|
+--------------------+
|       Côte d'Ivoire|
|                Chad|
|            Anguilla|
|              Russia|
|            Paraguay|
|Virgin Islands of...|
|               World|
|               Yemen|
|British Indian Oc...|
|             Senegal|
|              Sweden|
|              Guyana|
|         Philippines|
|            Djibouti|
|           Singapore|
|            Malaysia|
|fr:republica-moldova|
|        ch:allemagne|
|                Fiji|
|              Turkey|
|           fr:nantes|
|Nutrition facts c...|
|              Malawi|
|                Iraq|
|           fr:tahiti|
|             Germany|
|                  En|
|            Cambodia|
|     To be completed|
|         Afghanistan|
|            de:grece|
|              Jordan|
|              Rwanda|
|            Maldives|
|    Photos validated|
|          ch:schweiz|
|              France|
|            de:japon|
|              Greece|
|     Photos uploaded|
|Packaging 

Some of the entires are still invalid because they are written in another languages, we decided to not count them. Since we already have a list of countries, we are going to use them to keep only the valid entries.

In [16]:
joined = data_countries.join(gdp, 'countries_en', how='inner').drop('Country Code', '2016')

In [26]:
joined.filter(data.origins != "").count()

45504

In [31]:
origins = joined.join(gdp, joined.origins.isin(gdp.countries_en), how='inner')

In [33]:
origins.count()

15203

In [34]:
or_pd = origins.toPandas()

In [None]:
or_pd.origins.hist()

# Plans for the Project

# Societal aspects of food 

## Relations between food quality and general health

We would first like to discover if for a given country the quality of the food that is consumed there can have an influence on the global health of this country. To this purpose we will use the nutritional coefficients grades, (maybe the average number of nocive additives) and the life expectancy data.

In [None]:
#TODO correlation coefficients between average nutritional and expectancy per country

## Relations between food quality and general wealth

We will also analyze the possible links between average food quality and the global wealth of a country. For this we will again use  the nutritional coefficients grades, (maybe the average number of nocive additives) and the GDPs.

# Impacts of Globalization

## Consequences of globalization, case study on countries

We will take the example of concrete countries like Switzerland and France to observe how the globalization impacts the food quality of these countries. We will try to observe if the food imported from abroad to these countries is globally healthier or not than the one produced in those countries, by using data like nutritional coefficients and additives.

## Ecological footprint of globalization

We will analyze the ecological footprint of the food transport. What distance does the food travel before reaching countries like the ones we will study ? How can we quantify this footprint ?  We will use for this data from TODO FIND DATA SOURCE FOR THIS

# Conclusion

We will try to bring all our observations and analysis to make a global statement on the effects of globalization on the world and more specifically on countries like Switzerland. Do globalization impacts our environnement ? Is this globalization of food good for our health ? Should we try to eat more local or continue on this path ?