## Milestone 2
Dataset: Open Food Facts

The dataset is downloaded and stored in the /data folder

When describing the data, in particular, you should show (non-exhaustive list):

- That you can handle the data in its size.
- That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
- That you considered ways to enrich, filter, transform the data according to your needs.
- That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.
- That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.


In [2]:
import pandas as pd
import numpy as np
import scipy as sp

import findspark
findspark.init()

from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark import SparkContext

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [3]:
data_folder = './data/'

# Loading the data

## Open Food Facts dataset

The data is the CSV file that can be downloaded on the openfoodfacts website. Its size is 1.6 GB. We decided for this milestone to download and load it using spark.

In [4]:
data = spark.read.option("delimiter", "\t").option("header", "true").csv(data_folder + "en.openfoodfacts.org.products.csv")

In [5]:
print('In the dataset, there are {} elements'.format(data.count()))

In the dataset, there are 697574 elements


We look at the columns and type to make sure it has been loaded correctly:

In [6]:
data.dtypes

[('code', 'string'),
 ('url', 'string'),
 ('creator', 'string'),
 ('created_t', 'string'),
 ('created_datetime', 'string'),
 ('last_modified_t', 'string'),
 ('last_modified_datetime', 'string'),
 ('product_name', 'string'),
 ('generic_name', 'string'),
 ('quantity', 'string'),
 ('packaging', 'string'),
 ('packaging_tags', 'string'),
 ('brands', 'string'),
 ('brands_tags', 'string'),
 ('categories', 'string'),
 ('categories_tags', 'string'),
 ('categories_en', 'string'),
 ('origins', 'string'),
 ('origins_tags', 'string'),
 ('manufacturing_places', 'string'),
 ('manufacturing_places_tags', 'string'),
 ('labels', 'string'),
 ('labels_tags', 'string'),
 ('labels_en', 'string'),
 ('emb_codes', 'string'),
 ('emb_codes_tags', 'string'),
 ('first_packaging_code_geo', 'string'),
 ('cities', 'string'),
 ('cities_tags', 'string'),
 ('purchase_places', 'string'),
 ('stores', 'string'),
 ('countries', 'string'),
 ('countries_tags', 'string'),
 ('countries_en', 'string'),
 ('ingredients_text', 'str

There are a lot of null values in the dataset, it is therefore a good idea to look at the number of products that have the tags we're interested:

- Which countries are the highest exporters and importers and is there a relationship with the GDP?

For this one, we're interested in the following tags: `origin`, `manufacturing_places` and `countries`

In [7]:
print('There are {} products that contains the origins tags'.format(data.filter(data.origins != "").count()))
print('There are {} products that contains the manufacturing_places tags'.format(data.filter(data.manufacturing_places != "").count()))
print('There are {} products that contains the countries tags'.format(data.filter(data.countries != "").count()))

There are 42483 products that contains the origins tags
There are 67356 products that contains the manufacturing_places tags
There are 696976 products that contains the countries tags


Here, we are facing our first problem, while the countries have more than enough samples in it (we still need to check the distribution later on), the `origins` and `manufacturing_places` both represent less than 10% of the data. Now, if we check at the actual values of them:

In [10]:
data.select('manufacturing_places').distinct().show(100)

+--------------------+
|manufacturing_places|
+--------------------+
|bordères sur l'ec...|
|             beauzac|
|    sable sur Sarthe|
|       Côte d'Ivoire|
| France,62800 Lievin|
|McCain Alimentair...|
|S.A.S,Haute-Norma...|
|        SOUTH AFRICA|
|33290 Blanquefort...|
|Candia (Filiale S...|
|France,Jurançon,6...|
|Primel Gastronomi...|
|Candia (Filiale S...|
|France,Conserveri...|
|             rueyres|
|France,LNUF MDD,3...|
|Delmas Poissons e...|
|Malissard,Provenc...|
|       Vaudes,France|
|Montgaillard,Haut...|
|BPA Anjou,1 Chemi...|
|Montoire sur le L...|
|France,Kervignac,...|
|56700,Kervignac,M...|
|Montredon des Cor...|
| France,SA Favrichon|
|            Lavernay|
|  st romain sur cher|
|Concept Fruits,07...|
|Belgique,5150 Flo...|
| Pays Basque.,France|
|30570 Saint André...|
|85340,Olonne-sur-...|
|Saint Genès-de-Bl...|
|     France,Bordeaux|
|Fabriqué en Franc...|
|         France,gers|
|Champagne-Ardenne...|
|               Jever|
|Silberstedt,Deuts...|
|oldenburg,

In [11]:
data.select('origins').distinct().show(100)

+--------------------+
|             origins|
+--------------------+
|Britain,British C...|
|               45273|
|          états unis|
|       Côte d'Ivoire|
|       Non spécifiée|
|Côtes du Rhône,Fr...|
|France,Pêcheries ...|
|        SOUTH AFRICA|
|   France ou Espagne|
|Avoine Française,...|
|Vin,Bourgogne,France|
|Noix de Saint-Jac...|
|     basse Normandie|
|      Porc de France|
|Languedoc Roussil...|
|     France,Bordeaux|
|Carpe origine France|
| Beaumont-de-Lomagne|
|                 Mer|
|farine de blé T18...|
|           France,35|
|République Tchèqu...|
|      Halluin,France|
|Lait origine Fran...|
|Pérou,Coopérative...|
|Chine,Privince du...|
|Pérou,République ...|
|La Réunion,Morbih...|
|      Chine,Paraguay|
|      Nordostpazifik|
|Union européenne,...|
|             Toskana|
|   Tipperary,Ireland|
|        Cork,Ireland|
|Ivory Coast,Domin...|
|               liban|
|      Mexico,Yucatan|
|  Thailand,  Amerika|
|Union européenne,...|
|        Italia,Parma|
|      Espa

We are now facing another problem, all the tags are not normalized and a lot of them are even invalid ("mer", postal code, or in other languages). 

## GDP and Life Expectancy

We found the GDP per country (in USD) on the World Bank website

In [7]:
gdp = spark.read.option("header", "true").csv(data_folder + 'GDP.csv').select('Country Name', 'Country Code', '2016').withColumnRenamed('Country Name', 'countries_en')

In [8]:
gdp.show()

+--------------------+------------+----------------+
|        countries_en|Country Code|            2016|
+--------------------+------------+----------------+
|               Aruba|         ABW|            null|
|         Afghanistan|         AFG|19469022207.6852|
|              Angola|         AGO|95337203468.1156|
|             Albania|         ALB|11883682170.8236|
|             Andorra|         AND|2877311946.90265|
|          Arab World|         ARB|2500164034395.78|
|United Arab Emirates|         ARE|357045064669.843|
|           Argentina|         ARG| 554860945013.62|
|             Armenia|         ARM| 10546135160.031|
|      American Samoa|         ASM|       658000000|
| Antigua and Barbuda|         ATG| 1460144703.7037|
|           Australia|         AUS|1208039015868.39|
|             Austria|         AUT|390799991147.468|
|          Azerbaijan|         AZE|37867518957.1975|
|             Burundi|         BDI| 3007029030.4001|
|             Belgium|         BEL|46754554876

Same goes for the life expectancy:

In [9]:
le = spark.read.option("header", "true").csv(data_folder + 'LE.csv').select('Country Name', 'Country Code', '2016').withColumnRenamed('Country Name', 'countries_en')

In [10]:
le.dtypes

[('countries_en', 'string'), ('Country Code', 'string'), ('2016', 'string')]

In [11]:
le.show(10)

+--------------------+------------+---------------+
|        countries_en|Country Code|           2016|
+--------------------+------------+---------------+
|               Aruba|         ABW|         75.867|
|         Afghanistan|         AFG|         63.673|
|              Angola|         AGO|         61.547|
|             Albania|         ALB|         78.345|
|             Andorra|         AND|           null|
|          Arab World|         ARB|71.198456370659|
|United Arab Emirates|         ARE|         77.256|
|           Argentina|         ARG|         76.577|
|             Armenia|         ARM|         74.618|
|      American Samoa|         ASM|           null|
+--------------------+------------+---------------+
only showing top 10 rows



# Cleaning the data

## Open Food Facts dataset

## Countries

In [12]:
data_countries = data.filter(data.countries_en != "")

In [13]:
col_split = F.split(data_countries['countries_en'], ',')

In [14]:
data_countries = data_countries.withColumn('countries_en', F.explode(col_split))

In [15]:
data_countries.select('countries_en').distinct().show(500)

+--------------------+
|        countries_en|
+--------------------+
|       Côte d'Ivoire|
|                Chad|
|            Anguilla|
|              Russia|
|            Paraguay|
|Virgin Islands of...|
|               World|
|               Yemen|
|British Indian Oc...|
|             Senegal|
|              Sweden|
|              Guyana|
|         Philippines|
|            Djibouti|
|           Singapore|
|            Malaysia|
|fr:republica-moldova|
|        ch:allemagne|
|                Fiji|
|              Turkey|
|           fr:nantes|
|Nutrition facts c...|
|              Malawi|
|                Iraq|
|           fr:tahiti|
|             Germany|
|                  En|
|            Cambodia|
|     To be completed|
|         Afghanistan|
|            de:grece|
|              Jordan|
|              Rwanda|
|            Maldives|
|    Photos validated|
|          ch:schweiz|
|              France|
|            de:japon|
|              Greece|
|     Photos uploaded|
|Packaging 

Some of the entires are still invalid because they are written in another languages, we decided to not count them. Since we already have a list of countries, we are going to use them to keep only the valid entries.

In [16]:
joined = data_countries.join(gdp, 'countries_en', how='inner').drop('Country Code', '2016')

In [26]:
joined.filter(data.origins != "").count()

45504

In [31]:
origins = joined.join(gdp, joined.origins.isin(gdp.countries_en), how='inner')

In [33]:
origins.count()

15203

In [34]:
or_pd = origins.toPandas()

In [None]:
or_pd.origins.hist()