## Milestone 2
Dataset: Open Food Facts

The dataset is downloaded and stored in the /data folder

When describing the data, in particular, you should show (non-exhaustive list):

- That you can handle the data in its size.
- That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
- That you considered ways to enrich, filter, transform the data according to your needs.
- That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.
- That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.


In [1]:
import pandas as pd
import numpy as np
import scipy as sp

import findspark
findspark.init()

from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark import SparkContext

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [2]:
data_folder = './data/'

# Loading the data

## Open Food Facts dataset

The data is the CSV file that can be downloaded on the openfoodfacts website. Its size is 1.6 GB. We decided for this milestone to download and load it using spark.

In [3]:
data = spark.read.option("delimiter", "\t").option("header", "true").csv(data_folder + "en.openfoodfacts.org.products.csv")

In [4]:
print('In the dataset, there are {} elements'.format(data.count()))

In the dataset, there are 697574 elements


We look at the columns and type to make sure it has been loaded correctly:

In [5]:
data.dtypes

[('code', 'string'),
 ('url', 'string'),
 ('creator', 'string'),
 ('created_t', 'string'),
 ('created_datetime', 'string'),
 ('last_modified_t', 'string'),
 ('last_modified_datetime', 'string'),
 ('product_name', 'string'),
 ('generic_name', 'string'),
 ('quantity', 'string'),
 ('packaging', 'string'),
 ('packaging_tags', 'string'),
 ('brands', 'string'),
 ('brands_tags', 'string'),
 ('categories', 'string'),
 ('categories_tags', 'string'),
 ('categories_en', 'string'),
 ('origins', 'string'),
 ('origins_tags', 'string'),
 ('manufacturing_places', 'string'),
 ('manufacturing_places_tags', 'string'),
 ('labels', 'string'),
 ('labels_tags', 'string'),
 ('labels_en', 'string'),
 ('emb_codes', 'string'),
 ('emb_codes_tags', 'string'),
 ('first_packaging_code_geo', 'string'),
 ('cities', 'string'),
 ('cities_tags', 'string'),
 ('purchase_places', 'string'),
 ('stores', 'string'),
 ('countries', 'string'),
 ('countries_tags', 'string'),
 ('countries_en', 'string'),
 ('ingredients_text', 'str

There is in fact a lot of columns that we wouldn't need, but first, let's explicity list what we need depending on the questions we had.

- Which countries are the highest exporters and importers and is there a relationship with the GDP?

For this question, what is needed in the dataset is, apart from the description/name of the food, are:
1. `origins` : origins of ingredients
2. `manufacturing_places` : places where manufactured or transformed
3. `countries` : list of countries where the product is sold

There are a lot of null values in the dataset, it is therefore a good idea to look at the number of products that have an origin or a manufacturing place:

In [9]:
print('There are {} products that contains the origins tags'.format(data.filter(data.origins != "").count()))

There are 42483 products that contains the origins tags


In [8]:
print('There are {} products that contains the manufacturing_places tags'.format(data.filter(data.manufacturing_places != "").count()))

There are 67356 products that contains the manufacturing_places tags


In [10]:
print('There are {} products that contains the countries tags'.format(data.filter(data.countries != "").count()))

There are 696976 product that contains the countries tags


As we can see, the number of products that contains the informations we need is in fact very low compared to the whole dataset. However, we still think that it is a good idea to use that subset since it can gives us a first impression of what we want. The data will need cleaning afterwards.

- How does the transport and production of food affect the ecosystem?

hmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

- How does the food we eat affect our Life expectancy?

We need the following tags from our dataset:
1. `categories`

we also need the Life expectancy, that we're going to get later.

## GDP and Life Expectancy

We found the GDP per country (in USD) on the World Bank website

In [26]:
gdp = pd.read_csv(data_folder + 'GDP.csv')

In [23]:
gdp.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
0,Aruba,ABW,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,2498933000.0,2467704000.0,2584464000.0,,,,,,,
1,Afghanistan,AFG,GDP (current US$),NY.GDP.MKTP.CD,537777800.0,548888900.0,546666700.0,751111200.0,800000000.0,1006667000.0,...,12486940000.0,15936800000.0,17930240000.0,20536540000.0,20264250000.0,20616100000.0,19215560000.0,19469020000.0,20815300000.0,
2,Angola,AGO,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,75492390000.0,82526140000.0,104115800000.0,113923200000.0,124912500000.0,126730200000.0,102621200000.0,95337200000.0,124209400000.0,
3,Albania,ALB,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,12044210000.0,11926950000.0,12890870000.0,12319780000.0,12776280000.0,13228240000.0,11386930000.0,11883680000.0,13039350000.0,
4,Andorra,AND,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,3660531000.0,3355695000.0,3442063000.0,3164615000.0,3281585000.0,3350736000.0,2811489000.0,2877312000.0,3012914000.0,


Same goes for the life expectancy:

In [27]:
le = pd.read_csv(data_folder + 'LE.csv')

In [28]:
le.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
0,Aruba,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,65.662,66.074,66.444,66.787,67.113,67.435,...,74.872,75.016,75.158,75.299,75.44,75.582,75.725,75.867,,
1,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.292,32.742,33.185,33.624,34.06,34.495,...,60.754,61.226,61.666,62.086,62.494,62.895,63.288,63.673,,
2,Angola,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,33.251,33.573,33.914,34.272,34.645,35.031,...,57.231,58.192,59.042,59.77,60.373,60.858,61.241,61.547,,
3,Albania,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,62.279,63.298,64.187,64.911,65.461,65.848,...,76.281,76.652,77.031,77.389,77.702,77.963,78.174,78.345,,
4,Andorra,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,...,,,,,,,,,,
