# Data collection and descriptive analysis
From [Open Food Facts](https://world.openfoodfacts.org/) we have a 1.7 GB `.csv` file which contains information of over 600 000 unique food products. Our purpose with this notebook is to explore this dataset and compile the availible information into one/several files of a smaller format containing only what is needed and can be used for our project.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_folder = "./data/"

In [3]:
database = pd.read_csv(data_folder + "en.openfoodfacts.org.products.csv", sep='\t', dtype=object)

We take a look at the data:

In [4]:
database.describe()

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,...,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
count,696770,696770,696801,696800,696794,696804,696804,670800,77415,194249,...,344,140760,140760,2,1,27.0,1.0,14,33.0,13.0
unique,696731,696733,6656,562201,562207,551516,551515,479331,58231,22537,...,211,56,56,2,1,18.0,1.0,10,19.0,8.0
top,8057014130126,http://world-en.openfoodfacts.org/product/6003...,kiliweb,1489055829,2017-03-09T10:37:09Z,France,en:france,Comté,Pâtes alimentaires au blé dur de qualité supér...,500 g,...,0,0,0,3,2,0.06,1.6e-05,4,0.02,0.0073
freq,2,2,312135,20,20,29,31,451,181,7881,...,82,7338,10150,1,1,4.0,1.0,3,6.0,2.0


We have a lot of different types of data, 173 columns. This means that the `describe` method does not really give us information that is easy to survey - we need to explore the data in another way. 

# Finding the column with the country data
We are interested in doing out analysis based on which country the food item comes from. We therefore filter the data to find the columns which starts with the string `countr` for "countries":

In [5]:
filter_col = [col for col in database if col.startswith('countr')]
filter_col

['countries', 'countries_tags', 'countries_en']

We have three different columns regarding country data. Let's try to find out which one is relevant for us. A guess is that it is the `countries_en` column that we need, since the name presumably means that the column contain data of the origin country in English.

Looking at the `countries` column, we notice that there are duplicates:

In [6]:
database.countries.value_counts().head(5)

en:france    205162
France       179274
US           168473
en:FR         28054
Suisse         9097
Name: countries, dtype: int64

France appears several times! By comparing the number of unique values for each of the three different columns we see that the other two columns contain less than half the number of unique values:

In [7]:
for col in filter_col:
    print("Number of unique country labels in column '{}': ".format(col) + str(database[col].value_counts().shape[0]))


Number of unique country labels in column 'countries': 3227
Number of unique country labels in column 'countries_tags': 1227
Number of unique country labels in column 'countries_en': 1227


Taking a look at `countries_tags` and `countries_en` gives us the information that they are basically identical, just with a different format for each country:

In [8]:
database.countries_tags.value_counts().head(6)

en:france               421492
en:united-states        173575
en:switzerland           13463
en:germany               11845
en:france,en:germany      6309
en:spain                  6234
Name: countries_tags, dtype: int64

In [9]:
database.countries_en.value_counts().head(6)

France            421492
United States     173575
Switzerland        13463
Germany            11845
France,Germany      6309
Spain               6234
Name: countries_en, dtype: int64

Because of this, we decide to use the `countries_en` column. We note the format of the column, that each country starts with a capital letter and that if there are several countries they are separated by a comma without and whitespace.

# Filtering out France and the United States
We are only interested in comparing France against the United States. Because of this, we want to compile the rows of the database which contain data for these two countries into two new dataframes respectively.

In [10]:
# na=False drops all the rows where countries_en is NaN
france_data = database[database.countries_en.str.contains("France", na=False)]
us_data = database[database.countries_en.str.contains("United States", na=False)]

# Filtering out relevant columns

Relevant columns are columns with values for several products. We define that a column is relevant to look at if it has at least 10 000 defined values.

In [46]:
columns10000 = ((france_data.count() > 10000) & (us_data.count() > 10000))

In [47]:
france_data.columns[columns10000]

Index(['code', 'url', 'creator', 'created_t', 'created_datetime', 'last_modified_t', 'last_modified_datetime', 'product_name', 'brands', 'brands_tags', 'countries', 'countries_tags', 'countries_en', 'ingredients_text', 'serving_size', 'serving_quantity', 'additives_n', 'additives', 'additives_tags', 'additives_en', 'ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n', 'states', 'states_tags', 'states_en', 'energy_100g', 'fat_100g', 'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g', 'proteins_100g', 'salt_100g', 'sodium_100g'], dtype='object')

From the resultig columns, we will use the ones relevant for nutrition. These are the ones with values per 100g, servings and the product names. In addition to these, we will save the column "category", as we will use it to categorise our data. We will also save some additional vitamins and minerals, as well as two columns with nutrition facts. An overview of all columns can be found here: https://static.openfoodfacts.org/data/data-fields.txt

In [58]:
re_columns = ['product_name', 'brands', 'brands_tags', 'ingredients_text', 'serving_size', 'categories',
              'categories_tags', 'categories_en',
              'serving_quantity', 'energy_100g', 'proteins_100g', 'carbohydrates_100g', 'sugars_100g', 'fat_100g',
              'saturated-fat_100g','monounsaturated-fat_100g', 'polyunsaturated-fat_100g', 'omega-3-fat_100g',
              'omega-6-fat_100g','omega-9-fat_100g', 'trans-fat_100g', 'cholesterol_100g', 'fiber_100g',
              'sodium_100g', 'vitamin-a_100g','vitamin-d_100g', 'vitamin-e_100g', 'vitamin-k_100g', 'vitamin-c_100g',
              'vitamin-b1_100g','vitamin-b2_100g', 'vitamin-pp_100g', 'vitamin-b6_100g', 'vitamin-b9_100g',
              'vitamin-b12_100g',
              'biotin_100g', 'calcium_100g', 'phosphorus_100g', 'iron_100g', 'magnesium_100g', 'zinc_100g',
              'copper_100g', 'manganese_100g', 'fluoride_100g', 'selenium_100g', 'chromium_100g', 'molybdenum_100g',
              'iodine_100g', 'nutrition-score-fr_100g', 'nutrition-score-uk_100g']

In [59]:
france = france_data[re_columns]
usa = us_data[re_columns]