# Open Food Facts Notebook
## Table of Contents
1. [Helper Functions](#Helper-Functions) 
2. [Cleaning Data](#Cleaning-Data)  
    2.1 [Fill in missing Product Name](#product_name)  
    2.2 [Fill in Missing Values for Country](#country)  
    2.3 [Fill in Missing Nutrion Scores](#nutrition-scores)  
    2.4 [Fill in Missing Allergens](#allergens)  
    2.5 [Fill in Missing Traces](#traces)  
    2.6 [Fill/Clean Ingredients](#ingredients)  
    2.7 [Fill/Clean Labels](#labels_column)  
    2.8 [Clean float64 Columns](#float64_col)  
3. [Data Visualization & Analysis](#data_analysis)  
    3.1 [Maps](#Maps)  
    3.2 [Correlations Between Neighbouring Countries](#correlation_neighbours)

# Imports

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
#import folium
import re
%matplotlib inline

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
import seaborn as sns
import time

#from google.cloud import translate
#import pycountry
#import emoji

#translate_client = translate.Client()

import sys # for printing process
import unidecode # for normalizing text
from pathlib import Path # check files


from py_translator import Translator
translator = Translator()

# from googletrans import Translator

# WE HAD DIFFICULTIES MERGING, FATS CLEANING IS IN `project_fats` FILE

We need to create a data folder where the .csv file will be stored and also a maps folder where .html maps will be stored

In [2]:
filename = 'en.openfoodfacts.org.products.csv'
countryfile = 'wikipedia-iso-country-codes.csv'
translationsfile = 'translations.csv'
foodfile = 'food.csv'

In [3]:
using_col = [
    "product_name",
    "generic_name",
    "quantity",
    "brands",
    "brands_tags",
    "categories",
    "categories_tags",
    "categories_en",
    "manufacturing_places",
    "manufacturing_places_tags",
    "labels",
    "labels_tags",
    "labels_en",
    "purchase_places",
    "countries",
    "countries_tags",
    "countries_en",
    "ingredients_text",
    "allergens",
    "allergens_en",
    "traces",
    "traces_tags",
    "traces_en",
    "nutrition_grade_uk",
    "nutrition_grade_fr",
    "main_category",
    "main_category_en",
    "energy_100g",
    "energy-from-fat_100g",
    "fat_100g",
    "saturated-fat_100g",
    "trans-fat_100g",
    "cholesterol_100g",
    "carbohydrates_100g",
    "sugars_100g",
    "fiber_100g",
    "proteins_100g",
    "salt_100g",
    "sodium_100g",
    "alcohol_100g",
    "calcium_100g",
    "iron_100g",
    "carbon-footprint_100g",
    "nutrition-score-fr_100g",
    "nutrition-score-uk_100g",
    "glycemic-index_100g"
]

In [4]:
data_folder = './data/'
maps_folder = './maps/'

In [5]:
# CONSTANTS
# unknown values to use
UNKNOWN_NR='-1'
UNKNOWN_STR='unknown'
# delay between translation requests
TRANSLATION_DELAY=0.3
# progress in function
PROGRESS=0


In [77]:
# cache translations to save translation requests
translations_file = Path(data_folder + translationsfile)
translations = {}

if not translations_file.is_file():
    print('Translations file not found')
else:
    print('Translations file found')
    translations = pd.read_csv(data_folder + translationsfile, 
                               sep='\t',
                               low_memory=False).to_dict("records")[0]
    print('{} translations found'.format(len(translations)))

Translations file found
7435 translations found


In [78]:
food_file = Path(data_folder + foodfile)
food_df = pd.DataFrame()

if not food_file.is_file():
    print('Food file not found')
    food_df = pd.read_csv(data_folder + filename, 
                      sep='\t',
                      usecols = using_col,
                      quotechar='"', 
                      low_memory=False)
else:
    print('Food file found')
    food_df = pd.read_csv(data_folder + foodfile, 
                      sep='\t',
                      low_memory=False)
    print('{} Food entries found'.format(len(food_df)))


Food file found
693846 Food entries found


In [79]:
print("The types of the data set are: \n", format(food_df.dtypes))
print ("The total size of the data set is:", format(food_df.shape) )
food_df.head(5)

The types of the data set are: 
 product_name                  object
generic_name                  object
quantity                      object
brands                        object
brands_tags                   object
categories                    object
categories_tags               object
categories_en                 object
manufacturing_places          object
manufacturing_places_tags     object
labels                        object
labels_tags                   object
labels_en                     object
purchase_places               object
countries                     object
countries_tags                object
countries_en                  object
ingredients_text              object
allergens                     object
allergens_en                  object
traces                        object
traces_tags                   object
traces_en                     object
nutrition_grade_uk           float64
nutrition_grade_fr            object
main_category                 object
main_

Unnamed: 0,product_name,generic_name,quantity,brands,brands_tags,categories,categories_tags,categories_en,manufacturing_places,manufacturing_places_tags,...,proteins_100g,salt_100g,sodium_100g,alcohol_100g,calcium_100g,iron_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g
0,Vitória crackers,,,,,,,,,,...,7.8,1.4,0.551181,,,,,,,
1,Cacao,,130 g,,,,,,,,...,,,,,,,,,,
2,Sauce Sweety chili 0%,,,,,,,,,,...,0.2,2.04,0.80315,,,,,,,
3,Mendiants,,,,,,,,,,...,,,,,,,,,,
4,Salade de carottes râpées,,,,,,,,,,...,0.9,0.42,0.165354,,,,,,,


# Cleaning Data

In [108]:
# remove rows where the columns we are interested in are all null
food_df = food_df.dropna(subset=using_col, how='all')
saveFoodDF(food_df)

Saved module: food


# Helper Functions

In [80]:
# Gets the first non null value from gibben collumns in priority order
def getValueWithPriorityColumns(input_row, merging_columns):
    for column in merging_columns:
        if pd.notnull(input_row[column]):
            return input_row[column]
    return input_row[merging_columns[0]]

# Merges from a input DF the desired columns into a result column
def mergeColumnsFromDF(input_df, desired_columns, result_column):
    if result_column in input_df.columns:
        return input_df

    input_df[result_column] = input_df.apply(
        lambda x: getValueWithPriorityColumns(x,desired_columns),
        axis = 1
    )
    for column in desired_columns:
        if column in input_df.columns:
            input_df = input_df.drop(column, axis=1)

    return input_df


In [81]:
# Translates a value and saves it to translations cache dict
def translateWithCache(value):
    global translations
    # search translated word in translations map
    if value in translations:
        # print("Cached  {} -> {}".format(value,translations[value]))
        return translations[value]
    else:
        try:
            print(1)
            time.sleep(TRANSLATION_DELAY) 
            print(2)
            trns_value = translator.translate(text=value, dest='en')
            if not trns_value is None:
                new_translation=trns_value.text.lower()
                print("Translating {} -> {}".format(value,new_translation))
                translations[value]=new_translation
                return new_translation
            else:
                print("None {} / {} / {}".format(value,type(value), e))
                
        except Exception as e:
            print("Exception {} / {} / {}".format(value,type(value), e))
    return value

In [89]:
translator.token_acquirer.acquire('èè')
translator.detect('pepe')
#translator.translate(text='Hello my friend', dest='es').text

IndexError: list index out of range

In [42]:
# Save to file translations dict
def saveTranslations():
    pd.DataFrame.from_dict(translations,orient="index").T.to_csv(data_folder + translationsfile,sep='\t',index=False)

# Save to file a dataframe with gibben name
def saveModuleDF(name,df):
    df.to_csv(data_folder + name + '.csv', sep='\t',index=False)
    print('Saved module: {}'.format(name))

# Save to file food dataframe 
def saveFoodDF(df):
    saveModuleDF('food',df)

# Looks for existing file saved and returns it if found or a food_df copy
def getModuleDF(name):
    file_name=data_folder + name + '.csv'
    file = Path(file_name)

    if not file.is_file():
        print('{} file not found'.format(name))
        return food_df.copy()
    else:
        print('{} file found'.format(name))
        df = pd.read_csv(file_name, 
                          sep='\t',
                          low_memory=False)
        print('{} {} entries found'.format(len(df),name))
        return df

    
    
# Removes special character, numbers, accents, sets to lower case and removes trailing spaces
def normalizeString(string):
    cleaned=''.join([i for i in string if (i.isalnum() & ~i.isdigit()) | i.isspace() ])
    return unidecode.unidecode(cleaned.lower().strip())

# Removes array without duplicate values
def removeDuplcates(array):
    newItems=[]
    for item in array:
        if item and item not in newItems:
            newItems.append(item)
    
    return newItems


In [17]:
# Shows progress during processing
def showProgress(size):
    global PROGRESS 
    PROGRESS += 1
    progress_value = int(10000*PROGRESS/size)/100
    if (progress_value*100)%1==0:
        sys.stdout.write('\r'+'Progress {}%'.format(progress_value))
        sys.stdout.flush()
        
# Show NaN percentage of column values in input df    
def showNanPercentage(df,desired_columns):
    for column in desired_columns:
        print("Percentage of NaN in {} is {:.2f}%".format(column,100*len(df[df[column].isna()])/len(df) ))
        

In [18]:
# formats and translate rows formated as a list of values
def formatAndTranslateRow(row, translateNoFormat):
    showProgress(allergen_notna_df.shape[0])

    if type(row) is not list and pd.notnull(row) :
        raw_data = row.lower().split(',')
        data = []
        for value in raw_data:
            value_ = normalizeString(value)
            # format <langage_code:info>
            if (':') in value_:
                info_ = value_.split(':')
                if len(info_) == 2:
                    # already in english
                    if info_[0] == 'en':
                        data.append(info_[1])
                    # translate to english
                    else:
                        data.append(translateWithCache(info_[1]))
                else:
                    data.append(info_)
                #print("Appending {}".format(info_))
                
            # no format, let's translate it
            else:
                if translateNoFormat:
                    data.append(translateWithCache(value_))
                else:
                    data.append(value_)
        #print(data)
        return data                       
    else:
        return row
    


## Fill in Missing Product Name <a id="product_name"></a>

This section deals with NaN values for `product_name`. If it does not have a `product_name`, the `generic_name` was used. If neither field was filled, then a combination of `brands` and `categories_en`/`main_category` was used.

In [132]:
desired_columns = [
    'product_name',
    'generic_name',
    'main_category',
    'main_category_en',
    'brands',
    'brands_tags',
    'categories',
    'categories_tags',
    'categories_en',
]

result_column='product_name_value'
showNanPercentage(food_df,desired_columns)

Percentage of NaN in product_name is 3.73%
Percentage of NaN in generic_name is 88.88%
Percentage of NaN in main_category is 74.23%
Percentage of NaN in main_category_en is 74.23%
Percentage of NaN in brands is 33.17%
Percentage of NaN in brands_tags is 33.18%
Percentage of NaN in categories is 74.21%
Percentage of NaN in categories_tags is 74.21%
Percentage of NaN in categories_en is 74.22%


In [135]:
df = getModuleDF(result_column)
FOUND_MODULE = result_column in df.columns
print("Found module: {}".format(FOUND_MODULE))

product_name_value file found
693846 product_name_value entries found
Found module: True


In [136]:
def get_name(row):
    showProgress(df.shape[0])
    if pd.isnull(row['product_name']):
        if pd.isnull(row['generic_name']):
            if pd.isnull(row['main_category_en']) & pd.isnull(row['categories_en']) & pd.isnull(row['brands']):
                return
            else:
                category_name = row['main_category_en']
                if pd.isnull(category_name):
                    category_name = row['categories_en']
                return "{} {}".format(row['brands'], category_name)
        else:
            return row['generic_name']
    else:
        return row['product_name']


In [137]:
if not FOUND_MODULE:
    PROGRESS=0
    
    df[result_column] = df.apply(
        lambda x: get_name(x),
        axis = 1
    )
    
    # removing the columns that we no longer need
    df = df.drop(desired_columns, axis=1)
    saveModuleDF(result_column,df)


In [138]:
print("Number of rows w/missing product_name after modifications: {}".format(len(df) - df[result_column].count()))

Number of rows w/missing product_name after modifications: 22458


As seen in the results, 22 458 rows still do not have names after our modifications. Our team decided that names were not of particular importance for our analysis, so we decided to leave these no-named items in the dataframe. The name is not important because we mostly want to analyze the ingredients of the items for each country. Thus the `labels`, `allergens`, and the numbers for sugar/sodium/calcium/etc and `countries` are the important columns.

## Fill in Missing Values for Country <a id="country"></a>
This section deals with the missing values for `countries_en`. The `countries_en` column represents the countries where the product is sold. This column is important for our analysis because we want to analyze how viable it is to live in each country based off one's dietary restrictions.

In order to fix these missing values, we decided to first fill the column with values from `purchase_places`, then `manufacturing_places`. We decided to use `purchase_places` because if it was purchased in a certain country, obviously it means the product is sold there as well. As for `manufacturing_places`, we assumed that it is most likely that a product manufactured in a country would be sold there as well.

Furthermore, we looked at the column `origins`, however this column is actually describing where each ingredient came from. This would not be helpful for us because the origin would not tell us about which countries actually sell/consume this specific item.

In [192]:
desired_columns = [
    'countries_en',
    'purchase_places',
    'manufacturing_places',
    'manufacturing_places_tags',
    'countries_tags',
    'countries'
]

result_column='country_name'
showNanPercentage(food_df,desired_columns)

Percentage of NaN in countries_en is 0.07%
Percentage of NaN in purchase_places is 85.51%
Percentage of NaN in manufacturing_places is 90.35%
Percentage of NaN in manufacturing_places_tags is 90.35%
Percentage of NaN in countries_tags is 0.07%
Percentage of NaN in countries is 0.07%


In [148]:
df_1 = getModuleDF(result_column)
FOUND_MODULE = result_column in df.columns
print("Found module: {}".format(FOUND_MODULE))

countries_values file not found
Found module: False


In [149]:
print("Number of rows w/missing countries_en: {}".format(len(df) - df['countries_en'].count()))

Number of rows w/missing countries_en: 459


In [150]:
df_1['countries_en'].count()

693387

In [155]:
def translate_country(row,size):
    showProgress(size)
    if pd.isnull(row['countries_en']):
        alt_country = None
        if pd.notna(row['purchase_places']):
            alt_country = row['purchase_places']
        elif pd.notna(row['manufacturing_places']):
            alt_country = row['manufacturing_places']
            
        # got value from purchase_places or manufacturing_places
        if (not alt_country is None) and pd.notna(alt_country):
            try:
                en_alt_country = translateWithCache(alt_country)
                if not en_alt_country is None:
                    return en_alt_country.text
            except Exception as e:
                return alt_country
            
        return alt_country
    else:
        return row['countries_en']

In [156]:
if not FOUND_MODULE:
    PROGRESS=0
    size = df_1.shape[0]
    df_1[result_column] = df_1.apply(lambda x: 
        translate_country(x,size),
        axis = 1
    )
    saveTranslations()
    

Progress 100.0%

In [160]:
print("Number of rows w/missing countries_en: {}".format(len(df_1) - df_1[result_column].count()))
print("Percentage of rows w/missing countries_en: {0:.3f}%".format(100*(len(df_1)-df_1[result_column].count())/len(df_1)))


Number of rows w/missing countries_en: 322
Percentage of rows w/missing countries_en: 0.046%


In [161]:
if not FOUND_MODULE:
    # removing the columns that we no longer need
    df_1 = df_1.drop(desired_columns, axis=1)

    # drop rows without country
    df_1 = df_1.dropna(subset=[result_column])

    #saveModuleDF(result_column,df_1)



In [162]:
print("Number of rows w/multiple countries: {}".format(len(df_1[df_1[result_column].str.contains(',')])))

print("Number of total rows: {}".format(len(df_1)))

Number of rows w/multiple countries: 28626
Number of total rows: 693524


In [163]:
if not FOUND_MODULE:
    # shows that some countries_en are lists
    df_1[df_1[result_column].notnull() & df_1[result_column].str.contains(',')][result_column].head()


    df_1[result_column] = df_1.apply(
        lambda x: [x.strip() for x in x[result_column].split(',')],
        axis = 1
    )
    #saveModuleDF(result_column,df_1)


    # [x.strip() for x in my_string.split(',')]
    

In [166]:
# shows that the countries has been properly split
df_1[df_1.index == 173][[result_column]]

Unnamed: 0,countries_values
173,"[France, United States]"


Next, this subsection deals with standardizing the countries for each product. First, we notice that some products have more than one country in their `countries_en` field. In this case, we seperate/explode each country in the `countries_en` field so that each country has its own row for that item. Next, we join the countries with their respective country code.

In [168]:
# map the countries_en to country codes
country_df = pd.read_csv(data_folder + countryfile, 
                         sep=',',
                         header=0,
                         usecols=['English short name lower case', 'Alpha-2 code'],
                         quotechar='"')
# rename columns
country_df.rename(columns={
    'Alpha-2 code':'country_code',
    'English short name lower case': result_column
    }, inplace=True)

country_df[result_column] = country_df.apply(
    lambda x: x[result_column].lower(),
    axis=1
)

country_df.head()

Unnamed: 0,countries_values,country_code
0,afghanistan,AF
1,åland islands,AX
2,albania,AL
3,algeria,DZ
4,american samoa,AS


In [169]:
def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, lens)
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, lens)
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

In [170]:
df_3 = df_1.copy()

if not FOUND_MODULE:
    df_3 = explode(df_3,result_column)

In [172]:
# see how the explode function created another row because there were two countries for Lion Peanut x2
df_3[df_3['product_name'].notna() & df_3['product_name'].str.contains('Lion Peanut x2')]

Unnamed: 0,product_name,generic_name,quantity,brands,brands_tags,categories,categories_tags,categories_en,labels,labels_tags,...,salt_100g,sodium_100g,alcohol_100g,calcium_100g,iron_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,countries_values
173,Lion Peanut x2,,,,,,,,,,...,,,,,,,,,,France
174,Lion Peanut x2,,,,,,,,,,...,,,,,,,,,,United States


In [180]:
df_3[result_column] = df_3.apply(
    lambda x: x[result_column].lower(),
    axis = 1
)

In [176]:
# method to check the stats of the rows with a country_name but still without a country_code
def notAssigned(df_sample):
    not_assigned = df_sample[df_sample[result_column].notna() & df_sample['country_code'].isna()]

    print("Number of unassigned items is: {}".format(len(not_assigned)))
    print("The important values are: ")
    print(not_assigned[result_column].value_counts())

In [177]:
df_5 = df_3.copy()

if not FOUND_MODULE:
    df_5 = df_5.merge(country_df, how='left')

    notAssigned(df_5)

Number of unassigned items is: 4566
The important values are: 
russia                              2203
en                                   477
fr:deutschland                       229
taiwan                               227
vietnam                              107
de:allemagne                          92
ch:suisse                             82
european union                        80
south korea                           67
fr:schweiz                            49
republic of macedonia                 41
fr:frankreich                         35
categories completed                  34
brands completed                      34
product name completed                34
packaging completed                   33
characteristics completed             33
ingredients completed                 33
nutrition facts completed             33
quantity completed                    33
photos uploaded                       33
photos validated                      30
to be checked                      

We decided to fix the country codes with the highest frequency, since the importance/effect of fixing the lower values will decrease as we descend through the list.

In [178]:
# changing Russian Federation to russia
country_df[result_column][country_df['country_code'] == 'RU'] = 'russia'

# changing Korea, Republic of to south korea
country_df[result_column][country_df['country_code'] == 'KR'] = 'south korea'

# changing Macedonia, the former Yugoslav Republic of to republic of macedonia
country_df[result_column][country_df['country_code'] == 'MK'] = 'republic of macedonia'

# changing Taiwan, Province of China to taiwan
country_df[result_column][country_df['country_code'] == 'TW'] = 'taiwan'

# changing Viet Nam to vietnam
country_df[result_column][country_df['country_code'] == 'VN'] = 'vietnam'


From the above analysis of the unpaired countries, we see that a few countries are still in another language. Specifically, they are in the format "language:country". The method `parseTranslate` tries to deal with this issue.

In [185]:
# Parse and translate columns that are in the format "language:value"
def parseTranslate(x, target_columns,size):
    showProgress(size)
    for column in target_columns:    
        if (':') in x[column]:
            info_ = x[column].split(':')
            if len(info_) == 2:
                value = info_[1]
                return translateWithCache(value)
        return x[column]

In [186]:
if not FOUND_MODULE:
    PROGRESS=0
    size=df_5.shape[0]
    df_5[result_column] = df_5.apply(
        lambda x: parseTranslate(x,[result_column],size),
        axis = 1
    )
    saveTranslations()

Progress 100.0%

From our description of the countries still missing country codes, it is found that most of these countries do not have the full name as the one in the CSV file `country_df`. Thus, we try to find the `best_match` and change the `country_name` in the food dataframe to match the one in the `country_df`. We consider something a `best_match` if the `country_name` from the food dataframe is a substring of the `country_name` in the country dataframe.

In [None]:
df_5 = df_5.merge(country_df, how='left')

In [None]:
# Display result
df_5[['product_name',result_column,'country_code']].head()

In [187]:
def best_match(country_df, row):
    if pd.isnull(row['country_code']):
        countries = list(country_df[result_column])
        for country in countries:
            if row[result_column] in country:
                return country

    return row[result_column]

In [188]:
if not FOUND_MODULE:
    df_5[result_column] = df_5.apply(
        lambda x: best_match(country_df, x),
        axis = 1
    )
    df_5 = df_5.drop(['country_code'], axis=1)

    df_5 = df_5.merge(country_df, how='left')
    notAssigned(df_5)
    saveModuleDF(result_column,df_5)


Number of unassigned items is: 928
The important values are: 
suisse                                                                                         85
european union                                                                                 80
product name completed                                                                         34
brands completed                                                                               34
categories completed                                                                           34
nutrition facts completed                                                                      33
photos uploaded                                                                                33
characteristics completed                                                                      33
packaging completed                                                                            33
ingredients completed                                   

In [190]:
print("Number of rows with a country code: {}".format(len(df_5[df_5['country_code'].notna()])))
print("Number of total rows: {}".format(len(food_df)))

Number of rows with a country code: 724730
Number of total rows: 693846


The number of rows we can use (meaning the rows with a `country_code`) is higher than what we original started with because we made duplicates of some rows so that each country has its own instance of the item. An issue we ran into is that with the high number of translations we need to do, Google's API will eventually block our requests,thus some more rows might have actually been able to be paired up with a `country_code`. To try a walka

## Fill in Missing Nutrition Scores <a id="nutrition-scores"></a>

This section deals with NaN values for `nutrition score`.
Starting with the analysis let's show the percentage of nan values in the desired columns

In [200]:
desired_columns=[
    'nutrition_grade_uk',
    'nutrition_grade_fr',
    'nutrition-score-fr_100g',
    'nutrition-score-uk_100g'
]
result_column='nutrition_score'
showNanPercentage(food_df,desired_columns)

Percentage of NaN in nutrition_grade_uk is 100.00%
Percentage of NaN in nutrition_grade_fr is 79.79%
Percentage of NaN in nutrition-score-fr_100g is 79.79%
Percentage of NaN in nutrition-score-uk_100g is 79.79%


It's important to note that `nutrition_grade_uk` is always nan and that `nutrition_grade_fr`, `nutrition-score-fr_100g` and `nutrition-score-uk_100g` have exactly the same value. For this reason, the column used is `nutrition_grade_fr`. Nan values are not filled since for now, the nutrition score is going to be an additional indicator.

In [201]:
nutrition_df = food_df.copy()
nutrition_df[result_column]=nutrition_df[desired_columns[1]] #.fillna(UNKNOWN_STR)
nutrition_df = nutrition_df.drop(desired_columns, axis=1)

In [202]:
nutrition_df[result_column].head()

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: nutrition_score, dtype: object

## Fill in Missing Allergens <a id="allergens"></a>

This section deals with NaN values for `allergens`.
Starting with the analysis let's show the percentage of nan values in the desired columns

In [203]:
desired_columns=[
    'allergens_en',
    'allergens'
]
result_column='allergen_values'
showNanPercentage(food_df,desired_columns)

Percentage of NaN in allergens_en is 100.00%
Percentage of NaN in allergens is 90.07%


In [204]:
allergen_df = getModuleDF(result_column)
FOUND_MODULE = result_column in df.columns
print("Found module: {}".format(FOUND_MODULE))

allergen_values file not found
Found module: False


Looking at the result, both columns have different percentages, for some reason, values in `allergens_en` are urls so for further analysis only the `allergen` column is taken.

In [205]:
if not FOUND_MODULE:
    print(allergen_df[allergen_df[desired_columns[0]].notna()][desired_columns].head(5))
    allergen_df[result_column]=allergen_df[desired_columns[1]]
    allergen_df1 = allergen_df.drop(desired_columns, axis=1)


                                             allergens_en allergens
264437  https://static.openfoodfacts.org/images/produc...   Dairies
264467  https://static.openfoodfacts.org/images/produc...   Dairies
264504  https://static.openfoodfacts.org/images/produc...   Dairies
264510  https://static.openfoodfacts.org/images/produc...   Dairies
264521  https://static.openfoodfacts.org/images/produc...   Dairies


Looking at the allergens format:

In [206]:
allergen_notna_df=allergen_df1[allergen_df1[result_column].notna()].copy()
allergen_notna_df[result_column].head()

10                                   en:eggs,en:mustard
22    BLÉ, GLUTEN, BLE, FROMAGE, LAIT, LAIT, LAIT, L...
31            BLÉ, SEIGLE, BLÉ, SEIGLE, SAUMON, FROMAGE
39                                              FROMAGE
44     BLÉ, GLUTEN, BLE, BLE, ORGE, BLÉ, SÉSAME, SEIGLE
Name: allergen_values, dtype: object

Pre-processing of the `result column` by assuring values are strings lowercase before processing.
Using the helper function `formatAndTranslateRow`, allergen rows are going to be formated to an array and translated to english.

In [207]:
if not FOUND_MODULE:
    allergen_notna_df[result_column].apply(str)
    allergen_notna_df[result_column] = allergen_notna_df[result_column].apply(
        lambda x: x.lower()
    )
    PROGRESS=0
    allergen_notna_df[result_column] = allergen_notna_df[result_column].apply(
        lambda x: formatAndTranslateRow(x,True)
    )
    saveTranslations()

Progress 100.0%

In [208]:
print(allergen_notna_df.shape[0])
allergen_notna_df[allergen_notna_df[result_column].notna()][result_column].head(10)

68868


10                                   [eneggs, enmustard]
22     [became, gluten, became, cheese, milk, milk, m...
31            [became, rye, became, rye, salmon, cheese]
39                                              [cheese]
44     [became, gluten, became, became, barley, becam...
46                                    [eneggs, engluten]
282                                                [soy]
293                     [became, butter, eggs, hazelnut]
315                                   [almonds, almonds]
342                                     [milk, hazelnut]
Name: allergen_values, dtype: object

In [209]:
if not FOUND_MODULE:
    allergen_df1_=allergen_df1.rename(columns = {result_column:'old_values'})['old_values']
    allergen_df1[result_column]=pd.concat([allergen_df1_, allergen_notna_df], axis=1, join_axes=[allergen_df1.index])[result_column]           
    saveModuleDF(result_column,allergen_df1)
    

In [210]:
print(allergen_df1.shape)
allergen_df1[allergen_df1[result_column].notna()][result_column].head(10)

(693846, 45)


10                                   [eneggs, enmustard]
22     [became, gluten, became, cheese, milk, milk, m...
31            [became, rye, became, rye, salmon, cheese]
39                                              [cheese]
44     [became, gluten, became, became, barley, becam...
46                                    [eneggs, engluten]
282                                                [soy]
293                     [became, butter, eggs, hazelnut]
315                                   [almonds, almonds]
342                                     [milk, hazelnut]
Name: allergen_values, dtype: object

In [211]:
showNanPercentage(allergen_df1,[result_column])

Percentage of NaN in allergen_values is 90.07%


## Fill in Missing Traces <a id="traces"></a>

This section deals with NaN values for `traces`.

In [216]:
desired_columns=[
    'traces_en',
    'traces'
]
result_column='traces_values'
showNanPercentage(food_df,desired_columns)

Percentage of NaN in traces_en is 91.46%
Percentage of NaN in traces is 93.22%


Starting with the analysis let's show the percentage of nan values in the desired columns

In [217]:
traces_df = getModuleDF(result_column)
FOUND_MODULE = result_column in df.columns
print("Found module: {}".format(FOUND_MODULE))

traces_values file not found
Found module: False


In [218]:
if not FOUND_MODULE:
    traces_df1=mergeColumnsFromDF(traces_df, desired_columns, result_column)
    
traces_notna_df=traces_df1[traces_df1[result_column].notna()].copy()

In [220]:
showNanPercentage(traces_df1,[result_column])
traces_notna_df[result_column].head()

Percentage of NaN in traces_values is 91.46%


111                                            Eggs,Milk
129                                         Sesame seeds
220    Eggs,Gluten,Milk,Nuts,Soybeans,Oatmeal,Wheatflour
255    fr:contient-oeuf-lait-anchois-soya-ble-seigle-...
275    Soybeans,Sulphur dioxide and sulphites,fr:cont...
Name: traces_values, dtype: object

Translate

In [221]:
if not FOUND_MODULE:
    PROGRESS=0
    traces_notna_df[result_column] = traces_notna_df[result_column].apply(
        lambda x: formatAndTranslateRow(x,False)
    )
    saveTranslations()
    #saveModuleDF(result_column,traces_notna_df)

Progress 86.0%

In [222]:
print(traces_notna_df.shape)
traces_notna_df[traces_notna_df[result_column].notna()][result_column].head(10)

(59252, 45)


111                                         [eggs, milk]
129                                       [sesame seeds]
220    [eggs, gluten, milk, nuts, soybeans, oatmeal, ...
255    [frcontientoeuflaitanchoissoyableseigleorgemou...
275    [soybeans, sulphur dioxide and sulphites, frco...
286    [gluten, frpeutcontenirnoixvariessoyalaitoeufs...
293                       [nuts, sesame seeds, soybeans]
299    [celery, crustaceans, eggs, fish, gluten, milk...
300                                               [eggs]
306    [eggs, gluten, milk, mustard, nuts, sesame see...
Name: traces_values, dtype: object

In [223]:
if not FOUND_MODULE:
    traces_df1_=traces_df1.rename(columns = {result_column:'old_values'})['old_values']
    traces_df1[result_column]=pd.concat([traces_df1_, traces_df1], axis=1, join_axes=[traces_df1.index])[result_column]           
    saveModuleDF(result_column,traces_df1)
    

Saved module: traces_values


In [224]:
print(traces_df1.shape)
traces_df1[traces_df1[result_column].notna()][result_column].head(10)
showNanPercentage(traces_df1,[result_column])

(693846, 45)
Percentage of NaN in traces_values is 91.46%


## Fill/Clean Ingredients <a id="ingredients"></a>
This section deals with NaN values for `ingredients`.

In [19]:
desired_columns=[
    'ingredients_text'
]
result_column='ingredients_values'
showNanPercentage(food_df,desired_columns)

Percentage of NaN in ingredients_text is 43.30%


Starting with the analysis let's show the percentage of nan values in the desired columns

In [21]:
ingredients_df = getModuleDF(result_column)
FOUND_MODULE = result_column in ingredients_df.columns
print("Found module: {}".format(FOUND_MODULE))

ingredients_values file not found
Found module: False


Let's look at the format of each ingredient entry

In [22]:
ingredients_df[ingredients_df[desired_columns[0]].notna()][desired_columns].head(5)

Unnamed: 0,ingredients_text
10,"antioxydant : érythorbate de sodium, colorant ..."
15,"Lait entier, sucre, amidon de maïs, cacao, Aga..."
22,"baguette Poite vin Pain baguette 50,6%: fqrine..."
31,"Paln suédois 42,6%: farine de BLÉ, eau, farine..."
33,"Taboulé 76,2%, légumes 12%, huile de colza, se..."


In [23]:
ingredients_df=ingredients_df.rename(columns={desired_columns[0]:result_column})

In [47]:
def removeCommonWords(value):
    words=['long','grain','white','refined','concentrate','natural','dry','roasted','organic','bar','whole','rolled','seasoning','juice','extract']
    value_=value
    for word in words:
        if word in value_:
            value_=value_.replace(word,"")
    return normalizeString(value_)

def cleanIngredients(row):
    showProgress(ingredients_df.shape[0])
    if type(row) is not list and pd.notnull(row) :
        values=row.split(',')
        ingredients=[]
        for item in values:
            ingredient=item
            # format key:value
            # take only value
            if (':') in ingredient:
                info_ = ingredient.split(':')
                if len(info_) == 2:
                    ingredient=info_[1]
            # format value (info)
            # take only values
            if ('(') in ingredient:
                info_ = ingredient.split('(')
                ingredient=info_[0]
                
            # format value [value,...]
            # take all values
            if ('[') in ingredient:
                info_ = ingredient.split('[')
                if len(info_) >= 2:
                    ingredients.append(removeCommonWords(normalizeString(info_[0])))
                    ingredient=info_[1]
            # format value [value,...]
            # take all values
            if (' or ') in ingredient:
                info_ = ingredient.split('or')
                if len(info_) >= 2:
                    ingredients.append(removeCommonWords(normalizeString(info_[0])))
                    ingredient=info_[1]
            # format value - info
            # take only values
            if (' - ') in ingredient:
                info_ = ingredient.split('-')
                ingredient=info_[0]
                
            # avoid empty strings and removeCommonWords
            ingredient=removeCommonWords(normalizeString(ingredient))
            if ingredient:
                ingredient_trans=translateWithCache(ingredient) 
                ingredients.append(ingredient_trans)
        
        # remove duplicate elements
        ingredients_=removeDuplcates(ingredients)
        return ingredients_
    else:
        return row

In [48]:
ingredients_notna_df=ingredients_df[ingredients_df[result_column].notna()]
if not FOUND_MODULE:
    PROGRESS=0
    ingredients_notna_df[result_column] = ingredients_notna_df[result_column].apply(
        lambda x: cleanIngredients(x)
    )
    saveTranslations()
    saveModuleDF(result_column,ingredients_notna_df)


Progress 0.0%Exception erythorbate de sodium / <class 'str'> / list index out of range
Exception jaunes doeuf / <class 'str'> / list index out of range
Exception es de moutarde / <class 'str'> / list index out of range
Exception dextrose / <class 'str'> / list index out of range
Exception gomme de cellulose / <class 'str'> / list index out of range
Exception sorbate de potassium / <class 'str'> / list index out of range
Exception carotene / <class 'str'> / list index out of range
Exception arome / <class 'str'> / list index out of range
Progress 0.0%

KeyboardInterrupt: 

In [26]:
ingredients_notna_df[result_column][10]

"antioxydant : érythorbate de sodium, colorant : caramel - origine UE), tomate 33,3%, MAYONNAISE 11,1% (huile de colza 78,9%, eau, jaunes d'OEUF 6%, vinaigre, MOUTARDE [eau, graines de MOUTARDE, sel, vinaigre, curcuma], sel, dextrose, stabilisateur : gomme de cellulose, conservateur : sorbate de potassium, colorant : ?-carotène, arôme)"

## Fill/Clean Labels <a id="labels_column"></a>

## Clean float64 Columns <a id="float64_col"></a>
Karen's part

# Data Visualization & Analysis <a id="data_analysis"></a>
TO BE COMPLETED FOR MILESTONE 3

In [43]:
saveTranslations()

## Maps

## Correlations b/w Neighbouring Countries <a id="correlation_neighbours"></a>