### Get Product Data from Open Food Facts API

We want to extract the product_name, brands, categories, code (unique product ID), nutrition_grades_tags

## Data Wrangling

### Import libraries

In [3]:
import requests
import pprint
import pandas as pd

In [26]:

# function to pull data from openfoodsfacts
def fetch_products(category="snacks", page=1):
    url = f"https://world.openfoodfacts.org/category/{category}/{page}.json"
    print(url)
    response = requests.get(url)
    try:
        print(response.status_code)
        # get json response
        data = response.json()
        # get products from the response
        products = data["products"]
        return products
    except:
        return None


# get products
products = fetch_products("snacks")
print(list(products[0].keys()))

https://world.openfoodfacts.org/category/snacks/1.json
200


Many keywords contain language tags. Lets keep only the English tags like '_en'.

In [38]:
# get all language columns as they end with '_xx'
lang = df.filter(regex='_..$', axis=1).columns

# filter non-english columns
lang_non_en = [col for col in lang if '_en' not in col]

lang_non_en

['allergens_lc',
 'brands_lc',
 'categories_lc',
 'countries_lc',
 'generic_name_ar',
 'generic_name_es',
 'generic_name_fr',
 'generic_name_uk',
 'ingredients_lc',
 'ingredients_text_ar',
 'ingredients_text_es',
 'ingredients_text_fr',
 'ingredients_text_uk',
 'ingredients_text_with_allergens_ar',
 'ingredients_text_with_allergens_es',
 'ingredients_text_with_allergens_fr',
 'ingredients_text_with_allergens_uk',
 'labels_lc',
 'last_modified_by',
 'nutrition_grade_fr',
 'origin_ar',
 'origin_es',
 'origin_fr',
 'origin_uk',
 'origins_lc',
 'packaging_lc',
 'packaging_text_ar',
 'packaging_text_es',
 'packaging_text_fr',
 'packaging_text_uk',
 'product_name_ar',
 'product_name_es',
 'product_name_fr',
 'product_name_uk',
 'traces_lc',
 'generic_name_de',
 'generic_name_fi',
 'generic_name_it',
 'generic_name_ja',
 'generic_name_nb',
 'generic_name_nl',
 'generic_name_pl',
 'generic_name_sv',
 'ingredients_text_de',
 'ingredients_text_fi',
 'ingredients_text_it',
 'ingredients_text_ja',

**products** is a list containing dictionaries. We convert it into pandas dataframe

In [11]:
# convert to dataframe
df = pd.DataFrame(products)
# set index equal to unique product id  given by '_id'
df.set_index('_id', inplace=True)
print(df.shape)
print(df.columns)

(20, 458)
Index(['_keywords', 'added_countries_tags', 'additives_n',
       'additives_original_tags', 'additives_tags', 'allergens',
       'allergens_from_ingredients', 'allergens_from_user',
       'allergens_hierarchy', 'allergens_lc',
       ...
       'owners_tags', 'packaging_imported', 'producer_version_id',
       'producer_version_id_imported', 'product_name_fr_imported',
       'quantity_imported', 'serving_size_imported', 'sources_fields',
       'traces_imported', 'specific_ingredients'],
      dtype='object', length=458)


In [25]:
# check the percentage of missing values in each columns
columns_comp = df.isnull().sum()/df.shape[0]

# get name of columns with less than 75 percent complete data
columns_incomplete = columns_comp[columns_comp<0.9]
columns_incomplete.shape

(335,)

335 columns are less than 90% complete, we can remove them as any Imputation of such feature will cause errors in the model.
Before we throw away the dataset we can look at them if they contain anything important that we can handle with inputing the data.

Index(['allergens_lc', 'brands_lc', 'categories_lc', 'countries_lc',
       'generic_name_ar', 'generic_name_en', 'generic_name_es',
       'generic_name_fr', 'generic_name_uk', 'ingredients_lc',
       ...
       'packaging_text_sk', 'packaging_text_sl', 'product_name_bg',
       'product_name_dz', 'product_name_sk', 'product_name_sl',
       'abbreviated_product_name_fr', 'conservation_conditions_fr',
       'customer_service_fr', 'producer_version_id'],
      dtype='object', length=165)