# USDA Food Data - Preliminary Analysis

USDA Food Data is obtained from a consolidated dataset published by the Open Food Facts organization (https://world.openfoodfacts.org/) and made available on the Kaggle website (https://www.kaggle.com/openfoodfacts/world-food-facts). 

Open Food Facts is a free, open, collbarative database of food products from around the world, with ingredients, allergens, nutrition facts and all the tidbits of information we can find on product labels (source: ://www.kaggle.com/openfoodfacts/world-food-facts). 

Link to the available data can be found here - https://www.kaggle.com/openfoodfacts/world-food-facts/downloads/en.openfoodfacts.org.products.tsv

For the purpose of our analysis we will only be looking at USDA data and not data sourced from other countries since the USDA data appears to be the dataset that is well populated with values.

## Loading the data

In [2]:
# load pre-requisite imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [67]:
# load world food data into a pandas dataframe
world_food_facts =pd.read_csv("./data/en.openfoodfacts.org.products.tsv", sep='\t',low_memory=False)

# extract USDA data from world data
usda_import = world_food_facts[world_food_facts.creator=="usda-ndb-import"]

# save the usda data to a csv file
usda_import.to_csv("./data/usda_imports_v2.csv")

## Preliminary look at the USDA data

In [10]:
# Examining available fields
print("Number of records:",len(usda_import))
print("Number of columns:",len(list(usda_import)))

print("\nField Names:")
list(usda_import)

Number of records: 169868
Number of columns: 162

Field Names:


['code',
 'url',
 'creator',
 'created_t',
 'created_datetime',
 'last_modified_t',
 'last_modified_datetime',
 'product_name',
 'generic_name',
 'quantity',
 'packaging',
 'packaging_tags',
 'brands',
 'brands_tags',
 'categories',
 'categories_tags',
 'categories_en',
 'origins',
 'origins_tags',
 'manufacturing_places',
 'manufacturing_places_tags',
 'labels',
 'labels_tags',
 'labels_en',
 'emb_codes',
 'emb_codes_tags',
 'first_packaging_code_geo',
 'cities',
 'cities_tags',
 'purchase_places',
 'stores',
 'countries',
 'countries_tags',
 'countries_en',
 'ingredients_text',
 'allergens',
 'allergens_en',
 'traces',
 'traces_tags',
 'traces_en',
 'serving_size',
 'no_nutriments',
 'additives_n',
 'additives',
 'additives_tags',
 'additives_en',
 'ingredients_from_palm_oil_n',
 'ingredients_from_palm_oil',
 'ingredients_from_palm_oil_tags',
 'ingredients_that_may_be_from_palm_oil_n',
 'ingredients_that_may_be_from_palm_oil',
 'ingredients_that_may_be_from_palm_oil_tags',
 'nutritio

## Quick look at a few of the rows

Each row contains fields that specify the value for a given nutrient. Note that only those fields with valid values are populated. The others are empty.

In [13]:
usda_import.head(5)

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,...,ph_100g,fruits-vegetables-nuts_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,...,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,...,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,...,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,...,,,,,,,,,,
5,16100,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055651,2017-03-09T10:34:11Z,1489055651,2017-03-09T10:34:11Z,Breadshop Honey Gone Nuts Granola,,,...,,,,,,,,,,


## Quick look at ingredients

Ingredients are not broken down similar to nutrients into separate fields. Rather, all ingredients are grouped together into a single line of text. 

In [15]:
usda_import['ingredients_text'].head(5)

1    Bananas, vegetable oil (coconut oil, corn oil ...
2    Peanuts, wheat flour, sugar, rice flour, tapio...
3    Organic hazelnuts, organic cashews, organic wa...
4                                      Organic polenta
5    Rolled oats, grape concentrate, expeller press...
Name: ingredients_text, dtype: object

In [66]:
# Extracting ingredients for a particular product
import re

pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 600)

for x in range(3):
    ingredients = re.split(',|\(|\)',usda_import['ingredients_text'].iloc[x])
    ingredients = [w.strip().replace(' ','-') for w in ingredients]
    print(' '.join(ingredients))
    

Bananas vegetable-oil coconut-oil corn-oil-and/or-palm-oil sugar natural-banana-flavor.
Peanuts wheat-flour sugar rice-flour tapioca-starch salt leavening ammonium-bicarbonate baking-soda  soy-sauce water soybeans wheat salt  potato-starch.
Organic-hazelnuts organic-cashews organic-walnuts-almonds organic-sunflower-oil sea-salt.


## Cleaning up the dataset

We now look at the available data in the dataset and look for possible issues with the data that could impact our analysis.

Notice that several entries are not full populated with all available nutrition.

Going by the results, we can limit the categories that we use for the analysis to just those over 100,000 values to ensure that we avoid having to work with columns that are not sufficiently populated. 

In [56]:
# Looking for columns that are not sufficiently populated

# display count of all rows
print("Total rows in USDA dataset are:",len(usda_import))

# display count of all non-NAN entries in each column
print("\nCount of non-NaN values in each column")

usda_import.count().sort_values(ascending=False)

Total rows in USDA dataset are: 169868

Count of non-NaN values in each column


code                                          169868
states_en                                     169868
countries                                     169868
url                                           169868
creator                                       169868
created_t                                     169868
created_datetime                              169868
states_tags                                   169868
last_modified_t                               169868
last_modified_datetime                        169868
countries_tags                                169868
countries_en                                  169868
states                                        169868
additives_n                                   169867
ingredients_text                              169867
ingredients_from_palm_oil_n                   169867
ingredients_that_may_be_from_palm_oil_n       169867
serving_size                                  169866
additives                                     

## Looking for similar products based on ingredients

This section attempts to use item similarity to look for similar products based on ingredients present. We vectorize all ingredients and use the resulting vector to look for similar items.

In [74]:
# load the subsample USDA data
usda_sample_data =pd.read_csv("./data/usda_imports_v2_1000_hdr.csv", sep=',',low_memory=False)

In [75]:
list(usda_sample_data)

['code',
 'url',
 'creator',
 'created_t',
 'created_datetime',
 'last_modified_t',
 'last_modified_datetime',
 'product_name',
 'generic_name',
 'quantity',
 'packaging',
 'packaging_tags',
 'brands',
 'brands_tags',
 'categories',
 'categories_tags',
 'categories_en',
 'origins',
 'origins_tags',
 'manufacturing_places',
 'manufacturing_places_tags',
 'labels',
 'labels_tags',
 'labels_en',
 'emb_codes',
 'emb_codes_tags',
 'first_packaging_code_geo',
 'cities',
 'cities_tags',
 'purchase_places',
 'stores',
 'countries',
 'countries_tags',
 'countries_en',
 'ingredients_text',
 'allergens',
 'allergens_en',
 'traces',
 'traces_tags',
 'traces_en',
 'serving_size',
 'no_nutriments',
 'additives_n',
 'additives',
 'additives_tags',
 'additives_en',
 'ingredients_from_palm_oil_n',
 'ingredients_from_palm_oil',
 'ingredients_from_palm_oil_tags',
 'ingredients_that_may_be_from_palm_oil_n',
 'ingredients_that_may_be_from_palm_oil',
 'ingredients_that_may_be_from_palm_oil_tags',
 'nutritio

In [76]:
len(usda_sample_data)

1000