# Branded food data frame analysis

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import os

In [2]:
filepath = lambda x: os.path.join('data', x)

In [3]:
branded_food = pd.read_csv(filepath('branded_food.csv'), dtype={
    'brand_owner':str,
    'brand_name':str,
    'subbrand_name':str,
    'gtin_upc':str})
branded_food.head()

  branded_food = pd.read_csv(filepath('branded_food.csv'), dtype={


Unnamed: 0,fdc_id,brand_owner,brand_name,subbrand_name,gtin_upc,ingredients,not_a_significant_source_of,serving_size,serving_size_unit,household_serving_fulltext,branded_food_category,data_source,package_weight,modified_date,available_date,market_country,discontinued_date,preparation_state_code,trade_channel,short_description
0,1105904,Richardson Oilseed Products (US) Limited,,,27000612323,Vegetable Oil,,15.0,ml,,Oils Edible,GDSN,,2020-10-02,2020-11-13,United States,,,,
1,1105905,CAMPBELL SOUP COMPANY,,,51000198808,"INGREDIENTS: BEEF STOCK, CONTAINS LESS THAN 2%...",,240.0,ml,,Herbs/Spices/Extracts,GDSN,,2020-09-12,2020-11-13,United States,,,,
2,1105906,CAMPBELL SOUP COMPANY,,,51000213273,"INGREDIENTS: CLAM STOCK, POTATOES, CLAMS, CREA...",,440.0,g,,Prepared Soups,GDSN,,2020-09-01,2020-11-13,United States,,,,
3,1105907,CAMPBELL SOUP COMPANY,,,51000213303,"INGREDIENTS: WATER, CREAM, BROCCOLI, CELERY, V...",,440.0,g,,Prepared Soups,GDSN,,2020-09-01,2020-11-13,United States,,,,
4,1105908,CAMPBELL SOUP COMPANY,,,51000224637,"INGREDIENTS: CHICKEN STOCK, CONTAINS LESS THAN...",,240.0,ml,,Herbs/Spices/Extracts,GDSN,,2020-10-03,2020-11-13,United States,,,,


Going through the initial inspection of the dataset:

- Missingness/null values
- incorrect dtypes
- Reducing of columns/memory if possible]
- Validation of data (outlier searching and correction of incorrect values)

## Initial Missingness check

In [4]:
branded_food.isnull().mean()

fdc_id                         0.000000
brand_owner                    0.007813
brand_name                     0.296762
subbrand_name                  0.952827
gtin_upc                       0.000000
ingredients                    0.002923
not_a_significant_source_of    0.960171
serving_size                   0.005842
serving_size_unit              0.010312
household_serving_fulltext     0.585086
branded_food_category          0.005729
data_source                    0.000000
package_weight                 0.617388
modified_date                  0.000011
available_date                 0.000000
market_country                 0.000000
discontinued_date              1.000000
preparation_state_code         0.978783
trade_channel                  0.991557
short_description              0.978720
dtype: float64

We see that there are some columns with missing data, mainly towards the end, including discontinued_date, preparation_state_code, trade_channel, and short_description, which we will take a look through non-null examples and drop if necessary to preserve memory.

In [5]:
has_prep_code = branded_food[branded_food.preparation_state_code.notnull()]
has_prep_code.head()

Unnamed: 0,fdc_id,brand_owner,brand_name,subbrand_name,gtin_upc,ingredients,not_a_significant_source_of,serving_size,serving_size_unit,household_serving_fulltext,branded_food_category,data_source,package_weight,modified_date,available_date,market_country,discontinued_date,preparation_state_code,trade_channel,short_description
1549669,2219410,Cargill Incorporated/Honeysuckle White,HONEYSUCKLE WHITE,,642205546077,"Turkey, Natural Flavoring",,112.0,g,4 oz.,Meat/Poultry/Other Animals Unprepared/Unproce...,GDSN,1 LBR,2019-03-07,2022-02-10,United States,,UNPREPARED,,HSW Fh 93% Grd Tky Chub 12/1
1549670,2219411,Cargill Incorporated/Honeysuckle White,Honeysuckle White,,642205534517,"All Natural White Turkey, Natural Flavoring",,112.0,g,4 oz.,Meat/Poultry/Other Animals Unprepared/Unproce...,GDSN,1.25 LBR,2020-02-04,2022-02-10,United States,,UNPREPARED,,HSW Fh Gr WhtDry Ex Wt 6/1.25#
1549671,2219412,Cargill Incorporated/Honeysuckle White,HONEYSUCKLE WHITE,,642205534500,"All Natural Turkey, Natural Flavoing",,112.0,g,4 oz.,Meat/Poultry/Other Animals Unprepared/Unproce...,GDSN,1.25 LBR,2020-02-05,2022-02-10,United States,,UNPREPARED,,HSW Fh 85/15 Gr tky Ex Wt 6/1.25
1549672,2219413,Kellogg Company US,Kellogg's Pop-Tarts,,38000317101,"Enriched flour (wheat flour, niacin, reduced i...",,52.0,g,1 Pastry,Sweet Bakery Products,GDSN,14.7 ONZ,2019-04-09,2022-02-10,United States,,UNPREPARED,,Pop-Tarts
1549673,2219414,Kellogg Company US,Kellogg's Cheez It,,24100105236,"Enriched flour (wheat flour, niacin, reduced i...",,25.0,g,1 Pouch,Biscuits/Cookies,GDSN,12.6 ONZ,2019-04-30,2022-02-10,United States,,UNPREPARED,,Gripz Crackers


In [6]:
has_prep_code.preparation_state_code.value_counts()

UNPREPARED        26631
PREPARED           5676
READY_TO_EAT       2830
READY_TO_DRINK     2677
BAKE                603
HEAT_AND_SERVE      272
THAW                147
FREEZE               97
GRILL                95
CONVECTION           36
UNSPECIFIED          22
FRY                  15
STEAM                12
DEEP_FRY             12
ROAST                11
BOIL                  7
MICROWAVE             4
STIR_FRY              4
Name: preparation_state_code, dtype: int64

We see that many of these are with respect to how to prepare the food for consumption, many of which are simply labeled "unprepared" or prepared. We also see that many of these preparation codes are disambiguous, as unprepared does contain also foods that would be considered ready to eat, such as Pop-Tarts or Cheez-its. Because of this, it may be hard to categorize food into explicitly "prepared" and "unprepared" categories.

In [11]:
has_prep_code[has_prep_code.preparation_state_code == 'UNPREPARED'].head(5)

Unnamed: 0,fdc_id,brand_owner,brand_name,subbrand_name,gtin_upc,ingredients,not_a_significant_source_of,serving_size,serving_size_unit,household_serving_fulltext,branded_food_category,data_source,package_weight,modified_date,available_date,market_country,discontinued_date,preparation_state_code,trade_channel,short_description
1549669,2219410,Cargill Incorporated/Honeysuckle White,HONEYSUCKLE WHITE,,642205546077,"Turkey, Natural Flavoring",,112.0,g,4 oz.,Meat/Poultry/Other Animals Unprepared/Unproce...,GDSN,1 LBR,2019-03-07,2022-02-10,United States,,UNPREPARED,,HSW Fh 93% Grd Tky Chub 12/1
1549670,2219411,Cargill Incorporated/Honeysuckle White,Honeysuckle White,,642205534517,"All Natural White Turkey, Natural Flavoring",,112.0,g,4 oz.,Meat/Poultry/Other Animals Unprepared/Unproce...,GDSN,1.25 LBR,2020-02-04,2022-02-10,United States,,UNPREPARED,,HSW Fh Gr WhtDry Ex Wt 6/1.25#
1549671,2219412,Cargill Incorporated/Honeysuckle White,HONEYSUCKLE WHITE,,642205534500,"All Natural Turkey, Natural Flavoing",,112.0,g,4 oz.,Meat/Poultry/Other Animals Unprepared/Unproce...,GDSN,1.25 LBR,2020-02-05,2022-02-10,United States,,UNPREPARED,,HSW Fh 85/15 Gr tky Ex Wt 6/1.25
1549672,2219413,Kellogg Company US,Kellogg's Pop-Tarts,,38000317101,"Enriched flour (wheat flour, niacin, reduced i...",,52.0,g,1 Pastry,Sweet Bakery Products,GDSN,14.7 ONZ,2019-04-09,2022-02-10,United States,,UNPREPARED,,Pop-Tarts
1549673,2219414,Kellogg Company US,Kellogg's Cheez It,,24100105236,"Enriched flour (wheat flour, niacin, reduced i...",,25.0,g,1 Pouch,Biscuits/Cookies,GDSN,12.6 ONZ,2019-04-30,2022-02-10,United States,,UNPREPARED,,Gripz Crackers


## Serving size cleaning

In [12]:
branded_food.serving_size_unit.value_counts()

g      1522350
ml      243399
GRM      40597
MLT       8288
MG        7555
IU        3671
GM         346
MC          63
Name: serving_size_unit, dtype: int64

We see that there are several options. We will look up what these units exactly mean (unabbreviated), and possibly merge any containing the same amount. We can also visualize the distributions of respective foods.

- g (gram)
- ml (mililiter, most likely for fluids)
- grm - unknown, will compare to gram's distribution for differences
- mlt - Unknwon
- MG - possibly a milligram distribution
- IU - 

In [13]:
grm_foods = branded_food[branded_food.serving_size_unit == 'GRM']
grm_foods

Unnamed: 0,fdc_id,brand_owner,brand_name,subbrand_name,gtin_upc,ingredients,not_a_significant_source_of,serving_size,serving_size_unit,household_serving_fulltext,branded_food_category,data_source,package_weight,modified_date,available_date,market_country,discontinued_date,preparation_state_code,trade_channel,short_description
1751470,2456687,SCHWAN'S FOOD SERVICE INC,TONY'S,,10072180726718,"INGREDIENTS: FRENCH BREAD (WATER, WHITE WHOLE ...",,156.0,GRM,1 Pizza (156g),Pies/Pastries/Pizzas/Quiches - Savoury (Frozen),GDSN,60 EA,2022-12-15,2023-01-26,United States,,UNPREPARED,"[""CHILD_NUTRITION_FOOD_PROGRAMS""]",TN FB WG CHS 100
1751471,2456688,Bake Crafters Food Company,Bake Crafters,,00737410335001,"Enriched Wheat Flour [Wheat Flour, Malted Barl...",,28.0,GRM,1 oz (28g),Bread (Frozen),GDSN,6.25 LBR,2022-06-02,2023-01-26,United States,,UNPREPARED,,"Pullman Bread, White, 1 Slice, IW"
1751472,2456689,Bake Crafters Food Company,Bake Crafters,,00737410171708,"Whole Wheat Flour, Enriched Bleached Wheat Flo...",,78.0,GRM,"2.75 oz (78g), 4 pieces",Desserts (Frozen),GDSN,12.375 LBR,2022-06-02,2023-01-26,United States,,UNPREPARED,"[""CHILD_NUTRITION_FOOD_PROGRAMS"",""CHILD_NUTRIT...","Mini Breakfast Bites, Glz, WG, 4 Pk"
1751473,2456690,Bake Crafters Food Company,Bake Crafters,,00737410158105,"Water, Whole Wheat Flour, Enriched Wheat Flour...",,40.0,GRM,"1.4 oz (40g), 2 Pancakes",Bread (Frozen),GDSN,14.175 LBR,2022-06-02,2023-01-26,United States,,UNPREPARED,"[""CHILD_NUTRITION_FOOD_PROGRAMS"",""CHILD_NUTRIT...",Pancakes WG Wholesome Choice Mpl Ch
1751474,2456691,Brakebush Brothers,Brakebush,,10038034558706,UNCOOKED BONELESS CHICKEN BREAST TENDERS CONTA...,,71.0,GRM,1 Piece,Chicken - Prepared/Processed,GDSN,10 LBR,2022-04-26,2023-01-26,United States,,UNPREPARED,,Crispy-Lishus tenders
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1845290,2554908,Tanjoe Enterprises Inc.,LIDIA'S,,897712001049,"ITALIAN TOMATOES (CITRIC ACID), ARTICHOKES, CA...",,125.0,GRM,1/2 cup,Prepared Pasta & Pizza Sauces,LI,25 oz/708 g,2023-03-27,2023-05-25,United States,,,,
1845291,2554909,Oregon Growers & Shippers LLC,OREGON GROWERS,,898271000948,"BLACKBERRIES (MARIONBERRIES, BLACKBERRIES), CA...","Not a significant source of saturated fat, tra...",39.0,GRM,2 Tbsp,Syrups & Molasses,LI,8 fl oz/237 mL,2023-04-26,2023-05-25,United States,,,,
1845292,2554910,Mt. Garfield Winery Corp,LIFESTYLEFOODS,,898425002682,ARCADIAN HARVEST LETTUCE (BLEND OF LEAF LETTUC...,,163.0,GRM,,"Pickles, Olives, Peppers & Relishes",LI,5.75 oz./163 g,2023-03-10,2023-05-25,United States,,,,
1845295,2554913,"Ittella International, Inc.",TATTOOED CHEF,,899764001527,"CAULIFLOWER, CORN FLOUR, GRANA PADANO CHEESE (...",,71.0,GRM,1 pc,Frozen Patties and Burgers,LI,10 oz/283 g,2023-03-23,2023-05-25,United States,,,,


One thing to note is the serving size amount of grm corresponding to the household serving fulltext. We can see many examples where the gram amount corresponds to what is labeled in household serving fulltext:

In [19]:
(
    grm_foods[grm_foods.household_serving_fulltext.str.contains('g)', regex=False).fillna(False)]
    [['serving_size', 'serving_size_unit', 'household_serving_fulltext']]
)

Unnamed: 0,serving_size,serving_size_unit,household_serving_fulltext
1751470,156.0,GRM,1 Pizza (156g)
1751471,28.0,GRM,1 oz (28g)
1751472,78.0,GRM,"2.75 oz (78g), 4 pieces"
1751473,40.0,GRM,"1.4 oz (40g), 2 Pancakes"
1751477,34.0,GRM,"1.19 oz (34g), 1 Bread Stick"
...,...,...,...
1841730,4.0,GRM,2 Tbsp (4g)
1841731,79.0,GRM,"2.8 oz (79g), 1 Biscuit"
1842987,61.0,GRM,2/3 cup dry mix (61g) (1cup prepared)
1842989,61.0,GRM,1 link (g)


From this we can most likely consider **grm** as synonymous with the **g** abbreviation. 

In [25]:
g_foods = branded_food[branded_food.serving_size_unit == 'g']
g_foods[g_foods.household_serving_fulltext.str.contains('g)', regex=False).fillna(False)][['serving_size', 'serving_size_unit', 'household_serving_fulltext']]

Unnamed: 0,serving_size,serving_size_unit,household_serving_fulltext
34110,17.0,g,1 Tbsp (17g)
34285,17.0,g,1 Tbsp(17g)
34382,62.0,g,1/4 cup (62g)
34383,62.0,g,1/4 cup (62g)
34384,62.0,g,1/4 cup (62g)
...,...,...,...
1757337,54.0,g,3 sheets (3x18g) dry
1759907,85.0,g,1 cup salad only/ 1 cup dressed salad with top...
1760115,85.0,g,1 cup salad only/1 cup dressed salad with topp...
1776928,85.0,g,"1 Cup Vegetables (85 g), 4 pieces Sausage (16 ..."


If we only look at these examples, we would consider them as conventionally the same. However, they may not be the same exactly.