# Modeling USDA ERS Food data

In [1]:
dataset_id = "USDA_ERS_modeled"

In [2]:
!bq --location=US mk --dataset {dataset_id}  #Note: This will not work if you already have a dataset with this name

BigQuery error in mk operation: Dataset 'responsive-cab-267123:USDA_ERS_modeled'
already exists.


The `FIPS_Market_Group` and `State_Codes` tables do not contribute anything to our planned analysis, as descriped in the DATASETS.txt file, so we don't transfer them over to our modeled dataset. The `Geo_Market_Group` table contains the same data as `Geo_Market`, so we can drop it.

To model:

* some child tables have the "name" instead of "id" as the FK. We think id should be FK 
* Drop the extra blank column(s) in `Market_Groups` table(s)
    * change column names to what's on ERD
* Add meaningul column names to `Geo_Market`: (check ERD)
    * the `nielson_name` (4th column) field in the `Geo_Master` table violates the 1NF design principle --drop it
* union `Food_#_Market` tables, add a column to represent the food# (This table needs a PK)
    * changed some attribute names in Food_#_Market (but feel free to change them back and fix ERD)
        * se to standard_error
        * n to sample_size
        * aggweight to agg_weight
        * totexp to tot_q_exp
    * deleted region and division attributes from the new unioned `Food_Market` because it's repetitive (can be traced back to Geo_Market parent table)
* anything else you can think of
* anything not transformed/modeled, add to TRANSFORMS.txt file
    * we can make the mapping table between primary and secondary dataset in the next milestone

### Copy Foods table from staging dataset

In [3]:
%%bigquery
create table USDA_ERS_modeled.Foods as select distinct * from USDA_ERS_staging.Foods

Executing query with job ID: e9c0e57f-1bcb-4d8e-8096-c011dafb6b58
Query executing: 0.56s

Conflict: 409 GET https://www.googleapis.com/bigquery/v2/projects/responsive-cab-267123/queries/e9c0e57f-1bcb-4d8e-8096-c011dafb6b58?timeoutMs=400&location=US&maxResults=0: Already Exists: Table responsive-cab-267123:USDA_ERS_modeled.Foods

### Copy Food_Categories table from staging dataset

In [4]:
%%bigquery
create table USDA_ERS_modeled.Food_Categories as select distinct * from USDA_ERS_staging.Food_Categories

Executing query with job ID: e3c35cae-f2ba-440e-9053-4cae11da3a0e
Query executing: 0.59s

Conflict: 409 GET https://www.googleapis.com/bigquery/v2/projects/responsive-cab-267123/queries/e3c35cae-f2ba-440e-9053-4cae11da3a0e?timeoutMs=400&location=US&maxResults=0: Already Exists: Table responsive-cab-267123:USDA_ERS_modeled.Food_Categories

### Create the Market_Groups table that will contain market_id and market_names

We are dropping the three string fields from the staging dataset (only need market_id and market_name)

In [None]:
%%bigquery
create table USDA_ERS_modeled.Market_Groups as select market_id, market_name from USDA_ERS_staging.Market_Groups

### Create a table that combines all Food_#_Market Tables from staging dataset

Will use marketgroup (change name to market_id),
year, quarter, price, se (will explicitly name it as standard_error),
n (sample_size), aggweight (agg_weight), totexp (tot_q_exp),
and food_id (this will correspond to the original number specified in each staging table -> 1-54).

In [5]:
%%bigquery
create table USDA_ERS_modeled.Food_Market as 
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 1 as food_id from USDA_ERS_staging.Food_1_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 2 as food_id from USDA_ERS_staging.Food_2_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 3 as food_id from USDA_ERS_staging.Food_3_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 4 as food_id from USDA_ERS_staging.Food_4_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 5 as food_id from USDA_ERS_staging.Food_5_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 6 as food_id from USDA_ERS_staging.Food_6_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 7 as food_id from USDA_ERS_staging.Food_7_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 8 as food_id from USDA_ERS_staging.Food_8_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 9 as food_id from USDA_ERS_staging.Food_9_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 10 as food_id from USDA_ERS_staging.Food_10_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 11 as food_id from USDA_ERS_staging.Food_11_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 12 as food_id from USDA_ERS_staging.Food_12_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 13 as food_id from USDA_ERS_staging.Food_13_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 14 as food_id from USDA_ERS_staging.Food_14_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 15 as food_id from USDA_ERS_staging.Food_15_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 16 as food_id from USDA_ERS_staging.Food_16_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 17 as food_id from USDA_ERS_staging.Food_17_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 18 as food_id from USDA_ERS_staging.Food_18_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 19 as food_id from USDA_ERS_staging.Food_19_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 20 as food_id from USDA_ERS_staging.Food_20_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 21 as food_id from USDA_ERS_staging.Food_21_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 22 as food_id from USDA_ERS_staging.Food_22_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 23 as food_id from USDA_ERS_staging.Food_23_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 24 as food_id from USDA_ERS_staging.Food_24_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 25 as food_id from USDA_ERS_staging.Food_25_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 26 as food_id from USDA_ERS_staging.Food_26_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 27 as food_id from USDA_ERS_staging.Food_27_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 28 as food_id from USDA_ERS_staging.Food_28_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 29 as food_id from USDA_ERS_staging.Food_29_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 30 as food_id from USDA_ERS_staging.Food_30_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 31 as food_id from USDA_ERS_staging.Food_31_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 32 as food_id from USDA_ERS_staging.Food_32_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 33 as food_id from USDA_ERS_staging.Food_33_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 34 as food_id from USDA_ERS_staging.Food_34_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 35 as food_id from USDA_ERS_staging.Food_35_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 36 as food_id from USDA_ERS_staging.Food_36_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 37 as food_id from USDA_ERS_staging.Food_37_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 38 as food_id from USDA_ERS_staging.Food_38_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 39 as food_id from USDA_ERS_staging.Food_39_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 40 as food_id from USDA_ERS_staging.Food_40_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 41 as food_id from USDA_ERS_staging.Food_41_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 42 as food_id from USDA_ERS_staging.Food_42_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 43 as food_id from USDA_ERS_staging.Food_43_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 44 as food_id from USDA_ERS_staging.Food_44_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 45 as food_id from USDA_ERS_staging.Food_45_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 46 as food_id from USDA_ERS_staging.Food_46_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 47 as food_id from USDA_ERS_staging.Food_47_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 48 as food_id from USDA_ERS_staging.Food_48_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 49 as food_id from USDA_ERS_staging.Food_49_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 50 as food_id from USDA_ERS_staging.Food_50_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 51 as food_id from USDA_ERS_staging.Food_51_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 52 as food_id from USDA_ERS_staging.Food_52_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 53 as food_id from USDA_ERS_staging.Food_53_Market UNION DISTINCT
select GENERATE_UUID() as food_market_id, marketgroup as market_id, year, quarter, price, se as standard_error, n as sample_size, aggweight as agg_weight, totexp as tot_q_exp, 54 as food_id from USDA_ERS_staging.Food_54_Market;

Executing query with job ID: a4acc0eb-4bc8-443e-8081-51fa672ca5ea
Query executing: 2.37s

Conflict: 409 GET https://www.googleapis.com/bigquery/v2/projects/responsive-cab-267123/queries/a4acc0eb-4bc8-443e-8081-51fa672ca5ea?timeoutMs=400&location=US&maxResults=0: Already Exists: Table responsive-cab-267123:USDA_ERS_modeled.Food_Market

### Create Product_Food_Map Table 

In [None]:
%%bigquery
CREATE TABLE Product_Food_Map
from
select food_id, foo from USDA_ERS_modeled.food


## Check Primary Keys

## Food_Market Table

food_market_id

In [None]:
%%bigquery
select count(*) as no_entries from USDA_ERS_modeled.Food_Market

In [None]:
%%bigquery
select count(distinct food_market_id) as no_PKs from USDA_ERS_modeled.Food_Market

## Foods Table

In [None]:
%%bigquery
select count(*) as no_entries from USDA_ERS_modeled.Foods

In [None]:
%%bigquery
select count(distinct food_id) as no_PKs from USDA_ERS_modeled.Foods

## Food_Categories Table

In [None]:
%%bigquery
select count(*) as no_entries from USDA_ERS_modeled.Food_Categories

In [None]:
%%bigquery
select count(distinct category_id) as no_PKs from USDA_ERS_modeled.Food_Categories

## Market_Groups Table

In [6]:
%%bigquery
select count(*) as no_entries from USDA_ERS_modeled.Market_Groups

Unnamed: 0,no_entries
0,39


In [7]:
%%bigquery
select count(distinct market_id) as no_PKs from USDA_ERS_modeled.Market_Groups

Unnamed: 0,no_PKs
0,39


# Foreign Key Check

market_id

In [14]:
%%bigquery
select g.market_id
from USDA_ERS_modeled.Food_Market fm left join USDA_ERS_modeled.Market_Groups g
on fm.market_id= g.market_id left join USDA_ERS_modeled.Geo_Market gm on gm.market_id = g.market_id
where fm.market_id is null

Unnamed: 0,market_id


food_id

In [15]:
%%bigquery
select fm.food_id
from USDA_ERS_modeled.Food_Market fm left join USDA_ERS_modeled.Foods f
on fm.food_id= f.food_id 
where f.food_id is null

Unnamed: 0,food_id


food_category

In [16]:
%%bigquery
select f.food_category
from USDA_ERS_modeled.Food_Categories fc left join USDA_ERS_modeled.Foods f
on fc.category_id= f.food_category
where fc.category_id is null

Unnamed: 0,food_category


# Predicting 2017 Food Prices Using Linear Regression in Apache Beam 
The transformed table will contain the food_id and the predicted 2017 price based on linear regression

In [1]:
%run Food_Market_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:apache_beam.io.gcp.bigquery_tools:Using location 'US' from table <TableReference
 datasetId: 'USDA_ERS_modeled'
 projectId: 'responsive-cab-267123'
 tableId: 'Food_Market'> referenced by query SELECT food_id, year, market_id, AVG(price) as average_price FROM USDA_ERS_modeled.Food_Market WHERE price IS NOT NULL and year IS NOT NULL GROUP BY food_id, year, market_id ORDER BY food_id, year ASC limit 100
INFO:apache_beam.io.filebasedsink:Starting finalize_write threads with num_shards: 1 (skipped: 0), batches: 1, num_threads: 1


[0.442133025, 0.4217025, 0.456592575]
[0.438534975, 0.42346865, 0.44865410000000006]
[0.407856425, 0.414252925, 0.45905062499999993]
[0.419936825, 0.394503425, 0.424838375]
[0.3658032, 0.37462070000000003]
[0.42713515, 0.3797036750000001, 0.43225255]
[0.41507285, 0.42564002500000003, 0.44038917499999997]
[0.4546993750000001, 0.479148925, 0.5055215]
[0.39316255000000006, 0.38859795, 0.42288034999999996]
[0.468440225, 0.43798404999999996, 0.45067602500000004]
[0.49199052499999996, 0.46225655, 0.48605499999999996]
[0.33849185, 0.35564524999999997, 0.3956174]
[0.414000975, 0.40931520000000005, 0.4471182]
[0.3897556, 0.391994025]
[0.40065605, 0.397285375, 0.453305275]
[0.5136139, 0.48310167499999995]
[0.356727675, 0.35333492499999997, 0.39516779999999996]
[0.40983525000000004, 0.37665344999999995, 0.43652029999999997]
[0.42266235, 0.441289675, 0.48769614999999994]
[0.476266725, 0.44032004999999996, 0.49953805000000007]
[0.46105285, 0.446557975, 0.5178164749999999]
[0.37860292500000003, 0.38

INFO:apache_beam.io.filebasedsink:Renamed 1 shards in 0.11 seconds.
INFO:apache_beam.io.filebasedsink:Starting finalize_write threads with num_shards: 1 (skipped: 0), batches: 1, num_threads: 1
INFO:apache_beam.io.filebasedsink:Renamed 1 shards in 0.10 seconds.
INFO:apache_beam.io.filebasedsink:Starting finalize_write threads with num_shards: 1 (skipped: 0), batches: 1, num_threads: 1
INFO:apache_beam.io.filebasedsink:Renamed 1 shards in 0.10 seconds.
INFO:apache_beam.io.filebasedsink:Starting finalize_write threads with num_shards: 1 (skipped: 0), batches: 1, num_threads: 1
INFO:apache_beam.io.filebasedsink:Renamed 1 shards in 0.10 seconds.
INFO:apache_beam.io.gcp.bigquery_tools:Created table responsive-cab-267123.USDA_ERS_modeled.Food_Market_Beam with schema <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'food_id'
 type: 'INTEGER'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'market_id'
 type: 'INTEGER'>, <TableFieldSchema
 fields: []
 mode:

In [46]:
%run Food_Market_beam_dataflow.py

  kms_key=transform.kms_key))
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://bmease_cs327e/staging/transform-foodmarket-df1.1588035483.278157/pipeline.pb...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://bmease_cs327e/staging/transform-foodmarket-df1.1588035483.278157/pipeline.pb in 0 seconds.
INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/home/jupyter/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpvv0h5tc1', 'apache-beam==2.19.0', '--no-deps', '--no-binary', ':all:']
INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI to gs://bmease_cs327e/staging/transform-foodmarket-df1.1588035483.278157/dataflow_python_sdk.tar
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://bmease_cs327e/staging/transform-foodmarket-df1.1588035483.278157/dataflow_p

# Primary Key Check for Food_Market_Beam_DF

In [47]:
%%bigquery
SELECT count(*) from USDA_ERS_modeled.Food_Market_Beam_DF

Unnamed: 0,f0_
0,54


In [49]:
%%bigquery
SELECT count(distinct food_id) from USDA_ERS_modeled.Food_Market_Beam_DF

Unnamed: 0,f0_
0,54


# Foreign Key Check for Food_Market_Beam_DF

In [51]:
%%bigquery
SELECT fm.food_id
FROM USDA_ERS_modeled.Food_Market_Beam_DF fm
LEFT OUTER JOIN USDA_ERS_modeled.Foods f
ON fm.food_id = f.food_id
WHERE f.food_id IS NULL

Unnamed: 0,food_id


In [3]:
#Run all noms code with Python 3
# import noms
# client = noms.Client("lMekhvK6sRaykDk7sov61ZEuJK06faPSspjcg8En")

In [198]:
# product_name = ''
# product_name = ''.join([i for i in product_name if not i.isdigit() and not i is '%'])
# search_results = client.search_query(product_name)
# #print(search_results)
# groups = list()
# search_results = pd.DataFrame(search_results.__dict__.get('json').get('items'))
# #print(len(search_results.__dict__.get('json').get('items')))
# #for i in search_results.__dict__.get('json').get('items'): 
# #    groups.append(i.get('group'))
# #groups = set(groups)
# #print(groups)
# print(set(search_results['group']))
# len(set(search_results['group']))
# search_results.head()

{'Fruits and Fruit Juices', 'Baby Foods', 'Soups, Sauces, and Gravies', 'Sweets', 'Dairy and Egg Products', 'Breakfast Cereals', 'Snacks', 'Legumes and Legume Products', 'Baked Products', 'Beverages', 'Vegetables and Vegetable Products'}


Unnamed: 0,group,name,ndbno
0,Baked Products,"Cookies, sugar wafer, with creme filling, suga...",18202
1,Sweets,"Sugars, brown",19334
2,Sweets,"Sugars, granulated",19335
3,Sweets,"Sugars, powdered",19336
4,Sweets,"Sugars, maple",19340


In [201]:
# FoodIDMap = {
    
#     1: {'Fresh/Frozen fruit', 'Fruits and Fruit Juices'},
#     2: {'Canned Fruit', 'Fruits and Fruit Juices'}, #check for 'can' in name
#     3: {'Fruit Juice', 'Fruits and Fruit Juices'}, #check for 'juice'
#     4: {'Fresh/Frozen dark green vegetables', 'Vegetables and Vegetable'}, #check dark green
#     5: {'Canned dark green vegetables', 'Vegetables and Vegetable'}, #check dark green & canned
#     6: {'Fresh/Frozen orange vegetables', 'Vegetables and Vegetable'}, #check orange 
#     7: {'Canned orange vegetables', 'Vegetables and Vegetable'}, #check orange & canned
#     8: {'Fresh/Frozen starchy vegetables', 'Vegetables and Vegetable'}, #check starchy
#     9: {'Canned starchy vegetables', 'Vegetables and Vegetable'}, #check starchy & canned
#     10: {'Fresh/Frozen select nutrient vegetables', 'Vegetables and Vegetable'},
#     11: {'Canned select nutrients', 'Vegetables and Vegetable'}, #check canned
#     12: {'Fresh/Frozen other vegetables', 'Vegetables and Vegetable'}, #default veggie
#     13: {'Canned other vegetables', 'Vegetables and Vegetable'}, #default if canned
#     14: {'Frozen/Dried Legumes', 'Legumes and Legume Products'}, #check for beans
#     15: {'Canned Legumes', 'Legumes and Legume Products'}, #check for canned beans
#     16: {'Whole grain bread, rolls, rice, pasta, cereal', 'Cereal Grains and Pasta'}, #whole-grain as keyword
#     17: {'Whole grain flour and mixes', 'Cereal Grains and Pasta'}, #search for flour as keyword
#     18: {'Whole grain frozen/ready to cook', 'Baked Products'},
#     19: {'other bread, rolls, rice, pasta, cereal', 'Cereal Grains and Pasta'}, #efault for bread and grains
#     20: {'other flour and mixes', 'Cereal Grains and Pasta'}, #efault for flour
#     21: {'other frozen/ready to cook grains', 'Baked Products'}, #if frozen
#     22: {'Low fat milk', 'Dairy and Egg Products'}, #check for %s or the word fat
#     23: {'Low fat cheese', 'Dairy and Egg Products'},
#     24: {'Low fat yogurt & other dairy', 'Dairy and Egg Products'},
#     25: {'Whole and 2% milk', 'Dairy and Egg Products'},
#     26: {'Whole and 2% cheese', 'Dairy and Egg Products'},
#     27: {'Whole and 2% yogurt & other dairy', 'Dairy and Egg Products'},
#     28: {'Fresh/frozen low fat meat', 'Lamb, Veal, and Game Products'},
#     29: {'Fresh/frozen regular fat meat', 'Port Products', 'Beef Products'},
#     30: {'Canned meat', None},
#     31: {'Fresh/frozen poultry', 'Poultry Products'},
#     32: {'Canned poultry', 'Poultry Products'}, #check canned
#     33: {'Fresh/frozen fish', 'Finfish and Shellfish Products'},
#     34: {'Canned fish', 'Finfish and Shellfish Products'}, #check canned
#     35: {'Raw nuts and seeds', 'Nut and Seed Products'}, #raw keyword
#     36: {'Processed nuts, seeds and nut butters', 'Nut and Seed Products'}, #not raw
#     37: {'Eggs', 'Dairy and Egg Products'},
#     38: {'Oils', 'Fats and Oils'}, #butter is an exception
#     39: {'Solid fats', 'Fats and Oils'}, 
#     40: {'Raw sugars', 'Sweets'},
#     41: {'Non-alcoholic nondiet carbonated beverages', 'Beverages'}, #sodas
#     42: {'Non-carbonated caloric beverages'}, #diet soda, sparkling water
#     43: {'Water', None},
#     44: {'Ice cream and frozen desserts', 'Sweets'},
#     45: {'Baked good mixes', 'breakfast bakery'}, #instacart data
#     46: {'Packaged sweets/baked goods', 'breakfast bars pastries'}, #this is from instacart aisles
#     47: {'Bakery items, ready to eat', 'Baked Products'}, #look for keyword baked
#     48: {'Frozen entrees and sides', None},
#     49: {'Canned soups, sauces, prepared foods', 'Soups, Sauces, and Gravies'},
#     50: {'Packaged snacks', 'Snacks'},
#     51: {'Ready to cook meals and sides', None},
#     52: {'Ready to eat deli items (hot and cold)', 'Sausages and Luncheon Meats'},
#     53: {'Non-alcoholic diet carbonated beverages', 'Beverages'},
#     54: {'Unsweetened coffee and tea', 'Beverages'}
# }

In [191]:
# import pandas as pd
# food_list = client.get_foods({'04646':100})
# m = noms.Meal(food_list)
# r = noms.report(m)
# #for i in r:
#  #   print(i)
#   #  print(i.get('name' == 'Fat'))
# r = pd.DataFrame(r)
# print(r)
# #print(noms.report(m).get)

       limit                 name      rda         state  unit    value
0        NaN              Protein   125.00     deficient     g    0.000
1        NaN                  Fat    55.56  satisfactory     g  100.000
2        NaN                Carbs   250.00     deficient     g    0.000
3        NaN             Calories  2000.00     deficient  kcal  884.000
4        NaN                Water  2000.00     deficient     g    0.000
5     400.00             Caffeine     0.00  satisfactory    mg    0.000
6     300.00          Theobromine     0.00  satisfactory    mg    0.000
7      50.00                Sugar     0.00  satisfactory     g    0.000
8        NaN                Fiber    28.00     deficient     g    0.000
9    2500.00              Calcium  1000.00     deficient    mg    0.000
10     45.00                 Iron     8.00     deficient    mg    0.040
11    700.00            Magnesium   300.00     deficient    mg    0.000
12   4000.00           Phosphorus   700.00     deficient    mg  

In [10]:
# food_id_map = {0: 8, 1: 43, 2: 93, 3: 112, 4: 128, 5: 26, 6: 31, 7: 64, 8: 77, 9: 90, 10: 94, 11: 98, 12: 115, 13: 48, 14: 57, 15: 121, 16: 130, 17: 18, 18: 68, 19: 59, 20: 69, 21: 81, 22: 95, 23: 99, 24: 2, 25: 21, 26: 36, 27: 53, 28: 71, 29: 84, 30: 86, 31: 91, 32: 108, 33: 120, 34: 1, 35: 13, 36: 14, 37: 67, 38: 96, 39: 4, 40: 9, 41: 12, 42: 63, 43: 131, 44: 34, 45: 37, 46: 38, 47: 42, 48: 52, 49: 58, 50: 79, 51: 113, 52: 116, 53: 119, 54: 129, 55: 30, 56: 33, 57: 66, 58: 76, 59: 7, 60: 15, 61: 35, 62: 39, 63: 49, 64: 106, 65: 122, 66: 100, 67: 6, 68: 5, 69: 17, 70: 19, 71: 29, 72: 51, 73: 72, 74: 88, 75: 89, 76: 97, 77: 104, 78: 105, 79: 110, 80: 16, 81: 24, 82: 32, 83: 83, 84: 123, 85: 3, 86: 23, 87: 45, 88: 46, 89: 50, 90: 61, 91: 78, 92: 103, 93: 107, 94: 117, 95: 125}

In [9]:
# %%bigquery
# select product_name, department, aisle from instacart_modeled.Products p
# inner join instacart_modeled.Departments d on d.department_id = p.department_id
# inner join instacart_modeled.Aisles a on a.aisle_id = p.aisle_id
# where d.department_id not in (1,2,5,11,12,17,18,19)
# and p.product_name not like '%Filters%'
# order by product_name

Unnamed: 0,product_name,department,aisle
0,& Go! Hazelnut Spread + Pretzel Sticks,pantry,spreads
1,(70% Juice!) Mountain Raspberry Juice Squeeze,beverages,juice nectars
2,+Energy Black Cherry Vegetable & Fruit Juice,beverages,refrigerated
3,0 Calorie Acai Raspberry Water Beverage,beverages,energy sports drinks
4,0 Calorie Fuji Apple Pear Water Beverage,beverages,energy sports drinks
...,...,...,...
26148,with Olive Oil Mayonnaise,pantry,condiments
26149,with Olive Oil Mayonnaise Dressing,pantry,condiments
26150,with Sweet Cinnamon Bunches Cereal,breakfast,cereal
26151,with a Splash of Mango Coconut Water,beverages,juice nectars


In [56]:
# green_veggies = {'arugula (rocket)', ' bok choy', ' broccoli', ' broccoli rabe (rapini)', ' broccolini', ' collard greens', ' leafy lettuce', ' endive', ' escarole', ' kale', ' mesclun', ' mixed greens', ' mustard greens', ' romaine lettuce', ' spinach', ' Swiss chard', ' turnip greens', ' watercress'}

# orange_veggies = list(str.split('acorn squash,bell peppers,butternut squash,carrots,hubbard squash,pumpkin,red chili peppers, sweet red peppers, sweet potatoes, tomatoes, 100% vegetable juice', ','))

# print(orange_veggies)

# starchy_veggies = list(str.split('cassava, corn, green bananas,green lima beans,green peas,parsnips,plantains,potatoes white,taro,water chestnuts,yams', ','))
# print(starchy_veggies)

['acorn squash', 'bell peppers', 'butternut squash', 'carrots', 'hubbard squash', 'pumpkin', 'red chili peppers', ' sweet red peppers', ' sweet potatoes', ' tomatoes', ' 100% vegetable juice']
['cassava', ' corn', ' green bananas', 'green lima beans', 'green peas', 'parsnips', 'plantains', 'potatoes white', 'taro', 'water chestnuts', 'yams']


In [11]:
%run Food_Map_beam_dataflow.py

  kms_key=transform.kms_key))
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://bmease_cs327e/staging/foodmap-df.1588128899.077566/pipeline.pb...
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 1/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 2/2)
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 1/2)
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://bmease_cs327e/staging/foodmap-df.1588128899.077566/pipeline.pb in 0 seconds.
INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/home/jupyter/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpb8niz0rx', 'apache-beam==2.19.0', '--no-deps', '--no-binary', ':all:']
INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI to gs://bmease_cs327e/staging/foodmap-df.1588128899.077566/dataflow_python_sdk.tar


In [13]:
%run Food_Map_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
INFO:oauth2client.transport:Refreshing due to a 401 (attempt 1/2)
INFO:apache_beam.io.gcp.bigquery_tools:Using location 'US' from table <TableReference
 datasetId: 'instacart_modeled'
 projectId: 'responsive-cab-267123'
 tableId: 'Products'> referenced by query select lower(product_name) as product_name, product_id, a.aisle_id, department from instacart_modeled.Products p inner join instacart_modeled.Departments d on d.department_id = p.department_id inner join instacart_modeled.Aisles a on a.aisle_id = p.aisle_id where d.department_id not in (5,8,11,17,18) and p.product_name not like '%Filters%' order by product_name limit 100
INFO:apache_beam.io.gcp.bigquery_tools:Created table responsive-cab-267123.USDA_ERS_modeled.random with schema <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'food_id'
 type: 'INTEGER'>

# Primary Key Check for Food_Map_Beam_DF

In [21]:
%%bigquery
SELECT count(*) from USDA_ERS_modeled.Food_Map_Beam_DF

Unnamed: 0,f0_
0,36907


In [22]:
%%bigquery
SELECT count(*) from (SELECT distinct food_id, product_id from USDA_ERS_modeled.Food_Map_Beam_DF)

Unnamed: 0,f0_
0,36907


# Foreign Key Check for Food_Map_Beam_DF

In [23]:
%%bigquery
SELECT m.food_id
FROM USDA_ERS_modeled.Food_Map_Beam_DF m
LEFT OUTER JOIN USDA_ERS_modeled.Foods f
ON m.food_id = f.food_id
WHERE f.food_id IS NULL

Unnamed: 0,food_id


In [24]:
%%bigquery
SELECT m.product_id
FROM USDA_ERS_modeled.Food_Map_Beam_DF m
LEFT OUTER JOIN instacart_modeled.Products p
ON m.product_id = p.product_id
WHERE p.product_id IS NULL

Unnamed: 0,product_id
