## TPC-H & TPC-DS BigQuery Data Import  
Import data from GCS to a previously created BigQuery dataset.  

This Notebook assumes that you've already generated data at one or more scale factors and uploaded them to the project Google Cloud Storage bucket listed in `config.gcs_data_bucket`  

Three values are required to initiate an upload to BigQuery:  
1. `test` - the test name, either `h` or `ds`
2. `scale` - the scale factor in GB, usually this will be `1, 100, 1000, 10000`  
3. `name` - name of this instance of the `test` and `scale` combination, i.e. `time-partitioned`

In [1]:
import config, bq

In [2]:
import pandas as pd

In [3]:
pd.set_option("display.max_rows", 1000)
pd.options.display.float_format = '{:.2f}'.format

#### Upload Variables

In [4]:
test = "ds"
scale = 100
cid = "01"
schema = "bq_ds_01.sql"

In [5]:
dataset_name = "{}_{}GB_{}".format(test, scale, cid)
dataset_name

'ds_100GB_01'

In [6]:
ddl_filepath = config.fp_schema + config.sep + schema

In [7]:
bq.create_dataset(dataset_name=dataset_name)

Dataset(DatasetReference('tpc-benchmarking-9432', 'ds_100GB_01'))

In [8]:
bq.create_schema(schema_file=ddl_filepath, dataset=dataset_name)

In [9]:
bq_upload = bq.BQUpload(test=test, scale=scale, dataset=dataset_name)

#### Initiate Upload  
set `verbose=True` for status printouts.

In [10]:
fp = bq_upload.upload(verbose=True)

Tables to upload:
call_center
catalog_page
catalog_returns
catalog_sales
customer
customer_address
customer_demographics
date_dim
household_demographics
income_band
inventory
item
promotion
reason
ship_mode
store
store_returns
store_sales
time_dim
warehouse
web_page
web_returns
web_sales
web_site
Loading Table: call_center
t0: 2020-07-09 22:23:32.026483
...
t1: 2020-07-09 22:23:34.883012
Load Job Done: True
ID: 3a642435-44d2-472a-af32-a6d5f195def7
dt: 0 days 00:00:02.856529
GB/s: 0.00
------------------------------
Loading Table: catalog_page
t0: 2020-07-09 22:23:34.884551
...
t1: 2020-07-09 22:23:45.401323
Load Job Done: True
ID: 23e43480-4cdf-49ac-9316-ded51eb52793
dt: 0 days 00:00:10.516772
GB/s: 0.00
------------------------------
Loading Table: catalog_returns
t0: 2020-07-09 22:23:45.402857
...
t1: 2020-07-09 22:25:05.539592
Load Job Done: True
ID: 1d2885fc-35bf-4287-aee0-a9c2fb73d087
dt: 0 days 00:01:20.136735
GB/s: 0.03
------------------------------
Loading Table: catalog_sales

#### Summary of Upload

In [11]:
fp

'/home/colin/code/bq_snowflake_benchmark/ds/bq_upload-ds_100GB-ds_100GB_01-2020-07-09 22:23:32.010143.csv'

In [12]:
dfa = bq.parse_log(fp)
dfa

Unnamed: 0,test,scale,dataset,table,status,t0,t1,size_bytes,job_id
0,test,scale,dataset,table,status,t0,t1,size_bytes,job_id
1,ds,100,ds_100GB_01,call_center,start,2020-07-09 22:23:32.026483,,9389,
2,ds,100,ds_100GB_01,call_center,end,2020-07-09 22:23:32.026483,2020-07-09 22:23:34.883012,9389,3a642435-44d2-472a-af32-a6d5f195def7
3,ds,100,ds_100GB_01,catalog_page,start,2020-07-09 22:23:34.884551,,2833824,
4,ds,100,ds_100GB_01,catalog_page,end,2020-07-09 22:23:34.884551,2020-07-09 22:23:45.401323,2833824,23e43480-4cdf-49ac-9316-ded51eb52793
5,ds,100,ds_100GB_01,catalog_returns,start,2020-07-09 22:23:45.402857,,2264498800,
6,ds,100,ds_100GB_01,catalog_returns,end,2020-07-09 22:23:45.402857,2020-07-09 22:25:05.539592,2264498800,1d2885fc-35bf-4287-aee0-a9c2fb73d087
7,ds,100,ds_100GB_01,catalog_sales,start,2020-07-09 22:25:05.541085,,30874934982,
8,ds,100,ds_100GB_01,catalog_sales,end,2020-07-09 22:25:05.541085,2020-07-09 22:26:41.941668,30874934982,26dcd9e3-4639-4519-bf69-0c0a90135d66
9,ds,100,ds_100GB_01,customer,start,2020-07-09 22:26:41.943236,,267479876,
