In [1]:
import warnings
warnings.filterwarnings('ignore')

# Speed comparison

## Introduction

The purpose of this page is to compare the speed of data transfer between the load method from this library: 

- [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load)

and Python Client for Google BigQuery’s methods: 

- [google.cloud.bigquery.job.QueryJob.to_dataframe()](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
- [google.cloud.bigquery.client.Client.load_table_from_dataframe()](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe)

## Set up 

In [2]:
# %%bash 
# pip install google-pandas-load
# pip install google-cloud-bigquery-storage[fastavro]==0.*
# pip install pyarrow==0.*

In [3]:
import os
import pandas
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
from google_pandas_load import LoaderQuickSetup

In [4]:
project_id = 'dmp-y-tests'
dataset_id = 'tmp'
bucket_name = 'bucket_gpl'
local_dir_path = '/tmp/gpl_directory'

Next, set bq_client, bqstorage_client, table_ref and instantiate a loader.

In [5]:
bq_client = bigquery.Client(
    project=project_id, 
    credentials=None)

bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=None)

table_ref = bigquery.dataset.DatasetReference(
    project=project_id, 
    dataset_id=dataset_id).table('z0')

gpl = LoaderQuickSetup(
    project_id=project_id, 
    dataset_id=dataset_id,
    bucket_name=bucket_name,
    local_dir_path=local_dir_path)

In [6]:
if not os.path.isdir(local_dir_path):
    os.makedirs(local_dir_path)

## Download

The query below creates a 1.01 GB BigQuery table.

In [7]:
query = """
select * from
(select rand() as a from unnest(generate_array(1, 8000)))
cross join
(select rand() as b from unnest(generate_array(1, 8000)))
"""

In [8]:
# %%time
# df = bq_client.query(query).to_dataframe()

KeyboardInterrupt: 

Using bqstorage_client speeds up the download. See [here](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas) for additional information.

In [9]:
%%time
df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)

ArrowNotImplementedError: Not implemented type for lists: struct<action: string, html_url: string, page_name: string, sha: string, summary: string, title: string>

There is issue with the previous download: it used the cache query results !

In [10]:
%%time
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache=False
df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)

ArrowNotImplementedError: Not implemented type for lists: struct<action: string, html_url: string, page_name: string, sha: string, summary: string, title: string>

When executing a query with [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load), the query is not caught because the method always creates a destination table. See [here](https://cloud.google.com/bigquery/docs/cached-results) for more informations. 

In [11]:
%%time
df = gpl.load(source='query', destination='dataframe', query=query)

2019-12-18 16:35:26,760 - LoaderQuickSetup - DEBUG - Starting query to bq...


NotFound: 404 Not found: Dataset bigquery-public-data:samples was not found in location EU

(job ID: f5a7edcd-675a-4884-a06c-41d497a0d2da)

                -----Query Job SQL Follows-----                

    |    .    |    .    |    .    |    .    |    .    |
   1:
   2:select * from `bigquery-public-data.samples.github_nested`
    |    .    |    .    |    .    |    .    |    .    |

## Upload

In [None]:
from random
N = 64*10**6
import random
df = pandas.DataFrame({'a': [random.uniform(0, 1)]*N, 'b': [random.uniform(0, 1)]*N})

In [12]:
%%time
bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()

NameError: name 'df' is not defined

In [13]:
%%time
gpl.load(source='dataframe', destination='bq', data_name='z1', dataframe=df)

NameError: name 'df' is not defined

## Conclusion 

The [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load) method executes much faster downloads and slightly slower uploads than those executed by the built-in methods from [Python Client for Google BigQuery](https://googleapis.dev/python/bigquery/latest/index.html). 