In [1]:
import warnings
warnings.filterwarnings('ignore')

# Speed comparison

## Introduction

The purpose of this page is to compare the speed of data transfer between the load method from this library: 

- [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load)

and Python Client for Google BigQuery’s methods: 

- [google.cloud.bigquery.job.QueryJob.to_dataframe()](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
- [google.cloud.bigquery.client.Client.load_table_from_dataframe()](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe)

## Set up 

In [2]:
# %%bash 
# pip install google_pandas_load
# pip install google-cloud-bigquery-storage[pandas,fastavro]

In [3]:
import os
import pandas
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
from google_pandas_load import LoaderQuickSetup

In [4]:
project_id = 'dmp-y-tests'
dataset_id = 'tmp'
bucket_name = 'bucket_gpl'
local_dir_path = '/tmp/gpl_directory'

Next, set bq_client, bqstorage_client, table_ref and instantiate a loader.

In [5]:
bq_client = bigquery.Client(
    project=project_id, 
    credentials=None)

bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=None)

table_ref = bigquery.dataset.DatasetReference(
    project=project_id, 
    dataset_id=dataset_id).table('s0')

gpl = LoaderQuickSetup(
    project_id=project_id, 
    dataset_id=dataset_id,
    bucket_name=bucket_name,
    local_dir_path=local_dir_path)

In [6]:
if not os.path.isdir(local_dir_path):
    os.makedirs(local_dir_path)

## Download

The query below creates a 580 MB BigQuery table.

In [7]:
query = """
select * from 
(select 'Hello, ' as a from unnest(generate_array(1, 6000))) 
cross join 
(select 'World!' as b from unnest(generate_array(1, 6000)))
"""

In [8]:
%%time
df = bq_client.query(query).to_dataframe()

CPU times: user 2min 52s, sys: 5.22 s, total: 2min 57s
Wall time: 15min 9s


Using bqstorage_client speeds up the download. See [here](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas) for additional information.

In [9]:
%%time
df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)

CPU times: user 1min 54s, sys: 12.1 s, total: 2min 6s
Wall time: 1min 48s


There is issue with the previous download: it used the cache query results !

In [10]:
%%time
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache=False
df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)

CPU times: user 1min 52s, sys: 15.9 s, total: 2min 8s
Wall time: 2min 2s


When executing a query with [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load), the query is not caught because the method always creates a destination table. See [here](https://cloud.google.com/bigquery/docs/cached-results) for more informations. 

In [11]:
%%time
df = gpl.load(source='query', destination='dataframe', query=query)

2019-12-02 20:16:51,512 - LoaderQuickSetup - DEBUG - Starting query to bq...
2019-12-02 20:17:07,445 - LoaderQuickSetup - DEBUG - Ended source to bq [15s, 0.0$]
2019-12-02 20:17:07,446 - LoaderQuickSetup - DEBUG - Starting bq to gs...
2019-12-02 20:17:15,273 - LoaderQuickSetup - DEBUG - Ended bq to gs [7s]
2019-12-02 20:17:15,275 - LoaderQuickSetup - DEBUG - Starting gs to local...
2019-12-02 20:17:16,930 - LoaderQuickSetup - DEBUG - Ended gs to local [1s]
2019-12-02 20:17:16,932 - LoaderQuickSetup - DEBUG - Starting local to dataframe...
2019-12-02 20:17:23,267 - LoaderQuickSetup - DEBUG - Ended local to dataframe [6s]


CPU times: user 6.49 s, sys: 377 ms, total: 6.87 s
Wall time: 31.9 s


## Upload

In [12]:
N = 36*10**6
df = pandas.DataFrame({'a': ['Hello, ']*N, 'b': ['World!']*N})

In [13]:
%%time
bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()

CPU times: user 5.59 s, sys: 315 ms, total: 5.9 s
Wall time: 1min 12s


<google.cloud.bigquery.job.LoadJob at 0x7f447d9e0978>

In [15]:
%%time
gpl.load(source='dataframe', destination='bq', data_name='s1', dataframe=df)

2019-12-02 20:25:08,085 - LoaderQuickSetup - DEBUG - Starting dataframe to local...
2019-12-02 20:25:48,292 - LoaderQuickSetup - DEBUG - Ended dataframe to local [40s]
2019-12-02 20:25:48,293 - LoaderQuickSetup - DEBUG - Starting local to gs...
2019-12-02 20:25:48,874 - LoaderQuickSetup - DEBUG - Ended local to gs [0s]
2019-12-02 20:25:48,875 - LoaderQuickSetup - DEBUG - Starting gs to bq...
2019-12-02 20:26:45,445 - LoaderQuickSetup - DEBUG - Ended gs to bq [56s]


CPU times: user 41 s, sys: 182 ms, total: 41.2 s
Wall time: 1min 37s


## Conclusion 

The [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.loader.Loader.load) method executes faster downloads and faster uploads than those executed by the built-in methods from [Python Client for Google BigQuery](https://googleapis.dev/python/bigquery/latest/index.html). 

Looking at the [basic mechanism](index.rst#The-basic-mechanism), one could think that the use of the local folder diminishes the number of network calls, and thus it speeds up data transfers. 