In [1]:
import warnings
warnings.filterwarnings('ignore')

# Speed comparison

## Introduction

The purpose of this page is to compare the speed of data transfer between the the load method from this library: 

- [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load)

and Python Client for Google BigQuery’s methods: 

- [google.cloud.bigquery.job.QueryJob.to_dataframe()](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.job.QueryJob.to_dataframe.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
- [google.cloud.bigquery.client.Client.load_table_from_dataframe()](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.load_table_from_dataframe.html#google.cloud.bigquery.client.Client.load_table_from_dataframe)

## Set up 

In [2]:
import os
import pandas
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
from google_pandas_load import LoaderQuickSetup

In [3]:
project_id = 'dmp-y-tests'
dataset_id = 'tmp'
bucket_name = 'bucket_gpl'
local_dir_path = '/tmp/gpl_directory'

Next, set bq_client, bqstorage_client, table_ref and instantiate a loader.

Credentials are inferred from the environment. Further information about how to authenticate to Google Cloud Platform with the [Google Cloud Client Libraries for Python](https://googleapis.github.io/google-cloud-python/latest/index.html) can be found [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html?highlight=defaults).

In [4]:
bq_client = bigquery.Client(
    project=project_id, 
    credentials=None)

bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=None)

table_ref = bigquery.dataset.DatasetReference(
    project=project_id, 
    dataset_id=dataset_id).table('s0')

gpl = LoaderQuickSetup(
    project_id=project_id, 
    dataset_id=dataset_id,
    bucket_name=bucket_name,
    local_dir_path=local_dir_path)

In [5]:
if not os.path.isdir(local_dir_path):
    os.makedirs(local_dir_path)

## Download

The query below creates a 600 MB BigQuery table.

In [6]:
query = """
select * from 
(select 'Hello, ' as a from unnest(generate_array(1, 4000))) 
cross join 
(select 'World!' as b from unnest(generate_array(1, 4000)))
"""

In [7]:
%%time
df = bq_client.query(query).to_dataframe()

CPU times: user 1min 48s, sys: 2.34 s, total: 1min 51s
Wall time: 7min 5s


Using bqstorage_client speeds up the download. See [here](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas) for additional information.

In [8]:
%%time
df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)

CPU times: user 1min 5s, sys: 2.67 s, total: 1min 8s
Wall time: 1min 5s


There is issue with the previous download: it used the cache query results !

In [9]:
%%time
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache=False
df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)

CPU times: user 1min 8s, sys: 2.39 s, total: 1min 10s
Wall time: 1min 13s


When executing a query with [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load), the query is not caught because the method always creates a destination table. See [here](https://cloud.google.com/bigquery/docs/cached-results) for more informations. 

In [10]:
%%time
df = gpl.load(source='query', destination='dataframe', query=query)

2019-04-01 21:58:38,007 - LoaderQuickSetup - DEBUG - Starting query to bq...
2019-04-01 21:58:47,690 - LoaderQuickSetup - DEBUG - Ended query to bq [9s, 0.0$]
2019-04-01 21:58:47,692 - LoaderQuickSetup - DEBUG - Starting bq to gs...
2019-04-01 21:58:54,434 - LoaderQuickSetup - DEBUG - Ended bq to gs [6s]
2019-04-01 21:58:54,436 - LoaderQuickSetup - DEBUG - Starting gs to local...
2019-04-01 21:58:56,707 - LoaderQuickSetup - DEBUG - Ended gs to local [2s]
2019-04-01 21:58:56,708 - LoaderQuickSetup - DEBUG - Starting local to dataframe...
2019-04-01 21:59:00,431 - LoaderQuickSetup - DEBUG - Ended local to dataframe [3s]


CPU times: user 4.29 s, sys: 418 ms, total: 4.71 s
Wall time: 23.2 s


## Upload

In [11]:
N = 16*10**6
df = pandas.DataFrame({'a': ['Hello, ']*N, 'b': ['World!']*N})

In [12]:
%%time
# you may need to install pyarrow (pip install pyarrow)
# for this to work.
bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()

CPU times: user 2.58 s, sys: 378 ms, total: 2.96 s
Wall time: 2min 1s


<google.cloud.bigquery.job.LoadJob at 0x7fe2757d3400>

In [13]:
%%time
df = gpl.load(source='dataframe', destination='bq', data_name='s1', dataframe=df)

2019-04-01 22:01:04,349 - LoaderQuickSetup - DEBUG - Starting dataframe to local...
2019-04-01 22:01:21,937 - LoaderQuickSetup - DEBUG - Ended dataframe to local [17s]
2019-04-01 22:01:21,938 - LoaderQuickSetup - DEBUG - Starting local to gs...
2019-04-01 22:01:22,401 - LoaderQuickSetup - DEBUG - Ended local to gs [0s]
2019-04-01 22:01:22,402 - LoaderQuickSetup - DEBUG - Starting gs to bq...
2019-04-01 22:01:48,976 - LoaderQuickSetup - DEBUG - Ended gs to bq [26s]


CPU times: user 18 s, sys: 91.7 ms, total: 18.1 s
Wall time: 44.9 s


## Conclusion 

The [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load) method executes faster downloads and faster uploads than those executed by the built-in methods from [Python Client for Google BigQuery](https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html). 

Looking at the [basic mechanism](index.rst#The-basic-mechanism), one could think that the use of the local folder diminishes the number of network calls, and thus it speeds up data transfers. 