In [1]:
import warnings
warnings.filterwarnings('ignore')

# Speed comparison

## Introduction

We would like to compare the speed of data transferts executed by these [Python Client for Google BigQuery](https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html) methods, on one hand : 

- [google.cloud.bigquery.job.QueryJob.to_dataframe()](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.job.QueryJob.to_dataframe.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
- [google.cloud.bigquery.client.Client.load_table_from_dataframe()](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.load_table_from_dataframe.html#google.cloud.bigquery.client.Client.load_table_from_dataframe)

and the load method of this library, on the other hand: 

- [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load)

## Set up 

These are the imports used by this speed comparison.

In [2]:
import os
import pandas
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
from google_pandas_load import LoaderQuickSetup

Set these variables with the corresponding values of your own resources if you want to run the tests. 

In [3]:
project_id = 'dmp-y-tests'
dataset_id = 'tmp'
bucket_name = 'augustin-b-bucket'
local_dir_path = '/tmp/gpl_directory'

Let's set a bq_client, bqstorage_client, a table_ref and a loader (in the following cell, the credentials are inferred from the environment.See [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html?highlight=defaults) for more informations about how to authenticate to Google Cloud Platform with the [Google Cloud Client Libraries for Python](https://googleapis.github.io/google-cloud-python/latest/index.html)). 

In [4]:
bq_client = bigquery.Client(
    project=project_id, 
    credentials=None)

bqstorage_client = bigquery_storage_v1beta1.BigQueryStorageClient(
    credentials=None)

table_ref = bigquery.dataset.DatasetReference(
    project=project_id, 
    dataset_id=dataset_id).table('s0')

gpl = LoaderQuickSetup(
    project_id=project_id, 
    dataset_id=dataset_id,
    bucket_name=bucket_name,
    local_dir_path=local_dir_path)

Let's not forget to create a local folder with path equals to local_dir_path for the loader to use it. 

In [5]:
if not os.path.isdir(local_dir_path):
    os.makedirs(local_dir_path)

By default, [google_pandas_load.LoaderQuickSetup](LoaderQuickSetup.rst) displays automatically its log records in the console. This is convenient when scripting. See [here](Tutorial.ipynb#Logging) for more informations. 

## Download

The BigQuery table resulting from the query below has a size of 600 MB. 

In [6]:
query = """
select * from 
(select 'Hello, ' as a from unnest(generate_array(1, 4000))) 
cross join 
(select 'World!' as b from unnest(generate_array(1, 4000)))
"""

In [7]:
%%time
df = bq_client.query(query).to_dataframe()

CPU times: user 1min 44s, sys: 2.37 s, total: 1min 46s
Wall time: 6min 57s


The use of the bqstorage_client speeds up the download. See [here](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas) for more informations. 

In [8]:
%%time
df = bq_client.query(query).to_dataframe(bqstorage_client=bqstorage_client)

CPU times: user 1min 26s, sys: 3.58 s, total: 1min 30s
Wall time: 1min 26s


There is problem with the previous download : it used the cache query results !

In [9]:
%%time
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache=False
df = bq_client.query(query, job_config=job_config).to_dataframe(bqstorage_client=bqstorage_client)

CPU times: user 1min 25s, sys: 3.02 s, total: 1min 28s
Wall time: 1min 32s


When executing a query, the load method of this library always set a destination table. This implies that the query cache will not be used. See [here](https://cloud.google.com/bigquery/docs/cached-results) for more informations. 

In [10]:
%%time
df = gpl.load(source='query', destination='dataframe', query=query)

2019-03-22 20:51:28,274 - LoaderQuickSetup - DEBUG - Starting query to bq...
2019-03-22 20:51:36,489 - LoaderQuickSetup - DEBUG - Ended query to bq [8s, 0.0$]
2019-03-22 20:51:36,491 - LoaderQuickSetup - DEBUG - Starting bq to gs...
2019-03-22 20:51:42,837 - LoaderQuickSetup - DEBUG - Ended bq to gs [6s]
2019-03-22 20:51:42,839 - LoaderQuickSetup - DEBUG - Starting gs to local...
2019-03-22 20:51:43,684 - LoaderQuickSetup - DEBUG - Ended gs to local [0s]
2019-03-22 20:51:43,685 - LoaderQuickSetup - DEBUG - Starting local to dataframe...
2019-03-22 20:51:47,681 - LoaderQuickSetup - DEBUG - Ended local to dataframe [3s]


CPU times: user 4.62 s, sys: 508 ms, total: 5.13 s
Wall time: 20.3 s


## Upload

In [11]:
N = 16*10**6
df = pandas.DataFrame({'a': ['Hello, ']*N, 'b': ['World!']*N})

In [12]:
%%time
# you may need to install pyarrow (pip install pyarrow)
# for this to work.
bq_client.load_table_from_dataframe(dataframe=df, destination=table_ref).result()

CPU times: user 2.59 s, sys: 349 ms, total: 2.94 s
Wall time: 1min 25s


<google.cloud.bigquery.job.LoadJob at 0x7f007d0c2278>

In [13]:
%%time
df = gpl.load(source='dataframe', destination='bq', data_name='s1', dataframe=df)

2019-03-22 20:53:16,307 - LoaderQuickSetup - DEBUG - Starting dataframe to local...
2019-03-22 20:53:33,437 - LoaderQuickSetup - DEBUG - Ended dataframe to local [17s]
2019-03-22 20:53:33,437 - LoaderQuickSetup - DEBUG - Starting local to gs...
2019-03-22 20:53:33,763 - LoaderQuickSetup - DEBUG - Ended local to gs [0s]
2019-03-22 20:53:33,766 - LoaderQuickSetup - DEBUG - Starting gs to bq...
2019-03-22 20:53:58,420 - LoaderQuickSetup - DEBUG - Ended gs to bq [24s]


CPU times: user 17.5 s, sys: 107 ms, total: 17.6 s
Wall time: 42.4 s


## Conclusion 

The load method of this library executes faster downloads and faster uploads than those executed by the built-in methods from [Python Client for Google BigQuery](https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html). 

When looking at the [basic mechanism](index.rst#The-basic-mechanism), one can think that the use of the local folder diminished the number of network calls and thus, speeds up data transferts. 