In [8]:
import warnings
warnings.filterwarnings('ignore')

# Tutorial

## Pre-setup

These are the imports used by this tutorial.

In [4]:
import os
import logging
import pandas
from datetime import datetime
from google.cloud import bigquery
from google.cloud import storage
from google_pandas_load import Loader
from google_pandas_load import LoaderQuickSetup
from google_pandas_load import LoadConfig

Set these variables with the corresponding values of your own resources. 

In [5]:
project_id = 'dmp-y-tests'
dataset_id = 'tmp'
bucket_name = 'augustin-b-bucket'
# gs_dir_path_in_bucket is the path in 
# the bucket of the directory that
# will contain the data in Storage.  
gs_dir_path_in_bucket = 'gpl_dir/subdir'
local_dir_path = '/tmp/gpl_directory'

Finally, let's not forget to create a local folder with path equals to local_dir_path otherwise the load jobs using it will crash. 

In [6]:
if not os.path.isdir(local_dir_path):
    os.makedirs(local_dir_path)

## Set up a loader

Let's define, in this package context, a loader as an instance of [google_pandas_load.Loader](Loader.rst) or of [google_pandas_load.LoaderQuickSetup](LoaderQuickSetup.rst). 

The second class is the only daughter of the first class.

Let's see how to create instances of both classes, with their main parameters : the locations where the data can be extracted from or moved to.  

### the low-level way

To set up a loader the low-level way, use [google_pandas_load.Loader](Loader.rst).

In the following code cell, the credentials are inferred from the environment (See [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html?highlight=defaults) for more informations about how to authenticate to Google Cloud Platform with the [Google Cloud Client Libraries for Python](https://googleapis.github.io/google-cloud-python/latest/index.html)). 

In [13]:
# the bq_client to execute the cloud parts of the load jobs, 
# which are the execution of queries, the extaction of BigQuery
# tables to Storage and loading tables to BigQuery from Storage. 
bq_client = bigquery.Client(
    project=project_id, 
    credentials=None)

# the dataset_ref pointing to the dataset to store the data 
# in BigQuery. 
dataset_ref = bigquery.dataset.DatasetReference(
    project=project_id, 
    dataset_id=dataset_id)

# the gs_client is used to instantiate a bucket. 
gs_client = storage.Client(
    project=project_id, 
    credentials=None)
# the bucket to store the data in Storage. 
bucket = storage.bucket.Bucket(
    client=gs_client, 
    name=bucket_name)

gpl = Loader(
    bq_client=bq_client,
    dataset_ref=dataset_ref,
    bucket=bucket,
    gs_dir_path_in_bucket=gs_dir_path_in_bucket,
    local_dir_path=local_dir_path)

In the setup above, the bq_client, the dataset_ref and the gs_client share the same project_id. Furthermore, the bq_client and the gs_client share the same credentials. Both of these argument sharings are not required. 

Nonetheless, in order to be able to execute load jobs with all possible source and destination, the bq_client must have read and write access to data in the dataset and in the bucket. 

You can set the parameter gs_dir_path_in_bucket to None if you want to use directly the root directory of the bucket to contain the data loaded in Storage

### the quick way

To set up a loader quickly, use [google_pandas_load.LoaderQuickSetup](LoaderQuickSetup.rst).

The code behind the instanciation of a object of this class is essentially the code of the previous cell.  

A limitation is that the bq_client, the dataset_ref and the gs_client share necessarily the same project_id (the one specified as an argument). Another one is that the bq_client and the gs_client share the same credentials (those specified as an argument).

A drawback is that these objects, built internally during the creation of the instance, could be used in other modules which do not need a loader.

In [6]:
gpl_quick_setup = LoaderQuickSetup(
    project_id=project_id, 
    dataset_id=dataset_id, 
    bucket_name=bucket_name, 
    gs_dir_path_in_bucket=gs_dir_path_in_bucket,
    credentials=None,
    local_dir_path=local_dir_path)

## A simple download

In [7]:
df = gpl.load(
    source='query', 
    destination='dataframe', 
    query='select 1 as x')

df

Unnamed: 0,x
0,1


## A simple upload

In [8]:
gpl.load(
    source='dataframe', 
    destination='bq',
    data_name='a0',
    dataframe=df)

In BigQuery, there is now the following table : 

![](a0_in_bq.png)

 It has this table id : project_id:dataset_id.a0. 

## Basic loading mechanism

### source and destination

The paramaters source and destination of [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load) can take the following values : 

- 'query', 
- 'bq' 
- 'gs', 
- 'local'
- 'dataframe'

### Loading paths

The downloading path is 'query'-> 'bq' -> 'gs' -> 'local' -> 'dataframe'.

The uploading path is the reverted one.

### Load result in RAM

- If destination = 'query', the following BigQuery standard SQL query:  
  "select * from \`project_id.dataset_id.data_name\`",  
  where the project_id is the one of the dataset. 

- If destination = 'dataframe', a pandas dataframe populated with the loaded data. 

- Otherwise, None.

### In general, data is moved, not copied ! 

Thus, in general, once the load job has been executed, the data does not exist anymore in the source and in the transitional locations. 

There are two exceptions : 

- When source = 'dataframe', the dataframe is not deleted in RAM (a function cannot delete a global variable         without knowing its name). 
- When destination = 'query', the data is not deleted in BigQuery, so that the data still exists somewhere. Indeed   in this case, the load job returns a simple query (see paragraph above), which represents the data but does not   contain the data.  

Use the parameters delete_in_bq, delete_in_gs and delete_in_local of [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load) to control the deletion of the data, during the execution of the load job.

### In general, pre-existing data is deleted !

In general, before data moves to any location, data with the same name already existing in the location is deleted, to make a clean space for the new data to come. 

There is one exception : 

- When destination = 'bq' and the parameter 
  [google_pandas_load.Loader.load()](Loader.rst#google_pandas_load.Loader.load) is set to 'WRITE_APPEND', the data   is appended to pre-existing data with the same name in the dataset. The default value of this parameter is         'WRITE_TRUNCATE'. 

## What is the data named data_name ? 

For a loader, the data named data_name is :

- in BigQuery : the table in the dataset whose id is data_name
- in Storage : the blobs which are inside the bucket directory and whose basename begins with data_name
- in local : the files which are inside the local folder and whose basename begins with data_name

This defintion is motivated by the fact that BigQuery splits a big table in several blobs when extracting it to Storage.

## More examples


### from query to gs

In [9]:
gpl.load(
    source='query', 
    destination='gs', 
    data_name='a0',
    query='select 5 as y')

### from gs to local

In [10]:
gpl.load(
    source='gs', 
    destination='local', 
    data_name='a0')

### from local to dataframe

In [11]:
df = gpl.load(
    source='local', 
    destination='dataframe', 
    data_name='a0')

### from dataframe to gs

In [12]:
gpl.load(
    source='dataframe', 
    destination='gs', 
    data_name='a0', 
    dataframe=df)

### from gs to query

In [13]:
query = gpl.load(
    source='gs', 
    destination='query', 
    data_name='a0', 
    bq_schema=[bigquery.SchemaField('y', 'INTEGER')])

The bq_schema can also be inferred from the dataframe with 
google_pandas_load.LoadConfig.bq_schema_inferred_from_dataframe :  

In [14]:
bq_schema = LoadConfig.bq_schema_inferred_from_dataframe(df)
bq_schema

[SchemaField('y', 'INTEGER', 'NULLABLE', None, ())]

## List data

Let's create some data in BigQuery and transfert it to Storage. 

In [15]:
query = """
select * from 
(select 'Hello, ' as x from unnest(generate_array(1, 4000))) 
cross join 
(select 'World!' as y from unnest(generate_array(1, 4000)))
"""

gpl.load(
    source='query', 
    destination='gs',
    data_name='a0',
    query=query)

To list this data, [named](#What-is-the-data-named-data_name-?) a0, in Storage :  

In [16]:
gpl.list_blobs(data_name='a0')

[<Blob: augustin-b-bucket, gpl_dir/subdir/a0-000000000000.csv.gz>,
 <Blob: augustin-b-bucket, gpl_dir/subdir/a0-000000000001.csv.gz>]

You can also list the blob uris : 

In [17]:
gpl.list_blob_uris(data_name='a0')

['gs://augustin-b-bucket/gpl_dir/subdir/a0-000000000000.csv.gz',
 'gs://augustin-b-bucket/gpl_dir/subdir/a0-000000000001.csv.gz']

The data was big enough for BigQuery to split it in several files in Storage. 

Let's move this data into the local folder : 

In [18]:
gpl.load(
    source='gs', 
    destination='local',
    data_name='a0')

To list this data, [named](#What-is-the-data-named-data_name-?) a0, in the local folder :  

In [19]:
gpl.list_local_file_paths(data_name='a0')

['/tmp/gpl_directory/a0-000000000001.csv.gz',
 '/tmp/gpl_directory/a0-000000000000.csv.gz']

If you want BigQuery to not split the data, you can set use_wildcard to False when creating the loader. 

## Check data existence

Use the exist methods to check the [existence](#What-is-the-data-named-data_name-?) of the data in BigQuery, in Storage or in the local folder. 

For instance : 

In [9]:
print(gpl.exist_in_local(data_name='a1'))

gpl.load(
    source='query', 
    destination='local',
    data_name='a1',
    query='select 2')

print(gpl.exist_in_local(data_name='a1'))

False
True


## Delete data

### delete parameters

Use delete parameters to control the deletion of the data in BigQuery, in Storage or in the local folder, during the execution of a load job. 

For instance, let's upload a dataframe into BigQuery, while keeping the data in Storage but not in the local folder.

In [10]:
df = pandas.DataFrame(data={'x':[1]})

gpl.load(
    source='dataframe', 
    destination='bq',
    data_name='a1',
    dataframe=df, 
    delete_in_local=True, 
    delete_in_gs=False)

Note that True is the [default](#In-general,-data-is-moved,-not-copied-!) value of the three parameters delete_in_bq, delete_in_gs and delete_in_local. 

In [11]:
print(gpl.exist_in_local(data_name='a1'))
print(gpl.exist_in_gs(data_name='a1'))
print(gpl.exist_in_bq(data_name='a1'))

False
True
True


### delete methods

Use the delete methods to delete data in BigQuery, in Storage or in the local folder. 

For instance : 

In [12]:
gpl.load(
    source='query', 
    destination='gs',
    data_name='a1',
    query='select 2')

print(gpl.exist_in_gs(data_name='a1'))
gpl.delete_in_gs(data_name='a1')
print(gpl.exist_in_gs(data_name='a1'))

True
False


## Cast data

### when it moves to pandas

When the data moves to pandas, use the parameter dtype to cast a column in one of the following python types : bool, int, float or str.  

In [24]:
query = """
select 5 as x, 5 as y, 5 as z
"""
dtype = {
    'x': str, 
    'y': float}

df = gpl.load(
    source='query', 
    destination='dataframe', 
    query=query, 
    dtype=dtype)

df

Unnamed: 0,x,y,z
0,5,5.0,5


In [25]:
df.dtypes

x     object
y    float64
z      int64
dtype: object

To cast a column into the datetime.datetime type, use the parameter parse_dates.

In [26]:
query = """
select 
cast('2012-11-14 14:32:30' as TIMESTAMP) as x, 
'2013-11-14 14:32:30.100121' as y,
cast('2012-11-14' as DATE) as z
"""

df = gpl.load(
    source='query',
    destination='dataframe',
    query=query,
    parse_dates=['x', 'y', 'z'])

df

Unnamed: 0,x,y,z
0,2012-11-14 14:32:30,2013-11-14 14:32:30.100121,2012-11-14


In [27]:
df.dtypes

x    datetime64[ns]
y    datetime64[ns]
z    datetime64[ns]
dtype: object

### when it moves to BigQuery

When data moves to BigQuery, you can specify the BigQuery types of the columns with the parameter bq_schema. 

In [28]:
df = pandas.DataFrame(data={'x': [7, 8], 'y': ['a', 'b']})

gpl.load(
    source='dataframe', 
    destination='gs', 
    data_name='a0', 
    dataframe=df)


bq_schema = [bigquery.SchemaField(name='x', field_type='FLOAT'),
             bigquery.SchemaField(name='y', field_type='STRING')]

gpl.load(
    source='gs', 
    destination='bq', 
    data_name='a0', 
    bq_schema=bq_schema)

Let's check that the BigQuery table a0 has the bq_schema specified : 

In [29]:
table_ref = dataset_ref.table(table_id='a0')
table = bq_client.get_table(table_ref=table_ref)
table.schema

[SchemaField('x', 'FLOAT', 'NULLABLE', None, ()),
 SchemaField('y', 'STRING', 'NULLABLE', None, ())]

If source = 'dataframe', bq_schema is not required. The pandas columns are given BigQuery types as follow and in this order of priority : 

- the columns whose name are in the list parameter date_cols are given the BigQuery type DATE. 
- the columns whose name are in the list parameter timestamp_cols are given the BigQuery type TIMESTAMP. 
- the columns with python type bool are given the BigQuery type BOOLEAN.
- the columns with python type int are given the BigQuery type INTEGER. 
- the columns with python type float are given the BigQuery type FLOAT.
- the other columns are given the BigQuery type STRING. 

Let's see an example : 

In [30]:
dt = datetime.strptime(
    '2003-11-14 14:32:30.100121', 
    '%Y-%m-%d %H:%M:%S.%f')
df = pandas.DataFrame(
    data={
        'w': [8.0], 
        'x': ['e'], 
        'y': ['2018-01-01'], 
        'z': [dt]})

gpl.load(
    source='dataframe', 
    destination='bq', 
    data_name='a0', 
    dataframe=df, 
    date_cols=['y'], 
    timestamp_cols=['z'])

In [31]:
table_ref = dataset_ref.table(table_id='a0')
table = bq_client.get_table(table_ref=table_ref)
table.schema

[SchemaField('w', 'FLOAT', 'NULLABLE', None, ()),
 SchemaField('x', 'STRING', 'NULLABLE', None, ()),
 SchemaField('y', 'DATE', 'NULLABLE', None, ()),
 SchemaField('z', 'TIMESTAMP', 'NULLABLE', None, ())]

## Multi load

The method [google_pandas_load.Loader.mload()](Loader.rst#google_pandas_load.Loader.mload) is used to launch several load jobs at the same time. For each job, the user defines a load job config which is an instance of [google_pandas_load.LoadConfig](LoadConfig.rst). Then the user gives as input for the mload method, this list of load configs. 

A load config has the same parameters than the method google_pandas_load.Loader.load.

Let's see an example : 

In [32]:
config1 = LoadConfig(
    source='query', 
    destination='dataframe', 
    query='select 1 as x')


df = pandas.DataFrame(data={'x': [3]})
config2 = LoadConfig(
    source='dataframe', 
    destination='local', 
    data_name='a0',
    dataframe=df)

load_results = gpl.mload(configs=[config1, config2])

In [33]:
load_results[0]

Unnamed: 0,x
0,1


In [34]:
print(load_results[1])

None


## Monitoring

### a load job

You can have extra informations about a load job with the method [google_pandas_load.Loader.xload()](Loader.rst#google_pandas_load.Loader.xload). In particular, monitoring informations. 

For instance : 

In [35]:
xload_result = gpl.xload(
    source='query', 
    destination='dataframe', 
    query='select 11 as x')

In [36]:
xload_result.load_result

Unnamed: 0,x
0,11


In [37]:
print(xload_result.data_name)
print(xload_result.duration)
print(xload_result.durations)
print(xload_result.query_cost)

20190322203436_427238_rand3108
2
Namespace(bq_to_gs=1, bq_to_query=None, dataframe_to_local=None, gs_to_bq=None, gs_to_local=0, local_to_dataframe=0, local_to_gs=None, query_to_bq=1)
0.0


### a multi load job

You can have extra informations about a multi load job with the method [google_pandas_load.Loader.xmload()](Loader.rst#google_pandas_load.Loader.xmload). In particular, monitoring informations. 

For instance : 

In [38]:
config1 = LoadConfig(
    source='query', 
    destination='dataframe', 
    query='select 1 as x')


df = pandas.DataFrame(data={'x': [3]})
config2 = LoadConfig(
    source='dataframe', 
    destination='local', 
    data_name='a0',
    dataframe=df)

xmload_result = gpl.xmload(configs=[config1, config2])

In [39]:
xmload_result.load_results

[   x
 0  1, None]

In [40]:
print(xmload_result.data_names)
print(xmload_result.duration)
print(xmload_result.durations)
print(xmload_result.query_cost)
print(xmload_result.query_costs)

['20190322203440_093909_rand14', 'a0']
3
Namespace(bq_to_gs=2, bq_to_query=None, dataframe_to_local=0, gs_to_bq=None, gs_to_local=0, local_to_dataframe=0, local_to_gs=None, query_to_bq=1)
0.0
[0.0, None]


## Logging

The logger creating the log records of [google_pandas_load.Loader](Loader.rst) is named Loader and is controlled, as usual, by the application code. 

In [41]:
import logging
logger = logging.getLogger('Loader')
logger.setLevel(level=logging.DEBUG)
ch = logging.StreamHandler()
formatter = logging.Formatter(fmt='%(name)s - %(levelname)s - %(message)s')
ch.setFormatter(fmt=formatter)
logger.addHandler(hdlr=ch)

In [42]:
df = gpl.load(
    source='query', 
    destination='dataframe', 
    query='select 1 as x')

Loader - DEBUG - Starting query to bq...
Loader - DEBUG - Ended query to bq [1s, 0.0$]
Loader - DEBUG - Starting bq to gs...
Loader - DEBUG - Ended bq to gs [1s]
Loader - DEBUG - Starting gs to local...
Loader - DEBUG - Ended gs to local [0s]
Loader - DEBUG - Starting local to dataframe...
Loader - DEBUG - Ended local to dataframe [0s]


The logger creating the log records of [google_pandas_load.LoaderQuickSetup](LoaderQuickSetup.rst) is named LoaderQuickSetup. Contrary to the logger Loader, it has already a built-in console handler. Thus, without any logging set up, logging records are displayed in the console. For instance :  

In [43]:
df = gpl_quick_setup.load(
    source='query', 
    destination='dataframe', 
    query='select 1 as x')

2019-03-22 20:34:48,200 - LoaderQuickSetup - DEBUG - Starting query to bq...
2019-03-22 20:34:49,841 - LoaderQuickSetup - DEBUG - Ended query to bq [1s, 0.0$]
2019-03-22 20:34:49,842 - LoaderQuickSetup - DEBUG - Starting bq to gs...
2019-03-22 20:34:51,904 - LoaderQuickSetup - DEBUG - Ended bq to gs [2s]
2019-03-22 20:34:51,906 - LoaderQuickSetup - DEBUG - Starting gs to local...
2019-03-22 20:34:52,404 - LoaderQuickSetup - DEBUG - Ended gs to local [0s]
2019-03-22 20:34:52,405 - LoaderQuickSetup - DEBUG - Starting local to dataframe...
2019-03-22 20:34:52,412 - LoaderQuickSetup - DEBUG - Ended local to dataframe [0s]


This is convenient when scripting for instance. 

In order to avoid duplicate log records in the console, the LoaderQuickSetup logger is set to not propagate its log records to its logger ancestors. 

[google_pandas_load.Loader](Loader.rst) and [google_pandas_load.LoaderQuickSetup](LoaderQuickSetup.rst) both have a parameter logger. The default values are respectively the Loader logger and the LoaderQuickSetup logger. In both cases, you can set up this parameter with another logger. 

This is mainly convenient when using [google_pandas_load.LoaderQuickSetup](LoaderQuickSetup.rst) to retake control of its log records (for instance to stop displaying them in the console). 