# BigQuery Integration

## Google Authentication

The most common way is to use the [default credentials](https://cloud.google.com/docs/authentication/application-default-credentials). Make sure the `GOOGLE_APPLICATION_CREDENTIALS` is set to a credential file. You can also read the [authentication doc](https://cloud.google.com/bigquery/docs/authentication) of BigQuery. In this way you must make sure each machine in the cluster has `GOOGLE_APPLICATION_CREDENTIALS` setup.

An alternative way is to keep the [service account credential json](https://developers.google.com/workspace/guides/create-credentials#service-account) as a string in an environment variable, then in Fugue configs, set `fugue.bq.credentials.env` to point to the environment variable. In this way, you only need to configure your local environment. The credential will be propagated to the compute cluster when the application is running.

Some platforms have their own way to inject the credentials into your environment. For example [this](https://cloud.google.com/bigquery/docs/connect-databricks) is how Databricks users enable the service account on a cluster. Following these steps, you don't need extra steps to authenticate for Fugue.

## Environment Requirement

You MUST set a dataset to store temporary intermediate tables. The default dataset name is `FUGUE_TEMP_DATASET`. You can change it through the Fugue config `fugue.bq.temp_dataset`. It is strongly recommended to set a default expiration for this dataset, `1 day` is a reasonable value.

You can also specify the project id or it will use the default project id of your account. This can be changed by config `fugue.bq.project`


## The BigQuery Client

`BigQueryClient` is the client singleton to talk to the BigQuery service. You may initialize the `BigQueryClient` explicitly.

If you use the default credentials:

```python
from fugue_bigquery import BigQueryClient

client = BigQueryClient.get_or_create()
```

or in the environment variable approach (or you have configs):

```python
from fugue_bigquery import BigQueryClient

conf = {
    "fugue.bq.credentials.env":"MY_ENV",
    "fugue.bq.temp_dataset":"my_temp",
}
client = BigQueryClient.get_or_create(conf)
```

But alternatively, you can just add the configs when providing the `engine_conf`

## Using the APIs

### get_schema

This function get the schema of the table or the query without execution.

In [2]:
import fugue_bigquery.api as fbqa

fbqa.get_schema("bigquery-public-data.usa_names.usa_1910_2013")

state:str,gender:str,year:long,name:str,number:long

In [3]:
fbqa.get_schema("SELECT COUNT(*) AS ct FROM `bigquery-public-data.usa_names.usa_1910_2013`")

ct:long

### load_table

This function loads the table under the context of the current execution engine.

* If the engine is `BigQueryExecutionEngine`, it will return `BigQueryDataFrame` or the underlying Ibis table, depending on `as_fugue`.
* If the engine is not a distributed engine, then it will load the entire result as a local dataframe.
* If the engine is a distributed engine, it will use the engine to distributedly load the table content.

In [5]:
fbqa.load_table("bigquery-public-data.usa_names.usa_1910_2013")  # ibis table

In [6]:
fbqa.load_table("bigquery-public-data.usa_names.usa_1910_2013", as_fugue=True)  # BigQueryDataFrame

Unnamed: 0,state:str,gender:str,year:long,name:str,number:long
0,AL,F,1910,Sadie,40
1,AL,F,1910,Mary,875
2,AR,F,1910,Vera,39
3,AR,F,1910,Marie,78
4,AR,F,1910,Lucille,66
5,CA,F,1910,Virginia,101
6,DC,F,1910,Margaret,72
7,GA,F,1910,Mildred,133
8,GA,F,1910,Vera,51
9,GA,F,1910,Sallie,92


Notice the `sample` parameter can effectively reduce the size of data loaded so it saves cost. Use it when you load the table with a different engine:

In [7]:
fbqa.load_table("bigquery-public-data.usa_names.usa_1910_2013", sample=0.0001, engine="pandas")

Unnamed: 0,state,gender,year,name,number
0,AL,F,1910,Hazel,51
1,AL,F,1910,Lucy,76
2,AR,F,1910,Nellie,39
3,AR,F,1910,Lena,40
4,CO,F,1910,Thelma,36
...,...,...,...,...,...
925178,CO,M,2013,Peyton,35
925179,IA,M,2013,Zander,35
925180,NE,M,2013,Carson,35
925181,OK,M,2013,Alan,35


You can use `columns` and `row_filter` to futher reduce the size of data loaded.

## load_sql

It is similar to load_table, but the input will be a SQL query. In some cases, the SQL output will be saved to the temp dataset and then loaded distributedly.

In [8]:
fbqa.load_sql("""
SELECT COUNT(*) AS ct
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state='CA'
""")  # ibis table

In [10]:
fbqa.load_sql("""
SELECT COUNT(*) AS ct
FROM `bigquery-public-data.usa_names.usa_1910_2013` TABLESAMPLE SYSTEM (1 PERCENT)
WHERE state='CA'
""", as_fugue=True)  # BigQueryDataFrame

Unnamed: 0,ct:long
0,58252


In [12]:
fbqa.load_sql("""
SELECT name
FROM `bigquery-public-data.usa_names.usa_1910_2013` TABLESAMPLE SYSTEM (1 PERCENT)
WHERE state='CA'
""", engine="pandas")  # LocalDataFrame

Unnamed: 0,name
0,Beatrice
1,Marion
2,Thelma
3,Muriel
4,Anita
...,...
57712,Marcelo
57713,Mathias
57714,Kadin
57715,Ronaldo


## Registered Representations

You can use the tuple `("bq", table_or_sql)` to represent a dataframe that can be used by any Fugue API:

In [13]:
import fugue.api as fa

fa.show(("bq","bigquery-public-data.usa_names.usa_1910_2013"))

Unnamed: 0,state:str,gender:str,year:long,name:str,number:long
0,AL,F,1910,Sadie,40
1,AL,F,1910,Mary,875
2,AR,F,1910,Vera,39
3,AR,F,1910,Marie,78
4,AR,F,1910,Lucille,66
5,CA,F,1910,Virginia,101
6,DC,F,1910,Margaret,72
7,GA,F,1910,Mildred,133
8,GA,F,1910,Vera,51
9,GA,F,1910,Sallie,92


In [14]:
fa.show(("bq","SELECT COUNT(*) AS ct FROM `bigquery-public-data.usa_names.usa_1910_2013`"))

Unnamed: 0,ct:long
0,5552452


In [20]:
import pandas as pd
from typing import List, Any

# schema: *
def median(df:pd.DataFrame) -> List[List[Any]]:
    return [[df.state.iloc[0], df.number.median()]]

fa.transform(
    ("bq", """SELECT state, number
    FROM `bigquery-public-data.usa_names.usa_1910_2013` TABLESAMPLE SYSTEM (1 PERCENT)"""),
    median,
    partition="state",
    engine="dask"
).compute()

Unnamed: 0,state,number
0,AK,9
1,AL,13
2,AR,13
3,AZ,12
4,CA,13
5,CO,12
6,CT,13
7,DC,11
8,DE,10
9,FL,13
