# Google BigQuery

1. Instalando e importando bibliotecas
2. inicializando variáveis do envs
3. Conectando ao Data Warehouse BigQUery
4. Criando tabelas e populando com dados dos scripts SQL (1 a 4)
5. Criando tabela e importando dados de arquivo CSV
6. Criando tabela com dados externos do Google Cloud Storage
7. Criando views e views materializadas

<div class="alert alert-info">
     
**Observações**
 
- Para executar o notebook, é necessário ter uma conta no Google Cloud e o arquivo de credenciais para permitir o acesso, além de ter criado um Bucket Cloud Storage para armazenar os dados. Os dois devem estar na mesma região do GCP.

- Os arquivos de exemplos estarão na pasta dados. Os dois arquivos de dados que serão utilizados para nas tabelas externas e na importação de dados para o BigQuery.

- Mudar o nome do dataset na variável `dataset_id` e dentro dos arquivos SQL (1 a 4).

</div>


## Passos iniciais


### Instalando bibliotecas utilizadas no notebook


In [2]:
%pip install --upgrade google-cloud-bigquery

Note: you may need to restart the kernel to use updated packages.


### Importando bibliotecas utilizadas no notebook e iniciando environment


In [2]:
from google.cloud import bigquery
from dotenv import load_dotenv
import os

load_dotenv()

True

### Iniciando variável de credenciais do GCP


In [3]:
credentials_path = os.getenv('GOOGLE_APPLICATION_CREDENTIALS_PATH')
project_id = 'vendas-401823'
dataset_id = 'northwind2'
cloud_storage_path = os.getenv('BIGQUERY_CS_BUCKET')

### Conectando no Google BigQuery


In [4]:
client = bigquery.Client.from_service_account_json(credentials_path)

## Criação e Inserção dos dados no Data Warehouse BigQuery

### Criação das tabelas no BigQuery


In [13]:
table_id = 'categories'
schema = [
    bigquery.SchemaField('category_id', 'INTEGER', mode='required'),
    bigquery.SchemaField('category_name', 'STRING', mode='required'),
    bigquery.SchemaField('description', 'STRING')
]

table_ref = client.dataset(dataset_id).table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table = client.create_table(table)

In [52]:
table_id = 'customers'
schema = [
    bigquery.SchemaField('customer_id', 'STRING', mode='required'),
    bigquery.SchemaField('company_name', 'STRING', mode='required'),
    bigquery.SchemaField('contact_name', 'STRING'),
    bigquery.SchemaField('contact_title', 'STRING'),
    bigquery.SchemaField('address', 'STRING'),
    bigquery.SchemaField('city', 'STRING'),
    bigquery.SchemaField('region', 'STRING'),
    bigquery.SchemaField('postal_code', 'STRING'),
    bigquery.SchemaField('country', 'STRING'),
    bigquery.SchemaField('phone', 'STRING'),

]

table_ref = client.dataset(dataset_id).table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table = client.create_table(table)

In [21]:
table_id = 'employees'
schema = [
    bigquery.SchemaField('employee_id', 'INTEGER', mode='required'),
    bigquery.SchemaField('last_name', 'STRING', mode='required'),
    bigquery.SchemaField('first_name', 'STRING', mode='required'),
    bigquery.SchemaField('title', 'STRING'),
    bigquery.SchemaField('title_of_courtesy', 'STRING'),
    bigquery.SchemaField('birth_date', 'TIMESTAMP'),
    bigquery.SchemaField('hire_date', 'DATE'),
    bigquery.SchemaField('address', 'STRING'),
    bigquery.SchemaField('city', 'STRING'),
    bigquery.SchemaField('region', 'STRING'),
    bigquery.SchemaField('postal_code', 'STRING'),
    bigquery.SchemaField('country', 'STRING'),
    bigquery.SchemaField('home_phone', 'STRING'),
    bigquery.SchemaField('extension', 'STRING'),
    bigquery.SchemaField('notes', 'STRING'),
    bigquery.SchemaField('reports_to', 'INTEGER'),
    bigquery.SchemaField('photo_path', 'STRING'),
    bigquery.SchemaField('salary', 'FLOAT')
]

table_ref = client.dataset(dataset_id).table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY,
    field="birth_date",
    expiration_ms=1036800000,
    require_partition_filter=True,
)
table = client.create_table(table)

<div class="alert alert-info">
     
**Observação**
 
- Tabela criada com script SQL pois a api do BigQuery não tem suporte para criação de partições utilizando o campo _PARTITIONTIME em Python.

</div>


In [40]:
sql = """
CREATE TABLE """ + dataset_id + """.order_details (
    order_id smallint NOT NULL,
    product_id smallint NOT NULL,
    unit_price FLOAT64 NOT NULL,
    quantity smallint NOT NULL,
    discount FLOAT64 NOT NULL
)
PARTITION BY TIMESTAMP_TRUNC(_PARTITIONTIME, HOUR);
"""

job = client.query(sql)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7ffaf27e54d0>

In [56]:
table_id = 'orders'
schema = [
    bigquery.SchemaField('order_id', 'INTEGER'),
    bigquery.SchemaField('customer_id', 'STRING'),
    bigquery.SchemaField('employee_id', 'INTEGER'),
    bigquery.SchemaField('order_date', 'DATE'),
    bigquery.SchemaField('required_date', 'DATE'),
    bigquery.SchemaField('shipped_date', 'DATE'),
    bigquery.SchemaField('ship_via', 'INTEGER'),
    bigquery.SchemaField('freight', 'FLOAT'),
    bigquery.SchemaField('ship_name', 'STRING'),
    bigquery.SchemaField('ship_address', 'STRING'),
    bigquery.SchemaField('ship_city', 'STRING'),
    bigquery.SchemaField('ship_region', 'STRING'),
    bigquery.SchemaField('ship_postal_code', 'STRING'),
    bigquery.SchemaField('ship_country', 'STRING')
]

table_ref = client.dataset(dataset_id).table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table.range_partitioning = bigquery.RangePartitioning(
    range_=bigquery.PartitionRange(start=10300, end=12000, interval=100),
    field="order_id",
)
table = client.create_table(table)

In [37]:
table_id = 'products'
schema = [
    bigquery.SchemaField('product_id', 'INTEGER', mode='required'),
    bigquery.SchemaField('product_name', 'STRING', mode='required'),
    bigquery.SchemaField('supplier_id', 'INTEGER'),
    bigquery.SchemaField('category_id', 'INTEGER'),
    bigquery.SchemaField('quantity_per_unit', 'STRING'),
    bigquery.SchemaField('unit_price', 'FLOAT'),
    bigquery.SchemaField('units_in_stock', 'INTEGER'),
    bigquery.SchemaField('units_on_order', 'INTEGER'),
    bigquery.SchemaField('reorder_level', 'INTEGER'),
    bigquery.SchemaField('discontinued', 'INTEGER', mode='required'),
]

table_ref = client.dataset(dataset_id).table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table.range_partitioning = bigquery.RangePartitioning(
    range_=bigquery.PartitionRange(start=1, end=10001, interval=10),
    field="product_id",
)
table.clustering_fields = ['category_id']
table = client.create_table(table)

In [38]:

table_id = 'shippers'
schema = [
    bigquery.SchemaField('shipper_id', 'INTEGER', mode='required'),
    bigquery.SchemaField('company_name', 'STRING', mode='required'),
    bigquery.SchemaField('phone', 'STRING')
]

table_ref = client.dataset(dataset_id).table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table = client.create_table(table)

In [39]:
table_id = 'suppliers'
schema = [
    bigquery.SchemaField('supplier_id', 'INTEGER', mode='required'),
    bigquery.SchemaField('company_name', 'STRING', mode='required'),
    bigquery.SchemaField('contact_name', 'STRING'),
    bigquery.SchemaField('contact_title', 'STRING'),
    bigquery.SchemaField('address', 'STRING'),
    bigquery.SchemaField('city', 'STRING'),
    bigquery.SchemaField('region', 'STRING'),
    bigquery.SchemaField('postal_code', 'STRING'),
    bigquery.SchemaField('country', 'STRING'),
    bigquery.SchemaField('phone', 'STRING'),
    bigquery.SchemaField('fax', 'STRING'),
    bigquery.SchemaField('homepage', 'STRING'),
]

table_ref = client.dataset(dataset_id).table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table = client.create_table(table)

### Inserindo de dados nas tabelas do BigQuery através dos scripts SQL (1 a 4)


In [58]:
with open('1.categories_shippers_suppliers.sql', 'r') as file:
    query1 = file.read()

with open('2.customers_employees_products.sql', 'r') as file:
    query2 = file.read()

with open('3.orders.sql', 'r') as file:
    query3 = file.read()

with open('4.orderdetails1.sql', 'r') as file:
    query4 = file.read()

with open('4.orderdetails2.sql', 'r') as file:
    query5 = file.read()

job = client.query(query1)
job.result()

job = client.query(query2)
job.result()

job = client.query(query3)
job.result()

job = client.query(query4)
job.result()

job = client.query(query5)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7ffaf1e56050>

In [59]:
query = f"""
select * from `{dataset_id}.employees` where date(birth_date) between '1900-01-01' and '1980-12-31' order by employee_id
"""
query_job = client.query(query)

for row in query_job:
    print(row)

Row((1, 'Davolio', 'Nancy', 'Sales Representative', 'Ms.', datetime.datetime(1948, 12, 8, 0, 0, tzinfo=datetime.timezone.utc), datetime.date(1992, 5, 1), '507 - 20th Ave. E.Apt. 2A', 'Seattle', 'WA', '98122', 'USA', '(206) 555-9857', '5467', 'Education includes a BA in psychology from Colorado State University in 1970.  She also completed The Art of the Cold Call.  Nancy is a member of Toastmasters International.', 2, 'http://accweb/emmployees/davolio.bmp', 2954.55), {'employee_id': 0, 'last_name': 1, 'first_name': 2, 'title': 3, 'title_of_courtesy': 4, 'birth_date': 5, 'hire_date': 6, 'address': 7, 'city': 8, 'region': 9, 'postal_code': 10, 'country': 11, 'home_phone': 12, 'extension': 13, 'notes': 14, 'reports_to': 15, 'photo_path': 16, 'salary': 17})
Row((1, 'Davolio', 'Nancy', 'Sales Representative', 'Ms.', datetime.datetime(1948, 12, 8, 0, 0, tzinfo=datetime.timezone.utc), datetime.date(1992, 5, 1), '507 - 20th Ave. E.Apt. 2A', 'Seattle', 'WA', '98122', 'USA', '(206) 555-9857', '5

### Criando uma tabela pivot


In [62]:
query = f"""
with cte_mes as (
  select 
  employee_id,
  extract(month FROM birth_date) as mes_nascimento
  from `{dataset_id}.employees`
  where extract(year FROM birth_date) between 1900 and 1990
)
select 
mes_nascimento,
sum(case when mes_nascimento = 1 then 1 else 0 end) as Janeiro,
sum(case when mes_nascimento = 2 then 1 else 0 end) as Fevereiro,
sum(case when mes_nascimento = 3 then 1 else 0 end) as Marco,
sum(case when mes_nascimento = 4 then 1 else 0 end) as Abril,
sum(case when mes_nascimento = 5 then 1 else 0 end) as Maio,
sum(case when mes_nascimento = 6 then 1 else 0 end) as Junho,
sum(case when mes_nascimento = 7 then 1 else 0 end) as Julho,
sum(case when mes_nascimento = 8 then 1 else 0 end) as Agosto,
sum(case when mes_nascimento = 9 then 1 else 0 end) as Setembro,
sum(case when mes_nascimento = 10 then 1 else 0 end) as Outubro,
sum(case when mes_nascimento = 11 then 1 else 0 end) as Novembro,
sum(case when mes_nascimento = 12 then 1 else 0 end) as Dezembro
from cte_mes
group by mes_nascimento
order by mes_nascimento
"""
query_job = client.query(query)

for row in query_job:
    print(row)

Row((1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), {'mes_nascimento': 0, 'Janeiro': 1, 'Fevereiro': 2, 'Marco': 3, 'Abril': 4, 'Maio': 5, 'Junho': 6, 'Julho': 7, 'Agosto': 8, 'Setembro': 9, 'Outubro': 10, 'Novembro': 11, 'Dezembro': 12})
Row((2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), {'mes_nascimento': 0, 'Janeiro': 1, 'Fevereiro': 2, 'Marco': 3, 'Abril': 4, 'Maio': 5, 'Junho': 6, 'Julho': 7, 'Agosto': 8, 'Setembro': 9, 'Outubro': 10, 'Novembro': 11, 'Dezembro': 12})
Row((3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0), {'mes_nascimento': 0, 'Janeiro': 1, 'Fevereiro': 2, 'Marco': 3, 'Abril': 4, 'Maio': 5, 'Junho': 6, 'Julho': 7, 'Agosto': 8, 'Setembro': 9, 'Outubro': 10, 'Novembro': 11, 'Dezembro': 12})
Row((9, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0), {'mes_nascimento': 0, 'Janeiro': 1, 'Fevereiro': 2, 'Marco': 3, 'Abril': 4, 'Maio': 5, 'Junho': 6, 'Julho': 7, 'Agosto': 8, 'Setembro': 9, 'Outubro': 10, 'Novembro': 11, 'Dezembro': 12})
Row((12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2), {'mes_nascimento': 0, 'Jan

### Cálculo de média com sub consulta


In [63]:
query = f"""
    select 
    product_id,
    product_name,
    (select avg(unit_price) from `northwind.products` as products2 where products2.category_id = products.category_id) as preco_medio
    from `northwind.products` products
    order by preco_medio asc
"""
query_job = client.query(query)

for row in query_job:
    print(row)

Row((52, 'Filo Mix', 20.25), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((57, 'Ravioli Angelo', 20.25), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((56, 'Gnocchi di nonna Alice', 20.25), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((64, 'Wimmers gute Semmelkndel', 20.25), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((22, 'Gustafs Knckebrd', 20.25), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((23, 'Tunnbrd', 20.25), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((42, 'Singaporean Hokkien Fried Mee', 20.25), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((58, 'Escargots de Bourgogne', 20.682499999999997), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((73, 'Rd Kaviar', 20.682499999999997), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((30, 'Nord-Ost Matjeshering', 20.682499999999997), {'product_id': 0, 'product_name': 1, 'preco_medio': 2})
Row((37, 'Gravad la

### Criação de tabela apartir de um arquivo CSV


In [10]:
table_id = 'shippers2'
csv_file_path = 'manualtable.csv'
table_ref = client.dataset(dataset_id).table(table_id)

schema = [
    bigquery.SchemaField('shipper_id', 'INTEGER', mode='required'),
    bigquery.SchemaField('company_name', 'STRING', mode='required'),
    bigquery.SchemaField('phone', 'STRING')
]

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV,
    schema=schema,
)

with open(csv_file_path, 'rb') as source_file:
    job = client.load_table_from_file(
        source_file, table_ref, job_config=job_config)

job.result()

LoadJob<project=vendas-401823, location=us-central1, id=182f0bb4-e882-4282-9f62-6e2f2a5528a5>

In [13]:
query = f"""
    SELECT * FROM `{dataset_id}.shippers2`
"""
query_job = client.query(query)

for row in query_job:
    print(row)

Row((1, 'Speedy Express', '(503) 555-9831'), {'shipper_id': 0, 'company_name': 1, 'phone': 2})
Row((2, 'United Package', '(503) 555-3199'), {'shipper_id': 0, 'company_name': 1, 'phone': 2})
Row((3, 'Federal Shipping', '(503) 555-9931'), {'shipper_id': 0, 'company_name': 1, 'phone': 2})


### Importação de tabela apartir de arquivo parquet do Cloud Storage


In [29]:
table_id = f'{project_id}.{dataset_id}.employees2'
external_source_format = "PARQUET"
source_uri = f'{cloud_storage_path}employees.parquet'
external_config = bigquery.ExternalConfig(external_source_format)
external_config.source_uris = source_uri
external_config.autodetect = True

table = bigquery.Table(table_id)
table.external_data_configuration = external_config
table = client.create_table(table)

In [31]:
query = f"""
    SELECT *, _FILE_NAME as fn FROM `{dataset_id}.employees2`
"""
query_job = client.query(query)

for row in query_job:
    print(row)

Row((1, 'Davolio', 'Nancy', 'Sales Representative', 'Ms.', datetime.datetime(1948, 12, 8, 0, 0, tzinfo=datetime.timezone.utc), datetime.datetime(1992, 5, 1, 0, 0, tzinfo=datetime.timezone.utc), '507 - 20th Ave. E.Apt. 2A', 'Seattle', 'WA', '98122', 'USA', '(206) 555-9857', 5467, 'Education includes a BA in psychology from Colorado State University in 1970.  She also completed "The Art of the Cold Call."  Nancy is a member of Toastmasters International.', 2, 'http://accweb/emmployees/davolio.bmp', 2954.55, 'gs://dados-curso/employees.parquet'), {'EmployeeID': 0, 'LastName': 1, 'FirstName': 2, 'Title': 3, 'TitleOfCourtesy': 4, 'BirthDate': 5, 'HireDate': 6, 'Address': 7, 'City': 8, 'Region': 9, 'PostalCode': 10, 'Country': 11, 'HomePhone': 12, 'Extension': 13, 'Notes': 14, 'ReportsTo': 15, 'PhotoPath': 16, 'Salary': 17, 'fn': 18})
Row((2, 'Fuller', 'Andrew', 'Vice President, Sales', 'Dr.', datetime.datetime(1952, 2, 19, 0, 0, tzinfo=datetime.timezone.utc), datetime.datetime(1992, 8, 14, 

## Views e Views Materializadas no BigQuery

### Criando a view e view materializada


In [32]:
sql = f"""
create view `{dataset_id}.sales_by_year` as
select
  employee_id, 
  extract(year from order_date) as order_year,
  sum(unit_price * quantity) as total_sales
from northwind.orders
join northwind.order_details on orders.order_id = order_details.order_id
group by employee_id, order_year
"""

job = client.query(sql)
job.result()

sql = f"""
create materialized view `{dataset_id}.sales_by_year_mat` cluster by employee_id as
select
  employee_id, 
  extract(year from order_date) as order_year,
  sum(unit_price * quantity) as total_sales
from northwind.orders
join northwind.order_details on orders.order_id = order_details.order_id
group by employee_id, order_year
"""

job = client.query(sql)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f495f970a10>

### Executando consultas utilizando a view e a view materializada criada


In [33]:
query = f"""
    select * from `{dataset_id}.sales_by_year`;
"""
query_job = client.query(query)

for row in query_job:
    print(row)

Row((1, 2021, 53546.38), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((1, 2020, 38789.0), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((2, 2020, 22834.699999999997), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((2, 2021, 43684.3), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((3, 2021, 66836.5), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((3, 2020, 19231.8), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((4, 2020, 53114.799999999996), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((4, 2021, 79795.49), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((5, 2021, 17362.199999999997), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((5, 2020, 21965.2), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((6, 2020, 17731.1), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((6, 2021, 20081.54), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((7, 2020, 18

In [34]:
query = f"""
    select * from `{dataset_id}.sales_by_year_mat`;
"""
query_job = client.query(query)

for row in query_job:
    print(row)

Row((9, 2020, 11365.7), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((6, 2020, 17731.1), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((5, 2020, 21965.2), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((4, 2020, 53114.799999999996), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((7, 2020, 18104.800000000003), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((2, 2020, 22834.699999999997), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((3, 2020, 19231.8), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((1, 2020, 38789.0), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((8, 2020, 23161.4), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((9, 2021, 7519.6), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((4, 2021, 79795.48999999999), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((1, 2021, 53546.38), {'employee_id': 0, 'order_year': 1, 'total_sales': 2})
Row((6, 2