## Big Query 1 & 2

### Things to do before lecture

1. Install ``pip3 install google-cloud-bigquery google-cloud-bigquery-storage pyarrow tqdm ipywidgets pandas matplotlib db-dtypes pandas-gbq``
2. Gloud authentication: `gcloud auth application-default login --scopes=openid,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/drive.readonly`
3. Start notebook on your VM: `python3 -m jupyterlab --no-browser`
4. Establish SSH tunnel for port 8888

In [None]:
project = "cs544-spring2024" 
# this name will probably be different for you

In [None]:
# import statement


In [None]:
# bigquery Client
bq = 

In [None]:
q = bq.query(
"""
SELECT geo_id, county_name 
FROM `bigquery-public-data.geo_us_boundaries.counties` 
WHERE county_name = 'Dane'
"""
)
q.to_dataframe()

## Structure

"project" contains "datasets" contain "tables"

#### What datasets do I have in my project?

In [None]:
bq

### Dataset creation

In [None]:
ds = 
# ds.location = "us-central1"
bq

### Public datasets

In [None]:
for ds in bq.list_datasets("bigquery-public-data"):
    print(ds.dataset_id)

### List tables

In [None]:
for ??? in bq.???("bigquery-public-data.github_repos"):
    print(???)

### Running queries: three options

1. Using extension
2. Using extension and store result in a DataFrame
3. Using Python API

#### Extension access

#### OPTION 1: Run a query using `%%bigquery`

In [None]:
???
SELECT *
FROM `bigquery-public-data.github_repos.languages`
LIMIT 5

#### OPTION 2: Save a query result into `df` using `%%bigquery df`

In [None]:
???
SELECT *
FROM `bigquery-public-data.github_repos.languages`
LIMIT 5

In [None]:
df

#### OPTION 3: Python API

In [None]:
no_cache = bigquery.QueryJobConfig(use_query_cache=False)

In [None]:
q = ???("""
SELECT *
FROM `bigquery-public-data.github_repos.languages`
LIMIT 5
""")

In [None]:
# DataFrame

#### Total bytes processed and billed (in MB)

In [None]:
q.??? / 1024**2 # MB

In [None]:
q.??? / 1024**2 # MB

#### How many times can we do this in the free tier?

In [None]:
tb = 1024**4
tb / q.total_bytes_billed

#### How much will it cost per query after that, in say Tokyo?

Source: https://cloud.google.com/bigquery/pricing#on_demand_pricing

In [None]:
price_per_tb = ???
q.total_bytes_billed / tb * price_per_tb

### Pricing factors

1. you pay for storage too (not just queries)
2. they have a minimum of 10 MB per query
3. they round up to the nearest 1 MB per query

### `open-lambda` repositories

In [None]:
%%bigquery
SELECT *
FROM `bigquery-public-data.github_repos.languages`


### Inspecting types

### `ARRAY` of `STRUCT`s aka `REPEATED RECORD`s 

#### Get the first language.

In [None]:
%%bigquery
SELECT *
FROM `bigquery-public-data.github_repos.languages`
WHERE repo_name LIKE 'open-lambda/%'

Get the last language.

In [None]:
%%bigquery
SELECT repo_name, ??? as last
FROM `bigquery-public-data.github_repos.languages`
WHERE repo_name LIKE 'open-lambda/%'

### Get the names of the first and the last languages.

In [None]:
%%bigquery
SELECT repo_name, language[OFFSET(0)].name as first, language[OFFSET(ARRAY_LENGTH(language)-1)].name as last
FROM bigquery-public-data.github_repos.languages
WHERE repo_name LIKE "open-lambda/%"

## `CROSS JOIN`

#### How often is `C` used with `Dockerfile`?

In [None]:
%%bigquery
SELECT *
FROM `bigquery-public-data.github_repos.languages`
WHERE repo_name LIKE 'open-lambda/%'

In [None]:
%%bigquery
SELECT *
FROM `bigquery-public-data.github_repos.languages`
WHERE repo_name LIKE 'open-lambda/%'

In [None]:
%%bigquery
SELECT *
FROM `bigquery-public-data.github_repos.languages`
???
WHERE repo_name LIKE 'open-lambda/%'

Double `CROSS JOIN`.

In [None]:
%%bigquery
SELECT *
FROM `bigquery-public-data.github_repos.languages`
CROSS JOIN UNNEST(language) 
WHERE repo_name LIKE 'open-lambda/%'

In [None]:
%%bigquery
SELECT repo_name, L1.name AS name1, L2.name AS name2
FROM bigquery-public-data.github_repos.languages
CROSS JOIN UNNEST(language) AS L1
CROSS JOIN UNNEST(language) AS L2
WHERE repo_name LIKE "open-lambda/%"


#### What are the ten most common languages on GitHub?

In [None]:
%%bigquery top10
SELECT *
FROM bigquery-public-data.github_repos.languages
CROSS JOIN UNNEST(language) AS L

In [None]:
top10

In [None]:
top10.set_index("name")

In [None]:
top10.set_index("name").plot.bar()

#### What software licenses are used most often for Python projects?

In [None]:
%%bigquery lic
SELECT l*
FROM bigquery-public-data.github_repos.languages

In [None]:
lic.set_index("license").plot.bar()

### Using Bigquery on our custom data

### Example 1: BigQuery Table

In [None]:
config = bigquery.LoadJobConfig(source_format="PARQUET", write_disposition="WRITE_TRUNCATE")
# Get this "gsutil URI" from your GCP account 
source = "gs://s24_msyamkumar/hdma-wi-2021.parquet"
dataset = "lec_demo"
job = bq.load_table_from_uri(source, f"{project}.{dataset}.loans", job_config=config)
job.result()

### Example 2: External Table (GCS)

In [None]:
config = bigquery.ExternalConfig(source_format="PARQUET")
config.source_uris = [source]
# config.autodetect = True
table = bigquery.Table(f"{project}.{dataset}.loans-external")
table.external_data_configuration = config
bq.create_table(table, exists_ok=True)

### Example 3: external table (sheets)
Form: https://forms.gle/wwqt8XBXmFj6pES56 <br>
Sheet: https://docs.google.com/spreadsheets/d/1FfalqAWdzz01D1zIvBxsDWLW05-lvANWjjAj2vI4A04/

In [None]:
config = bigquery.ExternalConfig(source_format="GOOGLE_SHEETS")
config.source_uris = ["https://docs.google.com/spreadsheets/d/1FfalqAWdzz01D1zIvBxsDWLW05-lvANWjjAj2vI4A04/"]
config.autodetect = True
table = bigquery.Table(f"{project}.{dataset}.applications")
table.external_data_configuration = config
bq.create_table(table, exists_ok=True)

In [None]:
%%bigquery
SELECT *
FROM `cs544-spring2024.test1.applications`