![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2Farchitectures%2Ftracking%2Fsetup%2Fgithub&dt=GitHub+Metrics+-+2+-+Commits+-+Reporting+Scheduled+Query.ipynb)

# GitHub Metrics: Commit History Reporting

This notebook shows the processing of raw data in the dataset `github_metrics` into a format ready for reporting stored in the `reporting` folder.  The queries develped here are scheduled in the Cloud Function create by the step 3 notebook for Commits.

**Source Dataset** 
- `vertex-ai-mlops-369716.github_metrics`
- **Source Tables**
    - `commits`
    - `commits_files`

**Destination Dataset** 
- `vertex-ai-mlops-369716.reporting`
- **Destination Tables**
    - `commits`
    - `commits_files`

Alternative Way To Schedule is [BigQuery Scheduled Queries](https://cloud.google.com/bigquery/docs/scheduling-queries#set_up_scheduled_queries)



---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/architectures/tracking/setup/github/GitHub%20Metrics%20-%202%20-%20Commits%20-%20Reporting%20Scheduled%20Query.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [None]:
PROJECT_ID = 'vertex-ai-mlops-369716' # replace with project ID

In [None]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

Updated property [core/project].


---
## Setup

In [None]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'vertex-ai-mlops-369716'

In [None]:
BQ_PROJECT = PROJECT_ID

In [None]:
from google.cloud import bigquery

In [None]:
bq = bigquery.Client(project = PROJECT_ID)

---
## Inital Reporting Tables

### github_metrics.commits -> reporting.commits

This section creates and runs the query that takes all the `github_metrics.commits` data to date and creates a reporting table `reporting.commits`. The next section will build the query that incrementally updates this initial table.

In [None]:
query = f"""
CREATE OR REPLACE TABLE `{BQ_PROJECT}.reporting.commits` AS
  SELECT
    * EXCEPT(datetime),
    DATETIME(TIMESTAMP(datetime)) AS datetime
  FROM `{BQ_PROJECT}.github_metrics.commits`
  ORDER BY datetime
"""
job = bq.query(query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f791569be50>

In [None]:
bq.query(query = f"SELECT * FROM `{BQ_PROJECT}.reporting.commits` LIMIT 5").to_dataframe()

Unnamed: 0,sha,url,message,author,datetime
0,701994319c570361840886cb0db660d9ed7534be,https://github.com/statmike/vertex-ai-mlops/co...,"Initial Load, 01 is complete",statmike,2021-04-01 16:06:54
1,ab6115f872328f308199918a11717be7a50df4b4,https://github.com/statmike/vertex-ai-mlops/co...,Complete,statmike,2021-04-01 16:08:42
2,24a093511a365c0f3b2de670de126b539119923c,https://github.com/statmike/vertex-ai-mlops/co...,Complete,statmike,2021-04-01 17:41:03
3,1288d481c84e0078fd306c8ada9ade4244eaeb8f,https://github.com/statmike/vertex-ai-mlops/co...,Work on 03,statmike,2021-04-02 00:44:13
4,60e483e8b5084684c671ecfff5cd60d4ca7e97e8,https://github.com/statmike/vertex-ai-mlops/co...,Complete,statmike,2021-04-02 13:19:59


### github_metrics.commits_files -> reporting.commits_files

This section creates and runs the query that takes all the `github_metrics.commits_files` data to date and creates a reporting table `reporting.commits_files`. The next section will build the query that incrementally updates this initial table.

In [None]:
query = f"""
CREATE OR REPLACE TABLE `{BQ_PROJECT}.reporting.commits_files` AS
  SELECT
    * EXCEPT(datetime),
    DATETIME(TIMESTAMP(datetime)) AS datetime
  FROM `{BQ_PROJECT}.github_metrics.commits_files`
  ORDER BY datetime
"""
job = bq.query(query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f7914a335b0>

In [None]:
bq.query(query = f"SELECT * FROM `{BQ_PROJECT}.reporting.commits_files` LIMIT 5").to_dataframe()

Unnamed: 0,sha,url,message,author,file_sha,file,additions,deletions,datetime
0,701994319c570361840886cb0db660d9ed7534be,https://github.com/statmike/vertex-ai-mlops/co...,"Initial Load, 01 is complete",statmike,ce0a6b40ecb6343d312c9768678181b258af03a9,statmike/vertex-ai-mlops/01 - BigQuery - Data....,125,0,2021-04-01 16:06:54
1,701994319c570361840886cb0db660d9ed7534be,https://github.com/statmike/vertex-ai-mlops/co...,"Initial Load, 01 is complete",statmike,3621c154ce65ce214b9a00f6e28057b04db8df95,statmike/vertex-ai-mlops/02 BigQuery - BQML.ipynb,46,0,2021-04-01 16:06:54
2,701994319c570361840886cb0db660d9ed7534be,https://github.com/statmike/vertex-ai-mlops/co...,"Initial Load, 01 is complete",statmike,d4a7577c4b21cbe7cd47992a3e3511462a762bc0,statmike/vertex-ai-mlops/03 - BigQuery - BQML ...,46,0,2021-04-01 16:06:54
3,ab6115f872328f308199918a11717be7a50df4b4,https://github.com/statmike/vertex-ai-mlops/co...,Complete,statmike,49f3ac9a0d7c662747fcd2e7d9523b17e3cd6865,statmike/vertex-ai-mlops/01 - BigQuery - Data....,666,33,2021-04-01 16:08:42
4,24a093511a365c0f3b2de670de126b539119923c,https://github.com/statmike/vertex-ai-mlops/co...,Complete,statmike,61d228b7bb947957a2867f08377a4d80f8410728,statmike/vertex-ai-mlops/02 BigQuery - BQML.ipynb,1132,2,2021-04-01 17:41:03


---
## Incremental Updates

In the case of commits, all changes are appends for newly arriving commits.  

For efficiency it is best to update `reporting.commits_files` first.  Why?  Because detecting a new commit involves comparing `github_metrics.commits` with `reporting.commits` and once the later is updated it requires first selecting distinct commits from the `reporting.commits_files` table.

### reporting.commits_files

In [None]:
query_1 = f"""
  WITH
    CURRENT_COMMITS AS (SELECT sha FROM `{BQ_PROJECT}.reporting.commits`),
    SOURCE_COMMITS AS (SELECT sha FROM `{BQ_PROJECT}.github_metrics.commits`),
    NEW_COMMITS AS (SELECT SOURCE_COMMITS.sha FROM SOURCE_COMMITS WHERE NOT EXISTS (SELECT CURRENT_COMMITS.sha FROM CURRENT_COMMITS WHERE SOURCE_COMMITS.sha = CURRENT_COMMITS.sha)),
    RAW_COMMITS AS (SELECT * FROM NEW_COMMITS LEFT OUTER JOIN `{BQ_PROJECT}.github_metrics.commits_files` USING(sha))
  SELECT
    * EXCEPT(datetime),
    DATETIME(TIMESTAMP(datetime)) AS datetime
  FROM RAW_COMMITS
  ORDER BY datetime
"""
bq.query(query = query_1).to_dataframe()

Unnamed: 0,sha,url,message,author,file_sha,file,additions,deletions,datetime


In [None]:
query_1 = f"INSERT INTO `{BQ_PROJECT}.reporting.commits_files`{query_1}"
print(query_1)

INSERT INTO `vertex-ai-mlops-369716.reporting.commits_files`
  WITH
    CURRENT_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.reporting.commits`),
    SOURCE_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.github_metrics.commits`),
    NEW_COMMITS AS (SELECT SOURCE_COMMITS.sha FROM SOURCE_COMMITS WHERE NOT EXISTS (SELECT CURRENT_COMMITS.sha FROM CURRENT_COMMITS WHERE SOURCE_COMMITS.sha = CURRENT_COMMITS.sha)),
    RAW_COMMITS AS (SELECT * FROM NEW_COMMITS LEFT OUTER JOIN `vertex-ai-mlops-369716.github_metrics.commits_files` USING(sha))
  SELECT
    * EXCEPT(datetime),
    DATETIME(TIMESTAMP(datetime)) AS datetime
  FROM RAW_COMMITS
  ORDER BY datetime



In [None]:
job = bq.query(query = query_1)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f79129bb3d0>

In [None]:
job.state

'DONE'

### reporting.commits

In [None]:
query_2 = f"""
  WITH
    CURRENT_COMMITS AS (SELECT sha FROM `{BQ_PROJECT}.reporting.commits`),
    SOURCE_COMMITS AS (SELECT sha FROM `{BQ_PROJECT}.github_metrics.commits`),
    NEW_COMMITS AS (SELECT SOURCE_COMMITS.sha FROM SOURCE_COMMITS WHERE NOT EXISTS (SELECT CURRENT_COMMITS.sha FROM CURRENT_COMMITS WHERE SOURCE_COMMITS.sha = CURRENT_COMMITS.sha)),
    RAW_COMMITS AS (SELECT * FROM NEW_COMMITS LEFT OUTER JOIN `{BQ_PROJECT}.github_metrics.commits` USING(sha))
  SELECT
    * EXCEPT(datetime),
    DATETIME(TIMESTAMP(datetime)) AS datetime
  FROM RAW_COMMITS
  ORDER BY datetime
"""
bq.query(query = query_2).to_dataframe()

Unnamed: 0,sha,url,message,author,datetime


In [None]:
query_2 = f"INSERT INTO `{BQ_PROJECT}.reporting.commits`{query_2}"
print(query_2)

INSERT INTO `vertex-ai-mlops-369716.reporting.commits`
  WITH
    CURRENT_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.reporting.commits`),
    SOURCE_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.github_metrics.commits`),
    NEW_COMMITS AS (SELECT SOURCE_COMMITS.sha FROM SOURCE_COMMITS WHERE NOT EXISTS (SELECT CURRENT_COMMITS.sha FROM CURRENT_COMMITS WHERE SOURCE_COMMITS.sha = CURRENT_COMMITS.sha)),
    RAW_COMMITS AS (SELECT * FROM NEW_COMMITS LEFT OUTER JOIN `vertex-ai-mlops-369716.github_metrics.commits` USING(sha))
  SELECT
    * EXCEPT(datetime),
    DATETIME(TIMESTAMP(datetime)) AS datetime
  FROM RAW_COMMITS
  ORDER BY datetime



In [None]:
job = bq.query(query = query_2)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f7914a33970>

In [None]:
job.state

'DONE'

---
## Query To Schedule

In the notebook 'GitHub Metrics - 3 - Commits - Incremental Update Cloud Function.ipynb' the cloud function that updates the raw data in the dataset `github_metrics` is setup.  Since updating the reporting should come right after the raw data update it makes sense to add the updating queries to that Cloud Function.  The query is constructed by the print statement below and then copy/pasted to the the cloud function for daily execution.

>An alternative way to schedule a query is using [BigQuery's Scheduled Queries](https://cloud.google.com/bigquery/docs/scheduling-queries) capability.  This has two ways to get started within the console:
>- From the query editor there is a Schedule option in the Query Editor Tool Bar
>- From the BigQuery > Scheduled Queries > + Create Scheduled Query In Editor

In [None]:
query = query_1 + ';\n' + query_2 + ';'
print(query)

INSERT INTO `vertex-ai-mlops-369716.reporting.commits_files`
  WITH
    CURRENT_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.reporting.commits`),
    SOURCE_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.github_metrics.commits`),
    NEW_COMMITS AS (SELECT SOURCE_COMMITS.sha FROM SOURCE_COMMITS WHERE NOT EXISTS (SELECT CURRENT_COMMITS.sha FROM CURRENT_COMMITS WHERE SOURCE_COMMITS.sha = CURRENT_COMMITS.sha)),
    RAW_COMMITS AS (SELECT * FROM NEW_COMMITS LEFT OUTER JOIN `vertex-ai-mlops-369716.github_metrics.commits_files` USING(sha))
  SELECT
    * EXCEPT(datetime),
    DATETIME(TIMESTAMP(datetime)) AS datetime
  FROM RAW_COMMITS
  ORDER BY datetime
;
INSERT INTO `vertex-ai-mlops-369716.reporting.commits`
  WITH
    CURRENT_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.reporting.commits`),
    SOURCE_COMMITS AS (SELECT sha FROM `vertex-ai-mlops-369716.github_metrics.commits`),
    NEW_COMMITS AS (SELECT SOURCE_COMMITS.sha FROM SOURCE_COMMITS WHERE NOT EXISTS (SELECT C