### Google Cloud Functions & Cloud Scheduler

This notebook shows how I hosted an automate a web-scraping script in Google Cloud Functions (GCF). 

---

Airflow is extremely useful for coordinating complex pipelines in distributed systems and can be run in the cloud with [Google Cloud Composer](https://cloud.google.com/composer). The downside with this approach is that the increased overhead with Airflow results in a more expensive project. By contrast, GCF is dedicated to simple stateless functions and is much more affordable. Thankfully, our use case is simple enough for GCF to handle. 


#### Example: `nws_update_gcf`

We have two DAGs we'd like to port over as Google Cloud Functions. Let's use the NWS script as an example.

Each function needs its own directory to export as a package. 

```bash
nws_update_gcf
├── main.py     # houses actual script 
├── README.md   
├── requirements.txt # GCF container automatically installs these 
└── utils     
    ├── __init__.py # need for utils to be read as a module
    └── utils.py
```

To export our DAG as a GCF, all we really need to do is 1.) remove the airflow decorators (```@task``` amd ```@dag```) and the ending DAG constructor statement, and 2.) add a function that serves as the "entry point" for GCF to trigger the function. 
```python 
## Bash: pip install functions-framework
import functions_framework

@functions_framework.http
def main(request) -> None: 
  """Entry point for google cloud function"""
  df = get_forecast_df()
  
  load_staging_table(df)
  
  insert_table()

  return "Mandatory Return Statement"
```

Here our function ```main``` is set up to be triggered when a request is sent to the target URL of the function. Notice that ```main``` takes a similar role to the ```dag()``` constructor in airflow, in that it defines the order of how the functions in the script will be executed.

In [None]:
import pandas as pd
import numpy as np
import datetime as dt 
import logging 
from io import BytesIO
# GCP imports: 
from google.cloud import bigquery, storage, logging as cloud_logging 
from google.oauth2 import service_account
from google.api_core.exceptions import NotFound
# Utils
import utils.utils as utils 
## ^^ For the actual package it will just be "utils.utils"
# Functions Framework 
import functions_framework

## ---------- GCP INFO ---------- ## 
PROJECT_ID = "alaska-scrape"
DATASET_ID = "weather"
STAGING_TABLE_ID = "nws_staging"
MAIN_TABLE_ID = "nws"

SCHEMA =  [
    bigquery.SchemaField("location", "STRING", mode="REQUIRED"), 
    bigquery.SchemaField("utc_datetime", "DATETIME", mode="REQUIRED"), 
    bigquery.SchemaField("lst_datetime", "DATETIME", mode="REQUIRED"), 
    bigquery.SchemaField("temperature_f", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("dewpoint_f", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("wind_chill_f", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("surface_wind_mph", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("wind_dir", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("gust", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("sky_cover_pct", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("precipitation_potential_pct", "FLOAT", mode="NULLABLE"), 
    bigquery.SchemaField("relative_humidity_pct", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("rain", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("thunder", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("snow", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("freezing_rain", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("sleet", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("fog", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("last_update_nws", "DATETIME", mode="NULLABLE")
  ] 

## ---------- LOGGING ---------- ## 
# Cloud logging client
# logger_client = cloud_logging.Client(credentials=credentials, project=credentials.project_id)
logger_client = cloud_logging.Client()

# Cloud logging handler
handler = logger_client.get_default_handler()

# Create logger with cloud handler
logger = logging.getLogger(__name__)
logger.addHandler(handler)

# Set logging levels 
logger.setLevel(logging.INFO)
handler.setLevel(logging.INFO)

# Format logger 
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Confirm logger is working  
logger.info(f"Running daily scrape of NWS Weather Forecasts in Alaska")

## ---------- CLOUD STORAGE ---------- ## 
# storage_client = storage.Client(credentials=credentials, project=credentials.project_id)
storage_client = storage.Client()
bucket = storage_client.bucket(f"{PROJECT_ID}-bucket")

# Locations 
blob = bucket.blob("locations.csv")
content = blob.download_as_bytes()
locations_df = pd.read_csv(BytesIO(content))

## ---------- BIGQUERY ---------- ## 
# bq_client = bigquery.Client(credentials=credentials, project=credentials.project_id)
bq_client = bigquery.Client()


def get_forecast_df() -> pd.DataFrame:
  """Get dataframe of forecast data for next 6 days from various points in Alaska"""

  nws_urls = locations_df.apply(utils.get_nws_url, axis=1)
  url_map = dict(zip(locations_df['station_location'], nws_urls))

  combined_table = []
  for location, url in url_map.items():
    soup_list = [utils.get_soup(url + f"&AheadHour={hr}") for hr in (0,48,96)]
    table_list = utils.flatten([utils.extract_table_data(soup, location) for soup in soup_list])
    combined_table.extend(table_list)

  forecast_dict = utils.transpose_as_dict(combined_table)
  forecast_df = utils.transform_df(forecast_dict)
  
  return forecast_df


def load_staging_table(df:pd.DataFrame) -> None:
  """Upload dataframe from transform_df() to BigQuery staging table"""

  jc = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=False,
    schema=SCHEMA,
    create_disposition="CREATE_IF_NEEDED",
    write_disposition="WRITE_TRUNCATE"   
  )
 
  # Set target table in BigQuery
  full_table_id = f"{PROJECT_ID}.{DATASET_ID}.{STAGING_TABLE_ID}"

  # Upload to BigQuery
  ## If any required columns are missing values, include name of column in error message
  try: 
    job = bq_client.load_table_from_dataframe(df, full_table_id, job_config=jc)
    job.result()
  except Exception as e:
    error_message = str(e)
    if 'Required column value for column index' in error_message:
      start_index = error_message.index('Required column value for column index') + len('Required column value for column index: ')
      end_index = error_message.index(' is missing', start_index)
      missing_column_index = int(error_message[start_index:end_index])
      missing_column_name = list(df.columns)[missing_column_index]
      error_message = error_message[:start_index] + f'{missing_column_name} ({missing_column_index})' + error_message[end_index:]
    raise Exception(error_message) 

  # Log result 
  table_ref = bq_client.get_table(full_table_id)
  logger.info(f"Loaded {table_ref.num_rows} rows and {table_ref.schema} columns")

  
def insert_table() -> None: 
  """Insert staging table into the main data table -- creates the table if it doesn't exist yet"""
  
  insert_query=f"""
    INSERT INTO {DATASET_ID}.{MAIN_TABLE_ID} 
    SELECT *, CURRENT_TIMESTAMP() as date_added_utc
    FROM {DATASET_ID}.{STAGING_TABLE_ID}
    """
  
  full_table_id = f"{PROJECT_ID}.{DATASET_ID}.{MAIN_TABLE_ID}"

  try: 
    query_job = bq_client.query(insert_query) 
    query_job.result()
  except NotFound:
    logger.info(f"Table {DATASET_ID}.{MAIN_TABLE_ID} does not exist. Creating.")

    # Adding date_added to SCHEMA 
    schema = SCHEMA + [bigquery.SchemaField("date_added_utc", "TIMESTAMP", mode="REQUIRED")]

    table = bigquery.Table(full_table_id, schema=schema)
    table = bq_client.create_table(table)

    query_job = bq_client.query(insert_query)
    query_job.result()
    
  table = bq_client.get_table(full_table_id)
  logger.info(f"Loaded {table.num_rows} rows and {len(table.schema)} columns into {full_table_id}\n")

@functions_framework.http
def main(request) -> None: 
  """Entry point for google cloud function"""
  df = get_forecast_df()
  
  load_staging_table(df)
  
  insert_table()

  return "Mandatory Return Statement" # Can put anything but must be present. 

To export our directory as a cloud function, we use the  ```gcloud functions deploy``` command: 

```bash 

gcloud functions deploy YOUR_FUNCTION_NAME \
[--gen2] \ # gen1 vs gen2 has some minor differences 
--region=<google-server-location> \ # e.g. us-central1
--runtime=<python-version> \ # python37, python38, etc.
--source=</path/to/your/directory> \ 
--entry-point=<your-target-function> \ # here it's our function "main()"
--memory=512 # 512MB, default is 256MB 
--trigger-http # how the function gets triggered -- can have multiple 
--allow-unauthenticated # allows anyone with URL to call function -- enable for setup convenience, NOT secure
```

The documentation on [this](https://cloud.google.com/functions/docs/deploy#from-local-machine) page is pretty good but it does not clarify exactly how the script should be structured. Be sure to have a proper entry point defined in your script for the function to trigger.

You can test your function by clicking "Test in Cloud Shell" on the testing tab on the function's page.  

![gcf-test](../img/gcf_testingpage.png)

By default it will pre-fill a JSON with `Hello World!` -- this JSON is used to pass information to your function when you call it. We don't need that so we can leave the JSON empty. 

Now we are able to trigger our function on demand from cloud shell or our command line:

```bash

curl -m 706 -X POST https://your-function-target-url \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-H "Content-Type: application/json" \
-d '{}'

```
This setup does not go over authentication, as that requires more involved set up to coordinate service accounts and access permissions. We can forgo it for a non-critical project like this. I do suggest you name your function with a string of random characters in order to keep your endpoint URL even more obscure.

#### Cloud Scheduler 

Once you have the function exported to cloud functions, we can schedule it to run at regular intervals using [Cloud Scheduler](https://cloud.google.com/scheduler). This is a simple service that can schedule and send requests to the target URL generated when we deployed our function. 

1. From the above page, go to the Cloud Scheduler Console for your project.
2. Click "Create Job" at the top of the page.

![gcf-screenshot](../img/gcf_screenshot.resized.png)
  

3. Set the Timezone and Schedule interval using [CRON syntax](https://crontab.guru/) (e.g. for noon, `0 12 * * *`)
4. Under "Configure execution" paste your function's target URL and select `Target type: HTTP` and `HTTP method: POST`. 
5. You can leave the `Retry config` settings in the next section as default. 

Refer back to the Cloud Scheduler and [Cloud Logging](https://cloud.google.com/logging) consoles to confirm that your function is working as scheduled. 