### NWS Data  

This notebook shows how to create and host an automated web-scraping script in Google Cloud Functions (GCF), repurposed from an Airflow DAG (`../airflow/nws_dag.py`). The GCF package directory this notebook creates (`notebooks/nws_update_gcf/`) is already included in this repo, so there is no need for you to run the code in the first two sections. The third will show how to upload the package to GCF.

---

Airflow is extremely useful for coordinating complex pipelines in distributed systems and can be run in the cloud with [Google Cloud Composer](https://cloud.google.com/composer). The downside with this approach is that the increased overhead with Airflow results in a more expensive project. By contrast, GCF is dedicated to simple stateless functions and is much more affordable. Thankfully, our use case is simple enough for GCF to handle. 



#### 1.) Create directory for GCF scripts

First we need to make a project directory which we can upload to GCF. We will make it in our current directory `notebooks`.


In [14]:
# %%bash -- I've already run this for you  
# mkdir nws_update_gcf && cd $_
# mkdir utils
# touch main.py
# touch README.md requirements.txt 
# cp ../../airflow/dags/utils/* utils/

# echo "Script to scrape, transform, and upload NWS forecast information to BigQuery" >> README.md

# echo "# This file is required to mark the 'utils' directory as a package" >> utils/__init__.py

# # The GCF container will automatically install any dependencies contained in requirements.txt.
# echo -e "datetime\npandas\nnumpy\nrequests\nbeautifulsoup4\ngoogle-cloud-bigquery\ngoogle-cloud-storage\ngoogle-cloud-logging\ngoogle-auth\ngoogle-auth-oauthlib\ngoogle-auth-httplib2" > requirements.txt

# cd .. && tree nws_update_gcf

nws_update_gcf
├── main.py
├── README.md
├── requirements.txt
└── utils
    ├── __init__.py
    └── utils.py

1 directory, 5 files


cp: -r not specified; omitting directory '../../airflow/dags/utils/__pycache__'


#### 2.) Refactor Airflow Script

Next we modify our Airflow DAG to work as a regular python script, which we will add to `main.py`

In [1]:
import pandas as pd
import numpy as np
import requests
import os
import re
import os
import logging
import io 
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
# GCP imports: 
from google.cloud import bigquery, storage, logging as cloud_logging 
from google.oauth2 import service_account
# Utils
from nws_update_gcf.utils.utils import nws_url, get_table, table_to_dict

Upon creating a function in Google Cloud Functions, it is automatically associated with a single Google Cloud project. When creating various GCP client (e.g. `storage.Client()`) we would not have to specify the project or our credentials. We define these here just to test-run our code locally.

In [2]:
# Creating GCP Connection 
from yaml import full_load
from google.oauth2 import service_account

with open("../config/gcp-config.yaml", "r") as fp: 
    gcp_config = full_load(fp)

PROJECT_ID = gcp_config["project-id"]
DATASET_ID = gcp_config["dataset-id"]

key_path = gcp_config["credentials"]
credentials = service_account.Credentials.from_service_account_file(
   key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],

)

## ---------- LOGGING ---------- ## 
# Cloud logging client
logger_client = cloud_logging.Client(credentials=credentials, project=credentials.project_id)

# Cloud logging handler
handler = logger_client.get_default_handler()

# Create logger with cloud handler
logger = logging.getLogger(__name__)
logger.addHandler(handler)

# Set logging levels 
logger.setLevel(logging.INFO)
handler.setLevel(logging.INFO)

# Format logger 
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Confirm logger is working  
logger.info(f"Running daily scrape of NWS Weather Forecasts in Alaska")

## ---------- CLOUD STORAGE ---------- ## 
storage_client = storage.Client(credentials=credentials, project=credentials.project_id)
bucket = storage_client.bucket(f"{PROJECT_ID}-bucket")

# Locations 
blob = bucket.blob("locations.csv")
content = blob.download_as_bytes()
locations_df = pd.read_csv(io.BytesIO(content))

## ---------- BIGQUERY ---------- ## 
bq_client = bigquery.Client(credentials=credentials, project=credentials.project_id)

Now for the actual code:

In [3]:

def get_forecast_df() -> pd.DataFrame:
  """Get dataframe of forecast data for next 48 hours from various points in Alaska"""

  logger.info(f"Scraping forecast data")

  nws_urls = locations_df.apply(nws_url, axis=1)
  loc_dict = dict(zip(locations_df['station_location'], nws_urls))

  combined_table = []
  for location, url in loc_dict.items():
    result = requests.get(url)
    soup = BeautifulSoup(result.content, "html.parser")

    tr_list = soup.find_all("table")[5].find_all("tr") # <tr> elements
    table = get_table(tr_list, location)   
    combined_table.extend(table)
  
  return table_to_dict(combined_table)


def transform_df(fcast_dict:dict) -> pd.DataFrame: 
  """Cast dictionary from get_forecast() to a dataframe and transform"""
  df = pd.DataFrame(fcast_dict)
  df.columns = [col.lower() for col in df.columns] 

  # Check for missing values
  empty_values = df.isna().any(axis=1)
  if empty_values.any():
    bad_rows = df[empty_values]
    logging.warning(f"When creating dataframe, failed to parse values in the following rows:\n{bad_rows}")
    logging.info(f"Dropping bad rows from dataframe")


  # Replace missing value indicators with Nan
  df.replace({'':np.NaN, '--':np.NaN}, inplace=True)

  ## Datetime Transformations
  cur_year = datetime.now().year
  dt_strings = df['date'] + '/' + str(cur_year) + ' ' + df['hour (akst)'] + ':00 AKST'
  # Local time (AKST)
  df['lst_datetime'] = pd.to_datetime(dt_strings, format='%m/%d/%Y %H:%M AKST')
  # UTC time
  akst_offset = timedelta(hours=9)
  df['utc_datetime'] = df['lst_datetime'] + akst_offset

  # Reorder columns 
  col_names = ['location','utc_datetime','lst_datetime'] + list(df.columns)[3:-2]
  df = df[col_names]

  # Timestamp column: track when forecast was accessed
  df['date_added_utc'] = datetime.utcnow()

  # Edit column headers 
  df.rename(columns=lambda x: re.sub('°|\(|\)', '', x), inplace=True)
  df.rename(columns=lambda x: re.sub('%', 'pct', x), inplace=True)
  df.rename(columns=lambda x: re.sub(' ', '_', x.strip()), inplace=True)

  # Log result
  logger.info(f"Created dataframe: ")

  return df 


def load_df_to_bq(df) -> None:
  """Load dataframe to BigQuery"""

  # Set schema and job_config
  schema = [
    bigquery.SchemaField("location", "STRING", mode="REQUIRED"), 
    bigquery.SchemaField("utc_datetime", "DATETIME", mode="REQUIRED"), 
    bigquery.SchemaField("lst_datetime", "DATETIME", mode="REQUIRED"), 
    bigquery.SchemaField("temperature_f", "INTEGER", mode="REQUIRED"), 
    bigquery.SchemaField("dewpoint_f", "INTEGER", mode="REQUIRED"), 
    bigquery.SchemaField("wind_chill_f", "INTEGER", mode="REQUIRED"), 
    bigquery.SchemaField("surface_wind_mph", "INTEGER", mode="REQUIRED"), 
    bigquery.SchemaField("wind_dir", "STRING", mode="REQUIRED"), 
    bigquery.SchemaField("gust", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("sky_cover_pct", "INTEGER", mode="REQUIRED"), 
    bigquery.SchemaField("precipitation_potential_pct", "FLOAT", mode="REQUIRED"), 
    bigquery.SchemaField("relative_humidity_pct", "FLOAT", mode="REQUIRED"),
    bigquery.SchemaField("rain", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("thunder", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("snow", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("freezing_rain", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("sleet", "STRING", mode="NULLABLE")
  ]
  
  jc = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=False,
    schema=schema,
    create_disposition="CREATE_IF_NEEDED",
    write_disposition="WRITE_APPEND"   
  )
 
  # Set target table in BigQuery
  full_table_id = f"{PROJECT_ID}.{DATASET_ID}.nws_forecasts"

  # Load from dataframe -- if any columns are missing, include names of column in the error message
  try: 
    job = bq_client.load_table_from_dataframe(df, full_table_id, job_config=jc)
    job.result()
  except Exception as e:
    error_message = str(e)
    # modify error message to include the name of the missing column
    if 'Required column value for column index' in error_message:
      start_index = error_message.index('Required column value for column index') + len('Required column value for column index: ')
      end_index = error_message.index(' is missing', start_index)
      missing_column_index = int(error_message[start_index:end_index])
      # get the name of the missing column based on its index
      missing_column_name = list(df.columns)[missing_column_index]
      # modify the error message to include the name of the missing column
      error_message = error_message[:start_index] + f'{missing_column_name} ({missing_column_index})' + error_message[end_index:]
    raise Exception(error_message) 

  # Log result 
  table = bq_client.get_table(full_table_id)
  logger.info(f"Loaded {table.num_rows} rows and {table.schema} columns")

In [4]:
df = pd.DataFrame(get_forecast_df())

In [5]:
df[df.isna().any(axis=1)]

Unnamed: 0,Date,location,Hour (AKST),Temperature (°F),Dewpoint (°F),Wind Chill (°F),Surface Wind (mph),Wind Dir,Gust,Sky Cover (%),Precipitation Potential (%),Relative Humidity (%),Rain,Thunder,Snow,Freezing Rain,Sleet


In [47]:
empty_values = df2.isna().any(axis=1)
if empty_values.any():
  bad_rows = df2[empty_values]
  # logging.warning(f"When creating dataframe, failed to parse values in the following rows:\n{bad_rows}")
  for index, row in bad_rows.iterrows():
      cols_with_nans = list(row[row.isna()].index)
      cols_with_nans
      # log_data = {index: cols_with_nans}
      # logging.warning(f"When creating dataframe, failed to parse values in the following columns of row {index}:\n{json.dumps(log_data)}")


In [44]:
df2[['Date', 'location', 'Hour (AKST)', 'Temperature (°F)', 'Gust']]

Unnamed: 0,Date,location,Hour (AKST),Temperature (°F),Gust
0,02/27,Fairbanks,10,-15,
1,02/27,Fairbanks,11,-14,
2,02/27,Fairbanks,12,-12,
3,02/27,Fairbanks,13,-10,
4,02/27,Fairbanks,14,-9,
...,...,...,...,...,...
1099,03/01,Aleknagik,05,22,
1100,03/01,Aleknagik,06,22,29
1101,03/01,Aleknagik,07,20,29
1102,03/01,Aleknagik,08,19,29


In [5]:
display(locations_df.apply(nws_url, axis=1)[0])

'https://forecast.weather.gov/MapClick.php?lat=64.97&lon=-147.51&unit=0&lg=english&FcstType=digital&menu=1'