### NWS Forecast Data

This notebook explains how to scrape, transform, and upload the NWS's hourly weather forecast for the next 6 days** into a dataset in BigQuery. The resulting script can be set to run at regular intervals as an Airflow DAG or a function in Cloud Functions. 

** *For some reason, the hourly forecast doesn't quite extend to a full week, but only 6.5 days. To keep the math easier, we will only scrape the next 6 days -- in the end, this won't affect our pipeline once we have it updating continuously.*

We will collect hourly forecasts for the locations of 23 USCRN data collection stations in Alaska -- the presence of these stations will enable ourselves and any other users of our dataset to evaluate the accuracy of the forecasts.

---

#### Why use scraping over `api.weather.gov`?

Generally speaking, if a website offers an API to access its data then it's a good bet to use it.

 So why not just use `api.weather.gov`?

The main reason I've chose webscraping for the NWS part of the project is that at times the `api.weather.gov` has given me `500: Internal Server Error` responses even when the HTML data interface is still accessible. The level of information provided is essentially the same:

In [91]:
import pandas as pd
import numpy as np
import requests
import re
import datetime as dt 
import itertools
import time
from bs4 import BeautifulSoup

In [23]:
locations_df = pd.read_csv("../data/locations.csv")

In [10]:
random_location = locations_df.sample(1).iloc[0] 

print(f"{random_location}\n")

lat, lon = random_location['latitude'], random_location['longitude']  

## API results
url = f"https://api.weather.gov/points/{lat},{lon}"

response = requests.get(url)
main_data = response.json()

response = requests.get(main_data['properties']['forecastHourly'])
hourly_data = response.json()
fields = hourly_data['properties']['periods'][0]

print(f"{fields}\n") 

## Webscraping results 
url = f"https://forecast.weather.gov/MapClick.php?lat={lat}&lon={lon}&unit=0&lg=english&FcstType=digital&menu=1"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
df = pd.read_html(str(soup.find_all("table")[5]))[0]
df = df.iloc[1:17,:]

display(df)

station_location    Deadhorse
wbanno                  26565
longitude             -148.46
latitude                70.16
Name: 13, dtype: object

{'number': 1, 'name': '', 'startTime': '2023-02-28T05:00:00-09:00', 'endTime': '2023-02-28T06:00:00-09:00', 'isDaytime': False, 'temperature': -24, 'temperatureUnit': 'F', 'temperatureTrend': None, 'probabilityOfPrecipitation': {'unitCode': 'wmoUnit:percent', 'value': 3}, 'dewpoint': {'unitCode': 'wmoUnit:degC', 'value': -33.333333333333336}, 'relativeHumidity': {'unitCode': 'wmoUnit:percent', 'value': 79}, 'windSpeed': '5 mph', 'windDirection': 'SW', 'icon': 'https://api.weather.gov/icons/land/night/sct,3?size=small', 'shortForecast': 'Partly Cloudy', 'detailedForecast': ''}



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
1,Date,02/28,,,,,,,,,...,,,,,,03/01,,,,
2,Hour (AKST),05,06,07,08,09,10,11,12,13,...,19,20,21,22,23,00,01,02,03,04
3,Temperature (°F),-24,-24,-25,-25,-24,-22,-19,-17,-16,...,-16,-17,-18,-20,-21,-22,-22,-21,-21,-21
4,Dewpoint (°F),-28,-28,-29,-30,-30,-28,-25,-22,-21,...,-20,-23,-22,-24,-24,-25,-25,-25,-26,-26
5,Wind Chill (°F),-38,-41,-41,-41,-44,-42,-39,-41,-39,...,-45,-46,-47,-49,-51,-51,-51,-50,-48,-48
6,Surface Wind (mph),5,6,6,6,9,9,9,15,15,...,22,22,22,22,22,21,21,21,18,18
7,Wind Dir,SW,S,S,S,E,E,E,E,E,...,E,E,E,E,E,E,E,E,E,E
8,Gust,,,,,,,,,,...,,,,,,,,,,
9,Sky Cover (%),28,22,22,22,14,14,14,5,5,...,32,32,28,28,28,65,65,65,73,73
10,Precipitation Potential (%),3,3,3,3,2,2,2,1,1,...,0,0,0,0,0,0,0,0,0,0


The results need cleaning (e.g. the column names are in the first row), but you can see that the information is roughly equivalent, with the API offering an `isDaytime` field and the HTML interface offering an actual percentage for sky cover (rather than a snippet like "Partly Cloudy"). Depending on how `isDaytime` is measured, it might be more accurate than any estimate of daylight hours we could make based off the scraped data...

![alaska-sun](../img/alaska_suntimes.png)

Welp -- we'll exclude it for now. Later on we can either access it separately from the API or estimate it ourselves based off standard time tables.

---

#### 1.) Scraping the Data 

For each location, the forecast for the next 48 hours is stored in a tabular data table like this: 

<img src="../img/nws_p1.png" height=400px>

The rest of the forecast can be accessed by jumping ahead in 48 hour increments. We do this by adding `&AheadHour=` on the end of the URL and specifying how many hours (48 and 96). 

The series of transformations required here is quite complex -- a downside of scraping vs using the API.

In [59]:
import utils.utils # get_last_update_nws, ff_list, get_soup, flatten  -- check utils.py to see what these do

def extract_table_data(soup:BeautifulSoup, location:str) -> list:
  """
  Extracts 48hr forecast table data from a Beautiful Soup object as a list of lists

  Args: 
  table_records (list): List of <tr> elements containing NWS forecast data

  location (str): The name of the place the forecast is for; used for filling out added "location" column 

  Returns:
  table (list): List of lists containing table data 
  """
  table_records = soup.find_all("table")[5].find_all("tr")

  colspan = table_records[0] # 48hr data is divided into two tables by two colspan elements
  table = [tr for  tr in table_records if tr != colspan] # vertically concat tables by removing colspan elements

  table = [[ele.getText() for ele in tr.find_all("font")] for tr in table] 

  # Add location column 
  location_col = ['location']
  location_col.extend([location]*24) # fill out to match length of other columns
  table.insert(1, location_col)  # for first half of table
  table.insert(19, location_col) # for second half of table

  # Add last_update_nws column 
  last_update_nws = ["last_update_nws"]
  last_update_nws.extend([utils.get_last_update_nws(soup)] * 24)
  table.insert(1, last_update_nws)
  table.insert(19, last_update_nws) 

  return table

def transpose_as_dict(table:list) -> dict:
  """
  Takes the list of lists generated by extract_table_data() and transposes it (flip orientation) by casting as a dictionary
  
  Args:
  table (list): list of lists of columnar data generated by extract_table_data()

  Returns: 
  data_map (dict): Dictionary representation of table, transposed and ready to be made into a dataframe
  """
  data_map = {}
  for col in table: # Table is still "landscape-oriented"
    if col[0] not in data_map.keys(): # cols from first half of table
      data_map[col[0]] = col[1:]
    else: # cols from second half
      data_map[col[0]].extend(col[1:])

  data_map['Date'] = utils.ff_list(data_map['Date'])

  return data_map

def transform_df(forecast_dict:dict) -> pd.DataFrame: 
  """Cast dictionary from transpose_as_dict() to a dataframe and transform"""
  ## Create dataframe
  df = pd.DataFrame(forecast_dict)
  
  ## Edit column headers 
  df.columns = [col.lower() for col in df.columns] 
  df.rename(columns=lambda x: re.sub('°|\(|\)', '', x), inplace=True)
  df.rename(columns=lambda x: re.sub('%', 'pct', x), inplace=True)
  df.rename(columns=lambda x: re.sub(' ', '_', x.strip()), inplace=True)
  
  ## Replace missing values
  # Replace missing values in gust with zero -- gust is never *actually* 0 so no masking
  # Replace missing values in windchill with an explicity np.NaN
  df.replace({'gust':{'':0}, 'wind_chill_f':{'':np.nan}}, inplace=True)

  ## Datetime Transformations
  cur_year = dt.datetime.now().year
  dt_strings = df['date'] + '/' + str(cur_year) + ' ' + df['hour_akst'] + ':00 AKST'
  # Local time (AKST)
  df['lst_datetime'] = pd.to_datetime(dt_strings, format='%m/%d/%Y %H:%M AKST')
  # UTC time
  akst_offset = dt.timedelta(hours=9)
  df['utc_datetime'] = df['lst_datetime'] + akst_offset

  ## Drop duplicates in composite key columns 
  duplicates = df.duplicated(subset=["location", "lst_datetime"], keep=False)
  duplicate_rows = df[duplicates]
  if not duplicate_rows.empty:
    print(f"Warning: {len(duplicate_rows)} rows have duplicate values in location and lst_datetime")
    print(f"Dropping")
    df.drop_duplicates(subset=['location', 'lst_datetime'], inplace=True, ignore_index=True)

  ## Reorder columns 
  col_names = ['location', 'utc_datetime', 'lst_datetime'] + list(df.columns[4:-2]) + ["last_update_nws"]
  df = df[col_names]

  return df 

And here's the main function to scrape the data:

In [111]:
def get_forecast_df() -> pd.DataFrame:
  """Get a dataframe of NWS forecast data for the next 6 days from various points in Alaska"""

  nws_urls = locations_df.apply(utils.get_nws_url, axis=1)
  url_map = dict(zip(locations_df['station_location'], nws_urls))

  combined_table = []
  for location, url in url_map.items():
    soup_list = [utils.get_soup(url + f"&AheadHour={hr}") for hr in (0,48,96)]
    table_list = utils.flatten([extract_table_data(soup, location) for soup in soup_list])
    combined_table.extend(table_list)
  
  forecast_dict = transpose_as_dict(combined_table)

  return transform_df(forecast_dict)

In [60]:
df = get_forecast_df()
df

Unnamed: 0,location,utc_datetime,lst_datetime,temperature_f,dewpoint_f,wind_chill_f,surface_wind_mph,wind_dir,gust,sky_cover_pct,precipitation_potential_pct,relative_humidity_pct,rain,thunder,snow,freezing_rain,sleet,fog,last_update_nws,hash_id
0,Fairbanks,2023-03-01 15:00:00,2023-03-01 06:00:00,-4,-8,-17,7,E,0,82,16,82,--,--,SChc,--,--,--,2023-03-01 04:51:00,595ce723d483bcc5136b598d75c8106a
1,Fairbanks,2023-03-01 16:00:00,2023-03-01 07:00:00,-4,-9,-18,7,E,0,82,16,80,--,--,SChc,--,--,--,2023-03-01 04:51:00,eb60aefaa763787eccd79cfb3d2cb150
2,Fairbanks,2023-03-01 17:00:00,2023-03-01 08:00:00,-4,-10,-18,7,E,0,82,16,77,--,--,SChc,--,--,--,2023-03-01 04:51:00,7ac1be4ecb6e583d98328f7ee8ccd618
3,Fairbanks,2023-03-01 18:00:00,2023-03-01 09:00:00,-2,-9,-12,5,NE,0,84,21,73,--,--,SChc,--,--,--,2023-03-01 04:51:00,2edc0876c91456a32d481bcf5a91456a
4,Fairbanks,2023-03-01 19:00:00,2023-03-01 10:00:00,2,-6,-8,5,NE,0,84,21,69,--,--,SChc,--,--,--,2023-03-01 04:51:00,ab8ea23d1af79efd227d782ecc077149
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3307,Aleknagik,2023-03-07 10:00:00,2023-03-07 01:00:00,27,26,0,11,E,0,73,31,98,--,--,Chc,--,--,--,2023-03-01 04:21:00,24a8c0ca1f720e1ca2db6b2af46d7312
3308,Aleknagik,2023-03-07 11:00:00,2023-03-07 02:00:00,27,26,0,11,E,0,73,31,99,--,--,Chc,--,--,--,2023-03-01 04:21:00,354f18cf14430bca0021ec2de3cedceb
3309,Aleknagik,2023-03-07 12:00:00,2023-03-07 03:00:00,26,26,0,8,E,0,70,33,100,--,--,Chc,--,--,--,2023-03-01 04:21:00,5ee9a517e5c05a1c982429d11e8cc26d
3310,Aleknagik,2023-03-07 13:00:00,2023-03-07 04:00:00,26,26,0,8,E,0,70,33,98,--,--,Chc,--,--,--,2023-03-01 04:21:00,000fa28b4b24e3ef952ef56e8c551c95



#### 2.) Uploading the Data 

![TO-DO_ERD]()

The end goal of our pipeline is for the NWS forecasts to be easily evaluated against the historic data from the USCRN. We also want to make it easy for data scientists and ML engineers forecasting from the USCRN data to evaluate the performance of their models against the NWS forecasts, with how long the NWS forecasts were made in advance (`utc_datetime - last_update_nws`) being a key parameter. 

By taking a daily snapshot of the NWS forecast data we keep the data highly denormalized, making this sort of analysis much easier and our pipeline simpler to manage. We could reduce data duplication by using a nested history column (e.g. a JSON array) but memory is less of a concern for us than ease of analysis.

In [61]:
from yaml import full_load
from google.cloud import bigquery 
from google.oauth2 import service_account 
from google.api_core.exceptions import NotFound

# GCP/BigQuery information
with open("../airflow/dags/config/gcp-config.yaml", "r") as fp:
  gcp_config = full_load(fp)
  
PROJECT_ID = gcp_config['project-id']
DATASET_ID = gcp_config['dataset-id']
STAGING_TABLE_ID = 'nws_staging'
MAIN_TABLE_ID = 'nws' 

# Set credentials
key_path = gcp_config['credentials']
credentials = service_account.Credentials.from_service_account_file(
  key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

# Create client
client = bigquery.Client(credentials=credentials, project=PROJECT_ID)

In [67]:
def load_staging_table(df:pd.DataFrame) -> None: 
  """Upload dataframe from get_forecast_df() to BigQuery staging table"""

  # Set Schema
  schema = [
    bigquery.SchemaField("location", "STRING", mode="REQUIRED"), 
    bigquery.SchemaField("utc_datetime", "DATETIME", mode="REQUIRED"), 
    bigquery.SchemaField("lst_datetime", "DATETIME", mode="REQUIRED"), 
    bigquery.SchemaField("temperature_f", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("dewpoint_f", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("wind_chill_f", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("surface_wind_mph", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("wind_dir", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("gust", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("sky_cover_pct", "INTEGER", mode="NULLABLE"), 
    bigquery.SchemaField("precipitation_potential_pct", "FLOAT", mode="NULLABLE"), 
    bigquery.SchemaField("relative_humidity_pct", "FLOAT", mode="NULLABLE"),
    bigquery.SchemaField("rain", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("thunder", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("snow", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("freezing_rain", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("sleet", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("fog", "STRING", mode="NULLABLE"), 
    bigquery.SchemaField("last_update_nws", "DATETIME", mode="NULLABLE")
  ] 

  jc = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=False,
    schema=schema,
    create_disposition="CREATE_IF_NEEDED",
    write_disposition="WRITE_TRUNCATE"   
  )

  # Set target table in BigQuery
  full_table_id = f"{PROJECT_ID}.{DATASET_ID}.{STAGING_TABLE_ID}"

  # Upload to BigQuery
  ## If any required columns are missing values, include name of column in error message
  try: 
    job = client.load_table_from_dataframe(df, full_table_id, job_config=jc)
    job.result()
  except Exception as e:
    error_message = str(e)
    if 'Required column value for column index' in error_message:
      start_index = error_message.index('Required column value for column index') + len('Required column value for column index: ')
      end_index = error_message.index(' is missing', start_index)
      missing_column_index = int(error_message[start_index:end_index])
      missing_column_name = list(df.columns)[missing_column_index]
      error_message = error_message[:start_index] + f'{missing_column_name} ({missing_column_index})' + error_message[end_index:]
    raise Exception(error_message) 
  
  # Log result 
  table = client.get_table(full_table_id)
  print(f"Loaded {table.num_rows} rows and {len(table.schema)} columns into {full_table_id}\n")

In [68]:
load_staging_table(df)

Loaded 3311 rows and 19 columns into alaska-scrape.weather.nws_staging



In [79]:
def insert_table() -> None: 
  """Insert staging table into the main data table -- creates the table if it doesn't exist yet"""
  
  insert_query=f"""
    INSERT INTO {DATASET_ID}.{MAIN_TABLE_ID} 
    SELECT *, CURRENT_TIMESTAMP() as date_added
    FROM {DATASET_ID}.{STAGING_TABLE_ID}
    """

  try: 
    query_job = client.query(insert_query) 
    query_job.result()
  except NotFound:
    print(f"Table {DATASET_ID}.{MAIN_TABLE_ID} does not exist. Creating.")
    create_query = f"""
      CREATE TABLE {DATASET_ID}.{MAIN_TABLE_ID}
      AS
      SELECT *, CURRENT_TIMESTAMP() as date_added
      FROM {DATASET_ID}.{STAGING_TABLE_ID}
    """
    query_job = client.query(create_query)
    query_job.result()
    
  full_table_id = f"{PROJECT_ID}.{DATASET_ID}.{MAIN_TABLE_ID}"
  table = client.get_table(full_table_id)
  print(f"Loaded {table.num_rows} rows and {len(table.schema)} columns into {full_table_id}\n")

In [80]:
insert_table()

Table weather.nws does not exist. Creating.
Loaded 3311 rows and 20 columns into alaska-scrape.weather.nws



![success](../img/success_nws_staging.png)

---


##### Automating this pipeline

**Airflow**

![nws_dag_success](../img/nws_dag_success.png)

See `../airflow/dags/` for how I ported this over to an automatable DAG in Airflow. `get_forecast_df()` does quite a lot for a single task, but we kept the port super simple since ultimately this pipeline will be run in Google Cloud Functions. Airflow logs our print statements so we don't even need to change those.

```python 
## Changes made 
@task 
def get_forecast_dict() -> dict:
  """Get forecast data for next 6 days as dict"""
  #...do stuff...#
  return forecast_dict

@task
def transform_forecast(forecast_dict:dict) -> dict: 
  """Cast dictionary from transpose_as_dict() to a dataframe, transform, and cast back to dict"""
  #...do stuff...#
  return transformed_dict 

@task 
def load_staging_table(transformed_dict:dict) -> None: 
  """Read dict from transform_dict() load to staging table in BigQuery"""
```
<br>

A main restriction for Airflow is that tasks are not meant to transfer data. We can pass data as XCOMs, but: 
1. Dataframes are too large 

In [87]:
import sys 
print(f"{sys.getsizeof(df)} vs. {sys.getsizeof(df.to_dict())} bytes!!!")

3130472 vs. 656 !!!


2. Types which are not JSON serializable (e.g. datetimes) need to be converted before they can be encoded. This makes passing around dictionaries with datetime information awkward. For purposes of keeping the port simple, we will modify `get_last_update_nws()` to return a string and convert any datetime columns to string and back again between XCOM pushes/pulls. 


**Google Cloud Functions**

See `./notebooks/3_gcf_export.ipynb` for how I ported this over to a Google Cloud Function. 

<!-- ![bq_result](../img/success_nws_staging.png) -->