### NWS Forecast Data

This notebook explains how to scrape, transform, and upload the NWS's hourly weather forecast for the next 6 days** into a dataset in BigQuery. The resulting script can be set to run at regular intervals as an Airflow DAG or a function in Cloud Functions. 

** *For some reason, the hourly forecast doesn't quite extend to a full week, but only 6.5 days. To keep the math easier, we will only scrape the next 6 days -- in the end, this won't affect our pipeline once we have it updating continuously.*

We will collect hourly forecasts for the locations of 23 USCRN data collection stations in Alaska -- the presence of these stations will enable ourselves and any other users of our dataset to evaluate the accuracy of the forecasts.

---

#### Why use scraping over `api.weather.gov`?

Generally speaking, if a website offers an API to access its data then it's a good bet to use it. So why not just use `api.weather.gov`?

There are at least twi reasons I chose to webscrape the forecast data for this project:

1. I've noticed at times that the `api.weather.gov` can give a `500: Internal Server Error` response when the HTML data interface is still accessible.  
2. As far as I can tell, the API does not offer the same amount of information as the tabular HTML interface:  

In [95]:
import pandas as pd
import numpy as np

import requests
import re
import datetime as dt 
import itertools
from bs4 import BeautifulSoup

In [4]:
locations_df = pd.read_csv("../data/locations.csv")

In [82]:
random_location = locations_df.sample(1).iloc[0] 

print(f"{random_location}\n")

lat, lon = random_location['latitude'], random_location['longitude']  

## API results
url = f"https://api.weather.gov/points/{lat},{lon}"

response = requests.get(url)
main_data = response.json()

response = requests.get(main_data['properties']['forecastHourly'])
hourly_data = response.json()
fields = hourly_data['properties']['periods'][0]

print(f"{fields}\n") 

## Webscraping results 
url = f"https://forecast.weather.gov/MapClick.php?lat={lat}&lon={lon}&unit=0&lg=english&FcstType=digital&menu=1"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
df = pd.read_html(str(soup.find_all("table")[5]))[0]
df = df.iloc[1:16,:]

display(df)

station_location    Utqiagvik
wbanno                  27516
longitude             -156.61
latitude                71.32
Name: 1, dtype: object

{'number': 1, 'name': '', 'startTime': '2023-02-27T14:00:00-09:00', 'endTime': '2023-02-27T15:00:00-09:00', 'isDaytime': True, 'temperature': -13, 'temperatureUnit': 'F', 'temperatureTrend': None, 'probabilityOfPrecipitation': {'unitCode': 'wmoUnit:percent', 'value': 2}, 'dewpoint': {'unitCode': 'wmoUnit:degC', 'value': -28.333333333333332}, 'relativeHumidity': {'unitCode': 'wmoUnit:percent', 'value': 73}, 'windSpeed': '10 mph', 'windDirection': 'W', 'icon': 'https://api.weather.gov/icons/land/day/bkn,2?size=small', 'shortForecast': 'Mostly Cloudy', 'detailedForecast': ''}



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
1,Date,02/27,,,,,,,,,...,,,,,,,,,,
2,Hour (AKST),15,16,17,18,19,20,21,22,23,...,05,06,07,08,09,10,11,12,13,14
3,Temperature (°F),-13,-14,-16,-19,-20,-21,-22,-22,-23,...,-24,-24,-24,-24,-22,-21,-21,-21,-22,-22
4,Dewpoint (°F),-18,-19,-21,-23,-24,-24,-25,-25,-26,...,-27,-28,-28,-27,-26,-25,-24,-24,-25,-26
5,Wind Chill (°F),-31,-33,-36,-36,-37,-38,-35,-36,-36,...,-42,-45,-45,-44,-45,-44,-43,-46,-47,-48
6,Surface Wind (mph),9,9,9,7,7,7,5,5,5,...,7,9,9,9,11,11,11,15,15,15
7,Wind Dir,W,W,W,SW,SW,SW,S,S,S,...,E,E,E,E,E,E,E,E,E,E
8,Gust,,,,,,,,,,...,,,,,,,,,,
9,Sky Cover (%),74,74,74,55,55,55,55,55,55,...,19,20,20,20,21,21,21,21,21,21
10,Precipitation Potential (%),14,14,14,14,14,14,14,14,14,...,1,1,1,1,1,1,1,1,1,1


The results need cleaning (e.g. the column names are in the first row), but you can see that all the same information is present with the addition of several other fields. The API *does* provide a useful `isDaytime` field but we can calculate that ourselves.

---

#### 1.) Scraping the Data 

For each location, the forecast for the next 48 hours is stored in a tabular data table like this: 

<img src="../img/nws_p1.png" height=400px>

The rest of the forecast can be accessed by jumping ahead in 48 hour increments. We do this by adding `&AheadHour=` on the end of the URL and specifying how many hours (48 and 96). 

The series of transformations required here is quite complex -- a downside of scraping vs using the API -- so I've broken it up into many modular functions.

In [97]:
## General Utilities 
def get_soup(url:str) -> BeautifulSoup:
  """Simple wrapper for getting beautiful soup object from url"""
  result = requests.get(url)
  return BeautifulSoup(result.content, "html.parser") 

def flatten(ls:list): 
  """Flattens/unnests a list of lists by one layer"""
  return list(itertools.chain.from_iterable(ls)) 

def ff_list(ls:list) -> list:
  """Forward fill the values in a list"""
  for i in range(len(ls)):
    if not ls[i] and i > 0:
        ls[i] = ls[i-1]
  return ls

## Specific Utilities
def get_nws_url(row:pd.Series) -> str:
  """
  Get url for the next 48 hours of forecasts from latitude and longitude columns
  
  Args: 
  row (pd.Series): The current row of the dataframe

  Returns: 
  url (str): The url for the next 48 hours of forecasts
  """
  lat, lon = row["latitude"], row["longitude"]
  url = f"https://forecast.weather.gov/MapClick.php?w0=t&w1=td&w2=wc&w3=sfcwind&w3u=1&w4=sky&w5=pop&w6=rh&w7=rain&w8=thunder&w9=snow&w10=fzg&w11=sleet&w12=fog&AheadHour=0&Submit=Submit&FcstType=digital&textField1={lat}&textField2={lon}&site=all&unit=0&dd=&bw=&menu=1"
  return url

def get_last_update(soup:BeautifulSoup) -> dt.datetime:
  """
  Find the "Last Updated" value from a BeautifulSoup object, transform to a datetime in AKST

  Args:
  soup (BeautifulSoup): A Beautiful Soup representation of a particular NWS forecast page

  Returns: 
  last_update_dt (datetime): Datetime representation of time page was last updated (AKST)
  """
  last_update_tag = soup.find('td', string=lambda text: text and 'Last Update:' in text)
  last_update_text = re.sub("Last Update: |\s(?=pm)|AKST |,", "", last_update_tag.getText())
  last_update_dt = dt.datetime.strptime(last_update_text, "%I:%M%p %b %d %Y")
  return last_update_dt

In [109]:
## Core helper functions
def extract_table_data(soup:BeautifulSoup, location:str) -> list:
  """
  Extracts 48hr forecast table data from a Beautiful Soup object as a list of lists

  Args: 
  table_records (list): List of <tr> elements containing NWS forecast data

  location (str): The name of the place the forecast is for; used for filling out added "location" column 

  Returns:
  table (list): List of lists containing table data 
  """
  table_records = soup.find_all("table")[5].find_all("tr")

  colspan = table_records[0] # 48hr data is divided into two tables by two colspan elements
  table = [tr for  tr in table_records if tr != colspan] # vertically concat tables by removing colspan elements

  table = [[ele.getText() for ele in tr.find_all("font")] for tr in table] 

  # Add location column 
  location_col = ['location']
  location_col.extend([location]*24) # fill out to match length of other columns
  table.insert(1, location_col)  # for first half of table
  table.insert(19, location_col) # for second half of table

  # Add last_update_nws column 
  last_update_nws = ['last_update_nws']
  last_update_nws.extend([get_last_update(soup)] * 24)
  table.insert(1, last_update_nws)
  table.insert(19, last_update_nws) 

  return table

def transpose_as_dict(table:list) -> dict:
  """
  Takes the list of lists generated by extract_table_data() and transposes it (flip orientation) by casting as a dictionary
  
  Args:
  table (list): list of lists of columnar data generated by extract_table_data()

  Returns: 
  data_map (dict): Dictionary representation of table, transposed and ready to be made into a dataframe
  """
  data_map = {}
  for col in table: # Table is still "landscape-oriented"
    if col[0] not in data_map.keys(): # cols from first half of table
      data_map[col[0]] = col[1:]
    else: # cols from second half
      data_map[col[0]].extend(col[1:])
  data_map['Date'] = ff_list(data_map['Date'])
  return data_map

def transform_df(fcast_dict:dict) -> pd.DataFrame: 
  """
  Cast dictionary from transpose_as_dict() to a dataframe and transform

  Args: 
  table (list)
  """
  # Create dataframe
  df = pd.DataFrame(fcast_dict)

  # Edit column headers 
  df.columns = [col.lower() for col in df.columns] 
  df.rename(columns=lambda x: re.sub('°|\(|\)', '', x), inplace=True)
  df.rename(columns=lambda x: re.sub('%', 'pct', x), inplace=True)
  df.rename(columns=lambda x: re.sub(' ', '_', x.strip()), inplace=True)

  # Replace missing value indicators with Nan
  df.replace({'':np.NaN, '--':np.NaN}, inplace=True)

  ## Datetime Transformations
  cur_year = dt.datetime.now().year
  dt_strings = df['date'] + '/' + str(cur_year) + ' ' + df['hour (akst)'] + ':00 AKST'
  # Local time (AKST)
  df['lst_datetime'] = pd.to_datetime(dt_strings, format='%m/%d/%Y %H:%M AKST')
  # UTC time
  akst_offset = dt.timedelta(hours=9)
  df['utc_datetime'] = df['lst_datetime'] + akst_offset

  ## Reorder columns 
  col_names = ['location','utc_datetime','lst_datetime'] + list(df.columns)[3:-2]
  df = df[col_names]

  # Timestamp column: track when forecast was accessed
  df['date_added_utc'] = dt.datetime.utcnow()

  return df 



In [110]:
nws_urls = locations_df.apply(get_nws_url, axis=1)
url_map = dict(zip(locations_df['station_location'], nws_urls))

combined_table = []
for location, url in url_map.items():
  soup_list = [get_soup(url + f"&AheadHour={hr}") for hr in (0,48,96)]
  table_list = flatten([extract_table_data(soup, location) for soup in soup_list])
  combined_table.extend(table_list)
  break
my_dict = transpose_as_dict(combined_table)

In [119]:

# Create dataframe
df = pd.DataFrame(my_dict)

# Edit column headers 
df.columns = [col.lower() for col in df.columns] 
df.rename(columns=lambda x: re.sub('°|\(|\)', '', x), inplace=True)
df.rename(columns=lambda x: re.sub('%', 'pct', x), inplace=True)
df.rename(columns=lambda x: re.sub(' ', '_', x.strip()), inplace=True)

# Replace missing value indicators with Nan
df.replace({'':np.NaN, '--':np.NaN}, inplace=True)

## Datetime Transformations
cur_year = dt.datetime.now().year
dt_strings = df['date'] + '/' + str(cur_year) + ' ' + df['hour_akst'] + ':00 AKST'
# Local time (AKST)
df['lst_datetime'] = pd.to_datetime(dt_strings, format='%m/%d/%Y %H:%M AKST')
# UTC time
akst_offset = dt.timedelta(hours=9)
df['utc_datetime'] = df['lst_datetime'] + akst_offset

## Reorder columns 
col_names = ['location','utc_datetime','lst_datetime'] + list(df.columns)[3:-2]
df = df[col_names]

# # Timestamp column: track when forecast was accessed
# df['date_added_utc'] = dt.datetime.utcnow()

In [120]:
df.head()

Unnamed: 0,date,last_update_nws,location,hour_akst,temperature_f,dewpoint_f,wind_chill_f,surface_wind_mph,wind_dir,gust,...,precipitation_potential_pct,relative_humidity_pct,rain,thunder,snow,freezing_rain,sleet,fog,lst_datetime,utc_datetime
0,02/27,2023-02-27 18:47:00,Fairbanks,20,-20,-25,-20,2,SE,,...,1,76,,,,,,,2023-02-27 20:00:00,2023-02-28 05:00:00
1,02/27,2023-02-27 18:47:00,Fairbanks,21,-23,-29,-37,5,SE,,...,1,75,,,,,,,2023-02-27 21:00:00,2023-02-28 06:00:00
2,02/27,2023-02-27 18:47:00,Fairbanks,22,-25,-30,-39,5,SE,,...,1,75,,,,,,,2023-02-27 22:00:00,2023-02-28 07:00:00
3,02/27,2023-02-27 18:47:00,Fairbanks,23,-24,-30,-38,5,SE,,...,1,75,,,,,,,2023-02-27 23:00:00,2023-02-28 08:00:00
4,02/28,2023-02-27 18:47:00,Fairbanks,0,-24,-30,-42,7,E,,...,1,75,,,,,,,2023-02-28 00:00:00,2023-02-28 09:00:00


#### 2.) Uploading the Data 
