### NWS Forecast Data

This notebook explains how to scrape, transform, and upload the NWS's hourly weather forecast for the next 6 days** into a dataset in BigQuery. The resulting script can be set to run at regular intervals as an Airflow DAG or a function in Cloud Functions. 

***For some reason, the hourly forecast doesn't quite extend to a full week, but only 6.5 days. To keep the math easier, we will only scrape the next 6 days (this won't affect the end result).*

We will collect hourly forecasts for the locations of 23 USCRN data collection stations in Alaska -- the presence of these stations will enable ourselves and any other users of our dataset to evaluate the accuracy of the forecasts. 

---

#### Why use scraping over `api.weather.gov`?

Generally speaking, if a website offers an API to access its data then it's a good bet to use it. So why not just use `api.weather.gov`?

There are at least twi reasons I chose to webscrape the forecast data for this project:

1. I've noticed at times that the `api.weather.gov` can give a `500: Internal Server Error` response when the HTML data interface is still accessible.  
2. As far as I can tell, the API does not offer the same amount of information as the tabular HTML interface:  

In [36]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
locations_df = pd.read_csv("../data/locations.csv")

In [82]:
random_location = locations_df.sample(1).iloc[0] 

print(f"{random_location}\n")

lat, lon = random_location['latitude'], random_location['longitude']  

## API results
url = f"https://api.weather.gov/points/{lat},{lon}"

response = requests.get(url)
main_data = response.json()

response = requests.get(main_data['properties']['forecastHourly'])
hourly_data = response.json()
fields = hourly_data['properties']['periods'][0]

print(f"{fields}\n") 

## Webscraping results 
url = f"https://forecast.weather.gov/MapClick.php?lat={lat}&lon={lon}&unit=0&lg=english&FcstType=digital&menu=1"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
df = pd.read_html(str(soup.find_all("table")[5]))[0]
df = df.iloc[1:16,:]

display(df)

station_location    Utqiagvik
wbanno                  27516
longitude             -156.61
latitude                71.32
Name: 1, dtype: object

{'number': 1, 'name': '', 'startTime': '2023-02-27T14:00:00-09:00', 'endTime': '2023-02-27T15:00:00-09:00', 'isDaytime': True, 'temperature': -13, 'temperatureUnit': 'F', 'temperatureTrend': None, 'probabilityOfPrecipitation': {'unitCode': 'wmoUnit:percent', 'value': 2}, 'dewpoint': {'unitCode': 'wmoUnit:degC', 'value': -28.333333333333332}, 'relativeHumidity': {'unitCode': 'wmoUnit:percent', 'value': 73}, 'windSpeed': '10 mph', 'windDirection': 'W', 'icon': 'https://api.weather.gov/icons/land/day/bkn,2?size=small', 'shortForecast': 'Mostly Cloudy', 'detailedForecast': ''}



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
1,Date,02/27,,,,,,,,,...,,,,,,,,,,
2,Hour (AKST),15,16,17,18,19,20,21,22,23,...,05,06,07,08,09,10,11,12,13,14
3,Temperature (°F),-13,-14,-16,-19,-20,-21,-22,-22,-23,...,-24,-24,-24,-24,-22,-21,-21,-21,-22,-22
4,Dewpoint (°F),-18,-19,-21,-23,-24,-24,-25,-25,-26,...,-27,-28,-28,-27,-26,-25,-24,-24,-25,-26
5,Wind Chill (°F),-31,-33,-36,-36,-37,-38,-35,-36,-36,...,-42,-45,-45,-44,-45,-44,-43,-46,-47,-48
6,Surface Wind (mph),9,9,9,7,7,7,5,5,5,...,7,9,9,9,11,11,11,15,15,15
7,Wind Dir,W,W,W,SW,SW,SW,S,S,S,...,E,E,E,E,E,E,E,E,E,E
8,Gust,,,,,,,,,,...,,,,,,,,,,
9,Sky Cover (%),74,74,74,55,55,55,55,55,55,...,19,20,20,20,21,21,21,21,21,21
10,Precipitation Potential (%),14,14,14,14,14,14,14,14,14,...,1,1,1,1,1,1,1,1,1,1


The results need cleaning, but you can see that all the same information is present with the addition of several other fields. The API *does* provide a useful `isDaytime` field but we can calculate that ourselves.

---

#### 1.) Scraping the Data 

For each location, the forecast for the next 48 hours is stored in a tabular data table like this: 

<img src="../img/nws_p1.png" height=400px>

can be accessed by "jumping ahead" in 48 hour increments. We do this by adding `&AheadHour=` on the end of the URL and specifying how many hours (48, 96, and 107). There are only 3.5 such increments in a week, so the last jump is smaller. This means that the last two tables will have some overlap:

*48*

<img src="../img/nws_p2.png" height=400px>


*96*

<img src="../img/nws_p3.png" height=400px>

*107*

<img src="../img/nws_p4.png" height=400px>

In [104]:
from datetime import datetime, timedelta

a = datetime(2023,2,27,16)
b = datetime(2023,3,6,2)
b - a

datetime.timedelta(days=6, seconds=36000)

In [97]:

def nws_url(row:pd.Series) -> tuple:
  """
  Construct NWS forecast urls from latitude and longitude columns in locations dataframe
  
  Args: 
  row (pd.Series): The current row of the dataframe

  Returns: 
  url (tuple): Tuple of four URLs one for next 48 hours,  
  """
  lat, lon = row["latitude"], row["longitude"]
  url = f"https://forecast.weather.gov/MapClick.php?w0=t&w1=td&w2=wc&w3=sfcwind&w3u=1&w4=sky&w5=pop&w6=rh&w7=rain&w8=thunder&w9=snow&w10=fzg&w11=sleet&w12=fog&AheadHour=0&Submit=Submit&FcstType=digital&textField1={lat}&textField2={lon}&site=all&unit=0&dd=&bw=&menu=1"
  week_urls = (url, url + "&AheadHour=48", url + "&AheadHour=96", url + "&AheadHour=107")
  return week_urls

In [99]:
nws_url(locations_df.iloc[0])

('https://forecast.weather.gov/MapClick.php?w0=t&w1=td&w2=wc&w3=sfcwind&w3u=1&w4=sky&w5=pop&w6=rh&w7=rain&w8=thunder&w9=snow&w10=fzg&w11=sleet&w12=fog&AheadHour=0&Submit=Submit&FcstType=digital&textField1=64.97&textField2=-147.51&site=all&unit=0&dd=&bw=&menu=1',
 'https://forecast.weather.gov/MapClick.php?w0=t&w1=td&w2=wc&w3=sfcwind&w3u=1&w4=sky&w5=pop&w6=rh&w7=rain&w8=thunder&w9=snow&w10=fzg&w11=sleet&w12=fog&AheadHour=0&Submit=Submit&FcstType=digital&textField1=64.97&textField2=-147.51&site=all&unit=0&dd=&bw=&menu=1&AheadHour=48',
 'https://forecast.weather.gov/MapClick.php?w0=t&w1=td&w2=wc&w3=sfcwind&w3u=1&w4=sky&w5=pop&w6=rh&w7=rain&w8=thunder&w9=snow&w10=fzg&w11=sleet&w12=fog&AheadHour=0&Submit=Submit&FcstType=digital&textField1=64.97&textField2=-147.51&site=all&unit=0&dd=&bw=&menu=1&AheadHour=72',
 'https://forecast.weather.gov/MapClick.php?w0=t&w1=td&w2=wc&w3=sfcwind&w3u=1&w4=sky&w5=pop&w6=rh&w7=rain&w8=thunder&w9=snow&w10=fzg&w11=sleet&w12=fog&AheadHour=0&Submit=Submit&FcstType

In [100]:
locations_df.apply()

Unnamed: 0,station_location,wbanno,longitude,latitude
0,Fairbanks,26494,-147.51,64.97
1,Utqiagvik,27516,-156.61,71.32
2,Sitka,25379,-135.33,57.06
3,St._Paul,25711,-170.21,57.16
4,Port_Alsworth,26562,-154.32,60.2
5,Sand_Point,25630,-160.47,55.35
6,Kenai,26563,-150.45,60.72
7,Red_Dog_Mine,26655,-162.92,68.03
8,Tok,96404,-141.21,62.74
9,Gustavus,25380,-135.69,58.43


#### 2.) Uploading the Data 
