### USCRN Data: Historical Baseline

This notebook explains and contains the initial scrape, transform, and upload of the USCRN weather data to BigQuery and Google Cloud Storage. This dataset will serve as the benchmark for the NWS forecast data. 

The DAG contained in `airflow/dags/uscrn_dag.py` is set to periodically update the BigQuery table created by this notebook, as is the cloud function in `notebooks/uscrn_update_gcf/`

In [2]:
import requests
import pandas as pd 
import numpy as np
import re
import itertools
from yaml import full_load
import datetime as dt
from bs4 import BeautifulSoup

with open ("../airflow/dags/config/sources.yaml", "r") as fp:
  sources = full_load(fp)

#### 1.) Column Headers and Descriptions

To save on storage space, USCRN omits column names in its main data tables and stores them in a separate text file ([headers.txt](https://www.ncei.noaa.gov/pub/data/uscrn/products/hourly02/headers.txt)). We'll scrape these first before tackling the main data.

In [2]:
url = sources['USCRN']['headers']
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

columns = str(soup).split("\n")[1].strip(" ").split(" ")
columns = [str.lower(c) for c in columns] 
columns.insert(0,'station_location')

descrip_text = str(soup).split("\n")[2] # raw text block containing column descriptions
descrip_text

"The station WBAN number. The UTC date of the observation. The UTC time of the observation. Time is the end of the observed hour, so the 0000 hour is actually the last hour of the previous day's observation (starting just after 11:00 PM through midnight). The Local Standard Time (LST) date of the observation. The Local Standard Time (LST) time of the observation. Time is the end of the observed hour (see UTC_TIME description). The version number of the station datalogger program that was in effect at the time of the observation. Note: This field should be treated as text (i.e. string). Station longitude, using WGS-84. Station latitude, using WGS-84. Average air temperature, in degrees C, during the last 5 minutes of the hour. See Note F. Average air temperature, in degrees C, for the entire hour. See Note F. Maximum air temperature, in degrees C, during the hour. See Note F. Minimum air temperature, in degrees C, during the hour. See Note F. Total amount of precipitation, in mm, record

The descriptions of the columns are quite the mess, as there is no standard separator used. We will have to work our way through it step by step: 

In [3]:
def close_parens(s:str):
    """uses regex to replace closing parenthesis ')' after it's removed from .split()"""
    unclosed_paren = re.compile(r'(\([^)]*)$') 
    return re.sub(unclosed_paren, r"\1)", s) 

first_split = map(close_parens, descrip_text.split("). "))

no_notes = [re.sub(r' See Note [A-Z]\.',"",s) for s in first_split]
no_notes

["The station WBAN number. The UTC date of the observation. The UTC time of the observation. Time is the end of the observed hour, so the 0000 hour is actually the last hour of the previous day's observation (starting just after 11:00 PM through midnight)",
 'The Local Standard Time (LST) date of the observation. The Local Standard Time (LST) time of the observation. Time is the end of the observed hour (see UTC_TIME description)',
 'The version number of the station datalogger program that was in effect at the time of the observation. Note: This field should be treated as text (i.e. string)',
 "Station longitude, using WGS-84. Station latitude, using WGS-84. Average air temperature, in degrees C, during the last 5 minutes of the hour. Average air temperature, in degrees C, for the entire hour. Maximum air temperature, in degrees C, during the hour. Minimum air temperature, in degrees C, during the hour. Total amount of precipitation, in mm, recorded during the hour. Average global sol

The third entry in `no_notes` is ready (it's a single string belonging to a single column). The last set of descriptions in `no_notes` can be split on `". "`, but the first two sets need special attention. We will pop the last set out and split it, then pop the third set out, and then address the first two sets. At that point we will recombine everything into one list while preserving the original order. 

In [4]:
last_set = no_notes.pop().strip().split(". ")
third_set = no_notes.pop() # just a string

In [5]:
def flatten(ls:list): 
  """Flattens/unnests a list of lists"""
  return list(itertools.chain.from_iterable(ls)) 

no_notes = [re.sub(". Time is", " at", s) for s in no_notes] # rephrase description so we can split on sentences

first_second = flatten([s.split(". ") for s in no_notes]) 

# Finally:
descriptions = flatten([first_second, [third_set], last_set]) 
descriptions.insert(0,"Location name for USCRN station") # Description added for "station_location" 
descriptions[0:5]

['Location name for USCRN station',
 'The station WBAN number',
 'The UTC date of the observation',
 "The UTC time of the observation at the end of the observed hour, so the 0000 hour is actually the last hour of the previous day's observation (starting just after 11:00 PM through midnight)",
 'The Local Standard Time (LST) date of the observation']

The [readme](https://www.ncei.noaa.gov/pub/data/uscrn/products/hourly02/readme.txt) also contains information on the units of each column:

In [6]:
url = sources['USCRN']['readme']
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

lines = [line.strip() for line in str(soup).split("\n")]
table_idx = lines.index("Field#  Name                           Units") # 252
table = lines[table_idx+2:table_idx+40]
table

['1    WBANNO                         XXXXX',
 '2    UTC_DATE                       YYYYMMDD',
 '3    UTC_TIME                       HHmm',
 '4    LST_DATE                       YYYYMMDD',
 '5    LST_TIME                       HHmm',
 '6    CRX_VN                         XXXXXX',
 '7    LONGITUDE                      Decimal_degrees',
 '8    LATITUDE                       Decimal_degrees',
 '9    T_CALC                         Celsius',
 '10   T_HR_AVG                       Celsius',
 '11   T_MAX                          Celsius',
 '12   T_MIN                          Celsius',
 '13   P_CALC                         mm',
 '14   SOLARAD                        W/m^2',
 '15   SOLARAD_FLAG                   X',
 '16   SOLARAD_MAX                    W/m^2',
 '17   SOLARAD_MAX_FLAG               X',
 '18   SOLARAD_MIN                    W/m^2',
 '19   SOLARAD_MIN_FLAG               X',
 '20   SUR_TEMP_TYPE                  X',
 '21   SUR_TEMP                       Celsius',
 '22   SUR_TEMP_FL

In [7]:
# Get units with regex lookbehind
regex = re.compile(r"(?<=\s{5})[^\s']+") # enough \s to exclude column names
units = re.findall(regex, str(table))

# Add unit for "station_location" column
units.insert(0, "X+ (Various Lengths)") 

header_info = {
  'name': columns,
  'description': descriptions, 
  'units': units
}
header_df = pd.DataFrame(header_info)

header_df.head(5)

Unnamed: 0,name,description,units
0,station_location,Location name for USCRN station,X+ (Various Lengths)
1,wbanno,The station WBAN number,XXXXX
2,utc_date,The UTC date of the observation,YYYYMMDD
3,utc_time,The UTC time of the observation at the end of ...,HHmm
4,lst_date,The Local Standard Time (LST) date of the obse...,YYYYMMDD


#### 2.) Main Data (>2 million rows)

You might find that trying to create a dataframe from a nested list of 2 million rows risks crashing the Jupyter IPython Kernel. I've since refactored my code with recursion and batch processing to reduce memory load.

The full script to scrape, transform, and save the data is available in `notebooks/uscrn_helper_scripts/uscrn_scrape_main.py`. I've imported the main function from there in the next cell so you can run it here. But it's likely better to execute the script in a separate terminal (`$ python3.7 uscrn_scrape.py`). 

In [1]:
import os
from utils.utils import get_station_location, get_soup, get_file_urls
# Refer to the script to see what these helper functions are doing

output_file="../data/uscrn.csv"

columns = header_df['name']

if os.path.isfile(output_file):
  raise Exception(f"{output_file} already exists")

def process_rows(file_urls, row_limit, output_file) -> None:
  """
  Processes a batch of rows from a list of URLs to extract weather station data and save it to a CSV file.

  Args:
    file_urls (list): A list of URLs where weather station data can be found.
    row_limit (int): The maximum number of rows to process per batch.
    output_file (str): The path to the output CSV file.
  Returns:
    None
  """
  # Get rows for current batch
  rows = []
  current_idx=0
  for i, url in enumerate(file_urls[current_idx:]):
    # Get location from url
    station_location = get_station_location(url)
    # Get new rows 
    soup = get_soup(url, delay=1)
    soup_lines = [station_location + " " + line for line in str(soup).strip().split("\n")]
    new_rows = [re.split('\s+', row) for row in soup_lines]
    # Add to list
    rows.extend(new_rows)
    if len(rows) >= row_limit:
      current_idx=i
      break

  # Define column names -- same as from header_df['name']
  columns = ['station_location','wbanno','utc_date','utc_time','lst_date','lst_time','crx_vn','longitude','latitude',
  't_calc','t_hr_avg','t_max','t_min','p_calc','solarad','solarad_flag','solarad_max','solarad_max_flag','solarad_min',
  'solarad_min_flag','sur_temp_type','sur_temp','sur_temp_flag','sur_temp_max','sur_temp_max_flag','sur_temp_min',
  'sur_temp_min_flag','rh_hr_avg','rh_hr_avg_flag','soil_moisture_5','soil_moisture_10','soil_moisture_20',
  'soil_moisture_50','soil_moisture_100','soil_temp_5','soil_temp_10','soil_temp_20','soil_temp_50','soil_temp_100']
  
  # Create dataframe for current batch
  df = pd.DataFrame(rows, columns=columns)

  #### --- Transform dataframe --- #### 

  df.replace([-99999,-9999], np.nan, inplace=True) 
  df.replace({'crx_vn':{-9:np.nan}}, inplace=True)

  # Drop soil columns -- vast majority have missing data 
  df = df.filter(regex="^((?!soil).)*$")

  # convert to datetimes
  df['utc_datetime'] = pd.to_datetime(df['utc_date'].astype(int).astype(str) + df['utc_time'].astype(int).astype(str).str.zfill(4), format='%Y%m%d%H%M')
  df['lst_datetime'] = pd.to_datetime(df['lst_date'].astype(int).astype(str) + df['lst_time'].astype(int).astype(str).str.zfill(4), format='%Y%m%d%H%M')

  # drop old date and time columns
  df.drop(['utc_date', 'utc_time', 'lst_date', 'lst_time'], axis=1, inplace=True)

  # reorder columns 
  cols = ['station_location','wbanno','crx_vn','utc_datetime','lst_datetime'] + list(df.columns)[3:-2]
  df = df[cols]

  #### -------------------------- #####

  # Write dataframe to CSV
  hdr = False if os.path.isfile(output_file) else True
  df.to_csv("../data/uscrn.csv", mode="a+", header=hdr, index=False)


  # if os.path.isfile(output_file):
  #     df.to_csv(output_file, mode='a', header=False, index=False)
  #     del df
  # else:
  #   with open(output_file, "w") as fp:
  #     df.to_csv(fp, index=False)
  #     del df
  
  # Recursively process remaining rows     
  if len(rows) >= row_limit:
      remaining_urls = file_urls[current_idx:]
      process_rows(remaining_urls, row_limit, output_file)
  else: 
      return 
    
process_rows(file_urls=get_file_urls("hourly02"), row_limit=100000, output_file=output_file)

Let's also save tables containing supplemental information:

*Location information*

In [9]:
location_cols = ['station_location', 'wbanno', 'longitude', 'latitude']
df = pd.read_csv("../data/uscrn.csv", usecols=location_cols)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.to_csv("../data/locations.csv", index=False)

*Column descriptions*

In [10]:
names = pd.read_csv("../data/uscrn.csv", nrows=1)
df = pd.DataFrame(list(names.columns), columns=["name"]) # transposing 

# Join to column description dataframe we made in previous section
df = df.merge(header_df, how="left")

# Datetime columns were added by transform_dataframe() -- need to fill in missing values 
df['units'].fillna("YYYY-MM-DD HH:MM:SS", inplace=True) 

df['description'][df['name'] == 'utc_datetime'] = "UTC datetime of observation"
df['description'][df['name'] == 'lst_datetime'] = "Local standard datetime of observation (AKST)"
df['description'][df['name'] == 'date_added_utc'] = "Datetime added to usrcn.csv (UTC)"

# Add 'type' columns -- useful for setting table schema in next section
def map_type(unit:str):
  """Map unit to datatype"""
  if "X" in unit:
    return "STRING"
  elif "Y" in unit:
    return "DATETIME"
  else: 
    return "FLOAT"
  
df['type'] = df['units'].map(map_type)

# Write to .csv
df.to_csv("../data/column_descriptions.csv", index=False)

#### 3.) Wind Data (*20 million* rows!)

For some reason, the USCRN hourly database (`hourly02`) does not include windspeed measurements. This information is only included in the sub-hourly database (`subhourly01`). 

The amount of the data is so large here that even the batch processing method we just used isn't enough to avoid exceeding memory limits -- the garbage collector is not adequately clearing the memory on every pass. We instead write all the raw data directly to .csv before reading it back into memory and transforming it:

In [None]:
import csv

def write_raw_rows(file_urls, output_file) -> None:
  """
  Args:
    file_urls (list): List of text file urls
    output_file (str): The path to the output CSV file.

  Returns:
    None
  """

  if os.path.isfile(output_file):
    raise Exception(f"{output_file} already exists")

  for url in file_urls:
    # Get location from url
    station_location = get_station_location(url)
    # Get new rows 
    soup = get_soup(url, delay=.5)
    lines = [re.split('\s+', line) for line in str(soup).strip().splitlines()]
    # We're only scraping this data for the wind information, so we ignore rows that don't have any (i.e wind < 0)
    wind_cols = [[station_location] + line[:5] + line[-2:] for line in lines if float(line[-2]) >= 0]
    # Write rows to CSV
    if wind_cols:
      with open(output_file, "a+") as f:
        writer = csv.writer(f)
        writer.writerows(wind_cols)
      del wind_cols

subhourly_files = get_file_urls("subhourly01")

write_raw_rows(file_urls=subhourly_files, output_file="../data/uscrn_wind_raw.csv")

Since we want to aggregate the sub-hourly measurements by hour, we can't read the data in chunks (otherwise we will skew averages for hours that were partially read-in).

In [None]:
## Read the data  
colnames = ['station_location','wbanno','utc_date','utc_time',
  'lst_date','lst_time',"wind_1_5", "wind_flag"]

df = pd.read_csv("../data/uscrn_wind_raw.csv", names=colnames) 
 
## Transformations
df['wind_1_5'] = df['wind_1_5'].astype(float)

# convert to datetimes
df['utc_datetime'] = pd.to_datetime(df['utc_date'].astype(int).astype(str) + df['utc_time'].astype(int).astype(str).str.zfill(4), format='%Y%m%d%H%M')
df['lst_datetime'] = pd.to_datetime(df['lst_date'].astype(int).astype(str) + df['lst_time'].astype(int).astype(str).str.zfill(4), format='%Y%m%d%H%M')

# round to nearest hour 
df['utc_datetime'] = df['utc_datetime'].dt.floor("H")
df['lst_datetime'] = df['lst_datetime'].dt.floor("H")

# drop poor quality data (wind_flag == 3: roughly 1.9% of rows)
df = df[df['wind_flag'] == 0]
df.drop("wind_flag", axis=1, inplace=True)

# aggregate by hour
df = df.groupby(['station_location','wbanno','utc_datetime','lst_datetime','wind_flag'])['wind_1_5'].mean().reset_index()
df.rename({"wind_1_5":"wind_hr_avg"}, axis=1, inplace=True)

## Write to csv
df.to_csv("../data/uscrn_wind_agg.csv")

#### 4.) Upload to BigQuery 

**Main Data**

In [21]:
# Create dataset 
with open("../airflow/dags/config/gcp-config.yaml", "r") as fp:
  gcp_config = full_load(fp) 

# !bq mk -d --location=us-east4 {gcp_config['project-id']}:{gcp_config['dataset-id']}

In [3]:
# Set schema 
header_df = pd.read_csv("../data/column_descriptions.csv")
header_df.drop("units", axis=1, inplace=True)

required_fields = ['station_location', 'wbanno', 'utc_datetime', 'lst_datetime', 'longitude', 'latitude', 'date_added_utc']

header_df['mode'] = np.where(header_df['name'].isin(required_fields), "REQUIRED", "NULLABLE")

schema = header_df.to_dict(orient='records')

print(schema[0:6]) 

[{'name': 'station_location', 'description': 'Location name for USCRN station', 'type': 'STRING', 'mode': 'REQUIRED'}, {'name': 'wbanno', 'description': 'The station WBAN number', 'type': 'STRING', 'mode': 'REQUIRED'}, {'name': 'crx_vn', 'description': 'The version number of the station datalogger program that was in effect at the time of the observation. Note: This field should be treated as text (i.e. string)', 'type': 'STRING', 'mode': 'NULLABLE'}, {'name': 'utc_datetime', 'description': 'UTC datetime of observation', 'type': 'DATETIME', 'mode': 'REQUIRED'}, {'name': 'lst_datetime', 'description': 'Local standard datetime of observation (AKST)', 'type': 'DATETIME', 'mode': 'REQUIRED'}, {'name': 'longitude', 'description': 'Station longitude, using WGS-84', 'type': 'FLOAT', 'mode': 'REQUIRED'}]


In [22]:
from google.cloud import bigquery
from google.oauth2 import service_account

key_path = gcp_config['credentials']
credentials = service_account.Credentials.from_service_account_file(
   key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

client = bigquery.Client(credentials=credentials, project=credentials.project_id)

PROJECT_ID = gcp_config['project-id']
DATASET_ID = gcp_config['dataset-id']
table_id = f"{PROJECT_ID}.{DATASET_ID}.uscrn"

jc = bigquery.LoadJobConfig(
   source_format = bigquery.SourceFormat.CSV,
   skip_leading_rows=1,
   autodetect=False,
   schema=schema,
   create_disposition="CREATE_IF_NEEDED",
   write_disposition="WRITE_TRUNCATE", 
   destination_table_description="Historical weather data from USCRN stations in Alaska"
)

# job = client.load_table_from_dataframe(df, table_id, job_config=jc)

with open("../data/uscrn.csv", "rb") as fp: 
  job = client.load_table_from_file(fp, table_id, job_config=jc)
job.result()

job.result()

**Wind Data**

In [7]:
wind_schema = schema[0:2] + schema[3:5] + ['wind_flag', 'wind_schema']
print(wind_schema)

[{'name': 'station_location', 'description': 'Location name for USCRN station', 'type': 'STRING', 'mode': 'REQUIRED'}, {'name': 'wbanno', 'description': 'The station WBAN number', 'type': 'STRING', 'mode': 'REQUIRED'}, {'name': 'utc_datetime', 'description': 'UTC datetime of observation', 'type': 'DATETIME', 'mode': 'REQUIRED'}, {'name': 'lst_datetime', 'description': 'Local standard datetime of observation (AKST)', 'type': 'DATETIME', 'mode': 'REQUIRED'}, 'wind_flag', 'wind_schema']


In [None]:
jc = bigquery.LoadJobConfig(
   source_format = bigquery.SourceFormat.CSV,
   skip_leading_rows=1,
   autodetect=False,
   schema=wind_schema,
   create_disposition="CREATE_IF_NEEDED",
   write_disposition="WRITE_TRUNCATE", 
   destination_table_description="Hourly wind data from USCRN stations, aggregated from 5 minute measurements"
)

table_id = f"{PROJECT_ID}.{DATASET_ID}.uscrn_wind" 

with open("../data/uscrn_wind.csv", "rb") as fp:
    job = client.load_table_from_file(fp, table_id, job_config=jc)

**Supplemental Data** (Locations and Column Description tables)

In [21]:
# Locations table
table_id = f"{PROJECT_ID}.{DATASET_ID}.locations"

jc = bigquery.LoadJobConfig(
  source_format = bigquery.SourceFormat.CSV,
  autodetect=True,
  create_disposition="CREATE_IF_NEEDED",
  write_disposition="WRITE_TRUNCATE", 
  destination_table_description="Location names, WBANNO codes, and coordinates for USCRN stations in Alaska"
)

with open("../data/locations.csv", "rb") as fp: 
  job = client.load_table_from_file(fp, table_id, job_config=jc)
job.result()

LoadJob<project=alaska-scrape, location=us-east4, id=e2fe472f-ee3b-424f-ac0f-a14ccc5ff26b>

In [25]:
# Column description table 
table_id = f"{PROJECT_ID}.{DATASET_ID}.column_descriptions"

schema = [ # Col headers not being autodetected
  bigquery.SchemaField("name", "STRING", mode="REQUIRED"), 
  bigquery.SchemaField("description", "STRING", mode="REQUIRED"), 
  bigquery.SchemaField("units", "STRING", mode="REQUIRED"),
  bigquery.SchemaField("type", "STRING", mode="REQUIRED")
]

jc = bigquery.LoadJobConfig(
  source_format = bigquery.SourceFormat.CSV,
  skip_leading_rows=1,
  autodetect=False,
  create_disposition="CREATE_IF_NEEDED",
  write_disposition="WRITE_TRUNCATE", 
  destination_table_description=f"Column descriptions for fields in {DATASET_ID}.uscrn", 
  schema=schema
)

with open("../data/column_descriptions.csv", "rb") as fp: 
  job = client.load_table_from_file(fp, table_id, job_config=jc)
job.result()

LoadJob<project=alaska-scrape, location=us-east4, id=0a834079-f8cf-484b-96dc-1ee70a55c464>

Lastly, let's also upload these smaller tables to Google Cloud Storage. When we create our Google Cloud Functions (see `gcf/`) it will be easier to read them from there than from BigQuery.

In [None]:
!gsutil mb -p {PROJECT_ID} -b on -l us-east4 gs://{PROJECT_ID}-bucket
!gsutil cp ../data/locations.csv gs://{PROJECT_ID}-bucket
!gsutil cp ../data/column_descriptions.csv gs://{PROJECT_ID}-bucket

In [15]:
!gsutil ls -l gs://{PROJECT_ID}-bucket 

      2252  2023-02-26T17:52:09Z  gs://alaska-scrape-bucket/column_descriptions.csv
       706  2023-02-26T17:51:54Z  gs://alaska-scrape-bucket/locations.csv
TOTAL: 2 objects, 2958 bytes (2.89 KiB)


#### 5.) Updating Data 

So far we've scraped, transformed, and uploaded all available data from the USCRN database. For our regularly-running script we'll only want to scrape the newest data we don't have yet. See `airflow/dags/uscrn_dag.py` for an example of how to do this.

In [None]:
def get_wind_updates(url:str) -> dict: 
  """Get wind updates from the subhourly data set"""

In [8]:
url = "https://www.ncei.noaa.gov/pub/data/uscrn/products/subhourly01/2023/CRNS0101-05-2023-AK_Aleknagik_1_NNE.txt"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

In [57]:
def get_latest_date_hour() -> tuple: # String representation of datetime
  """Reads/returns latest 'utc_datetime' value from wind table"""
  
  query = f"""
  SELECT utc_datetime 
  FROM weather.uscrn
  ORDER BY utc_datetime DESC LIMIT 1
  """
  query_job = client.query(query)
  result = query_job.result()

  row = next(result)
  latest_datetime = row['utc_datetime']
  latest_date = dt.datetime.strftime(latest_datetime, format="%Y%m%d")
  latest_hour= dt.datetime.strftime(latest_datetime, format="%H%M")


  return int(latest_date), int(latest_hour)

In [60]:
latest_date, latest_hour = get_latest_date_hour()

In [80]:
new_rows = [line.split() for line in str(soup).strip().splitlines() if int(line[6:14]) > latest_date and int(line[15:19]) > latest_hour] 
wind_cols = [['station_location'] + line[:5] + line[-2:] for line in new_rows]


In [81]:
columns = ['station_location','wbanno','utc_date','utc_time',
'lst_date','lst_time',"wind_1_5", "wind_flag"]

pd.DataFrame(wind_cols, columns=columns)

Unnamed: 0,station_location,wbanno,utc_date,utc_time,lst_date,lst_time,wind_1_5,wind_flag
0,station_location,23583,20230226,2005,20230226,1105,1.24,0
1,station_location,23583,20230226,2010,20230226,1110,0.70,0
2,station_location,23583,20230226,2015,20230226,1115,0.07,0
3,station_location,23583,20230226,2020,20230226,1120,0.04,0
4,station_location,23583,20230226,2025,20230226,1125,0.09,0
...,...,...,...,...,...,...,...,...
242,station_location,23583,20230303,2040,20230303,1140,2.32,0
243,station_location,23583,20230303,2045,20230303,1145,1.63,0
244,station_location,23583,20230303,2050,20230303,1150,1.71,0
245,station_location,23583,20230303,2055,20230303,1155,0.49,0


In [72]:
print(new_lines[0:5])

[['23583', '20230226', '2005', '20230226', '1105', '2.514', '-158.61', '59.28', '-8.7', '0.0', '188', '0', '-12.9', 'C', '0', '90', '0', '-99.000', '-9999.0', '1243', '0', '1.24', '0'], ['23583', '20230226', '2010', '20230226', '1110', '2.514', '-158.61', '59.28', '-8.8', '0.0', '194', '0', '-12.5', 'C', '0', '90', '0', '-99.000', '-9999.0', '1245', '0', '0.70', '0'], ['23583', '20230226', '2015', '20230226', '1115', '2.514', '-158.61', '59.28', '-8.2', '0.0', '202', '0', '-12.2', 'C', '0', '91', '0', '-99.000', '-9999.0', '1245', '0', '0.07', '0'], ['23583', '20230226', '2020', '20230226', '1120', '2.514', '-158.61', '59.28', '-7.9', '0.0', '211', '0', '-12.1', 'C', '0', '92', '0', '-99.000', '-9999.0', '1245', '0', '0.04', '0'], ['23583', '20230226', '2025', '20230226', '1125', '2.514', '-158.61', '59.28', '-7.6', '0.0', '219', '0', '-11.9', 'C', '0', '91', '0', '-99.000', '-9999.0', '1247', '0', '0.09', '0']]


In [65]:
## get latest datetime from the bigquery table 


['23583 20230226 2005 20230226 1105  2.514 -158.61   59.28    -8.7     0.0    188 0   -12.9 C 0    90 0 -99.000 -9999.0  1243 0   1.24 0',
 '23583 20230226 2010 20230226 1110  2.514 -158.61   59.28    -8.8     0.0    194 0   -12.5 C 0    90 0 -99.000 -9999.0  1245 0   0.70 0',
 '23583 20230226 2015 20230226 1115  2.514 -158.61   59.28    -8.2     0.0    202 0   -12.2 C 0    91 0 -99.000 -9999.0  1245 0   0.07 0',
 '23583 20230226 2020 20230226 1120  2.514 -158.61   59.28    -7.9     0.0    211 0   -12.1 C 0    92 0 -99.000 -9999.0  1245 0   0.04 0',
 '23583 20230226 2025 20230226 1125  2.514 -158.61   59.28    -7.6     0.0    219 0   -11.9 C 0    91 0 -99.000 -9999.0  1247 0   0.09 0',
 '23583 20230226 2030 20230226 1130  2.514 -158.61   59.28    -7.4     0.0    227 0   -11.7 C 0    91 0 -99.000 -9999.0  1247 0   0.20 0',
 '23583 20230226 2035 20230226 1135  2.514 -158.61   59.28    -7.2     0.0    236 0   -11.6 C 0    90 0 -99.000 -9999.0  1246 0   0.25 0',
 '23583 20230226 2040 20230