# Accessing quality-controlled historical weather station data

The <span style="color:#FF0000">[Historical Observations Data Platform](https://eaglerockanalytics.com/project/historical-observations-data-platform/)</span> is a cloud-based historical weather observations database that enables access to high-quality, open climate and weather data. The Platform responds to a broad-scale need to understand weather and climate information including the severity, duration, frequency, and rate of change over time of extreme weather events, as well as supporting projections downscaling efforts. The Platform implements stringent, customized Quality Assurance / Quality Control (QA/QC) protocols in line with international conventions, with updates relevant to the energy sector to accurate capture extremes. The Platform has sourced publicly accessible weather observation stations from 27 networks throughout western North America, with a total of **14,927 quality-controlled and standardized stations** spanning 1980-2022. 

For more information on the QA/QC and standardization process, please check out the open-access methods and code at the <span style="color:#FF0000">[Historical Observations Data Platform code repository](https://github.com/Eagle-Rock-Analytics/historical-obs-platform)</span>. A station list of all available quality-controlled and standardized stations is available in our <span style="color:#FF0000">[data bucket](https://cadcat.s3.amazonaws.com/histwxstns/historical_wx_stations.csv)</span>.

**Runtime**: < 1 min

In [None]:
import intake 
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

First, open the catalog using `intake`

In [None]:
cat = intake.open_esm_datastore("https://cadcat.s3.amazonaws.com/histwxstns/era-hdp-collection.json")

Next, view the catalog in table format. You can inspect the first few rows by calling `.head()` on the table.

In [None]:
# Access catalog as dataframe and inspect the first few rows
cat_df = cat.df
cat_df

View all the weather station networks by using the following code 

In [None]:
# See all network options 
cat_df["network_id"].unique()

You can also filter the catalog to see all stations within a network

In [None]:
my_network = "ASOSAWOS"
cat_df[cat_df["network_id"] == my_network]

You can subset the catalog and read in the cloud-optimized data as `xarray.Dataset` objects using the method shown below. To change the data downloaded, simply modify the inputs in the dictionary `query`. These inputs must correspond to valid options in the catalog. 

In [None]:
# Set your query here 
query = {
    "network_id": "ASOSAWOS",  # Name of the network
    "station_id": ["ASOSAWOS_72288023152","ASOSAWOS_72389093193","ASOSAWOS_72297603166","ASOSAWOS_72493023230"] # List of stations to get data for 
}

# Subset catalog 
cat_subset = cat.search(**query)

# View the data you've selected before downloading
cat_subset.df

Then, you can download all the files. The files will be downloaded as a dictionary, in which each key is a string description of the data, and the item is the data object. 

In [None]:
# Get dataset dictionary 
dsets = cat_subset.to_dataset_dict(
    xarray_open_kwargs={'consolidated':False},
    storage_options={'anon':True}
)

To see all the string IDs for the Datasets in the dictionary, you can print them with the following code: 

In [None]:
list(dsets.keys())

You can easily access the files in the dictionary using the following format: 
```
dsets[<string ID of data>]
```
The string ID of the data is constructed using both the network ID and the station ID for each individual weather station. 

In [None]:
# Retrieve a single file
ds = dsets["ASOSAWOS.ASOSAWOS_72389093193"]
ds

## Make a quick plot of the data 
`xarray` has some nice mapping features that enable you to quickly generate a plot for a single timestep. This lets you get a sense for the data you read in. 

In [None]:
variable_to_plot = "tas"
ds.squeeze()[variable_to_plot].plot(x="time");

## Subset the historical weather stations for a region

If you're interested in historical weather observation stations in a specific area, you can also subset the full archive of stations to identify those that you are interested in. We will read in the Historical Data Platform station list, which provides the coordinates, dates of coverage, source network, location information, and the *total number of observations* for each station. 

For this example, we will show you how to read a shapefile of historic wildfire boundaries in San Diego country using publically available data from the [San Diego Regional Data Warehouse](https://geo.sandag.org/portal/apps/experiencebuilder/experience/?id=fad9e9c038c84f799b5378e4cc3ed068#data_s=id%3AdataSource_1-0%3A202). Next, we'll subset the station list to get all the weather stations available in the fire zone for the [Cedar fire in 2003](https://www.sandiego.gov/fire/about/major-fires-incidents/2003-cedar-fire), which is one of the largest wildfires in California's history (and, as of 2025, the largest wildfire in San Diego county). Lastly, we'll export the subsetted station list and make a plot of the fire boundary with the station locations overlayed. 

You could easily modify this workflow for your own shapefiles, by using **web link to an open access shapefile** or **uploading your own shapefile** to your Hub instance.

In [None]:
# Read in the Historical Data Platform station list
hdp_stns = pd.read_csv("https://cadcat.s3.amazonaws.com/histwxstns/historical_wx_stations.csv")

# Convert to a GeoDataFrame so we can use the geometry column for subsetting
hdp_stns = gpd.GeoDataFrame(hdp_stns, geometry=gpd.GeoSeries.from_wkt(hdp_stns["geometry"]), crs="EPSG:4326")

The open source Python library [geopandas](https://geopandas.org/en/stable/getting_started/introduction.html) makes working with shapefiles in python easy; you can read in a shapefile using one line of code. 

In [None]:
# Region of Interest shapefile
roi = gpd.read_file("https://geo.sandag.org/server/rest/directories/downloads/Fire_Burn_History_shapefile.zip") # Replace this with your own shapefile!

# Convert to projection of HDP data
roi = roi.to_crs(hdp_stns.crs) 

# Subset the shapefile to just get the Cedar fire of 2003 
cedar_fire_2003 = roi[(roi["FIRE_NAME"]=="CEDAR") & (roi["YEAR_"] == 2003)]
cedar_fire_2003

In [None]:
# Clip the stationlist to subset within your area of interest
stns_within_area = gpd.sjoin(hdp_stns, cedar_fire_2003, how='inner', predicate='intersects').reset_index(drop=True)

This subset are the stations within the designated area from your submitted shapefile! You can easily export this list now for your own information, and use it to look up specific stations. 

In [None]:
# Export the subset portion of station list
stns_within_area.to_csv("subset_station_list.csv")

Let's also make a quick visualization of the data to better understand the subsetting process and the geographic boundaries of the data. 

In [None]:
fig, ax = plt.subplots(figsize=(8,10))
cedar_fire_2003.plot(ax=ax, color='rosybrown')
stns_within_area.plot(ax=ax, color='darkblue', markersize=12, label="station")
ax.legend()
ax.set_title("Weather stations within the Cedar Fire boundary");

Want a more interactive way to view the data? Use `stns.explore()`, a geopandas method that will generate an interactive map where you can zoom, pan, and click features to see their attributes. Note, this map may take some time to load. 

In [None]:
stns_within_area.explore()