# Data Access Notebook for BEACO2N
_by Michelle H Wilkerson_

## Purpose of this Notebook

This notebook was developed as part of NSF Grant 2445609 to support accessing and processing OpenAQ data for middle and high school classroom activities. It's written to be relatively accessible to beginners, but if you have not interacted with computational notebooks or python before you may find navigating this tool difficult. (Check out the Show Your Work project for a gentle introduction to computational notebooks for educators!)

Our project is focused on supporting data analysis and mechanistic reasoning in science education. In other words, we want students to learn how data provides information about _how scientific mechanisms work_, and how understanding scientific mechanisms can help them to _explain and interpret patterns in data_. This builds on a long history of research on complex systems and agent-based modeling, and more closely connects that work to current expansions of data analysis across subjects.

Here, we are focused on Air Quality as a phenomenon. While most students understand that poor Air Quality can impact health, they may not know that there are many different kinds of air pollution, each caused by different processes and chemicals. These are reflected by different patterns over the course of a day or year

This data tool allows users to connect to BEACO2N, search for air quality data streams in an area of interest, and then identifies the data streams that record observations for _both_ PM2.5 and O3, as two key pollutants impacting air quality and that tend to behave very differently over time. These kinds of datasets can serve as a launch to examining what AQ is and what are its underlying mechanistic and compositional complexities.

You are welcome to modify and adapt this script. You may find the BEACO2N documentation [here](https://beacon.berkeley.edu/metadata/) helpful. BEACO2N requests are made via http GET; to play around with what these look like you can use [this](http://128.32.208.8/about) GUI tool and then see what the generated link looks like. Below we use the csv option. Json is also available by replacing "csv" with "json" in the url. BEACO2N data is queried on a node-by-node basis; see [here](https://beacon.berkeley.edu/get_latest_nodes/csv/) for a list of available nodes. 

## Part I: Caching a list of nodes with PM and O3.

We start by taking all the available nodes from the BEACO2N project and building a dataframe.

In [None]:
import pandas as pd 

# let's get all the nodes
nodes = pd.read_csv("https://beacon.berkeley.edu/get_latest_nodes/csv/")
nodes

How nice, the list of available nodes also gives us the latest readings for each measurement available. So let's start by looking at the latest PM and O3 readings from all available nodes. We could look at the "level" of the node, which is an assessment of the quality of the readings. But, we want to case a wide net and I see that level 1 is applied to missing readings (-999). So instead, we're gonna look for readings that are simply not -999. We'll store all the nodes that have both PM and O3 readings as "juicy_nodes"... but note that this is just the latest reading, so if there's a small "blip" in a sensor that might exclude it unnecessarily from our list.

In [None]:
nodes_with_both = nodes[(nodes['o3_ppm'] != -999) & (nodes['PM_2.5_ug/m3'] != -999)]
nodes_with_both = nodes_with_both.set_index('id') #use id as index
nodes_with_both

So these are the nodes where the "action" is in terms of graphs that might show interesting differences in the behavior and presence of pollutants over time. Look at the datetime. That will tell you when the latest readings were. (Again, it could be that some other nodes had both readings up until the most recent one and so we don't see the juicy stuff. I'll deal with that in a later section.) Right now, CAwaterloo1 looks active so let's take a look at that one.

In [None]:
# we want data from the last available week of data, so let's datetime it
from datetime import datetime, timedelta

# we're gonna then build the url we need to make the request
import urllib.parse

# let's get the last week of data from the waterloo id=356
nodeID = 356

# this is the datetime format BEACON uses
date_format = "%Y-%m-%d %H:%M:%S"

# get the latest available datetime, and construct a week-long interval.
end_date = nodes_with_both.loc[nodeID, "datetime"]
start_date = (datetime.strptime(end_date,"%Y-%m-%d %H:%M:%S") - timedelta(weeks=1)).strftime(date_format)

name = str(nodes_with_both.loc[nodeID, "node_name_long"])

# construct the GET request
url = "http://128.32.208.8/node/" + str(nodeID) + "/measurements/csv?name=" + name + "&interval=60&start=" + start_date + "&end=" + end_date
url.replace(" ", "%20")
## NOTE ok what is very strange is this isn't working for node 356 
# which is supposed to have current data. But it seems to be working
# for these other notes: 64, 63, 61...
# the text below will be clickable 
# but in Chrome I have to paste into the url bar to download the file

Here's some code to expore a csv of the merged dataset to use in other tools. (Should be very easy to work with in CODAP from here.)

In [None]:
merged.to_csv("filename.csv")

# Credit

#### BEACO2N Data
Cohen Research - University of California Berkeley
Berkeley Environmental Air-quality & CO2 Network (BEACO2N)
Available at http://beacon.berkeley.edu
Date accessed: [September 02, 2025]