<a href="https://colab.research.google.com/github/andonyns/air-quality/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Laboratorio 01

## Grupo 04
- Jorge Ignacio Chavarría Herrera - B82073
- Antonio Badilla-Olivas - B80874
- Enrique Guillermo Vílchez Lizano - C18477
- Andony Nuñez Solano - B04539

## Objetivos

1. Selección y recolección de parámetros y ciudades.
2. Limpiar y transformar los datos para comparaciones.
3. Análisis univariable y multivariable. Analizar las tendencias de los indicadores y hacer comparaciones. Incluir posibles correlaciones entre variables.
4. Conclusiones y recomendaciones según las políticas ambientales de cada país.

# Relevant concepts
These are the variables that [OpenAQ](https://openaq.org/) offers to measure air pollution. The definitions were taken from [Clean Air Act](https://www.epa.gov/criteria-air-pollutants/information-pollutant):

1. PM (Particular Matter): These particles come in many sizes and shapes and can be made up of hundreds of different chemicals. Some are emitted directly from a source, such as construction sites, unpaved roads, fields, smokestacks or fires. Most particles form in the atmosphere as a result of complex reactions of chemicals such as sulfur dioxide and nitrogen oxides, which are pollutants emitted from power plants, industries and automobiles.

  - PM₂.₅ (Particulate Matter 2.5 micrometers or smaller):
fine inhalable particles, with diameters that are generally 2.5 micrometers and smaller

  - PM₁₀ (Particulate Matter 10 micrometers or smaller):
inhalable particles, with diameters that are generally 10 micrometers and smaller

2.	O₃ (Ozone):
tropospheric, or ground level ozone, is not emitted directly into the air, but is created by chemical reactions between oxides of nitrogen (NOx) and volatile organic compounds (VOC). This happens when pollutants emitted by cars, power plants, industrial boilers, refineries, chemical plants, and other sources chemically react in the presence of sunlight.

3.	NO₂ (Nitrogen Dioxide):
Nitrogen Dioxide (NO2) is one of a group of highly reactive gases known as oxides of nitrogen or nitrogen oxides (NOx). Other nitrogen oxides include nitrous acid and nitric acid. NO2 is used as the indicator for the larger group of nitrogen oxides. NO2 primarily gets in the air from the burning of fuel. NO2 forms from emissions from cars, trucks and buses, power plants, and off-road equipment.

4.	SO₂ (Sulfur Dioxide):
SO2 is the component of greatest concern and is used as the indicator for the larger group of gaseous sulfur oxides (SOx).  Other gaseous SOx (such as SO3) are found in the atmosphere at concentrations much lower than SO2. The largest source of SO2 in the atmosphere is the burning of fossil fuels by power plants and other industrial facilities. Smaller sources of SO2 emissions include: industrial processes such as extracting metal from ore; natural sources such as volcanoes; and locomotives, ships and other vehicles and heavy equipment that burn fuel with a high sulfur content.

5.	CO (Carbon Monoxide):
CO is a colorless, odorless gas that can be harmful when inhaled in large amounts. CO is released when something is burned. The greatest sources of CO to outdoor air are cars, trucks and other vehicles or machinery that burn fossil fuels. A variety of items in your home such as unvented kerosene and gas space heaters, leaking chimneys and furnaces, and gas stoves also release CO and can affect air quality indoors.

In [2]:
%pip install dotenv

Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, dotenv
Successfully installed dotenv-0.9.9 python-dotenv-1.1.0
Note: you may need to restart the kernel to use updated packages.


In [15]:
# For API requests
import requests
from urllib.parse import urljoin

# For env
import os
from dotenv import load_dotenv

# For data manipulation
from pprint import pprint
import pandas as pd

import time
import logging
import sys

from dataclasses import dataclass

# To log events
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout
)

logger = logging.getLogger(__name__)

load_dotenv()


True

In [16]:
DATA_DIR = "data"
if not os.path.isdir(DATA_DIR):
    os.mkdir(DATA_DIR)

BASE_API_URL = "https://api.openaq.org/v3/"

HEADERS = {"X-API-Key": os.getenv("API_KEY")}

LOCATIONS_ENDPOINT = urljoin(BASE_API_URL, "locations/{location_id}")
MEASUREMENTS_ENDPOINT = urljoin(BASE_API_URL, "sensors/{sensor_id}/measurements")

In [17]:
def fetch_data(
    base_url: str,
    headers: dict[str, str] | None = None,
    parameters: dict[str, any] | None = None,
    query_parameters: dict[str, any] | None = None,
    verbose: bool = False
):
    """
    Fetch data from the OpenAQ API.
    :param base_url: The base URL for the API endpoint.
    :param headers: Optional headers to include in the request.
    :param parameters: Optional parameters to format the URL.
    :param query_parameters: Optional query parameters to include in the request.
    :param verbose: If True, print the response headers.
    :return: The JSON response from the API.
    """

    
    if parameters is not None:
        base_url = base_url.format(**parameters)

    if query_parameters is not None:
        base_url = urljoin(base_url, "?" + "&".join(f"{k}={v}" for k, v in query_parameters.items()))

    if headers is not None:
        response = requests.get(
            url=base_url,
            headers=headers,
        )
    else:
        response = requests.get(
            url=base_url,
        )

    if response.status_code != 200:
        raise Exception(
            f"Request failed with status: {response.status_code}. Reason: {response.text}"
        )
    if verbose:
        pprint(response.headers)
    return response

In [23]:
# Costa Rica example
cr_location_id = 3070644

cr_location_data = fetch_data(
    base_url=LOCATIONS_ENDPOINT,
    headers=HEADERS,
    parameters={"location_id": cr_location_id},
).json()
pprint(cr_location_data)

{'meta': {'found': 1,
          'limit': 100,
          'name': 'openaq-api',
          'page': 1,
          'website': '/'},
 'results': [{'bounds': [-84.0417, 9.938, -84.0417, 9.938],
              'coordinates': {'latitude': 9.938, 'longitude': -84.0417},
              'country': {'code': 'CR', 'id': 29, 'name': 'Costa Rica'},
              'datetimeFirst': {'local': '2024-09-19T14:01:34-06:00',
                                'utc': '2024-09-19T20:01:34Z'},
              'datetimeLast': {'local': '2025-04-20T13:51:58-06:00',
                               'utc': '2025-04-20T19:51:58Z'},
              'distance': None,
              'id': 3070644,
              'instruments': [{'id': 4, 'name': 'Clarity Sensor'}],
              'isMobile': False,
              'isMonitor': False,
              'licenses': [{'attribution': {'name': 'Clarity', 'url': None},
                            'dateFrom': '2021-10-20',
                            'dateTo': None,
                            'id

In [26]:
# Get the first sensor
sensor = cr_location_data["results"][0]["sensors"][0]
not_finished = True
query_params = {"limit": 1000, "page": 1}

while not_finished: 
    sensor_id = sensor["id"]
    sensor_data = fetch_data(
        base_url=MEASUREMENTS_ENDPOINT,
        headers=HEADERS,
        parameters={"sensor_id": sensor_id},
        query_parameters=query_params,
    ).json()
    if len(sensor_data["results"]) < 1000:
        print(f"last page: {sensor_data["meta"]["page"]}")
        not_finished = False

    else:
        query_params["page"] += 1

print(f"Total pages: {query_params["page"]}")

last page: 13
Total pages: 13


In [32]:
def set_color(text: str, color: str = "green") -> str:
    """
    Set the color of the text.
    :param text: The text to color.
    :param color: The color to set.
    :return: The colored text.
    """
    colors = {
        "green": "\033[32m",
        "yellow": "\033[33m",
        "red": "\033[31m",
        "blue": "\033[34m",
        "orange": "\033[38;5;214m",
        "reset": "\033[0m",
    }
    return f"{colors[color]}{text}{colors['reset']}"


def fetch_location_sensors_data(
    location_id: int,
):
    logger.info(f"Fetching data for location ID: {location_id}...")
    location_data = fetch_data(
        base_url=LOCATIONS_ENDPOINT,
        headers=HEADERS,
        parameters={"location_id": location_id},
    ).json()

    logger.info(set_color(f"Finished fetch for data of location ID: {location_id}", "green"))

    # Get all sensors info
    sensors_data = {}
    query_params = {"limit": 1000, "page": 1}

    # Iterate through all sensors
    for i, sensor in enumerate(location_data["results"][0]["sensors"]):
        logger.info(f"Processing sensor number {i}")
        sensor_id = sensor["id"]
        sensors_data[sensor_id] = []

        not_finished = True
        while not_finished:

            sensor_response = fetch_data(
                base_url=MEASUREMENTS_ENDPOINT,
                headers=HEADERS,
                parameters={"sensor_id": sensor_id},
                query_parameters=query_params,
            )

            sensor_data = sensor_response.json()

            logger.info(f"Processing sensor number {i}, page {sensor_data["meta"]["page"]}")
            logger.warning(set_color(f"Remaining requests: {sensor_response.headers["X-Ratelimit-Remaining"]}", "orange"))
            
            sensors_data[sensor_id].append(sensor_data)  

            if len(sensor_data["results"]) < 1000:
                not_finished = False
            else:
                query_params["page"] += 1
            
            if sensor_response.headers["X-Ratelimit-Remaining"] == 0:
                logger.warning(set_color(f"Reached rate limit. Waiting for {sensor_response["X-Ratelimit-Reset"]} seconds.", "red"))
                time.sleep(sensor_response["X-Ratelimit-Reset"])
    
        logger.info(set_color(f"Finished processing sensor number {i}", "green"))

    logger.info(set_color(f"Finished processing all sensors for location ID: {location_id}", "green"))
    return sensors_data

In [30]:
class Location:
    def __init__(self, country: str, name: str, id: int, sensors_data: dict = None):
        self.country: str = country
        self.name: str = name
        self.id: int = id
        self.sensors_data: dict = sensors_data

    def pull_sensors_data(self):
        self.sensors_data = fetch_location_sensors_data(self.id)
    
    def save_sensors_data(self):
        if self.sensors_data is None:
            raise ValueError("No sensors data to save. Please pull the data first.")
        
        # Replace spaces
        country = self.country.replace(" ", "-")
        name = self.name.replace(" ", "-")
                                 
        for sensor_id, sensor_data in self.sensors_data.items():
            # Flatten the list of dictionaries
            sensor_df = pd.json_normalize(sensor_data)
            
            # Save to CSV
            if not os.path.isdir(f"{DATA_DIR}/{country}_{name}"):
                os.mkdir(f"{DATA_DIR}/{country}_{name}")

            sensor_df.to_csv(f"{DATA_DIR}/{country}_{name}/{sensor_id}.csv", index=False)


In [20]:
locations = [
    Location(country="Costa Rica", name="NASA GSFC Rutgers Calib. N13", id=3070644),
    Location(country="United Kingdom", name="Port Talbot Margam", id=946),
    Location(country="Spain", name="Escaldes-Engordany", id=9742)
]

In [None]:
# For every location, save all info
for location in locations:
    location.pull_sensors_data()
    location.save_sensors_data()
    del location.sensors_data
    
    logger.info(set_color(f"Saved data for location {location.name}", "green"))

2025-04-20 21:21:33 - __main__ - INFO - Fetching data for location ID: 3070644...
2025-04-20 21:21:33 - __main__ - INFO - [32mFinished fetch for data of location ID: 3070644[0m
2025-04-20 21:21:33 - __main__ - INFO - Processing sensor number 0
2025-04-20 21:21:34 - __main__ - INFO - Processing sensor number 0, page 1
2025-04-20 21:21:35 - __main__ - INFO - Processing sensor number 0, page 2
2025-04-20 21:21:37 - __main__ - INFO - Processing sensor number 0, page 3
2025-04-20 21:21:38 - __main__ - INFO - Processing sensor number 0, page 4
2025-04-20 21:21:41 - __main__ - INFO - Processing sensor number 0, page 5
2025-04-20 21:21:43 - __main__ - INFO - Processing sensor number 0, page 6
2025-04-20 21:21:46 - __main__ - INFO - Processing sensor number 0, page 7
2025-04-20 21:21:50 - __main__ - INFO - Processing sensor number 0, page 8
2025-04-20 21:21:53 - __main__ - INFO - Processing sensor number 0, page 9
2025-04-20 21:21:57 - __main__ - INFO - Processing sensor number 0, page 10
202

Exception: Request failed with status: 408. Reason: {"detail":"Connection timed out: Try to provide more specific query parameters or a smaller time frame."}


# 2. Tareas de limpieza y transformación:

Se deben realizar las tareas de limpieza y transformación necesarias para poder hacer un comparativo de la evolución de los diferentes indicadores de la calidad del aire en Costa Rica y las otras ciudades.


In [25]:
def clean_sensors_data(
    sensors_data: dict[int, list[dict]]
) -> dict[int, list[dict]]:
    
    """
    Clean the sensors data.
    :param sensors_data: The sensors data to clean.
    :return: The cleaned sensors data.
    """

    cleaned_data = {}
    for sensor_id, sensor_data in sensors_data.items():
        cleaned_data[sensor_id] = []

        for page in sensor_data:
            for measurement in page["results"]:
                # Data dict
                data_dict = {
                    "value": measurement["value"],
                    "period": measurement["period"],
                    "parameter": measurement["parameter"],
                }
                cleaned_data[sensor_id].append(data_dict)
    
    return cleaned_data

In [26]:
clean_cr_data = clean_sensors_data(cr_data)
pprint(clean_cr_data)

{10669679: [{'parameter': {'displayName': None,
                           'id': 2,
                           'name': 'pm25',
                           'units': 'µg/m³'},
             'period': {'datetimeFrom': {'local': '2024-09-19T13:56:34-06:00',
                                         'utc': '2024-09-19T19:56:34Z'},
                        'datetimeTo': {'local': '2024-09-19T14:01:34-06:00',
                                       'utc': '2024-09-19T20:01:34Z'},
                        'interval': '00:05:00',
                        'label': 'raw'},
             'value': 7.11},
            {'parameter': {'displayName': None,
                           'id': 2,
                           'name': 'pm25',
                           'units': 'µg/m³'},
             'period': {'datetimeFrom': {'local': '2024-09-19T14:16:01-06:00',
                                         'utc': '2024-09-19T20:16:01Z'},
                        'datetimeTo': {'local': '2024-09-19T14:21:01-06:00',
       


# 3. Implementación en Google Colab:

Realizar la implementación en Google Colab. Si existen problemas de desempeño, se puede optar por otro entorno, lo cual debe ser anotado en la documentación del notebook así como en la presentación.



# 4. Análisis y comparación:

Se debe realizar un análisis EDA que incluya análisis univariable y multivariable.

Analizar las tendencias de los indicadores para las diferentes ciudades y hacer comparaciones entre diferentes países y ciudades.

Incluir posibles correlaciones entre las variables y parámetros de calidad del aire de cada país/ciudad.

Utilizar diferentes tipos de visualizaciones relevantes para el análisis.



# 5. Conclusiones y Recomendaciones:

Extraer conclusiones sobre la evolución de la calidad del aire en Costa Rica y las ciudades seleccionadas, explicando cómo los datos sustentan estas conclusiones.

Buscar información sobre las políticas ambientales y regulaciones en estas ciudades y mostrar cómo los datos reflejan el efecto de estas políticas.