# Air Quality Analysis

This project examines the relationship between weather and air pollution variables within randomly selected cities. The main objective is to design a data-driven air quality intelligence system that integrates monitoring, explanatory analytics, and predictive modelling:

**Hypotheses:**
- **H1:** Weather conditions have a measurable effect on air pollution levels across urban locations.
- **H2:** Identification of Atmospheric and Pollution Risk cluster Regimes
- **H3:** Probabilistic Relationship Between Weather Risk and Pollution Severity
- **H4:** Weather-driven patterns can be used to predict near-term pollution levels with acceptable accuracy.
## Table of Contents

1. [Data Ingestion](#1-data-ingestion)
2. [Exploratory Data Analysis (EDA)](#2-exploratory-data-analysis-eda)
3. [Extract Load Transform (ETL)](#3-extract-load-transform-etl)
4. [Data Preparation for ML](#4-data-preparation-for-ml)
5. [Prediction/Modelling](#5-predictionmodelling)
6. [Dashboard and Reporting](#6-dashboard-and-reporting)
7. [Summary and Conclusions](#7-summary-and-conclusions)



## 1.0 Data Ingestion
##### Weather forecast variables:
* https://open-meteo.com/en/docs?hourly=temperature_2m,relative_humidity_2m,precipitation,wind_speed_10m,surface_pressure&location_mode=csv_coordinates&csv_coordinates=34.0522,+-118.2437%29,++%0A38.5816,+-21.4944%29,+++%0A42.3314,+-83.0458%29,+++%0A29.7604,+-95.3698%29,+++%0A41.4993,+-81.6944%29,+++%0A41.8781,+-87.6298%29%0A&time_mode=time_interval&start_date=2025-11-07&end_date=2026-01-05. While these model-derived estimates provide consistent temporal coverage, they may underrepresent localized weather variability and are subject to model biases and limited seasonal scope.
#### Air quality variables:
* https://open-meteo.com/en/docs/air-quality-api?location_mode=csv_coordinates&csv_coordinates=34.0522,+-118.2437%0A38.5816,+-21.4944%0A42.3314,+-83.0458%0A29.7604,+-95.3698%0A41.4993,+-81.6944%0A41.8781,+-87.6298&time_mode=time_interval&start_date=2025-11-07&end_date=2026-01-05&hourly=pm10,pm2_5,carbon_monoxide,carbon_dioxide,sulphur_dioxide,ozone,nitrogen_dioxide,us_aqi. Potential biases include model representation errors, limited spatial granularity, and short temporal coverage, which may affect interpretation of meteorology–air quality relationships.
* This was possible through the use of the Air Quality API and the Weather Forecast API.
##### Six cities from the USA were selected randomly: Los Angeles, Sacramento, Houston, Cleveland and  Chicago.
##### The selected forecast period for both weather and air quality is from 07/11/2025 t0 05/01/2026 to get near real-time data. The response is hourly.

### 1.1 Air quality Variables ingestion pipeline

In [1]:
# Imports
import openmeteo_requests
import pandas as pd
import requests_cache
import os
import warnings
warnings.filterwarnings('ignore')
from retry_requests import retry

In [2]:
### Explicitly define city –coordinate mapping
LOCATIONS = [
    {"city": "Los Angeles", "country": "US", "lat": 34.0522, "lon": -118.2437},
    {"city": "Sacramento", "country": "US", "lat": 38.5816, "lon": -121.4944},
    {"city": "Detroit", "country": "US", "lat": 42.3314, "lon": -83.0458},
    {"city": "Houston", "country": "US", "lat": 29.7604, "lon": -95.3698},
    {"city": "Cleveland", "country": "US", "lat": 41.4993, "lon": -81.6944},
    {"city": "Chicago", "country": "US", "lat": 41.8781, "lon": -87.6298},
]

#### Loading and calling Open-Meteo air Quality Historical API

In [3]:
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = 3600)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://air-quality-api.open-meteo.com/v1/air-quality"
params = {
	"latitude": [34.0522, 38.5816, 42.3314, 29.7604, 41.4993, 41.8781],
	"longitude": [-118.2437, -21.4944, -83.0458, -95.3698, -81.6944, -87.6298],
	"hourly": ["pm10", "pm2_5", "carbon_monoxide", "sulphur_dioxide", "ozone", "us_aqi", "carbon_dioxide", "nitrogen_dioxide"],
	"start_date": "2025-11-07",
	"end_date": "2026-01-05",
}
responses = openmeteo.weather_api(url, params=params)

# Process 6 locations
for response in responses:
	print(f"\nCoordinates: {response.Latitude()}°N {response.Longitude()}°E")
	print(f"Elevation: {response.Elevation()} m asl")
	print(f"Timezone difference to GMT+0: {response.UtcOffsetSeconds()}s")
	
	# Process hourly data. The order of variables needs to be the same as requested.
	hourly = response.Hourly()
	hourly_pm10 = hourly.Variables(0).ValuesAsNumpy()
	hourly_pm2_5 = hourly.Variables(1).ValuesAsNumpy()
	hourly_carbon_monoxide = hourly.Variables(2).ValuesAsNumpy()
	hourly_sulphur_dioxide = hourly.Variables(3).ValuesAsNumpy()
	hourly_ozone = hourly.Variables(4).ValuesAsNumpy()
	hourly_us_aqi = hourly.Variables(5).ValuesAsNumpy()
	hourly_carbon_dioxide = hourly.Variables(6).ValuesAsNumpy()
	hourly_nitrogen_dioxide = hourly.Variables(7).ValuesAsNumpy()
	
	hourly_data = {"date": pd.date_range(
		start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
		end =  pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
		freq = pd.Timedelta(seconds = hourly.Interval()),
		inclusive = "left"
	)}
	
	hourly_data["pm10"] = hourly_pm10
	hourly_data["pm2_5"] = hourly_pm2_5
	hourly_data["carbon_monoxide"] = hourly_carbon_monoxide
	hourly_data["sulphur_dioxide"] = hourly_sulphur_dioxide
	hourly_data["ozone"] = hourly_ozone
	hourly_data["us_aqi"] = hourly_us_aqi
	hourly_data["carbon_dioxide"] = hourly_carbon_dioxide
	hourly_data["nitrogen_dioxide"] = hourly_nitrogen_dioxide
	
	hourly_dataframe = pd.DataFrame(data = hourly_data)
	print("\nHourly data\n", hourly_dataframe)


Coordinates: 34.099998474121094°N -118.19999694824219°E
Elevation: 91.0 m asl
Timezone difference to GMT+0: 0s

Hourly data
                           date       pm10  pm2_5  carbon_monoxide  \
0    2025-11-07 00:00:00+00:00  12.300000   11.8            252.0   
1    2025-11-07 01:00:00+00:00  13.500000   12.6            307.0   
2    2025-11-07 02:00:00+00:00  14.400000   13.8            388.0   
3    2025-11-07 03:00:00+00:00  17.500000   17.0            450.0   
4    2025-11-07 04:00:00+00:00  19.700001   19.1            479.0   
...                        ...        ...    ...              ...   
1435 2026-01-05 19:00:00+00:00  13.000000   11.3            206.0   
1436 2026-01-05 20:00:00+00:00  12.700000   10.9            195.0   
1437 2026-01-05 21:00:00+00:00  11.500000    9.6            191.0   
1438 2026-01-05 22:00:00+00:00  10.900000    9.1            196.0   
1439 2026-01-05 23:00:00+00:00  11.200000    9.4            207.0   

      sulphur_dioxide  ozone      us_aqi  car

In [4]:
# Initialize a list to hold all location DataFrames
dfs = []

# Loop through each location and its API response
for location, response in zip(LOCATIONS, responses):

    hourly = response.Hourly()

    hourly_data = {
        "date": pd.date_range(
            start=pd.to_datetime(hourly.Time(), unit="s", utc=True),
            end=pd.to_datetime(hourly.TimeEnd(), unit="s", utc=True),
            freq=pd.Timedelta(seconds=hourly.Interval()),
            inclusive="left"
        ),
        "pm10": hourly.Variables(0).ValuesAsNumpy(),
        "pm2_5": hourly.Variables(1).ValuesAsNumpy(),
        "carbon_monoxide": hourly.Variables(2).ValuesAsNumpy(),
        "sulphur_dioxide": hourly.Variables(3).ValuesAsNumpy(),
        "ozone": hourly.Variables(4).ValuesAsNumpy(),
        "us_aqi": hourly.Variables(5).ValuesAsNumpy(),
        "carbon_dioxide": hourly.Variables(6).ValuesAsNumpy(),
        "nitrogen_dioxide": hourly.Variables(7).ValuesAsNumpy(),
    }

    df = pd.DataFrame(hourly_data)

    #  Attach metadata HERE
    df["city"] = location["city"]
    df["country"] = location.get("country", "")
    df["lat"] = location["lat"]
    df["lon"] = location["lon"]

    dfs.append(df)

air_quality_df = pd.concat(dfs, ignore_index=True)

In [5]:
# Get the dataframe dimensions
air_quality_df.shape

(8640, 13)

### 1.2 Weather variables ingestion pipeline

In [6]:
# The same cities/locations apply
LOCATIONS = [
    {"city": "Los Angeles", "country": "US", "lat": 34.0522, "lon": -118.2437},
    {"city": "Sacramento", "country": "US", "lat": 38.5816, "lon": -121.4944},
    {"city": "Detroit", "country": "US", "lat": 42.3314, "lon": -83.0458},
    {"city": "Houston", "country": "US", "lat": 29.7604, "lon": -95.3698},
    {"city": "Cleveland", "country": "US", "lat": 41.4993, "lon": -81.6944},
    {"city": "Chicago", "country": "US", "lat": 41.8781, "lon": -87.6298},
]

### Loading and calling Open-Meteo Weather Forecast  API

In [7]:
import openmeteo_requests

import pandas as pd
import requests_cache
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = 3600)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://api.open-meteo.com/v1/forecast"
params = {
	"latitude": [34.0522, 38.5816, 42.3314, 29.7604, 41.4993, 41.8781],
	"longitude": [-118.2437, -21.4944, -83.0458, -95.3698, -81.6944, -87.6298],
	"hourly": ["temperature_2m", "relative_humidity_2m", "precipitation", "wind_speed_10m", "surface_pressure"],
	"start_date": "2025-11-07",
	"end_date": "2026-01-05",
}
responses = openmeteo.weather_api(url, params=params)

# Process 6 locations
for response in responses:
	print(f"\nCoordinates: {response.Latitude()}°N {response.Longitude()}°E")
	print(f"Elevation: {response.Elevation()} m asl")
	print(f"Timezone difference to GMT+0: {response.UtcOffsetSeconds()}s")
	
	# Process hourly data. The order of variables needs to be the same as requested.
	hourly = response.Hourly()
	hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
	hourly_relative_humidity_2m = hourly.Variables(1).ValuesAsNumpy()
	hourly_precipitation = hourly.Variables(2).ValuesAsNumpy()
	hourly_wind_speed_10m = hourly.Variables(3).ValuesAsNumpy()
	hourly_surface_pressure = hourly.Variables(4).ValuesAsNumpy()
	
	hourly_data = {"date": pd.date_range(
		start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
		end =  pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
		freq = pd.Timedelta(seconds = hourly.Interval()),
		inclusive = "left"
	)}
	
	hourly_data["temperature_2m"] = hourly_temperature_2m
	hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
	hourly_data["precipitation"] = hourly_precipitation
	hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
	hourly_data["surface_pressure"] = hourly_surface_pressure
	
	hourly_dataframe = pd.DataFrame(data = hourly_data)
	print("\nHourly data\n", hourly_dataframe)
	


Coordinates: 34.06025695800781°N -118.23432922363281°E
Elevation: 91.0 m asl
Timezone difference to GMT+0: 0s

Hourly data
                           date  temperature_2m  relative_humidity_2m  \
0    2025-11-07 00:00:00+00:00       18.757500                  77.0   
1    2025-11-07 01:00:00+00:00       17.607500                  84.0   
2    2025-11-07 02:00:00+00:00       16.757500                  89.0   
3    2025-11-07 03:00:00+00:00       16.307501                  92.0   
4    2025-11-07 04:00:00+00:00       15.807501                  95.0   
...                        ...             ...                   ...   
1435 2026-01-05 19:00:00+00:00       15.610500                  80.0   
1436 2026-01-05 20:00:00+00:00       16.410500                  77.0   
1437 2026-01-05 21:00:00+00:00       16.910500                  69.0   
1438 2026-01-05 22:00:00+00:00       17.410500                  63.0   
1439 2026-01-05 23:00:00+00:00       16.510500                  78.0   

      prec

In [8]:
# Initialize a list to hold all location DataFrames
dfs = []

# Loop through each location and its API response
for location, response in zip(LOCATIONS, responses):
    
    # Convert hourly data to numpy arrays
    hourly = response.Hourly()
    hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
    hourly_relative_humidity_2m = hourly.Variables(1).ValuesAsNumpy()
    hourly_precipitation = hourly.Variables(2).ValuesAsNumpy()
    hourly_wind_speed_10m = hourly.Variables(3).ValuesAsNumpy()
    hourly_surface_pressure = hourly.Variables(4).ValuesAsNumpy()
    
    # Create a pandas DataFrame
    df = pd.DataFrame({
        "date": pd.date_range(
            start=pd.to_datetime(hourly.Time(), unit="s", utc=True),
            end=pd.to_datetime(hourly.TimeEnd(), unit="s", utc=True),
            freq=pd.Timedelta(seconds=hourly.Interval()),
            inclusive="left"
        ),
        "temperature_2m": hourly_temperature_2m,
        "relative_humidity_2m": hourly_relative_humidity_2m,
        "precipitation": hourly_precipitation,
        "wind_speed_10m": hourly_wind_speed_10m,
        "surface_pressure": hourly_surface_pressure
    })
    
    # Attach city metadata
    df["city"] = location["city"]
    df["country"] = location["country"]
    df["lat"] = location["lat"]
    df["lon"] = location["lon"]
    
    # Append to the list
    dfs.append(df)

# Concatenate all locations into a single DataFrame
weather_df = pd.concat(dfs, ignore_index=True)

# Preview the final DataFrame
weather_df.head()

Unnamed: 0,date,temperature_2m,relative_humidity_2m,precipitation,wind_speed_10m,surface_pressure,city,country,lat,lon
0,2025-11-07 00:00:00+00:00,18.7575,77.0,0.0,6.696387,1006.039551,Los Angeles,US,34.0522,-118.2437
1,2025-11-07 01:00:00+00:00,17.6075,84.0,0.0,5.116561,1006.096191,Los Angeles,US,34.0522,-118.2437
2,2025-11-07 02:00:00+00:00,16.7575,89.0,0.0,2.81169,1005.86676,Los Angeles,US,34.0522,-118.2437
3,2025-11-07 03:00:00+00:00,16.307501,92.0,0.0,2.16,1006.14679,Los Angeles,US,34.0522,-118.2437
4,2025-11-07 04:00:00+00:00,15.807501,95.0,0.0,2.27684,1005.534607,Los Angeles,US,34.0522,-118.2437


In [9]:
# Save air quality data
air_quality_df.to_csv("../Raw_data/air_quality_df.csv", index=False)
# Save weather data
weather_df.to_csv("../Raw_data/weather_df.csv", index=False)