# Merging Air Quality and Weather Data

This notebook retrieves air quality and weather data for Madrid using the [Open-Meteo APIs](https://open-meteo.com/). The goal is to:

- Extract hourly measurements of pollutants such as PM10, PM2.5, NO₂, CO, and O₃.
- Combine them with weather indicators like temperature and wind speed.
- Create a unified dataset to support exploratory data analysis (EDA) and future modeling.

This merged DataFrame will serve as the foundation for understanding relationships between pollution levels and meteorological factors.


In [10]:
# Here we call the API to create our dataset for the EDA notebook.
# EDA notebook for Madrid air pollution and weather (last 5 days)

# Imports
import requests
import pandas as pd
from datetime import datetime

In [11]:
# Coordinates for Madrid center
latitude = 40.4168
longitude = -3.7038

# Time range
start_date = "2025-05-10"
end_date = "2025-05-15"

In [None]:
# Fetch Air Quality Data
aq_url = "https://air-quality-api.open-meteo.com/v1/air-quality"
aq_params = {
    "latitude": latitude,
    "longitude": longitude,
    "hourly": "pm10,pm2_5,carbon_monoxide,nitrogen_dioxide,ozone",
    "start": f"{start_date}T00:00",
    "end": f"{end_date}T23:00",
    "timezone": "auto"
}
aq_response = requests.get(aq_url, params=aq_params)
aq_data = aq_response.json()

#print(aq_response.status_code)
#print(aq_response.url)         # See the actual request
#print(aq_response.json())      # See raw JSON (just a bit)
#print(aq_data.keys())
#for key in aq_data["hourly"]:
#    print(key, len(aq_data["hourly"][key]))


# Convert to DataFrame
df_aq = pd.DataFrame(aq_data['hourly'])
df_aq['time'] = pd.to_datetime(df_aq['time'])


For the temperature and wind speed we'll use the forecast API since the "https://archive-api.open-meteo.com/v1/archive" is providing null values
(Even though we are not really forecasting anything)

In [17]:
# Fetch Weather Data
weather_url = "https://api.open-meteo.com/v1/forecast"
params_weather = {
    "latitude": latitude,
    "longitude": longitude,
    "start_date": start_date,
    "end_date": end_date,
    "hourly": "temperature_2m,wind_speed_10m",
    "timezone": "auto"
}
response_weather = requests.get(weather_url, params=params_weather)
data_weather = response_weather.json()

# Check keys
print(data_weather.keys())

# Convert to DataFrame
df_weather = pd.DataFrame(data_weather["hourly"])
df_weather["time"] = pd.to_datetime(df_weather["time"])

# Display info
print(df_weather.info())
print(df_weather.head())
print(df_weather.isnull().sum())


dict_keys(['latitude', 'longitude', 'generationtime_ms', 'utc_offset_seconds', 'timezone', 'timezone_abbreviation', 'elevation', 'hourly_units', 'hourly'])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   time            144 non-null    datetime64[ns]
 1   temperature_2m  144 non-null    float64       
 2   wind_speed_10m  144 non-null    float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 3.5 KB
None
                 time  temperature_2m  wind_speed_10m
0 2025-05-10 00:00:00            13.7             2.2
1 2025-05-10 01:00:00            13.2             1.5
2 2025-05-10 02:00:00            13.4             0.5
3 2025-05-10 03:00:00            13.0             0.5
4 2025-05-10 04:00:00            12.4             2.2
time              0
temperature_2m    0
wind_speed_10m    0
dtype: int64


In [18]:
df_weather.head()

Unnamed: 0,time,temperature_2m,wind_speed_10m
0,2025-05-10 00:00:00,13.7,2.2
1,2025-05-10 01:00:00,13.2,1.5
2,2025-05-10 02:00:00,13.4,0.5
3,2025-05-10 03:00:00,13.0,0.5
4,2025-05-10 04:00:00,12.4,2.2


In [16]:
df_aq.head()

Unnamed: 0,time,pm10,pm2_5,carbon_monoxide,nitrogen_dioxide,ozone
0,2025-05-15 00:00:00,14.2,12.1,165.0,31.7,53.0
1,2025-05-15 01:00:00,12.0,10.5,151.0,24.0,51.0
2,2025-05-15 02:00:00,10.8,9.0,140.0,18.7,49.0
3,2025-05-15 03:00:00,9.5,8.2,130.0,14.5,47.0
4,2025-05-15 04:00:00,8.7,7.4,128.0,11.1,47.0


In [20]:
# Merge on time (inner join)
df = pd.merge(df_aq, df_weather, on='time', how='inner')

# Optional: set datetime index
#df.set_index('time', inplace=True)

# Quick look
df.head()

Unnamed: 0,time,pm10,pm2_5,carbon_monoxide,nitrogen_dioxide,ozone,temperature_2m,wind_speed_10m
0,2025-05-15 00:00:00,14.2,12.1,165.0,31.7,53.0,13.1,3.1
1,2025-05-15 01:00:00,12.0,10.5,151.0,24.0,51.0,12.8,2.5
2,2025-05-15 02:00:00,10.8,9.0,140.0,18.7,49.0,12.6,2.5
3,2025-05-15 03:00:00,9.5,8.2,130.0,14.5,47.0,12.2,3.2
4,2025-05-15 04:00:00,8.7,7.4,128.0,11.1,47.0,11.8,2.7


In [21]:
df.to_parquet("../data/merged_recent_pollution_weather.parquet", index=False)