Español

La contaminación atmosférica es un problema real que afecta directamente a la salud y a la calidad de vida, especialmente en entornos urbanos. En este proyecto personal he analizado datos de contaminación procedentes de estaciones de medición en Nueva York, centrándome en el dióxido de nitrógeno (NO₂) como indicador principal de contaminación urbana, y los he transformado en un mapa interactivo para visualizar de manera clara dónde se concentran los niveles más altos dentro de la ciudad.

El proyecto combina análisis de datos, limpieza del dataset y visualización geoespacial, ofreciendo una representación precisa de la distribución de este contaminante por zona, destacando las áreas de mayor riesgo para la salud respiratoria de la población.

English

Air pollution is a real issue that directly affects health and quality of life, especially in urban environments. In this personal project, I analyzed air pollution data collected from monitoring stations in New York City, focusing on nitrogen dioxide (NO₂) as a key indicator of urban pollution, and transformed it into an interactive map to clearly visualize where the highest levels are concentrated within the city.

The project combines data analysis, dataset cleaning, and geospatial visualization, providing an accurate representation of this pollutant’s distribution by area, highlighting zones of greater risk to respiratory health.

In [2]:
# Air pollution data analysis project
# Objective: explore air pollution data and visualize its spatial distribution
# using an interactive map

import pandas as pd
import numpy as np
import folium
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

#Import the data frame
path = r'C:\Users\dmsue\OneDrive - Universidad Complutense de Madrid (UCM)\Universidad\4ª CARRERA\Maching learning\Modulo 2\Proyecto\contaminacion.csv'
df = pd.read_csv(path, delimiter=';')

# --- Dataset overview ---
df.head()

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,172653,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,203,Bedford Stuyvesant - Crown Heights,Annual Average 2011,12/01/2010,25.3,
1,172585,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,203,Bedford Stuyvesant - Crown Heights,Annual Average 2009,12/01/2008,26.93,
2,336637,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,204,East New York,Annual Average 2015,01/01/2015,19.09,
3,336622,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,103,Fordham - Bronx Pk,Annual Average 2015,01/01/2015,19.76,
4,172582,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,104,Pelham - Throgs Neck,Annual Average 2009,12/01/2008,22.83,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16218 entries, 0 to 16217
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unique ID       16218 non-null  int64  
 1   Indicator ID    16218 non-null  int64  
 2   Name            16218 non-null  object 
 3   Measure         16218 non-null  object 
 4   Measure Info    16218 non-null  object 
 5   Geo Type Name   16218 non-null  object 
 6   Geo Join ID     16218 non-null  int64  
 7   Geo Place Name  16218 non-null  object 
 8   Time Period     16218 non-null  object 
 9   Start_Date      16218 non-null  object 
 10  Data Value      16218 non-null  float64
 11  Message         0 non-null      float64
dtypes: float64(2), int64(3), object(7)
memory usage: 1.5+ MB


In [4]:
df.describe()

Unnamed: 0,Unique ID,Indicator ID,Geo Join ID,Data Value,Message
count,16218.0,16218.0,16218.0,16218.0,0.0
mean,372730.417746,427.803613,609710.3,19.975917,
std,215507.61356,110.921411,7893388.0,21.322349,
min,121644.0,365.0,1.0,0.0,
25%,173211.25,365.0,202.0,9.05,
50%,325262.5,375.0,303.0,15.3,
75%,605270.75,386.0,404.0,26.0375,
max,799868.0,661.0,105106100.0,424.7,


In [5]:
df.isna().sum()

Unique ID             0
Indicator ID          0
Name                  0
Measure               0
Measure Info          0
Geo Type Name         0
Geo Join ID           0
Geo Place Name        0
Time Period           0
Start_Date            0
Data Value            0
Message           16218
dtype: int64

In [6]:
# --- Dataset cleaning ---
df = df.drop(columns=["Message"])
# --- Convert date ---
df["Start_Date"] = pd.to_datetime(df["Start_Date"], errors="coerce")
# --- Check ---
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16218 entries, 0 to 16217
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Unique ID       16218 non-null  int64         
 1   Indicator ID    16218 non-null  int64         
 2   Name            16218 non-null  object        
 3   Measure         16218 non-null  object        
 4   Measure Info    16218 non-null  object        
 5   Geo Type Name   16218 non-null  object        
 6   Geo Join ID     16218 non-null  int64         
 7   Geo Place Name  16218 non-null  object        
 8   Time Period     16218 non-null  object        
 9   Start_Date      16218 non-null  datetime64[ns]
 10  Data Value      16218 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(3), object(6)
memory usage: 1.4+ MB


In [7]:
# --- Columns ---
df.columns

Index(['Unique ID', 'Indicator ID', 'Name', 'Measure', 'Measure Info',
       'Geo Type Name', 'Geo Join ID', 'Geo Place Name', 'Time Period',
       'Start_Date', 'Data Value'],
      dtype='object')

In [8]:
df["Name"].unique()

array(['Nitrogen dioxide (NO2)', 'Fine particles (PM 2.5)', 'Ozone (O3)',
       'Asthma emergency department visits due to PM2.5',
       'Annual vehicle miles traveled',
       'Asthma hospitalizations due to Ozone',
       'Respiratory hospitalizations due to PM2.5 (age 20+)',
       'Boiler Emissions- Total SO2 Emissions',
       'Cardiovascular hospitalizations due to PM2.5 (age 40+)',
       'Boiler Emissions- Total PM2.5 Emissions',
       'Boiler Emissions- Total NOx Emissions',
       'Annual vehicle miles travelled (cars)',
       'Annual vehicle miles travelled (trucks)',
       'Cardiac and respiratory deaths due to Ozone',
       'Asthma emergency departments visits due to Ozone',
       'Outdoor Air Toxics - Formaldehyde',
       'Outdoor Air Toxics - Benzene', 'Deaths due to PM2.5'],
      dtype=object)

In [9]:
df_no2 = df[df["Name"] == "Nitrogen dioxide (NO2)"]
df["Geo Type Name"].value_counts()

Geo Type Name
UHF42       6300
CD          5900
UHF34       3128
Borough      740
Citywide     150
Name: count, dtype: int64

In [13]:
df_no2 = df_no2[df_no2["Geo Type Name"] == "Borough"]

# Group by "Geo Place Name" and calculate the average
df_map = df_no2.groupby("Geo Place Name")["Data Value"].mean().reset_index()
df_map.rename(columns={"Data Value": "Average NO2"}, inplace=True)# --- Initialize geolocation ---
geolocator = Nominatim(user_agent='myapplication')
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# --- Function to obtain lat/lon ---
def get_lat_lon(place):
    try:
        location = geocode(f"{place}, New York")
        if location:
            return pd.Series([location.latitude, location.longitude])
        else:
            return pd.Series([None, None])
    except:
        return pd.Series([None, None])

# --- Apply to dataframe ---
df_map[['lat','lon']] = df_map['Geo Place Name'].apply(get_lat_lon)

# --- Create basemap ---
m = folium.Map(location=[40.7, -74], zoom_start=10)

# --- Function to color according to value ---
def get_color(value, min_val, max_val):
    # Simple red gradient
    ratio = (value - min_val) / (max_val - min_val)
    r = int(255 * ratio)
    g = int(255 * (1 - ratio))
    return f'#{r:02x}{g:02x}00'

min_val = df_map['Average NO2'].min()
max_val = df_map['Average NO2'].max()

# --- Add markers with color and size according to concentration. ---
for _, row in df_map.iterrows():
    if pd.notnull(row['lat']) and pd.notnull(row['lon']):
        folium.CircleMarker(
            location=[row['lat'], row['lon']],
            radius=7 + 3 * ((row['Average NO2'] - min_val)/(max_val - min_val)),  # tamaño proporcional
            popup=f"{row['Geo Place Name']}: {row['Average NO2']:.2f}",
            color=get_color(row['Average NO2'], min_val, max_val),
            fill=True,
            fill_opacity=0.7
        ).add_to(m)

# --- Simple legend ---
from branca.colormap import linear
colormap = linear.YlOrRd_09.scale(min_val, max_val)
colormap.caption = 'Average NO2 Level'
colormap.add_to(m)

# --- Show map ---
m