## K-Means Model and Analysis

This notebook applies K-Means clustering to air quality data to identify groups of observations with similar characteristics, such as temperature, wind speed, precipitation, and fire-related features. 

The goal is to explore whether the data naturally forms distinct subgroups that might correspond to different environmental conditions or pollution patterns.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, Lasso
import plotly.express as px
import matplotlib.pyplot as plt
import os

In [None]:
script_dir = os.getcwd()
df = pd.read_csv(f'{script_dir}/air_quality_weather_fires.csv')

# Recoding weather codes into broader/more identifiable categories
df['weather_code'] = df['weather_code'].replace(['1', 'Clear sky', 'Mainly clear'], 'clear')
df = df[df.weather_code != '2']
df['weather_code'] = df['weather_code'].replace(
    ['3', 'Overcast', 'Partly cloudy'], 'cloudy')
df['weather_code'] = df['weather_code'].replace(
    ['51', '53', '55', '61', '63', '65', 
     'Dense drizzle', 'Heavy rain', 'Light drizzle', 'Moderate drizzle', 'Moderate rain', 'Slight rain'], 'rainy')
df['weather_code'] = df['weather_code'].replace(
    ['71', '73', '75', 'Heavy snow fall', 'Moderate snow fall', 'Slight snow fall'], 'snowy')

# Defining numeric variables for clustering and prediction
X_cols = ['latitude', 'longitude', 'temperature_2m_mean', 
          'wind_speed_10m_mean', 'precipitation_sum', 
          'fires_within_50km', 'fires_within_100km', 
          'distance_to_fire_km', 'fire_brightness']

# Converting columns to numeric and dropping rows with missing values
df[X_cols] = df[X_cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=X_cols + ['PM25'], inplace=True)

X_data = df[X_cols]
y_data = df['PM25'].values

In [19]:
# K-means clustering pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(random_state=42))
])

In [20]:
# Elbow method
K_values = list(range(1, 11))
wcss = []

for k in K_values:
    pipe.set_params(kmeans__n_clusters=k)
    pipe.fit(X_data)
    wcss.append(pipe['kmeans'].inertia_)

# Visualizing wcss versus k
fig = px.line(
    x=K_values,
    y=wcss,
    markers=True,
    title="Elbow Method Pipeline: StandardScaler + KMeans",
    labels={"x": "Number of Clusters (K)", "y": "Within-cluster Sum of Squares"}
)
fig.update_layout(height=500, width=700)
fig.show()

The plot of WCSS (within-cluster sum of squares) versus the number of clusters showed a steep decrease in WCSS from K=1 to K=3, followed by a slower, more gradual decline for higher K values. This “elbow” in the curve suggests that K=2 to K=5 clusters potentially capture most of the patterns in the data. Selecting K beyond 5 does not improve the grouping much and can overcomplicate the model.

In [21]:
# Silhouette scores to evaluate cluster quality
sil_scores = []
for k in range(2, 11):
    pipe.set_params(kmeans__n_clusters=k)
    pipe.fit(X_data)
    labels = pipe['kmeans'].labels_
    sil_scores.append(silhouette_score(X_data, labels))

fig = px.line(
    x=list(range(2, 11)),
    y=sil_scores,
    markers=True,
    title="Silhouette Scores for different K",
    labels={"x": "Number of Clusters (K)", "y": "Silhouette Score"}
)
fig.update_layout(height=500, width=700)
fig.show()

K's 3-10 have silhouette scores less than 0.25, which indicates weakly structured clusters that overlap. K=2 has a silhouette score of 0.88, which indicates clusters that are well-separated. 

In [7]:
# Fit the final k-means and cluster assignment
# Choosing a number of clusters based on elbow/silhouette
final_k = 2
pipe.set_params(kmeans__n_clusters=final_k)
pipe.fit(X_data)

# Assigning cluster labels to the dataframe
df['cluster'] = pipe['kmeans'].labels_

# Prints cluster sizes and scaled cluster centers
print("\nCluster sizes:")
print(df['cluster'].value_counts())

print("\nCluster centers scaled features:")
cluster_centers_df = pd.DataFrame(pipe['kmeans'].cluster_centers_, columns=X_cols)
print(cluster_centers_df)


Cluster sizes:
cluster
0    19512
1        2
Name: count, dtype: int64

Cluster centers scaled features:
   relative_humidity_2m_mean  et0_fao_evapotranspiration  latitude  longitude  \
0                   0.000029                   -0.000078  0.000085   0.000121   
1                  -0.283609                    0.757936 -0.832320  -1.182579   

   temperature_2m_mean  wind_speed_10m_mean  precipitation_sum  \
0            -0.000086             0.000083           0.000040   
1             0.841281            -0.811702          -0.389057   

   fires_within_50km  fires_within_100km  distance_to_fire_km  fire_brightness  
0          -0.008588           -0.005011             0.000037         0.000051  
1          83.780167           48.888815            -0.362497        -0.496401  


Based on the silhouette scores, only 2 clusters were identified; however, the clustering is highly unbalanced. Cluster 0 contains 19,512 points, representing almost the entire dataset, while Cluster 1 contains only 2 points, indicating extreme outliers. 

Overall, these numbers indicate that K-Means with K=2 is not able to find well-separated, representative clusters for this dataset, and the algorithm is largely grouping the majority of the data together while isolating extreme cases.

This means that most of the data are within average conditions, and the K-means model is not able to find meaningful subgroups in the data or identifiable patterns.

In [8]:
# Average pm25 summary per cluster
print("\nAverage PM25 per cluster:")
print(df.groupby('cluster')['PM25'].mean())


Average PM25 per cluster:
cluster
0     7.418345
1    12.375000
Name: PM25, dtype: float64


The difference in PM2.5 between the clusters shows that the cluster 2 captures extreme pollution events, while cluster 1 captures normal air quality.

The model indicates that higher wind speed, higher latitude, greater precipitation, and longer distances from fires are associated with lower PM2.5. 

Higher temperature and more nearby fires (within 100 km) increase PM2.5. Ridge keeps all predictors, while Lasso removes less important ones, like fires within 50 km, highlighting the most influential variables. 

Overall, the models are robust and consistent across training, test, and validation sets, but they have limited predictive power, suggesting that other factors not included in the dataset may also affect PM2.5 levels.