# Climate Action — Country Clustering (Colab-ready)

This notebook clusters countries by emissions & energy indicators to inform SDG 13 (Climate Action). It downloads Our World in Data CO₂ dataset, preprocesses features, runs KMeans clustering, visualizes with PCA, and includes optional mapping and time-series clustering cells.

**How to run:** Open this notebook in Google Colab and run cells sequentially. Some optional cells install extra packages (geopandas/folium) — enable if you want maps.

In [None]:
# CELL: Imports & settings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
%matplotlib inline
plt.rcParams['figure.figsize'] = (9,6)
print('Imports ready')

In [None]:
# CELL: Download Our World in Data CO2 dataset
url = 'https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv'
df = pd.read_csv(url)
print('Loaded rows,cols:', df.shape)
print('Available years:', int(df['year'].min()), 'to', int(df['year'].max()))

In [None]:
# CELL: Select year & features
latest_year = int(df['year'].max())
print('Using year:', latest_year)
year = latest_year

df_year = df[df['year']==year].copy()

candidates = [
    'co2', 'co2_per_capita', 'ghg_excluding_lucf_per_capita',
    'coal_co2_per_capita', 'gas_co2_per_capita', 'oil_co2_per_capita',
    'share_global_co2', 'share_global_coal_co2', 'renewables_share_energy'
]
features = [c for c in candidates if c in df_year.columns]
print('Using features:', features)

# Drop na
cols = ['country','iso_code'] + features
df_feat = df_year[cols].dropna()
print('Countries with full feature data:', df_feat.shape[0])
X = df_feat[features].values

In [None]:
# CELL: Preprocess & PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
print('Explained variance ratio (2 components):', pca.explained_variance_ratio_)

In [None]:
# CELL: Elbow & Silhouette diagnostics
inertia = []
silhouette = []
K = range(2,9)
for k in K:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertia.append(km.inertia_)
    silhouette.append(silhouette_score(X_scaled, labels))

plt.figure()
plt.subplot(1,2,1)
plt.plot(K, inertia, '-o')
plt.title('Elbow: Inertia vs k')
plt.xlabel('k')
plt.ylabel('Inertia')

plt.subplot(1,2,2)
plt.plot(K, silhouette, '-o')
plt.title('Silhouette score vs k')
plt.xlabel('k')
plt.ylabel('Silhouette')
plt.tight_layout()


In [None]:
# CELL: Fit final KMeans and visualize clusters
k_final = 4
km = KMeans(n_clusters=k_final, random_state=42, n_init=50)
labels = km.fit_predict(X_scaled)

df_feat['cluster'] = labels

plt.figure(figsize=(9,6))
for cl in sorted(df_feat['cluster'].unique()):
    idx = df_feat['cluster'] == cl
    plt.scatter(X_pca[idx,0], X_pca[idx,1], label=f'Cluster {cl}', alpha=0.7)

plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title(f'Country clusters (k={k_final})')
plt.legend()
plt.show()

# Print sample countries
sort_col = 'co2_per_capita' if 'co2_per_capita' in df_feat.columns else 'co2'
for cl in sorted(df_feat['cluster'].unique()):
    print('\nCluster', cl, 'sample countries:')
    print(df_feat[df_feat['cluster']==cl].sort_values(by=sort_col, ascending=False)['country'].head(8).to_list())


In [None]:
# CELL: Cluster summary statistics
cluster_means = df_feat.groupby('cluster')[features].mean().round(4)
cluster_means

In [None]:
# CELL: Save clustered results
out_csv = f'country_clusters_{year}.csv'
df_feat.to_csv(out_csv, index=False)
print('Saved', out_csv)

## Extended: Time-series features & clustering
Below is an optional workflow to create time-series derived features (growth rates) and cluster based on trends. This is lightweight and avoids heavy DTW installs.

In [None]:
# CELL: Time-series derived features (growth rates)
# We'll compute 5-year growth rate of co2_per_capita where available
years = [year-5, year]
df_ts = df[df['country'].isin(df_feat['country']) & df['year'].between(year-5, year)].copy()

# pivot
pv = df_ts.pivot(index='country', columns='year', values='co2_per_capita')
if (year-5) in pv.columns:
    pv['growth_5yr'] = (pv[year] - pv[year-5]) / (pv[year-5].replace(0, np.nan))
else:
    pv['growth_5yr'] = np.nan

pv_feat = pv[['growth_5yr']].dropna()
print('Countries with 5yr growth:', pv_feat.shape[0])

# Join growth into df_feat
df_feat = df_feat.merge(pv_feat[['growth_5yr']], left_on='country', right_index=True, how='left')
print('df_feat now has columns:', df_feat.columns.tolist())


## Optional: Map visualization (requires extra packages)
The following cell installs geopandas & folium in Colab and produces a choropleth of cluster membership. Run only if you are in Colab and accept the package installs.

In [None]:
# CELL: Optional map (Colab users) - installs geopandas & folium
# Uncomment and run in Colab if you want maps. This cell may take time to install packages.
# !pip install geopandas folium pycountry rtree

# Example mapping code (uncomment when packages are installed):
# import geopandas as gpd
# import folium
# world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# # Some iso code alignment: OWID uses ISO codes; naturalearth uses 'iso_a3'
# merged = world.merge(df_feat, left_on='iso_a3', right_on='iso_code', how='left')
# m = folium.Map(location=[10,0], zoom_start=2)
# folium.Choropleth(
#     geo_data=merged.__geo_interface__,
#     name='choropleth',
#     data=merged,
#     columns=['iso_a3', 'cluster'],
#     key_on='feature.properties.iso_a3',
#     fill_opacity=0.7,
#     line_opacity=0.2,
#     legend_name='Cluster'
# ).add_to(m)
# m

## Next steps and notes
- Validate clusters against GDP per capita, population, and NDC commitments.
- Consider time-series clustering (DTW) or dynamic time warping for richer trend grouping.
- Use the saved CSV `country_clusters_<year>.csv` for creating maps or dashboards.