
# Global Energy & Emissions Analysis 🌍⚡

**Project goal:** analyze global energy consumption, renewable adoption, and CO2 emissions using Our World in Data (OWID) datasets.

**Highlights:**
- Clean and prepare OWID energy dataset
- Explore trends and country comparisons
- Visualize correlations and build interpretable ML models (forecasting + clustering)
- Produce a CV-ready project summary for academic applications



## Before you run
1. Download the full OWID energy dataset and place it in the same folder as this notebook with the filename **`owid-energy-data.csv`**.  
   Raw URL: `https://raw.githubusercontent.com/owid/energy-data/master/owid-energy-data.csv` (right-click -> Save As...).  
2. Also keep the codebook file as **`owid-energy-codebook.csv`** (optional but useful).  
3. Then run the cells below in order.


In [None]:

# 1. Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Plot settings (avoid unicode subscripts)
plt.rcParams['axes.unicode_minus'] = False
sns.set(style='whitegrid')


In [None]:

# 2. Load datasets
energy_path = 'owid-energy-data.csv'   # make sure this file exists in the notebook folder
codebook_path = 'owid-energy-codebook.csv'  # optional

energy_df = pd.read_csv(energy_path)
try:
    codebook_df = pd.read_csv(codebook_path)
    print('Codebook loaded.')
except Exception:
    codebook_df = None
    print('Codebook not found or could not be loaded. It is optional.')

print('Energy dataset shape:', energy_df.shape)


In [None]:

# 3. Inspect column names to find CO2-related fields
all_cols = energy_df.columns.tolist()
co2_candidates_all = [c for c in all_cols if 'co2' in c.lower()]
print('Found CO2-related columns (first 30):', co2_candidates_all[:30])
print('\nSome common energy columns (first 30):', [c for c in all_cols if 'energy' in c.lower()][:30])


In [None]:

# 4. Select relevant indicators (auto-detect best CO2 column)
preferred_co2_order = ['co2_per_capita', 'consumption_co2_per_capita', 'consumption_co2', 'co2']

co2_col = None
for cand in preferred_co2_order:
    if cand in energy_df.columns:
        co2_col = cand
        break

# fallback: any column containing 'co2'
if co2_col is None:
    co2_cols = [c for c in energy_df.columns if 'co2' in c.lower()]
    if co2_cols:
        co2_col = co2_cols[0]

# Build list of columns to keep
cols_to_keep = ['country', 'year', 'primary_energy_consumption', 'renewables_share_energy']
if co2_col:
    cols_to_keep.append(co2_col)

print('Using CO2 column:', co2_col)
print('Columns to keep:', cols_to_keep)

# Subset and rename
df = energy_df[[c for c in cols_to_keep if c in energy_df.columns]].copy()
rename_map = {'country':'Country', 'year':'Year', 'primary_energy_consumption':'Energy_Use', 'renewables_share_energy':'Renewable_Pct'}
if co2_col:
    rename_map[co2_col] = 'CO2'
df.rename(columns=rename_map, inplace=True)

# Convert numeric columns and clean
numeric_cols = [c for c in ['Energy_Use','Renewable_Pct','CO2'] if c in df.columns]
for c in numeric_cols:
    df[c] = pd.to_numeric(df[c], errors='coerce')

# Forward/backward fill per country (preserves index shape)
if 'Country' in df.columns:
    df[numeric_cols] = df.groupby('Country')[numeric_cols].transform(lambda g: g.ffill().bfill())

# Coerce year to integer where possible
df['Year'] = pd.to_numeric(df['Year'], errors='coerce').astype('Int64')

print('\nPreview of cleaned data:') 
display(df.head())
print('\nMissing values per column:') 
print(df.isna().sum())



## 5. Exploratory Data Analysis (EDA)

We will:
- Show top CO2 emitters in the latest available year (if CO2 exists)
- Plot Energy Use vs CO2 colored by Renewable share (if CO2 exists)
- Plot renewable share trend for a sample country (Germany if present)


In [None]:

latest_year = int(df['Year'].max()) if df['Year'].notna().any() else None
print('Latest year in dataset:', latest_year)

# Top emitters
if 'CO2' in df.columns and latest_year is not None:
    top_co2 = df[df['Year']==latest_year].sort_values('CO2', ascending=False).head(10)
    print('\nTop 10 CO2 emitters in', latest_year)
    display(top_co2[['Country','CO2']])
    plt.figure(figsize=(10,6))
    sns.barplot(data=top_co2, x='CO2', y='Country', palette='Reds_r')
    plt.title(f'Top 10 CO2 emitters in {latest_year}')
    plt.xlabel('CO2 (units depend on source)')
    plt.show()
else:
    print('\nNo CO2 column available — skipping top emitters plot.')

# Energy vs CO2 scatter
if 'CO2' in df.columns:
    plt.figure(figsize=(8,6))
    sns.scatterplot(data=df, x='Energy_Use', y='CO2', hue='Renewable_Pct', alpha=0.7)
    plt.title('Energy Use vs CO2 (all countries & years)')
    plt.xlabel('Primary Energy Consumption')
    plt.ylabel('CO2')
    plt.show()
else:
    print('\nNo CO2 column available — skipping energy vs CO2 scatter.')

# Renewable trend for a sample country (Germany or most-records)
sample_country = 'Germany' if 'Germany' in df['Country'].unique() else df['Country'].value_counts().idxmax()
print('\nSample country used for renewable trend:', sample_country)
subset = df[df['Country']==sample_country]
if not subset['Renewable_Pct'].dropna().empty:
    plt.figure(figsize=(9,5))
    sns.lineplot(data=subset, x='Year', y='Renewable_Pct', marker='o')
    plt.title(f'{sample_country} Renewable Share Over Time')
    plt.ylabel('Renewable share (percent or fraction depending on source)')
    plt.show()
else:
    print('No renewable percentage data available for', sample_country)



## 6. Forecasting CO2 Emissions (Linear Regression)

If CO2 data is available for a country, we fit a simple linear regression on Year -> CO2 and forecast up to 2030.


In [None]:

from sklearn.exceptions import NotFittedError

if 'CO2' in df.columns:
    forecast_country = 'Germany' if 'Germany' in df['Country'].unique() else df['Country'].value_counts().idxmax()
    subset = df[df['Country']==forecast_country].dropna(subset=['Year','CO2'])
    if len(subset) >= 5:
        X = subset[['Year']].astype(int)
        y = subset['CO2']
        model = LinearRegression()
        model.fit(X, y)
        future_years = np.array(range(int(latest_year)+1, 2031)).reshape(-1,1)
        preds = model.predict(future_years)
        plt.figure(figsize=(8,6))
        plt.plot(X, y, 'o', label='Historical')
        plt.plot(future_years, preds, '--', label='Forecast')
        plt.title(f'Forecasted CO2 for {forecast_country}')
        plt.xlabel('Year'); plt.ylabel('CO2')
        plt.legend(); plt.show()
    else:
        print('Not enough historical CO2 data for', forecast_country, 'to build a reliable forecast.')
else:
    print('No CO2 column available — skipping forecasting.')



## 7. Clustering Countries by Energy & CO2

We cluster countries using the latest-year values of Energy_Use, Renewable_Pct and CO2 (if CO2 exists).


In [None]:

if 'Energy_Use' in df.columns and 'Renewable_Pct' in df.columns:
    features = ['Energy_Use','Renewable_Pct'] + (['CO2'] if 'CO2' in df.columns else [])
    latest_df = df[df['Year']==latest_year].dropna(subset=features)
    if not latest_df.empty and len(latest_df) >= 8:
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(latest_df[features])
        kmeans = KMeans(n_clusters=4, random_state=42)
        latest_df['Cluster'] = kmeans.fit_predict(X_scaled)
        plt.figure(figsize=(8,6))
        if 'CO2' in latest_df.columns:
            sns.scatterplot(data=latest_df, x='Energy_Use', y='CO2', hue='Cluster', palette='tab10', alpha=0.8)
            plt.ylabel('CO2')
        else:
            sns.scatterplot(data=latest_df, x='Energy_Use', y='Renewable_Pct', hue='Cluster', palette='tab10', alpha=0.8)
            plt.ylabel('Renewable_Pct')
        plt.title(f'Country clusters ({latest_year})')
        plt.show()
        display(latest_df[['Country']+features+['Cluster']].head(10))
    else:
        print('Not enough complete rows in the latest year to cluster.')
else:
    print('Required features for clustering are missing.')


In [None]:

# 8. Correlation heatmap (numeric features)
corr_feats = [c for c in ['Energy_Use','Renewable_Pct','CO2'] if c in df.columns]
if len(corr_feats) >= 2:
    plt.figure(figsize=(6,4))
    sns.heatmap(df[corr_feats].corr(), annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation heatmap')
    plt.show()
else:
    print('Not enough numeric features to compute correlation heatmap.')


In [None]:

# 9. Save cleaned dataset and print CV summary
out_path = 'clean_energy_data.csv'
df.to_csv(out_path, index=False)
print('Saved cleaned data to', out_path)

summary = f"""
Project: Global Energy & Emissions Analysis
- Cleaned records: {df.shape[0]:,}
- Countries covered: {df['Country'].nunique()}
- Years: {df['Year'].min()} to {df['Year'].max()}
- Features used: {', '.join([c for c in ['Energy_Use','Renewable_Pct','CO2'] if c in df.columns])}
"""
print(summary)



## Final notes & suggestions

- If CO2 columns do not appear after you download the full dataset, re-check the file name and that you downloaded the correct file from the OWID repository.  
- You can extend the forecasting section using time-series models (ARIMA/Prophet) or add cross-country regression analysis (e.g., panel regression).  
- To make figures publication-ready, adjust axis scales (log), add annotations, and save high-resolution PNGs or SVGs.

Good luck — run the notebook end-to-end and let me know if any cell raises an error. I can further tweak it for your environment.
