# **Prediction model**

## Objectives

* Predict future trends in access to electricity, access to clean fuels, and CO₂ emissions by 2030.

* Use linear regression models per country based on historical data (2000–2020) to estimate realistic, explainable future scenarios aligned with SDG 7 targets.

* Ensure interpretability and transparency in the prediction process to support policy communication and public understanding.

## Inputs

* Cleaned dataset from: the previous ETL stage (global-data-on-sustainable-energy-processed.csv)

* Variables used:

*   Time series per country for each target variable

*   Historical values from 2000 to 2020 (at least 5 valid data points per country/target)

*   No socioeconomic projections beyond 2020 are assumed (trend-based forecasting only)

## Outputs

*   Linear Regression models trained individually per country and target variable

*   Forecasted values for the year 2030, exported to:

        * data/predictions_linear_2030.csv

*   Visualisations showing:

    * Predicted distributions per variable

    * Country-level comparison between current and projected values

   




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Libraries and Build the model

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('Data\Processed\global-data-on-sustainable-energy-processed.csv')

In [None]:
df.dtypes

# Create a predictive model using linear regresion

In [None]:
sns.set_style("whitegrid")

# Variables to predict
target_vars = [
    'access_to_electricity',
    'access_to_clean_fuels',
    'co2_emissions_kt',
    'renewable_capacity_per_capita'  
]
year_future = 2030

# Prepare predictions and metrics
predictions = []
metrics = []

# Loop through countries and predict each target
for country in df['country'].unique():
    df_country = df[df['country'] == country]

    for target in target_vars:
        df_target = df_country[['year', target]].dropna()

        if len(df_target) >= 5:  # Ensure minimum data points
            X = df_target[['year']]
            y = df_target[target]

            model = LinearRegression()
            model.fit(X, y)
            y_pred = model.predict(X)

            # Predict future
            future_input = pd.DataFrame({'year': [year_future]})
            future_pred = model.predict(future_input)[0]

            # Apply bounds for % variables
            if target in ['access_to_electricity', 'access_to_clean_fuels']:
                future_pred = min(100, max(0, future_pred))
            elif target == 'renewable_capacity_per_capita':
                future_pred = max(0, future_pred)  # prevent negative capacity

            predictions.append({
                'country': country,
                'target': target,
                'predicted_value_2030': future_pred
            })

            # Metrics
            r2 = r2_score(y, y_pred)
            mae = mean_absolute_error(y, y_pred)
            metrics.append({
                'country': country,
                'target': target,
                'r2': r2,
                'mae': mae
            })

# Convert to DataFrame
df_predictions = pd.DataFrame(predictions)
df_metrics = pd.DataFrame(metrics)

# Pivot to wide format for exporting or dashboard
df_2030 = df_predictions.pivot(index='country', columns='target', values='predicted_value_2030').reset_index()
df_2030['year'] = year_future

# Save to CSV
df_2030.to_csv("Data/Predictions/predictions_linear_2030.csv", index=False)

# Print metrics summary
print("Performance Metrics promedio por variable:")
for target in target_vars:
    sub_df = df_metrics[df_metrics['target'] == target]

    if sub_df.empty:
        print(f"\n{target}: no hay métricas registradas.")
        continue

    mae_avg = sub_df['mae'].mean()
    r2_avg = sub_df['r2'].mean()

    print(f"\n{target}:")
    print(f"  MAE promedio: {mae_avg:.2f}")
    print(f"  R² promedio : {r2_avg:.2f}")





In [None]:
# Histograms of predictions
for target in target_vars:
    plt.figure(figsize=(10, 6))
    sns.histplot(df_2030[target], kde=True, bins=30)
    plt.title(f"Distribution of Predicted {target} in 2030")
    plt.xlabel(target)
    plt.ylabel("Count of Countries")
    plt.grid(True)
    plt.tight_layout()
    plt.show()


In [None]:
# Save results
df_pred = pd.DataFrame(predictions)
df_2030 = df_pred.pivot(index='country', columns='target', values='predicted_value_2030').reset_index()
df_2030['year'] = year_future
df_2030.to_csv("data/Predictions/predictions_linear_2030.csv", index=False)
print("Predictions saved to data/predictions_linear_2030.csv")


## Model Justification: Linear Regression per Country (2030 Predictions)

For this project, I used **simple linear regression models** applied **individually per country and per target variable** (`access_to_electricity`, `access_to_clean_fuels`, `co2_emissions_kt`). This approach was chosen based on the following reasons:

 **Interpretability**: Linear regression is transparent and easy to explain. Each country has its own model, making it straightforward to analyze and validate.

 **Data Availability**: The dataset contains historical time series per country from ~2000 to 2020, making linear trends a reasonable first assumption.

 **Project Requirements**: The model provides clear numerical predictions for 2030, aligned with the SDG 7 goal year.
 
 **Simplicity and Scalability**: Running many small linear models is computationally light and allows prediction coverage for over 170 countries.


### Limitations of the Model

While linear regression offers clarity and speed, it has some key limitations:

**Oversimplification**: It assumes a linear relationship between the year and the target variable, which may not hold true for all countries or contexts.

**Univariate Dependence**: The model only uses the year as an explanatory variable, ignoring other important factors (e.g., GDP, energy policy, urbanization).

**Low Reliability in CO₂ Predictions**: The model for `co2_emissions_kt` shows weak performance (average R² ≈ 0.47), meaning the predictions should be interpreted with caution.

 **Data Gaps and Quality**: Countries with less than 5 data points were excluded to maintain a minimal baseline of reliability, but this introduces uneven coverage.

### Ethical Considerations

**Transparency**: The model and its metrics are shared openly, and limitations are documented. No hidden logic or black-box modeling was used.

**Fair Representation**: Each country is modeled independently to avoid biasing predictions in favor of higher-income nations.

**No Policy Forecasting**: These predictions are statistical extrapolations, **not forecasts** based on political, environmental, or technological changes. Misinterpreting them as such could lead to flawed conclusions.

**Avoiding Misuse**: The predictions should not be used to rank countries' efforts or assign blame. They reflect historical trends, not intent or capability.


---

---