# Notebook 1: Data Cleaning & Common Preprocessing

**Goals:**
- Handle missing values (Imputation).
- Unit standardization.
- Generate Lag Features (Common for Panel Data models).
- Save intermediate processed data.

In [1]:
import pandas as pd
import sys
import os

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../src')))
from preprocessing import load_data, basic_cleaning, handle_missing_values, create_lag_features

# Load Raw Data
data_path = '../data/raw/global-data-on-sustainable-energy.csv'
df = load_data(data_path)

# Basic Cleaning
df_clean = basic_cleaning(df)

# Imputation
df_imputed = handle_missing_values(df_clean)

# Lag Features
target = 'Value_co2_emissions_kt_by_country'
lag_features = [target, 'gdp_growth', 'gdp_per_capita', 'Primary energy consumption per capita (kWh/person)']
df_lags = create_lag_features(df_imputed, target, lag_features, shifts=[1])

# Save
output_path = '../data/processed/common_preprocessed.csv'
df_lags.to_csv(output_path, index=False)
print(f"Saved common preprocessed data to {output_path}")

Loaded data from ../data/raw/global-data-on-sustainable-energy.csv: (3649, 21)
Missing values after imputation: 0
Dropped 176 rows due to lags.
Saved common preprocessed data to ../data/processed/common_preprocessed.csv


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_imputed[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_imputed[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va