## Capstone 3: Sustainable Energy Recommendation System  
### "Leveraging Neural Collaborative Filtering for Sustainable Energy Insights"
## Preprocessing   and   Training   Data   Development Objective: 
Goal: Create a cleaned development dataset you can use to complete the modeling step of your project.  
Steps:   
    ● Create   dummy   or   indicator   features   for   categorical   variables  
    ● Standardize   the   magnitude   of   numeric   features   using   a   scaler  
    ● Split   into   testing   and   training   datasets 

Neural Collaborative Filtering (NCF): 

    NCF is an advanced deep-learning approach for recommendation systems. Unlike traditional methods like matrix factorization, NCF learns complex relationships using neural networks.
    - Learns user-item interactions dynamically rather than assuming a linear relationship
    - Uses embeddings to encode user/item features and predict preferences
    - Scales better for large datasets compared to traditional collaborative filtering

Requires an user-tem interactions structure...

This data will protray: 
- Users -> **Countries**
- Items -> Energy investments, policy impacts, or **renewable adoption rates**
- Interactions -> Energy consumption, investment ratios or **adoption success** 



In [9]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

In [66]:
os.chdir('C:/Users/aamal/Desktop/Springboard/Springboard_DataScience/Capstone-3-Energy/data')
file_path = 'C:/Users/aamal/Desktop/Springboard/Springboard_DataScience/Capstone-3-Energy/data/energy_data_eda.csv'
df = pd.read_csv(file_path)

In [52]:
df.head()

Unnamed: 0,Entity,Year,Electricity_Access,Clean_Cooking_Fuels,Renewable_Capacity,Financial_Flows,Renewable_Share,Fossil_Electricity,Nuclear_Electricity,Renewable_Electricity,...,Energy_Dependency,CO2_Intensity_Per_Capita,Renewables_to_Fossil,Energy_Efficiency,Renewable_Nuclear_Interaction,Energy_Cluster,Renewable_Adoption_Group,Investment_Energy_Ratio,Policy_Impact,Econ_Energy_Cluster
0,0,2000,1.613591,6.2,9.22,20000.0,0.468451,0.16,0.0,0.31,...,9369.013,5e-06,1.937488,2.300408,0.0,2,Low Renewable Adoption,17340550.0,Strong Investment,0
1,0,2001,4.074574,7.2,8.86,130000.0,0.474802,0.09,0.0,0.5,...,61724.28,5e-06,5.555494,2.938436,0.0,2,Low Renewable Adoption,143940600.0,Strong Investment,0
2,0,2002,9.409158,8.2,8.47,3950000.0,0.393898,0.13,0.0,0.56,...,1555899.0,7e-06,4.307659,0.68116,0.0,2,Low Renewable Adoption,4912800000.0,Strong Investment,0
3,0,2003,14.738506,9.5,8.09,25970000.0,0.381716,0.31,0.0,0.63,...,9913163.0,8e-06,2.032252,0.728731,0.0,2,Low Renewable Adoption,29619630000.0,Strong Investment,0
4,0,2004,20.064968,10.9,7.75,0.0,0.460641,0.33,0.0,0.56,...,0.0,7e-06,1.696965,1.036219,0.0,2,Low Renewable Adoption,0.0,Low Investment,0


**Dataset Structure Overview**:

✔ Economic Metrics (`GDP_Growth`, `GDP_Per_Capita`, `Financial_Flows`, `Population_Density`)

✔ Energy Infrastructure & Consumption (`Energy_Consumption`, `Energy_Efficiency`, `Electricity_Access`)

✔ Renewable vs Fossil Energy Metrics (`Renewable_Share`, `Fossil_Electricity`, `Renewable_Electricity`, `Renewables_to_Fossil`)

✔ CO₂ & Sustainability Indicators (`CO2_Emissions`, `CO2_Intensity_Per_Capita`, `Renewables_Percentage`)

✔ Geospatial Attributes (`Latitude`, `Longitude`, `Land_Area`)

✔ Clustering & Policy Metrics (`Energy_Cluster`, `Econ_Energy_Cluster`, `Policy_Impact`, `Renewable_Adoption_Group`, `Policy_Impact_Strong Investment`)


#### **Clustering Ideas for NCF:**

**1.  Energy Transition Clusters**
- Group countries based on their transition from fossil fuels to renewables
- Consider features like "Fossil_Electricity", "Renewable_Share", "CO2_Emissions"
- Helps analyze which nations are leading vs lagging in clean energy adoption
  
**2.  Economic Development & Energy Consumption Clusters**
- Classify countries based on "GDP_Per_Capita" and "Energy_Consumption"
- Identify whether wealthier nations use energy more efficiently or excessively
- Helps understand economic-energy dependencies
  
**3.  Policy-Driven Investment Clusters**
- Use "Financial_Flows" and "Investment_Energy_Ratio"
- Group regions based on strong vs weak policy investments in clean energy
- Helps governments benchmark best practices for renewable funding
  
**4. CO₂ Emissions & Sustainability Clusters** 
- Analyze "CO2_Emissions", "CO2_Intensity_Per_Capita", "Renewables_Percentage"
- Identify high-emission vs decarbonizing economies
- Useful for climate policy evaluation


In [54]:
df.isnull().sum()

Entity                           0
Year                             0
Electricity_Access               0
Clean_Cooking_Fuels              0
Renewable_Capacity               0
Financial_Flows                  0
Renewable_Share                  0
Fossil_Electricity               0
Nuclear_Electricity              0
Renewable_Electricity            0
Low_Carbon_Electricity           0
Energy_Consumption               0
Energy_Intensity                 0
CO2_Emissions                    0
Renewables_Percentage            0
GDP_Growth                       0
GDP_Per_Capita                   0
Population_Density               0
Land_Area                        0
Latitude                         0
Longitude                        0
Energy_Dependency                0
CO2_Intensity_Per_Capita         0
Renewables_to_Fossil             0
Energy_Efficiency                0
Renewable_Nuclear_Interaction    0
Energy_Cluster                   0
Renewable_Adoption_Group         0
Investment_Energy_Ra

In [56]:
df.dtypes

Entity                             int64
Year                               int64
Electricity_Access               float64
Clean_Cooking_Fuels              float64
Renewable_Capacity               float64
Financial_Flows                  float64
Renewable_Share                  float64
Fossil_Electricity               float64
Nuclear_Electricity              float64
Renewable_Electricity            float64
Low_Carbon_Electricity           float64
Energy_Consumption               float64
Energy_Intensity                 float64
CO2_Emissions                    float64
Renewables_Percentage            float64
GDP_Growth                       float64
GDP_Per_Capita                   float64
Population_Density               float64
Land_Area                        float64
Latitude                         float64
Longitude                        float64
Energy_Dependency                float64
CO2_Intensity_Per_Capita         float64
Renewables_to_Fossil             float64
Energy_Efficienc

### Objective 1: Create dummy or indicator features for categorical variables

In [68]:
for col in ["Policy_Impact", "Renewable_Adoption_Group"]:
    print(f"{col} Unique Values:", df[col].unique())

Policy_Impact Unique Values: ['Strong Investment' 'Low Investment']
Renewable_Adoption_Group Unique Values: ['Low Renewable Adoption']


In [70]:
#Drop Renewable_Adoption_Group since it’s redundant

df.drop(columns=["Renewable_Adoption_Group"], inplace=True) 

In [72]:
from sklearn.preprocessing import LabelEncoder

# Convert categorical features into numerical format
encoder = LabelEncoder()
df["Policy_Impact"] = encoder.fit_transform(df["Policy_Impact"])

In [None]:
### Objective 2: Standardize 

In [19]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_features = ["GDP_Per_Capita", "CO2_Emissions", "Energy_Consumption", "Renewables_Percentage"]
df[num_features] = scaler.fit_transform(df[num_features])

In [None]:
### Objective 3: 

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["Econ_Energy_Cluster", "Energy_Cluster"])  # Drop dependent cluster labels
y = df["Energy_Cluster"]  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)