# Capstone 3: Sustainable Energy Recommendation System  
### "Leveraging Neural Collaborative Filtering for Sustainable Energy Insights"
Audrey Malloy 

Date Updated: April 15th, 2025

## Objectives for Data Wrangling
### 1. Handle Missing Data  
- Identify gaps in financial flows, renewable capacity, and energy share columns.  
- Implement imputation techniques where appropriate (mean, median, interpolation).  

### 2. Data Cleaning & Transformation  
- Convert categorical variables (e.g., "Entity" and "Density") into numerical formats.  
- Standardize units across energy and economic indicators.  
- Normalize numerical features to ensure comparability.  

### 3. Outlier Detection & Removal  
- Investigate extreme values in energy consumption and CO₂ emissions.  
- Apply statistical methods (z-score, IQR) to manage outliers effectively.  

### 4. Feature Engineering  
- Create new meaningful features (e.g., "Renewables-to-Fossil Ratio").  
- Construct interaction terms that capture dependencies between electricity sources.  

### 5. Geospatial Data Preparation  
- Ensure latitude and longitude are correctly formatted for graph-based relationships.  
- Use spatial clustering methods to group regions with similar energy dynamics.  


### Data Overview:

- **Entity**: The name of the country or region for which the data is reported.
- **Year**: The year for which the data is reported, ranging from 2000 to 2020.
- **Access to electricity (% of population)**: The percentage of population with access to electricity.
- **Access to clean fuels for cooking (% of population)**: The percentage of the population with primary reliance on clean fuels.
- **Renewable-electricity-generating-capacity-per-capita**: Installed renewable energy capacity per person.
- **Financial flows to developing countries (US $)**: Aid and assistance from developed countries for clean energy projects.
  
- **Renewable energy share in total final energy consumption (%)**: Percentage of renewable energy in final energy consumption.
- **Electricity from fossil fuels (TWh)**: Electricity generated from fossil fuels (coal, oil, gas) in terawatt-hours.
- **Electricity from nuclear (TWh)**: Electricity generated from nuclear power in terawatt-hours.
- **Electricity from renewables (TWh)**: Electricity generated from renewable sources (hydro, solar, wind, etc.) in terawatt-hours.
- **Low-carbon electricity (% electricity)**: Percentage of electricity from low-carbon sources (nuclear and renewables).
- **Primary energy consumption per capita (kWh/person)**: Energy consumption per person in kilowatt-hours.
- **Energy intensity level of primary energy (MJ/$2011 PPP GDP)**: Energy use per unit of GDP at purchasing power parity.
- **Value_co2_emissions (metric tons per capita)**: Carbon dioxide emissions per person in metric tons.
- **Renewables (% equivalent primary energy)**: Equivalent primary energy that is derived from renewable sources.
- **GDP growth (annual %)**: Annual GDP growth rate based on constant local currency.
- **GDP per capita**: Gross domestic product per person.
- **Density (P/Km²)**: Population density in persons per square kilometer.
- **Land Area (Km²)**: Total land area in square kilometers.
- **Latitude**: Latitude of the country's centroid in decimal degrees.
- **Longitude**: Longitude of the country's centroid in decimal degrees.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
import requests
import os

In [3]:
os.chdir('C:/Users/aamal/Desktop/Springboard/Springboard_DataScience/Capstone-3-Energy/data')
file_path = 'C:/Users/aamal/Desktop/Springboard/Springboard_DataScience/Capstone-3-Energy/data/global-data-on-sustainable-energy.csv'
df = pd.read_csv(file_path)

In [4]:
df.head()

Unnamed: 0,Entity,Year,Access to electricity (% of population),Access to clean fuels for cooking,Renewable-electricity-generating-capacity-per-capita,Financial flows to developing countries (US $),Renewable energy share in the total final energy consumption (%),Electricity from fossil fuels (TWh),Electricity from nuclear (TWh),Electricity from renewables (TWh),...,Primary energy consumption per capita (kWh/person),Energy intensity level of primary energy (MJ/$2017 PPP GDP),Value_co2_emissions_kt_by_country,Renewables (% equivalent primary energy),gdp_growth,gdp_per_capita,Density\n(P/Km2),Land Area(Km2),Latitude,Longitude
0,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0.0,0.31,...,302.59482,1.64,760.0,,,,60,652230.0,33.93911,67.709953
1,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0.0,0.5,...,236.89185,1.74,730.0,,,,60,652230.0,33.93911,67.709953
2,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0.0,0.56,...,210.86215,1.4,1029.999971,,,179.426579,60,652230.0,33.93911,67.709953
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0.0,0.63,...,229.96822,1.4,1220.000029,,8.832278,190.683814,60,652230.0,33.93911,67.709953
4,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0.0,0.56,...,204.23125,1.2,1029.999971,,1.414118,211.382074,60,652230.0,33.93911,67.709953


In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 21 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   Entity                                                            3649 non-null   object 
 1   Year                                                              3649 non-null   int64  
 2   Access to electricity (% of population)                           3639 non-null   float64
 3   Access to clean fuels for cooking                                 3480 non-null   float64
 4   Renewable-electricity-generating-capacity-per-capita              2718 non-null   float64
 5   Financial flows to developing countries (US $)                    1560 non-null   float64
 6   Renewable energy share in the total final energy consumption (%)  3455 non-null   float64
 7   Electricity from fossil fuels (TW

In [6]:
df.rename(columns={
    "Entity": "Entity",
    "Year": "Year",
    "Access to electricity (% of population)": "Electricity_Access",
    "Access to clean fuels for cooking": "Clean_Cooking_Fuels",
    "Renewable-electricity-generating-capacity-per-capita": "Renewable_Capacity",
    "Financial flows to developing countries (US $)": "Financial_Flows",
    "Renewable energy share in the total final energy consumption (%)": "Renewable_Share",
    "Electricity from fossil fuels (TWh)": "Fossil_Electricity",
    "Electricity from nuclear (TWh)": "Nuclear_Electricity",
    "Electricity from renewables (TWh)": "Renewable_Electricity",
    "Low-carbon electricity (% electricity)": "Low_Carbon_Electricity",
    "Primary energy consumption per capita (kWh/person)": "Energy_Consumption",
    "Energy intensity level of primary energy (MJ/$2017 PPP GDP)": "Energy_Intensity",
    "Value_co2_emissions_kt_by_country": "CO2_Emissions",
    "Renewables (% equivalent primary energy)": "Renewables_Percentage",
    "gdp_growth": "GDP_Growth",
    "gdp_per_capita": "GDP_Per_Capita",
    "Density\\n(P/Km2)": "Population_Density",
    "Land Area(Km2)": "Land_Area",
    "Latitude": "Latitude",
    "Longitude": "Longitude"
}, inplace=True)

### Handle missing and NA values: 

In [7]:
print(df.isnull().sum())   

Entity                       0
Year                         0
Electricity_Access          10
Clean_Cooking_Fuels        169
Renewable_Capacity         931
Financial_Flows           2089
Renewable_Share            194
Fossil_Electricity          21
Nuclear_Electricity        126
Renewable_Electricity       21
Low_Carbon_Electricity      42
Energy_Consumption           0
Energy_Intensity           207
CO2_Emissions              428
Renewables_Percentage     2137
GDP_Growth                 317
GDP_Per_Capita             282
Population_Density           1
Land_Area                    1
Latitude                     1
Longitude                    1
dtype: int64


EXPLANATIONS FOR MISSING VALUES^^^^^ Need to add

In [9]:
df["Electricity_Access"] = df["Electricity_Access"].fillna(df["Electricity_Access"].median())
df["Clean_Cooking_Fuels"] = df["Clean_Cooking_Fuels"].fillna(df["Clean_Cooking_Fuels"].median())
df["Fossil_Electricity"] = df["Fossil_Electricity"].fillna(df["Fossil_Electricity"].mean())

df["Renewable_Electricity"] = df["Renewable_Electricity"].fillna(df["Renewable_Electricity"].mean())

df["Low_Carbon_Electricity"] = df["Low_Carbon_Electricity"].fillna(df["Low_Carbon_Electricity"].mean())
df["Renewable_Share"]= df["Renewable_Share"].fillna(df["Renewable_Share"].median())

In [10]:
df.loc[:, "Financial_Flows"] = df["Financial_Flows"].fillna(0)
df.loc[:, "Renewables_Percentage"] = df["Renewables_Percentage"].fillna(0)

In [11]:
df["Renewable_Capacity"]= df["Renewable_Capacity"].fillna(0)

In [12]:
df["CO2_Emissions"] = df["CO2_Emissions"].interpolate(method="linear")
df["Energy_Intensity"] = df["Energy_Intensity"].interpolate(method="linear")

In [13]:
df.fillna( {
           "Population_Density": df["Population_Density"].ffill(), 
           "Land_Area": df["Land_Area"].ffill()}, inplace=True)

In [14]:
df.dropna(subset=["GDP_Per_Capita", "GDP_Growth", "Latitude", "Longitude"], inplace=True)

In [15]:
missing_nuclear_region = df.groupby("Entity")["Nuclear_Electricity"].apply(lambda x: x.isnull().sum())

print(missing_nuclear_region.sort_values(ascending=False))

Entity
Tuvalu          21
Kazakhstan      21
Malaysia        21
Saudi Arabia    21
Indonesia       21
                ..
Germany          0
Ghana            0
Greece           0
Grenada          0
Zimbabwe         0
Name: Nuclear_Electricity, Length: 164, dtype: int64


Countries that don't rely on Nuclear electricity fill NA values with 0: 

In [16]:
df["Nuclear_Electricity"] = df["Nuclear_Electricity"].fillna(0).infer_objects(copy=False)

In [17]:
print(df.isnull().sum())  

Entity                    0
Year                      0
Electricity_Access        0
Clean_Cooking_Fuels       0
Renewable_Capacity        0
Financial_Flows           0
Renewable_Share           0
Fossil_Electricity        0
Nuclear_Electricity       0
Renewable_Electricity     0
Low_Carbon_Electricity    0
Energy_Consumption        0
Energy_Intensity          0
CO2_Emissions             0
Renewables_Percentage     0
GDP_Growth                0
GDP_Per_Capita            0
Population_Density        0
Land_Area                 0
Latitude                  0
Longitude                 0
dtype: int64


In [18]:
df.describe()

Unnamed: 0,Year,Electricity_Access,Clean_Cooking_Fuels,Renewable_Capacity,Financial_Flows,Renewable_Share,Fossil_Electricity,Nuclear_Electricity,Renewable_Electricity,Low_Carbon_Electricity,Energy_Consumption,Energy_Intensity,CO2_Emissions,Renewables_Percentage,GDP_Growth,GDP_Per_Capita,Land_Area,Latitude,Longitude
count,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0,3327.0
mean,2010.083258,78.939515,63.923204,84.272549,42201040.0,33.020595,75.916419,13.977256,26.030736,38.235381,26316.343933,5.367094,161904.6,5.342819,3.442334,13212.059684,662658.8,18.261721,14.507692
std,6.036966,30.42909,38.270414,219.623685,205450700.0,29.290711,362.822004,75.061398,108.822168,34.074744,35716.445816,3.530948,770338.7,11.756992,5.660445,19707.295067,1646526.0,24.659895,66.244741
min,2000.0,1.252269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,10.0,0.0,-62.07592,111.927225,21.0,-40.900557,-175.198242
25%,2005.0,61.727257,26.1,0.0,0.0,7.855,0.32,0.0,0.09,5.711435,3094.1572,3.255,2290.0,0.0,1.386902,1348.51634,28051.0,1.836898,-10.940835
50%,2010.0,98.36157,83.15,8.87,0.0,23.3,3.51,0.0,1.79,32.98445,13652.776,4.39,12585.0,0.0,3.558128,4576.387617,131957.0,17.570692,19.145136
75%,2015.0,100.0,100.0,71.75,2840000.0,53.7,29.87,0.0,10.995,65.05172,33674.6035,6.105,61665.0,4.829156,5.811262,15432.02339,513120.0,39.074208,45.079162
max,2020.0,100.0,100.0,3060.19,5202310000.0,96.04,5184.13,809.41,2184.94,100.00001,262585.7,32.57,10707220.0,86.836586,123.139555,123514.1967,9984670.0,64.963051,178.065032


In [19]:
df.dtypes

Entity                     object
Year                        int64
Electricity_Access        float64
Clean_Cooking_Fuels       float64
Renewable_Capacity        float64
Financial_Flows           float64
Renewable_Share           float64
Fossil_Electricity        float64
Nuclear_Electricity       float64
Renewable_Electricity     float64
Low_Carbon_Electricity    float64
Energy_Consumption        float64
Energy_Intensity          float64
CO2_Emissions             float64
Renewables_Percentage     float64
GDP_Growth                float64
GDP_Per_Capita            float64
Population_Density         object
Land_Area                 float64
Latitude                  float64
Longitude                 float64
dtype: object

In [20]:
df["Population_Density"] = df["Population_Density"].str.replace(",", "").astype(float)

In [21]:
print(df[["Entity", "Year", "Population_Density"]].dtypes)  # Should show float for Population_Density
print(df["Population_Density"].isnull().sum())  # Should return 0

Entity                 object
Year                    int64
Population_Density    float64
dtype: object
0


In [22]:
df["Population_Density"] = pd.to_numeric(df["Population_Density"], errors="coerce")

In [23]:
print(df["Population_Density"].dtype)
print(df["Population_Density"].isnull().sum()) 

float64
0


In [24]:
entity_mapping = dict(zip(df["Entity"].astype("category").cat.categories, range(len(df["Entity"].astype("category").cat.categories))))
print(entity_mapping)

{'Afghanistan': 0, 'Albania': 1, 'Algeria': 2, 'Angola': 3, 'Antigua and Barbuda': 4, 'Argentina': 5, 'Armenia': 6, 'Aruba': 7, 'Australia': 8, 'Austria': 9, 'Azerbaijan': 10, 'Bahrain': 11, 'Bangladesh': 12, 'Barbados': 13, 'Belarus': 14, 'Belgium': 15, 'Belize': 16, 'Benin': 17, 'Bermuda': 18, 'Bhutan': 19, 'Bosnia and Herzegovina': 20, 'Botswana': 21, 'Brazil': 22, 'Bulgaria': 23, 'Burkina Faso': 24, 'Burundi': 25, 'Cambodia': 26, 'Cameroon': 27, 'Canada': 28, 'Cayman Islands': 29, 'Central African Republic': 30, 'Chad': 31, 'Chile': 32, 'China': 33, 'Colombia': 34, 'Comoros': 35, 'Costa Rica': 36, 'Croatia': 37, 'Cuba': 38, 'Cyprus': 39, 'Denmark': 40, 'Djibouti': 41, 'Dominica': 42, 'Dominican Republic': 43, 'Ecuador': 44, 'El Salvador': 45, 'Equatorial Guinea': 46, 'Eritrea': 47, 'Estonia': 48, 'Eswatini': 49, 'Ethiopia': 50, 'Fiji': 51, 'Finland': 52, 'France': 53, 'Gabon': 54, 'Georgia': 55, 'Germany': 56, 'Ghana': 57, 'Greece': 58, 'Grenada': 59, 'Guatemala': 60, 'Guinea': 61,

In [25]:
#["Entity"] = df["Entity"].astype("category").cat.codes  # Convert country names to numeric codes
#df["Entity"].dtypes

### Handling Outilers

In [27]:
# Remove extreme outliers (e.g., beyond 3 standard deviations)
for col in ["GDP_Per_Capita", "CO2_Emissions", "Renewable_Share", "Energy_Intensity"]:
    df = df[np.abs(df[col] - df[col].mean()) <= (3 * df[col].std())]

### Normalize Numerical Features

In [29]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[["GDP_Per_Capita", "Energy_Intensity", "CO2_Emissions", "Renewable_Share"]] = scaler.fit_transform(
    df[["GDP_Per_Capita", "Energy_Intensity", "CO2_Emissions", "Renewable_Share"]])

In [30]:
#Feature Engineering
df["Energy_Dependency"] = df["Renewable_Share"] * df["Financial_Flows"]
df["CO2_Intensity_Per_Capita"] = df["CO2_Emissions"] / df["Population_Density"]

In [31]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["Energy_Dependency", "CO2_Intensity_Per_Capita"]] = scaler.fit_transform(df[["Energy_Dependency", "CO2_Intensity_Per_Capita"]])

In [32]:
print(df.columns) 

Index(['Entity', 'Year', 'Electricity_Access', 'Clean_Cooking_Fuels',
       'Renewable_Capacity', 'Financial_Flows', 'Renewable_Share',
       'Fossil_Electricity', 'Nuclear_Electricity', 'Renewable_Electricity',
       'Low_Carbon_Electricity', 'Energy_Consumption', 'Energy_Intensity',
       'CO2_Emissions', 'Renewables_Percentage', 'GDP_Growth',
       'GDP_Per_Capita', 'Population_Density', 'Land_Area', 'Latitude',
       'Longitude', 'Energy_Dependency', 'CO2_Intensity_Per_Capita'],
      dtype='object')


In [33]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 3142 entries, 3 to 3648
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Entity                    3142 non-null   object 
 1   Year                      3142 non-null   int64  
 2   Electricity_Access        3142 non-null   float64
 3   Clean_Cooking_Fuels       3142 non-null   float64
 4   Renewable_Capacity        3142 non-null   float64
 5   Financial_Flows           3142 non-null   float64
 6   Renewable_Share           3142 non-null   float64
 7   Fossil_Electricity        3142 non-null   float64
 8   Nuclear_Electricity       3142 non-null   float64
 9   Renewable_Electricity     3142 non-null   float64
 10  Low_Carbon_Electricity    3142 non-null   float64
 11  Energy_Consumption        3142 non-null   float64
 12  Energy_Intensity          3142 non-null   float64
 13  CO2_Emissions             3142 non-null   float64
 14  Renewables_Pe

Unnamed: 0,Entity,Year,Electricity_Access,Clean_Cooking_Fuels,Renewable_Capacity,Financial_Flows,Renewable_Share,Fossil_Electricity,Nuclear_Electricity,Renewable_Electricity,...,CO2_Emissions,Renewables_Percentage,GDP_Growth,GDP_Per_Capita,Population_Density,Land_Area,Latitude,Longitude,Energy_Dependency,CO2_Intensity_Per_Capita
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,0.381716,0.31,0.0,0.63,...,0.000493,0.0,8.832278,0.001068,60.0,652230.0,33.93911,67.709953,-0.071982,-0.201462
4,Afghanistan,2004,20.064968,10.9,7.75,0.0,0.460641,0.33,0.0,0.56,...,0.000415,0.0,1.414118,0.001354,60.0,652230.0,33.93911,67.709953,-0.144614,-0.201667
5,Afghanistan,2005,25.390894,12.2,7.51,9830000.0,0.35277,0.34,0.0,0.59,...,0.000627,0.0,11.229715,0.001779,60.0,652230.0,33.93911,67.709953,-0.119206,-0.201107
6,Afghanistan,2006,30.71869,13.85,7.4,10620000.0,0.332049,0.2,0.0,0.64,...,0.000712,0.0,5.357403,0.002079,60.0,652230.0,33.93911,67.709953,-0.118777,-0.200881
7,Afghanistan,2007,36.05101,15.3,7.25,15750000.0,0.299667,0.2,0.0,0.75,...,0.000717,0.0,13.82632,0.003408,60.0,652230.0,33.93911,67.709953,-0.110033,-0.20087


In [34]:
df.describe()

Unnamed: 0,Year,Electricity_Access,Clean_Cooking_Fuels,Renewable_Capacity,Financial_Flows,Renewable_Share,Fossil_Electricity,Nuclear_Electricity,Renewable_Electricity,Low_Carbon_Electricity,...,CO2_Emissions,Renewables_Percentage,GDP_Growth,GDP_Per_Capita,Population_Density,Land_Area,Latitude,Longitude,Energy_Dependency,CO2_Intensity_Per_Capita
count,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,...,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0
mean,2010.078931,78.172628,62.794192,86.297791,43085890.0,0.352459,40.767936,8.829548,17.454812,38.348912,...,0.034573,5.081878,3.420967,0.152783,231.314768,564933.4,17.409283,14.797663,-4.522869e-18,1.809148e-17
std,6.047104,30.573357,38.478899,223.977507,209618300.0,0.303265,115.562198,43.962331,54.425831,33.899363,...,0.085634,11.149937,5.695488,0.209929,730.175864,1333619.0,24.73419,66.327497,1.000159,1.000159
min,2000.0,1.252269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-62.07592,0.0,2.0,21.0,-40.900557,-175.198242,-0.1446136,-0.2027639
25%,2005.0,58.394294,24.0,0.0,0.0,0.092071,0.31,0.0,0.1,5.882353,...,0.000916,0.0,1.402864,0.016748,30.0,28748.0,1.373333,-9.696645,-0.1446136,-0.2009069
50%,2010.0,97.7,83.15,11.005,0.0,0.246668,3.465,0.0,1.775,33.64052,...,0.004782,0.0,3.602344,0.058493,84.0,130370.0,15.870032,19.37439,-0.1446136,-0.189601
75%,2015.0,100.0,100.0,73.395,3427500.0,0.569398,29.3175,0.0,10.1075,65.006385,...,0.024581,4.50373,5.776738,0.189192,206.0,513120.0,39.074208,45.038189,-0.1390404,-0.1449973
max,2020.0,100.0,100.0,3060.19,5202310000.0,1.0,2431.9,789.88,821.4,100.00001,...,1.0,86.836586,123.139555,1.0,8358.0,9984670.0,64.963051,178.065032,32.52648,9.904354


In [35]:
datapath = 'C:/Users/aamal/Desktop/Springboard/Springboard_DataScience/Capstone-3-Energy/data'
energy_data_cleaned = 'energy_data_cleaned.csv'
filepath= os.path.join(datapath, energy_data_cleaned)

df.to_csv(filepath, index= False)
print(f"Data saved successfully to '{filepath}'")

Data saved successfully to 'C:/Users/aamal/Desktop/Springboard/Springboard_DataScience/Capstone-3-Energy/data\energy_data_cleaned.csv'
