# Machine Learning Models for Energy Analytics

## Overview
This notebook implements three types of machine learning models to analyse energy consumption and CO₂ emissions:
1. **Linear Regression** - Predicting energy consumption
2. **K-Means Clustering** - Grouping countries by energy profiles
3. **Decision Tree Classification** - Classifying energy categories

## Learning Objectives
- Understand supervised vs unsupervised learning
- Implement regression, clustering, and classification models
- Evaluate model performance using appropriate metrics
- Interpret model results for business insights

## Libraries Used
- **scikit-learn**: Machine learning algorithms and tools
- **pandas/numpy**: Data manipulation
- **matplotlib/seaborn**: Visualisation

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (mean_squared_error, r2_score, mean_absolute_error,
                             silhouette_score, classification_report, confusion_matrix,
                             accuracy_score)
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load Cleaned Data

In [2]:
from pathlib import Path
import os

current_dir = Path.cwd()
parent = current_dir.parent

os.chdir(parent)
current_dir = str(Path.cwd())   # update the variable so future code is consistent
print("New current directory:", current_dir)
processed_file_path = current_dir+'\\dataset\\processed\\cleaned_energy_data.csv'
df = pd.read_csv(processed_file_path)

New current directory: d:\Code Institute\Energy-Consumption-CO2-Emissions-Analysis


In [3]:
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

Dataset shape: (46000, 16)

Columns: ['Country', 'Energy_type', 'Year', 'Energy_consumption', 'Energy_production', 'GDP', 'Population', 'Energy_intensity_per_capita', 'Energy_intensity_by_GDP', 'CO2_emission', 'Energy_category', 'Energy_balance', 'CO2_per_capita', 'Energy_efficiency', 'Decade', 'Energy_source_type']


Unnamed: 0,Country,Energy_type,Year,Energy_consumption,Energy_production,GDP,Population,Energy_intensity_per_capita,Energy_intensity_by_GDP,CO2_emission,Energy_category,Energy_balance,CO2_per_capita,Energy_efficiency,Decade,Energy_source_type
0,Afghanistan,coal,1980,0.002479,0.002355,,13356.5,1.990283,0.0,0.0,coal,-0.000124,0.0,,1980,Fossil Fuel
1,Afghanistan,natural_gas,1980,0.002094,0.06282,,13356.5,1.990283,0.0,0.0,natural gas,0.060726,0.0,,1980,Fossil Fuel
2,Afghanistan,petroleum_n_other_liquids,1980,0.014624,0.0,,13356.5,1.990283,0.0,0.0,petroleum,-0.014624,0.0,,1980,Fossil Fuel
3,Afghanistan,nuclear,1980,0.0,0.0,,13356.5,1.990283,0.0,0.0,nuclear,0.0,0.0,0.0,1980,Nuclear
4,Afghanistan,renewables_n_other,1980,0.007386,0.007386,,13356.5,1.990283,0.0,0.0,renewables,0.0,0.0,,1980,Renewable


## 2. Linear Regression - Predicting Energy Consumption

### Objective
Predict energy consumption based on GDP, Population, and Year.

### Why Linear Regression?
- Simple, interpretable model
- Works well for continuous target variables
- Provides insights into feature importance through coefficients

In [4]:
# Prepare data for regression
regression_df = df[['GDP', 'Population', 'Year', 'Energy_consumption']].dropna()
print(f"Dataset size: {len(regression_df):,} records")

# Features and target
X_reg = regression_df[['GDP', 'Population', 'Year']]
y_reg = regression_df['Energy_consumption']

# Split data (80% train, 20% test)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

print(f"\nTraining set: {len(X_train_reg):,} samples")
print(f"Test set: {len(X_test_reg):,} samples")

Dataset size: 33,155 records

Training set: 26,524 samples
Test set: 6,631 samples


In [6]:
# Scale features (important for many ML algorithms)
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)
print("✓ Features scaled using StandardScaler")
print(f"\nScaled feature means: {X_train_reg_scaled.mean(axis=0)}")
print(f"Scaled feature std devs: {X_train_reg_scaled.std(axis=0)}")

✓ Features scaled using StandardScaler

Scaled feature means: [-2.94675392e-18 -1.63410899e-17 -7.20159869e-15]
Scaled feature std devs: [1. 1. 1.]
