# **Energy Consumption & CO2 Emissions Analysis**

## About Dataset

The world is becoming more modernized by the year, and with this becoming all the more polluted.

This data was pulled from the US Energy Administration and joined together for an easier analysis. Its a collection of some big factors that play into C02 Emissions, with everything from the Production and Consumption of each type of major energy source for each country and its pollution rating each year. It also includes each countries GDP, Population, Energy intensity per capita (person), and Energy intensity per GDP (per person GDP). All the data spans all the way from the 1980's to 2020. 

### Feature Descriptions:
* **Country** - Country in question
* **Energy_type** - Type of energy source
* **Year** - Year the data was recorded
* **Energy_consumption** - Amount of Consumption for the specific energy source, measured (quad Btu)
* **Energy_production** - Amount of Production for the specific energy source, measured (quad Btu)
* **GDP** - Countries GDP at purchasing power parities, measured (Billion 2015$ PPP)
* **Population** - Population of specific Country, measured (Mperson)
* **Energy_intensity_per_capita** - Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as units of energy per unit of capita (capita = individual person), measured (MMBtu/person)
* **Energy_intensity_by_GDP**- Energy intensity is a measure of the energy inefficiency of an economy. It is calculated as units of energy per unit of GDP, measred (1000 Btu/2015$ GDP PPP)
* **CO2_emission** - The amount of C02 emitted, measured (MMtonnes CO2)

## Objectives

1. Analyze global energy consumption patterns and CO2 emissions (1980-2020)
2. Investigate relationships between GDP, population, and energy consumption
3. Identify trends in renewable vs non-renewable energy adoption
4. Test hypotheses about economic and demographic factors affecting emissions
5. Build predictive models for energy consumption and emissions
6. Create an interactive dashboard for data exploration and insights

## Inputs

* Data source: https://www.kaggle.com/datasets/lobosi/c02-emission-by-countrys-grouth-and-population

## Outputs

* Cleaned dataset to use in the next step.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'd:\\Code Institute\\Energy-Consumption-CO2-Emissions-Analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'd:\\Code Institute\\Energy-Consumption-CO2-Emissions-Analysis'

In [5]:
input_file_path = current_dir+'\\dataset\\raw\\energy.csv'
input_file_path

'd:\\Code Institute\\Energy-Consumption-CO2-Emissions-Analysis\\dataset\\raw\\energy.csv'

## Extraction

Import Packages

In [6]:
import numpy as np
import pandas as pd

Load the dataset

In [8]:
# Load the dataset
df = pd.read_csv(input_file_path)
# Preview the first few rows
df

Unnamed: 0.1,Unnamed: 0,Country,Energy_type,Year,Energy_consumption,Energy_production,GDP,Population,Energy_intensity_per_capita,Energy_intensity_by_GDP,CO2_emission
0,0,World,all_energy_types,1980,292.899790,296.337228,27770.910281,4.298127e+06,68.145921,10.547000,4946.627130
1,1,World,coal,1980,78.656134,80.114194,27770.910281,4.298127e+06,68.145921,10.547000,1409.790188
2,2,World,natural_gas,1980,53.865223,54.761046,27770.910281,4.298127e+06,68.145921,10.547000,1081.593377
3,3,World,petroleum_n_other_liquids,1980,132.064019,133.111109,27770.910281,4.298127e+06,68.145921,10.547000,2455.243565
4,4,World,nuclear,1980,7.575700,7.575700,27770.910281,4.298127e+06,68.145921,10.547000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
55435,55435,Zimbabwe,coal,2019,0.045064,0.075963,37.620400,1.465420e+04,11.508701,4.482962,4.586869
55436,55436,Zimbabwe,natural_gas,2019,0.000000,0.000000,37.620400,1.465420e+04,11.508701,4.482962,0.000000
55437,55437,Zimbabwe,petroleum_n_other_liquids,2019,0.055498,0.000000,37.620400,1.465420e+04,11.508701,4.482962,4.377890
55438,55438,Zimbabwe,nuclear,2019,,,37.620400,1.465420e+04,11.508701,4.482962,0.000000


---

## TRANSFORMATION & CLEANING

Handle unnamed index column - Remove it

In [9]:
if 'Unnamed: 0' in df.columns or '' in df.columns:
    df = df.drop(columns=[col for col in df.columns if 'Unnamed' in str(col) or col == ''])
df.head()

Unnamed: 0,Country,Energy_type,Year,Energy_consumption,Energy_production,GDP,Population,Energy_intensity_per_capita,Energy_intensity_by_GDP,CO2_emission
0,World,all_energy_types,1980,292.89979,296.337228,27770.910281,4298127.0,68.145921,10.547,4946.62713
1,World,coal,1980,78.656134,80.114194,27770.910281,4298127.0,68.145921,10.547,1409.790188
2,World,natural_gas,1980,53.865223,54.761046,27770.910281,4298127.0,68.145921,10.547,1081.593377
3,World,petroleum_n_other_liquids,1980,132.064019,133.111109,27770.910281,4298127.0,68.145921,10.547,2455.243565
4,World,nuclear,1980,7.5757,7.5757,27770.910281,4298127.0,68.145921,10.547,0.0


Check if there are missing data

In [10]:
df.isnull().sum()

Country                            0
Energy_type                        0
Year                               0
Energy_consumption             11153
Energy_production              11151
GDP                            15414
Population                      9426
Energy_intensity_per_capita     5082
Energy_intensity_by_GDP         5082
CO2_emission                    3826
dtype: int64

In [11]:
missing_counts = df.isnull().sum()
missing_pct = (missing_counts / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing_Count': missing_counts,
    'Percentage': missing_pct
})
missing_df[missing_df['Missing_Count'] > 0]

Unnamed: 0,Missing_Count,Percentage
Energy_consumption,11153,20.117244
Energy_production,11151,20.113636
GDP,15414,27.80303
Population,9426,17.002165
Energy_intensity_per_capita,5082,9.166667
Energy_intensity_by_GDP,5082,9.166667
CO2_emission,3826,6.901154


Convert numeric columns from string to float

In [12]:
numeric_columns = ['Energy_consumption', 'Energy_production', 'GDP', 'Population', 
                   'Energy_intensity_per_capita', 'Energy_intensity_by_GDP', 'CO2_emission']

for col in numeric_columns:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
        print(f"✓ Converted {col} to numeric")

✓ Converted Energy_consumption to numeric
✓ Converted Energy_production to numeric
✓ Converted GDP to numeric
✓ Converted Population to numeric
✓ Converted Energy_intensity_per_capita to numeric
✓ Converted Energy_intensity_by_GDP to numeric
✓ Converted CO2_emission to numeric


Check Data Types

In [13]:
df.dtypes

Country                         object
Energy_type                     object
Year                             int64
Energy_consumption             float64
Energy_production              float64
GDP                            float64
Population                     float64
Energy_intensity_per_capita    float64
Energy_intensity_by_GDP        float64
CO2_emission                   float64
dtype: object

Handle missing values
- Fill numeric columns with 0 or median depending on context

In [14]:
df['CO2_emission'] = df['CO2_emission'].fillna(0)
df['Energy_intensity_by_GDP'] = df['Energy_intensity_by_GDP'].fillna(0)

For other numeric columns, use forward fill then backward fill

In [15]:
for col in numeric_columns:
    if col in df.columns and df[col].isnull().sum() > 0:
        df[col] = df.groupby(['Country', 'Energy_type'])[col].ffill().bfill()

Remove duplicates

In [16]:
initial_rows = len(df)
df = df.drop_duplicates()
print(f"✓ Removed {initial_rows - len(df)} duplicate rows")

✓ Removed 0 duplicate rows


Removes rows from DataFrame where the Energy_type column has the value 'all_energy_types' as these are the sum of all energy types.

In [17]:
rows_before_filter = len(df)
df = df[df['Energy_type'].str.lower() != 'all_energy_types']
print(f"✓ Removed {rows_before_filter - len(df)} rows with 'all_energy_types' (aggregated data)")

✓ Removed 9240 rows with 'all_energy_types' (aggregated data)


### Feature Engineering - Create additional features

Decade column for trend analysis

In [21]:
df['Decade'] = (df['Year'] // 10) * 10

Energy balance (production - consumption)

In [22]:
df['Energy_balance'] = df['Energy_production'] - df['Energy_consumption']

Per capita CO2 emissions

In [23]:
df['CO2_per_capita'] = df['CO2_emission'] / df['Population']
df['CO2_per_capita'] = df['CO2_per_capita'].replace([np.inf, -np.inf], 0).fillna(0)

Energy category
Creates the Energy_category column in DataFrame with the following mappings:

- "coal" → "coal"
- "natural_gas" → "natural gas"
- "petroleum_n_other_liquids" → "petroleum"
- "nuclear" → "nuclear"
- "renewables_n_other" → "renewables"
- Any other value → "Other"

In [24]:
df['Energy_category'] = df['Energy_type'].apply(
    lambda x: (
        'coal' if str(x).lower() == 'coal' else
        'natural gas' if str(x).lower() == 'natural_gas' else
        'petroleum' if str(x).lower() == 'petroleum_n_other_liquids' else
        'nuclear' if str(x).lower() == 'nuclear' else
        'renewables' if str(x).lower() == 'renewables_n_other' else
        'Other'
    )
)