# Global Co2 Emissions Responsibility : Total vs Per-Capita vs Historical Accountability Dashboard

## Objectives

The objective of this notebook is to perform the ETL (Extract, Transform, Load) process for the CO₂ emissions dataset.
This notebook aims to:
Extract the raw CO₂ dataset obtained from Our World in Data
Clean and prepare the data for analysis
Filter the dataset to focus on relevant years (from 1990 onward for trend analysis)
Create derived metrics such as cumulative emissions and growth rates
Prepare structured datasets for use in the BI dashboard
This ensures the data is accurate, consistent, and ready for analysis and visualisation.

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments
The year 1990 was selected as the baseline for trend analysis because it is widely used in international climate agreements and provides consistent reporting coverage.
Only aggregated national data is used; no personal data is included.
All data transformations are documented in this notebook to ensure transparency and reproducibility.
Version control is used to track changes and maintain governance best practices.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [20]:
import os
current_dir = os.getcwd()
current_dir

'\\\\talktalk\\redirectedfolders\\F.Afolabi\\Documents\\Global Co2 Emissions project 3'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [36]:
os.chdir(r"\\talktalk\redirectedfolders\F.Afolabi\Documents\Global Co2 Emissions project 3\Global-Co2-Emissions-Responsibility")
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [37]:
current_dir = os.getcwd()
current_dir

'\\\\talktalk\\redirectedfolders\\F.Afolabi\\Documents\\Global Co2 Emissions project 3\\Global-Co2-Emissions-Responsibility'

# Section 1

Section 1 content

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



# Data Preparation



In [53]:
# Load the dataset
df_co2_emissions= pd.read_csv("DataSet/Raw/Co2_emissions.csv")
df_co2_emissions.head()

Unnamed: 0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Afghanistan,1750,AFG,2802560.0,,0.0,0.0,,,,...,,,,,,,,,,
1,Afghanistan,1751,AFG,,,0.0,,,,,...,,,,,,,,,,
2,Afghanistan,1752,AFG,,,0.0,,,,,...,,,,,,,,,,
3,Afghanistan,1753,AFG,,,0.0,,,,,...,,,,,,,,,,
4,Afghanistan,1754,AFG,,,0.0,,,,,...,,,,,,,,,,


In [56]:
# Understanding the dataset, identify relevant columns and preparing for cleaning.
df_co2_emissions.shape
df_co2_emissions.columns
df_co2_emissions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50411 entries, 0 to 50410
Data columns (total 79 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   country                                    50411 non-null  object 
 1   year                                       50411 non-null  int64  
 2   iso_code                                   42480 non-null  object 
 3   population                                 41167 non-null  float64
 4   gdp                                        15251 non-null  float64
 5   cement_co2                                 29173 non-null  float64
 6   cement_co2_per_capita                      25648 non-null  float64
 7   co2                                        29384 non-null  float64
 8   co2_growth_abs                             27216 non-null  float64
 9   co2_growth_prct                            26239 non-null  float64
 10  co2_including_luc     

In [None]:
# Keeping only the relevant columns for our analysis.
keeps_columns = ["country", "iso_code", "year", "co2", "co2_per_capita", "gdp", "population", "cumulative_co2",]



In [None]:
# verify the columns we are keeping
df_co2_emissions = df_co2_emissions[keeps_columns]
df_co2_emissions.head()

Unnamed: 0,country,iso_code,year,co2,co2_per_capita,gdp,population,cumulative_co2
0,Afghanistan,AFG,1750,,,,2802560.0,
1,Afghanistan,AFG,1751,,,,,
2,Afghanistan,AFG,1752,,,,,
3,Afghanistan,AFG,1753,,,,,
4,Afghanistan,AFG,1754,,,,,


In [62]:
df_co2_emissions.shape

(50411, 8)

In [64]:
# Understand the dataset
df_co2_emissions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50411 entries, 0 to 50410
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         50411 non-null  object 
 1   iso_code        42480 non-null  object 
 2   year            50411 non-null  int64  
 3   co2             29384 non-null  float64
 4   co2_per_capita  26509 non-null  float64
 5   gdp             15251 non-null  float64
 6   population      41167 non-null  float64
 7   cumulative_co2  27563 non-null  float64
dtypes: float64(5), int64(1), object(2)
memory usage: 3.1+ MB


In [67]:
# Checking for missing values
df_co2_emissions.isnull().sum()

country               0
iso_code           7931
year                  0
co2               21027
co2_per_capita    23902
gdp               35160
population         9244
cumulative_co2    22848
dtype: int64

Data Structure and Missing Value Assessment
The filtered dataset contains 50,411 records across 8 selected variables relevant to emissions responsibility analysis.
Initial inspection revealed significant missing values in key analytical fields:
The CO₂ emissions column contains only 29,384 non-null record, 21,027 null values
Per-capita emissions and cumulative emissions also contain 23,902 and 22,848 missing values respectively.
GDP data shows substantial missingness and will be treated as an optional analytical variable.

In [70]:
# Describe the dataset to understand the distribution of values and identify any anomalies or outliers.
df_co2_emissions.describe()

Unnamed: 0,year,co2,co2_per_capita,gdp,population,cumulative_co2
count,50411.0,29384.0,26509.0,15251.0,41167.0,27563.0
mean,1920.349249,420.227031,3.821372,330079400000.0,60174530.0,12492.23
std,65.859123,1972.092036,14.312865,3087720000000.0,330843300.0,73121.43
min,1750.0,0.0,0.0,49980000.0,215.0,0.0
25%,1875.0,0.381056,0.171333,7874038000.0,327214.0,4.238027
50%,1925.0,5.080755,1.023368,27438610000.0,2291594.0,80.46645
75%,1975.0,53.656342,4.327494,121000000000.0,9986553.0,1163.685
max,2024.0,38598.57813,782.743408,130000000000000.0,8161973000.0,1849124.0


Statistical Summary of Key Variables
The dataset spans from 1750 to 2024.
Significant missing values were identified:
Only 29,384 records contain total CO₂ emissions.
Per-capita and cumulative emissions also show substantial null values 
GDP data is heavily incomplete and will be treated as an optional analytical variable.
Based on this assessment, rows lacking emissions data will be removed, and the dataset will be filtered from 1990 onward to ensure consistency and policy relevance.

In [73]:
# Drop rows with missing values
df_co2_emissions = df_co2_emissions.dropna(subset=["co2",])

In [75]:
# Filtering the dataset to include only records from 1990 onwards, as this period is more relevant for our analysis of global CO2 emissions.
df_co2_emissions = df_co2_emissions[df_co2_emissions["year"] >= 1990]   


In [77]:
# Checking the information of the dataset after cleaning
df_co2_emissions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8617 entries, 240 to 50410
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         8617 non-null   object 
 1   iso_code        7508 non-null   object 
 2   year            8617 non-null   int64  
 3   co2             8617 non-null   float64
 4   co2_per_capita  8068 non-null   float64
 5   gdp             5420 non-null   float64
 6   population      7893 non-null   float64
 7   cumulative_co2  8204 non-null   float64
dtypes: float64(5), int64(1), object(2)
memory usage: 605.9+ KB


In [82]:
#Checking for missing values after cleaning the dataset  

df_co2_emissions.isnull().sum()

country              0
iso_code          1109
year                 0
co2                  0
co2_per_capita     549
gdp               3197
population         724
cumulative_co2     413
dtype: int64

In [None]:
# checking the ISO codes in the dataset to indentify the unique countries represented in and the missing values in the iso_code column.
df_co2_emissions["iso_code"].unique()

array(['AFG', nan, 'ALB', 'DZA', 'AND', 'AGO', 'AIA', 'ATA', 'ATG', 'ARG',
       'ARM', 'ABW', 'AUS', 'AUT', 'AZE', 'BHS', 'BHR', 'BGD', 'BRB',
       'BLR', 'BEL', 'BLZ', 'BEN', 'BMU', 'BTN', 'BOL', 'BES', 'BIH',
       'BWA', 'BRA', 'VGB', 'BRN', 'BGR', 'BFA', 'BDI', 'KHM', 'CMR',
       'CAN', 'CPV', 'CAF', 'TCD', 'CHL', 'CHN', 'CXR', 'COL', 'COM',
       'COG', 'COK', 'CRI', 'CIV', 'HRV', 'CUB', 'CUW', 'CYP', 'CZE',
       'COD', 'DNK', 'DJI', 'DMA', 'DOM', 'TLS', 'ECU', 'EGY', 'SLV',
       'GNQ', 'ERI', 'EST', 'SWZ', 'ETH', 'FRO', 'FJI', 'FIN', 'FRA',
       'PYF', 'GAB', 'GMB', 'GEO', 'DEU', 'GHA', 'GRC', 'GRL', 'GRD',
       'GTM', 'GIN', 'GNB', 'GUY', 'HTI', 'HND', 'HKG', 'HUN', 'ISL',
       'IND', 'IDN', 'IRN', 'IRQ', 'IRL', 'ISR', 'ITA', 'JAM', 'JPN',
       'JOR', 'KAZ', 'KEN', 'KIR', 'KWT', 'KGZ', 'LAO', 'LVA', 'LBN',
       'LSO', 'LBR', 'LBY', 'LIE', 'LTU', 'LUX', 'MAC', 'MDG', 'MWI',
       'MYS', 'MDV', 'MLI', 'MLT', 'MHL', 'MRT', 'MUS', 'MEX', 'FSM',
       'MDA', '

In [86]:
# Removing rows with missing Nan ISO codes, as these are essential for our analysis and visualization.
df_co2_emissions = df_co2_emissions[df_co2_emissions["iso_code"].notna()]   

In [90]:
# Verifying the total number of unique countries in the dataset after cleaning and the df shape
df_co2_emissions.isna().sum()


country              0
iso_code             0
year                 0
co2                  0
co2_per_capita      66
gdp               2099
population          66
cumulative_co2       0
dtype: int64

In [92]:
df_co2_emissions.shape
df_co2_emissions["country"].nunique()

215

Unique Countries 
Following ISO code filtering, the dataset contains 215 unique countries and territories. This aligns with global emissions reporting standards, which include sovereign nations as well as dependent territories with independently recorded emissions data.

---

In [96]:
# save the cleaned dataset to a new csv file for future use
df_co2_emissions_clean = df_co2_emissions.copy()   
df_co2_emissions_clean.to_csv("DataSet/Cleaned/Co2_emissions_cleaned.csv", index=False) 



A cleaned version of the dataset was stored in a new DataFrame named df_co2_emissions_clean. This separation ensures that the original extracted dataset remains unchanged while providing a dedicated structure for analysis and dashboard preparation.

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [97]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)