# **Introduction**

Carbon Dioxide (CO<sub>2</sub>) and Greenhouse Gas (GHG) emissions as a result of human activities have been demonstrated to be the primary driver of today's climate change. Supporting evidence of this is the fact that global average temperatures have increased by more than 1 °C since pre-industrual times.

A changing climate has a range of potential ecological, physical, and health impacts, including extreme weather events (such as floods, droughts, storms, and heatwaves); sea-level rise; altered crop growth; and disrupted water systems (*5th. Intergovernmental Panel on Climate Change (IPCC) report.*<sup>1</sup>).

Therefore, to keep track and monitor the CO<sub>2</sub> and GHG concentrations in the atmosphere, has become a matter of ensuring the all Earth's future life existence.

### **Main Objective:** 

The main purpose of this project is to apply simple Python processing and visualization techniques, and perform Exploratory Data Analysis on a CO<sub>2</sub> emissions data in order to obtain valuable insights on a significant worldwide issue.

## **About the dataset**

This dataset was obtained from a public GitHub repository of [Our World In Data](https://github.com/owid/co2-data), a non-governmental organization (NGO) which aims to provide a wide range of insights and relevant information regarding several world problems.

A detailed description of every single feature of the dataset can be read in the "`data/owid-co2-codebook.xlsx`" file.

## **Imports**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import missingno as msno
import urllib.request
import upsetplot

# Mute warnings
import warnings
warnings.filterwarnings('ignore')

# Magic function to display matplotlib figures in a jupyter notebook
%matplotlib inline

# Making plots pretty
sns.set_style("darkgrid")

## **Reading the data**

In [2]:
!wget https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv -P data/

--2023-01-11 18:32:41--  https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
Unable to establish SSL connection.


In [None]:
# Data set url
owid_co2_data_url = "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
owid_co2_codebook_url = "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-codebook.csv"

In [None]:
# Downloading data from url into ./data/ directory
urllib.request.urlretrieve(
    url = owid_co2_data_url,
    filename = "./data/owid-co2-data.csv"
)

In [None]:
# Downloading codebook from url into ./data/ directory
urllib.request.urlretrieve(
    url = owid_co2_codebook_url,
    filename = "./data/owid-co2-codebook.csv"
)

In [None]:
# Converting codebook csv file into excel file and saving it into ./data/ directory
codebook_df = pd.read_csv("./data/owid-co2-codebook.csv")
codebook_excel = pd.ExcelWriter("./data/owid-co2-codebook.xlsx")
codebook_df.to_excel(codebook_excel, index=False)
codebook_excel.save()

In [None]:
# To specify the number of rows and columns to be displayed by pandas
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

In [None]:
# Reading and visualizing the first 5 rows of the CO2 data set
data = pd.read_csv("./data/owid-co2-data.csv")

data.head()

In [None]:
data.info()

Some data sets can have column names with different formats, such as upper cases, lower cases, spaces, symbols, etc. In this case, the column names have already been standardized, so there's no need to do that.

In [None]:
# Checking which features are different than float type
data.dtypes[data.dtypes != "float64"]

All the columns are float type objects except **`'country'`**, **`'year'`** and **`'iso_code'`** which are differents.

## **Using Pandas API extension for EDA**

### **Missing values analysis**

In [None]:
co2_df = data.copy()

In [None]:
%run pd-extensions.ipynb

In [None]:
# Total number of missing values in the dataset
co2_df.explore.number_missing()

In [None]:
# Total number of complete values (non-missing) in the dataset
co2_df.explore.number_complete()

In [None]:
# Summary table of missing values per variable
co2_df.explore.missing_variable_summary()

In [None]:
# Visualizing the proportion of missing values per variable
co2_df.explore.missing_variable_plot()

In [None]:
co2_df.explore.missing_variable_plot_matrix()

There are many columns practically empty. Hence, for the purposes of this project, some columns will be deleted.

In [None]:
columns_to_delete = [
    'co2_including_luc', 'co2_including_luc_growth_abs',
    'co2_including_luc_growth_prct', 'co2_including_luc_per_capita',
    'co2_including_luc_per_gdp', 'co2_including_luc_per_unit_energy',
    'consumption_co2',
    'consumption_co2_per_capita',
    'consumption_co2_per_gdp',
    'cumulative_co2_including_luc',
    'cumulative_luc_co2',
    'flaring_co2',
    'flaring_co2_per_capita',
    'ghg_excluding_lucf_per_capita',
    'ghg_per_capita',
    'methane',
    'methane_per_capita',
    'nitrous_oxide',
    'nitrous_oxide_per_capita',
    'land_use_change_co2',
    'land_use_change_co2_per_capita',
    'share_global_co2_including_luc',
    'share_global_cumulative_co2_including_luc',
    'share_global_cumulative_flaring_co2',
    'share_global_cumulative_luc_co2',
    'share_global_luc_co2',
    'total_ghg',
    'total_ghg_excluding_lucf',
    'trade_co2',
    'trade_co2_share'
]

co2_df.drop(
    columns = columns_to_delete,
    axis = 1,
    inplace = True
)

# **Data Pre-processing**

### **Year column**

In [None]:
# Checking minimum and maximum year
print("min year:", co2_df["year"].min())
print("max year:", co2_df["year"].max())

In [None]:
# Checking missing values in the year column
co2_df["year"].isnull().sum()

No missing values in the year column.

### **ISO code column**

In [None]:
# Checking countries without an ISO code
co2_df[co2_df["iso_code"].isnull()]["country"].unique()

In [None]:
countries_to_delete = [
    'French Equatorial Africa',
    'French West Africa',
    'Kosovo',
    'Kuwaiti Oil Fires',
    'Leeward Islands',
    'Panama Canal Zone',
    'Ryukyu Islands',
    'St. Kitts-Nevis-Anguilla',
    'Bonaire Sint Eustatius and Saba',
    'Christmas Island',
    'Sint Maarten (Dutch part)',
    'Europe (excl. EU-27)',
    'Europe (excl. EU-28)',
    'International transport'
]

# Checking the amount of CO2 emitted for each country without a designated ISO code.
for country in countries_to_delete:
    co2 = co2_df[
        (co2_df["country"] == country) & ~(co2_df["co2"].isnull())
    ]["co2"].sum()
    print(f"CO2 Emssions: {country} = {np.round(co2, 3)}")

These countries have a small amount of CO2 emissions observations to be considered as relevant for the purposes of this project. Therefore, their entries will be deleted for simplification.

In [None]:
# Dropping the countries
for country in countries_to_delete:
    value = df[
        df["country"] == country
    ].index
    
    df.drop(
        labels = value,
        axis = 0,
        inplace = True
    )

In [None]:
# Verifying the countries were dropped
df[df["iso_code"].isnull()]["country"].unique()

This way, the project will only consider the non-iso code "countries" observations such as the continents', the "World", and those corresponding to the income classification.

### **CO2 Emissions column**

In [None]:
# Dropping NaN's in co2 column
df.drop(
    labels = df[df["co2"].isnull()].index,
    axis = 0,
    inplace = True
)

# Dropping NaN's in co2_per_capita column
df.drop(
    labels = df[df["co2_per_capita"].isnull()].index,
    axis = 0,
    inplace = True
)

In [None]:
msno.matrix(
    df,
    color = (0.3, 0.36, 0.44)
)

### **GDP column**

In [None]:
# Filling GDP missing values for countries with co2 and co2_per_gdp entries.
df["gdp"].fillna(
    df["co2"] / df["co2_per_gdp"],
    inplace = True
)

In [None]:
df.groupby(["country"], as_index=False)["gdp"].apply(lambda x: x.isnull().sum())

# **Exploratory Data Analysis**

### **1. Yearly world CO2 emissions.**

In [None]:
world_emissions = df[df["country"] == "World"][["year", "co2"]]

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(
    x = world_emissions["year"],
    y = world_emissions["co2"]
    )
)

fig.update_layout(
    title = "Yearly global CO2 emissions",
    xaxis_title = "Year",
    yaxis_title = "CO2 Emissions [Million metric-tons]"
)

# fig = px.line(world_emissions, x="year", y="co2", labels={"x": "Year", "y": "CO2 Emissions [1e6 metric-tons]"}, title="Yearly global CO2 emissions")
fig.show()

### **2. CO2 emissions by continent.**

In [None]:
emissions_by_continent = df[
    (df["country"] == "North America") |
    (df["country"] == "North America (excl. USA)") |
    (df["country"] == "South America") |
    (df["country"] == "Africa") |
    (df["country"] == "Europe") | 
    (df["country"] == "Asia") |
    (df["country"] == "Asia (excl. China & India)") |
    (df["country"] == "Oceania")
][
    df["year"] == 2020
][
    ["country", "co2"]
].sort_values(
    by = "co2",
    ascending = False
)

fig, axes = plt.subplots(
    nrows = 1,
    ncols = 2,
    figsize = (12, 6)
)

sns.barplot(
    emissions_by_continent,
    x = emissions_by_continent["country"],
    y = emissions_by_continent["co2"],
    ax = axes[0]
)

axes[0].set_xlabel(
    "Country",
    fontweight = "bold"
)

axes[0].set_xticklabels(
    labels = emissions_by_continent["country"],
    rotation = 90
)

axes[0].set_ylabel(
    "CO2 Emissions [1e6 metric-tons]",
    fontweight = "bold"
)

plt.pie(
    emissions_by_continent["co2"],
    labels = emissions_by_continent["country"],
    autopct = '%.0f%%',
    explode = (0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
)

fig.suptitle(
    'CO2 Emissions per continent [2020]',
    fontsize = 20,
    fontweight = "bold"
)

plt.tight_layout()
plt.show()

### **2. Yearly CO2 emissions by continent.**

In [None]:
continent_yearly_emissions = df[
    (df["country"] == "North America") |
    (df["country"] == "North America (excl. USA)") |
    (df["country"] == "South America") |
    (df["country"] == "Africa") |
    (df["country"] == "Europe") | 
    (df["country"] == "Asia") |
    (df["country"] == "Asia (excl. China & India)") |
    (df["country"] == "Oceania")
][
    ["country", "year", "co2"]
]

sns.lineplot(
    data = continent_yearly_emissions,
    x = continent_yearly_emissions["year"],
    y = continent_yearly_emissions["co2"],
    hue = continent_yearly_emissions["country"]
)

plt.title(
    "Continent's Yearly CO2 emissions",
    fontweight = "bold",
    fontsize = 16
)

plt.xlabel("Year", weight = "bold")
plt.ylabel("CO2 Emissions [1e6 metric-tons]", weight = "bold")

plt.legend(bbox_to_anchor = (1.05, 1))

plt.show()

### **3. Top 10 countries with most CO2 emissions in 2020.**

In [None]:
df_2 = df.copy()

# Removing non-iso countries
non_iso_countries = list(df_2[df_2["iso_code"].isnull()]["country"].unique())
for country in non_iso_countries:
    value = df_2[df_2["country"] == country].index
    df_2.drop(labels=value, axis=0, inplace=True)

In [None]:
top_10_co2 = df_2[df_2["year"] == 2020][["country", "iso_code", "co2"]].sort_values(by="co2", ascending=False).head(10)
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.barplot(top_10_co2, x=top_10_co2["country"], y=top_10_co2["co2"], ax=axes[0])
axes[0].set_xlabel("Country", fontweight="bold")
axes[0].set_xticklabels(labels=top_10_co2["country"], rotation=90)
axes[0].set_ylabel("CO2 Emissions [1e6 metric-tons]", fontweight="bold")
plt.pie(top_10_co2["co2"], labels=top_10_co2["iso_code"], autopct='%.0f%%', explode=(0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
plt.legend(top_10_co2["country"], bbox_to_anchor=(1.05, 1))
fig.suptitle('Top 10 Countries With Most CO2 Emissions [2020]', fontsize=20, fontweight="bold")
plt.tight_layout()
plt.show()

### **3. Top 10 countries with most CO2 emissions per capita in 2020.**

In [None]:
top_10_per_capita = df_2[df_2["year"] == 2020][["country", "co2_per_capita"]].sort_values(by="co2_per_capita", ascending=False).head(10)
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.barplot(top_10_per_capita, x=top_10_per_capita["country"], y=top_10_per_capita["co2_per_capita"], ax=axes[0])
axes[0].set_xlabel("Country", fontweight="bold")
axes[0].set_xticklabels(labels=top_10_per_capita["country"], rotation=90)
axes[0].set_ylabel("CO2 Emissions per capita [1e6 metric-tons]", fontweight="bold")
plt.pie(top_10_per_capita["co2_per_capita"], labels=top_10_per_capita["country"], autopct='%.0f%%', explode=(0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
fig.suptitle('Top 10 Countries With Most CO2 Emissions per capita [2020]', fontsize=20, fontweight="bold")
plt.tight_layout()
plt.show()

### **4. Top 10 countries by Gross Domestic Product (GDP)**

In [None]:
top_10_gdp = df_2[df_2["year"] == 2018][["country", "iso_code", "gdp"]].sort_values(by="gdp", ascending=False).head(10)
# GDP data has only been recorded until 2018

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.barplot(top_10_gdp, x=top_10_gdp["country"], y=top_10_gdp["gdp"], ax=axes[0])
axes[0].set_xlabel("Country", fontweight="bold")
axes[0].set_xticklabels(labels=top_10_gdp["country"], rotation=90)
axes[0].set_ylabel("GDP [Trillion USD]", fontweight="bold")
plt.pie(top_10_gdp["gdp"], labels=top_10_gdp["iso_code"], autopct='%.0f%%', explode=(0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
plt.legend(top_10_gdp["country"], bbox_to_anchor=(1.05, 1))
fig.suptitle('Top 10 countries by GDP [2018]', fontsize=20, fontweight="bold")
plt.tight_layout()
plt.show()

### **5. Top 10 richest countries yearly CO2 emissions**

In [None]:
top_10_co2_growth = df[df["country"].isin(list(top_10_gdp["country"]))][["country", "year", "iso_code", "gdp", "co2", "co2_growth_prct"]].sort_values(by="co2", ascending=False)

plt.figure(figsize=(10, 6))
sns.lineplot(data=top_10_co2_growth, x=top_10_co2_growth["year"], y=top_10_co2_growth["co2"], hue=top_10_co2_growth["country"])
plt.title("Yearly CO2 emissions [Top 10 richest countries]", fontweight="bold", fontsize=16)
plt.xlabel("Year", weight="bold")
plt.ylabel("CO2 Emissions per capita [1e6 metric-tons]", weight="bold")
plt.legend()
plt.show()

### **5. Top 10 most energy consuming countries.**

In [None]:
top_10_energy = df_2[df_2["year"] == 2020][["country", "iso_code", "primary_energy_consumption"]].sort_values(by="primary_energy_consumption", ascending=False).head(10)
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.barplot(top_10_energy, x=top_10_energy["country"], y=top_10_energy["primary_energy_consumption"], ax=axes[0])
axes[0].set_xlabel("Country", fontweight="bold")
axes[0].set_xticklabels(labels=top_10_energy["country"], rotation=90)
axes[0].set_ylabel("Primary Energy Consumption [TWh/yr]", fontweight="bold")
plt.pie(top_10_energy["primary_energy_consumption"], labels=top_10_energy["iso_code"], autopct='%.0f%%', explode=(0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
plt.legend(top_10_energy["country"], bbox_to_anchor=(1.05, 1))
fig.suptitle('Top 10 Most Energy Consumers [2020]', fontsize=20, fontweight="bold")
plt.tight_layout()
plt.show()

### **6. Top 10 most energy consuming countries per capita.**

In [None]:
['country', 'year', 'iso_code', 'population', 'gdp', 'cement_co2',
       'cement_co2_per_capita', 'co2', 'co2_growth_abs', 'co2_growth_prct',
       'co2_per_capita', 'co2_per_gdp', 'co2_per_unit_energy', 'coal_co2',
       'coal_co2_per_capita', 'cumulative_cement_co2', 'cumulative_co2',
       'cumulative_coal_co2', 'cumulative_flaring_co2', 'cumulative_gas_co2',
       'cumulative_oil_co2', 'cumulative_other_co2', 'energy_per_capita',
       'energy_per_gdp', 'gas_co2', 'gas_co2_per_capita', 'oil_co2',
       'oil_co2_per_capita', 'other_co2_per_capita', 'other_industry_co2',
       'primary_energy_consumption', 'share_global_cement_co2',
       'share_global_co2', 'share_global_coal_co2',
       'share_global_cumulative_cement_co2', 'share_global_cumulative_co2',
       'share_global_cumulative_coal_co2', 'share_global_cumulative_gas_co2',
       'share_global_cumulative_oil_co2', 'share_global_cumulative_other_co2',
       'share_global_flaring_co2', 'share_global_gas_co2',
       'share_global_oil_co2', 'share_global_other_co2']

In [None]:
emissions_per_activity = data[(data["country"] == "World") & (data["year"] == 2020)][["cement_co2", "coal_co2", "flaring_co2", "gas_co2", "oil_co2"]]
emissions_per_activity.reset_index(inplace=True)
emissions_per_activity.drop(columns="index", inplace=True)
emissions_per_activity = emissions_per_activity.T
emissions_per_activity
# sns.barplot(emissions_per_activity)
#plt.pie(top_10_energy["primary_energy_consumption"], labels=top_10_energy["iso_code"], autopct='%.0f%%', explode=(0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
# plt.pie(emissions_per_activity.index, labels=emissions_per_activity.values)
# plt.legend(top_10_energy["country"], bbox_to_anchor=(1.05, 1))

In [None]:
plt.pie(top_10_energy["primary_energy_consumption"], labels=top_10_energy["iso_code"], autopct='%.0f%%', explode=(0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
plt.legend(top_10_energy["country"], bbox_to_anchor=(1.05, 1))

# **References**

1. 2014: Climate Change 2014: Impacts, Adaptation, and Vulnerability. Part A: Global and Sectoral Aspects. Contribution of Working Group II to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change
[Field, C.B., V.R. Barros, D.J. Dokken, K.J. Mach, M.D. Mastrandrea, T.E. Bilir, M. Chatterjee, K.L. Ebi, Y.O. Estrada, R.C. Genova, B. Girma, E.S. Kissel, A.N. Levy, S. MacCracken, P.R. Mastrandrea, and L.L.White (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, 1132 pp. Available: [Online](https://www.ipcc.ch/report/ar5/wg2/)

2. 