---
title: "ESS 330 Final Project: Changes in Carbon Emissions and Energy Use in the Top Five Polluting Countries During COVID-19"
author:
  - name: "Yazeed Aljohani"
    affiliation: "Colorado State University"
  - name: "Josh Puyear"
    affiliation: "Colorado State University"
  - name: "Cade Vanek"
    affiliation: "Colorado State University"
date: 2025-05-14
format:
  html:
    code-fold: true
    code-summary: "Show code"
editor: visual
bibliography: references.bib
keywords: [CO2 emissions, COVID-19, GHG, workflow, ANOVA, sustainability]
abstract: |
  This study analyzes changes in per capita greenhouse gas (GHG) emissions and emissions per unit of energy use in the top five CO₂ emitting countries: China, the United States, India, Russia, and Japan during the COVID-19 period (2015 to 2022). The objective is to assess how emissions patterns shifted before, during, and after the pandemic and to identify the strongest predictors of these changes. 
  Using data from Our World in Data, we cleaned and filtered emissions data from 2015 to 2022 and grouped years into three periods: pre-COVID (2015 to 2019), during COVID (2020), and post-COVID (2021 to 2022). We focused on key emission sectors including coal, oil, gas, cement, and industrial processes. 
  We applied ANOVA to determine whether per capita emissions changed significantly across periods for each country. China, Japan, and the United States showed statistically significant changes (p < 0.05), while changes in India and Russia were not significant. Linear regression identified gas emissions as the strongest positive predictor of per capita GHG emissions. 
  We also modeled CO₂ emissions per unit of energy using machine learning techniques, including random forest, which performed best (R² = 0.91). GDP per capita and gas emissions per capita were the most important predictors. 
  Our findings show that emissions temporarily decreased during the pandemic but rebounded in most countries by 2022. These results suggest opportunities for long-term emission reductions lie in transforming energy systems and targeting specific high-emission sectors. The study highlights the value of combining statistical and machine learning approaches to guide sustainable policy.
---


# **Introduction/Hypothesis**

Climate change is one of the most urgent challenges facing the world today. A primary driver of climate change is the release of greenhouse gases, especially carbon dioxide (CO₂), which is emitted through activities such as burning fossil fuels for energy, transportation, and industrial processes. CO₂ remains the most significant contributor to global warming due to its long atmospheric lifespan and the scale of human emissions (Archer et al., 2009). Despite international agreements like the Paris Accord aimed at limiting global warming, CO₂ emissions continue to rise (UNEP, 2022).

A small number of countries—China, the United States, India, Russia, and Japan—are responsible for the majority of global carbon emissions. These nations differ in population size, energy sources, and industrial output, yet each plays a critical role in shaping emissions trends. Assessing emissions on a per capita basis provides a more equitable way to understand responsibility and consumption, offering deeper insights than total emissions alone (Ritchie et al., 2020).

The COVID-19 pandemic created an unexpected opportunity to examine how major disruptions impact emissions. In 2020, global CO₂ emissions fell by approximately 5.4 percent, the largest annual drop ever recorded (Forster et al., 2020), largely due to reduced transportation and industrial activity during lockdowns. However, as economies reopened in 2021 and 2022, emissions quickly rebounded. This raises important questions about whether these changes signal structural shifts or were merely temporary (Le Quéré et al., 2021).

Energy-related emissions are a critical factor in understanding changes in CO₂ output, especially as global energy demand continues to grow. Each country's energy mix—how it generates electricity—affects the amount of CO₂ emitted per kilowatt-hour. This makes energy-based metrics essential for comparing emissions and identifying opportunities for reductions.

While some studies have examined the short-term impacts of COVID-19 on emissions, fewer have investigated long-term trends in per capita and energy-related emissions across the highest-emitting countries. This study addresses that gap by applying both traditional statistical analysis and machine learning techniques to explore emission trends from 2015 to 2022 in China, the United States, India, Russia, and Japan.

We focus on two main indicators: greenhouse gas emissions per person and CO₂ emissions per kilowatt-hour of energy used. Data are divided into three periods: pre-COVID (2015–2019), during COVID (2020), and post-COVID (2021–2022). By comparing these periods, we aim to determine whether emission shifts were lasting or temporary and to identify the sectors that contributed most to these changes.

**Objectives** 

1.  Compare per capita greenhouse gas emissions before, during, and after COVID-19 in the five largest CO₂-emitting countries. 

<!-- -->

2.  Identify which sector-specific emissions best predict per capita and per unit energy CO₂ emissions using linear regression and machine learning. 

<!-- -->

3.  Evaluate whether emission changes during the pandemic represent meaningful shifts in national emissions behavior. 

**Hypotheses** 

-   H1: Per capita greenhouse gas emissions declined during the COVID-19 pandemic and partially rebounded in the post-COVID period, with variation across countries. 

<!-- -->

-   H2: Sector-specific emissions such as gas and coal use are strong predictors of per capita emissions in all periods. 

<!-- -->

-   H3: GDP per capita and gas emissions per capita are the most important predictors of CO₂ emissions per unit of energy. 

# Methods

### Study Scope and Dataset

Greenhouse gas emissions data was sourced from Our World in Data [@owid-scaling-up-ai], which compiles information from multiple sources, such as the Global Carbon Project. The dataset from Oxford's Our World in Data includes emissions levels from the industrial revolution up to 2023. We utilized data from 2015 to 2022, focusing on the years 2019 to 2021, with the peak of the COVID-19 pandemic lockdowns happening in 2020 [@forster2020current].

To explain influences on total CO2, we narrowed 78 metrics collected in the Oxford dataset to two response variables, CO2 per unit energy, and CO2 emissions per capita (excluding land use change). These response variables account for regular widespread emissions. The proportion of per capita emissions from the top five cumulative CO2 emitting countries to the full record of countries during the 2015 to 2022 period revealed the impact that these countries can have to curb CO2 emissions in the future. In 2022, the top five CO2 emitters were responsible for 60.9 percent of global CO2 emissions.

### Data Preparation and Predictor Variables

To compare CO2 emissions before, during, and after the pandemic, we used the tidymodels package in R along with dplyr. The predictor variables gdp_percap, gas_CO2_per_capita, share_global_coal_CO2, coal_CO2_per_capita, cumulative_luc_CO2, oil_CO2_per_capita, and share_global_luc_CO2 (Table 1) were chosen to model CO2 emissions per unit energy based on correlation tests and an educated guess as to the most effective predictors. These predictor variables attributed the response partly to individual demand for energy (in the case of per capita emissions) and partly to energy use by the whole country (in the case of share of global coal CO2 emissions). Although there are many energy sources for electricity, coal is the most CO2-polluting, so was one of the only two energy polluters included (Environmental Protection Agency, 2025).

Predictor variables for ghg_excluding_lucf_per_capita were cumulative_CO2_including_luc, primary_energy_consumption, temperature_change_from_ghg, population, CO2_per_unit_energy, gdp_percap, cumulative_CO2, cumulative_coal_CO2, cumulative_coal_CO2, cumulative_luc_CO2, energy_per_capita. These predictors attempted to find a proxy for population-based CO2 emissions.

### Statistical and Machine Learning Methods

To analyze our data, we used a combination of ANOVA statistical tests and machine learning. ANOVA was used to assess differences in emissions before, during, and after COVID. A linear regression model was used to model per-capita emissions as a function of sector-specific CO2 emissions (coal, oil, gas, cement, and misc industry). Our research focuses on revealing relationships between countries and between predictor variables of greenhouse emissions.

We compared per unit energy CO2 emissions among the top five cumulative polluters. Machine learning models to predict ghg_excluding_lucf_per_capita (greenhouse gases excluding land use change emitted per person) and CO2_per_unit_energy (in CO2 emitted per kilowatt-hour) were constructed using the rsamples, parsnip, workflowsets, and baguettes packages.

We tested predictor variables with linear regression, neural network (Spyros et al. 2024), random forest (Kjajavi et al. 2023), decision tree (Rahman 2023), and boosted tree (Si and Du 2020) models to find predictor variables. The best model for each of the two response variables was analyzed with the vip package to reveal the most explanatory variables that predicted CO2 emissions per unit energy and CO2 per capita emissions from the selected variables.

# Results

### Data Exploration


```{undefined}
#| echo: false
library(tidyverse)
library(readr)
library(ggplot2)
library(tidyr)
library(dplyr)

data <- read_csv("data/owid-co2-data (1).csv")

```

```{undefined}
#| echo: false
# Narrowing Data to 2015-2022
data_filtered <- data |> 
  filter(year >= 2015, year <= 2022) |> 
  filter(!is.na(iso_code) & nchar(iso_code) == 3)

# Getting top 5 CO2 emitting countries by total GHG (excluding land use)- This is total cumulative ghg emissions
top_emitters <- data_filtered |> 
  group_by(country) |> 
  summarise(total_ghg = sum(total_ghg_excluding_lucf, na.rm = TRUE)) |> 
  arrange(desc(total_ghg)) |> 
  slice_head(n = 5) |> 
  pull(country)

# Filtering data to include only those countries and select relevant variables
df <- data_filtered |> 
  filter(country %in% top_emitters) |> 
  select(country, year, ghg_excluding_lucf_per_capita, 
         coal_co2, gas_co2, oil_co2, cement_co2, other_industry_co2)

# Adding period category (pre, during, post COVID)
df <- df |> 
  mutate(period = case_when( 
    year <= 2019 ~ "pre_covid", 
    year == 2020 ~ "during_covid", 
    year >= 2021 ~ "post_covid" 
  )) |> 
  mutate(period = factor(period, levels = c("pre_covid", "during_covid", "post_covid")))
```

```{undefined}
#| echo: false
ggplot(df, aes(x = factor(year), y = ghg_excluding_lucf_per_capita, color = country, group = country)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Per Capita GHG Emissions (Excl. Land Use) by Country (2015–2022)",
    x = "Year", y = "GHG per Capita (tCO₂e)"
  ) +
  theme_minimal()
```


Thise graph shows how greenhouse gas emissions per person changed in China, the United States, India, Russia, and Japan from 2015 to 2022. You can see a drop around 2020 during COVID, especially in the United States and Russia. Some countries' emissions started going back up after 2020.


```{undefined}
#| echo: false
df_long <- df |> 
  pivot_longer(cols = c(coal_co2, gas_co2, oil_co2, cement_co2, other_industry_co2),
               names_to = "sector", values_to = "emissions") |> 
  drop_na(emissions)

ggplot(df_long, aes(x = year, y = emissions, color = sector, group = interaction(sector, country))) +
  geom_line() +
  geom_point() +
  facet_wrap(~ country, scales = "free_y", ncol = 2) +
  scale_x_continuous(breaks = 2015:2022) +
  labs(title = "Sector-specific CO₂ Emissions (2015–2022)", 
       x = "Year", 
       y = "Emissions (MtCO₂)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


This graph shows where each country's emissions came from: like coal, oil, gas, cement, or other industries. For example, China has a lot of coal emissions, and the United States has more oil emissions. You can also see how emissions from some sectors dropped in 2020 and then changed again after COVID.


```{undefined}
#| echo: false
# 1. Prepare data
df <- data_filtered |>
  filter(country %in% top_emitters) |>
  select(country, year, ghg_excluding_lucf_per_capita,
         coal_co2, gas_co2, oil_co2, cement_co2, other_industry_co2,
         gdp, population) |>
  mutate(
    gdp_per_capita = gdp / population,
    period = case_when(
      year <= 2019 ~ "pre_covid",
      year == 2020 ~ "during_covid",
      year >= 2021 ~ "post_covid"
    )
  ) |>
  mutate(period = factor(period, levels = c("pre_covid", "during_covid", "post_covid")))

# 2. Run ANOVA for each country
for (c in unique(df$country)) {
  df_country <- df |> filter(country == c)
  if (n_distinct(df_country$period) > 1) {
    cat("\n--- ANOVA for", c, "---\n")
    model <- aov(ghg_excluding_lucf_per_capita ~ period, data = df_country)
    print(summary(model))
  } else {
    cat("\n--- Skipped ANOVA for", c, ": not enough periods ---\n")
  }
}

# 3. Linear Regression
lm_model <- lm(ghg_excluding_lucf_per_capita ~ coal_co2 + oil_co2 + gas_co2 + cement_co2 + other_industry_co2, data = df)
summary(lm_model)
```


| Country       | p-value | Interpretation                    |
|:--------------|:--------|:----------------------------------|
| China         | 0.0103  | Significant change (p \< 0.05)    |
| India         | 0.142   | No significant change (p \> 0.05) |
| Japan         | 0.0329  | Significant change (p \< 0.05)    |
| Russia        | 0.0781  | Not quite significant (p \> 0.05) |
| United States | 0.00355 | Significant change (p \< 0.01)    |