# Predicting CO2 Emission Per Capita for a country using energy consumptions

by Tony Shum, Jing Wen, Aishwarya Nadimpally, Weilin Han

In [1]:
# Initialize packages
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer,mean_squared_error,r2_score
import matplotlib.pyplot as plt

In [2]:
# Import the functions from the src folder
import sys
import os

# sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
sys.path.append(os.path.join('..'))
from src.create_scatter_plot import create_scatter_plot
from src.data_preprocessor import data_preprocessor
from src.read_melt_merge import read_melt_merge
from src.scoring_metrics import scoring_metrics

# Summary

Here we attempt to build a prediction model using the k-nearest neighbours algorithm which can use energy consumption and energy generation measurements to predict CO2 emission of certain country of next year. Our final prediction model perform pretty well on unseen test dataset, with $\text{R}^2$ of 0.975 and an overall accuracy calculated to be 0.976. However, the model predict CO2 emission by finding the existing cases in the training data set which is most similiar to unseen data, thus, if there is a case in unseen data set of which measurements are beyond the ranges in training data set (ie. massive increase of energy usage or energy efficiency increase or new type of clean energy), then the prediciton might not be accurate, thus we recomment continuing study to improve this prediction model.

# Introduction


According to the intergovernmental panel on climate change (IPCC), CO2 emissions are a leading contributor to global warming and climate change (IPCC, 2014). Understanding the correlation between consumption of different types of energy and CO2 emission is critical for formulating policies aimed at reducing emissions and mitigating climate change impacts (IPCC, 2018).

Our project aims to estimate a machine learning model to use energy consumptions per capita to predict CO2 Emission per capita of a country. Our model can be a powerful tool for raising public awareness of the impact of energy consumption on CO2 emission and international agreements on emission reductions. We are hoping that our findings will encourage sustainable behavior, such as reducing energy consumption or opting for green energy alternatives (International Energy Agency, 2018). 


# Methods

<!-- BEGIN QUESTION -->


## Data

The data set that was used in this project is from World Bank via GAPMINDER.ORG, which is an independent Swedish foundation with no political, religious or economic affiliations and the link can be found https://www.gapminder.org/


##### Credential
    
FREE DATA FROM WORLD BANK VIA GAPMINDER.ORG, CC-BY LICENSE

## Analysis

Data was split with 80% partitioned as training data and 20% as test data. For model building, we have chosen KNeighborsRegressor (KNN) from DummyRegressor,Ridge, SVR, as we have the highest $\text{R}^2$ score for KNN. The hyperparameter $K$ was chosen using 10-fold cross validation with $R^2$ as the regression metric.

# Results & Discussions

Our prediction model performed well on test data, with a final overall $\text{R}^2$ of 0.976, which is promising for predicting a country's CO2 emission per capita given the energy generation and consumption data. Our model has small deviation from residual to the ground truth, as we have RMSE of 1.34 meaning that our model is relatively accurate in terms of CO2 emission prediction.

#### Exploratory Data Analysis (EDA)

From our data, we have no NA or missing data, but need to change the data type of No. 3 column to No. 7 column to float. We also need need to clean up data and unify the units, such change 20u to 20e-6 and 15.1k to 15.1e3.

The Spearmean's rank correlation test below revealed some potential correlations between the following columns: co2_e vs elec_c, co2_e vs oil_c, elec_c vs oil_c, and gas_g vs oil_g.

We further visualized the correlation between columns of interest above in scatter plots. The plots also revealed that we only have one data point for year 2015 to 2018, we can consider exclude these years in the training dataset.

##### EDA Conclusion

We have changed the data type to appropriate type and unified the units for each column. We visualized the distribution for all numeric columns and explore potential correlation between columns. We split df into train and test data set (8:2) For pipeline building, it will be beneficial to remove the year 2015 - 2017 because we only have one data point per year. 

#### Preprocessing

Based on the nature of the data and the EDA results, the following assumption and preprocessing would be made
- A **naive assumption** that there is no temporal dependency between observations (i.e. observations among years) is made. `year` would be removed to prevent the model from exploiting the temporal feature for future-looking. Temporal feature treatment, e.g. time series split and time series cross-validation, could be considered later
- Scaling will be applied to all numeric features to standardize them to a common scale.
- OneHotEncoding will be applied to the categorical feature `country`.

#### Model Training

We used various regression models with $ \text{R}^ 2 $ as the scoring metrics and carry out 10-fold cross-validation with each model to find the best performing models. Based on the validation results, the model using k-nearest neighbors (k-nn) 
algorithm is the best performing model with $ \text{R}^ 2 $ of 0.949.

#### Hyperparameter Optimization

The hyperparameter `n_neighbors` and `max_categories` was chosen using 10-fold cross validation with  $ \text{R}^2 $ as the classification metric to improve the model performance. Based on the validation results, the KNN model has achieved a $ \text{R}^2 $(`mean_test_r2`) of 0.975.

#### Test Results

From the test data plot, we can see that we are under predicting few values. Our model has the accuracy of 97.5% with minimal prediction errors. Our prediction model performed quite well on test data, with a final overall $\text{R}^2$ of 0.976, which is promising for predicting a country's CO2 emission per capita given the energy generation and consumption data. Our model has not less deviation from residual to the ground truth,as we have RMSE of 1.34 which is not too high for our models and it helps for reducing errors.

## Limitations and Future Direction

To further improve this model in future with hopes of arriving one that could be used, there are several improvements we can suggest for later revision.As mentioned in Preprocessing, there could possibly be temporal dependency between observations and temporal treatments could be considered. In the EDA above, we discovered there are collinearity between `oil_c` and `elec_c`, `oil_g` and `gas_g`. Though it might not affect the predictive power of models, it harms the interpretation of the coefficients of linear models. Collinearity reduction treatment e.g. feature removal, dimension reduction technique, etc., could be considered. Assumed that co2_emission might be still in increasing trend in the future, KNN may not predict well beyond the range of values input in your training data. Other models with similar predictive power which can predict out-of-range input data could be considered.

# References

Morice, C.P., J.J. Kennedy, N.A. Rayner, J.P. Winn, E. Hogan, R.E. Killick, R.J.H. Dunn, T.J. Osborn, P.D. Jones and I.R. Simpson (in press) An updated assessment of near-surface temperature change from 1850: the HadCRUT5 dataset. Journal of Geophysical Research (Atmospheres)

Hannah Ritchie, Max Roser and Pablo Rosado (2020) - "CO₂ and Greenhouse Gas Emissions". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/co2-and-greenhouse-gas-emissions'

IPCC, 2014: Climate Change 2014: Synthesis Report. Contribution of Working Groups I, II and III to the Fifth Assessment Report of the Intergovernmental 
Panel on Climate Change. IPCC, Geneva, Switzerland, 151 pp.

IPCC, 2018: Global Warming of 1.5°C. An IPCC Special Report.

International Energy Agency, 2018: World Energy Outlook 2018.
