# Machine learning task

In the [previous notebook](http://localhost:8888/lab/tree/question1.ipynb), I answered the question: **In the present, which countries are the biggest CO2 producers on the planet? And which are the biggest CO2 producers per capita?** While this analysis is a good starting point to explore which countries are the major contributors to the possible climate disaster in the future, the relationship between a country-specific statistics and the CO2 production is not yet properly addressed. To answer this question, a simple machine learning model can be built with the aim to answer the "why" behind the carbon dioxide emissions, and can hold information about the factors which are tied to a typical high production of CO2 and low production CO2. These factors could be important to climate scientists and judiciaries to create action plans on how to affect some of the factors that could lead to a lower rate of CO2 production.  

------------
## Research question

**Can we build a machine learning model to predict the amount of CO2 that is generated by a country, given other data about the country?**

Well, let's hope so and get into it :-)

In [21]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from pathlib import Path

In [22]:
# load SDG dataset to pandas
working_dir = Path.cwd()
parent_dir = working_dir.parents[0]
data_dir = os.path.join(parent_dir, 'data', 'original raw', 'last decade')
sdg_path = os.path.join(data_dir, 'sdg_data.xlsx')

print('Reading SDG dataset ...')
sdg_dataset = pd.read_excel(sdg_path)
print('Done.')

Reading SDG dataset ...
Done.


In [23]:
# explore SDG dataset - beginning
sdg_dataset.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,YR2010,YR2011,YR2012,YR2013,YR2014,YR2015,YR2016,YR2017,YR2018,YR2019,YR2020
0,Afghanistan,AFG,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,20,21.8,23,24.8,26.1,27.4,28.6,29.7,30.9,31.9,33.2
1,Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,42.700001,43.222019,69.099998,68.290649,89.5,71.5,97.699997,97.699997,96.616135,97.699997,97.699997
2,Afghanistan,AFG,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,30.2188,29.57288,60.849155,60.566135,86.500511,64.573357,97.099358,97.091972,95.586174,97.07563,97.066711
3,Afghanistan,AFG,"Access to electricity, urban (% of urban popul...",EG.ELC.ACCS.UR.ZS,82.800003,86.56778,95,92.259048,98.699997,92.5,99.5,99.5,99.626022,99.5,99.5
4,Afghanistan,AFG,Account ownership at a financial institution o...,FX.OWN.TOTL.ZS,..,9.01,..,..,9.96,..,..,14.89,..,..,..


In [24]:
# explore SDG dataset - end
sdg_dataset.tail(8)

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,YR2010,YR2011,YR2012,YR2013,YR2014,YR2015,YR2016,YR2017,YR2018,YR2019,YR2020
106485,Zimbabwe,ZWE,Women making their own informed decisions rega...,SG.DMK.SRCR.FN.ZS,..,58.8,..,..,..,59.9,..,..,..,..,..
106486,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,..,3.9,..,..,..,3.7,..,..,..,5.418352,..
106487,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,..,30.5,..,..,33.5,32.4,..,..,..,33.658057,..
106488,,,,,,,,,,,,,,,
106489,,,,,,,,,,,,,,,
106490,,,,,,,,,,,,,,,
106491,Data from database: Sustainable Development Go...,,,,,,,,,,,,,,
106492,Last Updated: 07/22/2022,,,,,,,,,,,,,,


In [25]:
# last couple rows do not hold useful data - drop them
nrows = sdg_dataset.shape[0]
rows_to_drop = 6
sdg_dataset = sdg_dataset.loc[:nrows-rows_to_drop]

In [19]:
# count the nan values for all features in 2019