## Country Emission Information

Companies' ghg emissions could depend from exogenous factors. For instance, if a company mainly operates in a country whose energy mix relies more on fossil fuels, could have greater emissions than a similar company  (same industry, market cap etc) that works in a country whose energy mix relies more on renewable energies. For this reason, I decided to add as predictors for my emissions model the following country level dataset:

1. Emissions Factors: amount of carbon dioxide equivalent emitted per unit of energy produced (kg/kwh)
2. Emissions per GDP: amount of carbon dioxide equivalent emitted per unit of GDP produced  (kg per PPP); this metric suggest how carbon-intensive a country's economy is (often called as the carbon intensity of economies). GDP measured in cnstant 2011 international-dollars
3. Emissions by Sector: tons of CO2e per sector 
4. Emissions by Fuel: tons of CO2 emitted per fuel type associated with energy and industrial production
5. Total GHG Emissions; tons of CO2e  per country, including emissions form land use change. GHG includes carbon dioxide, methane, nitrouse oxide and F-gases.
6. Consumption-based CO2 emissions: consumption based are adjusted for trade;the difference between a country’s consumption-based emissions and production-based emissions. This means it is the net trade of emissions.It is equal to consumption-based emissions minus production-based emissions in any given year. This means net importers of emissions have positive values. Net exporters have negative values.



In this Jupyter Notebook I am doing the following steps:

**Step 1** -  Importing libraries and country emission data
**Step 2** -  Filtering the data for the country I am interested into
**Step 3** -  Upsampling data down to quarter
**Step 4** -  Filtering the data by date (above 2012)
**Step 5** -  Merging all dataset in one unique dataframe
**Step 6** -  Filling Null Values through IterativeImputer
**Step 7** -  Adding the Country ISIN code to the final dataframe


**Step 1** Importing Libraries and Country Emissions Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce 


from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.model_selection import train_test_split, cross_val_score


In [2]:
emiss_factor = pd.read_csv('../data/web/country_co2_mwh.csv')
emiss_gdp = pd.read_csv('../data/web/country_co2_gdp.csv')
emiss_trade = pd.read_csv('../data/web/co2_sharedintrade.csv')
emiss_fuel = pd.read_csv('../data/web/co2_byfuel.csv')
emiss_sector = pd.read_csv('../data/web/country_sector.csv')
ghg = pd.read_csv('../data/web/country_co2eq.csv')
country_isin = pd.read_csv('../data/web/country_isin.csv')

**Step 2** Filtering original data by country

In [3]:
#this is the dictionary the country I am interested about
country_region = ['Thailand', 'United Kingdom', 'South Korea', 'Brazil','Netherlands', 'United States', 
                  'Australia', 'China', 'Sweden','Switzerland', 'Germany', 'Italy', 'Indonesia', 'Japan',
                  'Hong Kong', 'South Africa', 'Philippines', 'Canada', 'Poland','Spain', 'Qatar', 'Singapore', 
                  'France', 'Finland', 'Malaysia','Taiwan', 'Denmark', 'Turkey', 'Mexico','Belgium', 'Norway', 
                  'Russia', 'New Zealand','Portugal', 'Chile', 'Czechia', 'Colombia', 'India','Austria', 
                  'Saudi Arabia', 'Greece', 'Israel','United Arab Emirates', 'Egypt', 'Hungary',
                  'Ireland', 'Pakistan','Kuwait']

In [4]:
#filtering emiss_factor; 
#solution source: https://www.adamsmith.haus/python/answers/how-to-filter-a-pandas-dataframe-with-a-list-by-%60in%60-or-%60not-in%60-in-python 
boolean_series = emiss_factor.Entity.isin(country_region)
emiss_factor = emiss_factor[boolean_series]  
#filtering emiss gdp
boolean_series = emiss_gdp.Entity.isin(country_region)
emiss_gdp = emiss_gdp[boolean_series]
#filtering emiss trade
boolean_series = emiss_trade.Entity.isin(country_region)
emiss_trade = emiss_trade[boolean_series]
#filtering emiss_fuel
boolean_series = emiss_fuel.Entity.isin(country_region)
emiss_fuel = emiss_fuel[boolean_series]
#filtering ghg
boolean_series = ghg.Entity.isin(country_region)
ghg = ghg[boolean_series]
#filtering ghg
boolean_series = emiss_sector.Entity.isin(country_region)
emiss_sector = emiss_sector[boolean_series]

**Step 3** Upsampling data down to quarter

*General Note*: the upsampling method was not creating quarters for the last date of the time series. For this reason I had to manually add the quarters for every dataset

In [5]:
# 1st datset ghg: country co2eq

In [6]:
#preparing a function date that I need to use for all dataset
#necessary to trasnform later the field in datetime
def date (val):
    return str(val)+'-01'+'-01'

In [7]:
#UPSAMPLING
ghg['Year'] = ghg['Year'].map(date)
ghg['Year'] = pd.to_datetime(ghg['Year'])
ghg.set_index('Year', inplace=True) 
ghg_quarter = ghg.groupby(by='Entity').resample('Q').ffill()
ghg_quarter['emissions_lucf'] = ghg_quarter['Total including LUCF'].map(lambda x: x/4)
ghg_quarter = ghg_quarter[['Code','emissions_lucf']]
ghg_quarter = ghg_quarter.reset_index()

In [8]:
#ADDING REMAINING QUARTERS
#step 1 - Filtering the dataframe with 2018 data as I know that those are the latest (check manually)
ghg_sample = ghg_quarter[ghg_quarter['Year']=='2018-03-31']
#step 2 - creating a for loop where I am manually adding all columns and adding to a list
lst=[]
for country in ghg_sample['Entity']:
    lst.append([f'{country}','2018-06-30', 
                ghg_sample[ghg_sample['Entity']==country]['Code'].values[0], 
                ghg_sample[ghg_sample['Entity']==country]['emissions_lucf'].values[0]])
    lst.append([f'{country}','2018-09-30',
                ghg_sample[ghg_sample['Entity']==country]['Code'].values[0], 
                ghg_sample[ghg_sample['Entity']==country]['emissions_lucf'].values[0]])
    lst.append([f'{country}','2018-12-31',
                ghg_sample[ghg_sample['Entity']==country]['Code'].values[0], 
               ghg_sample[ghg_sample['Entity']==country]['emissions_lucf'].values[0]])
    
#step 3 - converting the list in a pandas dataframe and concat and convert time in datetime
ghg_additional_quarter = pd.DataFrame(lst)
ghg_additional_quarter.rename(columns={0:'Entity', 1:'Year', 2:'Code', 3:'emissions_lucf'}, inplace=True)
ghg_quarter = pd.concat([ghg_quarter, ghg_additional_quarter]).sort_values(by=['Entity', 'Year'])
ghg_quarter['Year'] = pd.to_datetime(ghg_quarter['Year'])
ghg_quarter.tail(10)

  ghg_quarter = pd.concat([ghg_quarter, ghg_additional_quarter]).sort_values(by=['Entity', 'Year'])


Unnamed: 0,Entity,Year,Code,emissions_lucf
5191,United States,2016-09-30,USA,1419230000.0
5192,United States,2016-12-31,USA,1419230000.0
5193,United States,2017-03-31,USA,1403410000.0
5194,United States,2017-06-30,USA,1403410000.0
5195,United States,2017-09-30,USA,1403410000.0
5196,United States,2017-12-31,USA,1403410000.0
5197,United States,2018-03-31,USA,1448588000.0
135,United States,2018-06-30,USA,1448588000.0
136,United States,2018-09-30,USA,1448588000.0
137,United States,2018-12-31,USA,1448588000.0


In [9]:
#2nd datset: country emissions factor

In [10]:
#UPSAMPLING
emiss_factor['Year'] = emiss_factor['Year'].map(date)
emiss_factor['Year'] = pd.to_datetime(emiss_factor['Year'])
emiss_factor.set_index('Year', inplace=True) 
emiss_factor = emiss_factor.groupby(by='Entity').resample('Q').ffill()
emiss_factor['emissions_factor(kg/kwh)'] = emiss_factor['Annual CO2 emissions per unit energy (kg per kilowatt-hour)'].map(lambda x: x/4)
emiss_factor_quarter = emiss_factor[['Code','emissions_factor(kg/kwh)']].reset_index()

In [11]:
#ADDING REMAINING QUARTERS
#step 1 - Filtering the dataframe with 2019 data as I know that those are the latest (check manually)
emiss_factor_sample = emiss_factor_quarter[emiss_factor_quarter['Year']=='2019-03-31']

#step 2 - creating a for loop where I am manually adding all columns and adding to a list
lst=[]
for country in emiss_factor_sample['Entity']:
    lst.append([f'{country}','2019-06-30', 
                emiss_factor_sample[emiss_factor_sample['Entity']==country]['Code'].values[0], 
                emiss_factor_sample[emiss_factor_sample['Entity']==country]['emissions_factor(kg/kwh)'].values[0]])
    lst.append([f'{country}','2019-09-30',
                emiss_factor_sample[emiss_factor_sample['Entity']==country]['Code'].values[0], 
                emiss_factor_sample[emiss_factor_sample['Entity']==country]['emissions_factor(kg/kwh)'].values[0]])
    lst.append([f'{country}','2019-12-31',
                emiss_factor_sample[emiss_factor_sample['Entity']==country]['Code'].values[0], 
               emiss_factor_sample[emiss_factor_sample['Entity']==country]['emissions_factor(kg/kwh)'].values[0]])
    
#step 3 - converting the list in a pandas dataframe and concat and convert time in datetime
additional_quarter = pd.DataFrame(lst)
additional_quarter.rename(columns={0:'Entity', 1:'Year', 2:'Code', 3:'emissions_factor(kg/kwh)'}, inplace=True)
emiss_factor_quarter = pd.concat([emiss_factor_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])
emiss_factor_quarter['Year'] = pd.to_datetime(emiss_factor_quarter['Year'])
emiss_factor_quarter.tail(10)

  emiss_factor_quarter = pd.concat([emiss_factor_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])


Unnamed: 0,Entity,Year,Code,emissions_factor(kg/kwh)
10349,United States,2017-09-30,USA,0.050775
10350,United States,2017-12-31,USA,0.050775
10351,United States,2018-03-31,USA,0.0506
10352,United States,2018-06-30,USA,0.0506
10353,United States,2018-09-30,USA,0.0506
10354,United States,2018-12-31,USA,0.0506
10355,United States,2019-03-31,USA,0.049975
141,United States,2019-06-30,USA,0.049975
142,United States,2019-09-30,USA,0.049975
143,United States,2019-12-31,USA,0.049975


In [12]:
#3rd dataset:emission and gdp

In [13]:
#UPSAMPLING
emiss_gdp['Year'] = emiss_gdp['Year'].map(date)
emiss_gdp['Year'] = pd.to_datetime(emiss_gdp['Year'])
emiss_gdp.set_index('Year', inplace=True) 
emiss_gdp = emiss_gdp.groupby(by='Entity').resample('Q').ffill()
emiss_gdp['emissions_gdp(kg/$ppp)'] = emiss_gdp['Annual CO2 emissions per GDP (kg per $PPP)'].map(lambda x: x/4)
emiss_gdp_quarter = emiss_gdp[['Code','emissions_gdp(kg/$ppp)']].reset_index()

In [14]:
#ADDING REMAINING QUARTERS
#step 1 - Filtering the dataframe with 2018 data as I know that those are the latest (check manually)
sample = emiss_gdp_quarter[emiss_gdp_quarter['Year']=='2018-03-31']

#step 2 - creating a for loop where I am manually adding all columns and adding to a list
lst=[]
for country in sample['Entity']:
    lst.append([f'{country}','2018-06-30', 
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['emissions_gdp(kg/$ppp)'].values[0]])
    lst.append([f'{country}','2018-09-30',
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['emissions_gdp(kg/$ppp)'].values[0]])
    lst.append([f'{country}','2018-12-31',
                sample[sample['Entity']==country]['Code'].values[0], 
               sample[sample['Entity']==country]['emissions_gdp(kg/$ppp)'].values[0]])
    
#step 3 - converting the list in a pandas dataframe and concat and convert time in datetime
additional_quarter = pd.DataFrame(lst)
additional_quarter.rename(columns={0:'Entity', 1:'Year', 2:'Code', 3:'emissions_gdp(kg/$ppp)'}, inplace=True)
emiss_gdp_quarter = pd.concat([emiss_gdp_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])
emiss_gdp_quarter['Year'] = pd.to_datetime(emiss_gdp_quarter['Year'])
emiss_gdp_quarter.tail(10)

  emiss_gdp_quarter = pd.concat([emiss_gdp_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])


Unnamed: 0,Entity,Year,Code,emissions_gdp(kg/$ppp)
25169,United States,2016-09-30,USA,0.07645
25170,United States,2016-12-31,USA,0.07645
25171,United States,2017-03-31,USA,0.074
25172,United States,2017-06-30,USA,0.074
25173,United States,2017-09-30,USA,0.074
25174,United States,2017-12-31,USA,0.074
25175,United States,2018-03-31,USA,0.074075
141,United States,2018-06-30,USA,0.074075
142,United States,2018-09-30,USA,0.074075
143,United States,2018-12-31,USA,0.074075


In [15]:
#4rth dataset: trade

In [16]:
#UPSAMPLING
emiss_trade['Year'] = emiss_trade['Year'].map(date)
emiss_trade['Year'] = pd.to_datetime(emiss_trade['Year'])
emiss_trade.set_index('Year', inplace=True) 
emiss_trade = emiss_trade.groupby(by='Entity').resample('Q').ffill()
emiss_trade['emissions_in_trade'] = emiss_trade['Share of annual CO2 emissions embedded in trade'].map(lambda x: x/4)
emiss_trade= emiss_trade[['Code','emissions_in_trade']]
emiss_trade_quarter = emiss_trade.reset_index()

In [17]:
#ADDING REMAINING QUARTERS
#step 1 - Filtering the dataframe with 2019 data as I know that those are the latest (check manually)
sample = emiss_trade_quarter[emiss_trade_quarter['Year']=='2019-03-31']

#step 2 - creating a for loop where I am manually adding all columns and adding to a list
lst=[]
for country in sample['Entity']:
    lst.append([f'{country}','2019-06-30', 
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['emissions_in_trade'].values[0]])
    lst.append([f'{country}','2019-09-30',
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['emissions_in_trade'].values[0]])
    lst.append([f'{country}','2019-12-31',
                sample[sample['Entity']==country]['Code'].values[0], 
               sample[sample['Entity']==country]['emissions_in_trade'].values[0]])
    
#step 3 - converting the list in a pandas dataframe and concat and convert time in datetime
additional_quarter = pd.DataFrame(lst)
additional_quarter.rename(columns={0:'Entity', 1:'Year', 2:'Code', 3:'emissions_in_trade'}, inplace=True)
emiss_trade_quarter = pd.concat([emiss_trade_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])
emiss_trade_quarter['Year'] = pd.to_datetime(emiss_trade_quarter['Year'])
emiss_trade_quarter.tail(10)

  emiss_trade_quarter = pd.concat([emiss_trade_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])


Unnamed: 0,Entity,Year,Code,emissions_in_trade
5557,United States,2017-09-30,USA,1.6475
5558,United States,2017-12-31,USA,1.6475
5559,United States,2018-03-31,USA,1.6325
5560,United States,2018-06-30,USA,1.6325
5561,United States,2018-09-30,USA,1.6325
5562,United States,2018-12-31,USA,1.6325
5563,United States,2019-03-31,USA,1.76
141,United States,2019-06-30,USA,1.76
142,United States,2019-09-30,USA,1.76
143,United States,2019-12-31,USA,1.76


In [18]:
#5th dataset: fuel
#UPSAMPLING
emiss_fuel['Year'] = emiss_fuel['Year'].map(date)
emiss_fuel['Year'] = pd.to_datetime(emiss_fuel['Year'])
emiss_fuel.set_index('Year', inplace=True) 
emiss_fuel = emiss_fuel.groupby(by='Entity').resample('Q').ffill()
emiss_fuel['co2_oil'] = emiss_fuel['Annual CO2 emissions from oil'].map(lambda x: x/4 if x != None else 'missing')
emiss_fuel['co2_flaring'] = emiss_fuel['Annual CO2 emissions from flaring'].map(lambda x: x/4 if x != None else 'missing')
emiss_fuel['co2_cement'] = emiss_fuel['Annual CO2 emissions from cement'].map(lambda x: x/4 if x != None else 'missing')
emiss_fuel['co2_coal'] = emiss_fuel['Annual CO2 emissions from coal'].map(lambda x: x/4 if x != None else 'missing')
emiss_fuel['co2_gas'] = emiss_fuel['Annual CO2 emissions from gas'].map(lambda x: x/4 if x != None else 'missing')
emiss_fuel['co2_other_industry'] = emiss_fuel['Annual CO2 emissions from other industry'].map(lambda x: x/4 if x != None else 'missing')
emiss_fuel= emiss_fuel[['Code','co2_oil', 'co2_flaring', 'co2_cement', 'co2_coal', 'co2_gas', 'co2_other_industry']]
emiss_fuel_quarter = emiss_fuel.reset_index()


In [19]:
#ADDING REMAINING QUARTERS
#step 1 - Filtering the dataframe with 2019 data as I know that those are the latest (check manually)
sample = emiss_fuel_quarter[emiss_fuel_quarter['Year']=='2020-03-31']

#step 2 - creating a for loop where I am manually adding all columns and adding to a list
lst=[]
for country in sample['Entity']:
    lst.append([f'{country}','2020-06-30', 
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['co2_oil'].values[0], 
                sample[sample['Entity']==country]['co2_flaring'].values[0], 
                sample[sample['Entity']==country]['co2_cement'].values[0],
                sample[sample['Entity']==country]['co2_coal'].values[0],
                sample[sample['Entity']==country]['co2_gas'].values[0],
                sample[sample['Entity']==country]['co2_other_industry'].values[0],])
    lst.append([f'{country}','2020-09-30',
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['co2_oil'].values[0], 
                sample[sample['Entity']==country]['co2_flaring'].values[0], 
                sample[sample['Entity']==country]['co2_cement'].values[0],
                sample[sample['Entity']==country]['co2_coal'].values[0],
                sample[sample['Entity']==country]['co2_gas'].values[0],
                sample[sample['Entity']==country]['co2_other_industry'].values[0]])
    lst.append([f'{country}','2020-12-31',
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['co2_oil'].values[0], 
                sample[sample['Entity']==country]['co2_flaring'].values[0], 
                sample[sample['Entity']==country]['co2_cement'].values[0],
                sample[sample['Entity']==country]['co2_coal'].values[0],
                sample[sample['Entity']==country]['co2_gas'].values[0],
                sample[sample['Entity']==country]['co2_other_industry'].values[0]])
    
#step 3 - converting the list in a pandas dataframe and concat and convert time in datetime
additional_quarter = pd.DataFrame(lst)
additional_quarter.rename(columns={0:'Entity', 1:'Year', 
                                   2:'Code', 
                                   3:'co2_oil', 
                                   4:'co2_flaring', 
                                   5:'co2_cement', 
                                   6: 'co2_coal',
                                   7: 'co2_gas',
                                   8: 'co2_other_industry'}, inplace=True)
emiss_fuel_quarter = pd.concat([emiss_fuel_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])
emiss_fuel_quarter['Year'] = pd.to_datetime(emiss_fuel_quarter['Year'])
emiss_fuel_quarter.tail(10)

  emiss_fuel_quarter = pd.concat([emiss_fuel_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])


Unnamed: 0,Entity,Year,Code,co2_oil,co2_flaring,co2_cement,co2_coal,co2_gas,co2_other_industry
28649,United States,2018-09-30,USA,579202700.0,17751977.25,9742686.25,320883100.0,410260300.0,6031949.25
28650,United States,2018-12-31,USA,579202700.0,17751977.25,9742686.25,320883100.0,410260300.0,6031949.25
28651,United States,2019-03-31,USA,578343000.0,21127431.25,10223967.25,274713400.0,423723500.0,5822719.75
28652,United States,2019-06-30,USA,578343000.0,21127431.25,10223967.25,274713400.0,423723500.0,5822719.75
28653,United States,2019-09-30,USA,578343000.0,21127431.25,10223967.25,274713400.0,423723500.0,5822719.75
28654,United States,2019-12-31,USA,578343000.0,21127431.25,10223967.25,274713400.0,423723500.0,5822719.75
28655,United States,2020-03-31,USA,505134400.0,21127431.25,10198644.5,222162300.0,413747100.0,5822719.75
141,United States,2020-06-30,USA,505134400.0,21127431.25,10198644.5,222162300.0,413747100.0,5822719.75
142,United States,2020-09-30,USA,505134400.0,21127431.25,10198644.5,222162300.0,413747100.0,5822719.75
143,United States,2020-12-31,USA,505134400.0,21127431.25,10198644.5,222162300.0,413747100.0,5822719.75


In [20]:
#UPSAMPLING
# Emissions by sector 
emiss_sector['Year'] = emiss_sector['Year'].map(date)
emiss_sector['Year'] = pd.to_datetime(emiss_sector['Year'])
emiss_sector.set_index('Year', inplace=True) 
emiss_sector = emiss_sector.groupby(by='Entity').resample('Q').ffill()
emiss_sector['agriculture'] = emiss_sector['Agriculture'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['land_use_forestry'] = emiss_sector['Land-use change and forestry'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['waste'] = emiss_sector['Waste'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['industry'] = emiss_sector['Industry'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['manufact_construction'] = emiss_sector['Manufacturing and construction'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['transport'] = emiss_sector['Transport'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['electr_heat'] = emiss_sector['Electricity and heat'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['buildings'] = emiss_sector['Buildings'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['fugitive_emission'] = emiss_sector['Fugitive emissions'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['other_fuel_combustion'] = emiss_sector['Other fuel combustion'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector['aviation_shipping'] = emiss_sector['Aviation and shipping'].map(lambda x: x/4 if x != None else 'missing')
emiss_sector_quarter= emiss_sector[['Code','agriculture','land_use_forestry','waste','industry','manufact_construction', 
                            'transport','electr_heat','buildings','fugitive_emission','other_fuel_combustion','aviation_shipping' ]].reset_index()

In [21]:
#ADDING REMAINING QUARTERS
#step 1 - Filtering the dataframe with 2019 data as I know that those are the latest (check manually)
sample = emiss_sector_quarter[emiss_sector_quarter['Year']=='2018-03-31']
#step 2 - creating a for loop where I am manually adding all columns and adding to a list
lst=[]
for country in sample['Entity']:
    lst.append([f'{country}','2018-06-30', 
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['agriculture'].values[0], 
                sample[sample['Entity']==country]['land_use_forestry'].values[0], 
                sample[sample['Entity']==country]['waste'].values[0],
                sample[sample['Entity']==country]['industry'].values[0],
                sample[sample['Entity']==country]['manufact_construction'].values[0],
                sample[sample['Entity']==country]['transport'].values[0],
               sample[sample['Entity']==country]['electr_heat'].values[0],
               sample[sample['Entity']==country]['buildings'].values[0],
               sample[sample['Entity']==country]['fugitive_emission'].values[0],
               sample[sample['Entity']==country]['other_fuel_combustion'].values[0],
               sample[sample['Entity']==country]['aviation_shipping'].values[0]])
    lst.append([f'{country}','2018-09-30', 
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['agriculture'].values[0], 
                sample[sample['Entity']==country]['land_use_forestry'].values[0], 
                sample[sample['Entity']==country]['waste'].values[0],
                sample[sample['Entity']==country]['industry'].values[0],
                sample[sample['Entity']==country]['manufact_construction'].values[0],
                sample[sample['Entity']==country]['transport'].values[0],
               sample[sample['Entity']==country]['electr_heat'].values[0],
               sample[sample['Entity']==country]['buildings'].values[0],
               sample[sample['Entity']==country]['fugitive_emission'].values[0],
               sample[sample['Entity']==country]['other_fuel_combustion'].values[0],
               sample[sample['Entity']==country]['aviation_shipping'].values[0]])
    lst.append([f'{country}','2018-12-31', 
                sample[sample['Entity']==country]['Code'].values[0], 
                sample[sample['Entity']==country]['agriculture'].values[0], 
                sample[sample['Entity']==country]['land_use_forestry'].values[0], 
                sample[sample['Entity']==country]['waste'].values[0],
                sample[sample['Entity']==country]['industry'].values[0],
                sample[sample['Entity']==country]['manufact_construction'].values[0],
                sample[sample['Entity']==country]['transport'].values[0],
               sample[sample['Entity']==country]['electr_heat'].values[0],
               sample[sample['Entity']==country]['buildings'].values[0],
               sample[sample['Entity']==country]['fugitive_emission'].values[0],
               sample[sample['Entity']==country]['other_fuel_combustion'].values[0],
               sample[sample['Entity']==country]['aviation_shipping'].values[0]])
    
#step 3 - converting the list in a pandas dataframe and concat and convert time in datetime
additional_quarter = pd.DataFrame(lst)
additional_quarter.rename(columns={0:'Entity', 
                                   1:'Year', 
                                   2:'Code', 
                                   3:'agriculture', 
                                   4:'land_use_forestry', 
                                   5:'waste', 
                                   6: 'industry',
                                   7: 'manufact_construction',
                                   8: 'transport', 
                                  9:'electr_heat', 
                                  10:'buildings', 
                                  11:'fugitive_emission', 
                                  12:'other_fuel_combustion', 
                                  13:'aviation_shipping'}, inplace=True)
emiss_sector_quarter = pd.concat([emiss_sector_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])
emiss_sector_quarter['Year'] = pd.to_datetime(emiss_sector_quarter['Year'])
emiss_sector_quarter.tail(10)

  emiss_sector_quarter = pd.concat([emiss_sector_quarter, additional_quarter]).sort_values(by=['Entity', 'Year'])


Unnamed: 0,Entity,Year,Code,agriculture,land_use_forestry,waste,industry,manufact_construction,transport,electr_heat,buildings,fugitive_emission,other_fuel_combustion,aviation_shipping
5191,United States,2016-09-30,USA,95350000.0,-57590000.0,32757500.0,55505000.0,109755000.0,427805000.0,536377500.0,124117500.0,71527500.0,23620000.0,34212500.0
5192,United States,2016-12-31,USA,95350000.0,-57590000.0,32757500.0,55505000.0,109755000.0,427805000.0,536377500.0,124117500.0,71527500.0,23620000.0,34212500.0
5193,United States,2017-03-31,USA,95482500.0,-57290000.0,33032500.0,56990000.0,107352500.0,431005000.0,516130000.0,124475000.0,72775000.0,23457500.0,37057500.0
5194,United States,2017-06-30,USA,95482500.0,-57290000.0,33032500.0,56990000.0,107352500.0,431005000.0,516130000.0,124475000.0,72775000.0,23457500.0,37057500.0
5195,United States,2017-09-30,USA,95482500.0,-57290000.0,33032500.0,56990000.0,107352500.0,431005000.0,516130000.0,124475000.0,72775000.0,23457500.0,37057500.0
5196,United States,2017-12-31,USA,95482500.0,-57290000.0,33032500.0,56990000.0,107352500.0,431005000.0,516130000.0,124475000.0,72775000.0,23457500.0,37057500.0
5197,United States,2018-03-31,USA,96312500.0,-57317500.0,33310000.0,58477500.0,114697500.0,440560000.0,525792500.0,137670000.0,75395000.0,23687500.0,36212500.0
135,United States,2018-06-30,USA,96312500.0,-57317500.0,33310000.0,58477500.0,114697500.0,440560000.0,525792500.0,137670000.0,75395000.0,23687500.0,36212500.0
136,United States,2018-09-30,USA,96312500.0,-57317500.0,33310000.0,58477500.0,114697500.0,440560000.0,525792500.0,137670000.0,75395000.0,23687500.0,36212500.0
137,United States,2018-12-31,USA,96312500.0,-57317500.0,33310000.0,58477500.0,114697500.0,440560000.0,525792500.0,137670000.0,75395000.0,23687500.0,36212500.0


**Step 4** Filtering df by date

In [22]:
ghg_quarter = ghg_quarter [ghg_quarter ['Year']>'2012-01-01']
emiss_factor_quarter= emiss_factor_quarter[emiss_factor_quarter['Year']>'2012-01-01']
emiss_gdp_quarter= emiss_gdp_quarter[emiss_gdp_quarter['Year']>'2012-01-01']
emiss_trade_quarter= emiss_trade_quarter[emiss_trade_quarter['Year']>'2012-01-01']
emiss_fuel_quarter= emiss_fuel_quarter[emiss_fuel_quarter['Year']>'2012-01-01']
emiss_sector_quarter= emiss_sector_quarter[emiss_sector_quarter['Year']>'2012-01-01']

**Step 5** Merging all Df in one unique dataframe

In [23]:
#Source code: https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes
data_frames=[ghg_quarter,emiss_factor_quarter, emiss_gdp_quarter,emiss_trade_quarter,emiss_fuel_quarter, emiss_sector_quarter]
country_emission = reduce(lambda  left,right: pd.merge(left,right,on=['Year', 'Entity'],
                                            how='outer'), data_frames)
country_emission.drop(columns=['Code_x', 'Code_y'], inplace=True)
country_emission.shape

  country_emission = reduce(lambda  left,right: pd.merge(left,right,on=['Year', 'Entity'],


(1728, 23)

**Step 6** Adding the Country ISIN Code

In [24]:
country_region = {'Thailand':'TH', 'United Kingdom':'GB', 'South Korea':'KR', 'Brazil':'BR',
       'Netherlands':'NL', 'United States':'US', 'Australia':'AU', 'China':'CN', 'Sweden':'SE',
       'Switzerland':'CH', 'Germany':'DE', 'Italy':'IT', 'Indonesia':'ID', 'Japan':'JP',
       'Hong Kong': 'HK', 'South Africa':'ZA', 'Philippines':'PH', 'Canada':'CA', 'Poland':'PL',
       'Spain':'ES', 'Qatar':'QA', 'Singapore':'SG', 'France':'FR', 'Finland':'FI', 'Malaysia':'MY',
       'Taiwan':'TW', 'Denmark':'DK', 'Turkey':'TR', 'Mexico':'MX',
       'Belgium':'BE', 'Norway':'NO', 'Russia':'RU', 'New Zealand':'NZ',
       'Portugal':'PT', 'Chile':'CL', 'Czechia':'CZ', 'Colombia':'CO', 'India':'IN',
       'Austria':'AT', 'Saudi Arabia':'SA', 'Greece':'GR', 'Israel':'IL',
       'United Arab Emirates':'AE', 'Egypt':'EG', 'Hungary':'HU', 'Ireland':'IE', 'Pakistan':'PK',
       'Kuwait':'KW'}

In [25]:
def map_region(country):
    if country ==None:
        return np.nan
    elif country in country_region.keys():
        return country_region.get(country)
country_emission['country_isin'] = country_emission['Entity'].map(map_region)

In [26]:
country_emission['Entity'].unique()

array(['Australia', 'Austria', 'Belgium', 'Brazil', 'Canada', 'Chile',
       'China', 'Colombia', 'Czechia', 'Denmark', 'Egypt', 'Finland',
       'France', 'Germany', 'Greece', 'Hungary', 'India', 'Indonesia',
       'Ireland', 'Israel', 'Italy', 'Japan', 'Kuwait', 'Malaysia',
       'Mexico', 'Netherlands', 'New Zealand', 'Norway', 'Pakistan',
       'Philippines', 'Poland', 'Portugal', 'Qatar', 'Russia',
       'Saudi Arabia', 'Singapore', 'South Africa', 'South Korea',
       'Spain', 'Sweden', 'Switzerland', 'Thailand', 'Turkey',
       'United Arab Emirates', 'United Kingdom', 'United States',
       'Hong Kong', 'Taiwan'], dtype=object)

**Step 7** Filling Null value

In [27]:
#checking missing values
country_emission.isnull().mean()

Entity                      0.000000
Year                        0.000000
emissions_lucf              0.254630
emissions_factor(kg/kwh)    0.111111
emissions_gdp(kg/$ppp)      0.222222
emissions_in_trade          0.111111
co2_oil                     0.000000
co2_flaring                 0.240741
co2_cement                  0.020833
co2_coal                    0.062500
co2_gas                     0.000000
co2_other_industry          0.375000
agriculture                 0.254630
land_use_forestry           0.254630
waste                       0.254630
industry                    0.254630
manufact_construction       0.254630
transport                   0.254630
electr_heat                 0.254630
buildings                   0.254630
fugitive_emission           0.254630
other_fuel_combustion       0.254630
aviation_shipping           0.254630
country_isin                0.000000
dtype: float64

In [28]:
country_emission.shape

(1728, 24)

In [29]:
# I need to do the following:
# 1. dropping co2 flaring and co2 other industry
# 2. impute emissions lucf/factor/gdp/trade and from agriculture till aviation_shipping
# 3. drop the remaining null values

In [30]:
# dropping columns
country_emission.drop(columns=['co2_flaring','co2_other_industry'], inplace=True)

In [31]:
#Instantiate IterativeImputer and creating new df
X = country_emission.drop(columns=['Entity', 'Year', 'country_isin']).values
imputer = IterativeImputer(max_iter=10, random_state=0)
transf = imputer.fit_transform(X).round(2)
country_emission_imputed = pd.DataFrame(transf, columns=country_emission.drop(columns=['Entity', 'Year', 'country_isin']).columns)
country_emission = pd.concat([country_emission_imputed,country_emission[['Entity', 'Year', 'country_isin']]], axis=1)

In [32]:
#checking null values
country_emission.isnull().mean()

emissions_lucf              0.0
emissions_factor(kg/kwh)    0.0
emissions_gdp(kg/$ppp)      0.0
emissions_in_trade          0.0
co2_oil                     0.0
co2_cement                  0.0
co2_coal                    0.0
co2_gas                     0.0
agriculture                 0.0
land_use_forestry           0.0
waste                       0.0
industry                    0.0
manufact_construction       0.0
transport                   0.0
electr_heat                 0.0
buildings                   0.0
fugitive_emission           0.0
other_fuel_combustion       0.0
aviation_shipping           0.0
Entity                      0.0
Year                        0.0
country_isin                0.0
dtype: float64

In [33]:
#checking df
country_emission.head(3)

Unnamed: 0,emissions_lucf,emissions_factor(kg/kwh),emissions_gdp(kg/$ppp),emissions_in_trade,co2_oil,co2_cement,co2_coal,co2_gas,agriculture,land_use_forestry,...,manufact_construction,transport,electr_heat,buildings,fugitive_emission,other_fuel_combustion,aviation_shipping,Entity,Year,country_isin
0,160780000.0,0.06,0.09,0.47,33134118.75,872009.75,46878709.0,16781063.5,54752500.0,-6952500.0,...,10250000.0,22607500.0,57690000.0,3350000.0,8492500.0,3040000.0,3215000.0,Australia,2012-03-31,AU
1,160780000.0,0.06,0.09,0.47,33134118.75,872009.75,46878709.0,16781063.5,54752500.0,-6952500.0,...,10250000.0,22607500.0,57690000.0,3350000.0,8492500.0,3040000.0,3215000.0,Australia,2012-06-30,AU
2,160780000.0,0.06,0.09,0.47,33134118.75,872009.75,46878709.0,16781063.5,54752500.0,-6952500.0,...,10250000.0,22607500.0,57690000.0,3350000.0,8492500.0,3040000.0,3215000.0,Australia,2012-09-30,AU


In [34]:
#changing data types
country_emission['emissions_lucf']=country_emission['emissions_lucf'].apply(lambda x: '%.3f' % x).astype(float)
country_emission['co2_oil']=country_emission['co2_oil'].apply(lambda x: '%.3f' % x).astype(float)
country_emission['co2_coal']=country_emission['co2_coal'].apply(lambda x: '%.3f' % x).astype(float)
country_emission['co2_gas']=country_emission['co2_gas'].apply(lambda x: '%.3f' % x).astype(float)
#source: https://re-thought.com/how-to-suppress-scientific-notation-in-pandas/

In [35]:
#checking df
country_emission.head(3)

Unnamed: 0,emissions_lucf,emissions_factor(kg/kwh),emissions_gdp(kg/$ppp),emissions_in_trade,co2_oil,co2_cement,co2_coal,co2_gas,agriculture,land_use_forestry,...,manufact_construction,transport,electr_heat,buildings,fugitive_emission,other_fuel_combustion,aviation_shipping,Entity,Year,country_isin
0,160780000.0,0.06,0.09,0.47,33134118.75,872009.75,46878709.0,16781063.5,54752500.0,-6952500.0,...,10250000.0,22607500.0,57690000.0,3350000.0,8492500.0,3040000.0,3215000.0,Australia,2012-03-31,AU
1,160780000.0,0.06,0.09,0.47,33134118.75,872009.75,46878709.0,16781063.5,54752500.0,-6952500.0,...,10250000.0,22607500.0,57690000.0,3350000.0,8492500.0,3040000.0,3215000.0,Australia,2012-06-30,AU
2,160780000.0,0.06,0.09,0.47,33134118.75,872009.75,46878709.0,16781063.5,54752500.0,-6952500.0,...,10250000.0,22607500.0,57690000.0,3350000.0,8492500.0,3040000.0,3215000.0,Australia,2012-09-30,AU


In [36]:
#saving
country_emission.to_csv('../data/output/country_emission.csv', index=False)