# DOPP 2020W Exercise 3
## Group 10

**Question:**

How has the use of nuclear energy evolved over time? How well does the use of nuclear energy correlate with changes in carbon emissions? Are there characteristics of a country that correlate with increases or decreases in the use of nuclear energy?

**Members:**

* Frank Ebel 01429282
* Josef Glas 08606876
* Felix Korbelius 01526132
* Johannes Schabbauer 11776224
<br> <span style="color:red">todo_frank Name order?</span>

**Work method:**

Each person wrote small python scripts for their respective tasks. These scripts were merged in this notebook by Frank and modified if necessary. <br> <span style="color:red">todo_frank Link to GitHub repository?</span>

## Loading necessary modules

In [20]:
import requests
import re
import os

# for working with data
import numpy as np
import pandas as pd

# country manipulation
import country_converter
import pycountry
import logging  # todo_frank is it here appropiate?

# for visualization
import matplotlib.pyplot as plt
import pycountry
import plotly.express as px

## Finding appropiate datasets.

We decided that each person should focus on different categories. Each dataset should be sorted by the combination of country and year. To be consistent with country names we decided to use ISO 3166 alpha 3 country codes.  They categories were divided the following way:


* Josef: energy consumption and production data
* Frank: economical data (GDP, growth, ...)
* Felix: ecological data (CO$_2$-emission, pollution, ...)
* Johannes: (operating reactors, government, , accidents, <span style="color:red">climate agreements</span> ...)
<br> <span style="color:red">todo_frank Name order?</span>

**Used datasets**

All used datasets are in the folder `./data`.

* __[U.S. Energy Information Administration](https://www.eia.gov/international/data/world)__\
`USEIA/`
* __[GDP (current USD)](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD)__\
`API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1740389/`
* __[GDP growth (annual \%)](https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG)__\
`API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1740389/`
* __[GDP per capita (current USD)](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)__\
`API_NY.GDP.PCAP.CD_DS2_en_csv_v2_1740213`
* __[GDP per capita growth (annual \%)](https://data.worldbank.org/indicator/NY.GDP.PCAP.KD.ZG)__\
`API_NY.GDP.PCAP.KD.ZG_DS2_en_csv_v2_1740284/`
* __[Adjusted net national income per capita (current USD)](https://data.worldbank.org/indicator/NY.ADJ.NNTY.PC.CD)__\
`API_NY.ADJ.NNTY.PC.CD_DS2_en_csv_v2_1745486`
* __[Adjusted net national income per capita (annual \% growth)](https://data.worldbank.org/indicator/NY.ADJ.NNTY.PC.KD.ZG)__ \
`API_NY.ADJ.NNTY.PC.KD.ZG_DS2_en_csv_v2_1745488/`
* __[Data on CO2 and Greenhouse Gas Emissions by Our World in Data](https://github.com/owid/co2-data)__\
`owid-co2-data.csv`
* __[Power Reactor Information System](https://pris.iaea.org/PRIS/CountryStatistics/CountryStatisticsLandingPage.aspx)__\
`reactor_numbers_PRIS_IAEA.csv`
* __[Nuclear Wareheads per country](https://data.world/datagraver/nuclear-warheads-per-country)__\
`nuclear_warheads_1945_2016.csv`
* __[Gross domestic expenditure on R&D (GERD), GERD as a percentage of GDP](http://data.uis.unesco.org/Index.aspx?DataSetCode=SCN_DS#)__\
`SCN_DS_16122020083400698.csv`
* __[Nuclear Power Accidents (Deaths and Costs)](https://data.world/rebeccaclay/nuclear-power-accidents)__\
`C_id_35_NuclearPowerAccidents2016.csv`
* __[The Global State of Democracy Indices](https://www.idea.int/gsod-indices/dataset-resources)__\
`gsodi_pv_4.csv`


<br> <span style="color:red">todo_frank datasets order?</span>

### Dataset loading by Josef

In [7]:
# initialize CountryConverter as cc and disable warnings
country_converter.logging.getLogger().setLevel(logging.CRITICAL)
cc = country_converter.CountryConverter()

def load_useia_data():
    data = pd.DataFrame()

    ### CONSUMPTION
    target = './data/USEIA/USEIA_CONSUMPTION_1980-2018.csv'

    consumption = pd.read_csv(target, sep=",", decimal=".", header=0, skiprows=1, na_values='--')

    # rename columns
    consumption = consumption.rename(columns={'Unnamed: 1': 'text'})
    consumption.insert(loc=0, column='country', value=0)
    consumption.insert(loc=0, column='check', value='X')

    consumption['country'] = consumption['API'].str[-10:-7]
    #consumption = map_historic_countries(consumption)
    consumption['check'] = cc.convert(names=consumption['country'].to_list(), to='ISO3')

    consumption = consumption.sort_values(by=['country'])

    # data cleaning

    # remove rows where country code check failed
    raw = consumption[consumption['check'] != 'not found']

    raw = raw.reset_index(drop=True)

    consumption = pd.DataFrame(columns=['year', 'country', 'cons_btu', 'coal_cons_btu', 'gas_cons_btu', \
                                        'oil_cons_btu', 'nuclear_cons_btu', 'renewables_cons_btu'])

    countries = pd.DataFrame(raw['country'])
    countries.sort_values(by=['country'])
    countries.drop_duplicates(inplace=True)
    countries = countries.reset_index(drop=True)

    counter = 0
    for idx1, c in countries.iterrows():

        temp = raw[raw['country'] == c.iloc[0]]

        j = 4
        for i in range(1980, 2019):

            v_cons = 0
            v_coal = 0
            v_gas = 0
            v_oil = 0
            v_nuclear = 0
            v_renewables = 0

            new_row = {'year': i, 'country': c.iloc[0]}
            for idx2, row in temp.iterrows():

                text = row.iloc[3]

                if text.find("Consumption") != -1:
                    v_cons = row.iloc[j]
                elif text.find("Coal") != -1:
                    v_coal = row.iloc[j]
                elif text.find("Natural gas") != -1:
                    v_gas = row.iloc[j]
                elif text.find("Petroleum") != -1:
                    v_oil = row.iloc[j]
                elif text.find("Nuclear (") != -1:
                    v_nuclear = row.iloc[j]
                elif text.find("Renewables and other") != -1:
                    v_renewables = row.iloc[j]

            new_row['cons_btu'] = v_cons
            new_row['coal_cons_btu'] = v_coal
            new_row['gas_cons_btu'] = v_gas
            new_row['oil_cons_btu'] = v_oil
            new_row['nuclear_cons_btu'] = v_nuclear
            new_row['renewables_cons_btu'] = v_renewables

            consumption.loc[counter] = new_row
            counter += 1
            j += 1

    ### PRODUCTION
    target = './data/USEIA/USEIA_PRODUCTION_1980-2018.csv'

    production = pd.read_csv(target, sep=",", decimal=".", header=0, skiprows=1, na_values='--')

    # rename columns
    production = production.rename(columns={'Unnamed: 1': 'text'})
    production.insert(loc=0, column='country', value=0)
    production.insert(loc=0, column='check', value='X')

    production['country'] = production['API'].str[-10:-7]
    #production = map_historic_countries(production)
    production['check'] = cc.convert(names=production['country'].to_list(), to='ISO3')

    production.sort_values('country')

    # data cleaning

    # remove rows where country code check failed
    raw = production[production['check'] != 'not found']

    raw = raw.reset_index(drop=True)

    production = pd.DataFrame(columns=['year', 'country', 'prod_btu', 'coal_prod_btu', 'gas_prod_btu', \
                                       'oil_prod_btu', 'nuclear_prod_btu', 'renewables_prod_btu'])

    countries = pd.DataFrame(raw['country'])
    countries.sort_values(by=['country'])
    countries.drop_duplicates(inplace=True)
    countries = countries.reset_index(drop=True)

    counter = 0
    for idx3, c in countries.iterrows():

        temp = raw[raw['country'] == c.iloc[0]]

        j = 4
        for i in range(1980, 2019):

            v_prod = 0
            v_coal = 0
            v_gas = 0
            v_oil = 0
            v_nuclear = 0
            v_renewables = 0

            new_row = {'year': i, 'country': c.iloc[0]}
            for idx4, row in temp.iterrows():

                text = row.iloc[3]

                if text.find("Production") != -1:
                    v_prod = row.iloc[j]
                elif text.find("Coal") != -1:
                    v_coal = row.iloc[j]
                elif text.find("Natural gas") != -1:
                    v_gas = row.iloc[j]
                elif text.find("Petroleum") != -1:
                    v_oil = row.iloc[j]
                elif text.find("Nuclear (") != -1:
                    v_nuclear = row.iloc[j]
                elif text.find("Renewables and other") != -1:
                    v_renewables = row.iloc[j]

            new_row['prod_btu'] = v_prod
            new_row['coal_prod_btu'] = v_coal
            new_row['gas_prod_btu'] = v_gas
            new_row['oil_prod_btu'] = v_oil
            new_row['nuclear_prod_btu'] = v_nuclear
            new_row['renewables_prod_btu'] = v_renewables

            production.loc[counter] = new_row
            counter += 1
            j += 1

    data = pd.merge(consumption, production, how="outer", on=['year', 'country'])

    data['year'] = data['year'].astype('int64')
    data['nuclear_prod_btu'] = data['nuclear_prod_btu'].astype('float64')

    return data

### Dataset loading by Frank

In [12]:
def load_economical_data():
    """Load economical data into dataframe and return it.

    Common data structure:
    0: year
    1: country code
    3..: features"""

    def get_df(filepath):
        df = pd.read_csv(filepath, sep=',', skip_blank_lines=True, header=2)
        df.drop(columns_drop, axis=1, inplace=True)
        df.rename(columns={'Country Code': 'country'}, inplace=True)
        return df

    columns_drop = ['Country Name', 'Indicator Name',  'Indicator Code', 'Unnamed: 65']  # columns to drop
    dfs = []  # List of all dataframes.

    # load dataframe of GDP
    df_GDP = get_df('./data/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1740389/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1740389.csv')
    # melt and order to get in right format
    df_GDP = df_GDP.melt(id_vars=['country'], var_name='year', value_name='GDP')
    df_GDP['year'] = df_GDP['year'].astype('int64')
    dfs.append(df_GDP)

    # load dataframe of GDP growth
    df_GDP_growth = get_df('./data/API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2_1836177/'
                           'API_NY.GDP.MKTP.KD.ZG_DS2_en_csv_v2_1836177.csv')
    # melt and order to get in right format
    df_GDP_growth = df_GDP_growth.melt(id_vars=['country'], var_name='year', value_name='GDP growth')
    df_GDP_growth['year'] = df_GDP_growth['year'].astype('int64')
    dfs.append(df_GDP_growth)

    # load dataframe of GDP per capita
    df_GDP_per_capita = get_df('./data/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_1740213/'
                               'API_NY.GDP.PCAP.CD_DS2_en_csv_v2_1740213.csv')
    # melt and order to get in right format
    df_GDP_per_capita = df_GDP_per_capita.melt(id_vars=['country'], var_name='year', value_name='GDP per capita')
    df_GDP_per_capita['year'] = df_GDP_per_capita['year'].astype('int64')
    dfs.append(df_GDP_per_capita)

    # load dataframe of GDP per capita growth
    df_GDP_per_capita_growth = get_df('./data/API_NY.GDP.PCAP.KD.ZG_DS2_en_csv_v2_1740284/'
                                      'API_NY.GDP.PCAP.KD.ZG_DS2_en_csv_v2_1740284.csv')
    # melt and order to get in right format
    df_GDP_per_capita_growth = df_GDP_per_capita_growth.melt(id_vars=['country'], var_name='year',
                                                             value_name='GDP per capita growth')
    df_GDP_per_capita_growth['year'] = df_GDP_per_capita_growth['year'].astype('int64')
    dfs.append(df_GDP_per_capita_growth)

    # load dataframe of income per capita
    df_income_per_capita = get_df('./data/API_NY.ADJ.NNTY.PC.CD_DS2_en_csv_v2_1745486/'
                                  'API_NY.ADJ.NNTY.PC.CD_DS2_en_csv_v2_1745486.csv')
    # melt and order to get in right format
    df_income_per_capita = df_income_per_capita.melt(id_vars=['country'], var_name='year',
                                                     value_name='income per capita')
    df_income_per_capita['year'] = df_income_per_capita['year'].astype('int64')
    dfs.append(df_income_per_capita)

    # load dataframe of income per capita growth
    df_income_per_capita_growth = get_df('./data/API_NY.ADJ.NNTY.PC.KD.ZG_DS2_en_csv_v2_1745488/'
                                         'API_NY.ADJ.NNTY.PC.KD.ZG_DS2_en_csv_v2_1745488.csv')
    # melt and order to get in right format
    df_income_per_capita_growth = df_income_per_capita_growth.melt(id_vars=['country'], var_name='year',
                                                                   value_name='income per capita growth')
    df_income_per_capita_growth['year'] = df_income_per_capita_growth['year'].astype('int64')
    dfs.append(df_income_per_capita_growth)

    # merge and sort all dataframes
    result = dfs[0]
    for df in dfs[1:]:
        result = result.merge(df, how='outer', on=['country', 'year'])
    result.sort_values(['country', 'year'], inplace=True)
    result.reset_index(inplace=True, drop=True)

    # Since there are some aggregated values (e. g. WLD for world) remove all rows which don't have a valid
    # ISO 3166 Alpha-3 code.
    alpha_3_list = [country.alpha_3 for country in list(pycountry.countries)]  # all valid codes
    valid_entry = result['country'].isin(alpha_3_list)  # boolean series if each row is valid or not
    result = result.loc[valid_entry]
    # invalid = set(result.loc[~valid_entry]['country'].tolist())
    # print('invalid', invalid)

    return result

### Dataset loading by Felix

In [16]:
def load_emission_data():
    """ 
    Load all emission data files and combine them into a single Pandas DataFrame.
    Common data structure: 0-year, 1-country code, 2+-features.
    Check for correct typing.

    return:
    emission_data: data frame containing different emission data per country per year.
    """

    path = './data/owid-co2-data.csv'
    df_emission_data = pd.read_csv(path, sep=',')

    cols = ['year', 'iso_code']
    # Rearrange columns, so that year and country-code (iso-code) are the first two columns.
    new_cols = cols + df_emission_data.columns.drop(cols).tolist()
    # Drop country column.
    df_emission_data = df_emission_data[new_cols].drop(['country'], axis=1)
    # Rename iso_code to country and convert to string.
    df_emission_data[['iso_code']] = df_emission_data[['iso_code']].astype('string')
    df_emission_data = df_emission_data.rename(columns={'iso_code': 'country'})
    return df_emission_data


def resize_emission(df):
    """ Index dataframe and eliminate non-country specific data.

    Attention: When handling NaN values look at the values of a specific column, if there exists a NaN value
    above/below a 0 entry, it is highly possible that NaN are truly missing values.

    Time-range: 1980-2018

    return:
    trimmed down and somewhat ordered emission_data."""
    data_emission_i = df.copy()
    # Only keep countries (check len(country_code) == 3) - raw data contains continental data, etc. with a blank
    # country code (i.e. length 0).
    data_emission_i = data_emission_i[data_emission_i['country'].str.len() == 3]
    # Set index on country_code and year (group by country_code).
    # data_emission_i = data_emission_i.set_index(['country_code', 'year'])
    # Keep most interesting columns:
    data_emission_i = data_emission_i.drop(data_emission_i.iloc[:, -5:-2], axis=1)
    data_emission_i = data_emission_i.drop(data_emission_i.iloc[:, 16:26], axis=1)  # delete cement,... produc. emission
    data_emission_i = data_emission_i.drop(['gdp', 'trade_co2', 'trade_co2_share'], axis=1)
    return data_emission_i

### Dataset loading by Johannes

In [21]:
# initialize CountryConverter as cc and disable warnings
country_converter.logging.getLogger().setLevel(logging.CRITICAL)
cc = country_converter.CountryConverter()
# dictionary for country replacements (that cannot be read by country_converter)
# using current ones for outdated names, e.g. 'USSR' --> 'Russia'
_dict_country_repl = {'UK':'United Kingdom', 'USSR':'Russia', 'Soviet Union':'Russia', 'East Germany':'Germany',
                      'Illinois':'US', 'Tawian':'Taiwan', 'Yugoslavia':'Serbia', 'Scotland':'United Kingdom'}

###################################################################################################
def load_political_data():
    # read data from diffenernt datasets in the category 'political'
        
    # nuclear wrheads
    # read file and exclude last (empty) line
    warheads = pd.read_csv('./data/nuclear_warheads_1945_2016.csv', 
                           sep=';',thousands='.', decimal=',').iloc[:-1]
    warheads['Year'] = warheads['Year'].astype('int')
    warheads = warheads.set_index('Year')
    # transform datafame from 2D to MultiIndex
    warheads = warheads.stack()
    warheads = warheads.reset_index()
    # set column names and convert country names to ISO3
    warheads.columns = ['year', 'country', 'nuclear_warheads']
    warheads['country'] = cc.convert(warheads['country'].to_list(), to='ISO3')
    warheads = warheads.set_index(['year', 'country'])
    
    
    # research expenditure
    research = pd.read_csv('./data/SCN_DS_16122020083400698.csv')
    # choose only lines with relatilve expenditure (for all reaseach categories)
    research = research.loc[research['Indicator']=="GERD as a percentage of GDP"]
    # chose relevant columns and rename them
    research = research[['Time','Country', 'Value']]
    research.columns = ['year', 'country', 'research_%GDP']
    # convert to ISO3 and exclude regions (cannot be converted to countrycode)
    research['country'] = research['country'].replace({'Oceania (Australia/New Zealand)':'not found'})
    research['country'] =  cc.convert(research['country'].to_list(), to='ISO3', not_found='not found')
    research = research[research['country'] != 'not found']
    research = research.set_index(['year', 'country'])

    
    # accidents of nuclear power plants
    accidents = pd.read_csv('./data/C_id_35_NuclearPowerAccidents2016.csv')
    accidents = accidents[['Date', 'Location', 'Cost (millions 2013US$)', 'Fatalities']]
    accidents.columns = ['year', 'country', 'accident_cost_MioUSD2013', 'accident_deaths']
    # use only year from Date column
    accidents['year'] = accidents['year'].str.slice(start=-4).astype('int')
    # use last part of Location (usually the country)
    accidents['country'] = accidents['country'].str.split(',').str[-1].str.lstrip(' ')
    # do some corrections (e.g. old country names or missing ones)
    accidents['country'] = accidents['country'].replace(_dict_country_repl)
    # conversion to ISO3
    accidents['country'] = cc.convert(accidents['country'].to_list(), to='ISO3')
    accidents = accidents.set_index(['year', 'country'])
    # sum values, if there was more than one accident per year and country
    accidents = accidents.sum(level=['year','country'])
    
    # democarcy indicators
    democracy = pd.read_csv('./data/gsodi_pv_4.csv', low_memory=False)
    # choose five main categories
    democracy = democracy[['ID_year','ID_country_name','C_A1','C_A2','C_A3','C_A4','C_SD51']]
    democracy.columns = ['year', 'country', 'representative_government', 'fundamental_rights', 
                         'checks_on_gouvernment', 'impartial_administration', 'civil_society_participation']
    # avoid that 'Southern Africa' is converted to 'ZAF' and count 'East Germany' as 'Germany' 
    democracy['country'] = democracy['country'].replace(
            {'Southern Africa':' ','German Democratic Republic':'Germany'})
    democracy['country'] = cc.convert(democracy['country'].to_list(), to='ISO3')
    # exclude regions (and east germany)
    democracy = democracy[democracy['country'] != 'not found']
    democracy = democracy.set_index(['year', 'country'])
    # use mean value for duplicate values (EAST and WEST GERMANY)
    democracy = democracy.mean(level=['year','country'])

    # get number of reactors from seperate function
    reactors = load_reactor_numbers()
    
    # merging and fill some of the missing values
    merge = pd.concat(
            [reactors,warheads,accidents,research,democracy],
            axis=1, join='outer')


    #merge.iloc[:,3] = merge.iloc[:,3].unstack().interpolate().stack()
    merge.iloc[:,:3] = merge.iloc[:,:3].fillna(value=0)
    #merge.iloc[:,4:6] = merge.iloc[:,4:6].fillna(value='None')
    
    merge = merge.sort_index(level=['country'])
    
    return merge

###################################################################################################
def load_reactor_numbers():
    # loading number of operational nuclear power plants from IAEA-PRIS database (public version)
    
    # if data was already loaded from webpages, read directly from saved csv file
    if os.path.isfile('./data/reactor_numbers_PRIS_IAEA.csv'):
        reactors = pd.read_csv('./data/reactor_numbers_PRIS_IAEA.csv', index_col=[0,1])
        return reactors
    
    # create containers for reactor data per country
    startup_dict=dict()
    shutdown_dict=dict()

    # fetch table for reactors from public webpage
    url = 'https://pris.iaea.org/PRIS/CountryStatistics/ReactorDetails.aspx?current='
    for num in range(1000): # manual maximal id of reactor
        page = requests.get(url+str(num))
        if page.status_code < 400: # exclude non-existing IDs
            # find country (ISO2) in html and load tables from page
            country = re.findall('[\d\D]*color="DarkGray"', str(page.content))[0][-26:-24]
            country = cc.convert(country, src='ISO2', to='ISO3')
            # create dict entries for new countries
            if country not in startup_dict.keys():
                startup_dict[country] = np.empty(shape=0, dtype='int')
                shutdown_dict[country] = np.empty(shape=0, dtype='int')
            page_df = pd.read_html(page.content)
            if len(page_df) < 3: # exclude reactor if never started
                continue
            # get year of startup
            if page_df[0].iloc[6,1]=='Commercial Operation Date':
                # if 'Commercial Operation Date' is not given (NaN), use 'First Grid Connection'
                if type(page_df[0].iloc[7,1]) != 'str':
                        startup_dict[country] = np.append(startup_dict[country], int(page_df[0].iloc[7,0][-4:]))
                else:
                    startup_dict[country] = np.append(startup_dict[country], int(page_df[0].iloc[7,1][-4:]))
            # get year of reactor shutdown (if given)
            if page_df[0].iloc[8,0]=='Permanent Shutdown Date':
                shutdown_dict[country] = np.append(shutdown_dict[country], int(page_df[0].iloc[9,0][-4:]))

    # calculate operating reactors from startup and shutdown dates
    # of each reactor (from dicts) for each country per year
    reactors = pd.DataFrame()
    for ISO in startup_dict.keys():
        if len(startup_dict[ISO])==0:
            continue
        reactors_country = pd.DataFrame()
        reactors_country['year'] = np.arange(startup_dict[ISO].min(),2021)
        reactors_country['country'] = np.full(shape=reactors_country.shape[0], fill_value=ISO)
        reactors_country['built_reactors'] = np.fromiter(
                (startup_dict[ISO][startup_dict[ISO] <= year].size for year in reactors_country['year'] )
                ,dtype='int')
        reactors_country['shutdown_reactors'] = np.fromiter(
                (shutdown_dict[ISO][shutdown_dict[ISO] <= year].size for year in reactors_country['year'] )
                ,dtype='int')
        reactors_country['operating_reactors'] = reactors_country['built_reactors'] - reactors_country['shutdown_reactors']
        reactors = pd.concat([reactors, reactors_country],axis=0)
    reactors = reactors.set_index(['year', 'country'])
    # save DataFrame to csv-file, to fetch data not everytime
    reactors.to_csv('./data/reactor_numbers_PRIS_IAEA.csv')
    return reactors

### Merging datasets

This was worked on by Frank. The function <span style="color:blue">clean_data_after_merge()</span> was written by Johannes. Since running the next cell takes a lot of time, the merged and cleaned dataframe was written to `./data/data_merged/data.csv`. For exploring the data, loading the csv was much faster than running the code in the next cell each time.

In [25]:
def clean_data_after_merge(df):
    """Fill some missing data in merged dataframe."""

    for column in ['built_reactors', 'shutdown_reactors', 'operating_reactors', 'nuclear_warheads']:
        df[column].fillna(value=0, inplace=True)
    for column in ['accident_cost_MioUSD2013', 'accident_deaths']:
        df[column].fillna(value=0, inplace=True)

        
df_energy = load_useia_data()
df_emission = resize_emission(load_emission_data())
df_economy = load_economical_data()
df_politics = load_political_data()
df_politics.reset_index(drop=False, inplace=True)

# merge all dataframes:
dataframe = df_energy
for df in [df_emission, df_economy, df_politics]:
    dataframe = dataframe.merge(df, how='left', on=['year', 'country'])

# clean up some values
clean_data_after_merge(dataframe)

  _match_col = self.data[src_format].astype(str).str.replace("\\..*", "")


## Observation and Comments about used datasets:

page 5 of exercise3.pdf

how to format this part? Everyone their own thoughts? All together?

For some information ('built_reactors', 'shutdown_reactors', 'operating_reactors', 'nuclear_recators', 'accident_cost_MioUSD2013', 'accident_deaths') we decided to fill the missing values with the number 0. It was deemed an appropiate solution since there are not a lot of accidents with nuclear power plants (obvious ones like Chernobyl and Fukushima were in the data). 

suggestion: country conversion (UdSSR), Felix GHG vs CO$_2$ only, Johannes fillna(0) for missing values

This part was written by all members.

* Josef
* Frank \
    Since we decided beforehand what each person had to search, it was much easier for me to find appropiate data. I found data in .csv and .xslx formats. Of these I thought that .csv formats are easier to work in python with. Some web ruslts only offer datasets behind a paywall, which could not be used for this exercise.
* Felix
* Johannes