# Can we make a change on climate change?
#### EPA1333 Final Assignment
### Introduction
Regardless of the many debates over the impact of humans on climate change, global warming is an observable fact. According to NASA, highest 16 global temperatures have been recorded since 2001, the arctic ice shrank to a minimum in 2012, and carbon dioxide levels in the air are higher than ever!

Data analysis can help us better understand how the climate has changed, why, to make predictions for the future years, and to evaluate measures on how to suppress these changes. In this assignment you receive climate change data from World Bank, and are expected to perform an original and non-trivial analysis using Python.

### Data

#### Climate change indicators
The climate change data offered by World Bank contains a large set of indicators, such as for example CO2 emissions, population growth or renewable energy output. Values of these indicators are available per country and per year. You can download the data in *.csv format from https://data.worldbank.org/topic/climate-change.

In [1]:
# import standard library
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# load world bank data
# skip the unneeded header
wb = pd.read_csv('world_bank/API_19_DS2_en_csv_v2.csv', sep=',', header=0, skiprows=3) 

# read_csv resulted in additional blank series at the last column -> drop the column
wb = wb.drop(wb.columns[[61]], 1)

# display the head of dataframe
wb.head()


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,Urban population (% of total),SP.URB.TOTL.IN.ZS,50.776,50.761,50.746,50.73,50.715,50.7,...,44.147,43.783,43.421,43.059,42.698,42.364,42.058,41.779,41.528,41.304
1,Aruba,ABW,Urban population,SP.URB.TOTL,27526.0,28141.0,28532.0,28761.0,28924.0,29082.0,...,44686.0,44375.0,44052.0,43778.0,43575.0,43456.0,43398.0,43365.0,43331.0,43296.0
2,Aruba,ABW,Urban population growth (annual %),SP.URB.GROW,3.117931,2.209658,1.379868,0.799404,0.56514,0.544773,...,-0.435429,-0.698401,-0.730549,-0.623935,-0.464782,-0.273466,-0.133557,-0.076069,-0.078435,-0.080806
3,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101220.0,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0
4,Aruba,ABW,Population growth (annual %),SP.POP.GROW,3.148037,2.238144,1.409622,0.832453,0.592649,0.573468,...,0.38406,0.131311,0.098616,0.21268,0.376985,0.512145,0.592914,0.587492,0.524658,0.459929


In [3]:
#load country metadata
wb_meta_country = pd.read_csv('world_bank/Metadata_Country_API_19_DS2_en_csv_v2.csv', sep=',', header=0) 

# read_csv resulted in additional blank series at the last column -> drop the column
wb_meta_country = wb_meta_country.drop(wb_meta_country.columns[[5]], 1)

# display the head of dataframe
wb_meta_country.head()


Unnamed: 0,Country Code,Region,IncomeGroup,SpecialNotes,TableName
0,ABW,Latin America & Caribbean,High income,SNA data for 2000-2011 are updated from offici...,Aruba
1,AFG,South Asia,Low income,Fiscal year end: March 20; reporting period fo...,Afghanistan
2,AGO,Sub-Saharan Africa,Lower middle income,,Angola
3,ALB,Europe & Central Asia,Upper middle income,,Albania
4,AND,Europe & Central Asia,High income,WB-3 code changed from ADO to AND to align wit...,Andorra


In [4]:
#load indicator metadata
wb_meta_indi = pd.read_csv('world_bank/Metadata_Indicator_API_19_DS2_en_csv_v2.csv', sep=',', header=0) 

# read_csv resulted in additional blank series at the last column -> drop the column
wb_meta_indi = wb_meta_indi.drop(wb_meta_indi.columns[[4]], 1)

# display the head of dataframe
wb_meta_indi.head()


Unnamed: 0,INDICATOR_CODE,INDICATOR_NAME,SOURCE_NOTE,SOURCE_ORGANIZATION
0,SP.URB.TOTL.IN.ZS,Urban population (% of total),Urban population refers to people living in ur...,The United Nations Population Divisions World ...
1,SP.URB.TOTL,Urban population,Urban population refers to people living in ur...,World Bank Staff estimates based on United Nat...
2,SP.URB.GROW,Urban population growth (annual %),Urban population refers to people living in ur...,World Bank Staff estimates based on United Nat...
3,SP.POP.TOTL,"Population, total",Total population is based on the de facto defi...,(1) United Nations Population Division. World ...
4,SP.POP.GROW,Population growth (annual %),Annual population growth rate for year t is th...,Derived from total population. Population sour...


#### Climate data API
The Climate Data API provides programmatic access to most of the climate data used on the World Bank’s Climate Change Knowledge Portal. You can access this data directly from Python using requests. In addition to what was already downloadable as csv data, with this API you are able to access temperature, precipitation and basin level data. Read about it in more detail here: https://datahelpdesk.worldbank.org/knowledgebase/articles/902061-climate-data-api

Below is an example of how to access yearly temperature historical data per country from Python. You can select a country using its ISO aplha3 code: https://unstats.un.org/unsd/methodology/m49/.


In [5]:
import requests
r = requests.get('http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/ROU')
rou = r.json()
rou[:5]

[{'data': 9.215595, 'year': 1901},
 {'data': 8.389345, 'year': 1902},
 {'data': 9.500536, 'year': 1903},
 {'data': 8.901487, 'year': 1904},
 {'data': 9.12619, 'year': 1905}]

As with most data sources, you might find that you data contains missing values. Please handle them appropriately, for example by using interpolation.

#### Other resources
You are encouraged to find more data sources that will make your analysis more meaningful. Please make sure that you document everything carefully. Only use freely available datasets.

#### Assignment
* Create a Jupyter Notebook that contains your explanations and analyses.
* Start the notebook with a clear description of the type of analysis you are going to perform.
* The conceptual contents of the Notebook should be roughly similar to a normal written report of 10-20 pages.
* Some (minimum) properties of the Notebook and your analyses on which we will grade:
    * Required: Combine different data from multiple sources in your analyses.
    * Required: Use multiple types of visualizations of your results.
    * Required: Make sure your Notebook does not generate errors!
    * Required: You should use Python to answer your research questions. Your code should read, clean and format, process and visualize the data. There should be at least some non-trivial processing involved.
    * Whenever possible, make your Notebook read the data directly from the web. This way, your notebook will always use the most up-to-date data available. If not, document carefully where the data was collected what to do when you want to use more up-to-date data.
    * Demonstrate your skills in Python by using typical Python constructs and using the appropriate data structures (lists, dictionaries, tuples, arrays, dataframes, series, recursion, etc.)
    * Write clear, understandable code:
        * Document your code! Put comments when necessary.
        * Use sensible variable names.
        * Break up your code into parts. Use (fruitful) functions.
    * Try to think of reusability of your code. How easy would it be to use your same code if we want to try to do a small variations of your analysis? Can we easily adapt/play around with your code?
    * How difficult were your analyses?
* Make your Notebook self-explanatory. So, it should contain text (with references) as well as your actual analysis code and results.
* If you want to use other libraries for your analyses or visualizations, feel free to do so. However, only use free available and well-known libraries. The ones that come standard with Anaconda are fine. If you want to use something else, that you first need to install, check with us first.
* Create a zip archive and upload it on Brightspace.

#### Example
Here are some example questions you might ask to start the analysis. Please note that this is just an example and that you are expected to come up with your own questions and analyses.
* The EU has the following goal in the Paris agreement: "At least a 40% domestic reduction in greenhouse gases by 2030 compared to 1990 levels.” How is the EU doing at the moment? If they don’t change policy (i.e extrapolation of current trends) where will they end up? [source]
* Can you classify in good/neutral/bad countries?
* What is the trend per continent with respect to gas emissions? How do countries within the same continent behave? Is it fair to make statements over whole continents?
* Where do you find the highest increase in temperatures compared to 1960?
* Suppose that each country has a % growth or reduction of CO2, where do we end up?
* What are the countries with most gas emissions? How does that change when you normalize by their size / population?
* What is the country with the best trend in renewable energy over the last 10 years?

In [7]:
# load world bank data
country_class_y = pd.read_csv('world_bank/CCPI_2017.csv', sep=',', header=0) 
country_class_y.head()

Unnamed: 0,Countries,Label
0,France,Good
1,Sweden,Good
2,United Kingdom,Good
3,Cyprus,Good
4,Luxembourg,Good


In [8]:
wb_meta_country[['Country Code', 'TableName']].head()

Unnamed: 0,Country Code,TableName
0,ABW,Aruba
1,AFG,Afghanistan
2,AGO,Angola
3,ALB,Albania
4,AND,Andorra


In [9]:
df = pd.DataFrame()
for country in country_class_y['Countries']:
    #print(country)
    df1 = wb[wb['Country Name'] == country]
    df = pd.concat( [df,df1], ignore_index=True, axis=0)
        

In [12]:
df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,France,FRA,Urban population (% of total),SP.URB.TOTL.IN.ZS,61.88,62.607,63.489,64.702,65.898,67.071,...,77.621,77.864,78.106,78.345,78.584,78.82,79.055,79.289,79.52,79.75
1,France,FRA,Urban population,SP.URB.TOTL,28968650.0,29703740.0,30550680.0,31576960.0,32586170.0,33551440.0,...,49690040.0,50124940.0,50540080.0,50945800.0,51348970.0,51753050.0,52175170.0,52593940.0,52979460.0,53349650.0
2,France,FRA,Urban population growth (annual %),SP.URB.GROW,2.424503,2.505858,2.811428,3.304057,3.146026,2.919203,...,0.9335545,0.8714278,0.8248035,0.7995637,0.7882418,0.7838471,0.8123421,0.7994288,0.7303273,0.6963089
3,France,FRA,"Population, total",SP.POP.TOTL,46814240.0,47444750.0,48119650.0,48803680.0,49449400.0,50023770.0,...,64016230.0,64374990.0,64707040.0,65027510.0,65342780.0,65659790.0,65998570.0,66331960.0,66624070.0,66896110.0
4,France,FRA,Population growth (annual %),SP.POP.GROW,1.22961,1.337853,1.41247,1.411512,1.314427,1.154839,...,0.6187115,0.5588574,0.5144864,0.4940375,0.4836449,0.4839823,0.5146361,0.5038712,0.4394107,0.407491


In [13]:
df.shape

(4480, 61)

In [19]:
df_2000 = df.iloc[:, [0,2,49]]
df_2000.head()

Unnamed: 0,Country Name,Indicator Name,2005
0,France,Urban population (% of total),77.13
1,France,Urban population,48730240.0
2,France,Urban population growth (annual %),1.074557
3,France,"Population, total",63179360.0
4,France,Population growth (annual %),0.7538056


In [None]:
#df[['label']]
country_class_y.columns = ['Country Name', 'Label']
country_class_y.head()

In [None]:
df_2000_label = pd.merge(df_2000, country_class_y, on="Country Name", how="outer" )
df_2000_label.head()
df_2000_label.shape

In [None]:
# preprocessing data
df_2000_label_good = df_2000_label[(df_2000_label['Indicator Name'] == 'Urban population (% of total)') & (df_2000_label['Label'] == 'Good')]

df_2000_label_good

In [None]:
len(wb_meta_indi['INDICATOR_NAME'])

In [None]:
list_deleted_indi = []
list_saved_indi = []
df_test = pd.DataFrame()

for label in country_class_y['Label'].unique():
    for indicator in wb_meta_indi['INDICATOR_NAME']:
        df_temp = pd.DataFrame()
        df_temp = df_2000_label[(df_2000_label['Indicator Name'] == indicator) & (df_2000_label['Label'] == label)]
        
        # df_temp = df_temp.fillna(df_temp.mean())
        df_temp.interpolate(method=’polynomial’, order=4)
        print(df_temp)
        
        if(df_temp['2000'].isnull().values.all()):
            list_deleted_indi.append(indicator) 
        else:
            list_saved_indi.append(indicator) 
            
        df_test = pd.concat([df_test,df_temp], ignore_index=True, axis=0)
        
print(list_deleted_indi)
df_test

In [None]:
len(list_deleted_indi)

In [None]:
len(list_saved_indi)

In [None]:
set_saved_indi_filter = set(wb_meta_indi['INDICATOR_NAME']) - set(list_deleted_indi)

In [None]:
len(set_saved_indi_filter)

In [None]:
df_test.isnull().values.any()

In [None]:
df_test[df_test['Indicator Name'] == 'Urban population (% of total)']
df_test.shape

In [None]:
df_test_clean = pd.DataFrame()
for indi in set_saved_indi_filter:
    df_temp = df_test[df_test['Indicator Name'] == indi]
    df_test_clean = pd.concat([df_test_clean, df_temp], ignore_index = True, axis = 0)

print(df_test_clean.shape)
df_test_clean.head()

In [None]:
3584/56

In [None]:
# reshape the matrix feature
col = list(set_saved_indi_filter)
ind = list(country_class_y['Country Name'])

Xtrain = pd.DataFrame(np.zeros(shape = (56, 64)), columns = col, index = ind)
for country in country_class_y['Country Name']:
    Xtrain.loc[country] = df_test_clean[df_test_clean['Country Name'] == country]['2000'].values
Xtrain.shape

In [None]:
ytrain = country_class_y['Label']
ytrain.head()

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel='rbf', class_weight='balanced')

In [None]:
svc.get_params().keys()

In [None]:
from sklearn.grid_search import GridSearchCV
param_grid = {'C': [0.05, 0.1, 0.5, 1, 5, 10, 50], 'gamma': [1E-20 , 1E-15, 1E-12, 1E-10, 0.00001]}
grid = GridSearchCV(svc, param_grid)

%time grid.fit(Xtrain, ytrain)
print(grid.best_params_)

In [None]:
model = grid.best_estimator_


In [None]:
wb_test = wb[['Country Name','Indicator Name','2000']]
wb_test.head()

In [None]:
# Xtest

df_test_set = pd.DataFrame()
for indi in set_saved_indi_filter:
    df_temp = wb_test[wb_test['Indicator Name'] == indi]
    df_test_set = pd.concat([df_test_set, df_temp], ignore_index = True, axis = 0)

print(df_test_set.shape)
df_test_set.head()


In [None]:
len(wb['Country Name'].unique())

In [None]:
16896/264

In [None]:
set_test = set(wb['Country Name']) - set(country_class_y['Country Name'])

df_test_set_country = pd.DataFrame()
for country in set_test:
    # print(country)
    df_temp = df_test_set[df_test_set['Country Name'] == country]
    df_test_set_country = pd.concat([df_test_set_country, df_temp], ignore_index = True, axis = 0)

In [None]:
len(set_test)

In [None]:
df_test_set_country.head()

In [None]:
# reshape the matrix feature
col = list(set_saved_indi_filter)
ind = list(country_class_y['Country Name'])

for country in country_class_y['Country Name']:
    Xtrain.loc[country] = df_test_clean[df_test_clean['Country Name'] == country]['2000'].values
Xtrain.shape

Xtrain = pd.DataFrame(np.zeros(shape = (56, 64)), columns = col, index = ind)
for country in set_test:
    df_test_country[]
    

In [None]:
df_vietnam = df_test_set_country[df_test_set_country['Country Name'] == 'Vietnam']
df_vietnam = df_vietnam.fillna(0)

In [None]:
df_vietnam.isnull().sum()

In [None]:
Xtest = pd.DataFrame(df_vietnam['2000'].values.reshape(1,64))

In [None]:
ytest = model.predict(Xtest)
ytest

In [None]:
from sklearn.cross_validation import cross_val_score
cross_val_score(model, Xtrain, ytrain, cv=4)

In [None]:
list_deleted_indi = []
df_test = pd.DataFrame()

for label in country_class_y['Label'].unique():
    for indicator in wb_meta_indi['INDICATOR_NAME']:
        df_temp = pd.DataFrame()
        df_temp = df_2000[(df_2000['Indicator Name'] == indicator) & (df_2000_label['Label'] == label)]
        print(df_temp)
        #df_temp = df_temp.fillna(df_temp.mean())
        
        if(df_temp['2000'].isnull().values.all()):
            #print(df_temp)
            list_deleted_indi.append(indicator) 
        
        if((not df_temp['2000'].isnull().values.all()) & df_temp.isnull().values.any()):
            df_temp = df_temp.fillna(df_temp.mean())
            df_test = pd.concat( [df_test,df_temp], ignore_index=True, axis=0)
            print(df_temp)
            break
        
        df_test = pd.concat( [df_test,df_temp], ignore_index=True, axis=0)
        
    if((not df_temp['2000'].isnull().values.all()) & df_temp.isnull().values.any()):
        break

print(list_deleted_indi)
df_test