# 1. EDB Score 

In `data/DBData.csv`, you have the full "ease of doing business" dataset from the World Bank. Reformat it into the **Tidy Data** format, so one row is per-year-per-country

Result should look like:

![](EDB_unstack.png)

In [2]:
import pandas as pd
DBData = pd.read_csv('data/DBData.csv')
DBData = DBData.drop(["Country Code","Indicator Code"],axis = 1)
DBData

Unnamed: 0,Country Name,Indicator Name,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,Unnamed: 20
0,Afghanistan,Dealing with construction permits (DB06-15 met...,,,24.11,24.11,24.11,24.11,24.11,24.11,24.11,24.11,24.49,39.38,,,,,
1,Afghanistan,Dealing with construction permits (DB16-19 met...,,,,,,,,,,,,33.70,33.70,33.70,33.70,34.54,
2,Afghanistan,Dealing with construction permits: Building qu...,,,,,,,,,,,,2.50,2.50,2.50,2.50,3.00,
3,Afghanistan,Dealing with construction permits: Building qu...,,,,,,,,,,,,16.67,16.67,16.67,16.67,20.00,
4,Afghanistan,Dealing with construction permits: Cost (% of ...,,,166.50,154.10,160.10,112.50,97.10,85.60,80.50,71.20,66.90,59.40,61.20,66.00,71.70,73.00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43455,Zimbabwe,Trading across borders: Time to export: Border...,,,,,,,,,,,,45.07,45.07,45.07,45.07,45.07,
43456,Zimbabwe,Trading across borders: Time to export: Docume...,,,,,,,,,,,,42.01,42.01,42.01,42.01,42.01,
43457,Zimbabwe,Trading across borders: Time to import (days) ...,,,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,,,,,
43458,Zimbabwe,Trading across borders: Time to import: Border...,,,,,,,,,,,,78.97,78.97,18.76,18.76,18.76,


In [3]:
#sorting and cleaning columns wanted
DBData_name_of_columns = DBData.columns.to_list()
DBData_name_of_columns_not_wanted = ['Country Name','Indicator Name','Indicator Code']
DBData_columns = [item for item in DBData_name_of_columns if item not in DBData_name_of_columns_not_wanted]

#changing shape of the dataframe
DBData_melt = pd.melt(frame = DBData,id_vars=['Country Name','Indicator Name'],
                                      value_vars=DBData_columns,
                                      value_name='Property',
                                      var_name ='Year')

#sorting columns
DBData_melt = DBData_melt[['Year','Country Name','Indicator Name','Property']]
DBData_melt = DBData_melt.rename(columns={'Indicator Name':'Measure'})
DBData_melt

Unnamed: 0,Year,Country Name,Measure,Property
0,2004,Afghanistan,Dealing with construction permits (DB06-15 met...,
1,2004,Afghanistan,Dealing with construction permits (DB16-19 met...,
2,2004,Afghanistan,Dealing with construction permits: Building qu...,
3,2004,Afghanistan,Dealing with construction permits: Building qu...,
4,2004,Afghanistan,Dealing with construction permits: Cost (% of ...,
...,...,...,...,...
738815,Unnamed: 20,Zimbabwe,Trading across borders: Time to export: Border...,
738816,Unnamed: 20,Zimbabwe,Trading across borders: Time to export: Docume...,
738817,Unnamed: 20,Zimbabwe,Trading across borders: Time to import (days) ...,
738818,Unnamed: 20,Zimbabwe,Trading across borders: Time to import: Border...,


In [4]:
#pivoting the melting dataframe
DBData_new_frame = DBData_melt.pivot_table(index='Year',columns='Measure',values = 'Property')
DBData_new_frame = DBData_melt.set_index(['Year','Country Name'],append=True)
DBData_new_frame = DBData_melt.reset_index().pivot_table(index=['Year','Country Name'],columns='Measure',values = 'Property').reset_index()
DBData_new_frame

Measure,Year,Country Name,Dealing with construction permits (DB06-15 methodology) - Score,Dealing with construction permits (DB16-19 methodology) - Score,Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology),Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology) - Score,Dealing with construction permits: Cost (% of Warehouse value),Dealing with construction permits: Cost (% of Warehouse value) - Score,Dealing with construction permits: Liability and insurance regimes index (0-2) (DB16-19 methodology),Dealing with construction permits: Procedures (number),...,Trading across borders: Documents to export (number) (DB06-15 methodology),Trading across borders: Documents to export (number) (DB06-15 methodology) - Score,Trading across borders: Documents to import (number) (DB06-15 methodology),Trading across borders: Documents to import (number) (DB06-15 methodology) - Score,Trading across borders: Time to export (days) (DB06-15 methodology) - Score,Trading across borders: Time to export: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to export: Documentary compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import (days) (DB06-15 methodology) - Score,Trading across borders: Time to import: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import: Documentary compliance (hours) (DB16-19 methodology) - Score
0,2004,Albania,,,,,,,,,...,,,,,,,,,,
1,2004,Algeria,,,,,,,,,...,,,,,,,,,,
2,2004,Angola,,,,,,,,,...,,,,,,,,,,
3,2004,Argentina,,,,,,,,,...,,,,,,,,,,
4,2004,Armenia,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3020,2019,Vietnam,,79.05,12.0,80.00,0.7,96.54,0.0,10.0,...,,,,,,66.04,71.01,,80.29,68.62
3021,2019,West Bank and Gaza,,56.15,12.0,80.00,14.4,28.24,0.0,20.0,...,,,,,,96.86,57.99,,98.21,81.45
3022,2019,"Yemen, Rep.",,0.00,,0.00,,0.00,,,...,,,,,,0.00,0.00,,0.00,0.00
3023,2019,Zambia,,71.65,10.0,66.67,2.6,86.92,0.0,10.0,...,,,,,,25.16,43.79,,57.35,70.29


# 2 GDP and ease of doing business

Using the additional data in `data/GDPpc.csv`, join the clean dataset in **1** to the GDP data.

**What are the 3 Ease of Doing Business variables most closely linked to GDP?**

Answer by giving their correlation ratio, and give a possible explanation and a data visualization

**hint:** trying to do `df.corr()` or `sns.pairplot()` on the whole dataset will crash most computers. Be smart about the number of columns you're testing at once.

In [5]:
#sorting, cleaning columns wanted & changing shape of the dataframe
import pandas as pd
GDP = pd.read_csv('data/GDPpc.csv')

name_of_columns1 = GDP.columns.to_list()
name_of_columns_not_wanted1 = ['Country Name','Country Code','Indicator Name','Indicator Code']
new_list1 = [item for item in name_of_columns1 if item not in name_of_columns_not_wanted1]

GDP_melt = pd.melt(frame = GDP,id_vars=['Country Name'],value_vars=new_list1,value_name='GDP',var_name ='Year')
GDP_melt = GDP_melt.sort_values(by =['Country Name','Year']).reset_index(drop=True)
GDP_melt

Unnamed: 0,Country Name,Year,GDP
0,Afghanistan,1960,59.777327
1,Afghanistan,1961,59.878153
2,Afghanistan,1962,58.492874
3,Afghanistan,1963,78.782758
4,Afghanistan,1964,82.208444
...,...,...,...
15571,Zimbabwe,2014,1264.983826
15572,Zimbabwe,2015,1265.294413
15573,Zimbabwe,2016,1272.335450
15574,Zimbabwe,2017,1333.395663


In [6]:
#merging the two dataframes
merge_GDP_DBData= pd.merge(DBData_new_frame,GDP_melt,
                        on = ['Year','Country Name'],
                        how= 'inner',
                        suffixes = ['_DBData_new_frame','_GDP_melt'])
merge_GDP_DBData

Unnamed: 0,Year,Country Name,Dealing with construction permits (DB06-15 methodology) - Score,Dealing with construction permits (DB16-19 methodology) - Score,Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology),Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology) - Score,Dealing with construction permits: Cost (% of Warehouse value),Dealing with construction permits: Cost (% of Warehouse value) - Score,Dealing with construction permits: Liability and insurance regimes index (0-2) (DB16-19 methodology),Dealing with construction permits: Procedures (number),...,Trading across borders: Documents to export (number) (DB06-15 methodology) - Score,Trading across borders: Documents to import (number) (DB06-15 methodology),Trading across borders: Documents to import (number) (DB06-15 methodology) - Score,Trading across borders: Time to export (days) (DB06-15 methodology) - Score,Trading across borders: Time to export: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to export: Documentary compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import (days) (DB06-15 methodology) - Score,Trading across borders: Time to import: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import: Documentary compliance (hours) (DB16-19 methodology) - Score,GDP
0,2004,Albania,,,,,,,,,...,,,,,,,,,,2373.581292
1,2004,Algeria,,,,,,,,,...,,,,,,,,,,2598.908023
2,2004,Angola,,,,,,,,,...,,,,,,,,,,1248.404906
3,2004,Argentina,,,,,,,,,...,,,,,,,,,,4251.574348
4,2004,Armenia,,,,,,,,,...,,,,,,,,,,1191.961920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2544,2018,Vietnam,,79.03,12.0,80.00,0.7,96.45,0.0,10.0,...,,,,,66.04,71.01,,80.29,68.62,
2545,2018,West Bank and Gaza,,56.70,12.0,80.00,13.9,30.42,0.0,20.0,...,,,,,96.86,57.99,,98.21,81.45,
2546,2018,"Yemen, Rep.",,0.00,,0.00,,0.00,,,...,,,,,0.00,0.00,,0.00,0.00,
2547,2018,Zambia,,71.04,10.0,66.67,3.1,84.45,0.0,10.0,...,,,,,25.16,43.79,,57.35,70.29,


In [7]:
import pandas as pd
import numpy as np

#using the new dataframe and correlation
merge_GDP_DBData.shape
corr_matrix = merge_GDP_DBData.corr()
corr_matrix

#taking the top 3 
GDP_compare = corr_matrix.loc['GDP'].sort_values(ascending= False)[1:4]
GDP_compare

Getting electricity: Total duration and frequency of outages per customer a year (0-3) (DB16-19 methodology)    0.646972
Resolving insolvency (DB04-14 methodology) - Score                                                              0.644669
Resolving insolvency: Recovery rate (cents on the dollar)                                                       0.644232
Name: GDP, dtype: float64

# 3. Chocolate Nobel question

In this repository is the academic paper `chocolate_nobel.pdf`. 

Explain in 3 paragraphs why this paper's conclusions are bad statistics.

He use Wikipedia, Suisse Chocolate, Theobroma, and Caobisco to gather datasets. Unfortunately, they are not from reputable sources. There should be more reliable sources to present his hypotheses.

The focus of chocolates and his methodolgy were not well defined. He spoke of chocolate but there are variety of chocolates such chocolate with liquor, chocolates with almonds, chocolates with strawberry flavour and etc...He doesn't explain his methods. For example, why is he using 10 million per capital instead 5 million? Where did the 10 million come from?  He used chocolate as a major factor when there are other factors like education, literacy, poverty, opportunity to become a Laureates to consider improving or worsening cognitive function. He did not elaborate on the definition of congitive function.

He makes assertion without presenting evidence. His hypothesis is flawed in that counting the number of Nobel Laureates per capital per country does not in anyway quantify congitive ability of any given person. He should measure quantifiable metrics and cognitive aptitude. There are not enough evidence that support his hypothesis.