# 1. EDB Score 

In `data/DBData.csv`, you have the full "ease of doing business" dataset from the World Bank. Reformat it into the **Tidy Data** format, so one row is per-year-per-country

Result should look like:

![](EDB_unstack.png)

In [67]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [141]:
edb = pd.read_csv('data/DBData.csv')
edb = edb.drop(['Unnamed: 20', 'Indicator Code'], axis=1)
edb = edb.pivot_table(index = "Indicator Name", 
                      columns = "Country Name").T
edb = edb.reset_index()
edb = edb.rename(index={"Indicator Name":"Measure"}, columns={"level_0":"Year"})
edb

Indicator Name,Year,Country Name,Dealing with construction permits (DB06-15 methodology) - Score,Dealing with construction permits (DB16-19 methodology) - Score,Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology),Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology) - Score,Dealing with construction permits: Cost (% of Warehouse value),Dealing with construction permits: Cost (% of Warehouse value) - Score,Dealing with construction permits: Liability and insurance regimes index (0-2) (DB16-19 methodology),Dealing with construction permits: Procedures (number),...,Trading across borders: Documents to export (number) (DB06-15 methodology),Trading across borders: Documents to export (number) (DB06-15 methodology) - Score,Trading across borders: Documents to import (number) (DB06-15 methodology),Trading across borders: Documents to import (number) (DB06-15 methodology) - Score,Trading across borders: Time to export (days) (DB06-15 methodology) - Score,Trading across borders: Time to export: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to export: Documentary compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import (days) (DB06-15 methodology) - Score,Trading across borders: Time to import: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import: Documentary compliance (hours) (DB16-19 methodology) - Score
0,2004,Albania,,,,,,,,,...,,,,,,,,,,
1,2004,Algeria,,,,,,,,,...,,,,,,,,,,
2,2004,Angola,,,,,,,,,...,,,,,,,,,,
3,2004,Argentina,,,,,,,,,...,,,,,,,,,,
4,2004,Armenia,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3020,2019,Vietnam,,79.05,12.0,80.00,0.7,96.54,0.0,10.0,...,,,,,,66.04,71.01,,80.29,68.62
3021,2019,West Bank and Gaza,,56.15,12.0,80.00,14.4,28.24,0.0,20.0,...,,,,,,96.86,57.99,,98.21,81.45
3022,2019,"Yemen, Rep.",,0.00,,0.00,,0.00,,,...,,,,,,0.00,0.00,,0.00,0.00
3023,2019,Zambia,,71.65,10.0,66.67,2.6,86.92,0.0,10.0,...,,,,,,25.16,43.79,,57.35,70.29


# 2 GDP and ease of doing business

Using the additional data in `data/GDPpc.csv`, join the clean dataset in **1** to the GDP data.

**What are the 3 Ease of Doing Business variables most closely linked to GDP?**

Answer by giving their correlation ratio, and give a possible explanation and a data visualization

**hint:** trying to do `df.corr()` or `sns.pairplot()` on the whole dataset will crash most computers. Be smart about the number of columns you're testing at once.

In [136]:
gdp = pd.read_csv('data/GDPpc.csv')
gdp = gdp.drop(['Indicator Code', 'Indicator Name'], axis=1)
gdp = gdp.melt(id_vars=['Country Name', 'Country Code'], 
               var_name='Year', 
               value_name='GDP per capita')
gdp = gdp.dropna()
gdp.head(5)

Unnamed: 0,Country Name,Country Code,Year,GDP per capita
1,Afghanistan,AFG,1960,59.777327
11,Australia,AUS,1960,1807.349784
12,Austria,AUT,1960,935.460427
14,Burundi,BDI,1960,70.349079
15,Belgium,BEL,1960,1273.691659


In [145]:
mrg = edb.merge(gdp, on=['Country Name','Year'])
mrg.Year = mrg.Year.astype(int)
df = pd.DataFrame(mrg.corrwith(mrg['GDP per capita']))
df.sort_values(by=0,ascending=False).head(10) 
# Electricity, Resolving insolvency, Gloabal EDB

Unnamed: 0,0
GDP per capita,1.0
Getting electricity: Total duration and frequency of outages per customer a year (0-3) (DB16-19 methodology),0.646972
Resolving insolvency (DB04-14 methodology) - Score,0.644669
Resolving insolvency: Recovery rate (cents on the dollar),0.644232
Resolving insolvency: Recovery rate (cents on the dollar) - Score,0.644209
Global: Ease of doing business score (DB15 methodology),0.62108
Global: Ease of doing business score (DB10-14 methodology),0.61502
Global: Ease of doing business score (DB16 methodology),0.61459
Global: Ease of doing business score (DB17-19 methodology),0.605895
Registering property: Quality of land administration index (0-30) (DB17-19 methodology),0.580885


# 3. Chocolate Nobel question

In this repository is the academic paper `chocolate_nobel.pdf`. 

Explain in 3 paragraphs why this paper's conclusions are bad statistics.

In [None]:
# Franz H. Messerli is Swiss therefore probably very biased

# The initial hypothesis seems very flawed. The number of Nobel laureates 
# is most likely not related to the cognitive performance of a population 
# in general. A better indicator would be the average IQ but even that is 
# arguable. He probably tried that but quickly realized that Asian countries 
# have higher IQ and low chocolate consumption. 

# There's no real link made between the variables. We don't have the 
# chocolate consumption of the laureates themselves and we don't even know 
# if the laureates lived/worked in their country of origin. Furthermore, 
# countries seems to have been cherry-picked. He used data that support 
# his points rather than let the data speak for themselves.

# The author fails to give a plausible alternative explanation for his 
# conclusions. For example, developed countries with quality education and 
# high GDP are more likely to consume chocolate (luxury good) and produce 
# Nobel laureates.