****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

## imports for Python, Pandas

In [3]:
import json
import pandas as pd
from pandas.io.json import json_normalize

## json file to pandas dataframe

In [4]:
#World Bank Projects File
filename = 'data/world_bank_projects.json'

#json to pandas dataframe
data = pd.read_json(filename, encoding='utf-8')

data.head(3)

Unnamed: 0,_id,approvalfy,board_approval_month,boardapprovaldate,borrower,closingdate,country_namecode,countrycode,countryname,countryshortname,...,sectorcode,source,status,supplementprojectflg,theme1,theme_namecode,themecode,totalamt,totalcommamt,url
0,{'$oid': '52b213b38594d8a2be17c780'},1999,November,2013-11-12T00:00:00Z,FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA,2018-07-07T00:00:00Z,Federal Democratic Republic of Ethiopia!$!ET,ET,Federal Democratic Republic of Ethiopia,Ethiopia,...,"ET,BS,ES,EP",IBRD,Active,N,"{'Percent': 100, 'Name': 'Education for all'}","[{'code': '65', 'name': 'Education for all'}]",65,130000000,130000000,http://www.worldbank.org/projects/P129828/ethi...
1,{'$oid': '52b213b38594d8a2be17c781'},2015,November,2013-11-04T00:00:00Z,GOVERNMENT OF TUNISIA,,Republic of Tunisia!$!TN,TN,Republic of Tunisia,Tunisia,...,"BZ,BS",IBRD,Active,N,"{'Percent': 30, 'Name': 'Other economic manage...","[{'code': '24', 'name': 'Other economic manage...",5424,0,4700000,http://www.worldbank.org/projects/P144674?lang=en
2,{'$oid': '52b213b38594d8a2be17c782'},2014,November,2013-11-01T00:00:00Z,MINISTRY OF FINANCE AND ECONOMIC DEVEL,,Tuvalu!$!TV,TV,Tuvalu,Tuvalu,...,TI,IBRD,Active,Y,"{'Percent': 46, 'Name': 'Regional integration'}","[{'code': '47', 'name': 'Regional integration'...",52812547,6060000,6060000,http://www.worldbank.org/projects/P145310?lang=en


In [5]:
#Fill in some null values
data['project_abstract'] = data['project_abstract'].fillna('missing')
data['sector2'] = data['sector2'].fillna('missing')
data['sector3'] = data['sector3'].fillna('missing')
data['sector4'] = data['sector4'].fillna('missing')
data['closingdate'] = data['closingdate'].fillna('open')

## 1. Find the 10 countries with most projects

In [10]:
#Countries with most projects
data.countryname.value_counts().head(10)

People's Republic of China         19
Republic of Indonesia              19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
People's Republic of Bangladesh    12
Kingdom of Morocco                 12
Nepal                              12
Africa                             11
Republic of Mozambique             11
Name: countryname, dtype: int64

## 2. Find the top 10 major project themes (using column 'mjtheme_namecode')

In [11]:
#load json
world_bank_json_data = json.load((open('data/world_bank_projects.json')))

In [15]:
#normalize mjtheme_namecode values
mjtheme_namecode_df = json_normalize(world_bank_json_data, 'mjtheme_namecode')
mjtheme_namecode_df.head(5)

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration


In [20]:
#Count project theme name values
mjtheme_namecode_df.name.value_counts()

Environment and natural resources management    223
Rural development                               202
Human development                               197
Public sector governance                        184
Social protection and risk management           158
Financial and private sector development        130
                                                122
Social dev/gender/inclusion                     119
Trade and integration                            72
Urban development                                47
Economic management                              33
Rule of law                                      12
Name: name, dtype: int64

## 3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [21]:
#DataFrame with code and name info
codekey = mjtheme_namecode_df[mjtheme_namecode_df.name != ''].drop_duplicates().set_index('code')
codekey

Unnamed: 0_level_0,name
code,Unnamed: 1_level_1
8,Human development
1,Economic management
6,Social protection and risk management
5,Trade and integration
2,Public sector governance
11,Environment and natural resources management
7,Social dev/gender/inclusion
4,Financial and private sector development
10,Rural development
9,Urban development


In [22]:
#Turn DataFrame to Dictionary
codekey = dict(codekey['name'])

In [23]:
#Add name based on code value
mjtheme_namecode_df.name = mjtheme_namecode_df.code.apply(lambda x: codekey[x])

In [24]:
#Recount Themes code name values with blanks filled in.
mjtheme_namecode_df.name.value_counts()

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Rule of law                                      15
Name: name, dtype: int64