****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

# Problem 1
Find the 10 countries with most projects

## importing packages

In [1]:
import pandas as pd
import json
import numpy as np
from pandas.io.json import json_normalize

## open the json file and load in the dataframe

In [2]:
# open json file as a dataframe
json_df = pd.read_json('world_bank_projects.json', )

# inspect json dataframe
json_df.info()

ValueError: Expected object or value

## Find the most frequent countries

Since `countryname` is not nested, we can sort the column by high to low using the default return of `.value_counts()`

In [3]:
# sort out the frequency of countries in descending order
projects_per_country = json_df.countryname.value_counts()
projects_per_country

People's Republic of China       19
Republic of Indonesia            19
Socialist Republic of Vietnam    17
Republic of India                16
Republic of Yemen                13
                                 ..
Republic of Kiribati              1
Central African Republic          1
Kingdom of Tonga                  1
Republic of Serbia                1
Republic of Cape Verde            1
Name: countryname, Length: 118, dtype: int64

# Problem 2

Find the top 10 major project themes (using column 'mjtheme_namecode')



## Inspect the `mjtheme_namecode` column

In [4]:
json_df.mjtheme_namecode

0      [{'code': '8', 'name': 'Human development'}, {...
1      [{'code': '1', 'name': 'Economic management'},...
2      [{'code': '5', 'name': 'Trade and integration'...
3      [{'code': '7', 'name': 'Social dev/gender/incl...
4      [{'code': '5', 'name': 'Trade and integration'...
                             ...                        
495    [{'code': '4', 'name': 'Financial and private ...
496    [{'code': '8', 'name': 'Human development'}, {...
497    [{'code': '10', 'name': 'Rural development'}, ...
498    [{'code': '10', 'name': 'Rural development'}, ...
499    [{'code': '9', 'name': 'Urban development'}, {...
Name: mjtheme_namecode, Length: 500, dtype: object

## Open the file as string and flatten `mjtheme_namecode`
Since the column is nested, it needs to be flattened using json_normalize() to the column

In [5]:
# if you get the column mjtheme_namecode like this, it will look presentable in jupyter
js = json.load(open('world_bank_projects.json'))
themes = json_normalize(js, 'mjtheme_namecode')
themes

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
...,...,...
1494,10,Rural development
1495,9,Urban development
1496,8,Human development
1497,5,Trade and integration


## Sort the frequency of code numbers in descending order

In [6]:
top_ten_themes = themes.code.value_counts().head(10)
top_ten_themes

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
Name: code, dtype: int64

# Problem 3

In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

This process requires three steps:

 - Sort the list by code and name in ascending order
 - Replace empty cells with NaN values
 - Use `.fillna(method='bfill')` to backfill NaN cells

## Sort `code` and `name` by ascending order

In [7]:
sorted_themes = themes.sort_values(['code', 'name'])
sorted_themes

Unnamed: 0,code,name
212,1,
363,1,
1024,1,
1114,1,
1437,1,
...,...,...
1426,9,Urban development
1428,9,Urban development
1470,9,Urban development
1473,9,Urban development


## Fill the blank names with NaN values

In [8]:
sorted_themes.name[sorted_themes['name'] == ''] = np.nan
sorted_themes.head(10)

NameError: name 'np' is not defined

## Use `.fillna(method='bfill')` to backfill NaN cells

In [89]:
filled_in_themes = sorted_themes.fillna(method='bfill')
filled_in_themes.head(10)

Unnamed: 0,code,name
212,1,Economic management
363,1,Economic management
1024,1,Economic management
1114,1,Economic management
1437,1,Economic management
2,1,Economic management
88,1,Economic management
175,1,Economic management
204,1,Economic management
205,1,Economic management


# Outcome:

## 1. Top 10 Countries with Most Projects

In [87]:
projects_per_country.head(10)

Republic of Indonesia            19
People's Republic of China       19
Socialist Republic of Vietnam    17
Republic of India                16
Republic of Yemen                13
                                 ..
Republic of Congo                 1
Republic of Belarus               1
Republic of Kiribati              1
Bosnia and Herzegovina            1
People's Republic of Angola       1
Name: countryname, Length: 118, dtype: int64

## 2. Top 10 Major Project Themes

In [86]:
top_ten_themes

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
Name: code, dtype: int64

## 3. Fill in missing names of `mjtheme_namecode`

In [90]:
filled_in_themes

Unnamed: 0,code,name
212,1,Economic management
363,1,Economic management
1024,1,Economic management
1114,1,Economic management
1437,1,Economic management
...,...,...
1426,9,Urban development
1428,9,Urban development
1470,9,Urban development
1473,9,Urban development
