# JSON Exercise Solutions

+ Andy Pickering
+ May 23, 2017

## Summary

This notebook contains my solutions to the JSON data wrangling exercises, as part of the Springboard data science curriculum.

****
+ reference: http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader
+ data source: http://jsonstudio.com/resources/
****

In [1]:
# import libraries
import pandas as pd
import json
from pandas.io.json import json_normalize

****
## JSON exercise questions

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

# (1) Find the 10 countries with most projects

In [2]:
# read json file into a dataframe
df = pd.read_json('data/world_bank_projects.json')
# group by country and count # rows (since each row is a project)
df.groupby('countryshortname').size().sort_values(ascending=False).head(10)

countryshortname
Indonesia             19
China                 19
Vietnam               17
India                 16
Yemen, Republic of    13
Nepal                 12
Bangladesh            12
Morocco               12
Mozambique            11
Africa                11
dtype: int64

# (2) Find the top 10 major project themes (using column 'mjtheme_namecode')

Using the dataframe as read in in (1) has some issues; there are multiple codes/themes for the same project, and the name is missing for some of them.

In [3]:
# example:
df['mjtheme_namecode'][0]

[{'code': '8', 'name': 'Human development'}, {'code': '11', 'name': ''}]

But we can create a separate data frame with all the codes and names from 'mjtheme_namecode', using *json_normalize* .

In [4]:
a=json.load((open('data/world_bank_projects.json')))
b=json_normalize(a,'mjtheme_namecode')
b.head(10)

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


To answer question (2), i'll here use *value_counts* to find the most frequent project codes. After question (3) i'll produce the same list of project *names*, since question (3) involves filling in the missing names.

In [5]:
b.code.value_counts().head(10)

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
Name: code, dtype: int64

See end of (3) for same result, with project *names*.

# (3) In 2. above you will notice that some entries have only the code and the name is missing.  Create a dataframe with the missing names filled in.

In [6]:
# print df w/ missing name
b.head(20)

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


In this case there are only 11 unique codes/project names, so I could just manually code the names and use a loop to fill them in. But i'd like to do it in a more automatic way that would work if we had many more codes. To do this, I just get the rows for each unique code, find the first non-empty name, and fill in all rows with the name.

In [7]:
# loop over unique codes, get name, and fill in all rows with names
for code in b.code.unique():
    code_rows = b.name[b.code== code] # find rows w/ this code
    code_name = code_rows[code_rows!=''].values[1]   # find first non-empty name
    b.name[b.code==code] = code_name    # fill all rows of this code w/ name
    
# print df w/ names filled in
b.head(20)

Unnamed: 0,code,name
0,8,Human development
1,11,Environment and natural resources management
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


And now that all rows have names, we can make the solution to (2), using the project *names* instead of the code numbers.

In [8]:
# top project names
b.groupby('name').size().sort_values(ascending=False).head(10)

name
Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
dtype: int64