# Mini-Project: JSON
### Alden Chico

****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

### Preliminary Setup 

In [2]:
import pandas as pd

In [3]:
'''Import data from json as the DataFrame world_banks'''

world_banks = pd.read_json('data/world_bank_projects.json')

### Exercise 1: Find the 10 countries with the most projects

**Solution**: 
Group the countries by using the ```countryshortname``` column and count the instances of each country short name. Sort the values in descending order and store the top 10 country names as the DataFrame ```top_countries```.

In [19]:
top_countries = world_banks.groupby('countryshortname')['countryshortname'].count().sort_values(ascending=False)[:10]
top_countries = top_countries.to_frame()
top_countries.columns = ['Number of Projects']
top_countries

Unnamed: 0_level_0,Number of Projects
countryshortname,Unnamed: 1_level_1
Indonesia,19
China,19
Vietnam,17
India,16
"Yemen, Republic of",13
Nepal,12
Bangladesh,12
Morocco,12
Mozambique,11
Africa,11


### Exercise 2: Find the top 10 major project themes

**Solution**: Extract all project theme information from the ```mjtheme_namecode``` column of ```world_banks``` using a for loop and store it into the list ```project_themes```. Create a defaultdict called ```counter``` that counts the number of instances that each theme appears in ```project_themes```. Cast ```counter``` as a DataFrame, specify the counter values in the column as ```Project Theme Amounts```, and sort the amounts in descending order. Store the top ten project themes as the DataFrame ```top_projects```.

In [16]:
project_themes = []

#Append all the name values from the mjtheme_namecode column to project_themes
for data in world_banks['mjtheme_namecode']:
    for sub_data in data:
        project_themes.append(sub_data['name'])

In [17]:
#Create a defaultdict counter that counts the number of instances a project theme happens in project_themes
from collections import defaultdict
counter = defaultdict(int)
for project in project_themes:
    counter[project] += 1
    
#Remove instances where the project theme is an empty string
counter.pop('')

#Create a list from the sorted values from counter
counter = pd.DataFrame.from_dict(dict(counter), orient='index')
counter.columns = ['Project Theme Amounts']
top_projects = counter.sort_values('Project Theme Amounts', ascending=False).head(10)
top_projects

Unnamed: 0,Project Theme Amounts
Environment and natural resources management,223
Rural development,202
Human development,197
Public sector governance,184
Social protection and risk management,158
Financial and private sector development,130
Social dev/gender/inclusion,119
Trade and integration,72
Urban development,47
Economic management,33


### Exercise 3: Create a DataFrame with missing name entries from ```mjtheme_namecode``` filled in.

***Solution 1***: Go through each row in the ```world_banks``` dataframe and loop through each list of dictionaries in the ```'mjtheme_namecode'``` column. If the value of the ```name``` key is ```''```, then replace the string with ```'Not Specified'``` using the ```.replace()``` method. Assert an error if the ```name``` value is an empty string and print the first value of ```world_banks['mjtheme_namecode']``` to show proper string replacement.

In [6]:
world_banks = pd.read_json('data/world_bank_projects.json')

for idx, row in world_banks.iterrows():
    for item in row['mjtheme_namecode']:
        if item['name'] == '':
            item['name'] = item['name'].replace('', 'Not Specified')
        assert item['name'] != '', 'Empty string still present'
world_banks['mjtheme_namecode'][0]

[{'code': '8', 'name': 'Human development'},
 {'code': '11', 'name': 'Not Specified'}]

**Solution 2**: Loop through the ```mjtheme_namecode``` column of ```world_banks``` and replace all ```''``` values from the ```name``` key with ```'Not Specified'``` using an if-statement. Assert an error if the ```name``` value is an empty string and print out the first value of ```world_banks['mjtheme_namecode']``` to demonstrate proper missing name entry replacement.

In [7]:
world_banks = pd.read_json('data/world_bank_projects.json')

for data in world_banks['mjtheme_namecode']:
    for sub_data in data:
        if sub_data['name'] == '':
            sub_data['name'] = 'Not Specified'
        assert sub_data['name'] != '', 'Empty string still present.'
world_banks['mjtheme_namecode'][0]

[{'code': '8', 'name': 'Human development'},
 {'code': '11', 'name': 'Not Specified'}]