#### Data Wrangling Mini Project Objects:
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

****
#### Importing necessary libraries

In [2]:
import pandas as pd
import json
import numpy as np
from pandas.io.json import json_normalize
from collections import Counter

****
#### Turning the JSON into a Pandas dataframe for readibility/manipulation.

In [3]:
url = 'data/world_bank_projects.json'
df = pd.read_json(url)

****
#### Finding the top 10 countries with projects We have now completed part 1 of the Mini Project

In [4]:
df['countryshortname'].value_counts().nlargest(10)

Indonesia             19
China                 19
Vietnam               17
India                 16
Yemen, Republic of    13
Bangladesh            12
Nepal                 12
Morocco               12
Mozambique            11
Africa                11
Name: countryshortname, dtype: int64

****
#### We note that the mjtheme_namecode column contains an individual JSON object for each row. Some contain more than 1 project. We write a for loop to iterate over the rows in the dataframe with another for loop to loop over the projects within each row. We append this to an empty list.

In [5]:
lis = []
for row in df['mjtheme_namecode']:
    for project in row:
        lis.append(project)

****
#### We use a counter to find the top 10 values. Code 11 (Environment and natural resources management) has the most projects. We have now completed part 2 of the mini project.

In [6]:
count_projects = Counter()
for project in lis:
    count_projects[project['code']] += 1
count_projects.most_common(10)

[('11', 250),
 ('10', 216),
 ('8', 210),
 ('2', 199),
 ('6', 168),
 ('4', 146),
 ('7', 130),
 ('5', 77),
 ('9', 50),
 ('1', 38)]

****
#### We work out a dictionary that summarizes all the codes with their respective project names to map out the ones that are empty.

In [13]:
non_empty_list = []
for project in lis:
    if project['name'] != '':
        non_empty_list.append(project)

unique_projects = {project['code']:project['name'] for project in non_empty_list}

****
#### I personally liked working with NaN values rather than emptry strings. I turned all empty strings to NaN here.

In [14]:
projects_df = json_normalize(lis).replace('', np.nan, regex=True)

****
#### Use the fillna method combined with map on the 'code' column to look up the unique project names in the dictionary we created by the code. We have now completed part 3 of the mini project.

In [15]:
projects_df['name'] = projects_df.name.fillna(projects_df.code.map(unique_projects))

In [16]:
projects_df

Unnamed: 0,code,name
0,8,Human development
1,11,Environment and natural resources management
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion
