# 1. Find the 10 Countries With the Most Projects

In [19]:
# Import packages
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize

In [20]:
# Load json data as pandas dataframe
df = pd.read_json('world_bank_projects.json')

In [21]:
# Inspect dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 50 columns):
_id                         500 non-null object
approvalfy                  500 non-null int64
board_approval_month        500 non-null object
boardapprovaldate           500 non-null object
borrower                    485 non-null object
closingdate                 370 non-null object
country_namecode            500 non-null object
countrycode                 500 non-null object
countryname                 500 non-null object
countryshortname            500 non-null object
docty                       446 non-null object
envassesmentcategorycode    430 non-null object
grantamt                    500 non-null int64
ibrdcommamt                 500 non-null int64
id                          500 non-null object
idacommamt                  500 non-null int64
impagency                   472 non-null object
lendinginstr                495 non-null object
lendinginstrtype            495 non

<b>From inspecting the dataframe info, I can see that there are 500 rows and 50 columns. 'countryname' and 'project_name' are the two columns I will be using to find the 10 countries with the most projects. Since both of these columns have 500 non-null object entries, I know that there are no missing values.</b>

In [22]:
# Use value_counts
count = df.countryname.value_counts().head(10)
print(count)

People's Republic of China         19
Republic of Indonesia              19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
Kingdom of Morocco                 12
Nepal                              12
People's Republic of Bangladesh    12
Africa                             11
Republic of Mozambique             11
Name: countryname, dtype: int64


In [23]:
type(count)

pandas.core.series.Series

# 2. Find the Top 10 Major Project Themes

In [24]:
# Inspect 'mjtheme_namecode' column
df['mjtheme_namecode'].head()

0    [{'code': '8', 'name': 'Human development'}, {...
1    [{'code': '1', 'name': 'Economic management'},...
2    [{'code': '5', 'name': 'Trade and integration'...
3    [{'code': '7', 'name': 'Social dev/gender/incl...
4    [{'code': '5', 'name': 'Trade and integration'...
Name: mjtheme_namecode, dtype: object

<b>Since 'mjtheme_namecode' has nested elements, normalization is necessary to create a dataframe.</b>

In [25]:
# Load json data as a list
data_list = json.load((open('world_bank_projects.json')))

# Normalize data using json_normalize
df2 = json_normalize(data_list, 'mjtheme_namecode')

In [26]:
# Inspect dataframe info
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1499 entries, 0 to 1498
Data columns (total 2 columns):
code    1499 non-null object
name    1499 non-null object
dtypes: object(2)
memory usage: 23.5+ KB


In [27]:
# Use value_counts
count2 = df2.name.value_counts().head(10)
print(count2)

Environment and natural resources management    223
Rural development                               202
Human development                               197
Public sector governance                        184
Social protection and risk management           158
Financial and private sector development        130
                                                122
Social dev/gender/inclusion                     119
Trade and integration                            72
Urban development                                47
Name: name, dtype: int64


<b>From inspecting the dataframe, it seems that 'code' and 'name' have no missing values. However, we see in the series above that 'name' has 122 missing values, represented by spaces.</b>

In [28]:
type(count2)

pandas.core.series.Series

# 3. Create a Dataframe with the Missing Names Filled In

In [29]:
# Use ffill and bfill and create a new dataframe
df3 = df2.replace('',np.nan).groupby('code').ffill().bfill()

count2 = df3.name.value_counts().head(10)
print(count2)

Environment and natural resources management    249
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              39
Name: name, dtype: int64
