## Excercises on JSON data
* To get familiar with packages for dealing with JSON
* Study examples with JSON strings and files
* Work on exercise to be completed and submitted

## Objective
Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [1]:
## Importing required packages
import pandas as pd
import numpy as np
import json
import json
from pandas.io.json import json_normalize

In [2]:
## Loading the data using context manager

with open("world_bank_projects.json") as world_bank_file:
    world_bank_data = json.load(world_bank_file)
    
## examing the type of loaded data.
print(type(world_bank_data))


<class 'list'>


Data type of loaded JSON file is 'list'. List can contain various data types in it.Working on loaded file with context manager for this exercise quite be tricky(performing data wrangling operations faster). we would load the JSON file using pandas as type 'DataFrame' which will help us use the suite of tools designed mainly for data wrangling operations in pandas.
However, lets examine the data first,

In [3]:
world_bank_data[1]

{'_id': {'$oid': '52b213b38594d8a2be17c781'},
 'approvalfy': 2015,
 'board_approval_month': 'November',
 'boardapprovaldate': '2013-11-04T00:00:00Z',
 'borrower': 'GOVERNMENT OF TUNISIA',
 'country_namecode': 'Republic of Tunisia!$!TN',
 'countrycode': 'TN',
 'countryname': 'Republic of Tunisia',
 'countryshortname': 'Tunisia',
 'docty': 'Project Information Document,Integrated Safeguards Data Sheet,Integrated Safeguards Data Sheet,Project Information Document,Integrated Safeguards Data Sheet,Project Information Document',
 'envassesmentcategorycode': 'C',
 'grantamt': 4700000,
 'ibrdcommamt': 0,
 'id': 'P144674',
 'idacommamt': 0,
 'impagency': 'MINISTRY OF FINANCE',
 'lendinginstr': 'Specific Investment Loan',
 'lendinginstrtype': 'IN',
 'lendprojectcost': 5700000,
 'majorsector_percent': [{'Name': 'Public Administration, Law, and Justice',
   'Percent': 70},
  {'Name': 'Public Administration, Law, and Justice', 'Percent': 30}],
 'mjsector_namecode': [{'code': 'BX',
   'name': 'Publi

We see the world_bank_data contains dictionaries in the list. Data contains project granted to nations for different sectors and information regarding the cost of project, duration, major sector classification etc.

No exploring different keys in the data. This is for getting a good understanding of what your data contains. The below code give the key value pairs in the data.However, it is code is commented as the output for the purpose of avoiding use space consumption.

In [4]:
## visualizing the keys in the data
# for k in world_bank_data:
#     for key, value in k.items():
#         print(key, value)

Loading JSON file using pandas.This methods of using pandas to load JSON outputs a data frame that is helpful later. 
The type of loaded file is Pandas DataFrame. We also notice that the columns have different data types but a single column has the same data type. This is core feature for fast data manupilations using pandas DataFrame

In [5]:
## Loading json using pandas
world_bank_data = pd.read_json('world_bank_projects.json')

# Examining data type 
print(type(world_bank_data))

## print first 10 rows
print(world_bank_data.head())


<class 'pandas.core.frame.DataFrame'>
                                    _id  approvalfy board_approval_month  \
0  {'$oid': '52b213b38594d8a2be17c780'}        1999             November   
1  {'$oid': '52b213b38594d8a2be17c781'}        2015             November   
2  {'$oid': '52b213b38594d8a2be17c782'}        2014             November   
3  {'$oid': '52b213b38594d8a2be17c783'}        2014              October   
4  {'$oid': '52b213b38594d8a2be17c784'}        2014              October   

      boardapprovaldate                                 borrower  \
0  2013-11-12T00:00:00Z  FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA   
1  2013-11-04T00:00:00Z                    GOVERNMENT OF TUNISIA   
2  2013-11-01T00:00:00Z   MINISTRY OF FINANCE AND ECONOMIC DEVEL   
3  2013-10-31T00:00:00Z   MIN. OF PLANNING AND INT'L COOPERATION   
4  2013-10-31T00:00:00Z                      MINISTRY OF FINANCE   

            closingdate                              country_namecode  \
0  2018-07-07T00:00:00Z

Examining different columns, dimension of the data and column types in the data.

We see that they are nearly 50 columns all of which could not be seen. 

In [6]:
## Printing columns names
print(world_bank_data.columns)

## seeing type of data in each of the columns
print(world_bank_data.dtypes)

## Number of rows and columns
print('no. of rows =  ',world_bank_data.shape[0])
print('no. of columns =  ',world_bank_data.shape[1])

Index(['_id', 'approvalfy', 'board_approval_month', 'boardapprovaldate',
       'borrower', 'closingdate', 'country_namecode', 'countrycode',
       'countryname', 'countryshortname', 'docty', 'envassesmentcategorycode',
       'grantamt', 'ibrdcommamt', 'id', 'idacommamt', 'impagency',
       'lendinginstr', 'lendinginstrtype', 'lendprojectcost',
       'majorsector_percent', 'mjsector_namecode', 'mjtheme',
       'mjtheme_namecode', 'mjthemecode', 'prodline', 'prodlinetext',
       'productlinetype', 'project_abstract', 'project_name', 'projectdocs',
       'projectfinancialtype', 'projectstatusdisplay', 'regionname', 'sector',
       'sector1', 'sector2', 'sector3', 'sector4', 'sector_namecode',
       'sectorcode', 'source', 'status', 'supplementprojectflg', 'theme1',
       'theme_namecode', 'themecode', 'totalamt', 'totalcommamt', 'url'],
      dtype='object')
_id                         object
approvalfy                   int64
board_approval_month        object
boardapprovaldat

### First Objective: Find the 10 countries with most projects
Data contains countries that are sanctioned projects. So, we need to find the top 10 countries that have more number of projects or in other words, we need to sort the number of occurences of a country name in the data in the decending order. 
Checking for Nan/missing entries in the country column

In [7]:
## Checking if the country column has NULL/NAN entries

print(world_bank_data[['countryname']].isnull().sum())
world_bank_data[['countryname']].isnull().values.all()


countryname    0
dtype: int64


False

In [8]:
## Sorting occurence of country in the data. This give the top 10 countries with most projects
Top_10_contries = world_bank_data['countryname'].value_counts()
print(Top_10_contries[0:10])

Republic of Indonesia              19
People's Republic of China         19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
Kingdom of Morocco                 12
Nepal                              12
People's Republic of Bangladesh    12
Africa                             11
Republic of Mozambique             11
Name: countryname, dtype: int64


### Second Objective: Find the top 10 major project themes (using column 'mjtheme_namecode')
Lets examine the 'mjtheme_namecode' column.

In [9]:
## Examining column 'mjtheme_namecode'
world_bank_data[['mjtheme_namecode']].head()

for row in world_bank_data['mjtheme_namecode'][0:5]:
    print(row)

[{'code': '8', 'name': 'Human development'}, {'code': '11', 'name': ''}]
[{'code': '1', 'name': 'Economic management'}, {'code': '6', 'name': 'Social protection and risk management'}]
[{'code': '5', 'name': 'Trade and integration'}, {'code': '2', 'name': 'Public sector governance'}, {'code': '11', 'name': 'Environment and natural resources management'}, {'code': '6', 'name': 'Social protection and risk management'}]
[{'code': '7', 'name': 'Social dev/gender/inclusion'}, {'code': '7', 'name': 'Social dev/gender/inclusion'}]
[{'code': '5', 'name': 'Trade and integration'}, {'code': '4', 'name': 'Financial and private sector development'}]


We see that the column contains data type dictionary. Also, contains missing data.
First we load the data as list and then use Json_normalize function to extract variables of interest into a dataframe

In [10]:
## Loading the JSON file using json.load
world_bank_data_string = json.load((open('world_bank_projects.json')))
print(type(world_bank_data_string))

## Creating a dataframe 
table_code = json_normalize(world_bank_data_string, 'mjtheme_namecode',['countryname'])
print(table_code.head())
print(type(table_code))

<class 'list'>
  code                                   name  \
0    8                      Human development   
1   11                                          
2    1                    Economic management   
3    6  Social protection and risk management   
4    5                  Trade and integration   

                               countryname  
0  Federal Democratic Republic of Ethiopia  
1  Federal Democratic Republic of Ethiopia  
2                      Republic of Tunisia  
3                      Republic of Tunisia  
4                                   Tuvalu  
<class 'pandas.core.frame.DataFrame'>


Since we have missing values in the name column as observed. We cannot use name column to find the top 10 most frequent themes. So, lets see top 10 themes using unique code associated to each theme

In [11]:
### Find top 10 themes by frequency of code number
Top10_Themes = table_code.code.value_counts().head(10)
Top10_Themes_da = Top10_Themes.to_frame()
Top10_Themes = Top10_Themes_da.reset_index().copy()
#Top10_Themes.reset_index()
Top10_Themes.rename(columns= {'index':'code','code':'count'},inplace = True)
Top10_Themes

Unnamed: 0,code,count
0,11,250
1,10,216
2,8,210
3,2,199
4,6,168
5,4,146
6,7,130
7,5,77
8,9,50
9,1,38


In [12]:
## Examing unique code for themes
type(table_code)
table_code.code.unique()

array(['8', '11', '1', '6', '5', '2', '7', '4', '10', '9', '3'], dtype=object)

Creating a data frame with unique code and theme name associated with it

In [13]:
# Creating empty pandas dataframe
d = pd.DataFrame(0, columns=['code','name'],index = range(0,len(table_code.code.unique())))

#For loop that lookups theme names associated to a code number
i= 0
for code in table_code.code.unique():
    d.iloc[i,0] = code
    d.iloc[i,1] = table_code[(table_code.code == code) & (table_code.name != '')].iloc[0,1]
    i=i+1

# merging top10themes and unique code with theme name 
Themes = pd.merge(d,Top10_Themes,on ='code').sort(['count'],ascending=False)
Themes



Unnamed: 0,code,name,count
1,11,Environment and natural resources management,250
8,10,Rural development,216
0,8,Human development,210
5,2,Public sector governance,199
3,6,Social protection and risk management,168
7,4,Financial and private sector development,146
6,7,Social dev/gender/inclusion,130
4,5,Trade and integration,77
9,9,Urban development,50
2,1,Economic management,38


### Third Objective:  We will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [14]:
##examing how the missing theme names are stored in the data

table_code.iloc[1,1]
print("Missing values are stored as: '' ")

Missing values are stored as: '' 


In [15]:
## Missing names data set
no_name = table_code[(table_code.name == '')]\

# data set with no missing names
name = table_code[(table_code.name != '')]

## No name rows
print('Number of rows with no name:', (no_name.shape[0]))

Number of rows with no name: 122


In [16]:
## Filling missign names by looking up the code and name associated in the no missing data set
for i in no_name.code.index:
    for j in name.code.index: 
        if no_name.code[i] == name.code[j]:
            no_name.name[i] = name.name[j]
            break
        else:
            continue

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [17]:
## Final dataframe with no Missing THEME names
Filled_missingvalues = pd.concat([no_name,name]).sort_index()
Filled_missingvalues.head(10)

Unnamed: 0,code,name,countryname
0,8,Human development,Federal Democratic Republic of Ethiopia
1,11,Environment and natural resources management,Federal Democratic Republic of Ethiopia
2,1,Economic management,Republic of Tunisia
3,6,Social protection and risk management,Republic of Tunisia
4,5,Trade and integration,Tuvalu
5,2,Public sector governance,Tuvalu
6,11,Environment and natural resources management,Tuvalu
7,6,Social protection and risk management,Tuvalu
8,7,Social dev/gender/inclusion,Republic of Yemen
9,7,Social dev/gender/inclusion,Republic of Yemen
