<img src="https://www.uc3m.es/ss/Satellite?blobcol=urldata&blobkey=id&blobtable=MungoBlobs&blobwhere=1371573952659">


---


# **WEB ANALYTICS COURSE 4 - SEMESTER 2**
# **BACHELOR IN DATA SCIENCE AND ENGINEERING**

# **LAB 1 APIs - WORLD BANK**


## Gabriela Marin Martín, Afina Nurova, Mónica De Álvaro Mena, Daniel Kwapien | Group 97 

# 0. LAB PREPARATION

Students have to complete the following tasks before attedning the lab:

1. **Read and study the API documentation to have some initial notions of the functionality of the World Bank API. Following, we share several links to the documentation related to the World Bank API:**
- https://datahelpdesk.worldbank.org/knowledgebase/articles/898581-api-basic-call-structures
- https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information
- https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation

2. **The key element of the World Bank API are the "indicators". Next, we share a link that may simplify the search of indicators through a search tool. Once you have selected an indicator you can find its codification within the url bar of the browser.**

- https://data.worldbank.org/indicator?tab=featured

# **1. INTRODUCTION**

* The goal of this lab is to gain experience testing a widely-used API such as the World Bank API that includes bunch of information about countries indicators in economy, health, education, agriculture, etc.

* The lab includes 5 milestones that will drive the student through the use of several indicators.  

* The lab will be done in groups of 23 students.

* The lab will use two complete consecutive sessions (4 hours). The students are expected to complete the 5 milestones proposed in the lab within these 2 sessions

* **The final mark will be computed as a function of the number of milestones successfully completed.**

* **Each group should also upload their lab notebook in the corresponding task in Aula Global.**

* Upon completing all the milestones, students should call the professor, who will check the correctness of the solution. Partial milestones checks may be allowed in some cases.

# 2. **MILESTONES**

In this section we describe one by one the milestones and leave a space to the students to implement the code to complete the requested task.

**NOTE: Unless otherwise stated, all the milestones have to deliver information about countries. Therefore, you should not consider regions or any othre aggreated information in your analysis.**

# **2.1. MILESTONE 1: POPULATION**:
Retrieve the 2022 countries' population and show the Top 10 countries and the Bottom 10 countries within the World Bank database.



In [1]:
import requests
import pandas as pd
import pycountry

def get_data(url):
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()[1] 
        return data
    else:
        print(f"Error fetching data: {response.status_code}")
        return None

url = "https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?date=2022&format=json&per_page=1000"
country_codes = [country.alpha_3 for country in pycountry.countries]
df = pd.DataFrame(get_data(url))
df = df[['countryiso3code', 'country', 'value']]
df.columns = ['CountryCode', 'Country', 'Population']
df = df[df['CountryCode'].isin(country_codes)]
df['Country'] = df['Country'].map(lambda x: x['value'])

df.dropna(subset=['Population'], inplace=True)
top_10_countries = df.sort_values(by='Population', ascending=False).head(10)
bottom_10_countries = df.sort_values(by='Population', ascending=True).head(10)

print("Top 10 Countries by Population in 2022:")
print(top_10_countries)

print("\nBottom 10 Countries by Population in 2022:")
print(bottom_10_countries)


Top 10 Countries by Population in 2022:
    CountryCode             Country    Population
138         IND               India  1.417173e+09
90          CHN               China  1.412175e+09
255         USA       United States  3.332714e+08
139         IDN           Indonesia  2.755013e+08
198         PAK            Pakistan  2.358249e+08
193         NGA             Nigeria  2.185412e+08
75          BRA              Brazil  2.153135e+08
64          BGD          Bangladesh  1.711864e+08
210         RUS  Russian Federation  1.442369e+08
176         MEX              Mexico  1.275041e+08

Bottom 10 Countries by Population in 2022:
    CountryCode                    Country  Population
250         TUV                     Tuvalu     11312.0
186         NRU                      Nauru     12668.0
199         PLW                      Palau     18055.0
76          VGB     British Virgin Islands     31305.0
232         MAF   St. Martin (French part)     31791.0
124         GIB                  Gib

# **2.2. MILESTONE 2: WOMEN Vs. MEN POPULATION**:
Obtain the % of men and women for each country and compute the difference among them using the formula %women - %men. Display:

1- The number of countries with more women than men.

2- The number of countries with more men than women

3- The 10 countries with more women compared to men (ten countries with the largest positive value of the previous metric)

- The 10 countries with more men compared to women (ten countries with the largest negative value of the previous metric).

**Note**: You can use the indicator the absolute number of men and women from the World Bank API and compute the % for each country and the difference, or you can use the indicator given directly the %.



In [2]:
# Fetch male population data
url = "https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL.MA.ZS?date=2022&format=json&per_page=1000"

df = pd.DataFrame(get_data(url))
df = df[['countryiso3code', 'country', 'value']]
df.columns = ['CountryCode', 'Country', 'MalePopulation']
df = df[df['CountryCode'].isin(country_codes)]
df['Country'] = df['Country'].map(lambda x: x['value'])
df.dropna(subset=['MalePopulation'], inplace=True)


df['FemalePopulation'] = 100 - df['MalePopulation']
df['Difference'] = df['FemalePopulation'] - df['MalePopulation']

more_women = df[df['Difference'] > 0].shape[0]
more_men = df[df['Difference'] < 0].shape[0]

top_10_women = df.sort_values(by='Difference', ascending=False).head(10)
top_10_men = df.sort_values(by='Difference', ascending=True).head(10)

print(f"Number of countries with more women than men: {more_women}")
print(f"Number of countries with more men than women: {more_men}")

print("\nTop 10 Countries with more women compared to men:")
print(top_10_women)

print("\nTop 10 Countries with more men compared to women:")
print(top_10_men)


Number of countries with more women than men: 134
Number of countries with more men than women: 81

Top 10 Countries with more women compared to men:
    CountryCode                Country  MalePopulation  FemalePopulation  \
57          ARM                Armenia       44.959286         55.040714   
252         UKR                Ukraine       45.919077         54.080923   
66          BLR                Belarus       46.043914         53.956086   
135         HKG   Hong Kong SAR, China       46.073324         53.926676   
158         LVA                 Latvia       46.384705         53.615295   
210         RUS     Russian Federation       46.442472         53.557528   
261         VIR  Virgin Islands (U.S.)       46.613382         53.386618   
166         MAC       Macao SAR, China       46.921032         53.078968   
164         LTU              Lithuania       46.968479         53.031521   
121         GEO                Georgia       47.003740         52.996260   

     Differen

## **2.3. MILESTONE 3: GDP PER CAPITA ACCORDING FOR INCOME LEVEL GROUPS**:

Compute the average increase/decrease in percentage for the PIB per capita in US dollars in the following two periods: 2000-2022 and  2010-2022, for the following income groups: low-income economies, lower-middle-income economies, middle economies, upper-middle-income economies and high-income economies. The following, link provides information of the different country aggregations carried out by the World Bank.  

https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups

 You should compute the %PIB increase as follows. Given country A with a PIB Per Capita \$20000 in 2000 and \$30000 in 2022 the increase/decrease should be computed as follow:

%PIB increase = 100*(30000-20000)/20000=50%.


In [3]:
#SOLUTION MILESTONE 3
url_2000 = "https://api.worldbank.org/v2/country/all/indicator/NY.GDP.PCAP.CD?date=2000&format=json&per_page=1000"
url_2010 = "https://api.worldbank.org/v2/country/all/indicator/NY.GDP.PCAP.CD?date=2010&format=json&per_page=1000"
url_2022 = "https://api.worldbank.org/v2/country/all/indicator/NY.GDP.PCAP.CD?date=2022&format=json&per_page=1000"

gdp_2000 = pd.DataFrame(get_data(url_2000))
gdp_2010 = pd.DataFrame(get_data(url_2010))
gdp_2022 = pd.DataFrame(get_data(url_2022))

gdp_2000 = gdp_2000[['countryiso3code', 'value']]
gdp_2010 = gdp_2010[['countryiso3code', 'value']]
gdp_2022 = gdp_2022[['countryiso3code', 'value']]

gdp_2000.columns = ['CountryCode', 'GDP_2000']
gdp_2010.columns = ['CountryCode', 'GDP_2010']
gdp_2022.columns = ['CountryCode', 'GDP_2022']

gdp_df = pd.merge(gdp_2000, gdp_2022, on='CountryCode')
gdp_df = pd.merge(gdp_df, gdp_2010, on='CountryCode')
gdp_df.dropna(inplace=True)

gdp_df = gdp_df[gdp_df['CountryCode'].isin(country_codes)]

# Calculate percentage increase
gdp_df['GDP_Percentage_Increase_2000_2022'] = 100 * (gdp_df['GDP_2022'] - gdp_df['GDP_2000']) / gdp_df['GDP_2000']
gdp_df['GDP_Percentage_Increase_2010_2022'] = 100 * (gdp_df['GDP_2022'] - gdp_df['GDP_2010']) / gdp_df['GDP_2010']

# Read income group data from CLASS.csv
income_groups_df = pd.read_csv('./CLASS.csv').dropna()
income_groups_df = income_groups_df[['Code', 'Income group']]
income_groups_df.columns = ['CountryCode', 'IncomeGroup']

gdp_df = pd.merge(gdp_df, income_groups_df, on='CountryCode')

# Filter out aggregated regions
# gdp_df = gdp_df[gdp_df['IncomeGroup'] != 'Aggregates']

# Group by income level and calculate mean percentage increase
income_group_gdp = gdp_df.groupby('IncomeGroup').agg({'GDP_Percentage_Increase_2000_2022': 'mean', 'GDP_Percentage_Increase_2010_2022': 'mean'})


print(income_group_gdp)

                     GDP_Percentage_Increase_2000_2022  \
IncomeGroup                                              
High income                                 411.115383   
Low income                                  177.810621   
Lower middle income                         285.925778   
Upper middle income                         370.363417   

                     GDP_Percentage_Increase_2010_2022  
IncomeGroup                                             
High income                                  73.381612  
Low income                                   30.655843  
Lower middle income                          53.724544  
Upper middle income                          43.486052  


# **2.4. MILESTONE 4: TOP 5 COUNTRIES INCREASE GDP PER INCOME-GROUP**

For each of the income groups included in Milestone 3 and the period 2010-2022 list the Top 5 countries in terms of %GDPR per capita increase along with the value

**NOTE**: Do not consider the countries for which you do not have data either in 2010 or 2022 or both of them

In [4]:
#SOLUTION MILESTONE 4

gdp_df_sorted = gdp_df.sort_values(by=['IncomeGroup', 'GDP_Percentage_Increase_2010_2022'], ascending=False)

#top_5_countries_per_income_group = pd.DataFrame()

for income_group, group in gdp_df_sorted.groupby('IncomeGroup'):
    top_5 = group[['CountryCode', 'GDP_Percentage_Increase_2010_2022']].head(5)
    
    top_5['IncomeGroup'] = income_group
    print('Income Group: ', income_group )
    print(top_5)

Income Group:  High income
    CountryCode  GDP_Percentage_Increase_2010_2022  IncomeGroup
53          GUY                         296.513734  High income
88          NRU                         161.452207  High income
96          PAN                         113.643967  High income
18          BGR                         103.600337  High income
102         ROU                          86.861917  High income
Income Group:  Low income
    CountryCode  GDP_Percentage_Increase_2010_2022 IncomeGroup
43          ETH                         206.315061  Low income
112         SOM                         164.937788  Low income
30          COD                         104.598773  Low income
104         RWA                          62.690758  Low income
69          LBR                          51.811254  Low income
Income Group:  Lower middle income
    CountryCode  GDP_Percentage_Increase_2010_2022          IncomeGroup
8           BGD                         246.006100  Lower middle income
123   

# **2.5. MILESTONE 5: CO2 emission per capita**

Retrieve the most recent non empty value for the amount of CO2 emission per capita (metric tons per country) for all the countries. Display the 30 countries with the highest CO2 emission per capita along with their value and the year related to that value.

**NOTE**: You cannot search manually the year and use it in your query for this milestone.


In [5]:
#SOLUTION MILESTONE 5

import requests
import pandas as pd
import pycountry


country_codes = [country.alpha_3 for country in pycountry.countries]


# World Bank API endpoint template for CO2 emissions per capita
api_url = "https://api.worldbank.org/v2/country/all/indicator/EN.ATM.CO2E.PC?mrnev=1&format=json&per_page=10000"

# Function to fetch CO2 emissions data per capita for all countries
def fetch_co2_data():
    response = requests.get(api_url)
    data = response.json()
    return data[1]  # The second item in the response is the actual data

# Fetch CO2 emissions data
co2_data = fetch_co2_data()

# Process the data to extract the most recent non-empty value per country
country_emissions = []

for entry in co2_data:
    country = entry['country']['value']
    year = entry['date']
    value = entry['value']
    
    # Store the first non-empty value we find per country (most recent comes first in API response)
    if value is not None:
        country_emissions.append({
            'country': country,
            'year': year,
            'co2_per_capita': value,
            'code': entry['countryiso3code']
        })


# Convert the data to a pandas DataFrame
df = pd.DataFrame(country_emissions)
df= df[df['code'].isin(country_codes)]

# Sort the DataFrame by CO2 emissions per capita in descending order
df_sorted = df.sort_values(by='co2_per_capita', ascending=False)

# Display the top 30 countries with the highest CO2 emissions per capita
top_30_countries = df_sorted.head(30)
print(top_30_countries[['country', 'year', 'co2_per_capita']])

                  country  year  co2_per_capita
187                 Qatar  2020       31.726842
60                Bahrain  2020       21.976908
72      Brunei Darussalam  2020       21.705812
139                Kuwait  2020       21.169610
228  United Arab Emirates  2020       20.252272
177                  Oman  2020       15.636201
56              Australia  2020       14.776137
193          Saudi Arabia  2020       14.266585
79                 Canada  2020       13.591696
230         United States  2020       13.032222
149            Luxembourg  2020       12.456953
134            Kazakhstan  2020       11.297743
189    Russian Federation  2020       11.141653
138           Korea, Rep.  2020       10.990030
224          Turkmenistan  2020       10.184086
221   Trinidad and Tobago  2020       10.157119
179                 Palau  2020        8.802582
93                Czechia  2020        8.304017
132                 Japan  2020        8.031496
83                  China  2020        7