# Data Exploration
***
## Overview
This notebook uses APIs from The World Health Organization (WHO) and The World Bank to obtain datasets used to answer the projects main hypothesis questions: Do countries with higher unemployment rate also have higher suicide rates? For each API, the list of available datasets is examined and an appropriate one is selected. Both sources provide historical data for at least 190 countries around the world. The data is called using requests, and refined from a json file to a dictionary using a simple for loop. The dictionary is then converted to a dataframe using the pandas library. Then, null values are dropped from each data frame and the columns are cast as the appropriate types. Lastly the dataframes are merged together, so data for both suicide rates and unemployment rates are paired together by country/year in one dataframe object.

This notebook also contains data exploration for country GDP data, provided by The World Bank, which is used to answer a follow up hypothesis: do countries with higher GDP have lower suicide rates?
### Dependencies:

In [1]:
# Dependencies
import requests
import pandas as pd

## WHO Data Exploration:
The Main Hypothesis is concerned with suicide rates and unemployment data. The first of which WHO says they have data for. WHO has an online database, [The Global Health Observatory](https://www.who.int/data/gho), which includes thousands of datasets (indicators). 
### WHO's Indicators Access Point:
To start, we look at the [indicator](https://ghoapi.azureedge.net/api/Indicator/) access point which lists all indicators found on the The Global Health Observatory page.

In [2]:
# Entry point for WHO's indicators
who_url = 'https://ghoapi.azureedge.net/api/Indicator/'

In [3]:
# Initialize a variable to loop through indicator list
index = 0

# Read API and print out the name of every indicator with the index value
who_data = requests.get(who_url).json()
for indicator in who_data['value']:
    print( index, indicator['IndicatorName'])
    index += 1

0 Ambient air pollution  attributable DALYs per 100'000 children under 5 years
1 Household air pollution attributable deaths
2 Household air pollution attributable deaths in children under 5 years
3 Household air pollution attributable deaths per 100'000 capita
4 Household air pollution  attributable deaths per 100'000 children under 5 years
5 Household air pollution attributable DALYs
6 Household air pollution attributable DALYs in children under 5 years
7 Household air pollution attributable DALYs (per 100 000 population)
8 Household air pollution  attributable DALYs per 100'000 children under 5 years
9 Household air pollution attributable DALYs (per 100 000, age-standardized)
10 Ambient air pollution attributable deaths in children under 5 years
11 Ambient air pollution attributable deaths
12 Ambient air pollution attributable death rate (per 100 000 population, age-standardized)
13 Ambient air pollution attributable DALYs
14 DALYs attributable to ambient air pollution (age-standard

1016 Law requires helmet to be fastened
1017 Existence of a road safety lead agency
1018 Existence of a national road safety strategy
1019 Availability of funding for national road safety strategy
1020 Existence of a formal pre-hospital care system
1021 Existence of a universal access telephone number for pre-hospital care
1022 Vehicle standards
1023 Distribution of road traffic deaths by type of road user (%)
1024 Gross national income per capita (Atlas method)
1025 Point prevalence (%), alcohol use disorders, 15+ years
1026 National survey on substance use among children and adolescents
1027 Government unit/official responsible for prevention for substance use
1028 Data on substance use disorders disseminated in national annual reports
1029 Ministry/office that takes primary responsibility for prevention for substance use
1030 Substance use policy at the national level
1031 Five years change in international cooperation for prevention for substance use
1032 Substance use policy at th

We find there are over 2,000 different indicators. Using a simple `ctrl-f` search we narrow down the list to just indicators involving _suicide rates_. the 665th indicator looks like the appropriate dataset to use based on its description.

In [4]:
who_data['value'][665]

{'IndicatorCode': 'MH_12',
 'IndicatorName': 'Age-standardized suicide rates (per 100 000 population)',
 'Language': 'EN'}

Note the rates are given as [*age standardized*](https://www.who.int/data/gho/indicator-metadata-registry/imr-details/78#:~:text=The%20age%2Dstandardized%20mortality%20rate,of%20the%20WHO%20standard%20population.) and *per 100,000 people*.
### WHO Suicide Rates 

In [5]:
# Use the IndicatorCode to create entry point for suicide rates data
sui_url = 'https://ghoapi.azureedge.net/api/MH_12'

In [6]:
# Read data from API
sui_data = requests.get(sui_url).json()
sui_data

{'@odata.context': 'https://ghoapi.azureedge.net/api/$metadata#MH_12',
 'value': [{'Id': 19629382,
   'IndicatorCode': 'MH_12',
   'SpatialDimType': 'REGION',
   'SpatialDim': 'GLOBAL',
   'TimeDimType': 'YEAR',
   'TimeDim': 2016,
   'Dim1Type': 'SEX',
   'Dim1': 'BTSX',
   'Dim2Type': None,
   'Dim2': None,
   'Dim3Type': None,
   'Dim3': None,
   'DataSourceDimType': None,
   'DataSourceDim': None,
   'Value': '10.53',
   'NumericValue': 10.5328,
   'Low': None,
   'High': None,
   'Comments': None,
   'Date': '2018-07-17T08:37:08.217+02:00',
   'TimeDimensionValue': '2016',
   'TimeDimensionBegin': '2016-01-01T00:00:00+01:00',
   'TimeDimensionEnd': '2016-12-31T00:00:00+01:00'},
  {'Id': 25257364,
   'IndicatorCode': 'MH_12',
   'SpatialDimType': 'COUNTRY',
   'SpatialDim': 'AFG',
   'TimeDimType': 'YEAR',
   'TimeDim': 2003,
   'Dim1Type': 'SEX',
   'Dim1': 'FMLE',
   'Dim2Type': None,
   'Dim2': None,
   'Dim3Type': None,
   'Dim3': None,
   'DataSourceDimType': None,
   'DataSou

Note that rows of data are organized by country, year and sex. We are concerned with data by country and year but only want values measured for both sexes combined so we will drop rows with the sex specific data.

In [7]:
# Intialize dictionary
sui_dict = {'country': [], 'year': [], 'suicide rate': [], 'sex': []}

In [8]:
# Loop through json items to store data
for entry in sui_data['value']:
    sui_dict['country'].append(entry['SpatialDim'])
    sui_dict['year'].append(entry['TimeDim'])
    sui_dict['suicide rate'].append(entry['NumericValue'])
    sui_dict['sex'].append(entry['Dim1'])

In [9]:
# Make DataFrame
sui_df = pd.DataFrame(sui_dict)
sui_df.head()

Unnamed: 0,country,year,suicide rate,sex
0,GLOBAL,2016,10.5328,BTSX
1,AFG,2003,7.6,FMLE
2,AFG,2007,7.11,FMLE
3,AFG,2006,7.31,FMLE
4,AFG,2005,7.44,FMLE


In [10]:
# Only want both sex values - loc 'sex' == 'BTSX', then drop sex column it's not needed
sui_df = sui_df.loc[sui_df['sex'] == 'BTSX']
sui_df = sui_df.drop(columns = 'sex')

In [11]:
# Print data types - year is integer
sui_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3881 entries, 0 to 11640
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country       3881 non-null   object 
 1   year          3881 non-null   int64  
 2   suicide rate  3881 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 121.3+ KB


The resulting dataframe contains 3,881 rows of clean/non-null data. This data will have to be paired by country and year with data from unemployment rate. So not all rows may be matched and therefor be used.

## World Bank Data Exploration:
The second set of data required to answer the main hypothesis is country unemployment rates. [The World Bank's API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation) provides global data on thousands of economic and financial indicators.
### World Bank's Indicator Acces Point:
A similar [indicators](https://api.worldbank.org/v2/indicator?format=json&per_page=21000) access point is used to view all indicators provided by The World Bank's API.

In [12]:
# World bank url
wb_url = 'https://api.worldbank.org/v2/indicator?format=json&per_page=21000'

# Initialize variable to store index count
count = 0

# Print out name of every "indicator" in world bank api to find ones of interest
wb_data = requests.get(wb_url).json()
for indicator in wb_data[1]:
    print(count, indicator['name'])
    count += 1

0 Poverty Headcount ($1.90 a day)
1 Poverty Headcount ($2.50 a day)
2 Middle Class ($10-50 a day) Headcount
3 Official Moderate Poverty Rate-National
4 Poverty Headcount ($4 a day)
5 Vulnerable ($4-10 a day) Headcount
6 Poverty Gap ($1.90 a day)
7 Poverty Gap ($2.50 a day)
8 Poverty Gap ($4 a day)
9 Poverty Severity ($1.90 a day)
10 Poverty Severity ($2.50 a day)
11 Poverty Severity ($4 a day)
12 Poverty Headcount ($1.90 a day)-Rural
13 Poverty Headcount ($2.50 a day)-Rural
14 Middle Class ($10-50 a day) Headcount-Rural
15 Official Moderate Poverty Rate- Rural
16 Poverty Headcount ($4 a day)-Rural
17 Vulnerable ($4-10 a day) Headcount-Rural
18 Poverty Gap ($1.90 a day)-Rural
19 Poverty Gap ($2.50 a day)-Rural
20 Poverty Gap ($4 a day)-Rural
21 Poverty Severity ($1.90 a day)-Rural
22 Poverty Severity ($2.50 a day)-Rural
23 Poverty Severity ($4 a day)-Rural
24 Access to electricity (% of total population)
25 Total final energy consumption (TFEC)
26 Literacy rate, youth total (% of people

1149 Sorghum production (metric tons)
1150 Wheat production (metric tons)
1151 Barley seed quantity (FAO, metric tonnes)
1152 Cereal seed quantity (FAO, metric tonnes)
1153 Fonio seed quantity (FAO, metric tonnes)
1154 Millet seed quantity (FAO, metric tonnes)
1155 Maize seed quantity (FAO, metric tonnes)
1156 Rice seed quantity (FAO, metric tonnes)
1157 Sorghum seed quantity (FAO, metric tonnes)
1158 Wheat seed quantity (FAO, metric tonnes)
1159 Surface area (ha)
1160 Surface area (sq. km)
1161 Agricultural machinery, tractors per agricultural worker
1162 Pesticide consumption (kg per hectare)
1163 Barley yield (kg per hectare)
1164 Cereal yield (kg per hectare)
1165 Fonio yield (kg per hectare)
1166 Millet yield (kg per hectare)
1167 Maize yield (kg per hectare)
1168 Rice yield (kg per hectare)
1169 Sorghum yield (kg per hectare)
1170 Wheat yield (kg per hectare)
1171 Benefit incidence of social safety net programs to poorest quintile (% of total safety net benefits)
1172 Coverage of

2103 Emission Totals - Emissions (CO2eq) from CH4 (AR5) - Enteric Fermentation
2104 Emission Totals - Emissions (CO2eq) from CH4 (AR5) - Forest fires
2105 Emission Totals - Emissions (CO2eq) from CH4 (AR5) - Farm-gate emissions
2106 Emission Totals - Emissions (CO2eq) from CH4 (AR5) - Fires in organic soils
2107 Emission Totals - Emissions (CO2eq) from CH4 (AR5) - Fires in humid tropical forests
2108 Capacity of Liquefied Natural Gas export terminals (Mt per year) - Cancelled
2109 Capacity of Liquefied Natural Gas export terminals (Mt per year) - In Development (Proposed + Construction)
2110 Capacity of Liquefied Natural Gas export terminals (Mt per year) - Operating
2111 Capacity of Liquefied Natural Gas export terminals (Mt per year) - Shelved
2112 Capacity of Liquefied Natural Gas import terminals (Mt per year) - Cancelled
2113 Capacity of Liquefied Natural Gas import terminals (Mt per year) - In Development (Proposed + Construction)
2114 Capacity of Liquefied Natural Gas import ter

3231 Gross Ext. Debt Pmt, DI: Intercom Lending, More than 2yrs, Debt liab. to fellow ent., Principal, USD
3232 Gross Ext. Debt Pmt, DI: Intercom Lending, More than 3 to 6, Debt liab. to fellow ent., Principal, USD
3233 Gross Ext. Debt Pmt, DI: Intercom Lending, More than 6 to 9, Debt liab. to fellow ent., Principal, USD
3234 Gross Ext. Debt Pmt, DI: Intercom Lending, Immediate, Debt liab. to fellow ent., Principal, USD
3235 Principal repayments on external debt, general government sector (PPG) (AMT, current US$)
3236 Principal repayments on external debt, public sector (PPG) (AMT, current US$)
3237 Gross Ext. Debt Pmt, DI: Intercom Lending, More than 0 to 3, Debt liab. of DI ent. to dir. investors, Principal, USD
3238 Gross Ext. Debt Pmt, DI: Intercom Lending, More than 9 to 12, Debt liab. of DI ent. to dir. investors, Principal, USD
3239 Gross Ext. Debt Pmt, DI: Intercom Lending, More than 12 to 18, Debt liab. of DI ent. to dir. investors, Principal, USD
3240 Gross Ext. Debt Pmt, DI: 

4365 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, Beginning of period, USD
4366 Ext. Assets in Debt Instruments, Central Bank, Short-term, Loans, USD
4367 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, end of period, USD
4368 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, Exchange rate chg, USD
4369 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, Beginning pos., USD
4370 Net Ext. Debt Position, Central Bank, Short-term, Loans, USD
4371 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, Other chg in vol., USD
4372 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, Other price chg, USD
4373 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, Transactions, USD
4374 Gross Ext. Debt Pos., Central Bank, Short-term, Loans, USD
4375 Gross Ext. Debt Pos., Nonfinancial corporations, Short-term, Loans, USD
4376 Gross Ext. Debt Pos., Other financial corporations, Short-term, Loans, USD
4377 Gross Ext. Debt Pos., Other Sectors, Short-term, Loans, Beginnin

5359 PNG, bonds (NTR, current US$)
5360 PNG, commercial banks and other creditors (NTR, current US$)
5361 CB, other private creditors (NTR, current US$)
5362 PPG, other private creditors (NTR, current US$)
5363 GG, other private creditors (NTR, current US$)
5364 OPS, other private creditors (NTR, current US$)
5365 PRVG, other private creditors (NTR, current US$)
5366 PS, other private creditors (NTR, current US$)
5367 Net transfers on external debt, private guaranteed by public sector (PPG) (NTR, current US$)
5368 CB, private creditors (NTR, current US$)
5369 PPG, private creditors (NTR, current US$)
5370 GG, private creditors (NTR, current US$)
5371 OPS, private creditors (NTR, current US$)
5372 PRVG, private creditors (NTR, current US$)
5373 PS, private creditors (NTR, current US$)
5374 Net official development assistance and official aid received (current US$)
5375 Net ODA received (% of GDP)
5376 Aid (% of GDI)
5377 Net official development assistance received (% of gross capital f

6578 Economy function expenditure (in IDR)
6579 Education function expenditure (in IDR)
6580 Environment function expenditure (in IDR)
6581 Health function expenditure (in IDR)
6582 Housing and public facilities function expenditure (in IDR)
6583 Infrastructure function expenditure (in IDR)
6584 Social protection function expenditure (in IDR)
6585 Public, law and order function expenditure (in IDR)
6586 Religious function expenditure (in IDR)
6587 Tourism and culture function expenditure (in IDR)
6588 Domestic credit to private sector by banks (% of GDP)
6589 Bank liquid reserves to bank assets ratio (%)
6590 Gold Holdings at London market price (US$ end period)
6591 Gold, valued at year-end London prices (current US$)
6592 Total reserves (includes gold, current US$)
6593 Total reserves including gold valued at London gold price (current US$)
6594 Total reserves includes gold (% of GDP)
6595 Total reserves (% of total external debt)
6596 Total reserves in months of imports
6597 Total r

7954 Prevalence of HIV, total (% of population ages 15-49): Q4
7955 Prevalence of HIV, total (% of population ages 15-49): Q5 (highest)
7956 Contraceptive prevalence, modern methods (% of females ages 15-49)
7957 Contraceptive prevalence, modern methods (% of females ages 15-49): Q1 (lowest)
7958 Contraceptive prevalence, modern methods (% of females ages 15-49): Q2
7959 Contraceptive prevalence, modern methods (% of females ages 15-49): Q3
7960 Contraceptive prevalence, modern methods (% of females ages 15-49): Q4
7961 Contraceptive prevalence, modern methods (% of females ages 15-49): Q5 (highest)
7962 Mortality rate, infant (per 1,000 live births)
7963 Mortality rate, infant (per 1,000 live births): Q1 (lowest)
7964 Mortality rate, infant (per 1,000 live births): Q2
7965 Mortality rate, infant (per 1,000 live births): Q3
7966 Mortality rate, infant (per 1,000 live births): Q4
7967 Mortality rate, infant (per 1,000 live births): Q5 (highest)
7968 Mortality rate, under-5 (per 1,000)
7

9204 Employment in the construction sector, aged 15-64, primary education and below (% of employed population with low education in working age)
9205 Employment in the construction sector, aged 15-64, male (% of male employed population in working age)
9206 Employment in the construction sector, aged 25-64 (% of employed population aged 25-64)
9207 Employment in the construction sector, aged 15-64, rural (% of rural employed population in working age)
9208 Employment in the construction sector, aged 15-64, urban (% of urban employed population in working age)
9209 Employment in the construction sector, aged 15-24 (% of employed population aged 15-24)
9210 Employment in the construction sector, aged 15-64, total (% of total employed population in working age)
9211 Employment in the commerce sector, aged 15-64, female (% of female employed population in working age)
9212 Employment in the commerce sector, aged 15-64, above primary education (% of employed population with high education i

10149 EGRA: Reading Comprehension - Share of students with a zero score (%). English. 6th Grade
10150 EGRA: Reading Comprehension - Share of students with a zero score (%). Ewe. 2nd Grade
10151 EGRA: Reading Comprehension - Share of students with a zero score (%). Fante. 2nd Grade
10152 EGRA: Reading Comprehension - Share of students with a zero score (%). Fulfulde. 2nd Grade
10153 EGRA: Reading Comprehension - Share of students with a zero score (%). Filipino. 3rd Grade
10154 EGRA: Reading Comprehension - Share of students with a zero score (%). French. 3rd Grade
10155 EGRA: Reading Comprehension - Share of students with a zero score (%). Ga. 2nd Grade
10156 EGRA: Reading Comprehension - Share of students with a zero score (%). Gonja. 2nd Grade
10157 EGRA: Reading Comprehension - Share of students with a zero score (%). Hararigna. 2nd Grade
10158 EGRA: Reading Comprehension - Share of students with a zero score (%). Hararigna. 3rd Grade
10159 EGRA: Reading Comprehension - Share of stu

11561 Official exchange rate (LCU per US$, period average)
11562 Official exchange rate to parallel exchange rate ratio
11563 PPP conversion factor, GDP (LCU per international $)
11564 2005 PPP conversion factor, GDP (LCU per international $)
11565 Price level ratio of PPP conversion factor (GDP) to market exchange rate
11566 PPP conversion factor, private consumption (LCU per international $)
11567 2005 PPP conversion factor, private consumption (LCU per international $)
11568 Maize price (US$ per metric ton)
11569 Maize price (local currency per metric ton)
11570 Wheat price (US$ per metric ton)
11571 Wheat price (local currency per metric ton)
11572 Palm Oil Land Area by type of condition: Damaged (in Hectares)
11573 Palm Oil Land Area by type of condition: Immature (in Hectares)
11574 Palm Oil Land Area by type of condition: Mature (in Hectares)
11575 Palm Oil Land Area by type of ownership: Private (in Hectares)
11576 Palm Oil Land Area by type of ownership: Smallholder (in Hectar

12665 Adequacy of benefits in 5th quintile (richest) (%) - All Social Assistance
12666 Adequacy of benefits in 5th quintile (richest) (%) - All Social Assistance -urban
12667 Average per capita transfer held by extreme poor (<$1.9 a day) - All Social Assistance  (preT)
12668 Average per capita transfer held by extreme poor (<$1.9 a day) - All Social Assistance
12669 Average per capita transfer - All Social Assistance (preT)
12670 Average per capita transfer - All Social Assistance -rural
12671 Average per capita transfer - All Social Assistance 
12672 Average per capita transfer - All Social Assistance -urban
12673 Average per capita transfer held by 1st quintile (poorest) - All Social Assistance (preT)
12674 Average per capita transfer held by 1st quintile (poorest) - All Social Assistance -rural
12675 Average per capita transfer held by 1st quintile (poorest) - All Social Assistance
12676 Average per capita transfer held by 1st quintile (poorest) - All Social Assistance -urban
12677 

13647 Average per capita transfer held by 1st quintile (poorest) - School feeding -rural
13648 Average per capita transfer held by 1st quintile (poorest) - School feeding
13649 Average per capita transfer held by 1st quintile (poorest) - School feeding -urban
13650 Average per capita transfer held by 2nd quintile - School Feeding (preT)
13651 Average per capita transfer held by 2nd quintile - School Feeding -rural
13652 Average per capita transfer held by 2nd quintile - School Feeding  
13653 Average per capita transfer held by 2nd quintile - School Feeding -urban
13654 Average per capita transfer held by 3rd quintile - School Feeding (preT)
13655 Average per capita transfer held by 3rd quintile - School Feeding -rural
13656 Average per capita transfer held by 3rd quintile - School Feeding  
13657 Average per capita transfer held by 3rd quintile - School Feeding -urban
13658 Average per capita transfer held by 4th quintile - School Feeding (preT)
13659 Average per capita transfer held 

14906 Length of District Road: Fair (in km) (Bina Marga Data)
14907 Length of District Road: Fair (in km) (BPS Data, Province only)
14908 Length of District Road: Good (in km) (Bina Marga Data)
14909 Length of District Road: Good (in km) (BPS Data, Province only)
14910 Length of District Road: Gravel (in km) (BPS Data, Province only)
14911 Length of District Road: Light Damage (in km) (Bina Marga Data)
14912 Length of District Road: Light Damage (in km) (BPS Data, Province only)
14913 Length of District Road: Other (in km) (BPS Data, Province only)
14914 Length of National Road: Asphalt (in km) (BPS Data, Province only)
14915 Length of National Road: Bad Damage (in km) (BPS Data, Province only)
14916 Length of National Road: Dirt (in km) (BPS Data, Province only)
14917 Length of National Road: Fair (in km) (BPS Data, Province only)
14918 Length of National Road: Good (in km) (BPS Data, Province only)
14919 Length of National Road: Gravel (in km) (BPS Data, Province only)
14920 Length o

16013 There is legislation on sexual harassment in employment (1=yes; 0=no)
16014 A woman can choose where to live in the same way as a man (1=yes; 0=no)
16015 Women and girls who participate in activities during menstrual period, rural (% of women and girls ages 15-49 living in rural areas who had a menstrual period within the last year)
16016 Women and girls who participate in activities during menstrual period, urban (% of women and girls ages 15-49 living in urban areas who had a menstrual period within the last year)
16017 Women and girls who participate in activities during menstrual period (% of women and girls ages 15-49 who had a menstrual period within the last year)
16018 Women and girls who have private places to wash and change during menstrual period, rural (% of women and girls ages 15-49 living in rural areas who had a menstrual period within the last year)
16019 Women and girls who have private places to wash and change during menstrual period, urban (% of women and gi

17203 Share of youth not in education, employment or training, female (% of female youth population)
17204 Share of youth not in education, employment or training, male (% of male youth population)
17205 Share of youth not in education, employment or training, total (% of youth population)
17206 Unemployment with primary education, female (% of female unemployment)
17207 Unemployment with primary education, male (% of male unemployment)
17208 Unemployment with primary education (% of total unemployment)
17209 Unemployment with secondary education, female (% of female unemployment)
17210 Unemployment with secondary education, male (% of male unemployment)
17211 Unemployment with secondary education (% of total unemployment)
17212 Unemployment with tertiary education, female (% of female unemployment)
17213 Unemployment with tertiary education, male (% of male unemployment)
17214 Unemployment with tertiary education (% of total unemployment)
17215 Number of people unemployed
17216 Unempl

18443 Enrolment in tertiary education, ISCED 6 programmes, both sexes (number)
18444 Enrolment in tertiary education, ISCED 6 programmes, female (number)
18445 Enrolment in tertiary education, ISCED 6 programmes, male (number)
18446 Enrolment in tertiary education, ISCED 7 programmes, both sexes (number)
18447 Enrolment in tertiary education, ISCED 7 programmes, female (number)
18448 Enrolment in tertiary education, ISCED 7 programmes, male (number)
18449 Enrolment in tertiary education, ISCED 8 programmes, both sexes (number)
18450 Enrolment in tertiary education, ISCED 8 programmes, female (number)
18451 Enrolment in tertiary education, ISCED 8 programmes, male (number)
18452 UIS: Percentage of population age 25+ whose highest level of education is primary, both sexes
18453 UIS: Percentage of population age 25+ whose highest level of education is primary, female
18454 UIS: Percentage of population age 25+ whose highest level of education is primary, male
18455 UIS: Percentage of popu

19731 Out-of-school rate for adolescents of lower secondary school age, rural, second quintile, both sexes (household survey data) (%)
19732 Out-of-school rate for adolescents of lower secondary school age, rural, second quintile, female (household survey data) (%)
19733 Out-of-school rate for adolescents of lower secondary school age, rural, second quintile, adjusted gender parity index (household survey data) (GPIA)
19734 Out-of-school rate for adolescents of lower secondary school age, rural, second quintile, male (household survey data) (%)
19735 Out-of-school rate for adolescents of lower secondary school age, rural, middle quintile, both sexes (household survey data) (%)
19736 Out-of-school rate for adolescents of lower secondary school age, rural, middle quintile, female (household survey data) (%)
19737 Out-of-school rate for adolescents of lower secondary school age, rural, middle quintile, adjusted gender parity index (household survey data) (GPIA)
19738 Out-of-school rate fo

Seraching for *Unemployment Rate*, we find indicator 9,746 has relevent data.

In [13]:
# Same thing for unemployment % ...
wb_data[1][9746]

{'id': 'JI.UEM.1564.ZS',
 'name': 'Unemployment rate, aged 15-64, total (% of total labor force in working age)',
 'unit': '',
 'source': {'id': '86', 'value': 'Global Jobs Indicators Database (JOIN)'},
 'sourceNote': '',
 'sourceOrganization': '',
 'topics': []}

Note *working age* is defined as 15-64 years of age. 
### World Bank Unemployment Rates

In [14]:
# Url for unemployment rate indicator
uem_url = 'https://api.worldbank.org/v2/country/indicator/JI.UEM.1564.ZS?format=json&per_page=10000'

# Load data as json
uem_data = requests.get(uem_url).json()
uem_data

[{'page': 1,
  'pages': 1,
  'per_page': 10000,
  'total': 8262,
  'sourceid': '86',
  'sourcename': 'Global Jobs Indicators Database (JOIN)',
  'lastupdated': '2021-09-24'},
 [{'indicator': {'id': 'JI.UEM.1564.ZS',
    'value': 'Unemployment rate, aged 15-64, total (% of total labor force in working age)'},
   'country': {'id': 'AFG', 'value': 'Afghanistan'},
   'countryiso3code': '',
   'date': '2020',
   'value': None,
   'unit': '',
   'obs_status': '',
   'decimal': 1},
  {'indicator': {'id': 'JI.UEM.1564.ZS',
    'value': 'Unemployment rate, aged 15-64, total (% of total labor force in working age)'},
   'country': {'id': 'AFG', 'value': 'Afghanistan'},
   'countryiso3code': '',
   'date': '2019',
   'value': None,
   'unit': '',
   'obs_status': '',
   'decimal': 1},
  {'indicator': {'id': 'JI.UEM.1564.ZS',
    'value': 'Unemployment rate, aged 15-64, total (% of total labor force in working age)'},
   'country': {'id': 'AFG', 'value': 'Afghanistan'},
   'countryiso3code': '',
 

In [15]:
# Intiate a dictionary that stores lists for country, year and unemployment rate
uem_dict = {'country': [], 'year': [], 'unemployment rate': []}

# Loop over entries in unemployment json data
for entry in uem_data[1]:
    
    # Append each list in the dictionary with values from current entry
    uem_dict['country'].append(entry['country']['id'])
    uem_dict['year'].append(entry['date'])
    uem_dict['unemployment rate'].append(entry['value'])

# Load the dictionary as a dataframe
uem_df = pd.DataFrame(uem_dict)
uem_df.head()

Unnamed: 0,country,year,unemployment rate
0,AFG,2020,
1,AFG,2019,
2,AFG,2018,
3,AFG,2017,
4,AFG,2016,


In [16]:
# Check for missing data and how data is stored
uem_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8262 entries, 0 to 8261
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   country            8262 non-null   object 
 1   year               8262 non-null   object 
 2   unemployment rate  1413 non-null   float64
dtypes: float64(1), object(2)
memory usage: 193.8+ KB


Note that year is stored as a string (object). We will recast this as an integer to make merging with the suicide rates dataframe possible. Also, Unemployment rate only has valid data for 1,413 of the total 8,262 entries. This is most likely that the JOIN database did not have data for certain countriy, year pairs. We drop this data and assume that the missing data does not result in a systematic error (i.e. lack of reporting is not correlated with country unemployment rate.) 

In [17]:
uem_df['year'] = uem_df['year'].astype(int)

In [18]:
uem_df.dropna(inplace=True)
uem_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1413 entries, 7 to 8230
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   country            1413 non-null   object 
 1   year               1413 non-null   int32  
 2   unemployment rate  1413 non-null   float64
dtypes: float64(1), int32(1), object(1)
memory usage: 38.6+ KB


### World Bank GDP per Capita
The same steps are repeated to obtain GDP per Capita data. The Indicator was found with a simliar search.

In [19]:
# Use gdp/capita id code by country to get url
gdp_url = 'https://api.worldbank.org/v2/country/indicator/NY.GDP.PCAP.CD?format=json&per_page=20000'
gdp_data = requests.get(gdp_url).json()
gdp_data

[{'page': 1,
  'pages': 1,
  'per_page': 20000,
  'total': 16492,
  'sourceid': '2',
  'sourcename': 'World Development Indicators',
  'lastupdated': '2022-07-20'},
 [{'indicator': {'id': 'NY.GDP.PCAP.CD',
    'value': 'GDP per capita (current US$)'},
   'country': {'id': 'ZH', 'value': 'Africa Eastern and Southern'},
   'countryiso3code': 'AFE',
   'date': '2021',
   'value': 1557.72268174541,
   'unit': '',
   'obs_status': '',
   'decimal': 1},
  {'indicator': {'id': 'NY.GDP.PCAP.CD',
    'value': 'GDP per capita (current US$)'},
   'country': {'id': 'ZH', 'value': 'Africa Eastern and Southern'},
   'countryiso3code': 'AFE',
   'date': '2020',
   'value': 1360.87864476883,
   'unit': '',
   'obs_status': '',
   'decimal': 1},
  {'indicator': {'id': 'NY.GDP.PCAP.CD',
    'value': 'GDP per capita (current US$)'},
   'country': {'id': 'ZH', 'value': 'Africa Eastern and Southern'},
   'countryiso3code': 'AFE',
   'date': '2019',
   'value': 1511.30925874722,
   'unit': '',
   'obs_statu

In [20]:
# Intialize dictionary to store columns of gdp_df
gdp_dict = {'country': [], 'year': [], 'gdp per capita': []}

# Loop through json and store values to dict
for entry in gdp_data[1]:
    gdp_dict['country'].append(entry['countryiso3code'])
    gdp_dict['year'].append(entry['date'])
    gdp_dict['gdp per capita'].append(entry['value'])

In [21]:
# Convert dict to df
gdp_df = pd.DataFrame(gdp_dict)
gdp_df.head()

Unnamed: 0,country,year,gdp per capita
0,AFE,2021,1557.722682
1,AFE,2020,1360.878645
2,AFE,2019,1511.309259
3,AFE,2018,1541.031661
4,AFE,2017,1629.404273


In [22]:
# Number of non-null values and data types
gdp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16492 entries, 0 to 16491
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         16492 non-null  object 
 1   year            16492 non-null  object 
 2   gdp per capita  13115 non-null  float64
dtypes: float64(1), object(2)
memory usage: 386.7+ KB


Note that the year column is cast as string (object) and GDP/capita only has data for 13,115 of the total 16,492 possible. To make merging with the suicide rates dataframe possible, year is recast as integer. The null values are also dropped resulting in less data for certain countries.

In [23]:
# 'year' column is object type -> int to match other datasets
gdp_df['year'] = gdp_df['year'].astype(int)

In [24]:
gdp_df = gdp_df.dropna()
gdp_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13115 entries, 0 to 16491
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         13115 non-null  object 
 1   year            13115 non-null  int32  
 2   gdp per capita  13115 non-null  float64
dtypes: float64(1), int32(1), object(1)
memory usage: 358.6+ KB


## Merge DataFrames
The Suicide rates dataframe has 3,881 rows of data but the unemployment rates dataframe only has 1,413 rows of clean data. To make analysis easier, we merge both dataframes together and only keep rows where we have data for both suicide rates and unemployment rates. The result is a dataframe that stores countries unemployment rates and suicide rates by country and year, where data is available. 

In [25]:
# Merge suicide rates and unemployment on country and year columns and only keep rows that appear in both of the original dfs.
sui_vs_uem = pd.merge(sui_df, uem_df, on=['country','year'], how='inner')
sui_vs_uem

Unnamed: 0,country,year,suicide rate,unemployment rate
0,AFG,2003,7.7200,8.392198
1,AFG,2011,6.4200,5.225274
2,AFG,2013,6.1700,9.286795
3,AFG,2007,7.4200,2.469782
4,AGO,2000,17.5554,4.996026
...,...,...,...,...
1095,ZMB,2010,19.7074,9.145213
1096,ZWE,2017,25.8514,8.197319
1097,ZWE,2001,19.5311,8.002688
1098,ZWE,2011,34.3032,5.664809


A total of 1,100 rows of data.

In [26]:
# Drop Na and check for amount of data
sui_vs_uem = sui_vs_uem.dropna()
sui_vs_uem['country'].value_counts()

EST    17
FRA    17
SVN    17
SVK    17
COL    17
       ..
EGY     1
DJI     1
IRN     1
SSD     1
IRQ     1
Name: country, Length: 153, dtype: int64

In [27]:
# How many counties only have 1 years worth of data
(sui_vs_uem['country'].value_counts() == 1).sum()

24

Note that there are 24 countries where data is only available for 1 year.

The same thing is repeated for GDP and suicide rates.

In [28]:
# Merge suicide rates and gdp on country and year columns, only keeping inclusive (inner) rows
sui_vs_gdp = pd.merge(sui_df, gdp_df, on=['country','year'], how='inner')
sui_vs_gdp

Unnamed: 0,country,year,suicide rate,gdp per capita
0,AFG,2003,7.7200,190.683814
1,AFG,2009,6.8400,437.268740
2,AFG,2006,7.5600,263.733602
3,AFG,2012,6.2400,638.845852
4,AFG,2016,6.0100,512.012778
...,...,...,...,...
3593,ZWE,2011,34.3032,1093.653409
3594,ZWE,2005,21.9666,476.555403
3595,ZWE,2000,19.9764,563.057504
3596,ZWE,2007,27.2252,431.787259


In [29]:
# See how many years of data for each country - 182 countries with at least 7 years of data
sui_vs_gdp['country'].value_counts()

KWT    20
KIR    20
MOZ    20
MRT    20
MUS    20
       ..
AFG    18
VEN    15
ERI    12
SSD     8
SOM     7
Name: country, Length: 182, dtype: int64

Note 182 total countries, 180 of which have data for 12 or more years. 

## Save Cleaned up data as CSV files
The two cleaned up dataframes are saved to .csv files so they can be used in the main_analysis.ipynb

In [30]:
sui_vs_uem.to_csv('Clean_Data/suicide_vs_unemployment_clean', index=False)
sui_vs_gdp.to_csv('Clean_Data/suicide_vs_gdp_clean', index=False)