<a href="https://colab.research.google.com/github/fayshaw/data_preprocessing/blob/main/livwell.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preproccesing
## LivWell Dataset: Women and their Well-being for 52 Countries
### <a href="https://www.womenindata.org/">Women in Data</a> and <a href="https://www.meetup.com/pyladies-boston/">PyLadies Boston</a>
#### <a href="https://www.linkedin.com/in/fayshaw/">Fay Shaw</a>
August 21, 2025

Together, we will explore the LivWell dataset from the Belmin et al's 2022 Nature paper <a href=" https://www.nature.com/articles/s41597-022-01824-2"> LivWell: a sub-national Dataset on the Living Conditions of Women and their Well-being for 52 Countries</a>. The authors aggregated a longitudinal dataset from Demographic and Health Surveys (DHS) for subnational regions.  Much of their work is in geographic harmonization of boundaries.

We will wrangle some raw data to look more like their published data set. <br>


*Figure 1: Flowchart representing the data processing steps to obtain LivWell. Orange: input data; green: indicators based on DHS data; blue: indicators based on gridded data; white: validation data.*

<img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01824-2/MediaObjects/41597_2022_1824_Fig1_HTML.png" width="600">



In this notebook, we will look at some of the DHS STAT compiler data (that they used for validation) and compare it to their data output.

# Overview

1. Open LivWell data set.
2. Look at DHS STAT Compiler raw data.
3. Try to get the raw data into a comparable form.

## Read files
Read the published file using a url.

In [1]:
import pandas as pd
livwell_df = pd.read_csv('https://zenodo.org/records/7277104/files/livwell.csv')

### Explore the data

<img src="https://scentla.com/wp-content/uploads/2025/02/Efficiently-Create-and-Fill-Pandas-DataFrames-in-Python-1024x399.jpg" width=600>

Figure from https://datagy.io/pandas-drop-index-column

Resources
* <a href="https://realpython.com/pandas-python-explore-dataset/">Real Python dataframe resource</a>
* <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">PyData Pandas cheat sheet</a>

DataFrame `df`
* Show `df`
* `df.head()`
* `df.describe()`
* `df.columns`
* `df.unique()`

In [2]:
livwell_df

Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,SurveyId,interview_year_mean,interview_month_mean,CMC_interview_mean,DM_age_mean,...,drought_spei03_n1_share36,drought_spei03_n1_share60,drought_spei03_n1.5_share12,drought_spei03_n1.5_share36,drought_spei03_n1.5_share60,drought_spei03_n2_share12,drought_spei03_n2_share36,drought_spei03_n2_share60,hdi,gdp_pc
0,Armenia,ARM,2000,1,Aragatsotn,AM2000DHS,2000.0,11.0,1210.53,30.71,...,0.388889,0.316667,0.333333,0.250000,0.166667,0.083333,0.083333,0.050000,0.644083,2938.187500
1,Armenia,ARM,2000,2,Ararat,AM2000DHS,2000.0,11.0,1210.55,30.38,...,0.416667,0.316667,0.333333,0.277778,0.233333,0.083333,0.083333,0.050000,0.644127,3053.040283
2,Armenia,ARM,2000,3,Armavir,AM2000DHS,2000.0,10.0,1210.43,31.10,...,0.361111,0.300000,0.333333,0.250000,0.166667,0.083333,0.083333,0.050000,0.644415,3003.245605
3,Armenia,ARM,2000,4,Gegharkunik,AM2000DHS,2000.0,11.0,1210.58,30.65,...,0.416667,0.316667,0.250000,0.194444,0.166667,0.083333,0.083333,0.050000,0.643942,2945.085449
4,Armenia,ARM,2000,5,Lori,AM2000DHS,2000.0,10.0,1210.43,31.57,...,0.388889,0.316667,0.333333,0.222222,0.150000,0.083333,0.083333,0.050000,0.645256,2925.469727
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1827,Zimbabwe,ZWE,2015,6,Matabeleland South,ZW2015DHS,2015.0,9.0,1389.00,27.65,...,0.388889,0.333333,0.333333,0.250000,0.216667,0.250000,0.083333,0.066667,0.516884,1864.769000
1828,Zimbabwe,ZWE,2015,7,Midlands,ZW2015DHS,2015.0,9.0,1388.60,27.89,...,0.388889,0.316667,0.250000,0.138889,0.150000,0.250000,0.083333,0.050000,0.516000,1687.976000
1829,Zimbabwe,ZWE,2015,8,Masvingo,ZW2015DHS,2015.0,9.0,1388.91,28.69,...,0.250000,0.216667,0.166667,0.055556,0.066667,0.083333,0.027778,0.016667,0.515188,1687.113000
1830,Zimbabwe,ZWE,2015,9,Harare/Chitungwiza,ZW2015DHS,2015.0,9.0,1388.71,28.67,...,0.416667,0.333333,0.416667,0.194444,0.116667,0.250000,0.083333,0.050000,0.516000,1687.976000


In [3]:
livwell_df.columns[:50]

Index(['country_name', 'country_code', 'year', 'region_num_harmonized',
       'region_name_harmonized', 'SurveyId', 'interview_year_mean',
       'interview_month_mean', 'CMC_interview_mean', 'DM_age_mean',
       'DM_age_mean_se', 'DM_age_15.19_p', 'DM_age_15.19_p_se',
       'DM_age_20.24_p', 'DM_age_20.24_p_se', 'DM_age_25.29_p',
       'DM_age_25.29_p_se', 'DM_age_30.34_p', 'DM_age_30.34_p_se',
       'DM_age_35.39_p', 'DM_age_35.39_p_se', 'DM_age_40.44_p',
       'DM_age_40.44_p_se', 'DM_age_45.49_p', 'DM_age_45.49_p_se',
       'DM_urban_p', 'DM_urban_p_se', 'DM_born_rural_p', 'DM_born_rural_p_se',
       'DM_nvr_marr_p', 'DM_nvr_marr_p_se', 'DM_marr_p', 'DM_marr_p_se',
       'DM_age_marr_mean', 'DM_age_marr_mean_se', 'DM_age_diff_mean',
       'DM_age_diff_mean_se', 'DM_age_diff_10plus_p',
       'DM_age_diff_10plus_p_se', 'DM_age_diff_5_9_p', 'DM_age_diff_5_9_p_se',
       'DM_age_diff_5minus_p', 'DM_age_diff_5minus_p_se', 'DM_age_diff_0_p',
       'DM_age_diff_0_p_se', 'HH_w

In [4]:
indicators_df = pd.read_csv("https://zenodo.org/records/7277104/files/indicators.csv")
indicators_df.head(20)

Unnamed: 0,indicator_category,indicator_code,indicator_description
0,Individual demographic information,DM_age_mean,Average age of women
1,Individual demographic information,DM_age_15-19_p,Women in age category 15-19 (%)
2,Individual demographic information,DM_age_20-24_p,Women in age category 20-24 (%)
3,Individual demographic information,DM_age_25-29_p,Women in age category 25-29 (%)
4,Individual demographic information,DM_age_30-34_p,Women in age category 30-34 (%)
5,Individual demographic information,DM_age_35-39_p,Women in age category 35-39 (%)
6,Individual demographic information,DM_age_40-44_p,Women in age category 40-44 (%)
7,Individual demographic information,DM_age_45-49_p,Women in age category 45-49 (%)
8,Individual demographic information,DM_urban_p,Women living in urban areas (%)
9,Individual demographic information,DM_born_rural_p,Women being born at the country side (%)


Look at which countries are in this data set using the dataframe and the column name: `dataframe['column name']`

In [5]:
livwell_df['country_name'].unique()

array(['Armenia', 'Burundi', 'Benin', 'Burkina Faso', 'Bangladesh',
       'Bolivia', "Cote d'Ivoire", 'Cameroon',
       'Congo Democratic Republic', 'Colombia', 'Egypt', 'Ethiopia',
       'Gabon', 'Ghana', 'Guinea', 'Guatemala', 'Honduras', 'Haiti',
       'Indonesia', 'India', 'Jordan', 'Kenya', 'Cambodia', 'Liberia',
       'Lesotho', 'Morocco', 'Madagascar', 'Maldives', 'Mali',
       'Mozambique', 'Malawi', 'Namibia', 'Niger', 'Nigeria', 'Nicaragua',
       'Nepal', 'Pakistan', 'Peru', 'Philippines', 'Rwanda', 'Senegal',
       'Sierra Leone', 'Togo', 'Tajikistan', 'Timor-Leste', 'Turkey',
       'Tanzania', 'Uganda', 'Vietnam', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)

In [6]:
len(set(livwell_df['country_name']))

52

## Filter to get data for one country
Armenia

In [7]:
# Boolean mask
livwell_df['country_name'] == 'Armenia'

Unnamed: 0,country_name
0,True
1,True
2,True
3,True
4,True
...,...
1827,False
1828,False
1829,False
1830,False


In [8]:
livwell_armenia = livwell_df[livwell_df['country_name'] == 'Armenia']

print("years: " , set(livwell_armenia['year']))
print("regions: ", set(livwell_armenia['region_name_harmonized']))
livwell_armenia.head(12)

years:  {2000, 2010, 2016, 2005}
regions:  {'Tavush', 'Armavir', 'Syunik', 'Yerevan', 'Gegharkunik', 'Aragatsotn', 'Lori', 'Shirak', 'Ararat', 'Kotayk', 'Vayots Dzor'}


Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,SurveyId,interview_year_mean,interview_month_mean,CMC_interview_mean,DM_age_mean,...,drought_spei03_n1_share36,drought_spei03_n1_share60,drought_spei03_n1.5_share12,drought_spei03_n1.5_share36,drought_spei03_n1.5_share60,drought_spei03_n2_share12,drought_spei03_n2_share36,drought_spei03_n2_share60,hdi,gdp_pc
0,Armenia,ARM,2000,1,Aragatsotn,AM2000DHS,2000.0,11.0,1210.53,30.71,...,0.388889,0.316667,0.333333,0.25,0.166667,0.083333,0.083333,0.05,0.644083,2938.1875
1,Armenia,ARM,2000,2,Ararat,AM2000DHS,2000.0,11.0,1210.55,30.38,...,0.416667,0.316667,0.333333,0.277778,0.233333,0.083333,0.083333,0.05,0.644127,3053.040283
2,Armenia,ARM,2000,3,Armavir,AM2000DHS,2000.0,10.0,1210.43,31.1,...,0.361111,0.3,0.333333,0.25,0.166667,0.083333,0.083333,0.05,0.644415,3003.245605
3,Armenia,ARM,2000,4,Gegharkunik,AM2000DHS,2000.0,11.0,1210.58,30.65,...,0.416667,0.316667,0.25,0.194444,0.166667,0.083333,0.083333,0.05,0.643942,2945.085449
4,Armenia,ARM,2000,5,Lori,AM2000DHS,2000.0,10.0,1210.43,31.57,...,0.388889,0.316667,0.333333,0.222222,0.15,0.083333,0.083333,0.05,0.645256,2925.469727
5,Armenia,ARM,2000,6,Kotayk,AM2000DHS,2000.0,10.0,1210.48,31.15,...,0.416667,0.316667,0.333333,0.277778,0.183333,0.083333,0.083333,0.05,0.644,2918.557617
6,Armenia,ARM,2000,7,Shirak,AM2000DHS,2000.0,10.0,1210.43,31.8,...,0.388889,0.35,0.333333,0.194444,0.133333,0.083333,0.083333,0.05,0.645674,3053.684326
7,Armenia,ARM,2000,8,Syunik,AM2000DHS,2000.0,10.0,1210.42,31.37,...,0.416667,0.316667,0.166667,0.222222,0.183333,0.083333,0.055556,0.066667,0.644479,3086.177002
8,Armenia,ARM,2000,9,Vayots Dzor,AM2000DHS,2000.0,10.0,1210.41,31.69,...,0.388889,0.3,0.333333,0.277778,0.216667,0.083333,0.194444,0.133333,0.643944,2969.26123
9,Armenia,ARM,2000,10,Tavush,AM2000DHS,2000.0,10.0,1210.46,31.28,...,0.416667,0.3,0.333333,0.194444,0.133333,0.083333,0.083333,0.05,0.644377,3000.003906


### More to explore with dataframes
pandas DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
* `df.shape`
* `df.dtypes`
* `df['column'].value_counts()`

## Read raw education file

Manually upload the file `STATcompilerExport_education.csv`

In [9]:
from google.colab import files
uploaded = files.upload()

Saving STATcompilerExport_education.csv to STATcompilerExport_education.csv


Notice that there Unnamed column titles at the top along with NaN rows at the top and bottom

In [10]:
education = pd.read_csv("STATcompilerExport_education.csv")
education

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,,,,,,,,,,,,
1,,,,,,,,,,,,
2,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate
3,Armenia,2015-16 DHS,Total 15-49,0.1,0,36.2,53.5,0.3,99.6,11.3,,
4,Armenia,2015-16 DHS,Region : Aragatsotn,0,0,57.4,34.9,0,100,9.9,,
...,...,...,...,...,...,...,...,...,...,...,...,...
785,Women with secondary or higher education,Percentage of women with secondary or higher e...,,,,,,,,,,
786,Median years of education completed [Women],Median number of years of education completed ...,,,,,,,,,,
787,Women who can read a whole sentence,Percentage of women who can read a whole sentence,,,,,,,,,,
788,Women who are literate,Percentage of women who are literate,,,,,,,,,,


How many rows are null?  Can we safely skip them?

In [11]:
education.iloc[0].isnull().sum()

np.int64(12)

In [12]:
education.tail(12)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
778,Togo,1998 DHS,Region : Savanes,82.7,0.2,,0.0,13.4,3.9,,,
779,,,,,,,,,,,,
780,Women with no education,Percentage of women with no education,,,,,,,,,,
781,Women with completed primary education,Percentage of women with completed primary edu...,,,,,,,,,,
782,Women with completed secondary education,Percentage of women with completed secondary e...,,,,,,,,,,
783,Women with more than secondary education,Percentage of women with more than secondary e...,,,,,,,,,,
784,Women with primary education,Percentage of women with primary education,,,,,,,,,,
785,Women with secondary or higher education,Percentage of women with secondary or higher e...,,,,,,,,,,
786,Median years of education completed [Women],Median number of years of education completed ...,,,,,,,,,,
787,Women who can read a whole sentence,Percentage of women who can read a whole sentence,,,,,,,,,,


Read the file by skipping the top NaN rows and bottom rows

In [13]:
stat_education = pd.read_csv('STATcompilerExport_education.csv', skiprows=3, skipfooter=11, engine='python')
stat_education

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,
...,...,...,...,...,...,...,...,...,...,...,...,...
771,Togo,1998 DHS,Region : ..Lomé,25.3,5.0,,2.8,38.7,36.0,4.3,,
772,Togo,1998 DHS,Region : Plateaux,43.4,3.0,,0.0,43.9,12.7,1.3,,
773,Togo,1998 DHS,Region : Centrale,52.2,1.0,,0.0,35.2,12.7,,,
774,Togo,1998 DHS,Region : Kara,52.0,1.4,,0.0,32.2,15.8,,,


In [14]:
set(stat_education['Country'])

{'Armenia',
 'Congo Democratic Republic',
 'Ethiopia',
 'Lesotho',
 'Malawi',
 'Maldives',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Nicaragua',
 'Nigeria',
 'Rwanda',
 'Tajikistan',
 'Timor-Leste',
 'Togo'}

In [15]:
len(set(stat_education['Country']))

15

### One country example

* Get data for just Armenia
* Make a deep `.copy()` so you are not operating on a view and avoid the <a href="https://realpython.com/pandas-settingwithcopywarning/">`SettingWithCopyWarning`</a>

In [16]:
stat_education_armenia = stat_education[stat_education['Country'] == 'Armenia'].copy()
stat_education_armenia.head(15)

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,
5,Armenia,2015-16 DHS,Region : Lori,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,
6,Armenia,2015-16 DHS,Region : Kotayk,0.0,0.0,36.5,53.2,0.4,99.6,11.0,,
7,Armenia,2015-16 DHS,Region : Shirak,0.2,0.0,39.6,51.3,0.2,99.6,11.0,,
8,Armenia,2015-16 DHS,Region : Syunik,0.2,0.0,36.6,50.0,0.0,99.8,10.9,,
9,Armenia,2015-16 DHS,Region : Vayots Dzor,0.0,0.0,40.2,49.3,0.0,100.0,10.3,,


In [17]:
stat_education_armenia.columns

Index(['Country', 'Survey', 'Characteristic', 'Women with no education',
       'Women with completed primary education',
       'Women with completed secondary education',
       'Women with more than secondary education',
       'Women with primary education',
       'Women with secondary or higher education',
       'Median years of education completed [Women]',
       'Women who can read a whole sentence', 'Women who are literate'],
      dtype='object')

Rename one column

### Get the survey years

In [18]:
set(stat_education_armenia['Survey'])

{'2000 DHS', '2005 DHS', '2010 DHS', '2015-16 DHS'}

Split the survey year text on the space `' '` to get the year. Make two new columns `year_text` and `source` that appear at the right side.



In [19]:
stat_education_armenia[['year','source']] = stat_education_armenia.loc[:, 'Survey'].str.split(expand=True)
stat_education_armenia.head(15)

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,,2015-16,DHS
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2015-16,DHS
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2015-16,DHS
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2015-16,DHS
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2015-16,DHS
5,Armenia,2015-16 DHS,Region : Lori,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2015-16,DHS
6,Armenia,2015-16 DHS,Region : Kotayk,0.0,0.0,36.5,53.2,0.4,99.6,11.0,,,2015-16,DHS
7,Armenia,2015-16 DHS,Region : Shirak,0.2,0.0,39.6,51.3,0.2,99.6,11.0,,,2015-16,DHS
8,Armenia,2015-16 DHS,Region : Syunik,0.2,0.0,36.6,50.0,0.0,99.8,10.9,,,2015-16,DHS
9,Armenia,2015-16 DHS,Region : Vayots Dzor,0.0,0.0,40.2,49.3,0.0,100.0,10.3,,,2015-16,DHS


### Rename the year text 2015-16 to 2016.

In [20]:
stat_education_armenia['year'] = stat_education_armenia['year'].replace('2015-16', '2016')
stat_education_armenia.head()

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,,2016,DHS
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS


Similarly, split the region using by the colin " : "

In [21]:
stat_education_armenia['region'] = stat_education_armenia.loc[:, 'Characteristic'].str.split(" : ").str[1]
stat_education_armenia.head()

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source,region
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,,2016,DHS,
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik


In [22]:
# Drop rows that are not regions
stat_education_armenia = stat_education_armenia[~stat_education_armenia['Characteristic'].str.contains('Total')]
stat_education_armenia.head()

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source,region
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik
5,Armenia,2015-16 DHS,Region : Lori,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2016,DHS,Lori


In [23]:
stat_ed_armenia = stat_education_armenia.drop(columns=['Survey', 'Characteristic'])
stat_ed_armenia.head()

Unnamed: 0,Country,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source,region
1,Armenia,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik
5,Armenia,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2016,DHS,Lori


In [24]:
# Rename education columns
rename_ed_cols ={
    'Women with no education' : 'ED_attainment_no_educ_p',
    'Women with completed primary education' : 'ED_attainment_primary_completed_p',
    'Women with completed secondary education' : 'ED_attainment_secondary_completed_p',
    'Women with more than secondary education' : 'ED_attainment_secondary_higher_p',
    'Women with primary education' : 'ED_attainment_primary_p',
    'Women with secondary or higher education' : 'ED_attainment_secondary_higher_p',
    'Median years of education completed [Women]' : 'ED_educ_years_median'
}

In [25]:
stat_ed_armenia = stat_ed_armenia.rename(columns=rename_ed_cols)
stat_ed_armenia.head()

Unnamed: 0,Country,ED_attainment_no_educ_p,ED_attainment_primary_completed_p,ED_attainment_secondary_completed_p,ED_attainment_secondary_higher_p,ED_attainment_primary_p,ED_attainment_secondary_higher_p.1,ED_educ_years_median,Women who can read a whole sentence,Women who are literate,year,source,region
1,Armenia,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik
5,Armenia,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2016,DHS,Lori


### Choose and reorder columns

In [27]:
stat_ed_armenia_df = stat_ed_armenia[['Country', 'source', 'year', 'region', 'ED_educ_years_median', 'ED_attainment_secondary_completed_p']]
stat_ed_armenia_df

Unnamed: 0,Country,source,year,region,ED_educ_years_median,ED_attainment_secondary_completed_p
1,Armenia,DHS,2016,Aragatsotn,9.9,57.4
2,Armenia,DHS,2016,Ararat,9.9,51.8
3,Armenia,DHS,2016,Armavir,9.8,41.2
4,Armenia,DHS,2016,Gegharkunik,9.8,56.7
5,Armenia,DHS,2016,Lori,11.1,43.3
6,Armenia,DHS,2016,Kotayk,11.0,36.5
7,Armenia,DHS,2016,Shirak,11.0,39.6
8,Armenia,DHS,2016,Syunik,10.9,36.6
9,Armenia,DHS,2016,Vayots Dzor,10.3,40.2
10,Armenia,DHS,2016,Tavush,11.1,32.3


### LivWell Aremnia education columns

In [28]:
lw_ed_cols = livwell_armenia.columns[livwell_armenia.columns.str.contains('ED')].to_list()
lw_ed_cols

['ED_educ_years_mean',
 'ED_educ_years_mean_se',
 'ED_attainment_no_educ_p',
 'ED_attainment_no_educ_p_se',
 'ED_attainment_primary_p',
 'ED_attainment_primary_p_se',
 'ED_attainment_primary_completed_p',
 'ED_attainment_primary_completed_p_se',
 'ED_attainment_secondary_p',
 'ED_attainment_secondary_p_se',
 'ED_attainment_secondary_completed_p',
 'ED_attainment_secondary_completed_p_se',
 'ED_attainment_secondary_higher_p',
 'ED_attainment_secondary_higher_p_se',
 'ED_litt_p',
 'ED_litt_p_se',
 'ED_litt_whole_p',
 'ED_litt_whole_p_se']

In [29]:
# Get year and country data
livwell_df.columns[:8].to_list()

['country_name',
 'country_code',
 'year',
 'region_num_harmonized',
 'region_name_harmonized',
 'SurveyId',
 'interview_year_mean',
 'interview_month_mean']

In [30]:
# Columns of interest in the LivWell data set
lw_year_ed_cols = livwell_df.columns[:8].to_list() + lw_ed_cols
lw_year_ed_cols

['country_name',
 'country_code',
 'year',
 'region_num_harmonized',
 'region_name_harmonized',
 'SurveyId',
 'interview_year_mean',
 'interview_month_mean',
 'ED_educ_years_mean',
 'ED_educ_years_mean_se',
 'ED_attainment_no_educ_p',
 'ED_attainment_no_educ_p_se',
 'ED_attainment_primary_p',
 'ED_attainment_primary_p_se',
 'ED_attainment_primary_completed_p',
 'ED_attainment_primary_completed_p_se',
 'ED_attainment_secondary_p',
 'ED_attainment_secondary_p_se',
 'ED_attainment_secondary_completed_p',
 'ED_attainment_secondary_completed_p_se',
 'ED_attainment_secondary_higher_p',
 'ED_attainment_secondary_higher_p_se',
 'ED_litt_p',
 'ED_litt_p_se',
 'ED_litt_whole_p',
 'ED_litt_whole_p_se']

In [31]:
lw_ed_arm = livwell_armenia[lw_year_ed_cols]
lw_ed_arm.head()

Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,SurveyId,interview_year_mean,interview_month_mean,ED_educ_years_mean,ED_educ_years_mean_se,...,ED_attainment_secondary_p,ED_attainment_secondary_p_se,ED_attainment_secondary_completed_p,ED_attainment_secondary_completed_p_se,ED_attainment_secondary_higher_p,ED_attainment_secondary_higher_p_se,ED_litt_p,ED_litt_p_se,ED_litt_whole_p,ED_litt_whole_p_se
0,Armenia,ARM,2000,1,Aragatsotn,AM2000DHS,2000.0,11.0,10.85,0.22,...,87.81,2.44,73.14,3.08,98.76,0.89,100.0,0.0,,
1,Armenia,ARM,2000,2,Ararat,AM2000DHS,2000.0,11.0,10.97,0.1,...,90.6,1.36,77.13,1.55,99.65,0.25,100.0,0.0,,
2,Armenia,ARM,2000,3,Armavir,AM2000DHS,2000.0,10.0,10.63,0.21,...,87.68,1.22,66.87,3.45,98.79,0.6,100.0,0.0,,
3,Armenia,ARM,2000,4,Gegharkunik,AM2000DHS,2000.0,11.0,10.51,0.13,...,93.25,1.43,72.39,2.27,99.59,0.28,100.0,0.0,,
4,Armenia,ARM,2000,5,Lori,AM2000DHS,2000.0,10.0,11.17,0.14,...,86.55,2.12,74.82,2.32,99.76,0.24,100.0,0.0,,


In [32]:
stat_ed_armenia.head()

Unnamed: 0,Country,ED_attainment_no_educ_p,ED_attainment_primary_completed_p,ED_attainment_secondary_completed_p,ED_attainment_secondary_higher_p,ED_attainment_primary_p,ED_attainment_secondary_higher_p.1,ED_educ_years_median,Women who can read a whole sentence,Women who are literate,year,source,region
1,Armenia,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik
5,Armenia,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2016,DHS,Lori


In [33]:
stat_ed_armenia.columns

Index(['Country', 'ED_attainment_no_educ_p',
       'ED_attainment_primary_completed_p',
       'ED_attainment_secondary_completed_p',
       'ED_attainment_secondary_higher_p', 'ED_attainment_primary_p',
       'ED_attainment_secondary_higher_p', 'ED_educ_years_median',
       'Women who can read a whole sentence', 'Women who are literate', 'year',
       'source', 'region'],
      dtype='object')

In [36]:
rename_cols = {'Median years of education completed [Women]' : 'ED_educ_years_median'}

stat_ed_armenia_renamed = stat_ed_armenia.rename(columns=rename_cols)

cols_reorder = ['Country', 'year', 'source', 'region',	'ED_attainment_no_educ_p',
                'ED_attainment_primary_p',	'ED_attainment_primary_completed_p',
                'ED_attainment_secondary_higher_p', 'ED_attainment_primary_p',
                'ED_attainment_secondary_completed_p', 'ED_educ_years_median',
                'Women who can read a whole sentence','Women who are literate']
stat_ed_armenia_renamed = stat_ed_armenia_renamed[cols_reorder]
stat_ed_armenia_renamed.head(12)

Unnamed: 0,Country,year,source,region,ED_attainment_no_educ_p,ED_attainment_primary_p,ED_attainment_primary_completed_p,ED_attainment_secondary_higher_p,ED_attainment_secondary_higher_p.1,ED_attainment_primary_p.1,ED_attainment_secondary_completed_p,ED_educ_years_median,Women who can read a whole sentence,Women who are literate
1,Armenia,2016,DHS,Aragatsotn,0.0,0.0,0.0,34.9,100.0,0.0,57.4,9.9,,
2,Armenia,2016,DHS,Ararat,0.0,0.6,0.2,34.7,99.4,0.6,51.8,9.9,,
3,Armenia,2016,DHS,Armavir,0.5,1.4,0.2,37.4,98.2,1.4,41.2,9.8,,
4,Armenia,2016,DHS,Gegharkunik,0.0,0.0,0.0,30.6,100.0,0.0,56.7,9.8,,
5,Armenia,2016,DHS,Lori,0.0,0.0,0.0,49.3,100.0,0.0,43.3,11.1,,
6,Armenia,2016,DHS,Kotayk,0.0,0.4,0.0,53.2,99.6,0.4,36.5,11.0,,
7,Armenia,2016,DHS,Shirak,0.2,0.2,0.0,51.3,99.6,0.2,39.6,11.0,,
8,Armenia,2016,DHS,Syunik,0.2,0.0,0.0,50.0,99.8,0.0,36.6,10.9,,
9,Armenia,2016,DHS,Vayots Dzor,0.0,0.0,0.0,49.3,100.0,0.0,40.2,10.3,,
10,Armenia,2016,DHS,Tavush,0.3,0.0,0.0,53.5,99.7,0.0,32.3,11.1,,


## Get data in the same format to merge on year and region.

In [37]:
stat_ed_armenia_renamed.columns

Index(['Country', 'year', 'source', 'region', 'ED_attainment_no_educ_p',
       'ED_attainment_primary_p', 'ED_attainment_primary_completed_p',
       'ED_attainment_secondary_higher_p', 'ED_attainment_secondary_higher_p',
       'ED_attainment_primary_p', 'ED_attainment_secondary_completed_p',
       'ED_educ_years_median', 'Women who can read a whole sentence',
       'Women who are literate'],
      dtype='object')

In [38]:
lw_ed_arm.columns

Index(['country_name', 'country_code', 'year', 'region_num_harmonized',
       'region_name_harmonized', 'SurveyId', 'interview_year_mean',
       'interview_month_mean', 'ED_educ_years_mean', 'ED_educ_years_mean_se',
       'ED_attainment_no_educ_p', 'ED_attainment_no_educ_p_se',
       'ED_attainment_primary_p', 'ED_attainment_primary_p_se',
       'ED_attainment_primary_completed_p',
       'ED_attainment_primary_completed_p_se', 'ED_attainment_secondary_p',
       'ED_attainment_secondary_p_se', 'ED_attainment_secondary_completed_p',
       'ED_attainment_secondary_completed_p_se',
       'ED_attainment_secondary_higher_p',
       'ED_attainment_secondary_higher_p_se', 'ED_litt_p', 'ED_litt_p_se',
       'ED_litt_whole_p', 'ED_litt_whole_p_se'],
      dtype='object')

In [39]:
stat_ed_armenia_renamed['year'].dtypes

dtype('O')

In [40]:
stat_ed_armenia_renamed['year'] = stat_ed_armenia['year'].astype(int)
stat_ed_armenia_renamed['year'].dtype

dtype('int64')

In [41]:
lw_ed_arm['year'].dtypes
lw_ed_arm['year'] = lw_ed_arm['year'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lw_ed_arm['year'] = lw_ed_arm['year'].astype(int)


In [42]:
merged_df = stat_ed_armenia_renamed.merge(lw_ed_arm, left_on=['region', 'year'],
            right_on=['region_name_harmonized', 'year'],
            suffixes=('_stat', '_lw'))

merged_df.columns

Index(['Country', 'year', 'source', 'region', 'ED_attainment_no_educ_p_stat',
       'ED_attainment_primary_p_stat',
       'ED_attainment_primary_completed_p_stat',
       'ED_attainment_secondary_higher_p_stat',
       'ED_attainment_secondary_higher_p_stat', 'ED_attainment_primary_p_stat',
       'ED_attainment_secondary_completed_p_stat', 'ED_educ_years_median',
       'Women who can read a whole sentence', 'Women who are literate',
       'country_name', 'country_code', 'region_num_harmonized',
       'region_name_harmonized', 'SurveyId', 'interview_year_mean',
       'interview_month_mean', 'ED_educ_years_mean', 'ED_educ_years_mean_se',
       'ED_attainment_no_educ_p_lw', 'ED_attainment_no_educ_p_se',
       'ED_attainment_primary_p_lw', 'ED_attainment_primary_p_se',
       'ED_attainment_primary_completed_p_lw',
       'ED_attainment_primary_completed_p_se', 'ED_attainment_secondary_p',
       'ED_attainment_secondary_p_se',
       'ED_attainment_secondary_completed_p_lw',
    

And hopefully they match!

In [45]:
merged_df[['year', 'region', 'ED_educ_years_mean', 'ED_educ_years_median',
           'ED_attainment_secondary_completed_p_lw', 'ED_attainment_secondary_completed_p_stat',
           'ED_attainment_secondary_higher_p_lw', 'ED_attainment_secondary_higher_p_stat']]
#           'ED_attainment_secondary_higher_p_se', 'ED_attainment_secondary_higher_p_se_lw']]

Unnamed: 0,year,region,ED_educ_years_mean,ED_educ_years_median,ED_attainment_secondary_completed_p_lw,ED_attainment_secondary_completed_p_stat,ED_attainment_secondary_higher_p_lw,ED_attainment_secondary_higher_p_stat,ED_attainment_secondary_higher_p_stat.1
0,2016,Aragatsotn,11.14,9.9,57.37,57.4,100.0,34.9,100.0
1,2016,Ararat,11.16,9.9,51.76,51.8,99.44,34.7,99.4
2,2016,Armavir,10.84,9.8,41.23,41.2,98.2,37.4,98.2
3,2016,Gegharkunik,11.14,9.8,56.74,56.7,100.0,30.6,100.0
4,2016,Lori,11.68,11.1,43.25,43.3,100.0,49.3,100.0
5,2016,Kotayk,11.77,11.0,36.52,36.5,99.59,53.2,99.6
6,2016,Shirak,11.7,11.0,39.62,39.6,99.6,51.3,99.6
7,2016,Syunik,11.69,10.9,36.63,36.6,99.79,50.0,99.8
8,2016,Vayots Dzor,11.48,10.3,40.18,40.2,100.0,49.3,100.0
9,2016,Tavush,11.62,11.1,32.35,32.3,99.74,53.5,99.7


## Second data source: Global Data Lab Mean International Wealth Index
We will clean it and join it to the education data

In [46]:
uploaded = files.upload()

Saving GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv to GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv


In [47]:
gdl = pd.read_csv("GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv")
gdl

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFG,National,AFGt,Total,,,,,,...,,,,,51.0,,,,,
1,Afghanistan,AFG,Subnat,AFGr101,Central (Kabul Wardak Kapisa Logar Parwan Panj...,,,,,,...,,,,,58.0,,,,,
2,Afghanistan,AFG,Subnat,AFGr102,Central Highlands (Bamyan Daikundi),,,,,,...,,,,,41.8,,,,,
3,Afghanistan,AFG,Subnat,AFGr103,East (Nangarhar Kunar Laghman Nooristan),,,,,,...,,,,,41.3,,,,,
4,Afghanistan,AFG,Subnat,AFGr104,North (Samangan Sar-e-Pul Balkh Jawzjan Faryab),,,,,,...,,,,,56.5,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1578,Zimbabwe,ZWE,Subnat,ZWEr104,Mashonaland West,,,25.6,,,...,35.4,,,,42.2,,,,42.4,
1579,Zimbabwe,ZWE,Subnat,ZWEr108,Masvingo,,,19.4,,,...,28.4,,,,38.8,,,,40.3,
1580,Zimbabwe,ZWE,Subnat,ZWEr105,Matebeleland North,,,19.2,,,...,25.8,,,,33.7,,,,31.6,
1581,Zimbabwe,ZWE,Subnat,ZWEr106,Matebeleland South,,,19.3,,,...,30.1,,,,39.2,,,,42.6,


In [48]:
armenia_gdl = gdl[gdl['Country'] == "Armenia"]
armenia_gdl

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
57,Armenia,ARM,National,ARMt,Total,,,,,,...,,,,,,86.2,,,,
58,Armenia,ARM,Subnat,ARMr101,Aragatsotn,,,,,,...,,,,,,83.0,,,,
59,Armenia,ARM,Subnat,ARMr102,Ararat,,,,,,...,,,,,,84.8,,,,
60,Armenia,ARM,Subnat,ARMr103,Armavir,,,,,,...,,,,,,81.1,,,,
61,Armenia,ARM,Subnat,ARMr104,Gegharkunik,,,,,,...,,,,,,79.9,,,,
62,Armenia,ARM,Subnat,ARMr106,Kotayk,,,,,,...,,,,,,89.0,,,,
63,Armenia,ARM,Subnat,ARMr105,Lori,,,,,,...,,,,,,81.8,,,,
64,Armenia,ARM,Subnat,ARMr107,Shirak,,,,,,...,,,,,,85.6,,,,
65,Armenia,ARM,Subnat,ARMr108,Syunik,,,,,,...,,,,,,85.9,,,,
66,Armenia,ARM,Subnat,ARMr110,Tavush,,,,,,...,,,,,,85.2,,,,


In [49]:
armenia_gdl.notna()

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
57,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
58,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
59,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
60,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
61,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
62,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
63,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
64,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
65,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
66,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


In [50]:
armenia_gdl[~armenia_gdl.notnull()]

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
57,,,,,,,,,,,...,,,,,,,,,,
58,,,,,,,,,,,...,,,,,,,,,,
59,,,,,,,,,,,...,,,,,,,,,,
60,,,,,,,,,,,...,,,,,,,,,,
61,,,,,,,,,,,...,,,,,,,,,,
62,,,,,,,,,,,...,,,,,,,,,,
63,,,,,,,,,,,...,,,,,,,,,,
64,,,,,,,,,,,...,,,,,,,,,,
65,,,,,,,,,,,...,,,,,,,,,,
66,,,,,,,,,,,...,,,,,,,,,,


In [51]:
armenia_gdl = armenia_gdl[['Country', 'ISO_Code', 'GDLCODE',	'Region', '2000', '2010', '2016']]

### DataFrame melt

One reason to melt a dataframe from wide to long is that it may be easier to plot.  We will melt our dataframe and then create a scatter plot.

<img src="https://pandas.pydata.org/pandas-docs/version/0.25.1/_images/reshaping_melt.png" width=800>

Figure from [pandas.pydata.org](https://pandas.pydata.org/pandas-docs/version/0.25.1/user_guide/reshaping.html#reshaping-by-melt)

In [52]:
armenia_gdl_melt = armenia_gdl.melt(id_vars=['Country', 'ISO_Code','GDLCODE', 'Region'], var_name='Year', value_name='IWI')
armenia_gdl_melt

Unnamed: 0,Country,ISO_Code,GDLCODE,Region,Year,IWI
0,Armenia,ARM,ARMt,Total,2000,71.4
1,Armenia,ARM,ARMr101,Aragatsotn,2000,55.4
2,Armenia,ARM,ARMr102,Ararat,2000,66.6
3,Armenia,ARM,ARMr103,Armavir,2000,64.0
4,Armenia,ARM,ARMr104,Gegharkunik,2000,59.0
5,Armenia,ARM,ARMr106,Kotayk,2000,75.1
6,Armenia,ARM,ARMr105,Lori,2000,64.8
7,Armenia,ARM,ARMr107,Shirak,2000,69.7
8,Armenia,ARM,ARMr108,Syunik,2000,75.0
9,Armenia,ARM,ARMr110,Tavush,2000,63.3


In [53]:
gdl = pd.read_csv("GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv")
armenia_gdl = gdl[gdl['Country'] == 'Armenia']
armenia_gdl

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
57,Armenia,ARM,National,ARMt,Total,,,,,,...,,,,,,86.2,,,,
58,Armenia,ARM,Subnat,ARMr101,Aragatsotn,,,,,,...,,,,,,83.0,,,,
59,Armenia,ARM,Subnat,ARMr102,Ararat,,,,,,...,,,,,,84.8,,,,
60,Armenia,ARM,Subnat,ARMr103,Armavir,,,,,,...,,,,,,81.1,,,,
61,Armenia,ARM,Subnat,ARMr104,Gegharkunik,,,,,,...,,,,,,79.9,,,,
62,Armenia,ARM,Subnat,ARMr106,Kotayk,,,,,,...,,,,,,89.0,,,,
63,Armenia,ARM,Subnat,ARMr105,Lori,,,,,,...,,,,,,81.8,,,,
64,Armenia,ARM,Subnat,ARMr107,Shirak,,,,,,...,,,,,,85.6,,,,
65,Armenia,ARM,Subnat,ARMr108,Syunik,,,,,,...,,,,,,85.9,,,,
66,Armenia,ARM,Subnat,ARMr110,Tavush,,,,,,...,,,,,,85.2,,,,


Find the columns that are numbers.

In [54]:
gdl_year_cols = [x for x in gdl.columns if str.isdigit(x)]
gdl_year_cols

['1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020']

In [55]:
gdl_data = gdl[['Country', 'GDLCODE', 'Region'] + gdl_year_cols]
gdl_data

Unnamed: 0,Country,GDLCODE,Region,1992,1993,1994,1995,1996,1997,1998,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFGt,Total,,,,,,,,...,,,,,51.0,,,,,
1,Afghanistan,AFGr101,Central (Kabul Wardak Kapisa Logar Parwan Panj...,,,,,,,,...,,,,,58.0,,,,,
2,Afghanistan,AFGr102,Central Highlands (Bamyan Daikundi),,,,,,,,...,,,,,41.8,,,,,
3,Afghanistan,AFGr103,East (Nangarhar Kunar Laghman Nooristan),,,,,,,,...,,,,,41.3,,,,,
4,Afghanistan,AFGr104,North (Samangan Sar-e-Pul Balkh Jawzjan Faryab),,,,,,,,...,,,,,56.5,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1578,Zimbabwe,ZWEr104,Mashonaland West,,,25.6,,,,,...,35.4,,,,42.2,,,,42.4,
1579,Zimbabwe,ZWEr108,Masvingo,,,19.4,,,,,...,28.4,,,,38.8,,,,40.3,
1580,Zimbabwe,ZWEr105,Matebeleland North,,,19.2,,,,,...,25.8,,,,33.7,,,,31.6,
1581,Zimbabwe,ZWEr106,Matebeleland South,,,19.3,,,,,...,30.1,,,,39.2,,,,42.6,


In [56]:
gdl_data = gdl[['Country', 'GDLCODE'] + gdl_year_cols]
gdl_data

Unnamed: 0,Country,GDLCODE,1992,1993,1994,1995,1996,1997,1998,1999,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFGt,,,,,,,,,...,,,,,51.0,,,,,
1,Afghanistan,AFGr101,,,,,,,,,...,,,,,58.0,,,,,
2,Afghanistan,AFGr102,,,,,,,,,...,,,,,41.8,,,,,
3,Afghanistan,AFGr103,,,,,,,,,...,,,,,41.3,,,,,
4,Afghanistan,AFGr104,,,,,,,,,...,,,,,56.5,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1578,Zimbabwe,ZWEr104,,,25.6,,,,,25.7,...,35.4,,,,42.2,,,,42.4,
1579,Zimbabwe,ZWEr108,,,19.4,,,,,21.7,...,28.4,,,,38.8,,,,40.3,
1580,Zimbabwe,ZWEr105,,,19.2,,,,,23.7,...,25.8,,,,33.7,,,,31.6,
1581,Zimbabwe,ZWEr106,,,19.3,,,,,20.1,...,30.1,,,,39.2,,,,42.6,


In [57]:
gdl_melt = gdl.melt(id_vars=['Country', 'ISO_Code','GDLCODE', 'Region'], var_name='Year', value_name='IWI')
gdl_melt

Unnamed: 0,Country,ISO_Code,GDLCODE,Region,Year,IWI
0,Afghanistan,AFG,AFGt,Total,Level,National
1,Afghanistan,AFG,AFGr101,Central (Kabul Wardak Kapisa Logar Parwan Panj...,Level,Subnat
2,Afghanistan,AFG,AFGr102,Central Highlands (Bamyan Daikundi),Level,Subnat
3,Afghanistan,AFG,AFGr103,East (Nangarhar Kunar Laghman Nooristan),Level,Subnat
4,Afghanistan,AFG,AFGr104,North (Samangan Sar-e-Pul Balkh Jawzjan Faryab),Level,Subnat
...,...,...,...,...,...,...
47485,Zimbabwe,ZWE,ZWEr104,Mashonaland West,2020,
47486,Zimbabwe,ZWE,ZWEr108,Masvingo,2020,
47487,Zimbabwe,ZWE,ZWEr105,Matebeleland North,2020,
47488,Zimbabwe,ZWE,ZWEr106,Matebeleland South,2020,


### Filter the data for plotting
Choose to include only include regional data. Drop data that has the string 'Level' in the Year column

In [58]:
gdl_region = gdl_melt.drop(columns='Region')
gdl_region[~(gdl_region['Year'] == 'Level')]
gdl_region

Unnamed: 0,Country,ISO_Code,GDLCODE,Year,IWI
0,Afghanistan,AFG,AFGt,Level,National
1,Afghanistan,AFG,AFGr101,Level,Subnat
2,Afghanistan,AFG,AFGr102,Level,Subnat
3,Afghanistan,AFG,AFGr103,Level,Subnat
4,Afghanistan,AFG,AFGr104,Level,Subnat
...,...,...,...,...,...
47485,Zimbabwe,ZWE,ZWEr104,2020,
47486,Zimbabwe,ZWE,ZWEr108,2020,
47487,Zimbabwe,ZWE,ZWEr105,2020,
47488,Zimbabwe,ZWE,ZWEr106,2020,


# Data visualization

In [59]:
livwell_gdl_subset = ['Armenia', 'Burundi', 'Cambodia', 'Dominican Republic', 'El Salvador',
                      'Fiji', 'Gabon', 'Haiti', 'Tanzania', 'Turkey', 'Yemen', 'Zimbabwe']
livwell_gdl_countries = set(livwell_df['country_name']) & set(gdl_melt['Country'])

print("Countries in LivWell and GDL datasets: ", len(livwell_gdl_countries))
print("Number of subset countries: ", len(livwell_gdl_subset))

Countries in LivWell and GDL datasets:  51
Number of subset countries:  12


In [60]:
livwell_gdl = gdl_melt[gdl_melt['Country'].isin(livwell_gdl_subset)]

#### Pick countries where data is available.
Find data that has good coverage for years.  Here, I checked for data in 1992 that has a value for the International Weath Index (IWI).

In [61]:
gdl_region[(gdl_region['Year'] == '1992') & (gdl_region['IWI'].notna())]
print(gdl_region.head(10))

livwell_gdl = gdl_region[gdl_region['Country'].isin(livwell_gdl_subset)]

       Country ISO_Code  GDLCODE   Year       IWI
0  Afghanistan      AFG     AFGt  Level  National
1  Afghanistan      AFG  AFGr101  Level    Subnat
2  Afghanistan      AFG  AFGr102  Level    Subnat
3  Afghanistan      AFG  AFGr103  Level    Subnat
4  Afghanistan      AFG  AFGr104  Level    Subnat
5  Afghanistan      AFG  AFGr105  Level    Subnat
6  Afghanistan      AFG  AFGr106  Level    Subnat
7  Afghanistan      AFG  AFGr107  Level    Subnat
8  Afghanistan      AFG  AFGr108  Level    Subnat
9      Albania      ALB     ALBt  Level  National


Check to see if the country is in the country subset list and create a scatter plot of the IWI by year and colored by country.

In [62]:
import plotly.express as px

fig = px.scatter(livwell_gdl, x="Year", y="IWI", color="Country")
fig.show()