<a href="https://colab.research.google.com/github/fayshaw/data_preprocessing/blob/main/livwell.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preproccesing
## LivWell Dataset: Women and their Well-being for 52 Countries
### Women in Data and PyLadies Boston
August 21, 2025

Together, we will explore the LivWell dataset from the Belmin et al's 2022 Nature paper <a href=" https://www.nature.com/articles/s41597-022-01824-2"> LivWell: a sub-national Dataset on the Living Conditions of Women and their Well-being for 52 Countries</a>. The authors aggregated a longitudinal dataset from Demographic and Health Surveys (DHS) for subnational regions.  Much of their work is in geographic harmonization of boundaries.

We will wrangle some raw data to look more like their published data set. <br>


*Figure 1: Flowchart representing the data processing steps to obtain LivWell. Orange: input data; green: indicators based on DHS data; blue: indicators based on gridded data; white: validation data.*

<img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01824-2/MediaObjects/41597_2022_1824_Fig1_HTML.png" width="600">



In this notebook, we will look at some of the DHS STAT compiler data (that they used for validation) and compare it to their data output.

# Overview

1. Open LivWell data set.
2. Look at DHS STAT Compiler raw data.
3. Try to get the raw data into a comparable form.

## Read files
Read the published file using a url.

In [None]:
import pandas as pd
livwell_df = pd.read_csv('https://zenodo.org/records/7277104/files/livwell.csv')

### Explore the data

<img src="https://scentla.com/wp-content/uploads/2025/02/Efficiently-Create-and-Fill-Pandas-DataFrames-in-Python-1024x399.jpg" width=600>

<a href="https://realpython.com/pandas-python-explore-dataset/">Real Python dataframe resource</a>

DataFrame `df`
* Show `df`
* `df.head()`
* `df.describe()`
* `df.columns`

In [None]:
livwell_df

Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,SurveyId,interview_year_mean,interview_month_mean,CMC_interview_mean,DM_age_mean,...,drought_spei03_n1_share36,drought_spei03_n1_share60,drought_spei03_n1.5_share12,drought_spei03_n1.5_share36,drought_spei03_n1.5_share60,drought_spei03_n2_share12,drought_spei03_n2_share36,drought_spei03_n2_share60,hdi,gdp_pc
0,Armenia,ARM,2000,1,Aragatsotn,AM2000DHS,2000.0,11.0,1210.53,30.71,...,0.388889,0.316667,0.333333,0.250000,0.166667,0.083333,0.083333,0.050000,0.644083,2938.187500
1,Armenia,ARM,2000,2,Ararat,AM2000DHS,2000.0,11.0,1210.55,30.38,...,0.416667,0.316667,0.333333,0.277778,0.233333,0.083333,0.083333,0.050000,0.644127,3053.040283
2,Armenia,ARM,2000,3,Armavir,AM2000DHS,2000.0,10.0,1210.43,31.10,...,0.361111,0.300000,0.333333,0.250000,0.166667,0.083333,0.083333,0.050000,0.644415,3003.245605
3,Armenia,ARM,2000,4,Gegharkunik,AM2000DHS,2000.0,11.0,1210.58,30.65,...,0.416667,0.316667,0.250000,0.194444,0.166667,0.083333,0.083333,0.050000,0.643942,2945.085449
4,Armenia,ARM,2000,5,Lori,AM2000DHS,2000.0,10.0,1210.43,31.57,...,0.388889,0.316667,0.333333,0.222222,0.150000,0.083333,0.083333,0.050000,0.645256,2925.469727
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1827,Zimbabwe,ZWE,2015,6,Matabeleland South,ZW2015DHS,2015.0,9.0,1389.00,27.65,...,0.388889,0.333333,0.333333,0.250000,0.216667,0.250000,0.083333,0.066667,0.516884,1864.769000
1828,Zimbabwe,ZWE,2015,7,Midlands,ZW2015DHS,2015.0,9.0,1388.60,27.89,...,0.388889,0.316667,0.250000,0.138889,0.150000,0.250000,0.083333,0.050000,0.516000,1687.976000
1829,Zimbabwe,ZWE,2015,8,Masvingo,ZW2015DHS,2015.0,9.0,1388.91,28.69,...,0.250000,0.216667,0.166667,0.055556,0.066667,0.083333,0.027778,0.016667,0.515188,1687.113000
1830,Zimbabwe,ZWE,2015,9,Harare/Chitungwiza,ZW2015DHS,2015.0,9.0,1388.71,28.67,...,0.416667,0.333333,0.416667,0.194444,0.116667,0.250000,0.083333,0.050000,0.516000,1687.976000


In [None]:
livwell_df.columns[:50]

Index(['country_name', 'country_code', 'year', 'region_num_harmonized',
       'region_name_harmonized', 'SurveyId', 'interview_year_mean',
       'interview_month_mean', 'CMC_interview_mean', 'DM_age_mean',
       'DM_age_mean_se', 'DM_age_15.19_p', 'DM_age_15.19_p_se',
       'DM_age_20.24_p', 'DM_age_20.24_p_se', 'DM_age_25.29_p',
       'DM_age_25.29_p_se', 'DM_age_30.34_p', 'DM_age_30.34_p_se',
       'DM_age_35.39_p', 'DM_age_35.39_p_se', 'DM_age_40.44_p',
       'DM_age_40.44_p_se', 'DM_age_45.49_p', 'DM_age_45.49_p_se',
       'DM_urban_p', 'DM_urban_p_se', 'DM_born_rural_p', 'DM_born_rural_p_se',
       'DM_nvr_marr_p', 'DM_nvr_marr_p_se', 'DM_marr_p', 'DM_marr_p_se',
       'DM_age_marr_mean', 'DM_age_marr_mean_se', 'DM_age_diff_mean',
       'DM_age_diff_mean_se', 'DM_age_diff_10plus_p',
       'DM_age_diff_10plus_p_se', 'DM_age_diff_5_9_p', 'DM_age_diff_5_9_p_se',
       'DM_age_diff_5minus_p', 'DM_age_diff_5minus_p_se', 'DM_age_diff_0_p',
       'DM_age_diff_0_p_se', 'HH_w

In [None]:
indicators_df = pd.read_csv("https://zenodo.org/records/7277104/files/indicators.csv")
indicators_df.head(20)

Unnamed: 0,indicator_category,indicator_code,indicator_description
0,Individual demographic information,DM_age_mean,Average age of women
1,Individual demographic information,DM_age_15-19_p,Women in age category 15-19 (%)
2,Individual demographic information,DM_age_20-24_p,Women in age category 20-24 (%)
3,Individual demographic information,DM_age_25-29_p,Women in age category 25-29 (%)
4,Individual demographic information,DM_age_30-34_p,Women in age category 30-34 (%)
5,Individual demographic information,DM_age_35-39_p,Women in age category 35-39 (%)
6,Individual demographic information,DM_age_40-44_p,Women in age category 40-44 (%)
7,Individual demographic information,DM_age_45-49_p,Women in age category 45-49 (%)
8,Individual demographic information,DM_urban_p,Women living in urban areas (%)
9,Individual demographic information,DM_born_rural_p,Women being born at the country side (%)


Look at which countries are in this data set using the dataframe and the column name: `dataframe['column name']`

In [None]:
set(livwell_df['country_name'])

{'Armenia',
 'Bangladesh',
 'Benin',
 'Bolivia',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Colombia',
 'Congo Democratic Republic',
 "Cote d'Ivoire",
 'Egypt',
 'Ethiopia',
 'Gabon',
 'Ghana',
 'Guatemala',
 'Guinea',
 'Haiti',
 'Honduras',
 'India',
 'Indonesia',
 'Jordan',
 'Kenya',
 'Lesotho',
 'Liberia',
 'Madagascar',
 'Malawi',
 'Maldives',
 'Mali',
 'Morocco',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Pakistan',
 'Peru',
 'Philippines',
 'Rwanda',
 'Senegal',
 'Sierra Leone',
 'South Africa',
 'Tajikistan',
 'Tanzania',
 'Timor-Leste',
 'Togo',
 'Turkey',
 'Uganda',
 'Vietnam',
 'Zambia',
 'Zimbabwe'}

In [None]:
len(set(livwell_df['country_name']))

52

## Filter to get data for one country
Armenia

In [None]:
# Boolean mask
livwell_df['country_name'] == 'Armenia'

Unnamed: 0,country_name
0,True
1,True
2,True
3,True
4,True
...,...
1827,False
1828,False
1829,False
1830,False


In [None]:
livwell_armenia = livwell_df[livwell_df['country_name'] == 'Armenia']

print("years: " , set(livwell_armenia['year']))
print("regions: ", set(livwell_armenia['region_name_harmonized']))
livwell_armenia.head(12)

years:  {2000, 2010, 2016, 2005}
regions:  {'Kotayk', 'Aragatsotn', 'Gegharkunik', 'Syunik', 'Yerevan', 'Tavush', 'Vayots Dzor', 'Armavir', 'Lori', 'Ararat', 'Shirak'}


Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,SurveyId,interview_year_mean,interview_month_mean,CMC_interview_mean,DM_age_mean,...,drought_spei03_n1_share36,drought_spei03_n1_share60,drought_spei03_n1.5_share12,drought_spei03_n1.5_share36,drought_spei03_n1.5_share60,drought_spei03_n2_share12,drought_spei03_n2_share36,drought_spei03_n2_share60,hdi,gdp_pc
0,Armenia,ARM,2000,1,Aragatsotn,AM2000DHS,2000.0,11.0,1210.53,30.71,...,0.388889,0.316667,0.333333,0.25,0.166667,0.083333,0.083333,0.05,0.644083,2938.1875
1,Armenia,ARM,2000,2,Ararat,AM2000DHS,2000.0,11.0,1210.55,30.38,...,0.416667,0.316667,0.333333,0.277778,0.233333,0.083333,0.083333,0.05,0.644127,3053.040283
2,Armenia,ARM,2000,3,Armavir,AM2000DHS,2000.0,10.0,1210.43,31.1,...,0.361111,0.3,0.333333,0.25,0.166667,0.083333,0.083333,0.05,0.644415,3003.245605
3,Armenia,ARM,2000,4,Gegharkunik,AM2000DHS,2000.0,11.0,1210.58,30.65,...,0.416667,0.316667,0.25,0.194444,0.166667,0.083333,0.083333,0.05,0.643942,2945.085449
4,Armenia,ARM,2000,5,Lori,AM2000DHS,2000.0,10.0,1210.43,31.57,...,0.388889,0.316667,0.333333,0.222222,0.15,0.083333,0.083333,0.05,0.645256,2925.469727
5,Armenia,ARM,2000,6,Kotayk,AM2000DHS,2000.0,10.0,1210.48,31.15,...,0.416667,0.316667,0.333333,0.277778,0.183333,0.083333,0.083333,0.05,0.644,2918.557617
6,Armenia,ARM,2000,7,Shirak,AM2000DHS,2000.0,10.0,1210.43,31.8,...,0.388889,0.35,0.333333,0.194444,0.133333,0.083333,0.083333,0.05,0.645674,3053.684326
7,Armenia,ARM,2000,8,Syunik,AM2000DHS,2000.0,10.0,1210.42,31.37,...,0.416667,0.316667,0.166667,0.222222,0.183333,0.083333,0.055556,0.066667,0.644479,3086.177002
8,Armenia,ARM,2000,9,Vayots Dzor,AM2000DHS,2000.0,10.0,1210.41,31.69,...,0.388889,0.3,0.333333,0.277778,0.216667,0.083333,0.194444,0.133333,0.643944,2969.26123
9,Armenia,ARM,2000,10,Tavush,AM2000DHS,2000.0,10.0,1210.46,31.28,...,0.416667,0.3,0.333333,0.194444,0.133333,0.083333,0.083333,0.05,0.644377,3000.003906


### More to explore with dataframes
pandas DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
* `df.shape`
* `df.dtypes`

## Read raw education file

Manually upload the file `STATcompilerExport_education.csv`

In [None]:
from google.colab import files
uploaded = files.upload()

Saving STATcompilerExport_education.csv to STATcompilerExport_education.csv


Notice that there Unnamed column titles at the top along with NaN rows at the top and bottom

In [None]:
education = pd.read_csv("STATcompilerExport_education.csv")
education

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,,,,,,,,,,,,
1,,,,,,,,,,,,
2,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate
3,Armenia,2015-16 DHS,Total 15-49,0.1,0,36.2,53.5,0.3,99.6,11.3,,
4,Armenia,2015-16 DHS,Region : Aragatsotn,0,0,57.4,34.9,0,100,9.9,,
...,...,...,...,...,...,...,...,...,...,...,...,...
785,Women with secondary or higher education,Percentage of women with secondary or higher e...,,,,,,,,,,
786,Median years of education completed [Women],Median number of years of education completed ...,,,,,,,,,,
787,Women who can read a whole sentence,Percentage of women who can read a whole sentence,,,,,,,,,,
788,Women who are literate,Percentage of women who are literate,,,,,,,,,,


How many rows are null?  Can we safely skip them?

In [None]:
education.iloc[0].isnull().sum()

np.int64(12)

In [None]:
education.tail(12)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
778,Togo,1998 DHS,Region : Savanes,82.7,0.2,,0.0,13.4,3.9,,,
779,,,,,,,,,,,,
780,Women with no education,Percentage of women with no education,,,,,,,,,,
781,Women with completed primary education,Percentage of women with completed primary edu...,,,,,,,,,,
782,Women with completed secondary education,Percentage of women with completed secondary e...,,,,,,,,,,
783,Women with more than secondary education,Percentage of women with more than secondary e...,,,,,,,,,,
784,Women with primary education,Percentage of women with primary education,,,,,,,,,,
785,Women with secondary or higher education,Percentage of women with secondary or higher e...,,,,,,,,,,
786,Median years of education completed [Women],Median number of years of education completed ...,,,,,,,,,,
787,Women who can read a whole sentence,Percentage of women who can read a whole sentence,,,,,,,,,,


Read the file by skipping the top NaN rows and bottom rows

In [None]:
education = pd.read_csv('STATcompilerExport_education.csv', skiprows=3, skipfooter=11, engine='python')
education

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,
...,...,...,...,...,...,...,...,...,...,...,...,...
771,Togo,1998 DHS,Region : ..Lomé,25.3,5.0,,2.8,38.7,36.0,4.3,,
772,Togo,1998 DHS,Region : Plateaux,43.4,3.0,,0.0,43.9,12.7,1.3,,
773,Togo,1998 DHS,Region : Centrale,52.2,1.0,,0.0,35.2,12.7,,,
774,Togo,1998 DHS,Region : Kara,52.0,1.4,,0.0,32.2,15.8,,,


In [None]:
set(education['Country'])

{'Armenia',
 'Congo Democratic Republic',
 'Ethiopia',
 'Lesotho',
 'Malawi',
 'Maldives',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Nicaragua',
 'Nigeria',
 'Rwanda',
 'Tajikistan',
 'Timor-Leste',
 'Togo'}

In [None]:
len(set(education['Country']))

15

### One country example

* Get data for just Armenia
* Make a deep `.copy()` so you are not operating on a view and avoid the <a href="https://realpython.com/pandas-settingwithcopywarning/">`SettingWithCopyWarning`</a>

In [None]:
education_armenia = education[education['Country'] == 'Armenia'].copy()
education_armenia.head(15)

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,
5,Armenia,2015-16 DHS,Region : Lori,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,
6,Armenia,2015-16 DHS,Region : Kotayk,0.0,0.0,36.5,53.2,0.4,99.6,11.0,,
7,Armenia,2015-16 DHS,Region : Shirak,0.2,0.0,39.6,51.3,0.2,99.6,11.0,,
8,Armenia,2015-16 DHS,Region : Syunik,0.2,0.0,36.6,50.0,0.0,99.8,10.9,,
9,Armenia,2015-16 DHS,Region : Vayots Dzor,0.0,0.0,40.2,49.3,0.0,100.0,10.3,,


In [None]:
education_armenia.columns

Index(['Country', 'Survey', 'Characteristic', 'Women with no education',
       'Women with completed primary education',
       'Women with completed secondary education',
       'Women with more than secondary education',
       'Women with primary education',
       'Women with secondary or higher education',
       'Median years of education completed [Women]',
       'Women who can read a whole sentence', 'Women who are literate'],
      dtype='object')

Rename one column

### Get the survey years

In [None]:
set(education_armenia['Survey'])

{'2000 DHS', '2005 DHS', '2010 DHS', '2015-16 DHS'}

Split the survey year text on the space `' '` to get the year. Make two new columns `year_text` and `source` that appear at the right side.



In [None]:
education_armenia[['year','source']] = education_armenia.loc[:, 'Survey'].str.split(expand=True)
education_armenia.head(15)

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,,2015-16,DHS
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2015-16,DHS
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2015-16,DHS
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2015-16,DHS
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2015-16,DHS
5,Armenia,2015-16 DHS,Region : Lori,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2015-16,DHS
6,Armenia,2015-16 DHS,Region : Kotayk,0.0,0.0,36.5,53.2,0.4,99.6,11.0,,,2015-16,DHS
7,Armenia,2015-16 DHS,Region : Shirak,0.2,0.0,39.6,51.3,0.2,99.6,11.0,,,2015-16,DHS
8,Armenia,2015-16 DHS,Region : Syunik,0.2,0.0,36.6,50.0,0.0,99.8,10.9,,,2015-16,DHS
9,Armenia,2015-16 DHS,Region : Vayots Dzor,0.0,0.0,40.2,49.3,0.0,100.0,10.3,,,2015-16,DHS


### Rename the year text 2015-16 to 2016.

In [None]:
education_armenia['year'] = education_armenia['year'].replace('2015-16', '2016')
education_armenia.head()

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,,2016,DHS
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS


Similarly, split the region using by the colin " : "

In [None]:
education_armenia['region'] = education_armenia.loc[:, 'Characteristic'].str.split(" : ").str[1]
education_armenia.head()

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source,region
0,Armenia,2015-16 DHS,Total 15-49,0.1,0.0,36.2,53.5,0.3,99.6,11.3,,,2016,DHS,
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik


In [None]:
# Drop rows that are not regions
education_armenia = education_armenia[~education_armenia['Characteristic'].str.contains('Total')]
education_armenia.head()

Unnamed: 0,Country,Survey,Characteristic,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source,region
1,Armenia,2015-16 DHS,Region : Aragatsotn,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,2015-16 DHS,Region : Ararat,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,2015-16 DHS,Region : Armavir,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,2015-16 DHS,Region : Gegharkunik,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik
5,Armenia,2015-16 DHS,Region : Lori,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2016,DHS,Lori


In [None]:
ed_armenia = education_armenia.drop(columns=['Survey', 'Characteristic'])
ed_armenia.head()

Unnamed: 0,Country,Women with no education,Women with completed primary education,Women with completed secondary education,Women with more than secondary education,Women with primary education,Women with secondary or higher education,Median years of education completed [Women],Women who can read a whole sentence,Women who are literate,year,source,region
1,Armenia,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik
5,Armenia,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2016,DHS,Lori


In [None]:
# Rename education columns
rename_ed_cols ={
    'Women with no education' : 'ED_attainment_no_educ_p',
    'Women with completed primary education' : 'ED_attainment_primary_p',
    'Women with completed secondary education' : 'ED_attainment_primary_completed_p',
    'Women with more than secondary education' : 'ED_attainment_secondary_higher_p',
    'Women with primary education' : 'ED_attainment_primary_p',
    'Women with secondary or higher education' : 'ED_attainment_secondary_completed_p',
    'Median years of education completed [Women]' : 'ED_educ_years_mean'
}

In [None]:
ed_armenia = ed_armenia.rename(columns=rename_ed_cols)
ed_armenia.head()
ed_armenia.head()

Unnamed: 0,Country,ED_attainment_no_educ_p,ED_attainment_primary_p,ED_attainment_primary_completed_p,ED_attainment_secondary_higher_p,ED_attainment_primary_p.1,ED_attainment_secondary_completed_p,ED_educ_years_mean,Women who can read a whole sentence,Women who are literate,year,source,region
1,Armenia,0.0,0.0,57.4,34.9,0.0,100.0,9.9,,,2016,DHS,Aragatsotn
2,Armenia,0.0,0.2,51.8,34.7,0.6,99.4,9.9,,,2016,DHS,Ararat
3,Armenia,0.5,0.2,41.2,37.4,1.4,98.2,9.8,,,2016,DHS,Armavir
4,Armenia,0.0,0.0,56.7,30.6,0.0,100.0,9.8,,,2016,DHS,Gegharkunik
5,Armenia,0.0,0.0,43.3,49.3,0.0,100.0,11.1,,,2016,DHS,Lori


In [None]:
uploaded = files.upload()

Saving GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv to GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv


In [None]:
livwell_armenia.columns[livwell_armenia.columns.str.contains('ED')]

Index(['ED_educ_years_mean', 'ED_educ_years_mean_se',
       'ED_attainment_no_educ_p', 'ED_attainment_no_educ_p_se',
       'ED_attainment_primary_p', 'ED_attainment_primary_p_se',
       'ED_attainment_primary_completed_p',
       'ED_attainment_primary_completed_p_se', 'ED_attainment_secondary_p',
       'ED_attainment_secondary_p_se', 'ED_attainment_secondary_completed_p',
       'ED_attainment_secondary_completed_p_se',
       'ED_attainment_secondary_higher_p',
       'ED_attainment_secondary_higher_p_se', 'ED_litt_p', 'ED_litt_p_se',
       'ED_litt_whole_p', 'ED_litt_whole_p_se'],
      dtype='object')

In [None]:
gdl = pd.read_csv("GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv")
gdl

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFG,National,AFGt,Total,,,,,,...,,,,,51.0,,,,,
1,Afghanistan,AFG,Subnat,AFGr101,Central (Kabul Wardak Kapisa Logar Parwan Panj...,,,,,,...,,,,,58.0,,,,,
2,Afghanistan,AFG,Subnat,AFGr102,Central Highlands (Bamyan Daikundi),,,,,,...,,,,,41.8,,,,,
3,Afghanistan,AFG,Subnat,AFGr103,East (Nangarhar Kunar Laghman Nooristan),,,,,,...,,,,,41.3,,,,,
4,Afghanistan,AFG,Subnat,AFGr104,North (Samangan Sar-e-Pul Balkh Jawzjan Faryab),,,,,,...,,,,,56.5,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1578,Zimbabwe,ZWE,Subnat,ZWEr104,Mashonaland West,,,25.6,,,...,35.4,,,,42.2,,,,42.4,
1579,Zimbabwe,ZWE,Subnat,ZWEr108,Masvingo,,,19.4,,,...,28.4,,,,38.8,,,,40.3,
1580,Zimbabwe,ZWE,Subnat,ZWEr105,Matebeleland North,,,19.2,,,...,25.8,,,,33.7,,,,31.6,
1581,Zimbabwe,ZWE,Subnat,ZWEr106,Matebeleland South,,,19.3,,,...,30.1,,,,39.2,,,,42.6,


In [None]:
armenia_gdl = gdl[gdl['Country'] == "Armenia"]
armenia_gdl

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
57,Armenia,ARM,National,ARMt,Total,,,,,,...,,,,,,86.2,,,,
58,Armenia,ARM,Subnat,ARMr101,Aragatsotn,,,,,,...,,,,,,83.0,,,,
59,Armenia,ARM,Subnat,ARMr102,Ararat,,,,,,...,,,,,,84.8,,,,
60,Armenia,ARM,Subnat,ARMr103,Armavir,,,,,,...,,,,,,81.1,,,,
61,Armenia,ARM,Subnat,ARMr104,Gegharkunik,,,,,,...,,,,,,79.9,,,,
62,Armenia,ARM,Subnat,ARMr106,Kotayk,,,,,,...,,,,,,89.0,,,,
63,Armenia,ARM,Subnat,ARMr105,Lori,,,,,,...,,,,,,81.8,,,,
64,Armenia,ARM,Subnat,ARMr107,Shirak,,,,,,...,,,,,,85.6,,,,
65,Armenia,ARM,Subnat,ARMr108,Syunik,,,,,,...,,,,,,85.9,,,,
66,Armenia,ARM,Subnat,ARMr110,Tavush,,,,,,...,,,,,,85.2,,,,


In [None]:
armenia_gdl.notna()

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
57,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
58,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
59,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
60,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
61,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
62,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
63,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
64,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
65,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
66,True,True,True,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


In [None]:
armenia_gdl.isnull().any()

Unnamed: 0,0
Country,False
ISO_Code,False
Level,False
GDLCODE,False
Region,False
1992,True
1993,True
1994,True
1995,True
1996,True


In [None]:
armenia_gdl[~armenia_gdl.notnull()]

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
57,,,,,,,,,,,...,,,,,,,,,,
58,,,,,,,,,,,...,,,,,,,,,,
59,,,,,,,,,,,...,,,,,,,,,,
60,,,,,,,,,,,...,,,,,,,,,,
61,,,,,,,,,,,...,,,,,,,,,,
62,,,,,,,,,,,...,,,,,,,,,,
63,,,,,,,,,,,...,,,,,,,,,,
64,,,,,,,,,,,...,,,,,,,,,,
65,,,,,,,,,,,...,,,,,,,,,,
66,,,,,,,,,,,...,,,,,,,,,,


In [None]:
armenia_gdl = armenia_gdl[['Country', 'ISO_Code', 'GDLCODE',	'Region', '2000', '2010', '2016']]

DataFrame melt

<img src="https://pandas.pydata.org/pandas-docs/version/0.25.1/_images/reshaping_melt.png" width=600>

In [None]:
armenia_gdl_melt = armenia_gdl.melt(id_vars=['Country', 'ISO_Code','GDLCODE', 'Region'], var_name='year', value_name='count')
armenia_gdl_melt

Unnamed: 0,Country,ISO_Code,GDLCODE,Region,year,count
0,Armenia,ARM,ARMt,Total,Level,National
1,Armenia,ARM,ARMr101,Aragatsotn,Level,Subnat
2,Armenia,ARM,ARMr102,Ararat,Level,Subnat
3,Armenia,ARM,ARMr103,Armavir,Level,Subnat
4,Armenia,ARM,ARMr104,Gegharkunik,Level,Subnat
...,...,...,...,...,...,...
355,Armenia,ARM,ARMr107,Shirak,2020,
356,Armenia,ARM,ARMr108,Syunik,2020,
357,Armenia,ARM,ARMr110,Tavush,2020,
358,Armenia,ARM,ARMr109,Vayots Dzor,2020,


In [None]:
rename_cols = {'Median years of education completed [Women]' : 'ED_educ_years_mean'}

education_armenia_renamed = education_armenia.rename(columns=rename_cols)
education_armenia_renamed.head(12)

In [None]:
from PIL import Image
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
gdl = pd.read_csv("GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv")
armenia_gdl = gdl[gdl['Country'] == 'Armenia']
armenia_gdl

In [None]:
livwell_df.head()

In [None]:
livwell_df.describe()

In [None]:
livwell_df.shape

(1832, 409)

In [None]:
url = "https://gitlab.pik-potsdam.de/belmin/livwelldata-paper/-/raw/main/analysis/data/raw_data/populationWB.csv"
pd.read_csv(url)