<a href="https://colab.research.google.com/github/fayshaw/data_preprocessing/blob/main/livwell.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preproccesing
## LivWell Dataset: Women and their Well-being for 52 Countries
### <a href="https://www.womenindata.org/">Women in Data</a> and <a href="https://www.meetup.com/pyladies-boston/">PyLadies Boston</a>
#### <a href="https://www.linkedin.com/in/fayshaw/">Fay Shaw</a>
August 21, 2025

Together, we will explore the LivWell dataset from the Belmin et al's 2022 Nature paper <a href=" https://www.nature.com/articles/s41597-022-01824-2"> LivWell: a sub-national Dataset on the Living Conditions of Women and their Well-being for 52 Countries</a>. The authors aggregated a longitudinal dataset from Demographic and Health Surveys (DHS) for subnational regions.  Much of their work is in geographic harmonization of boundaries.

We will wrangle some raw data to look more like their published data set. <br>


*Figure 1: Flowchart representing the data processing steps to obtain LivWell. Orange: input data; green: indicators based on DHS data; blue: indicators based on gridded data; white: validation data.*

<img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01824-2/MediaObjects/41597_2022_1824_Fig1_HTML.png" width="600">



In this notebook, we will look at some of the DHS STAT compiler data (that they used for validation) and compare it to their data output.

# Overview

1. Open LivWell data set.
2. Look at DHS STAT Compiler raw data.
3. Try to get the raw data into a comparable form.

## Read files
Read the published file using a url.

In [None]:
import pandas as pd
livwell_df = pd.read_csv('https://zenodo.org/records/7277104/files/livwell.csv')

### Explore the data

<img src="https://scentla.com/wp-content/uploads/2025/02/Efficiently-Create-and-Fill-Pandas-DataFrames-in-Python-1024x399.jpg" width=600>

Figure from https://datagy.io/pandas-drop-index-column

Resources
* <a href="https://realpython.com/pandas-python-explore-dataset/">Real Python dataframe resource</a>
* <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">PyData Pandas cheat sheet</a>

DataFrame `df`
* Show `df`
* `df.head()`
* `df.describe()`
* `df.columns`
* `df.unique()`

In [None]:
livwell_df

In [None]:
livwell_df.columns[:50]

In [None]:
indicators_df = pd.read_csv("https://zenodo.org/records/7277104/files/indicators.csv")
indicators_df.head(20)

Look at which countries are in this data set using the dataframe and the column name: `dataframe['column name']`

In [None]:
livwell_df['country_name'].unique()

In [None]:
len(set(livwell_df['country_name']))

## Filter to get data for one country
Armenia

In [None]:
# Boolean mask
livwell_df['country_name'] == 'Armenia'

In [None]:
livwell_armenia = livwell_df[livwell_df['country_name'] == 'Armenia']

print("years: " , set(livwell_armenia['year']))
print("regions: ", set(livwell_armenia['region_name_harmonized']))
livwell_armenia.head(12)

### More to explore with dataframes
pandas DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
* `df.shape`
* `df.dtypes`
* `df['column'].value_counts()`

## Read raw education file

Manually upload the file `STATcompilerExport_education.csv`

In [None]:
from google.colab import files
uploaded = files.upload()

Notice that there Unnamed column titles at the top along with NaN rows at the top and bottom

In [None]:
education = pd.read_csv("STATcompilerExport_education.csv")
education

How many rows are null?  Can we safely skip them?

In [None]:
education.iloc[0].isnull().sum()

In [None]:
education.tail(12)

Read the file by skipping the top NaN rows and bottom rows

In [None]:
stat_education = pd.read_csv('STATcompilerExport_education.csv', skiprows=3, skipfooter=11, engine='python')
stat_education

In [None]:
set(stat_education['Country'])

In [None]:
len(set(stat_education['Country']))

### One country example

* Get data for just Armenia
* Make a deep `.copy()` so you are not operating on a view and avoid the <a href="https://realpython.com/pandas-settingwithcopywarning/">`SettingWithCopyWarning`</a>

In [None]:
stat_education_armenia = stat_education[stat_education['Country'] == 'Armenia'].copy()
stat_education_armenia.head(15)

In [None]:
stat_education_armenia.columns

Rename one column

### Get the survey years

In [None]:
set(stat_education_armenia['Survey'])

Split the survey year text on the space `' '` to get the year. Make two new columns `year_text` and `source` that appear at the right side.



In [None]:
stat_education_armenia[['year','source']] = stat_education_armenia.loc[:, 'Survey'].str.split(expand=True)
stat_education_armenia.head(15)

### Rename the year text 2015-16 to 2016.

In [None]:
stat_education_armenia['year'] = stat_education_armenia['year'].replace('2015-16', '2016')
stat_education_armenia.head()

Similarly, split the region using by the colin " : "

In [None]:
stat_education_armenia['region'] = stat_education_armenia.loc[:, 'Characteristic'].str.split(" : ").str[1]
stat_education_armenia.head()

In [None]:
# Drop rows that are not regions
stat_education_armenia = stat_education_armenia[~stat_education_armenia['Characteristic'].str.contains('Total')]
stat_education_armenia.head()

In [None]:
stat_ed_armenia = stat_education_armenia.drop(columns=['Survey', 'Characteristic'])
stat_ed_armenia.head()

In [None]:
# Rename education columns
rename_ed_cols ={
    'Women with no education' : 'ED_attainment_no_educ_p',
    'Women with completed primary education' : 'ED_attainment_primary_completed_p',
    'Women with completed secondary education' : 'ED_attainment_secondary_completed_p',
    'Women with more than secondary education' : 'ED_attainment_secondary_higher_p',
    'Women with primary education' : 'ED_attainment_primary_p',
    'Women with secondary or higher education' : 'ED_attainment_secondary_higher_p',
    'Median years of education completed [Women]' : 'ED_educ_years_median'
}

In [None]:
stat_ed_armenia = stat_ed_armenia.rename(columns=rename_ed_cols)
stat_ed_armenia.head()

### Choose and reorder columns

In [None]:
stat_ed_armenia_df = stat_ed_armenia[['Country', 'source', 'year', 'region', 'ED_educ_years_median', 'ED_attainment_secondary_completed_p']]
stat_ed_armenia_df

### LivWell Aremnia education columns

In [None]:
lw_ed_cols = livwell_armenia.columns[livwell_armenia.columns.str.contains('ED')].to_list()
lw_ed_cols

In [None]:
# Get year and country data
livwell_df.columns[:8].to_list()

In [None]:
# Columns of interest in the LivWell data set
lw_year_ed_cols = livwell_df.columns[:8].to_list() + lw_ed_cols
lw_year_ed_cols

In [None]:
lw_ed_arm = livwell_armenia[lw_year_ed_cols]
lw_ed_arm.head()

In [None]:
stat_ed_armenia.head()

In [None]:
stat_ed_armenia.columns

In [None]:
rename_cols = {'Median years of education completed [Women]' : 'ED_educ_years_median'}

stat_ed_armenia_renamed = stat_ed_armenia.rename(columns=rename_cols)

cols_reorder = ['Country', 'year', 'source', 'region',	'ED_attainment_no_educ_p',
                'ED_attainment_primary_p',	'ED_attainment_primary_completed_p',
                'ED_attainment_secondary_higher_p', 'ED_attainment_primary_p',
                'ED_attainment_secondary_completed_p', 'ED_educ_years_median',
                'Women who can read a whole sentence','Women who are literate']
stat_ed_armenia_renamed = stat_ed_armenia_renamed[cols_reorder]
stat_ed_armenia_renamed.head(12)

## Get data in the same format to merge on year and region.

In [None]:
stat_ed_armenia_renamed.columns

In [None]:
lw_ed_arm.columns

In [None]:
stat_ed_armenia_renamed['year'].dtypes

In [None]:
stat_ed_armenia_renamed['year'] = stat_ed_armenia['year'].astype(int)
stat_ed_armenia_renamed['year'].dtype

In [None]:
lw_ed_arm['year'].dtypes
lw_ed_arm['year'] = lw_ed_arm['year'].astype(int)

In [None]:
merged_df = stat_ed_armenia_renamed.merge(lw_ed_arm, left_on=['region', 'year'],
            right_on=['region_name_harmonized', 'year'],
            suffixes=('_stat', '_lw'))

merged_df.columns

And hopefully they match!

In [None]:
merged_df[['year', 'region', 'ED_educ_years_mean', 'ED_educ_years_median',
           'ED_attainment_secondary_completed_p_lw', 'ED_attainment_secondary_completed_p_stat',
           'ED_attainment_secondary_higher_p_lw', 'ED_attainment_secondary_higher_p_stat']]
#           'ED_attainment_secondary_higher_p_se', 'ED_attainment_secondary_higher_p_se_lw']]

## Second data source: Global Data Lab Mean International Wealth Index
We will clean it and join it to the education data

In [None]:
uploaded = files.upload()

In [None]:
gdl = pd.read_csv("GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv")
gdl

In [None]:
armenia_gdl = gdl[gdl['Country'] == "Armenia"]
armenia_gdl

In [None]:
armenia_gdl.notna()

In [None]:
armenia_gdl[~armenia_gdl.notnull()]

In [None]:
armenia_gdl = armenia_gdl[['Country', 'ISO_Code', 'GDLCODE',	'Region', '2000', '2010', '2016']]

### DataFrame melt

One reason to melt a dataframe from wide to long is that it may be easier to plot.  We will melt our dataframe and then create a scatter plot.

<img src="https://pandas.pydata.org/pandas-docs/version/0.25.1/_images/reshaping_melt.png" width=800>

Figure from [pandas.pydata.org](https://pandas.pydata.org/pandas-docs/version/0.25.1/user_guide/reshaping.html#reshaping-by-melt)

In [None]:
# TODO: remove Total from Region
armenia_gdl_melt = armenia_gdl.melt(id_vars=['Country', 'ISO_Code','GDLCODE', 'Region'], var_name='Year', value_name='IWI')
armenia_gdl_melt

In [None]:
gdl = pd.read_csv("GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv")
armenia_gdl = gdl[gdl['Country'] == 'Armenia']
armenia_gdl

Find the columns that are numbers.

In [None]:
gdl_year_cols = [x for x in gdl.columns if str.isdigit(x)]
gdl_year_cols

In [None]:
gdl_data = gdl[['Country', 'GDLCODE', 'Region'] + gdl_year_cols]
gdl_data

In [None]:
gdl_data = gdl[['Country', 'GDLCODE'] + gdl_year_cols]
gdl_data

In [None]:
gdl_melt = gdl.melt(id_vars=['Country', 'ISO_Code','GDLCODE', 'Region'], var_name='Year', value_name='IWI')
gdl_melt

### Filter the data for plotting
Choose to include only include regional data. Drop data that has the string 'Level' in the Year column

In [None]:
gdl_region = gdl_melt.drop(columns='Region')
gdl_region = gdl_region[~((gdl_region['Year'] == 'Level') | (gdl_region['IWI'] == 'National'))]
gdl_region

# Data visualization
Plot a subset of countries

In [None]:
livwell_gdl_subset = ['Armenia', 'Burundi', 'Cambodia', 'Dominican Republic', 'El Salvador',
                      'Fiji', 'Gabon', 'Haiti', 'Tanzania', 'Turkey', 'Yemen', 'Zimbabwe']
livwell_gdl_countries = set(livwell_df['country_name']) & set(gdl_melt['Country'])

print("Countries in LivWell and GDL datasets: ", len(livwell_gdl_countries))
print("Number of subset countries: ", len(livwell_gdl_subset))

#### Pick countries where data is available.
Find data that has good coverage for years.  Here, I checked for data in 1992 that has a value for the International Weath Index (IWI).

In [None]:
gdl_region[(gdl_region['Year'] == '1992') & (gdl_region['IWI'].notna())]
print(gdl_region.head(10))

livwell_gdl = gdl_region[gdl_region['Country'].isin(livwell_gdl_subset)]

Check to see if the country is in the country subset list and create a scatter plot of the IWI by year and colored by country.

In [None]:
import plotly.express as px

fig = px.scatter(livwell_gdl, x="Year", y="IWI", color="Country")
fig.show()

In [None]:
gdl_subset = ['Armenia', 'Burundi', 'Cambodia', 'Dominican Republic', 'El Salvador',
              'Fiji', 'Gabon', 'Haiti', 'Tanzania', 'Turkey', 'Yemen', 'Zimbabwe']
gdl_subset_data = gdl_region[gdl_region['Country'].isin(livwell_gdl_subset)]
fig = px.scatter(gdl_subset_data, x="Year", y="IWI", color="Country")
fig.show()

In [None]:
gdl_subset_data

# More to explore

The authors incorporated many  approaches in their work including:

* Analysis in R.  Check out their <a href="https://gitlab.pik-potsdam.de/belmin/livwelldata">LivWell R repository</a>. They linearly interpolated data using the R package `imputeTS`.
* Collapsed categories for modern and traditional cooking fuel:
  * Modern: electricity, liquefied
petroleum gas, natural gas, kerosene and biogas
  * Traditional: biomass (firewood, charcoal, agricultural crops, coal)
  * This could also be described as recoding, label encoding or feature engineering
* Recoded drinking water quality to low, medium, high quality.  
* Geographic data. The authors harmonized variables over time and across countries.  