<a href="https://colab.research.google.com/github/fayshaw/data_preprocessing/blob/main/livwell.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preproccesing Best Practices

## LivWell Dataset: Women and their Well-being for 52 Countries

###<a href="https://www.womenindata.org/">Women in Data Boston</a> and <a href="https://www.meetup.com/pyladies-boston/">PyLadies Boston</a>
#### <a href="https://www.linkedin.com/in/fayshaw/">Fay Shaw</a>
August 21, 2025

Together, we will explore the LivWell dataset from the Belmin et al's 2022 Nature paper <a href=" https://www.nature.com/articles/s41597-022-01824-2"> LivWell: a sub-national Dataset on the Living Conditions of Women and their Well-being for 52 Countries</a>. The authors constructed a longitudinal dataset using Demographic and Health Surveys (DHS), GDP data, and climate data for subnational regions for 1990 - 2019.

*Figure 1: Flowchart representing the data processing steps to obtain LivWell. Orange: input data; green: indicators based on DHS data; blue: indicators based on gridded data; white: validation data.*

<img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01824-2/MediaObjects/41597_2022_1824_Fig1_HTML.png" width="600">


In this notebook, we will look at some of the DHS STAT data compare it to their data output.

🚩 <a href="https://github.com/fayshaw/data_preprocessing">Github repository</a>



# 🎯 Goals for this session
1. Learn the context of the data.
2. Transform raw dataset to compare to LivWell
3. Visualize data.

# 📚 Goals for different learners
* 🌟 If you are new, welcome!  Learn how to run colab files, introduction to dataframes.
* 🧰 If you are familiar with dataframes and notebooks, here are some transformation tools.
* 🐍 If you are a saavy Pythonista, check out the 🔎 *more to explore* prompts!  Combine different datasets together or challenge yourself to reproduce their R analysis code in Python!

## 🗂️ Files
1. Load LivWell dataset using urls: `livwell.csv` and `indicators.csv`
2. Load DHS STAT raw data: `STATcompilerExport_decision_power.csv`
3. Load global wealth indicators: `GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv`

Both the DHS STAT and Global Mean data are found in the authors' <a href="https://gitlab.pik-potsdam.de/belmin/livwelldata-paper/-/tree/main/analysis/data/raw_data/validation_data?ref_type=heads"> validation data folder</a> on gitlab.

# 💡 1. Learn the context by exploring the data

## Read files
The first thing to do is to look at the the LivWell data.  We can open it in Excel and read in the file using pandas.

In [1]:
import pandas as pd
livwell_df = pd.read_csv('https://zenodo.org/records/7277104/files/livwell.csv')

### Explore the data

<img src="https://scentla.com/wp-content/uploads/2025/02/Efficiently-Create-and-Fill-Pandas-DataFrames-in-Python-1024x399.jpg" width=600>

Figure from https://datagy.io/pandas-drop-index-column

Resources
* <a href="https://realpython.com/pandas-python-explore-dataset/">Real Python dataframe resource</a>
* <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">PyData Pandas cheat sheet</a>

DataFrame `df`
* Show `df`
* `df.head()`
* `df.describe()`
* `df.columns`
* `df.unique()`

In [2]:
livwell_df

Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,SurveyId,interview_year_mean,interview_month_mean,CMC_interview_mean,DM_age_mean,...,drought_spei03_n1_share36,drought_spei03_n1_share60,drought_spei03_n1.5_share12,drought_spei03_n1.5_share36,drought_spei03_n1.5_share60,drought_spei03_n2_share12,drought_spei03_n2_share36,drought_spei03_n2_share60,hdi,gdp_pc
0,Armenia,ARM,2000,1,Aragatsotn,AM2000DHS,2000.0,11.0,1210.53,30.71,...,0.388889,0.316667,0.333333,0.250000,0.166667,0.083333,0.083333,0.050000,0.644083,2938.187500
1,Armenia,ARM,2000,2,Ararat,AM2000DHS,2000.0,11.0,1210.55,30.38,...,0.416667,0.316667,0.333333,0.277778,0.233333,0.083333,0.083333,0.050000,0.644127,3053.040283
2,Armenia,ARM,2000,3,Armavir,AM2000DHS,2000.0,10.0,1210.43,31.10,...,0.361111,0.300000,0.333333,0.250000,0.166667,0.083333,0.083333,0.050000,0.644415,3003.245605
3,Armenia,ARM,2000,4,Gegharkunik,AM2000DHS,2000.0,11.0,1210.58,30.65,...,0.416667,0.316667,0.250000,0.194444,0.166667,0.083333,0.083333,0.050000,0.643942,2945.085449
4,Armenia,ARM,2000,5,Lori,AM2000DHS,2000.0,10.0,1210.43,31.57,...,0.388889,0.316667,0.333333,0.222222,0.150000,0.083333,0.083333,0.050000,0.645256,2925.469727
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1827,Zimbabwe,ZWE,2015,6,Matabeleland South,ZW2015DHS,2015.0,9.0,1389.00,27.65,...,0.388889,0.333333,0.333333,0.250000,0.216667,0.250000,0.083333,0.066667,0.516884,1864.769000
1828,Zimbabwe,ZWE,2015,7,Midlands,ZW2015DHS,2015.0,9.0,1388.60,27.89,...,0.388889,0.316667,0.250000,0.138889,0.150000,0.250000,0.083333,0.050000,0.516000,1687.976000
1829,Zimbabwe,ZWE,2015,8,Masvingo,ZW2015DHS,2015.0,9.0,1388.91,28.69,...,0.250000,0.216667,0.166667,0.055556,0.066667,0.083333,0.027778,0.016667,0.515188,1687.113000
1830,Zimbabwe,ZWE,2015,9,Harare/Chitungwiza,ZW2015DHS,2015.0,9.0,1388.71,28.67,...,0.416667,0.333333,0.416667,0.194444,0.116667,0.250000,0.083333,0.050000,0.516000,1687.976000


In [3]:
livwell_df.columns

Index(['country_name', 'country_code', 'year', 'region_num_harmonized',
       'region_name_harmonized', 'SurveyId', 'interview_year_mean',
       'interview_month_mean', 'CMC_interview_mean', 'DM_age_mean',
       ...
       'drought_spei03_n1_share36', 'drought_spei03_n1_share60',
       'drought_spei03_n1.5_share12', 'drought_spei03_n1.5_share36',
       'drought_spei03_n1.5_share60', 'drought_spei03_n2_share12',
       'drought_spei03_n2_share36', 'drought_spei03_n2_share60', 'hdi',
       'gdp_pc'],
      dtype='object', length=409)

In [4]:
indicators_df = pd.read_csv("https://zenodo.org/records/7277104/files/indicators.csv")
indicators_df

Unnamed: 0,indicator_category,indicator_code,indicator_description
0,Individual demographic information,DM_age_mean,Average age of women
1,Individual demographic information,DM_age_15-19_p,Women in age category 15-19 (%)
2,Individual demographic information,DM_age_20-24_p,Women in age category 20-24 (%)
3,Individual demographic information,DM_age_25-29_p,Women in age category 25-29 (%)
4,Individual demographic information,DM_age_30-34_p,Women in age category 30-34 (%)
...,...,...,...
260,Drought,drought_spei03_n1.5_share36,Share of months in the past 36 months with dro...
261,Drought,drought_spei03_n1.5_share60,Share of months in the past 60 months with dro...
262,Drought,drought_spei03_n2_share12,Share of months in the past 12 months with dro...
263,Drought,drought_spei03_n2_share36,Share of months in the past 36 months with dro...


Look at one column of data using the dataframe `df` and the column name `col`: `df['col']`.  You can use `set()` to find the set of values.

In [5]:
set(indicators_df['indicator_category'])

{'Decision power',
 'Domestic Violence',
 'Drought',
 'Education',
 'Energy and information',
 'Energy and information – Household level',
 'Energy and information – per urban/rural area',
 'Fertility preferences',
 'Fertility – complex indicators',
 'Health',
 'Health – Birth level',
 'Household Wealth',
 'Household characteristics – Household level',
 'Household characteristics – Womens level',
 'Household demographics – Household level',
 'Household household demographics – Household level',
 'Individual demographic information',
 'Nutrition',
 'Precipitation',
 'Reproductive health and fertility',
 'Socio-economic indicators',
 'Standardized Precipitation Evapotranspiration Index (SPEI)',
 'Temperature',
 'Work status'}

In [6]:
# Find unique countries
livwell_df['country_name'].unique()

array(['Armenia', 'Burundi', 'Benin', 'Burkina Faso', 'Bangladesh',
       'Bolivia', "Cote d'Ivoire", 'Cameroon',
       'Congo Democratic Republic', 'Colombia', 'Egypt', 'Ethiopia',
       'Gabon', 'Ghana', 'Guinea', 'Guatemala', 'Honduras', 'Haiti',
       'Indonesia', 'India', 'Jordan', 'Kenya', 'Cambodia', 'Liberia',
       'Lesotho', 'Morocco', 'Madagascar', 'Maldives', 'Mali',
       'Mozambique', 'Malawi', 'Namibia', 'Niger', 'Nigeria', 'Nicaragua',
       'Nepal', 'Pakistan', 'Peru', 'Philippines', 'Rwanda', 'Senegal',
       'Sierra Leone', 'Togo', 'Tajikistan', 'Timor-Leste', 'Turkey',
       'Tanzania', 'Uganda', 'Vietnam', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)

In [7]:
# The length of the set of
len(livwell_df['country_name'].unique())

52

## Filter to get data for one country


In [8]:
# Choose one column and check if the value is Armenia.
livwell_df['country_name'] == 'Armenia'

Unnamed: 0,country_name
0,True
1,True
2,True
3,True
4,True
...,...
1827,False
1828,False
1829,False
1830,False


In [9]:
# Create a new dataframe livwell_armenia for that country
livwell_armenia = livwell_df[livwell_df['country_name'] == 'Armenia']

# Years with survey data
print("years: " , set(livwell_armenia['year']))

# Regions in Armenia
print("regions: ", set(livwell_armenia['region_name_harmonized']))
livwell_armenia.head(12)

years:  {2000, 2010, 2016, 2005}
regions:  {'Tavush', 'Syunik', 'Ararat', 'Yerevan', 'Aragatsotn', 'Armavir', 'Kotayk', 'Gegharkunik', 'Vayots Dzor', 'Shirak', 'Lori'}


Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,SurveyId,interview_year_mean,interview_month_mean,CMC_interview_mean,DM_age_mean,...,drought_spei03_n1_share36,drought_spei03_n1_share60,drought_spei03_n1.5_share12,drought_spei03_n1.5_share36,drought_spei03_n1.5_share60,drought_spei03_n2_share12,drought_spei03_n2_share36,drought_spei03_n2_share60,hdi,gdp_pc
0,Armenia,ARM,2000,1,Aragatsotn,AM2000DHS,2000.0,11.0,1210.53,30.71,...,0.388889,0.316667,0.333333,0.25,0.166667,0.083333,0.083333,0.05,0.644083,2938.1875
1,Armenia,ARM,2000,2,Ararat,AM2000DHS,2000.0,11.0,1210.55,30.38,...,0.416667,0.316667,0.333333,0.277778,0.233333,0.083333,0.083333,0.05,0.644127,3053.040283
2,Armenia,ARM,2000,3,Armavir,AM2000DHS,2000.0,10.0,1210.43,31.1,...,0.361111,0.3,0.333333,0.25,0.166667,0.083333,0.083333,0.05,0.644415,3003.245605
3,Armenia,ARM,2000,4,Gegharkunik,AM2000DHS,2000.0,11.0,1210.58,30.65,...,0.416667,0.316667,0.25,0.194444,0.166667,0.083333,0.083333,0.05,0.643942,2945.085449
4,Armenia,ARM,2000,5,Lori,AM2000DHS,2000.0,10.0,1210.43,31.57,...,0.388889,0.316667,0.333333,0.222222,0.15,0.083333,0.083333,0.05,0.645256,2925.469727
5,Armenia,ARM,2000,6,Kotayk,AM2000DHS,2000.0,10.0,1210.48,31.15,...,0.416667,0.316667,0.333333,0.277778,0.183333,0.083333,0.083333,0.05,0.644,2918.557617
6,Armenia,ARM,2000,7,Shirak,AM2000DHS,2000.0,10.0,1210.43,31.8,...,0.388889,0.35,0.333333,0.194444,0.133333,0.083333,0.083333,0.05,0.645674,3053.684326
7,Armenia,ARM,2000,8,Syunik,AM2000DHS,2000.0,10.0,1210.42,31.37,...,0.416667,0.316667,0.166667,0.222222,0.183333,0.083333,0.055556,0.066667,0.644479,3086.177002
8,Armenia,ARM,2000,9,Vayots Dzor,AM2000DHS,2000.0,10.0,1210.41,31.69,...,0.388889,0.3,0.333333,0.277778,0.216667,0.083333,0.194444,0.133333,0.643944,2969.26123
9,Armenia,ARM,2000,10,Tavush,AM2000DHS,2000.0,10.0,1210.46,31.28,...,0.416667,0.3,0.333333,0.194444,0.133333,0.083333,0.083333,0.05,0.644377,3000.003906


## 🔎 More to explore with dataframes
pandas DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
* `df.shape`
* `df.dtypes`
* `df['column'].value_counts()`

# 🚀 2. Transform raw dataset to compare to LivWell

## Read STAT file

Manually upload the file `STATcompilerExport_decision_power.csv`

In [10]:
from google.colab import files
uploaded = files.upload()

Saving STATcompilerExport_decision_power.csv to STATcompilerExport_decision_power.csv


🔎 Notice that there `Unnamed` column names at the top along with NaN rows at the top and bottom

In [11]:
stat_power = pd.read_csv("STATcompilerExport_decision_power.csv")
stat_power

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,Country,Survey,Characteristic,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly ...,LivWell own house,Do not own a house [Women],,Do not own land [Women],Decision maker about Own health care: Mainly w...,Decision maker about Major household purchases...,Decision maker about Visits to her family or r...,Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, frie...",Women who decide themselves how their earnings...,Wife earns more than husband
3,Armenia,2015-16 DHS,Total,15.9,19.5,,51.5,,84.3,,,,96,80.3,92.3,27.8,8.3
4,Armenia,2015-16 DHS,Region : Aragatsotn,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24,15.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
723,Final say in making large purchases [Women],Percentage of women who say that they alone or...,,,,,,,,,,,,,,,
724,"Final say in visits to family, relatives, frie...",Percentage of women who say that they alone or...,,,,,,,,,,,,,,,
725,Women who decide themselves how their earnings...,Percentage of currently married or in union wo...,,,,,,,,,,,,,,,
726,Wife earns more than husband,Percentage of currently married or in union wo...,,,,,,,,,,,,,,,


🤔 How many rows are null?  Can we safely skip them?

* Top 2 rows are null and we want the third row to be for titles.
* Bottom rows are not regional data.

In [12]:
stat_power.iloc[0].isnull().sum()

np.int64(17)

In [13]:
# Check out the tail end.  Rows 714-727 are not regional data.
stat_power.tail(15)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
713,Togo,2013-14 DHS,Decision : Visits to her family or relatives,,,,,,,,,17.4,,,,,
714,,,,,,,,,,,,,,,,,
715,Family planning use decisionmaking mainly by wife,Among currently married women using family pla...,,,,,,,,,,,,,,,
716,Family planning non-use decisionmaking mainly ...,Among currently married women not currently us...,,,,,,,,,,,,,,,
717,Do not own a house [Women],Percentage of women who do not own a house,,,,,,,,,,,,,,,
718,Do not own land [Women],Percentage of women who do not own land,,,,,,,,,,,,,,,
719,Decision maker about Own health care: Mainly w...,Percentage of women for whom the decision make...,,,,,,,,,,,,,,,
720,Decision maker about Major household purchases...,Percentage of women for whom the decision make...,,,,,,,,,,,,,,,
721,Decision maker about Visits to her family or r...,Percentage of women for whom the decision make...,,,,,,,,,,,,,,,
722,Final say in own health care [Women],Percentage of women who say that they alone or...,,,,,,,,,,,,,,,


In [14]:
# Read the file again by skipping the top NaN rows and bottom rows
stat_power = pd.read_csv('STATcompilerExport_decision_power.csv', skiprows=3, skipfooter=14, engine='python')
stat_power

Unnamed: 0,Country,Survey,Characteristic,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,Do not own a house [Women],Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",Women who decide themselves how their earnings are used,Wife earns more than husband
0,Armenia,2015-16 DHS,Total,15.9,19.5,,51.5,,84.3,,,,96.0,80.3,92.3,27.8,8.3
1,Armenia,2015-16 DHS,Region : Aragatsotn,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5
2,Armenia,2015-16 DHS,Region : Ararat,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1
3,Armenia,2015-16 DHS,Region : Armavir,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8
4,Armenia,2015-16 DHS,Region : Gegharkunik,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
706,Togo,2013-14 DHS,Region : Kara,28.5,,,89.9,,92.4,,,,42.4,41.2,66.4,96.4,8.1
707,Togo,2013-14 DHS,Region : Savanes,30.9,,,83.8,,86.6,,,,41.0,46.4,63.4,84.4,5.8
708,Togo,2013-14 DHS,Decision : Own health care,,,,,,,11.6,,,,,,,
709,Togo,2013-14 DHS,Decision : Major household purchases,,,,,,,,14.3,,,,,,


In [15]:
set(stat_power['Country'])

{'Armenia',
 'Congo Democratic Republic',
 'Ethiopia',
 'Lesotho',
 'Malawi',
 'Maldives',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Nicaragua',
 'Nigeria',
 'Rwanda',
 'Tajikistan',
 'Timor-Leste',
 'Togo'}

In [16]:
# Look at the length of this list
len(set(stat_power['Country']))

15

## One country example

* Get data for just Armenia
* Make a deep `.copy()` so you are not operating on a view and avoid the <a href="https://realpython.com/pandas-settingwithcopywarning/">`SettingWithCopyWarning`</a>

In [17]:
stat_armenia_power = stat_power[stat_power['Country'] == 'Armenia'].copy()
stat_armenia_power.head(15)

Unnamed: 0,Country,Survey,Characteristic,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,Do not own a house [Women],Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",Women who decide themselves how their earnings are used,Wife earns more than husband
0,Armenia,2015-16 DHS,Total,15.9,19.5,,51.5,,84.3,,,,96.0,80.3,92.3,27.8,8.3
1,Armenia,2015-16 DHS,Region : Aragatsotn,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5
2,Armenia,2015-16 DHS,Region : Ararat,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1
3,Armenia,2015-16 DHS,Region : Armavir,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8
4,Armenia,2015-16 DHS,Region : Gegharkunik,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7
5,Armenia,2015-16 DHS,Region : Lori,9.1,5.5,55.49,44.5,99.99,99.4,,,,99.1,76.1,96.7,16.5,14.3
6,Armenia,2015-16 DHS,Region : Kotayk,16.9,36.7,49.56,50.4,99.96,87.2,,,,98.2,80.7,95.4,29.1,2.8
7,Armenia,2015-16 DHS,Region : Shirak,11.2,23.6,19.78,80.2,99.98,96.3,,,,98.5,84.2,94.2,36.7,17.9
8,Armenia,2015-16 DHS,Region : Syunik,0.0,7.1,55.72,44.3,100.02,84.2,,,,92.9,92.1,89.2,17.7,7.2
9,Armenia,2015-16 DHS,Region : Vayots Dzor,12.0,12.2,51.36,48.6,99.96,85.0,,,,97.8,90.7,95.4,23.6,10.8


In [18]:
stat_armenia_power.columns

Index(['Country', 'Survey', 'Characteristic',
       'Family planning use decisionmaking mainly by wife',
       'Family planning non-use decisionmaking mainly by wife',
       'LivWell own house', 'Do not own a house [Women]', 'Unnamed: 7',
       'Do not own land [Women]',
       'Decision maker about Own health care: Mainly wife [Women]',
       'Decision maker about Major household purchases: Mainly wife [Women]',
       'Decision maker about Visits to her family or relatives: Mainly wife [Women]',
       'Final say in own health care [Women]',
       'Final say in making large purchases [Women]',
       'Final say in visits to family, relatives, friends [Women]',
       'Women who decide themselves how their earnings are used',
       'Wife earns more than husband'],
      dtype='object')

## 🦋 Transform data step by step
Find columns that could be in a more useful format.

### Get the survey years
Take out the space between the year and the survey name.

In [19]:
set(stat_armenia_power['Survey'])

{'2000 DHS', '2005 DHS', '2010 DHS', '2015-16 DHS'}

In [20]:
# Split the survey year text on the space ' ' to get the year.
# Make two new columns 'year_text' and 'source' that appear at the right side.
stat_armenia_power[['year','source']] = stat_armenia_power.loc[:, 'Survey'].str.split(expand=True)
stat_armenia_power.head(15)

Unnamed: 0,Country,Survey,Characteristic,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,Do not own a house [Women],Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",Women who decide themselves how their earnings are used,Wife earns more than husband,year,source
0,Armenia,2015-16 DHS,Total,15.9,19.5,,51.5,,84.3,,,,96.0,80.3,92.3,27.8,8.3,2015-16,DHS
1,Armenia,2015-16 DHS,Region : Aragatsotn,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5,2015-16,DHS
2,Armenia,2015-16 DHS,Region : Ararat,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1,2015-16,DHS
3,Armenia,2015-16 DHS,Region : Armavir,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8,2015-16,DHS
4,Armenia,2015-16 DHS,Region : Gegharkunik,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7,2015-16,DHS
5,Armenia,2015-16 DHS,Region : Lori,9.1,5.5,55.49,44.5,99.99,99.4,,,,99.1,76.1,96.7,16.5,14.3,2015-16,DHS
6,Armenia,2015-16 DHS,Region : Kotayk,16.9,36.7,49.56,50.4,99.96,87.2,,,,98.2,80.7,95.4,29.1,2.8,2015-16,DHS
7,Armenia,2015-16 DHS,Region : Shirak,11.2,23.6,19.78,80.2,99.98,96.3,,,,98.5,84.2,94.2,36.7,17.9,2015-16,DHS
8,Armenia,2015-16 DHS,Region : Syunik,0.0,7.1,55.72,44.3,100.02,84.2,,,,92.9,92.1,89.2,17.7,7.2,2015-16,DHS
9,Armenia,2015-16 DHS,Region : Vayots Dzor,12.0,12.2,51.36,48.6,99.96,85.0,,,,97.8,90.7,95.4,23.6,10.8,2015-16,DHS


In [21]:
### Rename the year text 2015-16 to 2016.
stat_armenia_power['year'] = stat_armenia_power['year'].replace('2015-16', '2016')
stat_armenia_power.head()

Unnamed: 0,Country,Survey,Characteristic,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,Do not own a house [Women],Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",Women who decide themselves how their earnings are used,Wife earns more than husband,year,source
0,Armenia,2015-16 DHS,Total,15.9,19.5,,51.5,,84.3,,,,96.0,80.3,92.3,27.8,8.3,2016,DHS
1,Armenia,2015-16 DHS,Region : Aragatsotn,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5,2016,DHS
2,Armenia,2015-16 DHS,Region : Ararat,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1,2016,DHS
3,Armenia,2015-16 DHS,Region : Armavir,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8,2016,DHS
4,Armenia,2015-16 DHS,Region : Gegharkunik,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7,2016,DHS


In [22]:
# Similarly, split the region using by the colin " : "
stat_armenia_power['region'] = stat_armenia_power.loc[:, 'Characteristic'].str.split(" : ").str[1]
stat_armenia_power.head()

Unnamed: 0,Country,Survey,Characteristic,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,Do not own a house [Women],Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",Women who decide themselves how their earnings are used,Wife earns more than husband,year,source,region
0,Armenia,2015-16 DHS,Total,15.9,19.5,,51.5,,84.3,,,,96.0,80.3,92.3,27.8,8.3,2016,DHS,
1,Armenia,2015-16 DHS,Region : Aragatsotn,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5,2016,DHS,Aragatsotn
2,Armenia,2015-16 DHS,Region : Ararat,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1,2016,DHS,Ararat
3,Armenia,2015-16 DHS,Region : Armavir,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8,2016,DHS,Armavir
4,Armenia,2015-16 DHS,Region : Gegharkunik,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7,2016,DHS,Gegharkunik


### Drop rows that are not regions
📌 Tip: I like to name new dataframes when I've done significant operations like dropping rows.  This avoids errors if I re-run cells.

In [23]:
# Drop rows that are not regions
stat_armenia_power = stat_armenia_power[~stat_armenia_power['Characteristic'].str.contains('Total')]
stat_armenia_power.head()

Unnamed: 0,Country,Survey,Characteristic,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,Do not own a house [Women],Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",Women who decide themselves how their earnings are used,Wife earns more than husband,year,source,region
1,Armenia,2015-16 DHS,Region : Aragatsotn,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5,2016,DHS,Aragatsotn
2,Armenia,2015-16 DHS,Region : Ararat,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1,2016,DHS,Ararat
3,Armenia,2015-16 DHS,Region : Armavir,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8,2016,DHS,Armavir
4,Armenia,2015-16 DHS,Region : Gegharkunik,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7,2016,DHS,Gegharkunik
5,Armenia,2015-16 DHS,Region : Lori,9.1,5.5,55.49,44.5,99.99,99.4,,,,99.1,76.1,96.7,16.5,14.3,2016,DHS,Lori


In [24]:
# Rename dataframe
stat_armenia_pow = stat_armenia_power.drop(columns=['Survey', 'Characteristic'])
stat_armenia_pow.head()

Unnamed: 0,Country,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,Do not own a house [Women],Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",Women who decide themselves how their earnings are used,Wife earns more than husband,year,source,region
1,Armenia,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5,2016,DHS,Aragatsotn
2,Armenia,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1,2016,DHS,Ararat
3,Armenia,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8,2016,DHS,Armavir
4,Armenia,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7,2016,DHS,Gegharkunik
5,Armenia,9.1,5.5,55.49,44.5,99.99,99.4,,,,99.1,76.1,96.7,16.5,14.3,2016,DHS,Lori


In [25]:
# Rename education columns
rename_DP_cols = {
    'Country' : 'country_name',
    'Wife earns more than husband' : 'DP_earn_more_p',
    'Women who decide themselves how their earnings are used' : 'DP_decide_money_p',
    'Do not own a house [Women]' : 'STAT_not_homeowner'
}

In [26]:
stat_armenia_DP = stat_armenia_pow.rename(columns=rename_DP_cols)
stat_armenia_DP.head()

Unnamed: 0,country_name,Family planning use decisionmaking mainly by wife,Family planning non-use decisionmaking mainly by wife,LivWell own house,STAT_not_homeowner,Unnamed: 7,Do not own land [Women],Decision maker about Own health care: Mainly wife [Women],Decision maker about Major household purchases: Mainly wife [Women],Decision maker about Visits to her family or relatives: Mainly wife [Women],Final say in own health care [Women],Final say in making large purchases [Women],"Final say in visits to family, relatives, friends [Women]",DP_decide_money_p,DP_earn_more_p,year,source,region
1,Armenia,25.2,20.9,75.94,24.1,100.04,47.3,,,,97.9,89.9,85.8,24.0,15.5,2016,DHS,Aragatsotn
2,Armenia,13.2,6.9,69.42,30.6,100.02,63.7,,,,87.0,53.6,81.2,13.0,8.1,2016,DHS,Ararat
3,Armenia,7.0,15.3,22.96,77.0,99.96,90.3,,,,99.2,90.8,97.9,41.3,5.8,2016,DHS,Armavir
4,Armenia,12.9,24.7,80.78,19.2,99.98,38.2,,,,88.1,62.4,81.4,20.0,4.7,2016,DHS,Gegharkunik
5,Armenia,9.1,5.5,55.49,44.5,99.99,99.4,,,,99.1,76.1,96.7,16.5,14.3,2016,DHS,Lori


### Choose and reorder columns

In [27]:
# Now it looks similar to the LivWell data
stat_armenia_DP = stat_armenia_DP[['country_name', 'year', 'region', 'DP_earn_more_p', 'DP_decide_money_p', 'STAT_not_homeowner']]
stat_armenia_DP.head()

Unnamed: 0,country_name,year,region,DP_earn_more_p,DP_decide_money_p,STAT_not_homeowner
1,Armenia,2016,Aragatsotn,15.5,24.0,24.1
2,Armenia,2016,Ararat,8.1,13.0,30.6
3,Armenia,2016,Armavir,5.8,41.3,77.0
4,Armenia,2016,Gegharkunik,4.7,20.0,19.2
5,Armenia,2016,Lori,14.3,16.5,44.5


## 👍  Now this looks similar to our LivWell data set.
* Survey years are numbers (not `2015-2016 DHS`)
* Regions are just names (no `' : '`)
* Renamed columns of interest

## Filter LivWell columns to match STAT columns
### Get LivWell Aremnia decision and power columns

In [28]:
# These are all of the LivWell (lw) decision and power (DP) columns
lw_all_pow_cols = livwell_armenia.columns[livwell_armenia.columns.str.contains('DP')].to_list()
lw_all_pow_cols

['DP_decide_money_p',
 'DP_decide_money_p_se',
 'DP_decide_health_p',
 'DP_decide_health_p_se',
 'DP_decide_large_purchase_p',
 'DP_decide_large_purchase_p_se',
 'DP_decide_visits_p',
 'DP_decide_visits_p_se',
 'DP_owns_house_p',
 'DP_owns_house_p_se',
 'DP_owns_land_p',
 'DP_owns_land_p_se',
 'DP_decide_contraception_p',
 'DP_decide_contraception_p_se',
 'DP_decide_no_contraception_p',
 'DP_decide_no_contraception_p_se',
 'DP_earn_more_equal_p',
 'DP_earn_more_equal_p_se',
 'DP_earn_more_p',
 'DP_earn_more_p_se']

In [29]:
# Choose columns of interest
lw_DP_cols = ['country_name', 'country_code', 'year', 'region_num_harmonized',
       'region_name_harmonized', 'DP_earn_more_p', 'DP_owns_house_p','DP_decide_money_p']

In [30]:
# LivWell data for general and power columns
livwell_armenia_DP = livwell_armenia[lw_DP_cols]
livwell_armenia_DP

Unnamed: 0,country_name,country_code,year,region_num_harmonized,region_name_harmonized,DP_earn_more_p,DP_owns_house_p,DP_decide_money_p
0,Armenia,ARM,2000,1,Aragatsotn,,,28.0
1,Armenia,ARM,2000,2,Ararat,,,30.91
2,Armenia,ARM,2000,3,Armavir,,,32.81
3,Armenia,ARM,2000,4,Gegharkunik,,,9.62
4,Armenia,ARM,2000,5,Lori,,,35.29
5,Armenia,ARM,2000,6,Kotayk,,,49.28
6,Armenia,ARM,2000,7,Shirak,,,28.95
7,Armenia,ARM,2000,8,Syunik,,,32.97
8,Armenia,ARM,2000,9,Vayots Dzor,,,36.36
9,Armenia,ARM,2000,10,Tavush,,,32.53


In [31]:
# Ckeck our columns for our datasets
print(stat_armenia_DP.columns)
print(livwell_armenia_DP.columns)

Index(['country_name', 'year', 'region', 'DP_earn_more_p', 'DP_decide_money_p',
       'STAT_not_homeowner'],
      dtype='object')
Index(['country_name', 'country_code', 'year', 'region_num_harmonized',
       'region_name_harmonized', 'DP_earn_more_p', 'DP_owns_house_p',
       'DP_decide_money_p'],
      dtype='object')


## Get data in the same format to merge on year and region.

In [32]:
# In the LivWell data, the year data is of type int
livwell_armenia_DP['year'].dtype

dtype('int64')

In [33]:
# In the STAT data, the year data is of type object
stat_armenia_DP['year'].dtypes

dtype('O')

In [34]:
# Recast the year as an int.  We need to do this to prevent a
# ValueError: You are trying to merge on object and int64 columns for key 'year'
stat_armenia_DP['year'] = stat_armenia_DP['year'].astype(int)
stat_armenia_DP['year'].dtype

dtype('int64')

## 🧩 Merge the data!

Choose which variables should be the same.  
* On the left (STAT data) those are `country_name`, `region`, and `year`.   
* On the right (LivWell data) those are `country_name`, `region_name_harmonized` and `year`.  
* For the columns that are the same, the new column names will have a suffix (`_stat, `_lw`) to differentiate.

In [35]:
merged_df = stat_armenia_DP.merge(livwell_armenia_DP, left_on=['country_name', 'region', 'year'],
            right_on=['country_name', 'region_name_harmonized', 'year'],
            suffixes=('_stat', '_lw'))

merged_df.columns

Index(['country_name', 'year', 'region', 'DP_earn_more_p_stat',
       'DP_decide_money_p_stat', 'STAT_not_homeowner', 'country_code',
       'region_num_harmonized', 'region_name_harmonized', 'DP_earn_more_p_lw',
       'DP_owns_house_p', 'DP_decide_money_p_lw'],
      dtype='object')

In [36]:
merged_df

Unnamed: 0,country_name,year,region,DP_earn_more_p_stat,DP_decide_money_p_stat,STAT_not_homeowner,country_code,region_num_harmonized,region_name_harmonized,DP_earn_more_p_lw,DP_owns_house_p,DP_decide_money_p_lw
0,Armenia,2016,Aragatsotn,15.5,24.0,24.1,ARM,1,Aragatsotn,15.99,75.94,23.97
1,Armenia,2016,Ararat,8.1,13.0,30.6,ARM,2,Ararat,8.39,69.42,13.0
2,Armenia,2016,Armavir,5.8,41.3,77.0,ARM,3,Armavir,8.08,22.96,41.33
3,Armenia,2016,Gegharkunik,4.7,20.0,19.2,ARM,4,Gegharkunik,7.26,80.78,19.97
4,Armenia,2016,Lori,14.3,16.5,44.5,ARM,5,Lori,20.22,55.49,16.53
5,Armenia,2016,Kotayk,2.8,29.1,50.4,ARM,6,Kotayk,3.55,49.56,29.12
6,Armenia,2016,Shirak,17.9,36.7,80.2,ARM,7,Shirak,19.08,19.78,36.74
7,Armenia,2016,Syunik,7.2,17.7,44.3,ARM,8,Syunik,7.64,55.72,17.74
8,Armenia,2016,Vayots Dzor,10.8,23.6,48.6,ARM,9,Vayots Dzor,12.03,51.36,23.57
9,Armenia,2016,Tavush,11.5,22.9,62.4,ARM,10,Tavush,15.61,37.62,22.88


And hopefully they match!   Reorder columns to get a better look.

Note that `STAT_owns_house_p` is the inverse of `STAT_not_homeowner`.

🔎 More to explore
* Can you check to see that `STAT_owns_house_p` and `STAT_not_homeowner` add up to 100%?
* Can you upload other STAT files to compare to LivWell?

In [37]:
merged_df[['country_name', 'year', 'region', 'DP_earn_more_p_lw', 'DP_earn_more_p_stat',
           'DP_decide_money_p_lw', 'DP_decide_money_p_stat',
           'DP_owns_house_p', 'STAT_not_homeowner']]

Unnamed: 0,country_name,year,region,DP_earn_more_p_lw,DP_earn_more_p_stat,DP_decide_money_p_lw,DP_decide_money_p_stat,DP_owns_house_p,STAT_not_homeowner
0,Armenia,2016,Aragatsotn,15.99,15.5,23.97,24.0,75.94,24.1
1,Armenia,2016,Ararat,8.39,8.1,13.0,13.0,69.42,30.6
2,Armenia,2016,Armavir,8.08,5.8,41.33,41.3,22.96,77.0
3,Armenia,2016,Gegharkunik,7.26,4.7,19.97,20.0,80.78,19.2
4,Armenia,2016,Lori,20.22,14.3,16.53,16.5,55.49,44.5
5,Armenia,2016,Kotayk,3.55,2.8,29.12,29.1,49.56,50.4
6,Armenia,2016,Shirak,19.08,17.9,36.74,36.7,19.78,80.2
7,Armenia,2016,Syunik,7.64,7.2,17.74,17.7,55.72,44.3
8,Armenia,2016,Vayots Dzor,12.03,10.8,23.57,23.6,51.36,48.6
9,Armenia,2016,Tavush,15.61,11.5,22.88,22.9,37.62,62.4


## 🌍 Second data source: Global Data Lab Mean International Wealth Index

Find this file to upload: `GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv`


In [38]:
uploaded = files.upload()

Saving GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv to GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv


In [39]:
gdl = pd.read_csv("GDL-Mean-International-Wealth-Index-(IWI)-score-of-region-data.csv")
gdl

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,1992,1993,1994,1995,1996,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFG,National,AFGt,Total,,,,,,...,,,,,51.0,,,,,
1,Afghanistan,AFG,Subnat,AFGr101,Central (Kabul Wardak Kapisa Logar Parwan Panj...,,,,,,...,,,,,58.0,,,,,
2,Afghanistan,AFG,Subnat,AFGr102,Central Highlands (Bamyan Daikundi),,,,,,...,,,,,41.8,,,,,
3,Afghanistan,AFG,Subnat,AFGr103,East (Nangarhar Kunar Laghman Nooristan),,,,,,...,,,,,41.3,,,,,
4,Afghanistan,AFG,Subnat,AFGr104,North (Samangan Sar-e-Pul Balkh Jawzjan Faryab),,,,,,...,,,,,56.5,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1578,Zimbabwe,ZWE,Subnat,ZWEr104,Mashonaland West,,,25.6,,,...,35.4,,,,42.2,,,,42.4,
1579,Zimbabwe,ZWE,Subnat,ZWEr108,Masvingo,,,19.4,,,...,28.4,,,,38.8,,,,40.3,
1580,Zimbabwe,ZWE,Subnat,ZWEr105,Matebeleland North,,,19.2,,,...,25.8,,,,33.7,,,,31.6,
1581,Zimbabwe,ZWE,Subnat,ZWEr106,Matebeleland South,,,19.3,,,...,30.1,,,,39.2,,,,42.6,


In [40]:
gdl_armenia = gdl[gdl['Country'] == "Armenia"]

In [41]:
# Drop the years with NaN values
# The years that are left are similar to our LivWell data
gdl_armenia_years = gdl_armenia.dropna(axis=1)
gdl_armenia_years

Unnamed: 0,Country,ISO_Code,Level,GDLCODE,Region,2000,2010,2016
57,Armenia,ARM,National,ARMt,Total,71.4,81.9,86.2
58,Armenia,ARM,Subnat,ARMr101,Aragatsotn,55.4,68.4,83.0
59,Armenia,ARM,Subnat,ARMr102,Ararat,66.6,76.6,84.8
60,Armenia,ARM,Subnat,ARMr103,Armavir,64.0,72.3,81.1
61,Armenia,ARM,Subnat,ARMr104,Gegharkunik,59.0,80.9,79.9
62,Armenia,ARM,Subnat,ARMr106,Kotayk,75.1,84.2,89.0
63,Armenia,ARM,Subnat,ARMr105,Lori,64.8,82.1,81.8
64,Armenia,ARM,Subnat,ARMr107,Shirak,69.7,76.2,85.6
65,Armenia,ARM,Subnat,ARMr108,Syunik,75.0,82.5,85.9
66,Armenia,ARM,Subnat,ARMr110,Tavush,63.3,76.9,85.2


In [42]:
# Keep rows where Region is not 'Total'
gdl_armenia_years = gdl_armenia_years[gdl_armenia_years['Region'] != 'Total']
print(gdl_armenia_years.columns)

# Reorder columns
gdl_armenia_years = gdl_armenia_years[['Country', 'ISO_Code', 'GDLCODE',	'Region', '2000', '2010', '2016']]
gdl_armenia_years

Index(['Country', 'ISO_Code', 'Level', 'GDLCODE', 'Region', '2000', '2010',
       '2016'],
      dtype='object')


Unnamed: 0,Country,ISO_Code,GDLCODE,Region,2000,2010,2016
58,Armenia,ARM,ARMr101,Aragatsotn,55.4,68.4,83.0
59,Armenia,ARM,ARMr102,Ararat,66.6,76.6,84.8
60,Armenia,ARM,ARMr103,Armavir,64.0,72.3,81.1
61,Armenia,ARM,ARMr104,Gegharkunik,59.0,80.9,79.9
62,Armenia,ARM,ARMr106,Kotayk,75.1,84.2,89.0
63,Armenia,ARM,ARMr105,Lori,64.8,82.1,81.8
64,Armenia,ARM,ARMr107,Shirak,69.7,76.2,85.6
65,Armenia,ARM,ARMr108,Syunik,75.0,82.5,85.9
66,Armenia,ARM,ARMr110,Tavush,63.3,76.9,85.2
67,Armenia,ARM,ARMr109,Vayots Dzor,70.7,78.4,85.8


## 🧊 DataFrame melt

This transformation changes our "wide" dataframe to a "long" format.  It returns an "unpivoted" datafrrame.  

We will melt our dataframe and then create a scatter plot.

<img src="https://pandas.pydata.org/pandas-docs/version/0.25.1/_images/reshaping_melt.png" width=800>


Figure from [pandas.pydata.org](https://pandas.pydata.org/pandas-docs/version/0.25.1/user_guide/reshaping.html#reshaping-by-melt)

In [43]:
# In the melt command, the id_vars are the ones to keep as identifiers
# The var_name year gets added into every row
# The value_name is the International Wealth Index
gdl_armenia_melt = gdl_armenia_years.melt(id_vars=['Country', 'ISO_Code','GDLCODE', 'Region'], var_name='Year', value_name='IWI')
gdl_armenia_melt

Unnamed: 0,Country,ISO_Code,GDLCODE,Region,Year,IWI
0,Armenia,ARM,ARMr101,Aragatsotn,2000,55.4
1,Armenia,ARM,ARMr102,Ararat,2000,66.6
2,Armenia,ARM,ARMr103,Armavir,2000,64.0
3,Armenia,ARM,ARMr104,Gegharkunik,2000,59.0
4,Armenia,ARM,ARMr106,Kotayk,2000,75.1
5,Armenia,ARM,ARMr105,Lori,2000,64.8
6,Armenia,ARM,ARMr107,Shirak,2000,69.7
7,Armenia,ARM,ARMr108,Syunik,2000,75.0
8,Armenia,ARM,ARMr110,Tavush,2000,63.3
9,Armenia,ARM,ARMr109,Vayots Dzor,2000,70.7


### 📅 Include columns that are years by checking to see that they are numerical.

In [44]:
gdl_year_cols = [x for x in gdl.columns if str.isdigit(x)]
gdl_year_cols

['1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020']

In [45]:
# Include these columns in the data
gdl_data = gdl[['Country', 'GDLCODE'] + gdl_year_cols]
gdl_data

Unnamed: 0,Country,GDLCODE,1992,1993,1994,1995,1996,1997,1998,1999,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFGt,,,,,,,,,...,,,,,51.0,,,,,
1,Afghanistan,AFGr101,,,,,,,,,...,,,,,58.0,,,,,
2,Afghanistan,AFGr102,,,,,,,,,...,,,,,41.8,,,,,
3,Afghanistan,AFGr103,,,,,,,,,...,,,,,41.3,,,,,
4,Afghanistan,AFGr104,,,,,,,,,...,,,,,56.5,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1578,Zimbabwe,ZWEr104,,,25.6,,,,,25.7,...,35.4,,,,42.2,,,,42.4,
1579,Zimbabwe,ZWEr108,,,19.4,,,,,21.7,...,28.4,,,,38.8,,,,40.3,
1580,Zimbabwe,ZWEr105,,,19.2,,,,,23.7,...,25.8,,,,33.7,,,,31.6,
1581,Zimbabwe,ZWEr106,,,19.3,,,,,20.1,...,30.1,,,,39.2,,,,42.6,


### Melt the whole GDL data to move years into one column

🔎 More to explore: How can you combine the GDL dataset with the LivWell and STAT datasets?

In [46]:
gdl_melt = gdl.melt(id_vars=['Country', 'ISO_Code','GDLCODE', 'Region'], var_name='Year', value_name='IWI')
gdl_melt

Unnamed: 0,Country,ISO_Code,GDLCODE,Region,Year,IWI
0,Afghanistan,AFG,AFGt,Total,Level,National
1,Afghanistan,AFG,AFGr101,Central (Kabul Wardak Kapisa Logar Parwan Panj...,Level,Subnat
2,Afghanistan,AFG,AFGr102,Central Highlands (Bamyan Daikundi),Level,Subnat
3,Afghanistan,AFG,AFGr103,East (Nangarhar Kunar Laghman Nooristan),Level,Subnat
4,Afghanistan,AFG,AFGr104,North (Samangan Sar-e-Pul Balkh Jawzjan Faryab),Level,Subnat
...,...,...,...,...,...,...
47485,Zimbabwe,ZWE,ZWEr104,Mashonaland West,2020,
47486,Zimbabwe,ZWE,ZWEr108,Masvingo,2020,
47487,Zimbabwe,ZWE,ZWEr105,Matebeleland North,2020,
47488,Zimbabwe,ZWE,ZWEr106,Matebeleland South,2020,


### Filter the data for plotting
Choose to include only include regional data. Drop data that has the string 'Level' in the Year column

In [47]:
gdl_region = gdl_melt.drop(columns='Region')
gdl_region = gdl_region[~((gdl_region['Year'] == 'Level') | (gdl_region['IWI'] == 'National'))]
gdl_region

Unnamed: 0,Country,ISO_Code,GDLCODE,Year,IWI
1583,Afghanistan,AFG,AFGt,1992,
1584,Afghanistan,AFG,AFGr101,1992,
1585,Afghanistan,AFG,AFGr102,1992,
1586,Afghanistan,AFG,AFGr103,1992,
1587,Afghanistan,AFG,AFGr104,1992,
...,...,...,...,...,...
47485,Zimbabwe,ZWE,ZWEr104,2020,
47486,Zimbabwe,ZWE,ZWEr108,2020,
47487,Zimbabwe,ZWE,ZWEr105,2020,
47488,Zimbabwe,ZWE,ZWEr106,2020,


# 📊 3. Data visualization
Plot a subset of countries

In [48]:
livwell_gdl_subset = ['Armenia', 'Burundi', 'Cambodia', 'Dominican Republic', 'El Salvador',
                      'Fiji', 'Gabon', 'Haiti', 'Tanzania', 'Turkey', 'Yemen', 'Zimbabwe']
livwell_gdl_countries = set(livwell_df['country_name']) & set(gdl_melt['Country'])

print("Countries in LivWell and GDL datasets: ", len(livwell_gdl_countries))
print("Number of subset countries: ", len(livwell_gdl_subset))

Countries in LivWell and GDL datasets:  51
Number of subset countries:  12


### Finding non null data

In [49]:
#Check if there is data for a year that is not null.
gdl_region[(gdl_region['Year'] == '1992') & (gdl_region['IWI'].notna())]
print(gdl_region.head(10))

          Country ISO_Code  GDLCODE  Year  IWI
1583  Afghanistan      AFG     AFGt  1992  NaN
1584  Afghanistan      AFG  AFGr101  1992  NaN
1585  Afghanistan      AFG  AFGr102  1992  NaN
1586  Afghanistan      AFG  AFGr103  1992  NaN
1587  Afghanistan      AFG  AFGr104  1992  NaN
1588  Afghanistan      AFG  AFGr105  1992  NaN
1589  Afghanistan      AFG  AFGr106  1992  NaN
1590  Afghanistan      AFG  AFGr107  1992  NaN
1591  Afghanistan      AFG  AFGr108  1992  NaN
1592      Albania      ALB     ALBt  1992  NaN


## Create a scatter plot of the IWI by year and colored by country.

In [50]:
import plotly.express as px

gdl_subset = ['Armenia', 'Burundi', 'Cambodia', 'Dominican Republic', 'El Salvador',
              'Fiji', 'Gabon', 'Haiti', 'Tanzania', 'Turkey', 'Yemen', 'Zimbabwe']
gdl_subset_data = gdl_region[gdl_region['Country'].isin(livwell_gdl_subset)]
fig = px.scatter(gdl_subset_data, x="Year", y="IWI", color="Country")
fig.show()

# 🗺️ More to explore

The authors incorporated many  approaches in their work including:

* Analysis in R.  Check out their <a href="https://gitlab.pik-potsdam.de/belmin/livwelldata">LivWell R repository</a>. They linearly interpolated data using the R package `imputeTS`.
* Collapsed categories for <a href="https://gitlab.pik-potsdam.de/belmin/livwelldata-paper/-/blob/main/analysis/data/raw_data/all_labels_cooking_fuel_completed_coal_as_traditional.csv?ref_type=heads">modern and traditional cooking fuel</a>
  * Modern: electricity, liquefied
petroleum gas, natural gas, kerosene and biogas
  * Traditional: biomass (firewood, charcoal, agricultural crops, coal)
  * This could also be described as recoding, label encoding, or feature engineering
* Recoded drinking water quality to low, medium, high quality.  
* <a href="https://gitlab.pik-potsdam.de/belmin/livwelldata-paper/-/blob/main/analysis/data/derived_data/region_harmonization_files/dhs_region_harmonization.csv?ref_type=heads">Geographic data</a>. The authors harmonized variables over time and across countries.  

# 🎉 Takeaways
* Learn the context of the data.
* Before transforming, have a target usage and format in mind.
* You don't need to know all the syntax.  Practice!