# Preprocessing Household Survey Data of Rwanda

In this Python notebook, we will explore the process of preprocessing household survey data from Rwanda to compute two important indicators: **Food Consumption Score (FCS)** and **Household Dietary Diversity Score (HDDS)**. These indicators play a crucial role in assessing food security and nutritional status at the household level. The aim of calculating these indicators is to use them as ground truth data in the research project of using machine learning and deep learning for prediction for security indicators(FCS and HDDS) from heterogenous data.

### About Rwanda
Rwanda is a small landlocked country located in the heart of East Africa. Bordered by Uganda to the north, Tanzania to the east, Burundi to the south, and the Democratic Republic of the Congo to the west. Agriculture forms the backbone of Rwanda's economy, employing a large portion of the population and contributing significantly to GDP.The country is vulnerable to climate change, experiencing erratic rainfall patterns, prolonged droughts, and extreme weather events. These environmental factors disrupt agricultural productivity, leading to crop failures and food shortages.

### Data Source

We will be working with household survey data from Rwanda which consists of 6 datasets collected from [National Institute of Statistics of Rwanda (NISR)](http://microdata.statistics.gov.rw)
. These datasets contains information on various household characteristics, including food consumption, dietary habits, demographic details, and socio-economic factors.However for our task we will only focus with information concerning food consumption and their spatial distribution.

#### These 6 datasets are includes:

* [2006 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/26)

* [2009 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/8)

* [2012 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/69)

* [2015 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/70)

* [2018 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/91)

* [2021 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/106)


### Import Libraries

In this section, we import essential libraries and modules required for data preprocessing, analysis, and visualization tasks. These libraries provide robust functionalities and tools that streamline the data analysis workflow and enable us to manipulate and explore the dataset efficiently.

In [1]:
#uncomment to install savReaderWriter
#!pip install savReaderWriter

In [2]:
import pandas as pd
import pyreadstat as ps
from pandas import read_csv
import savReaderWriter as sv

### Helper Function

In this section we define a set of helper functions designed to streamline data preprocessing tasks and facilitate the computation of Food Consumption Score (FCS) and Household Dietary Diversity Score (HDDS). These functions are designed to assist in converting dataset files from different formats to a common format (i.e., .csv), making them compatible with various data analysis tools and workflows.

In [3]:
'''
    This function decodes bytes to string and handles integer values.
    It checks if the input value is a bytes object and decodes it to a UTF-8 encoded string.
    If the value is a float and represents an integer, it converts it to an integer.
    Otherwise, it returns the original value.
    It is important while converting the .sav file to .csv.
'''

def decode_value(value):
    if isinstance(value, bytes):
        return value.decode('utf-8')
    elif isinstance(value, float) and value.is_integer():
        return int(value)
    else:
        return value

In [4]:
'''
    This function converts a .sav (SPSS) file to a .csv (comma-separated values) file.
    It reads the .sav file using sv.SavReader, extracts column names, and decodes values using the decode_value function.
    The data is then converted to a DataFrame and saved as a .csv file at the specified path.
'''

def sav_to_csv(sav_path,csv_path):
    with sv.SavReader(sav_path) as reader:
        # Extract the column names
        column_names = [name.decode('utf-8') for name in reader.header]
        
        # Read the data and decode values
        data = [[decode_value(value) for value in row] for row in reader]
    
    df = pd.DataFrame(data, columns=column_names)
    df.to_csv(csv_path, index=False)
    print(f"A new dataset is saved to {csv_path}")

In [5]:
'''
    This function converts a .dta (Stata) file to a .csv file.
    It reads the .dta file using pd.read_stata and loads it into a DataFrame.
    The DataFrame is then saved as a .csv file at the specified path
'''

def dta_to_csv(dta_path, csv_path):
    df = pd.read_stata(dta_path)
    df.to_csv(csv_path, index=False)
    print(f"A new dataset is saved to {csv_path}")

In [6]:
'''
    This function is designed to subset a DataFrame based on specified columns, add a year column
    and save the resulting subset to a new CSV file. 
    This function is helful when your dataset have a large number of columns and you only need to work with some of them.
    
    Parameters:
        df (DataFrame): The original DataFrame.
        columns_to_keep (list): A list of column names to keep.
        output_file (str): The path to the output CSV file.
        year(int): The year associated with the dataset, which will be added as a new column.
        rename_columns (dict, optional): A dictionary where keys are original column names and values are new names.
'''

def subset_and_save(df, columns_to_keep, output_file, year, rename_columns=None):
    
    # Selecting columns to keep
    df_subset = df[columns_to_keep]

    # Optionally renaming columns
    if rename_columns:
        df_subset = df_subset.rename(columns=rename_columns)
    
    # Adding a year column
    df_subset.insert(0, 'year', year)

    # Save the subset DataFrame to a new CSV file
    df_subset.to_csv(output_file, index=False)

    print(f"A new dataset is saved to {output_file}")

In [7]:
#convert from .sav to csv file
#o_path= 'Rwanda/2018/rwanda_2018_preprocessed_data/1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.sav'
#u_path = 'output_file.csv'
#sav_to_csv(o_path,u_path)

In [8]:
#convert from dta to csv file
#o_path= 'Rwanda/2021/rwanda_2021_preprocessed_data/CFSVA_HH_2021_MASTER_DATASET.dta'
#u_path = 'output_file.csv'
#dta_to_csv(o_path,u_path)

### Rwanda 2006 Household Data

In [9]:
#data = read_csv('Rwanda/2006/rwanda_2006_preprocessed_data/June_10_Section1_11.csv',header=0, delimiter=',')
#data.shape

In [10]:
'''
df = data
#year = 2006

# Arrays containing the column names you want to keep
columns_to_keep = ['hid', 'novprov','novdistr','novsect','q9_4_1','q9_4_2','q9_4_3','q9_4_4',
                   'q9_4_5', 'q9_4_6', 'q9_4_7','q9_4_8', 'q9_4_9', 'q9_4_10', 'q9_4_11','q9_4_12',
                   'q9_4_13', 'q9_4_14', 'q9_4_15','q9_4_16', 'q9_4_17', 'q9_4_18', 'q9_4_19','q9_4_20',
                   'q9_4_21']
# Output file name
output_file = 'rwanda_2006.csv'

# Optional: Dictionary for renaming columns
rename_columns = {'novprov': 'province', 'novdistr': 'district', 'novsect': 'sector',
                  'q9_4_1': 'maize', 'q9_4_2': 'rice', 'q9_4_3': 'cereal', 'q9_4_4': 'cassava',
                  'q9_4_5': 'sweet_potato', 'q9_4_6': 'roots', 'q9_4_7': 'bread', 'q9_4_8': 'cooking_banana',
                  'q9_4_9': 'beans_peas', 'q9_4_10': 'vegetables', 'q9_4_11': 'cassava_leaves', 'q9_4_12': 'ground_nuts',
                  'q9_4_13': 'sunflower', 'q9_4_14': 'fruits', 'q9_4_15': 'fish', 'q9_4_16': 'meat',
                  'q9_4_17': 'poultry', 'q9_4_18': 'eggs', 'q9_4_19': 'oil', 'q9_4_20': 'sugar',
                  'q9_4_21': 'milk'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)
'''

"\ndf = data\n#year = 2006\n\n# Arrays containing the column names you want to keep\ncolumns_to_keep = ['hid', 'novprov','novdistr','novsect','q9_4_1','q9_4_2','q9_4_3','q9_4_4',\n                   'q9_4_5', 'q9_4_6', 'q9_4_7','q9_4_8', 'q9_4_9', 'q9_4_10', 'q9_4_11','q9_4_12',\n                   'q9_4_13', 'q9_4_14', 'q9_4_15','q9_4_16', 'q9_4_17', 'q9_4_18', 'q9_4_19','q9_4_20',\n                   'q9_4_21']\n# Output file name\noutput_file = 'rwanda_2006.csv'\n\n# Optional: Dictionary for renaming columns\nrename_columns = {'novprov': 'province', 'novdistr': 'district', 'novsect': 'sector',\n                  'q9_4_1': 'maize', 'q9_4_2': 'rice', 'q9_4_3': 'cereal', 'q9_4_4': 'cassava',\n                  'q9_4_5': 'sweet_potato', 'q9_4_6': 'roots', 'q9_4_7': 'bread', 'q9_4_8': 'cooking_banana',\n                  'q9_4_9': 'beans_peas', 'q9_4_10': 'vegetables', 'q9_4_11': 'cassava_leaves', 'q9_4_12': 'ground_nuts',\n                  'q9_4_13': 'sunflower', 'q9_4_14': 'fruits', '

#### Calculate Food Consumption Score For 2006

In [433]:
data = read_csv('Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006.csv',header=0, delimiter=',')
data.shape

(2783, 26)

In [434]:
data.columns

Index(['year', 'hid', 'province', 'district', 'sector', 'maize', 'rice',
       'cereal', 'cassava', 'sweet_potato', 'roots', 'bread', 'cooking_banana',
       'beans_peas', 'vegetables', 'cassava_leaves', 'ground_nuts',
       'sunflower', 'fruits', 'fish', 'meat', 'poultry', 'eggs', 'oil',
       'sugar', 'milk'],
      dtype='object')

In [435]:
# food group mapping
food_group_mapping = {
    'maize': 'sum_cereals_tubers',
    'rice': 'sum_cereals_tubers',
    'cereal': 'sum_cereals_tubers',
    'cassava': 'sum_cereals_tubers',
    'sweet_potato': 'sum_cereals_tubers',
    'roots': 'sum_cereals_tubers',
    'bread': 'sum_cereals_tubers',
    'cooking_banana': 'sum_cereals_tubers',
    'beans_peas': 'sum_pulses_nuts',
    'vegetables': 'sum_vegetables_leaves',
    'cassava_leaves': 'sum_vegetables_leaves',
    'ground_nuts': 'sum_pulses_nuts',
    'sunflower': 'sum_oil',
    'fruits': 'sum_fruits',
    'fish': 'sum_animal_protein',
    'meat': 'sum_animal_protein',
    'poultry': 'sum_animal_protein',
    'eggs': 'sum_animal_protein',
    'oil': 'sum_oil',
    'sugar': 'sum_sugar',
    'milk': 'sum_dairy_products'
}

# Create a new DataFrame to store the summed values for each food group
new_data = pd.DataFrame()

# Iterate through the food group mapping
for group_name in set(food_group_mapping.values()):
    # Select columns belonging to the current food group
    group_columns = [col for col in data.columns if food_group_mapping.get(col) == group_name]
    # Sum the values across the columns for each row
    group_sum = data.groupby('hid')[group_columns].sum()
    # Add the summed values to the new DataFrame
    new_data[group_name] = group_sum.sum(axis=1)

data = pd.merge(data[['year','hid','province','district','sector']], new_data, on='hid', how='left')

In [436]:
data.head()

Unnamed: 0,year,hid,province,district,sector,sum_sugar,sum_animal_protein,sum_dairy_products,sum_pulses_nuts,sum_vegetables_leaves,sum_oil,sum_cereals_tubers,sum_fruits
0,2006,120303105,EAST,NGOMA,MUGESERA,7.0,9.0,0.0,10.0,7.0,4.0,19.0,0.0
1,2006,80201303,WEST,RUBAVU,CYANZARWE,2.0,0.0,0.0,7.0,0.0,0.0,12.0,0.0
2,2006,60302903,WEST,NYAMASHEKE,KAGANO,0.0,0.0,0.0,7.0,7.0,0.0,7.0,0.0
3,2006,121009606,EAST,KIREHE,MUSAZA,0.0,0.0,7.0,5.0,4.0,0.0,11.0,0.0
4,2006,30401901,SOUTH,RUHANGO,RUHANGO,7.0,0.0,0.0,8.0,5.0,5.0,14.0,0.0


In [437]:
data.describe()

Unnamed: 0,year,hid,sum_sugar,sum_animal_protein,sum_dairy_products,sum_pulses_nuts,sum_vegetables_leaves,sum_oil,sum_cereals_tubers,sum_fruits
count,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0
mean,2006.0,73188170.0,1.215235,0.563061,0.625584,5.737334,3.655049,3.403881,10.782609,0.306504
std,0.0,33550010.0,2.396384,1.474768,1.885266,2.761096,2.689867,2.998085,5.136717,1.056892
min,2006.0,10604800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2006.0,40752200.0,0.0,0.0,0.0,4.0,1.0,0.0,7.0,0.0
50%,2006.0,70603610.0,0.0,0.0,0.0,7.0,3.0,3.0,9.0,0.0
75%,2006.0,110200900.0,1.0,0.0,0.0,7.0,6.0,7.0,14.0,0.0
max,2006.0,121012400.0,8.0,14.0,8.0,14.0,14.0,14.0,36.0,8.0


In [439]:
# Set the maximum number of days to be 7.
data.iloc[:, 5:] = data.iloc[:, 5:].applymap(lambda x: 7 if x > 7 else x)
data.describe()

Unnamed: 0,year,hid,sum_sugar,sum_animal_protein,sum_dairy_products,sum_pulses_nuts,sum_vegetables_leaves,sum_oil,sum_cereals_tubers,sum_fruits
count,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0
mean,2006.0,73188170.0,1.214876,0.539705,0.624865,5.357887,3.572404,3.353935,6.759612,0.306144
std,0.0,33550010.0,2.395441,1.333547,1.882642,2.273653,2.531126,2.900604,0.917946,1.054443
min,2006.0,10604800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2006.0,40752200.0,0.0,0.0,0.0,4.0,1.0,0.0,7.0,0.0
50%,2006.0,70603610.0,0.0,0.0,0.0,7.0,3.0,3.0,7.0,0.0
75%,2006.0,110200900.0,1.0,0.0,0.0,7.0,6.0,7.0,7.0,0.0
max,2006.0,121012400.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0


In [440]:
data.head()

Unnamed: 0,year,hid,province,district,sector,sum_sugar,sum_animal_protein,sum_dairy_products,sum_pulses_nuts,sum_vegetables_leaves,sum_oil,sum_cereals_tubers,sum_fruits
0,2006,120303105,EAST,NGOMA,MUGESERA,7.0,7.0,0.0,7.0,7.0,4.0,7.0,0.0
1,2006,80201303,WEST,RUBAVU,CYANZARWE,2.0,0.0,0.0,7.0,0.0,0.0,7.0,0.0
2,2006,60302903,WEST,NYAMASHEKE,KAGANO,0.0,0.0,0.0,7.0,7.0,0.0,7.0,0.0
3,2006,121009606,EAST,KIREHE,MUSAZA,0.0,0.0,7.0,5.0,4.0,0.0,7.0,0.0
4,2006,30401901,SOUTH,RUHANGO,RUHANGO,7.0,0.0,0.0,7.0,5.0,5.0,7.0,0.0


In [441]:
#save the data may be useful later
data.to_csv('rwanda_2006_with_sum_of_food_groups.csv', index=False)

##### Food Consumption Score

In [443]:
data.columns

Index(['year', 'hid', 'province', 'district', 'sector', 'sum_sugar',
       'sum_animal_protein', 'sum_dairy_products', 'sum_pulses_nuts',
       'sum_vegetables_leaves', 'sum_oil', 'sum_cereals_tubers', 'sum_fruits'],
      dtype='object')

In [444]:
#calculate the FCS based on the weight of WFP
data['fcs'] = (
    data['sum_cereals_tubers'] * 2 +
    data['sum_pulses_nuts'] * 3 +
    data['sum_vegetables_leaves'] * 1 +
    data['sum_fruits'] * 1 +
    data['sum_animal_protein'] * 4 +
    data['sum_dairy_products'] * 4 +
    data['sum_sugar'] * 0.5 +
    data['sum_oil'] * 0.5 
)
data.head()

Unnamed: 0,year,hid,province,district,sector,sum_sugar,sum_animal_protein,sum_dairy_products,sum_pulses_nuts,sum_vegetables_leaves,sum_oil,sum_cereals_tubers,sum_fruits,fcs
0,2006,120303105,EAST,NGOMA,MUGESERA,7.0,7.0,0.0,7.0,7.0,4.0,7.0,0.0,75.5
1,2006,80201303,WEST,RUBAVU,CYANZARWE,2.0,0.0,0.0,7.0,0.0,0.0,7.0,0.0,36.0
2,2006,60302903,WEST,NYAMASHEKE,KAGANO,0.0,0.0,0.0,7.0,7.0,0.0,7.0,0.0,42.0
3,2006,121009606,EAST,KIREHE,MUSAZA,0.0,0.0,7.0,5.0,4.0,0.0,7.0,0.0,61.0
4,2006,30401901,SOUTH,RUHANGO,RUHANGO,7.0,0.0,0.0,7.0,5.0,5.0,7.0,0.0,46.0


In [445]:
#save the data may be useful later
data.to_csv('rwanda_2006_with_fcs.csv', index=False)

#### Calculate Household Dierty Diversity Score (HDDS) For 2006

In [462]:
data = read_csv('Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006.csv',header=0, delimiter=',')
data.shape

(2783, 26)

In [463]:
data.columns

Index(['year', 'hid', 'province', 'district', 'sector', 'maize', 'rice',
       'cereal', 'cassava', 'sweet_potato', 'roots', 'bread', 'cooking_banana',
       'beans_peas', 'vegetables', 'cassava_leaves', 'ground_nuts',
       'sunflower', 'fruits', 'fish', 'meat', 'poultry', 'eggs', 'oil',
       'sugar', 'milk'],
      dtype='object')

In [464]:
# food group mapping
food_group_mapping = {
    'maize': 'sum_cereals',
    'rice': 'sum_cereals',
    'cereal': 'sum_cereals',
    'cassava': 'sum_roots_tubers',
    'sweet_potato': 'sum_roots_tubers',
    'roots': 'sum_roots_tubers',
    'bread': 'sum_cereals',
    'cooking_banana': 'sum_roots_tubers',
    'beans_peas': 'sum_pulses_nuts',
    'vegetables': 'sum_vegetables_leaves',
    'cassava_leaves': 'sum_vegetables_leaves',
    'ground_nuts': 'sum_pulses_nuts',
    'sunflower': 'sum_oil',
    'fruits': 'sum_fruits',
    'fish': 'sum_fish',
    'meat': 'sum_meat',
    'poultry': 'sum_meat',
    'eggs': 'sum_eggs',
    'oil': 'sum_oil',
    'sugar': 'sum_sugar',
    'milk': 'sum_milk'
}

# Create a new DataFrame to store the summed values for each food group
new_data = pd.DataFrame()

# Iterate through the food group mapping
for group_name in set(food_group_mapping.values()):
    # Select columns belonging to the current food group
    group_columns = [col for col in data.columns if food_group_mapping.get(col) == group_name]
    # Sum the values across the columns for each row
    group_sum = data.groupby('hid')[group_columns].sum()
    # Add the summed values to the new DataFrame
    new_data[group_name] = group_sum.sum(axis=1)
data = pd.merge(data[['year','hid','province','district','sector']], new_data, on='hid', how='left')

In [466]:
data.head()

Unnamed: 0,year,hid,province,district,sector,sum_eggs,sum_sugar,sum_roots_tubers,sum_meat,sum_cereals,sum_pulses_nuts,sum_oil,sum_vegetables_leaves,sum_fish,sum_milk,sum_fruits
0,2006,120303105,EAST,NGOMA,MUGESERA,7.0,7.0,9.0,2.0,10.0,10.0,4.0,7.0,0.0,0.0,0.0
1,2006,80201303,WEST,RUBAVU,CYANZARWE,0.0,2.0,10.0,0.0,2.0,7.0,0.0,0.0,0.0,0.0,0.0
2,2006,60302903,WEST,NYAMASHEKE,KAGANO,0.0,0.0,7.0,0.0,0.0,7.0,0.0,7.0,0.0,0.0,0.0
3,2006,121009606,EAST,KIREHE,MUSAZA,0.0,0.0,6.0,0.0,5.0,5.0,0.0,4.0,0.0,7.0,0.0
4,2006,30401901,SOUTH,RUHANGO,RUHANGO,0.0,7.0,5.0,0.0,9.0,8.0,5.0,5.0,0.0,0.0,0.0


In [468]:
# If the sum of each food group is greater than 0, set the value to 1 so that we can compute the HDDS easily
data.iloc[:, 5:] = data.iloc[:, 5:].applymap(lambda x: 1 if x > 0 else 0)
data.describe()

Unnamed: 0,year,hid,sum_eggs,sum_sugar,sum_roots_tubers,sum_meat,sum_cereals,sum_pulses_nuts,sum_oil,sum_vegetables_leaves,sum_fish,sum_milk,sum_fruits
count,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0,2783.0
mean,2006.0,73188170.0,0.031621,0.252964,0.956162,0.107079,0.748832,0.931728,0.675889,0.821056,0.123608,0.11678,0.11678
std,0.0,33550010.0,0.175019,0.434789,0.20477,0.309269,0.433763,0.252257,0.468126,0.383374,0.329193,0.321216,0.321216
min,2006.0,10604800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2006.0,40752200.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
50%,2006.0,70603610.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
75%,2006.0,110200900.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
max,2006.0,121012400.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [469]:
data.columns

Index(['year', 'hid', 'province', 'district', 'sector', 'sum_eggs',
       'sum_sugar', 'sum_roots_tubers', 'sum_meat', 'sum_cereals',
       'sum_pulses_nuts', 'sum_oil', 'sum_vegetables_leaves', 'sum_fish',
       'sum_milk', 'sum_fruits'],
      dtype='object')

In [470]:
#calculate the HDDS 
data['hdds']  = (
    data['sum_cereals'] + 
    data['sum_pulses_nuts'] +
    data['sum_eggs'] +
    data['sum_fish'] +
    data['sum_meat'] + 
    data['sum_milk'] + 
    data['sum_roots_tubers'] + 
    data['sum_vegetables_leaves'] + 
    data['sum_sugar'] + 
    data['sum_fruits'] +
    data['sum_oil']
)
data.head()

Unnamed: 0,year,hid,province,district,sector,sum_eggs,sum_sugar,sum_roots_tubers,sum_meat,sum_cereals,sum_pulses_nuts,sum_oil,sum_vegetables_leaves,sum_fish,sum_milk,sum_fruits,hdds
0,2006,120303105,EAST,NGOMA,MUGESERA,1,1,1,1,1,1,1,1,0,0,0,8
1,2006,80201303,WEST,RUBAVU,CYANZARWE,0,1,1,0,1,1,0,0,0,0,0,4
2,2006,60302903,WEST,NYAMASHEKE,KAGANO,0,0,1,0,0,1,0,1,0,0,0,3
3,2006,121009606,EAST,KIREHE,MUSAZA,0,0,1,0,1,1,0,1,0,1,0,5
4,2006,30401901,SOUTH,RUHANGO,RUHANGO,0,1,1,0,1,1,1,1,0,0,0,6


In [471]:
#save the data may be useful later
data.to_csv('rwanda_2006_with_hdds.csv', index=False)

In [479]:
#combine all the fcs score with hdds together
df_fcs = read_csv('rwanda_2006_with_fcs.csv',header=0, delimiter=',')
df_hdds = read_csv('rwanda_2006_with_hdds.csv',header=0, delimiter=',')
df_fcs = pd.merge(df_fcs[['year','hid','province','district','sector','fcs']], df_hdds[['hid','hdds']], on='hid', how='left')

In [480]:
df_fcs.head()

Unnamed: 0,year,hid,province,district,sector,fcs,hdds
0,2006,120303105,EAST,NGOMA,MUGESERA,75.5,8
1,2006,80201303,WEST,RUBAVU,CYANZARWE,36.0,4
2,2006,60302903,WEST,NYAMASHEKE,KAGANO,42.0,3
3,2006,121009606,EAST,KIREHE,MUSAZA,61.0,5
4,2006,30401901,SOUTH,RUHANGO,RUHANGO,46.0,6


In [481]:
#save the data 
df_fcs.to_csv('rwanda_2006_final.csv', index=False)

### Rwanda 2009 Household Data

In [38]:
#data = read_csv('Rwanda/2009/rwanda_2009_preprocessed_data/S9_Question.csv',header=0, delimiter=',')
#data.shape

(118800, 19)

In [None]:
''''
df = data
#year = 2009

# Arrays containing the column names you want to keep
columns_to_keep = ['hid', 'ID1','ID2','ID4','q9_4_1','q9_4_2','q9_4_3','q9_4_4',
                   'q9_4_5', 'q9_4_6', 'q9_4_7','q9_4_8', 'q9_4_9', 'q9_4_10', 'q9_4_11','q9_4_12',
                   'q9_4_13', 'q9_4_14', 'q9_4_15','q9_4_16', 'q9_4_17', 'q9_4_18', 'q9_4_19','q9_4_20',
                   'q9_4_21']
# Output file name
output_file = 'rwanda_2009.csv'

# Optional: Dictionary for renaming columns
rename_columns = {'ID1': 'province', 'ID2': 'district', 'ID4': 'sector',
                  'q9_4_1': 'maize', 'q9_4_2': 'rice', 'q9_4_3': 'cereal', 'q9_4_4': 'cassava',
                  'q9_4_5': 'sweet_potato', 'q9_4_6': 'roots', 'q9_4_7': 'bread', 'q9_4_8': 'cooking_banana',
                  'q9_4_9': 'beans_peas', 'q9_4_10': 'vegetables', 'q9_4_11': 'cassava_leaves', 'q9_4_12': 'ground_nuts',
                  'q9_4_13': 'sunflower', 'q9_4_14': 'fruits', 'q9_4_15': 'fish', 'q9_4_16': 'meat',
                  'q9_4_17': 'poultry', 'q9_4_18': 'eggs', 'q9_4_19': 'oil', 'q9_4_20': 'sugar',
                  'q9_4_21': 'milk'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)
'''

### Rwanda 2012 Household Data

In [349]:
#data = read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/cfsvans_2012_household_v01.csv',header=0, delimiter=',')
#data.shape

In [335]:
'''
df = data
year = 2012

# Arrays containing the column names you want to keep
columns_to_keep = ['hh_id', 'p_code','d_code','s_code','QA904_1','QB904_1','QC904_1','QD904_1',
                   'QE904_1', 'QF904_1', 'QG904_1','QH904_1', 'QI904_1', 'QJ904_1', 'QK904_1','QL904_1',
                   'QM904_1', 'QN904_1', 'QO904_1','QP904_1', 'QQ904_1', 'QR904_1', 'QS904_1','QT904_1',
                   'QU904_1', 'QV904_1','QW904_1', 'QX904_1', 'FCS', 'FCS_category']
# Output file name
output_file = 'rwanda_2012_0.csv'  # Change 'subset_data.csv' to your desired filename

# Optional: Dictionary for renaming columns
rename_columns = {'hh_id': 'hid', 'p_code': 'province','d_code': 'district', 's_code': 'sector',
                  'QA904_1': 'maize', 'QB904_1': 'sorghum', 'QC904_1': 'cereals', 'QD904_1': 'cassava',
                  'QE904_1': 'sweet_potato', 'QF904_1': 'roots', 'QG904_1': 'bread', 'QH904_1': 'carrot_tubers',
                  'QI904_1': 'cooking_banana', 'QJ904_1': 'beans_peas', 'QK904_1': 'cassava_leaves', 'QL904_1': 'vegetables',
                  'QM904_1': 'other_vegetables', 'QN904_1': 'ground_nuts', 'QO904_1': 'fruits', 'QP904_1': 'other_fruits',
                  'QQ904_1': 'fish', 'QR904_1': 'organ_meat', 'QS904_1': 'flesh_meat', 'QT904_1': 'eggs',
                  'QU904_1': 'oil', 'QV904_1': 'sugar', 'QW904_1': 'milk', 'QX904_1': 'condiments'}

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)



# Adding province name, district, and sector name and removing the id

# Read the data files

data = read_csv('rwanda_2012_0.csv',header=0, delimiter=',') #data with new columns

district_data = pd.read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/district_name.csv',header=0,delimiter=',')
province_data = pd.read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/province_name.csv',header=0,delimiter=',')
sector_data = pd.read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/sector_name.csv',header=0,delimiter=',')

# Create a mapping dictionary from district_id to district_name
district_mapping = dict(zip(district_data['dist_id'], district_data['dist_name']))

# Create a mapping dictionary from province_id to province_name
province_mapping = dict(zip(province_data['prov_id'], province_data['prev_name']))

# Create a mapping dictionary from sector_id to sector_name
sector_mapping = dict(zip(sector_data['sect_id'], sector_data['sect_name']))

# Add district, province and sector to the data using the mapping dictionary
data['district'] = data['district'].map(district_mapping)
data['province'] = data['province'].map(province_mapping)
data['sector'] = data['sector'].map(sector_mapping)

#save the data
data.to_csv('rwanda_2012.csv', index=False)
'''

Subset DataFrame saved to rwanda_2012_0.csv


### Rwanda 2015 Household Data

In [11]:
data = read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/cfsva_2015_master_DB_annex.csv',header=0, delimiter=',')
data.shape

(7500, 281)

In [12]:
df = data
year = 2015

# Arrays containing the column names you want to keep
columns_to_keep = ['KEY', 'S0_C_Prov','districts','S0_E_Sect','Starch','Pulses','Meat','Vegetables','Oil',
                   'Fruit', 'Milk', 'Sugar','FCS', 'FCG', 'FS_final', 'DDS','GDDS',
                   'HDDS_24h', 'HDDS_groups']
# Output file name
output_file = 'rwanda_2015.csv'  

# Optional: Dictionary for renaming columns
rename_columns = {'KEY': 'hid','S0_C_Prov': 'province', 'districts': 'district', 'S0_E_Sect': 'sector',
                  'Starch':'starch','Pulses': 'pulses', 'Meat': 'meat', 'Vegetables': 'vegetables', 'Oil': 'oil',
                  'Fruit': 'fruits', 'Milk': 'milk', 'Sugar': 'sugar', 'HDDS_24h': 'HDDS',
                  'HDDS_groups': 'GHDDS'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)

data = read_csv('rwanda_2015.csv',header=0, delimiter=',')

#modifying the hid value
start_value = 1010
data['new_hid'] = data.index.to_series().apply(lambda x: start_value + x)
data['hid'] = data['new_hid']
data.drop(columns=['new_hid'], inplace=True)
data.to_csv('rwanda_2015_0.csv', index=False)

# Adding province name, district, and sector name and removing the id

# Read the data files

data = read_csv('rwanda_2015_0.csv',header=0, delimiter=',') #data with new columns

district_data = pd.read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/district_name.csv',header=0,delimiter=',')
province_data = pd.read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/province_name.csv',header=0,delimiter=',')
sector_data = pd.read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/sector_name.csv',header=0,delimiter=',')

# Create a mapping dictionary from district_id to district_name
district_mapping = dict(zip(district_data['dist_id'], district_data['dist_name']))

# Create a mapping dictionary from province_id to province_name
province_mapping = dict(zip(province_data['prov_id'], province_data['prov_name']))

# Create a mapping dictionary from sector_id to sector_name
sector_mapping = dict(zip(sector_data['sect_id'], sector_data['sect_name']))

# Add district, province and sector to the data using the mapping dictionary
data['district'] = data['district'].map(district_mapping)
data['province'] = data['province'].map(province_mapping)
data['sector'] = data['sector'].map(sector_mapping)

#save the data
data.to_csv('rwanda_2015_v2.csv', index=False)


A new dataset is saved to rwanda_2015.csv


### Rwanda 2018 Household Data

In [406]:
#data = read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.csv',header=0, delimiter=',')
#data.shape

(9709, 343)

In [15]:
'''
df = data
year = 2018

# Arrays containing the column names you want to keep
columns_to_keep = ['PARENT_KEY', 'S0_C_Prov','S0_D_Dist','S0_E_Livezone','Starch','Pulses','Meat','Vegetables','Oil',
                   'Fruit', 'Milk', 'Sugar','FCS', 'FCG', 'FS_final', 'DDS','GDDS']
# Output file name
output_file = 'rwanda_2018.csv'  

# Optional: Dictionary for renaming columns
rename_columns = {'PARENT_KEY': 'hid','S0_C_Prov': 'province', 'S0_D_Dist': 'district', 'S0_E_Livezone': 'zone',
                  'Starch': 'starch','Pulses': 'pulses', 'Meat': 'meat', 'Vegetables': 'vegetables', 'Oil': 'oil',
                  'Fruit': 'fruits', 'Milk': 'milk', 'Sugar': 'sugar'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)

data = read_csv('rwanda_2018.csv',header=0, delimiter=',')

#modifying the hid value
start_value = 3010
data['new_hid'] = data.index.to_series().apply(lambda x: start_value + x)
data['hid'] = data['new_hid']
data.drop(columns=['new_hid'], inplace=True)
data.to_csv('rwanda_2018_0.csv', index=False)

district_data = pd.read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/district_name.csv',header=0,delimiter=',')
province_data = pd.read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/province_name.csv',header=0,delimiter=',')
zone_data = pd.read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/zone_name.csv',header=0,delimiter=',')

# Create a mapping dictionary from district_id to district_name
district_mapping = dict(zip(district_data['dist_id'], district_data['dist_name']))

# Create a mapping dictionary from province_id to province_name
province_mapping = dict(zip(province_data['prov_id'], province_data['prov_name']))

# Create a mapping dictionary from zone_id to zone_name
zone_mapping = dict(zip(zone_data['zone_id'], zone_data['zone_name']))

# Add district, province and sector to the data using the mapping dictionary
data['district'] = data['district'].map(district_mapping)
data['province'] = data['province'].map(province_mapping)
data['zone'] = data['zone'].map(zone_mapping)

#save the data
data.to_csv('rwanda_2018.csv', index=False)

'''

"\ndf = data\nyear = 2018\n\n# Arrays containing the column names you want to keep\ncolumns_to_keep = ['PARENT_KEY', 'S0_C_Prov','S0_D_Dist','S0_E_Livezone','Starch','Pulses','Meat','Vegetables','Oil',\n                   'Fruit', 'Milk', 'Sugar','FCS', 'FCG', 'FS_final', 'DDS','GDDS']\n# Output file name\noutput_file = 'rwanda_2018.csv'  \n\n# Optional: Dictionary for renaming columns\nrename_columns = {'PARENT_KEY': 'hid','S0_C_Prov': 'province', 'S0_D_Dist': 'district', 'S0_E_Livezone': 'zone',\n                  'Starch': 'starch','Pulses': 'pulses', 'Meat': 'meat', 'Vegetables': 'vegetables', 'Oil': 'oil',\n                  'Fruit': 'fruits', 'Milk': 'milk', 'Sugar': 'sugar'}  \n\n#df, columns_to_keep, output_file,year, rename_columns=None\nsubset_and_save(df, columns_to_keep, output_file,year,rename_columns)\n\ndata = read_csv('rwanda_2018.csv',header=0, delimiter=',')\n\n#modifying the hid value\nstart_value = 3010\ndata['new_hid'] = data.index.to_series().apply(lambda x: start

### Rwanda 2021 Household Data

In [37]:
#data = read_csv('Rwanda/2021/rwanda_2021_preprocessed_data/CFSVA_HH_2021_MASTER_DATASET.csv',header=0, delimiter=',')
#data.shape

In [32]:
'''
df = data
year = 2021

# Arrays containing the column names you want to keep
columns_to_keep = ['S0_B_DATE', 'S0_C_Prov','S0_D_Dist','S0_E_Livezone','Starch','Pulses','Meat','Vegetables','Oil',
                   'Fruit', 'Milk', 'Sugar','FCS', 'FCG', 'FS_final']
# Output file name
output_file = 'rwanda_2021.csv'  

# Optional: Dictionary for renaming columns
rename_columns = {'S0_B_DATE': 'hid','S0_C_Prov': 'province', 'S0_D_Dist': 'district','S0_E_Livezone': 'zone',
                  'Starch': 'starch','Pulses': 'pulses', 'Meat': 'meat', 'Vegetables': 'vegetables', 'Oil': 'oil',
                  'Fruit': 'fruits', 'Milk': 'milk', 'Sugar': 'sugar'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)

#read the data file
data = read_csv('rwanda_2021.csv',header=0, delimiter=',')


#modifying the hid value
start_value = 41010
data['new_hid'] = data.index.to_series().apply(lambda x: start_value + x)
data['hid'] = data['new_hid']
data.drop(columns=['new_hid'], inplace=True)
data.to_csv('rwanda_2021.csv', index=False)
'''

A new dataset saved to rwanda_2021.csv
