# Preprocessing & Calculate FCS and HDDS From Household Survey Data Of Rwanda

In this Python notebook, we will explore the process of preprocessing household survey data from Rwanda to compute two food security indicators: **Food Consumption Score (FCS)** and **Household Dietary Diversity Score (HDDS)**. These indicators play a crucial role in assessing food security and nutritional status at the household level. The aim of calculating these indicators is to use them as ground truth data in the research project of using machine learning and deep learning for prediction for security indicators(FCS and HDDS) from heterogenous data.


### Food Security Indicators
    
These are quantitative measures used to evaluate the accessibility, availability, and utilization of food at various levels starting from households,community to the national level. These indicators provide insights into the extent and severity of food insecurity, helping policymakers, researchers, and practitioners to identify vulnerable populations and design targeted interventions. There are numbers of food security indicators however as we mentioned early we will only focus on two indicators which are **Food Consumption Score(FCS)** and **Household Dietary Diversity Score(HDDS)**. The details description of these indicators will be given to the [Computation of FCS and HDDS](#compute_fcs_hdds) section.
    


### About Rwanda
Rwanda is a small landlocked country located in the heart of East Africa. Bordered by Uganda to the north, Tanzania to the east, Burundi to the south, and the Democratic Republic of the Congo to the west. Agriculture forms the backbone of Rwanda's economy, employing a large portion of the population and contributing significantly to GDP.The country is vulnerable to climate change, experiencing erratic rainfall patterns, prolonged droughts, and extreme weather events. These environmental factors disrupt agricultural productivity, leading to crop failures and food shortages.

### Data Source

We will be working with household survey data from Rwanda which consists of 6 datasets collected from [National Institute of Statistics of Rwanda (NISR)](http://microdata.statistics.gov.rw)
. These datasets contains information on various household characteristics, including food consumption, dietary habits, demographic details, and socio-economic factors.However for our task we will only focus with information concerning food consumption and their spatial distribution.

#### These 6 datasets are includes:

* [2006 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/26)

* [2009 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/8)

* [2012 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/69)

* [2015 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/70)

* [2018 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/91)

* [2021 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/106)

## Import Libraries

In this section, we import essential libraries and modules required for data preprocessing, analysis, and visualization tasks. These libraries provide robust functionalities and tools that streamline the data analysis workflow and enable us to manipulate and explore the dataset efficiently.

In [1]:
#uncomment to install savReaderWriter
#!pip install savReaderWriter

In [2]:
import pandas as pd
import pyreadstat as ps
from pandas import read_csv
import savReaderWriter as sv

## <a id='helper_function'></a> Helper Function

In this section we define a set of helper functions designed to streamline data preprocessing tasks and facilitate the computation of Food Consumption Score (FCS) and Household Dietary Diversity Score (HDDS). These functions are designed to assist in converting dataset files from different formats to a common format (i.e., .csv), making them compatible with various data analysis tools and workflows.

In [3]:
def decode_value(value):
    
    '''
    This function decodes bytes to string and handles integer values.
    It checks if the input value is a bytes object and decodes it to a UTF-8 encoded string.
    If the value is a float and represents an integer, it converts it to an integer.
    Otherwise, it returns the original value.
    It is important while converting the .sav file to .csv.
    '''
    
    if isinstance(value, bytes):
        return value.decode('utf-8')
    elif isinstance(value, float) and value.is_integer():
        return int(value)
    else:
        return value

In [4]:
def sav_to_csv(sav_path,csv_path):
    
    '''
    This function converts a .sav (SPSS) file to a .csv (comma-separated values) file.
    It reads the .sav file using sv.SavReader, extracts column names, and decodes values using the decode_value function.
    The data is then converted to a DataFrame and saved as a .csv file at the specified path.
    '''
    
    with sv.SavReader(sav_path) as reader:
        # Extract the column names
        column_names = [name.decode('utf-8') for name in reader.header]
        
        # Read the data and decode values
        data = [[decode_value(value) for value in row] for row in reader]
    
    df = pd.DataFrame(data, columns=column_names)
    df.to_csv(csv_path, index=False)
    print(f"A new dataset is saved to {csv_path}")

In [5]:
def dta_to_csv(dta_path, csv_path):
    
    '''
    This function converts a .dta (Stata) file to a .csv file.
    It reads the .dta file using pd.read_stata and loads it into a DataFrame.
    The DataFrame is then saved as a .csv file at the specified path
    '''
    
    df = pd.read_stata(dta_path)
    df.to_csv(csv_path, index=False)
    print(f"A new dataset is saved to {csv_path}")

In [14]:
def subset_and_save(df, columns_to_keep, output_file, year, rename_columns=None):
    
    '''
    This function is designed to subset a DataFrame based on specified columns, add a year column
    and save the resulting subset to a new CSV file. 
    This function is helful when your dataset have a large number of columns and you only need to work with some of them.
    
    Parameters:
        df (DataFrame): The original DataFrame.
        columns_to_keep (list): A list of column names to keep.
        output_file (str): The path to the output CSV file.
        year(int): The year associated with the dataset, which will be added as a new column.
        rename_columns (dict, optional): A dictionary where keys are original column names and values are new names.
    '''
    
    # Selecting columns to keep
    df_subset = df[columns_to_keep]

    # Optionally renaming columns
    if rename_columns:
        df_subset = df_subset.rename(columns=rename_columns)
    
    # Adding a year column
    df_subset.insert(0, 'year', year)

    # Save the subset DataFrame to a new CSV file
    df_subset.to_csv(output_file, index=False)

    print(f"A new dataset is saved to {output_file}")

In [108]:
def merge_data(df1, df2):
    
    '''
    Merge 2 data frames based on 'hhid' column using left join.

    Parameters:
        df1 (DataFrame): First data frame should be the dataframe containing the address of the household and fcs . 
        df2 (DataFrame): Second data frame should be the dataframe that has hdds.

    Returns:
        DataFrame: Merged data frame.
    '''

    merged_df = pd.merge(df1, df2[['hhid', 'hdds']], on='hhid', how='left')
        
    return merged_df

In [None]:
def concatenate_data(*dataframes):
    
    '''
    Concatenate multiple DataFrames provided by the user.

    Parameters:
    *dataframes (DataFrame): Variable number of DataFrame objects.

    Returns:
    DataFrame: Concatenated DataFrame.
    
    '''
    concatenated_df = pd.concat(dataframes)
    return concatenated_df

In [49]:
def calculate_mean(dataframe, group_by_columns, average_columns, additional_columns=None):
    
    '''
    Aggregate data in a DataFrame by specified columns and calculate the mean of one or more columns.

    Parameters:
    - dataframe (pd.DataFrame): The DataFrame containing the data to be aggregated.
    - group_by_columns (list): The list of column names to group the data by.
    - average_columns (list): A list of column names to calculate the average.
    - additional_columns (list, optional): A list of column names to include in the output DataFrame.

    Returns:
    pd.DataFrame: A new DataFrame containing the aggregated results with the following columns:
        - [average_columns]: The average values of the specified average_columns.
        - Count: The count of rows used to calculate the mean values.
        - Additional_columns (optional): The values of the specified additional_columns if provided.
    '''

    # Group by the specified columns and calculate the mean for each average column
    aggregated_data = dataframe.groupby(group_by_columns)[average_columns].mean().reset_index()
    
    # Calculate the count based on the first average column
    count_column = dataframe.groupby(group_by_columns)[average_columns[0]].count().reset_index()
    # Rename the count column
    count_column.rename(columns={average_columns[0]: 'count'}, inplace=True)
    
    # Merge the count column with the aggregated data
    aggregated_data = pd.merge(aggregated_data, count_column, on=group_by_columns)
    
    # If additional columns are specified, merge them into the aggregated DataFrame
    if additional_columns:
        for additional_column in additional_columns:
            additional_df = dataframe[group_by_columns + [additional_column]].drop_duplicates()
            aggregated_data = pd.merge(additional_df, aggregated_data, on=group_by_columns)
            
    # Reorder columns so that additional columns appear first
    if additional_columns:
        columns_order = group_by_columns + additional_columns + average_columns + ['count']
        aggregated_data = aggregated_data[columns_order]
    
    return aggregated_data

## <a id='compute_fcs_hdds'></a> Computation of FCS and HDDS

In this section, we will explain how we will compute the **Food Consumption Score (FCS)** and the **Household Dietary Diversity Score (HDDS)**. Since we will work of multiple dataset in our study area, we will adopt a standardized approach for calculating both FCS and HDDS across all datasets. Regardless of the dataset's specific characteristics, we will apply consistent methods to ensure comparability and reliability of the results.

###  <a id='fcs'></a>1. Food Consumption Score (FCS)

The Food Consumption Score (FCS) is a food security indicator developed by the [World Food Programme (WFP)](https://resources.vam.wfp.org/data-analysis/quantitative/food-security/food-consumption-score). It serves as an essential index for assessing household food consumption patterns and nutritional adequacy. The data to compute the FCS are collected by household survey questionnaire by asking a respondent the list of food groups they have consumed for the past seven days. The FCS aggregates these data on the diversity and frequency of food groups consumed over the previous seven days **(7)** which is then weighted according to the relative nutritional value of the consumed food groups as specified in the table below.

<p><img src="images/food_weights.png"  align="centre" alt="food groups weighs" style="width:600px;height:300px;"></p>


#### Steps to Compute FCS
 1. Group food items in the specified food groups
 2. Sum all the consumption frequencies of food items within the same group but the maximum limit should be 7
 3. Multiply the value of each food group by its weight as explained in table
 4. Sum the weighted food group scores to obtain the overall FCS
 5. Determine the household's food consumption status based on the following thresholds: 
     * 0 - 21 : Poor
     * 21.5 - 35 : Borderline
     * &gt; 35 : Acceptable
     
##### Mathematically the FCS is represented as:

$$
FCS = \sum_{j=1}^{9}f_j \times x_j
$$

**Where:**

- $(f_j$) represents the frequency of consumption of food group $(j$).
- $(x_j$) represents the nutritional value (weight) of food group $(j$).
- $(j$) ranges from 1 to 9, representing the nine food groups considered in the calculation.

For more detailed description about FCS Calculation and uses in food security analysis can be found in [this Document](https://documents.wfp.org/stellent/groups/public/documents/manual_guide_proced/wfp197216.pdf). The source code and sample of the data can also be found [Here](https://resources.vam.wfp.org/data-analysis/quantitative/food-security/food-consumption-score)

#### Calculate Food Consumption Score Function (CFCSF)

This function compute the food consumption score of a given dataframe. It implements the instruction of **Step 1** to **Step 4**  to compute the overall score. The weights of each food group is defined as explained in the section above. 

In [248]:
def calculate_fcs(data, food_group_mapping):
    # Define weights for each food group
    weights = {
        'cereals_tubers': 2,
        'pulses_nuts': 3,
        'vegetables_leaves': 1,
        'fruits': 1,
        'animal_protein': 4,
        'dairy_products': 4,
        'sugar': 0.5,
        'oil': 0.5,
        'condiments': 0
    }
    
    # Create a new DataFrame to store the summed values for each food group
    new_data = pd.DataFrame()
    
    # Add 'hhid' column to new_data
    new_data['hhid'] = data['hhid']
    
    # Iterate through the food group mapping
    for group_name in set(food_group_mapping.values()):
        # Select columns belonging to the current food group
        group_columns = [col for col in data.columns if food_group_mapping.get(col) == group_name]
        # Sum all the consumption frequencies of food items within the same group but capped at 7
        group_sum = data[group_columns].sum(axis=1).clip(upper=7)
        # Set maximum value to 1 if greater than 0, otherwise set to 0
        group_sum = group_sum.apply(lambda x: 7 if x > 7 else x)
        # Add the sum of consumption frequencies to the new DataFrame without applying weights
        new_data[group_name] = group_sum

    # Compute the FCS by multiplying the sum of each food group with its corresponding weight
    fcs_weights = new_data.drop(columns=['hhid']) * new_data.drop(columns=['hhid']).apply(lambda x: weights[x.name])
    new_data['fcs'] = fcs_weights.sum(axis=1)
    
    # Merge the calculated fcs back into the original DataFrame
    #data = pd.merge(data[['year', 'hhid', 'province', 'district', 'sector']], new_data, on='hhid', how='left')
    
    #uncomment this if there is no sector in the dataset you are working with
    data = pd.merge(data[['year', 'hhid', 'province', 'district', 'zone']], new_data, on='hhid', how='left')
    #data = pd.merge(data[['year', 'hhid', 'province', 'district', 'zone', 'o_fcs']], new_data, on='hhid', how='left')
    
    #for the datasets that contain an fcs uncomment to compare the results
    #data = pd.merge(data[['year', 'hhid', 'province', 'district', 'sector','o_fcs']], new_data, on='hhid', how='left')
    
    return data


### 2. Household Dietary Diversity Score (HDDS)

Household dietary diversity Score (HDDS) is a qualitative measure of food consumption that reflects household access to a variety of foods that indicate dietary diversity and nutritional quality.The HDDS consists of a simple count of food groups that a household has consumed over the preceding 24 hours. Each food group is assigned a score of **1 (if consumed over the previous 24 hours)** or **0 (if not consumed in the last 24 hours)**. The household score will range between 0 to 12 and is equal to the total number of food groups consumed by the household.

##### The following 12 food groups are used to calculate the HDDS indicator:

<img src="images/hdds_groups.png"  align="centre" alt="food groups weighs" style="width:400px;height:200px;">

#### Steps to Compute HDDS
 1. Group food items in the specified food groups
 2. Sum all the consumption frequencies of food items within the same group but the maximum limit should be 1
 3. Sum the food group scores to obtain the overall HDDS.

##### Mathematically the HDDS is represented as: 
$$
HDDS = \sum_{j=1}^{12} x_j
$$

where $( x_j $) equals 1 if the household consumed food from group $( j $) in the past 24 hours, and 0 otherwise.



#### Calculate Household Dietary Diversity Score Function (CHDDSF)
This function compute the household dietary diversity score for a given data frame.

In [214]:
def calculate_hdds(data, food_group_mapping):
    
    '''
    Processes the given DataFrame by grouping by 'hhid' and 'food_group',
    summing up 'is_consumed', pivoting the data to get food groups as columns, and
    calculating Household Dietary Diversity Score (HDDS) by summing the values
    of all the columns present in each food group.

    Parameters:
    - data (pd.DataFrame): The input DataFrame with at least 'hhid','food_group', and 'is_consumed' columns.

    Returns:
    - pd.DataFrame: A DataFrame with 'hhid', the value of all the columns, and the summed results as HDDS
                    for each household.
    '''
    
    # Create a new DataFrame to store the summed values for each food group
    new_data = pd.DataFrame()
    
    # Add 'hhid' column to new_data
    new_data['hhid'] = data['hhid']
    
    # Iterate through the food group mapping
    for group_name in set(food_group_mapping.values()):
        # Select columns belonging to the current food group
        group_columns = [col for col in data.columns if food_group_mapping.get(col) == group_name]
        # Sum all the consumption frequencies of food items within the same group
        group_sum = data[group_columns].sum(axis=1)
        # Set maximum value to 1 if greater than 0, otherwise set to 0
        group_sum = group_sum.apply(lambda x: 1 if x > 0 else 0)
        # Add the sum of consumption frequencies to the new DataFrame 
        new_data[group_name] = group_sum
        
    # Calculate Household Dietary Diversity Score (HDDS) by summing all the food group columns across the row
    new_data['hdds'] = new_data.drop(columns=['hhid']).sum(axis=1)
    
    # Merge the calculated fcs back into the original DataFrame
    #data = pd.merge(data[['year', 'hhid', 'province', 'district', 'sector']], new_data, on='hhid', how='left')
    
    #uncomment this if there is no sector in the dataset you are working with
    data = pd.merge(data[['year', 'hhid', 'province', 'district', 'zone']], new_data, on='hhid', how='left')
    
    return data


## Comprehensive Food Security and Vulnerability Analysis (CFSVA) - 2006

This dataset is public available on [2006 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/26). The purpose of this Comprehensive Food security and Vulnerability Analysis (CFSVA) is to provide an accurate baseline and understanding of chronic food insecurity and vulnerability conditions in rural Rwanda, and how best to respond to them. For more discription about this dataset [Click Here](https://microdata.statistics.gov.rw/index.php/catalog/26/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.statistics.gov.rw/index.php/catalog/26/data_dictionary)

#### Loading the data sources

The name of the file we have used in this data set is **June_10_Section1_11.sav** The file contains the data related to
* Demographics
* Housing and Facilities
* Household Assets and Productive Assets
* Inputs to Livelihood
* Migration & Remittances
* Sources of Credit
* Expenditure
* Food Sources and Consumption
* Shocks and Food Security and
* Programme Participation.

This file is in SPSS Statistics Data File Format, therefore we need to change them into csv format for easily manipulation

In [12]:
'''
#convert June_10_Section1_11.sav to csv file
o_path= 'Rwanda/2006/rwanda_2006_preprocessed_data/June_10_Section1_11.sav'
u_path = 'Rwanda/2006/rwanda_2006_preprocessed_data/June_10_Section1_11.csv'
sav_to_csv(o_path,u_path) #call the function to convert

data = read_csv('Rwanda/2006/rwanda_2006_preprocessed_data/June_10_Section1_11.csv',header=0, delimiter=',')
data.shape

'''

A new dataset is saved to Rwanda/2006/rwanda_2006_preprocessed_data/June_10_Section1_11.csv


The dataset contain a large number of columns therefore we will use **subset_and_save** function to extra only those columns we will use to compute FCS and HDDS.

In [20]:
'''
df = data
year = 2006

# Arrays containing the column names you want to keep
columns_to_keep = ['hid', 'novprov','novdistr','novsect','q9_4_1','q9_4_2','q9_4_3','q9_4_4',
                   'q9_4_5', 'q9_4_6', 'q9_4_7','q9_4_8', 'q9_4_9', 'q9_4_10', 'q9_4_11','q9_4_12',
                   'q9_4_13', 'q9_4_14', 'q9_4_15','q9_4_16', 'q9_4_17', 'q9_4_18', 'q9_4_19','q9_4_20',
                   'q9_4_21']
# Output file name
output_file = 'Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006_food_consumed.csv'

# Optional: Dictionary for renaming columns
rename_columns = {'hid': 'hhid','novprov': 'province', 'novdistr': 'district', 'novsect': 'sector',
                  'q9_4_1': 'maize', 'q9_4_2': 'rice', 'q9_4_3': 'cereal', 'q9_4_4': 'cassava',
                  'q9_4_5': 'sweet_potato', 'q9_4_6': 'roots', 'q9_4_7': 'bread', 'q9_4_8': 'cooking_banana',
                  'q9_4_9': 'beans_peas', 'q9_4_10': 'vegetables', 'q9_4_11': 'cassava_leaves', 'q9_4_12': 'ground_nuts',
                  'q9_4_13': 'sunflower', 'q9_4_14': 'fruits', 'q9_4_15': 'fish', 'q9_4_16': 'meat',
                  'q9_4_17': 'poultry', 'q9_4_18': 'eggs', 'q9_4_19': 'oil', 'q9_4_20': 'sugar',
                  'q9_4_21': 'milk'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)
'''

A new dataset is saved to Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006_food_consumed.csv


In [100]:
#load the data to compute fcs
data = read_csv('Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006_food_consumed.csv',header=0, delimiter=',')
data.head()

Unnamed: 0,year,hhid,province,district,sector,maize,rice,cereal,cassava,sweet_potato,...,ground_nuts,sunflower,fruits,fish,meat,poultry,eggs,oil,sugar,milk
0,2006,120303105,PROVINCE DE L'EST,NGOMA,MUGESERA,2.0,1.0,7.0,3.0,3.0,...,3.0,4.0,0.0,0.0,2.0,0.0,7.0,0.0,7.0,0.0
1,2006,80201303,PROVINCE DE L'OUEST,RUBAVU,CYANZARWE,0.0,0.0,2.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
2,2006,60302903,PROVINCE DE L'OUEST,NYAMASHEKE,KAGANO,0.0,0.0,0.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2006,121009606,PROVINCE DE L'EST,KIREHE,MUSAZA,0.0,1.0,4.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0
4,2006,30401901,PROVINCE DU SUD,RUHANGO,RUHANGO,7.0,2.0,0.0,2.0,1.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,7.0,0.0


In [80]:
#check the columns name to create the mapping
data.columns

Index(['year', 'hhid', 'province', 'district', 'sector', 'maize', 'rice',
       'cereal', 'cassava', 'sweet_potato', 'roots', 'bread', 'cooking_banana',
       'beans_peas', 'vegetables', 'cassava_leaves', 'ground_nuts',
       'sunflower', 'fruits', 'fish', 'meat', 'poultry', 'eggs', 'oil',
       'sugar', 'milk'],
      dtype='object')

In [101]:
# Define the food group mapping based on the weights as defined in the 
food_group_mapping = {
    'maize': 'cereals_tubers',
    'rice': 'cereals_tubers',
    'cereal': 'cereals_tubers',
    'cassava': 'cereals_tubers',
    'sweet_potato': 'cereals_tubers',
    'roots': 'cereals_tubers',
    'bread': 'cereals_tubers',
    'cooking_banana': 'cereals_tubers',
    'beans_peas': 'pulses_nuts',
    'vegetables': 'vegetables_leaves',
    'cassava_leaves': 'vegetables_leaves',
    'ground_nuts': 'pulses_nuts',
    'sunflower': 'oil',
    'fruits': 'fruits',
    'fish': 'animal_protein',
    'meat': 'animal_protein',
    'poultry': 'animal_protein',
    'eggs': 'animal_protein',
    'oil': 'oil',
    'sugar': 'sugar',
    'milk': 'dairy_products'
}

#### Compute FCS

In [104]:
#call the function
data_fcs= calculate_fcs(data, food_group_mapping)
data_fcs.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs
0,2006,120303105,PROVINCE DE L'EST,NGOMA,MUGESERA,4.0,0.0,7.0,7.0,7.0,0.0,7.0,7.0,75.5
1,2006,80201303,PROVINCE DE L'OUEST,RUBAVU,CYANZARWE,0.0,0.0,7.0,2.0,0.0,0.0,7.0,0.0,36.0
2,2006,60302903,PROVINCE DE L'OUEST,NYAMASHEKE,KAGANO,0.0,0.0,7.0,0.0,7.0,0.0,7.0,0.0,42.0
3,2006,121009606,PROVINCE DE L'EST,KIREHE,MUSAZA,0.0,7.0,7.0,0.0,4.0,0.0,5.0,0.0,61.0
4,2006,30401901,PROVINCE DU SUD,RUHANGO,RUHANGO,5.0,0.0,7.0,7.0,5.0,0.0,7.0,0.0,46.0


In [105]:
#save the data may be useful later
data_fcs.to_csv('Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006_with_fcs.csv', index=False)

#### Compute HDDS

In [112]:
'''
# food group mapping for HDDS with 12 groups
food_group_mapping = {
    'maize': 'cereals',
    'rice': 'cereals',
    'cereal': 'cereals',
    'cassava': 'roots_tubers',
    'sweet_potato': 'roots_tubers',
    'roots': 'roots_tubers',
    'bread': 'cereals',
    'cooking_banana': 'roots_tubers',
    'beans_peas': 'pulses_nuts',
    'vegetables': 'vegetables_leaves',
    'cassava_leaves': 'vegetables_leaves',
    'ground_nuts': 'pulses_nuts',
    'sunflower': 'oil',
    'fruits': 'fruits',
    'fish': 'fish',
    'meat': 'meat',
    'poultry': 'meat',
    'eggs': 'eggs',
    'oil': 'oil',
    'sugar': 'sugar',
    'milk': 'milk'
}
'''

In [106]:
data_hdds = calculate_hdds(data,food_group_mapping)
data_hdds.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,hdds
0,2006,120303105,PROVINCE DE L'EST,NGOMA,MUGESERA,1,0,1,1,1,0,1,1,6
1,2006,80201303,PROVINCE DE L'OUEST,RUBAVU,CYANZARWE,0,0,1,1,0,0,1,0,3
2,2006,60302903,PROVINCE DE L'OUEST,NYAMASHEKE,KAGANO,0,0,1,0,1,0,1,0,3
3,2006,121009606,PROVINCE DE L'EST,KIREHE,MUSAZA,0,1,1,0,1,0,1,0,4
4,2006,30401901,PROVINCE DU SUD,RUHANGO,RUHANGO,1,0,1,1,1,0,1,0,5


In [107]:
#save the data may be useful later
data_hdds.to_csv('Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006_with_hdds.csv', index=False)

#### Combine the files and save the final dataset

In [113]:
data_final = merge_data(data_fcs,data_hdds)
data_final.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs,hdds
0,2006,120303105,PROVINCE DE L'EST,NGOMA,MUGESERA,4.0,0.0,7.0,7.0,7.0,0.0,7.0,7.0,75.5,6
1,2006,80201303,PROVINCE DE L'OUEST,RUBAVU,CYANZARWE,0.0,0.0,7.0,2.0,0.0,0.0,7.0,0.0,36.0,3
2,2006,60302903,PROVINCE DE L'OUEST,NYAMASHEKE,KAGANO,0.0,0.0,7.0,0.0,7.0,0.0,7.0,0.0,42.0,3
3,2006,121009606,PROVINCE DE L'EST,KIREHE,MUSAZA,0.0,7.0,7.0,0.0,4.0,0.0,5.0,0.0,61.0,4
4,2006,30401901,PROVINCE DU SUD,RUHANGO,RUHANGO,5.0,0.0,7.0,7.0,5.0,0.0,7.0,0.0,46.0,5


In [114]:
data_final.to_csv('Rwanda/2006/rwanda_2006_preprocessed_data/rwanda_2006_final.csv', index=False)

### Rwanda 2009 Household Data

In [38]:
#data = read_csv('Rwanda/2009/rwanda_2009_preprocessed_data/S9_Question.csv',header=0, delimiter=',')
#data.shape

(118800, 19)

In [None]:
''''
df = data
#year = 2009

# Arrays containing the column names you want to keep
columns_to_keep = ['hid', 'ID1','ID2','ID4','q9_4_1','q9_4_2','q9_4_3','q9_4_4',
                   'q9_4_5', 'q9_4_6', 'q9_4_7','q9_4_8', 'q9_4_9', 'q9_4_10', 'q9_4_11','q9_4_12',
                   'q9_4_13', 'q9_4_14', 'q9_4_15','q9_4_16', 'q9_4_17', 'q9_4_18', 'q9_4_19','q9_4_20',
                   'q9_4_21']
# Output file name
output_file = 'rwanda_2009.csv'

# Optional: Dictionary for renaming columns
rename_columns = {'ID1': 'province', 'ID2': 'district', 'ID4': 'sector',
                  'q9_4_1': 'maize', 'q9_4_2': 'rice', 'q9_4_3': 'cereal', 'q9_4_4': 'cassava',
                  'q9_4_5': 'sweet_potato', 'q9_4_6': 'roots', 'q9_4_7': 'bread', 'q9_4_8': 'cooking_banana',
                  'q9_4_9': 'beans_peas', 'q9_4_10': 'vegetables', 'q9_4_11': 'cassava_leaves', 'q9_4_12': 'ground_nuts',
                  'q9_4_13': 'sunflower', 'q9_4_14': 'fruits', 'q9_4_15': 'fish', 'q9_4_16': 'meat',
                  'q9_4_17': 'poultry', 'q9_4_18': 'eggs', 'q9_4_19': 'oil', 'q9_4_20': 'sugar',
                  'q9_4_21': 'milk'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)
'''

## Comprehensive Food Security and Vulnerability Analysis (CFSVA) - 2012

This dataset is public available on [2012 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/69). The purpose of this Comprehensive Food security and Vulnerability Analysis (CFSVA) is to provide an accurate baseline and understanding of chronic food insecurity and vulnerability conditions.The CFSVA and Nutrition Survey 2012 was designed to produce estimates of food security indicators at district level and covered both urban and rural households. For more discription about this dataset [Click Here](https://microdata.statistics.gov.rw/index.php/catalog/69/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.statistics.gov.rw/index.php/catalog/69/data_dictionary)

#### Loading the data sources

The name of the file we have used in this data set is **cfsvans_2012_household_v01.sav** The file contains data related to Household Questionnaire.This file is in SPSS Statistics Data File Format, therefore we need to change them into csv format for easily manipulation

In [115]:
'''
#convert cfsvans_2012_household_v01.sav to csv file
o_path= 'Rwanda/2012/rwanda_2012_preprocessed_data/cfsvans_2012_household_v01.sav'
u_path = 'Rwanda/2012/rwanda_2012_preprocessed_data/cfsvans_2012_household_v01.csv'
sav_to_csv(o_path,u_path) #call the function to convert

data = read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/cfsvans_2012_household_v01.csv',header=0, delimiter=',')
data.shape

'''

A new dataset is saved to Rwanda/2012/rwanda_2012_preprocessed_data/cfsvans_2012_household_v01.csv


In [117]:
'''
df = data
year = 2012

# Arrays containing the column names you want to keep
columns_to_keep = ['hh_id', 'p_code','d_code','s_code','QA904_1','QB904_1','QC904_1','QD904_1',
                   'QE904_1', 'QF904_1', 'QG904_1','QH904_1', 'QI904_1', 'QJ904_1', 'QK904_1','QL904_1',
                   'QM904_1', 'QN904_1', 'QO904_1','QP904_1', 'QQ904_1', 'QR904_1', 'QS904_1','QT904_1',
                   'QU904_1', 'QV904_1','QW904_1', 'QX904_1', 'FCS']
# Output file name
output_file = 'Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_0.csv'  # Change 'subset_data.csv' to your desired filename

# Optional: Dictionary for renaming columns
rename_columns = {'hh_id': 'hhid', 'p_code': 'province','d_code': 'district', 's_code': 'sector',
                  'QA904_1': 'maize', 'QB904_1': 'sorghum', 'QC904_1': 'cereals', 'QD904_1': 'cassava',
                  'QE904_1': 'sweet_potato', 'QF904_1': 'roots', 'QG904_1': 'bread', 'QH904_1': 'carrot_tubers',
                  'QI904_1': 'cooking_banana', 'QJ904_1': 'beans_peas', 'QK904_1': 'cassava_leaves', 'QL904_1': 'vegetables',
                  'QM904_1': 'other_vegetables', 'QN904_1': 'ground_nuts', 'QO904_1': 'fruits', 'QP904_1': 'other_fruits',
                  'QQ904_1': 'fish', 'QR904_1': 'organ_meat', 'QS904_1': 'flesh_meat', 'QT904_1': 'eggs',
                  'QU904_1': 'oil', 'QV904_1': 'sugar', 'QW904_1': 'milk', 'QX904_1': 'condiments','FCS':'o_fcs'}

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)



# Adding province name, district, and sector name and removing the id

# Read the data files

data = read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_0.csv',header=0, delimiter=',') #data with new columns

district_data = pd.read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/district_name.csv',header=0,delimiter=',')
province_data = pd.read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/province_name.csv',header=0,delimiter=',')
sector_data = pd.read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/sector_name.csv',header=0,delimiter=',')

# Create a mapping dictionary from district_id to district_name
district_mapping = dict(zip(district_data['dist_id'], district_data['dist_name']))

# Create a mapping dictionary from province_id to province_name
province_mapping = dict(zip(province_data['prov_id'], province_data['prev_name']))

# Create a mapping dictionary from sector_id to sector_name
sector_mapping = dict(zip(sector_data['sect_id'], sector_data['sect_name']))

# Add district, province and sector to the data using the mapping dictionary
data['district'] = data['district'].map(district_mapping)
data['province'] = data['province'].map(province_mapping)
data['sector'] = data['sector'].map(sector_mapping)

#save the data
data.to_csv('Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_food_consumed.csv', index=False)

'''

A new dataset is saved to Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_0.csv


In [130]:
#load the data to compute fcs
data = read_csv('Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_food_consumed.csv',header=0, delimiter=',')
data.head()

Unnamed: 0,year,hhid,province,district,sector,maize,sorghum,cereals,cassava,sweet_potato,...,other_fruits,fish,organ_meat,flesh_meat,eggs,oil,sugar,milk,condiments,o_fcs
0,2012,9746,EAST,BUGESERA,SHYARA,0,0,1,1,1,...,0,0,0,1,0,7,1,0,7,43.0
1,2012,9739,EAST,BUGESERA,SHYARA,0,0,0,2,0,...,0,0,0,2,0,6,1,7,7,66.5
2,2012,9744,EAST,BUGESERA,SHYARA,0,0,0,5,0,...,0,0,0,0,0,0,0,0,3,33.0
3,2012,9738,EAST,BUGESERA,SHYARA,1,0,0,3,0,...,0,0,0,0,0,0,0,0,5,29.0
4,2012,9741,EAST,BUGESERA,SHYARA,1,0,0,5,0,...,0,0,0,0,0,5,2,0,5,38.5


In [131]:
#check the column names to create the mapping
data.columns

Index(['year', 'hhid', 'province', 'district', 'sector', 'maize', 'sorghum',
       'cereals', 'cassava', 'sweet_potato', 'roots', 'bread', 'carrot_tubers',
       'cooking_banana', 'beans_peas', 'cassava_leaves', 'vegetables',
       'other_vegetables', 'ground_nuts', 'fruits', 'other_fruits', 'fish',
       'organ_meat', 'flesh_meat', 'eggs', 'oil', 'sugar', 'milk',
       'condiments', 'o_fcs'],
      dtype='object')

In [132]:
# Define the food group mapping based on the weights as defined in the 
food_group_mapping = {
    'maize': 'cereals_tubers',
    'sorghum': 'cereals_tubers',
    'cereals': 'cereals_tubers',
    'cassava': 'cereals_tubers',
    'sweet_potato': 'cereals_tubers',
    'roots': 'cereals_tubers',
    'bread': 'cereals_tubers',
    'carrot_tubers': 'cereals_tubers',
    'cooking_banana': 'cereals_tubers',
    'beans_peas': 'pulses_nuts',
    'vegetables': 'vegetables_leaves',
    'cassava_leaves': 'vegetables_leaves',
    'other_vegetables': 'vegetables_leaves',
    'ground_nuts': 'pulses_nuts',
    'fruits': 'fruits',
    'other_fruits': 'fruits',
    'fish': 'animal_protein',
    'organ_meat': 'animal_protein',
    'flesh_meat': 'animal_protein',
    'eggs': 'animal_protein',
    'oil': 'oil',
    'sugar': 'sugar',
    'milk': 'dairy_products',
    'condiments': 'condiments'
}

#### Compute FCS

In [140]:
#call the function
data_fcs= calculate_fcs(data, food_group_mapping)
data_fcs.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,condiments,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs
0,2012,9746,EAST,BUGESERA,SHYARA,7,0,7,4,1,3,3,7,1,43.0
1,2012,9739,EAST,BUGESERA,SHYARA,6,7,7,3,1,0,0,7,2,66.5
2,2012,9744,EAST,BUGESERA,SHYARA,0,0,3,6,0,0,0,7,0,33.0
3,2012,9738,EAST,BUGESERA,SHYARA,0,0,5,4,0,0,0,7,0,29.0
4,2012,9741,EAST,BUGESERA,SHYARA,5,0,5,7,2,0,0,7,0,38.5


In [141]:
#save the data may be useful later
data_fcs.to_csv('Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_with_fcs.csv', index=False)

#### Compute HDDS

In [142]:
#call the function
data_hdds= calculate_hdds(data, food_group_mapping)
data_hdds.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,condiments,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,hdds
0,2012,9746,EAST,BUGESERA,SHYARA,1,0,1,1,1,1,1,1,1,8
1,2012,9739,EAST,BUGESERA,SHYARA,1,1,1,1,1,0,0,1,1,7
2,2012,9744,EAST,BUGESERA,SHYARA,0,0,1,1,0,0,0,1,0,3
3,2012,9738,EAST,BUGESERA,SHYARA,0,0,1,1,0,0,0,1,0,3
4,2012,9741,EAST,BUGESERA,SHYARA,1,0,1,1,1,0,0,1,0,5


In [143]:
#save the data may be useful later
data_hdds.to_csv('Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_with_hdds.csv', index=False)

#### Combine the files and save the final dataset

In [144]:
data_final = merge_data(data_fcs,data_hdds)
data_final.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,condiments,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs,hdds
0,2012,9746,EAST,BUGESERA,SHYARA,7,0,7,4,1,3,3,7,1,43.0,8
1,2012,9739,EAST,BUGESERA,SHYARA,6,7,7,3,1,0,0,7,2,66.5,7
2,2012,9744,EAST,BUGESERA,SHYARA,0,0,3,6,0,0,0,7,0,33.0,3
3,2012,9738,EAST,BUGESERA,SHYARA,0,0,5,4,0,0,0,7,0,29.0,3
4,2012,9741,EAST,BUGESERA,SHYARA,5,0,5,7,2,0,0,7,0,38.5,5


In [145]:
#save the final data
data_final.to_csv('Rwanda/2012/rwanda_2012_preprocessed_data/rwanda_2012_final.csv', index=False)

In [51]:
#aggregated_dt= calculate_mean(data,'district','fcs','province')
#aggregated_dt=calculate_mean(data, 'district', ['fcs','hdds'], ['year', 'province'])
#aggregated_dt.head(20)

## Comprehensive Food Security and Vulnerability Analysis (CFSVA) - 2015

This dataset is public available on [2015 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/70). The objective of this Comprehensive Food Security and Vulnerability Analysis 2015 (CFSVA and Nutrition Survey 2012) is to measure the extent and depth of food and nutrition insecurity in Rwanda, analyze trends over time, and integrate the findings with those from the recent 'Fourth Integrated Household Living Conditions Survey' (EICV 4) and 'Rwanda Demographic Health Survey 2014/15 (RDHS 2014/'15). For more discription about this dataset [Click Here](https://microdata.statistics.gov.rw/index.php/catalog/70/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.statistics.gov.rw/index.php/catalog/70/data_dictionary)

#### Loading the data sources

The name of the file we have used in this data set is **cfsva_2015_master_DB_annex.sav** The file contains data related to Household Questionnaire.This file is in SPSS Statistics Data File Format, therefore we need to change them into csv format for easily manipulation

In [146]:
'''
#convert cfsvans_2012_household_v01.sav to csv file
o_path= 'Rwanda/2015/rwanda_2015_preprocessed_data/cfsva_2015_master_DB_annex.sav'
u_path = 'Rwanda/2015/rwanda_2015_preprocessed_data/cfsva_2015_master_DB_annex.csv'
sav_to_csv(o_path,u_path) #call the function to convert

data = read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/cfsva_2015_master_DB_annex.csv',header=0, delimiter=',')
data.head()

'''

A new dataset is saved to Rwanda/2015/rwanda_2015_preprocessed_data/cfsva_2015_master_DB_annex.csv


In [149]:
'''
df = data
year = 2015

# Arrays containing the column names you want to keep
columns_to_keep = ['KEY', 'S0_C_Prov','districts','S0_E_Sect','Starch','Pulses','Meat','Vegetables','Oil',
                   'Fruit', 'Milk', 'Sugar','FCS','HDDS_24h']
# Output file name
output_file = 'Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015.csv'  

# Optional: Dictionary for renaming columns
rename_columns = {'KEY': 'hhid','S0_C_Prov': 'province', 'districts': 'district', 'S0_E_Sect': 'sector',
                  'Starch':'starch','Pulses': 'pulses', 'Meat': 'meat', 'Vegetables': 'vegetables', 'Oil': 'oil',
                  'Fruit': 'fruits', 'Milk': 'milk', 'Sugar': 'sugar', 'HDDS_24h': 'o_HDDS',
                  'FCS': 'o_fcs'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)

data = read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015.csv',header=0, delimiter=',')

#modifying the hid value
start_value = 1010
data['new_hid'] = data.index.to_series().apply(lambda x: start_value + x)
data['hhid'] = data['new_hid']
data.drop(columns=['new_hid'], inplace=True)


# Adding province name, district, and sector name and removing the id

# Read the data files

district_data = pd.read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/district_name.csv',header=0,delimiter=',')
province_data = pd.read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/province_name.csv',header=0,delimiter=',')
sector_data = pd.read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/sector_name.csv',header=0,delimiter=',')

# Create a mapping dictionary from district_id to district_name
district_mapping = dict(zip(district_data['dist_id'], district_data['dist_name']))

# Create a mapping dictionary from province_id to province_name
province_mapping = dict(zip(province_data['prov_id'], province_data['prov_name']))

# Create a mapping dictionary from sector_id to sector_name
sector_mapping = dict(zip(sector_data['sect_id'], sector_data['sect_name']))

# Add district, province and sector to the data using the mapping dictionary
data['district'] = data['district'].map(district_mapping)
data['province'] = data['province'].map(province_mapping)
data['sector'] = data['sector'].map(sector_mapping)

#save the data
data.to_csv('Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015_food_consumed.csv', index=False)
'''

A new dataset is saved to Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015.csv


In [160]:
data = read_csv('Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015_food_consumed.csv',header=0, delimiter=',')
data.head()

Unnamed: 0,year,hhid,province,district,sector,starch,pulses,meat,vegetables,oil,fruits,milk,sugar,o_fcs,o_HDDS
0,2015,1010,KIGALI,GASABO,KIMIRONKO,7,7,5,7,7,5,7,7,102.0,9
1,2015,1011,KIGALI,KICUKIRO,KIGARAMA,7,7,7,7,7,6,3,7,95.0,11
2,2015,1012,KIGALI,GASABO,KIMIRONKO,7,7,7,7,7,7,7,7,112.0,11
3,2015,1013,KIGALI,GASABO,GISOZI,7,7,5,7,7,5,4,7,90.0,9
4,2015,1014,KIGALI,GASABO,GISOZI,7,7,7,7,7,7,3,7,96.0,12


In [161]:
data.columns

Index(['year', 'hhid', 'province', 'district', 'sector', 'starch', 'pulses',
       'meat', 'vegetables', 'oil', 'fruits', 'milk', 'sugar', 'o_fcs',
       'o_HDDS'],
      dtype='object')

In [162]:
# Define the food group mapping based on the columns
food_group_mapping = {
    'starch': 'cereals_tubers',
    'pulses': 'pulses_nuts',
    'vegetables': 'vegetables_leaves',
    'fruits': 'fruits',
    'meat': 'animal_protein',
    'oil': 'oil',
    'sugar': 'sugar',
    'milk': 'dairy_products'
}

#### Compute FCS

In [163]:
#call the function
data_fcs= calculate_fcs(data, food_group_mapping)
data_fcs.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs
0,2015,1010,KIGALI,GASABO,KIMIRONKO,7,7,7,7,7,5,7,5,102.0
1,2015,1011,KIGALI,KICUKIRO,KIGARAMA,7,3,7,7,7,6,7,7,95.0
2,2015,1012,KIGALI,GASABO,KIMIRONKO,7,7,7,7,7,7,7,7,112.0
3,2015,1013,KIGALI,GASABO,GISOZI,7,4,7,7,7,5,7,5,90.0
4,2015,1014,KIGALI,GASABO,GISOZI,7,3,7,7,7,7,7,7,96.0


In [164]:
#save the data may be useful later
data_fcs.to_csv('Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015_with_fcs.csv', index=False)

#### Compute HDDS

In [165]:
#call the function
data_hdds= calculate_hdds(data, food_group_mapping)
data_hdds.head()

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,hdds
0,2015,1010,KIGALI,GASABO,KIMIRONKO,1,1,1,1,1,1,1,1,8
1,2015,1011,KIGALI,KICUKIRO,KIGARAMA,1,1,1,1,1,1,1,1,8
2,2015,1012,KIGALI,GASABO,KIMIRONKO,1,1,1,1,1,1,1,1,8
3,2015,1013,KIGALI,GASABO,GISOZI,1,1,1,1,1,1,1,1,8
4,2015,1014,KIGALI,GASABO,GISOZI,1,1,1,1,1,1,1,1,8


In [166]:
#save the data may be useful later
data_hdds.to_csv('Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015_with_hdds.csv', index=False)

#### Combine the files and save the final dataset

In [167]:
data_final = merge_data(data_fcs,data_hdds)
data_final

Unnamed: 0,year,hhid,province,district,sector,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs,hdds
0,2015,1010,KIGALI,GASABO,KIMIRONKO,7,7,7,7,7,5,7,5,102.0,8
1,2015,1011,KIGALI,KICUKIRO,KIGARAMA,7,3,7,7,7,6,7,7,95.0,8
2,2015,1012,KIGALI,GASABO,KIMIRONKO,7,7,7,7,7,7,7,7,112.0,8
3,2015,1013,KIGALI,GASABO,GISOZI,7,4,7,7,7,5,7,5,90.0,8
4,2015,1014,KIGALI,GASABO,GISOZI,7,3,7,7,7,7,7,7,96.0,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,2015,8505,SOUTH,NYARUGURU,NYABIMATA,0,0,7,0,0,0,2,0,20.0,2
7496,2015,8506,EAST,RWAMAGANA,RUBONA,0,0,7,0,5,2,7,0,42.0,4
7497,2015,8507,SOUTH,NYARUGURU,CYAHINDA,0,0,7,0,1,0,0,2,23.0,3
7498,2015,8508,WEST,NGORORERO,GATUMBA,3,0,7,0,7,0,7,0,43.5,4


In [168]:
#save the final dataset
data_final.to_csv('Rwanda/2015/rwanda_2015_preprocessed_data/rwanda_2015_final.csv', index=False)

## Comprehensive Food Security and Vulnerability Analysis (CFSVA) - 2018

This dataset is public available on [2018 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/91). This Comprehensive Food Security and Vulnerability Analysis (CFSVA) 2018 measures the extent and depth of food and nutrition insecurity in Rwanda, observes trends over time, and analyses the socioeconomic and demographic determinants linked to food and nutrition insecurity. For more discription about this dataset [Click Here](https://microdata.statistics.gov.rw/index.php/catalog/91/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.statistics.gov.rw/index.php/catalog/91/data_dictionary)

#### Loading the data sources

The name of the file we have used in this data set is **1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.sav** The file contains data related to Household Questionnaire.This file is in SPSS Statistics Data File Format, therefore we need to change them into csv format for easily manipulation

In [169]:
'''
#convert 1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.sav to csv file
o_path= 'Rwanda/2018/rwanda_2018_preprocessed_data/1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.sav'
u_path = 'Rwanda/2018/rwanda_2018_preprocessed_data/1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.csv'
sav_to_csv(o_path,u_path) #call the function to convert

data = read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.csv',header=0, delimiter=',')
data.shape

'''

A new dataset is saved to Rwanda/2018/rwanda_2018_preprocessed_data/1_CFSVA18_DB_HouseholdQues_Full_Annex_201904_NISR.csv


In [186]:
'''
df = data
year = 2018

# Arrays containing the column names you want to keep
columns_to_keep = ['PARENT_KEY', 'S0_C_Prov','S0_D_Dist','S0_E_Livezone','Starch','Pulses','Meat','Vegetables','Oil',
                   'Fruit', 'Milk', 'Sugar','FCS']
# Output file name
output_file = 'Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018.csv'  

# Optional: Dictionary for renaming columns
rename_columns = {'PARENT_KEY': 'hhid','S0_C_Prov': 'province', 'S0_D_Dist': 'district', 'S0_E_Livezone': 'zone',
                  'Starch': 'starch','Pulses': 'pulses', 'Meat': 'meat', 'Vegetables': 'vegetables', 'Oil': 'oil',
                  'Fruit': 'fruits', 'Milk': 'milk', 'Sugar': 'sugar', 'FCS': 'o_fcs'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)

data = read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018.csv',header=0, delimiter=',')

#modifying the hid value
start_value = 3010
data['new_hid'] = data.index.to_series().apply(lambda x: start_value + x)
data['hhid'] = data['new_hid']
data.drop(columns=['new_hid'], inplace=True)
data.to_csv('rwanda_2018_0.csv', index=False)

district_data = pd.read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/district_name.csv',header=0,delimiter=',')
province_data = pd.read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/province_name.csv',header=0,delimiter=',')
zone_data = pd.read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/zone_name.csv',header=0,delimiter=',')

# Create a mapping dictionary from district_id to district_name
district_mapping = dict(zip(district_data['dist_id'], district_data['dist_name']))

# Create a mapping dictionary from province_id to province_name
province_mapping = dict(zip(province_data['prov_id'], province_data['prov_name']))

# Create a mapping dictionary from zone_id to zone_name
zone_mapping = dict(zip(zone_data['zone_id'], zone_data['zone_name']))

# Add district, province and sector to the data using the mapping dictionary
data['district'] = data['district'].map(district_mapping)
data['province'] = data['province'].map(province_mapping)
data['zone'] = data['zone'].map(zone_mapping)

#save the data
data.to_csv('Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018_food_consumed.csv', index=False)

'''

A new dataset is saved to Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018.csv


In [220]:
data = read_csv('Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018_food_consumed.csv',header=0, delimiter=',')
data.head()

Unnamed: 0,year,hhid,province,district,zone,starch,pulses,meat,vegetables,oil,fruits,milk,sugar,o_fcs
0,2018,3010,WEST,KARONGI,WEST CONGO-NILE CREST TEA ZONE,7,7,0,7,7,0,0,0,45.5
1,2018,3011,NORTH,RULINDO,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",7,7,0,7,7,0,0,0,45.5
2,2018,3012,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",4,7,0,0,2,0,0,0,30.0
3,2018,3013,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",4,7,0,4,0,0,0,0,33.0
4,2018,3014,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",7,7,0,7,7,2,7,7,79.0


In [221]:
data.columns

Index(['year', 'hhid', 'province', 'district', 'zone', 'starch', 'pulses',
       'meat', 'vegetables', 'oil', 'fruits', 'milk', 'sugar', 'o_fcs'],
      dtype='object')

In [222]:
# Define the food group mapping based on the columns
food_group_mapping = {
    'starch': 'cereals_tubers',
    'pulses': 'pulses_nuts',
    'vegetables': 'vegetables_leaves',
    'fruits': 'fruits',
    'meat': 'animal_protein',
    'oil': 'oil',
    'sugar': 'sugar',
    'milk': 'dairy_products'
}

#### Compute FCS

In [223]:
#call the function
data_fcs= calculate_fcs(data, food_group_mapping)
data_fcs.head()

Unnamed: 0,year,hhid,province,district,zone,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs
0,2018,3010,WEST,KARONGI,WEST CONGO-NILE CREST TEA ZONE,7,0,7,0,7,0,7,0,45.5
1,2018,3011,NORTH,RULINDO,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",7,0,7,0,7,0,7,0,45.5
2,2018,3012,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",2,0,4,0,0,0,7,0,30.0
3,2018,3013,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",0,0,4,0,4,0,7,0,33.0
4,2018,3014,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",7,7,7,7,7,2,7,0,79.0


In [None]:
#save the data may be useful later
data_fcs.to_csv('Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018_with_fcs.csv', index=False)

#### Compute HDDS

In [224]:
#call the function
data_hdds= calculate_hdds(data, food_group_mapping)
data_hdds.head()

Unnamed: 0,year,hhid,province,district,zone,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,hdds
0,2018,3010,WEST,KARONGI,WEST CONGO-NILE CREST TEA ZONE,1,0,1,0,1,0,1,0,4
1,2018,3011,NORTH,RULINDO,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",1,0,1,0,1,0,1,0,4
2,2018,3012,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",1,0,1,0,0,0,1,0,3
3,2018,3013,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",0,0,1,0,1,0,1,0,3
4,2018,3014,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",1,1,1,1,1,1,1,0,7


In [225]:
#save the data may be useful later
data_hdds.to_csv('Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018_with_hdds.csv', index=False)

#### Combine the files and save the final dataset

In [226]:
data_final = merge_data(data_fcs,data_hdds)
data_final.head()

Unnamed: 0,year,hhid,province,district,zone,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs,hdds
0,2018,3010,WEST,KARONGI,WEST CONGO-NILE CREST TEA ZONE,7,0,7,0,7,0,7,0,45.5,4
1,2018,3011,NORTH,RULINDO,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",7,0,7,0,7,0,7,0,45.5,4
2,2018,3012,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",2,0,4,0,0,0,7,0,30.0,3
3,2018,3013,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",0,0,4,0,4,0,7,0,33.0,3
4,2018,3014,NORTH,GICUMBI,"CENTRAL-NORTHERN HIGHLAND IRISH POTATO, BEANS ...",7,7,7,7,7,2,7,0,79.0,7


In [227]:
#save the data may be useful later
data_final.to_csv('Rwanda/2018/rwanda_2018_preprocessed_data/rwanda_2018_final.csv', index=False)

## Comprehensive Food Security and Vulnerability Analysis (CFSVA) - 2021

This dataset is public available on [2021 - Comprehensive Food Security and Vulnerability Analysis (CFSVA)](https://microdata.statistics.gov.rw/index.php/catalog/106). The purpose of this Comprehensive Food security and Vulnerability Analysis (CFSVA) is to provide an accurate baseline and understanding of chronic food insecurity and vulnerability conditions.The CFSVA and Nutrition Survey 2021 was designed to produce estimates of food security indicators at district level and covered both urban and rural households. For more discription about this dataset [Click Here](https://microdata.statistics.gov.rw/index.php/catalog/106/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.statistics.gov.rw/index.php/catalog/106/data_dictionary)

#### Loading the data sources

The name of the file we have used in this data set is **CFSVA_HH_2021_MASTER_DATASET.dta** The file contains data related to Household Questionnaire.This file is in stata Data File Format, therefore we need to change them into csv format for easily manipulation

In [231]:
'''
#convert CFSVA_HH_2021_MASTER_DATASET.dta to csv file
o_path= 'Rwanda/2021/rwanda_2021_preprocessed_data/CFSVA_HH_2021_MASTER_DATASET.dta'
u_path = 'Rwanda/2021/rwanda_2021_preprocessed_data/CFSVA_HH_2021_MASTER_DATASET.csv'
dta_to_csv(o_path,u_path) #call the function to convert

data = read_csv('Rwanda/2021/rwanda_2021_preprocessed_data/CFSVA_HH_2021_MASTER_DATASET.csv',header=0, delimiter=',')
data.shape

'''

A new dataset is saved to Rwanda/2021/rwanda_2021_preprocessed_data/CFSVA_HH_2021_MASTER_DATASET.csv


In [242]:
'''
df = data
year = 2021

# Arrays containing the column names you want to keep
columns_to_keep = ['S0_B_DATE', 'S0_C_Prov','S0_D_Dist','S0_E_Livezone','Starch','Pulses','Meat','Vegetables','Oil',
                   'Fruit', 'Milk', 'Sugar','FCS']
# Output file name
output_file = 'Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021.csv'  

# Optional: Dictionary for renaming columns
rename_columns = {'S0_B_DATE': 'hhid','S0_C_Prov': 'province', 'S0_D_Dist': 'district','S0_E_Livezone': 'zone',
                  'Starch': 'starch','Pulses': 'pulses', 'Meat': 'meat', 'Vegetables': 'vegetables', 'Oil': 'oil',
                  'Fruit': 'fruits', 'Milk': 'milk', 'Sugar': 'sugar', 'FCS' : 'o_fcs'}  

#df, columns_to_keep, output_file,year, rename_columns=None
subset_and_save(df, columns_to_keep, output_file,year,rename_columns)

#read the data file
data = read_csv('Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021.csv',header=0, delimiter=',')


#modifying the hid value
start_value = 41010
data['new_hid'] = data.index.to_series().apply(lambda x: start_value + x)
data['hhid'] = data['new_hid']
data.drop(columns=['new_hid'], inplace=True)
data.to_csv('Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021_food_consumed.csv', index=False)
'''

A new dataset is saved to Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021.csv


In [253]:
data = read_csv('Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021_food_consumed.csv',header=0, delimiter=',')
data.head()

Unnamed: 0,year,hhid,province,district,zone,starch,pulses,meat,vegetables,oil,fruits,milk,sugar,o_fcs
0,2021,41010,Eastern,Rwamagana,Southeastern Plateau Banana Zone,7,7,3,7,5,5,4,3,79.0
1,2021,41011,Northern,Gakenke,East Congo-Nile Highland Subsistence Farming Zone,7,7,4,4,7,0,1,7,66.0
2,2021,41012,Eastern,Gatsibo,Eastern Plateau Mixed Agriculture Zone,7,7,0,3,5,0,0,2,41.5
3,2021,41013,Western,Rusizi,Lake Kivu Coffee Zone,7,7,0,7,7,0,1,0,49.5
4,2021,41014,Eastern,Kirehe,Southeastern Plateau Banana Zone,7,7,0,7,7,2,0,1,48.0


In [254]:
data.columns

Index(['year', 'hhid', 'province', 'district', 'zone', 'starch', 'pulses',
       'meat', 'vegetables', 'oil', 'fruits', 'milk', 'sugar', 'o_fcs'],
      dtype='object')

In [255]:
# Define the food group mapping based on the columns
food_group_mapping = {
    'starch': 'cereals_tubers',
    'pulses': 'pulses_nuts',
    'vegetables': 'vegetables_leaves',
    'fruits': 'fruits',
    'meat': 'animal_protein',
    'oil': 'oil',
    'sugar': 'sugar',
    'milk': 'dairy_products'
}

#### Compute FCS

In [256]:
#call the function
data_fcs= calculate_fcs(data, food_group_mapping)
data_fcs.head()

Unnamed: 0,year,hhid,province,district,zone,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs
0,2021,41010,Eastern,Rwamagana,Southeastern Plateau Banana Zone,5,4,7,3,7,5,7,3,79.0
1,2021,41011,Northern,Gakenke,East Congo-Nile Highland Subsistence Farming Zone,7,1,7,7,4,0,7,4,66.0
2,2021,41012,Eastern,Gatsibo,Eastern Plateau Mixed Agriculture Zone,5,0,7,2,3,0,7,0,41.5
3,2021,41013,Western,Rusizi,Lake Kivu Coffee Zone,7,1,7,0,7,0,7,0,49.5
4,2021,41014,Eastern,Kirehe,Southeastern Plateau Banana Zone,7,0,7,1,7,2,7,0,48.0


In [257]:
#save the final dataset may be useful later
data_fcs.to_csv('Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021_with_fcs.csv', index=False)

#### Compute HDDS

In [258]:
#call the function
data_hdds= calculate_hdds(data, food_group_mapping)
data_hdds.head()

Unnamed: 0,year,hhid,province,district,zone,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,hdds
0,2021,41010,Eastern,Rwamagana,Southeastern Plateau Banana Zone,1,1,1,1,1,1,1,1,8
1,2021,41011,Northern,Gakenke,East Congo-Nile Highland Subsistence Farming Zone,1,1,1,1,1,0,1,1,7
2,2021,41012,Eastern,Gatsibo,Eastern Plateau Mixed Agriculture Zone,1,0,1,1,1,0,1,0,5
3,2021,41013,Western,Rusizi,Lake Kivu Coffee Zone,1,1,1,0,1,0,1,0,5
4,2021,41014,Eastern,Kirehe,Southeastern Plateau Banana Zone,1,0,1,1,1,1,1,0,6


In [259]:
#save the final dataset may be useful later
data_hdds.to_csv('Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021_with_hdds.csv', index=False)

#### Combine the files and save the final dataset

In [260]:
data_final = merge_data(data_fcs,data_hdds)
data_final.head()

Unnamed: 0,year,hhid,province,district,zone,oil,dairy_products,cereals_tubers,sugar,vegetables_leaves,fruits,pulses_nuts,animal_protein,fcs,hdds
0,2021,41010,Eastern,Rwamagana,Southeastern Plateau Banana Zone,5,4,7,3,7,5,7,3,79.0,8
1,2021,41011,Northern,Gakenke,East Congo-Nile Highland Subsistence Farming Zone,7,1,7,7,4,0,7,4,66.0,7
2,2021,41012,Eastern,Gatsibo,Eastern Plateau Mixed Agriculture Zone,5,0,7,2,3,0,7,0,41.5,5
3,2021,41013,Western,Rusizi,Lake Kivu Coffee Zone,7,1,7,0,7,0,7,0,49.5,5
4,2021,41014,Eastern,Kirehe,Southeastern Plateau Banana Zone,7,0,7,1,7,2,7,0,48.0,6


In [261]:
#save the final dataset 
data_final.to_csv('Rwanda/2021/rwanda_2021_preprocessed_data/rwanda_2021_final.csv', index=False)