# Preprocessing & Calculate FCS and HDDS From Household Survey Data Of Tanzania

In this Python notebook, we will explore the process of preprocessing household survey data from Tanzania to compute two food security indicators: **Food Consumption Score (FCS)** and **Household Dietary Diversity Score (HDDS)**. These indicators play a crucial role in assessing food security and nutritional status at the household level. The aim of calculating these indicators is to use them as ground truth data in the research project of using machine learning and deep learning for prediction for security indicators(FCS and HDDS) from heterogenous data.


### Food Security Indicators
    
These are quantitative measures used to evaluate the accessibility, availability, and utilization of food at various levels starting from households,community to the national level. These indicators provide insights into the extent and severity of food insecurity, helping policymakers, researchers, and practitioners to identify vulnerable populations and design targeted interventions. There are numbers of food security indicators however as we mentioned early we will only focus on two indicators which are **Food Consumption Score(FCS)** and **Household Dietary Diversity Score(HDDS)**. The details description of these indicators will be given to the [Computation of FCS and HDDS](#compute_fcs_hdds) section.
    


### About Tanzania
    
Tanzania is a country located in East Africa,it's economy is primarily driven by agriculture,despite being an agricultural country, Tanzania faces challenges with malnutrition and lack of dietary diversity, particularly in rural areas.Many households rely heavily on staple crops such as maize, rice, and cassava, leading to a limited variety in diets and micronutrient deficiencies. Also poverty remains a significant barrier that makes many households struggle to afford an adequate and nutritious diet. Apart from that limited access to markets, infrastructure, and transportation exacerbates food insecurity, particularly in remote and rural areas.
    

### Data Source

We will be working with National Panel Survey (NPS) data from Tanzania which are public available on [National Bureau of Statistics (NBS)](https://www.nbs.go.tz/index.php/en/)
. The NPS is a household survey conducted at a national level, offering insights into poverty levels, agricultural productivity, and various crucial development metrics. It's characterized as an "integrated" survey because it encompasses diverse subjects within a single questionnaire, spanning from education and healthcare to issues like crime, gender-based violence and food security. Since our topic is concering about food security we will only focus on the data which are relevant to our topic.

#### The following are the datasets that we will work with:

* [National Panel Survey(NPS) 2008 - 2009](https://microdata.worldbank.org/index.php/catalog/76)

* [National Panel Survey(NPS) 2010 - 2011](https://microdata.worldbank.org/index.php/catalog/1050)

* [National Panel Survey(NPS) 2012 - 2013](https://microdata.worldbank.org/index.php/catalog/2252)

* [National Panel Survey(NPS) 2014 - 2015](https://microdata.worldbank.org/index.php/catalog/2862)

* [National Panel Survey(NPS) 2016 Feed the Future Interim Supplemental Survey 2016](https://microdata.worldbank.org/index.php/catalog/2863)

* [National Panel Survey(NPS) 2019 - 2020](https://microdata.worldbank.org/index.php/catalog/3885)

* [National Panel Survey(NPS) 2020 - 2021](https://microdata.worldbank.org/index.php/catalog/5639)
* [High Frequency Welfare Monitoring Phone Survey 2021-2024](https://microdata.worldbank.org/index.php/catalog/4542)

## Import Libraries

In this section, we import essential libraries and modules required for data preprocessing, analysis, and visualization tasks. These libraries provide robust functionalities and tools that streamline the data analysis workflow and enable us to manipulate and explore the dataset efficiently.

In [16]:
#uncomment to install savReaderWriter
#!pip install savReaderWriter

In [17]:
import pandas as pd
import pyreadstat as ps
from pandas import read_csv
import savReaderWriter as sv

## <a id='helper_function'></a> Helper Function

In this section we define a set of helper functions designed to streamline data preprocessing tasks and facilitate the computation of Food Consumption Score (FCS) and Household Dietary Diversity Score (HDDS). These functions are designed to assist in converting dataset files from different formats to a common format (i.e., .csv), making them compatible with various data analysis tools and workflows.

In [18]:
'''
    This function decodes bytes to string and handles integer values.
    It checks if the input value is a bytes object and decodes it to a UTF-8 encoded string.
    If the value is a float and represents an integer, it converts it to an integer.
    Otherwise, it returns the original value.
    It is important while converting the .sav file to .csv.
'''

def decode_value(value):
    if isinstance(value, bytes):
        return value.decode('utf-8')
    elif isinstance(value, float) and value.is_integer():
        return int(value)
    else:
        return value

In [19]:
'''
    This function converts a .sav (SPSS) file to a .csv (comma-separated values) file.
    It reads the .sav file using sv.SavReader, extracts column names, and decodes values using the decode_value function.
    The data is then converted to a DataFrame and saved as a .csv file at the specified path.
'''

def sav_to_csv(sav_path,csv_path):
    with sv.SavReader(sav_path) as reader:
        # Extract the column names
        column_names = [name.decode('utf-8') for name in reader.header]
        
        # Read the data and decode values
        data = [[decode_value(value) for value in row] for row in reader]
    
    df = pd.DataFrame(data, columns=column_names)
    df.to_csv(csv_path, index=False)
    print(f"A new dataset is saved to {csv_path}")

In [20]:
'''
    This function converts a .dta (Stata) file to a .csv file.
    It reads the .dta file using pd.read_stata and loads it into a DataFrame.
    The DataFrame is then saved as a .csv file at the specified path
'''

def dta_to_csv(dta_path, csv_path):
    df = pd.read_stata(dta_path)
    df.to_csv(csv_path, index=False)
    print(f"A new dataset is saved to {csv_path}")

In [30]:
'''
    This function is designed to subset a DataFrame based on specified columns, add a year column
    and save the resulting subset to a new CSV file. 
    This function is helful when your dataset have a large number of columns and you only need to work with some of them.
    
    Parameters:
        df (DataFrame): The original DataFrame.
        columns_to_keep (list): A list of column names to keep.
        output_file (str): The path to the output CSV file.
        rename_columns (dict, optional): A dictionary where keys are original column names and values are new names.
'''

def subset_and_save(df, columns_to_keep, output_file, rename_columns=None):
    
    # Selecting columns to keep
    df_subset = df[columns_to_keep]

    # Optionally renaming columns
    if rename_columns:
        df_subset = df_subset.rename(columns=rename_columns)
    
    # Save the subset DataFrame to a new CSV file
    df_subset.to_csv(output_file, index=False)

    print(f"A new dataset is saved to {output_file}")

In [29]:
'''
    This function is designed to subset a DataFrame based on specified columns, add a year column
    and save the resulting subset to a new CSV file. 
    This function is helful when your dataset have a large number of columns and you only need to work with some of them.
    
    Parameters:
        df (DataFrame): The original DataFrame.
        columns_to_keep (list): A list of column names to keep.
        output_file (str): The path to the output CSV file.
        year(int): The year associated with the dataset, which will be added as a new column.
        rename_columns (dict, optional): A dictionary where keys are original column names and values are new names.
'''

def subset_and_save2(df, columns_to_keep, output_file, year, rename_columns=None):
    
    # Selecting columns to keep
    df_subset = df[columns_to_keep]

    # Optionally renaming columns
    if rename_columns:
        df_subset = df_subset.rename(columns=rename_columns)
    
    # Adding a year column
    df_subset.insert(0, 'year', year)

    # Save the subset DataFrame to a new CSV file
    df_subset.to_csv(output_file, index=False)

    print(f"A new dataset is saved to {output_file}")

In [58]:
'''
    Merge three data frames based on 'hhid' column using left join.

    Parameters:
        df1 (DataFrame): First data frame should be the dataframe containing the address of the household. 
        df2 (DataFrame): Second data frame should be the dataframe that has fcs.
        df3 (DataFrame): Third data frame should be the dataframe  that has hdds.

    Returns:
        DataFrame: Merged data frame.
'''

def merge_data(df1, df2, df3):

    merged_df = pd.merge(df1, df2[['hhid', 'fcs']], on='hhid', how='left')
    merged_df = pd.merge(merged_df, df3[['hhid', 'hdds']], on='hhid', how='left')
    
    return merged_df


## <a id='compute_fcs_hdds'></a> Computation of FCS and HDDS

In this section, we will explain how we will compute the **Food Consumption Score (FCS)** and the **Household Dietary Diversity Score (HDDS)**. Since we will work of multiple dataset in our study area, we will adopt a standardized approach for calculating both FCS and HDDS across all datasets. Regardless of the dataset's specific characteristics, we will apply consistent methods to ensure comparability and reliability of the results.

###  <a id='fcs'></a>1. Food Consumption Score (FCS)

The Food Consumption Score (FCS) is a food security indicator developed by the [World Food Programme (WFP)](https://resources.vam.wfp.org/data-analysis/quantitative/food-security/food-consumption-score). It serves as an essential index for assessing household food consumption patterns and nutritional adequacy. The data to compute the FCS are collected by household survey questionnaire by asking a respondent the list of food groups they have consumed for the past seven days. The FCS aggregates these data on the diversity and frequency of food groups consumed over the previous seven days **(7)** which is then weighted according to the relative nutritional value of the consumed food groups as specified in the table below.

<p><img src="images/food_weights.png"  align="centre" alt="food groups weighs" style="width:600px;height:300px;"></p>


#### Steps to Compute FCS
 1. Group food items in the specified food groups
 2. Sum all the consumption frequencies of food items within the same group but the maximum limit should be 7
 3. Multiply the value of each food group by its weight as explained in table
 4. Sum the weighted food group scores to obtain the overall FCS
 5. Determine the household's food consumption status based on the following thresholds: 
     * 0 - 21 : Poor
     * 21.5 - 35 : Borderline
     * &gt; 35 : Acceptable
     
##### Mathematically the FCS is represented as:

$$
FCS = \sum_{j=1}^{9}f_j \times x_j
$$

**Where:**

- $(f_j$) represents the frequency of consumption of food group $(j$).
- $(x_j$) represents the nutritional value (weight) of food group $(j$).
- $(j$) ranges from 1 to 9, representing the nine food groups considered in the calculation.

For more detailed description about FCS Calculation and uses in food security analysis can be found in [this Document](https://documents.wfp.org/stellent/groups/public/documents/manual_guide_proced/wfp197216.pdf). The source code and sample of the data can also be found [Here](https://resources.vam.wfp.org/data-analysis/quantitative/food-security/food-consumption-score)

#### Sum Up Frequency by Food Group  Function (SFFGF)

This function implement the **Step 1** and **Step 2** as explained above.
It Groups all the food items to its corresponding food groups and compute the sum of all the consumption frequencies. The upper limit of the food frequency of each food groups is set to a maximum of 7 to prevent biasing the score upwards. This approach is used to prevent the bias of the score when there is an increasing of the number of food items/groups used in questionnaire(and later collapsed into the food groups for the FCS calculation). Assuming the questionnaire gathers the consumption of maize, rice,wheat and these foods are all consumed in combinatination, let say the frequency of maize=3, rice=3, wheat=3, all of these fall into the food group of cereals then the frequency of the cereals will be 9, thus biasing the score. That is the reason why we set the maximum frequency to 7.

In [22]:
def sum_up_frequency_by_food_group(data):
    
    '''
    Processes the given DataFrame by grouping by 'hhid' and 'food_group',
    summing up 'is_consumed', pivoting the data to get food groups as columns, and
    ensuring that no aggregated sum exceeds 7.

    Parameters:
    - data (pd.DataFrame): The input DataFrame with at least 'hhid','food_group', and 'is_consumed' columns.
    - hhid: represent a unique identifier of each household
    - food_groups: containing the food group such as cereal,pulses,sugar, oil, etc
    - is_consumed: represent an integer value of either 1(if it is consumed) or 0 (if not consumed)

    Returns:
    - pd.DataFrame: A pivoted DataFrame with 'hhid' as rows, 'food_group' as columns,
                    and sum of 'is_consumed' as values, capped at a maximum of 7.
                    
    '''
    # Group by 'hhid', 'food_group',then sum up 'is_consumed'
    grouped_data = data.groupby(['hhid', 'food_group'])['is_consumed'].sum().reset_index()

    # Pivot the data to get food groups as columns
    pivot_data = grouped_data.pivot_table(index='hhid', columns='food_group', values='is_consumed', aggfunc='sum', fill_value=0)

    # Reset index to make 'hhid' a column again
    pivot_data.reset_index(inplace=True)

    # Set the maximum number of days to be 7 for all columns except 'hhid'
    pivot_data.iloc[:, 1:] = pivot_data.iloc[:, 1:].applymap(lambda x: min(x, 7))

    return pivot_data

#### Calculate Food Consumption Score Function (CFCSF)

This function compute the food consumption score of a given dataframe. It implements the instruction of **Step 1** to **Step 4**  to compute the overall score. The weights of each food group is defined as explained in the section above. There are two version of this function, **calculate_fcs_version1()** only return the hhid and the computed food consumption score(fcs) and the other one is **calculate_fcs_version2()** which return the hhid, the columns of each food groups with their aggregated frequency and the calculated the food consumption score. Both of these methods produce the same score for the fcs but diffrent is only on the returned dataframe. So the choice of which one you want to use depends with your goals.

In [23]:
def calculate_fcs_version1(data):
    '''
    Processes the given DataFrame by grouping by 'hhid', 'food_group', and 
    summing up 'is_consumed', pivoting the data to get food groups as columns, capping
    the values at a maximum of 7, assigning weights to each food group, and finally
    calculating the food consumption score for each household.

    Parameters:
    - data (pd.DataFrame): The input DataFrame with at least 'hhid', 'food_group' and 'is_consumed' columns.
    - hhid: represent a unique identifier of each household
    - food_groups: containing the food group such as cereal,pulses,sugar, oil, etc
    - is_consumed: represent an integer value of either 1(if it is consumed) or 0 (if not consumed) 
    and sometimes represent the number of days a food consumed for the past 7 days, this depend with data in the dataset

    Returns:
    - pd.DataFrame: A DataFrame with 'hhid' and 'food_consumption_score' columns,
                    representing the calculated food consumption score for each household.
    '''
    # Define weights for each food group
    weights = {
        'cereals_tubers': 2,
        'pulses_nuts': 3,
        'vegetables_leaves': 1,
        'fruits': 1,
        'animal_protein': 4,
        'dairy_products': 4,
        'sugar': 0.5,
        'oil': 0.5,
        'condiments': 0
    }
    
    # Perform initial processing and capping
    grouped_data = data.groupby(['hhid', 'food_group'])['is_consumed'].sum().reset_index()
    pivot_data = grouped_data.pivot_table(index='hhid', columns='food_group', values='is_consumed', aggfunc='sum', fill_value=0)
    pivot_data.reset_index(inplace=True)
    pivot_data.iloc[:, 1:] = pivot_data.iloc[:, 1:].applymap(lambda x: min(x, 7))

    # Calculate the food consumption score
    for column in pivot_data.columns[1:]:
        if column in weights:
            pivot_data[column] = pivot_data[column] * weights[column]
        else:
            print(f"Warning: '{column}' not found in weights; it will be ignored in score calculation.")

    pivot_data['fcs'] = pivot_data.iloc[:, 1:].sum(axis=1)

    # Return a DataFrame with 'hhid' and 'food_consumption_score'
    return pivot_data[['hhid', 'fcs']]

In [24]:
def calculate_fcs_version2(data):
    
    ''''
    Processes the given DataFrame by grouping by 'hhid', 'food_group', and 
    summing up 'is_consumed', pivoting the data to get food groups as columns, capping
    the values at a maximum of 7, assigning weights to each food group, and finally
    calculating the food consumption score for each household.

    Parameters:
    - data (pd.DataFrame): Input DataFrame with 'hhid', 'food_group', 'food_items', and 'is_consumed'.
    - hhid: represent a unique identifier of each household
    - food_items: represent the food items such as maize, rice etc
    - food_groups: containing the food group such as cereal,pulses,sugar, oil, etc
    - is_consumed: represent an integer value of either 1(if it is consumed) or 0 (if not consumed)
    
    Returns:
    - pd.DataFrame: Output DataFrame with each 'hhid', the aggregated 'is_consumed' values for each food group,
                    and the calculated FCS.
    '''
    # Define weights for each food group
    weights = {
        'cereals_tubers': 2,
        'pulses_nuts': 3,
        'vegetables_leaves': 1,
        'fruits': 1,
        'animal_protein': 4,
        'dairy_products': 4,
        'sugar': 0.5,
        'oil': 0.5,
        'condiments': 0
    }
    
    # Group by 'hhid' and 'food_group', then sum 'is_consumed', capping at 7
    grouped = data.groupby(['hhid', 'food_group',])['is_consumed'].sum().reset_index()
    grouped['is_consumed'] = grouped['is_consumed'].apply(lambda x: min(x, 7))
    
    # Pivot the data to get food groups as columns, filled with 'is_consumed' values
    pivot_data = grouped.pivot(index='hhid', columns='food_group', values='is_consumed').fillna(0)
    
    # Compute the FCS by applying weights to the 'is_consumed' values and summing them up
    fcs = pivot_data.copy()
    for food_group in weights:
        if food_group in fcs.columns:
            fcs[food_group] = fcs[food_group] * weights[food_group]
    
    pivot_data['fcs'] = fcs.sum(axis=1)
    
    # Reset index to convert 'hhid' from index to a column
    pivot_data.reset_index(inplace=True)
    
    return pivot_data

### 2. Household Dietary Diversity Score (HDDS)

Household dietary diversity Score (HDDS) is a qualitative measure of food consumption that reflects household access to a variety of foods that indicate dietary diversity and nutritional quality.The HDDS consists of a simple count of food groups that a household has consumed over the preceding 24 hours. Each food group is assigned a score of **1 (if consumed over the previous 24 hours)** or **0 (if not consumed in the last 24 hours)**. The household score will range between 0 to 12 and is equal to the total number of food groups consumed by the household.

##### The following 12 food groups are used to calculate the HDDS indicator:

<img src="images/hdds_groups.png"  align="centre" alt="food groups weighs" style="width:400px;height:200px;">

#### Steps to Compute HDDS
 1. Group food items in the specified food groups
 2. Sum all the consumption frequencies of food items within the same group but the maximum limit should be 1
 3. Sum the food group scores to obtain the overall HDDS.

##### Mathematically the HDDS is represented as: 
$$
HDDS = \sum_{j=1}^{12} x_j
$$

where $( x_j $) equals 1 if the household consumed food from group $( j $) in the past 24 hours, and 0 otherwise.



#### Calculate Household Dietary Diversity Score Function (CHDDSF)
This function compute the household dietary diversity score for a given data frame. 

In [141]:
def calculate_hdds(data):
    """
    Processes the given DataFrame by grouping by 'hhid' and 'food_group',
    summing up 'is_consumed', pivoting the data to get food groups as columns, and
    calculating Household Dietary Diversity Score (HDDS) by summing the values
    of all the columns present in each food group.

    Parameters:
    - data (pd.DataFrame): The input DataFrame with at least 'hhid','food_group', and 'is_consumed' columns.

    Returns:
    - pd.DataFrame: A DataFrame with 'hhid', the value of all the columns, and the summed results as HDDS
                    for each household.
    """
    # Group by 'hhid', 'food_group', then sum up 'is_consumed'
    grouped_data = data.groupby(['hhid', 'food_group'])['is_consumed'].sum().reset_index()

    # Pivot the data to get food groups as columns
    pivot_data = grouped_data.pivot_table(index='hhid', columns='food_group', values='is_consumed', aggfunc='sum', fill_value=0)

    # Reset index to make 'hhid' a column again
    pivot_data.reset_index(inplace=True)
    
    # Set the maximum value to 1 for all columns except 'hhid'
    pivot_data.iloc[:, 1:] = pivot_data.iloc[:, 1:].applymap(lambda x: min(x, 1))

    # Calculate Household Dietary Diversity Score (HDDS) by summing the values of all columns present in each food group
    pivot_data['hdds'] = pivot_data.drop(columns=['hhid']).sum(axis=1)

    return pivot_data
    #return pivot_data[['hhid', 'hdds']]

## National Panel Survey 2008 - 2009

This dataset is public available on [National Panel Survey(NPS) 2008 - 2009](https://microdata.worldbank.org/index.php/catalog/76). The NPS interviewed 3,280 households spanning all regions and all districts of Tanzania, both mainland and Zanzibar.The dataset containing data related to Household questionnaire, Agriculture questionnaire and Community questionnaire which cover broad range of the topics. For more discription about the coverage of the topics in this dataset [Click Here](https://microdata.worldbank.org/index.php/catalog/76/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on file consisting about food consumption. For more description about the data definition of each files and their associate variables [Visit Here](https://microdata.worldbank.org/index.php/catalog/76/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:
* SEC_A_T.dta : Contain the description about the address of the household which is identified by a unique identification (hhid)
* SEC_K1.dta : Contain the data information about food consumption of 7 days recall

These files are in stata format, therefore we need to change them into csv format for easily manipulation

In [111]:
'''
#convert SEC_A_T.dta to csv file

o_path= 'Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/SEC_A_T.dta'
u_path = 'Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/SEC_A_T.csv'

dta_to_csv(o_path,u_path) #call the function to convert

'''

'''
#convert SEC_K1.dta to csv file

o_path= 'Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/SEC_K1.dta'
u_path = 'Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/SEC_K1.csv'

dta_to_csv(o_path,u_path) #call the function to convert


'''

A new dataset is saved to Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/SEC_A_T.csv


Renaming of the columns and removing some of them is done on OpenRefine software but it can also be done here by using the function **subset_and_save** in the Helper Function Section.

In [128]:
#Load the data file containing the food items consumed by each household the file is generated from SEC_K1.csv 
data =  read_csv('Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/tanzania_2008_2009_food_consumed_for_fcs.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_items,is_consumed,food_group
0,1010140020171,Rice (paddy),0,cereals_tubers
1,1010140020171,Rice (husked),0,cereals_tubers
2,1010140020171,"Maize (green, cob)",0,cereals_tubers
3,1010140020171,Maize (grain),0,cereals_tubers
4,1010140020171,Maize (flour),0,cereals_tubers


In [210]:
#read the file containing the location of the household 
location =  read_csv('Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/tanzania_2008_2009_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,hhid,year,region,district_id,district,ward_id,locality
0,1010140020171,2008,DODOMA,1,KONDOA,14,RURAL
1,1010140020284,2008,DODOMA,1,KONDOA,14,RURAL
2,1010140020297,2008,DODOMA,1,KONDOA,14,RURAL
3,1010140020409,2008,DODOMA,1,KONDOA,14,RURAL
4,1010140020471,2008,DODOMA,1,KONDOA,14,RURAL


#### Compute FCS

In [130]:
#call the function to compute the fcs
data_fcs = calculate_fcs_version2(data)
#data_fcs = calculate_fcs_version1(data)
data_fcs.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,fcs
0,1010140020171,2,1,2,1,0,1,0,1,2,17.0
1,1010140020284,0,1,1,0,0,1,0,0,2,4.5
2,1010140020297,3,2,2,1,0,1,1,1,2,26.0
3,1010140020409,1,2,2,0,0,1,0,1,2,11.0
4,1010140020471,0,3,2,0,0,1,1,1,2,12.0


In [114]:
#save the file may be useful later
#data_fcs.to_csv('Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/tanzania_2008_2009_fcs.csv', index=False)

#### Compute HDDS

In [131]:
data =  read_csv('Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/tanzania_2008_2009_food_consumed_for_hdds.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_items,is_consumed,food_code,food_group
0,1010140020171,Rice (paddy),0,101,cereals
1,1010140020171,Rice (husked),0,102,cereals
2,1010140020171,"Maize (green, cob)",0,103,cereals
3,1010140020171,Maize (grain),0,104,cereals
4,1010140020171,Maize (flour),0,105,cereals


In [132]:
#call the function to compute the hdds
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,hdds
0,1010140020171,8
1,1010140020284,4
2,1010140020297,8
3,1010140020409,6
4,1010140020471,6


In [103]:
#save the file may be useful later
#data_hdds.to_csv('Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/tanzania_2008_2009_hdds.csv', index=False)

#### Combine the files and save the final dataset

In [133]:
data_final = merge_data(location,data_fcs,data_hdds)
data_final.head(10)

Unnamed: 0,hhid,year,region,district,district_name,ward,locality,fcs,hdds
0,1010140020171,2008,DODOMA,1,KONDOA,14,Rural,17.0,8
1,1010140020284,2008,DODOMA,1,KONDOA,14,Rural,4.5,4
2,1010140020297,2008,DODOMA,1,KONDOA,14,Rural,26.0,8
3,1010140020409,2008,DODOMA,1,KONDOA,14,Rural,11.0,6
4,1010140020471,2008,DODOMA,1,KONDOA,14,Rural,12.0,6
5,1010140020551,2008,DODOMA,1,KONDOA,14,Rural,18.0,8
6,1010140020761,2008,DODOMA,1,KONDOA,14,Rural,8.5,5
7,1010140020762,2008,DODOMA,1,KONDOA,14,Rural,11.0,6
8,1020030030004,2009,DODOMA,2,MPWAPWA,3,Rural,30.0,10
9,1020030030022,2009,DODOMA,2,MPWAPWA,3,Rural,3.0,3


In [134]:
#save the final dataset 
data_final.to_csv('Tanzania/2008_2009/tanzania_2008_2009_preprosessed_data/tanzania_2008_2009_final.csv', index=False)

## National Panel Survey 2010 - 2011

This dataset is public available on [National Panel Survey(NPS) 2010 - 2011](https://microdata.worldbank.org/index.php/catalog/1050). The total sample size was 3,265 households in 409 Enumeration Areas (2,063 households in rural areas and 1,202 urban areas covering Dar es Salaam, other urban areas in Mainland, rural areas in Mainland, and Zanzibar.The dataset containing data related to Household questionnaire, Agriculture questionnaire, Fishery questionnaire and Community questionnaire which cover broad range of the topics. For more discription about the coverage of the topics in this dataset [Click Here](https://microdata.worldbank.org/index.php/catalog/1050/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.worldbank.org/index.php/catalog/1050/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:

   * HH_SEC_A.dta : Contain the description about the address of the household which is identified by a unique identification (hhid)
    
   * HH_SEC_K2.dta : Contain data related to frequency of consumption of food items in the past 7 days.

These files are in stata format, therefore we need to change them into csv format for easily manipulation


In [26]:
'''
#convert HH_SEC_A.dta to csv file
o_path= 'Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/HH_SEC_A.dta'
u_path = 'Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/HH_SEC_A.csv'

dta_to_csv(o_path,u_path) #call the function to convert


#convert HH_SEC_K2.dta to csv file

o_path= 'Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/HH_SEC_K2.dta'
u_path = 'Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/HH_SEC_K2.csv'

dta_to_csv(o_path,u_path) #call the function to convert

'''

A new dataset is saved to Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/HH_SEC_A.csv


In [135]:
#Load the data file containing the food items consumed by each household the file is generated from HH_SEC_K2.csv 
data =  read_csv('Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/tanzania_2010_2011_food_consumed.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_code,food_group,is_consumed
0,101014002017101,A,cereals_tubers,7.0
1,101014002017101,B,cereals_tubers,0.0
2,101014002017101,C,pulses_nuts,2.0
3,101014002017101,D,vegetables_leaves,7.0
4,101014002017101,E,animal_protein,2.0


In [136]:
#read the file containing the location of the household 
location =  read_csv('Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/tanzania_2010_2011_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,hhid,year,region,district,ward,locality
0,101014002017101,2010,1,1,141,Rural
1,101014002028401,2010,1,1,141,Rural
2,101014002029701,2010,1,1,141,Rural
3,101014002029704,2011,7,3,72,Urban
4,101014002040901,2010,1,1,141,Rural


#### Compute FCS

In [137]:
#call the function to compute the fcs
data_fcs = calculate_fcs_version2(data)
data_fcs.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,fcs
0,101014002017101,2.0,7.0,7.0,0.0,4.0,7.0,2.0,7.0,7.0,46.0
1,101014002028401,2.0,7.0,7.0,5.0,1.0,7.0,3.0,7.0,7.0,66.0
2,101014002029701,5.0,7.0,7.0,1.0,2.0,7.0,4.0,7.0,4.0,63.0
3,101014002029704,1.0,7.0,7.0,1.0,1.0,7.0,4.0,7.0,7.0,49.0
4,101014002040901,1.0,7.0,7.0,0.0,0.0,7.0,3.0,7.0,7.0,41.0


In [138]:
#save the file may be useful later
data_fcs.to_csv('Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/tanzania_2010_2011_with_fcs.csv', index=False)

#### Compute HDDS

In [142]:
#call the function to compute the fcs
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,hdds
0,101014002017101,1,1,1,0,1,1,1,1,1,8
1,101014002028401,1,1,1,1,1,1,1,1,1,9
2,101014002029701,1,1,1,1,1,1,1,1,1,9
3,101014002029704,1,1,1,1,1,1,1,1,1,9
4,101014002040901,1,1,1,0,0,1,1,1,1,7


In [143]:
#save the file may be useful later
data_hdds.to_csv('Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/tanzania_2010_2011_with_hdds.csv', index=False)

#### Combine the data files

In [144]:
#merging the location,fcs and hdds dataframe
data_final = merge_data(location,data_fcs,data_hdds)
data_final.head(10)

Unnamed: 0,hhid,year,region,district,ward,locality,fcs,hdds
0,101014002017101,2010,1,1,141,Rural,46.0,8
1,101014002028401,2010,1,1,141,Rural,66.0,9
2,101014002029701,2010,1,1,141,Rural,63.0,9
3,101014002029704,2011,7,3,72,Urban,49.0,9
4,101014002040901,2010,1,1,141,Rural,41.0,7
5,101014002047101,2010,1,1,141,Rural,35.5,8
6,101014002055101,2010,1,1,141,Rural,32.0,6
7,101014002076101,2010,1,1,141,Rural,24.5,4
8,101014002076201,2010,1,1,141,Rural,46.5,7
9,102003003000401,2011,1,2,31,Rural,38.0,7


In [145]:
#save the final dataset
data_final.to_csv('Tanzania/2010_2011/tanzania_2010_2011_preprocessed_data/tanzania_2010_2011_final.csv', index=False)

## National Panel Survey 2012 - 2013

This dataset is public available on [National Panel Survey(NPS) 2012 - 2013](https://microdata.worldbank.org/index.php/catalog/2252). The NPS interviewed 3,924households spanning all regions and all districts of Tanzania, both mainland and Zanzibar.The dataset containing data related to Household questionnaire, Agriculture questionnaire, Fishery & Livestock questionnaire and Community questionnaire which cover broad range of the topics. For more discription about the coverage of the topics in this dataset [Click Here](https://microdata.worldbank.org/index.php/catalog/2252/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.worldbank.org/index.php/catalog/2252/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:
* HH_SEC_A.dta : Contain the description about the address of the household which is identified by a unique identification (hhid)
* HH_SEC_J3.dta : Contain the data information about food consumption of 7 days recall

These files are in stata format, therefore we need to change them into csv format for easily manipulation

In [86]:
''''
#convert HH_SEC_A.dta to csv file
o_path= 'Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/HH_SEC_A.dta'
u_path = 'Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/HH_SEC_A.csv'

dta_to_csv(o_path,u_path) #call the function to convert

#convert HH_SEC_J3.dta to csv file
o_path= 'Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/HH_SEC_J3.dta'
u_path = 'Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/HH_SEC_J3.csv'

dta_to_csv(o_path,u_path) #call the function to convert

'''

A new dataset is saved to Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/HH_SEC_A.csv


In [173]:
#Load the data file containing the food items consumed by each household the file is generated from HH_SEC_J3.csv 
data =  read_csv('Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/tanzania_2012_2013_food_consumed.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_code,food_group,is_consumed
0,0001-001,A,cereals_tubers,7
1,0001-001,B,cereals_tubers,0
2,0001-001,C,pulses_nuts,0
3,0001-001,D,vegetables_leaves,5
4,0001-001,E,animal_protein,3


In [156]:
#Load the data file containing the household address,this file is generated from HH_SEC_A.csv 
location =  read_csv('Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/tanzania_2012_2013_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,hhid,year,region,district,district_name,ward,locality
0,0001-001,2012,DODOMA,1,KONDOA,141.0,RURAL
1,0002-001,2012,DODOMA,1,KONDOA,141.0,RURAL
2,0003-001,2012,DODOMA,1,KONDOA,141.0,RURAL
3,0003-010,2012,DAR ES SALAAM,3,TEMEKE,192.0,URBAN
4,0005-001,2012,DODOMA,1,KONDOA,141.0,RURAL


#### Compute FCS

In [151]:
#call the function to compute the fcs
data_fcs = calculate_fcs_version2(data)
data_fcs.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,fcs
0,0001-001,3,7,7,0,0,7,0,0,5,34.5
1,0002-001,0,7,7,0,0,7,0,7,7,28.0
2,0003-001,0,7,7,0,5,7,7,7,7,54.0
3,0003-010,5,7,7,0,2,7,4,7,7,62.0
4,0005-001,7,7,7,1,0,7,6,7,5,76.0


In [149]:
#save the file may be useful later
#data_fcs.to_csv('Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/tanzania_2012_2013_with_fcs.csv', index=False)

#### Compute HDDS

In [174]:
#call the function to compute the fcs
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,hdds
0,0001-001,1,1,1,0,0,1,0,0,1,5
1,0002-001,0,1,1,0,0,1,0,1,1,5
2,0003-001,0,1,1,0,1,1,1,1,1,7
3,0003-010,1,1,1,0,1,1,1,1,1,8
4,0005-001,1,1,1,1,0,1,1,1,1,8


In [175]:
#save the file may be useful later
data_hdds.to_csv('Tanzania/2012_2013/tanzania_2012_2013_preprocessed_data/tanzania_2012_2013_with_hdds.csv', index=False)

## National Panel Survey 2014 - 2015

This dataset is public available on [National Panel Survey(NPS) 2014 - 2015](https://microdata.worldbank.org/index.php/catalog/2862). The NPS interviewed 3,360 households spanning all Dar es Salaam,Other Urban,Rural and Zanzibar.The dataset containing data related to Household questionnaire, Agriculture questionnaire and Community questionnaire which cover broad range of the topics. For more discription about the coverage of the topics in this dataset [Click Here](https://microdata.worldbank.org/index.php/catalog/2862/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.worldbank.org/index.php/catalog/2862/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:
* hh_sec_a.dta : Contain the description about the address of the household which is identified by a unique identification (hhid)
* hh_sec_j3.dta : Contain the data information about food consumption of 7 days recall

These files are in stata format, therefore we need to change them into csv format for easily manipulation

In [157]:
'''
#convert hh_sec_a.dta to csv file
o_path= 'Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/hh_sec_a.dta'
u_path = 'Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/hh_sec_a.csv'

dta_to_csv(o_path,u_path) #call the function to convert

#convert hh_sec_j3.dta to csv file
o_path= 'Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/hh_sec_j3.dta'
u_path = 'Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/hh_sec_j3.csv'

dta_to_csv(o_path,u_path) #call the function to convert

''''

A new dataset is saved to Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/hh_sec_a.csv


In [159]:
data =  read_csv('Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/tanzania_2014_2015_food_consumed.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_code,food_group,is_consumed
0,1000-001,A,cereals_tubers,7.0
1,1000-001,B,cereals_tubers,2.0
2,1000-001,C,pulses_nuts,3.0
3,1000-001,D,vegetables_leaves,7.0
4,1000-001,E,animal_protein,2.0


In [160]:
#read the file containing the location of the household 
location =  read_csv('Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data//tanzania_2014_2015_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,hhid,year,region,district_id,district,ward_id,locality
0,1000-001,2014.0,ARUSHA,2,MERU,33,RURAL
1,1001-001,2014.0,ARUSHA,2,MERU,33,RURAL
2,1002-001,2014.0,ARUSHA,2,MERU,33,RURAL
3,1003-001,2014.0,ARUSHA,2,MERU,33,RURAL
4,1005-001,2014.0,ARUSHA,2,MERU,33,RURAL


#### Compute FCS

In [161]:
#call the function to compute the fcs
data_fcs = calculate_fcs_version2(data)
data_fcs.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,fcs
0,1000-001,2.0,7.0,7.0,1.0,0.0,7.0,3.0,7.0,7.0,49.0
1,1001-001,2.0,7.0,7.0,7.0,0.0,7.0,3.0,7.0,7.0,73.0
2,1002-001,0.0,7.0,7.0,7.0,0.0,7.0,7.0,7.0,7.0,77.0
3,1003-001,1.0,7.0,7.0,7.0,0.0,7.0,5.0,7.0,3.0,71.0
4,1005-001,2.0,7.0,7.0,7.0,0.0,7.0,1.0,7.0,7.0,67.0


In [162]:
#save the file may be useful later
data_fcs.to_csv('Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/tanzania_2014_2015_with_fcs.csv', index=False)

#### Compute HDDS

In [163]:
#call the function to compute the fcs
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,hdds
0,1000-001,1,1,1,1,0,1,1,1,1,8
1,1001-001,1,1,1,1,0,1,1,1,1,8
2,1002-001,0,1,1,1,0,1,1,1,1,7
3,1003-001,1,1,1,1,0,1,1,1,1,8
4,1005-001,1,1,1,1,0,1,1,1,1,8


In [164]:
#save the file may be useful later
data_hdds.to_csv('Tanzania/2014_2015/tanzania_2014_2015_preprocessed_data/tanzania_2014_2015_with_hdds', index=False)

## National Panel Survey 2016- Feed the Future Interim Supplemental Survey

This dataset is public available on [National Panel Survey(NPS) 2016](https://microdata.worldbank.org/index.php/catalog/2863). The Feed the Future Interim Supplemental Survey (FTFISS) include Household questionnaire to measure and elaborate on consumption habits in Tanzania, and to provide a more comprehensive view of the food security situation in the country. Additionally, this project provides a valuable opportunity to expand upon food security information gathered in the Tanzania National Panel Survey (NPS), as questionnaire themes in the FTFISS were modeled to reflect those topics considered central to the comprehension of food security.The NPS interviewed 727 households in the regions of Dodoma, Manyara, Morogoro, Mbeya, Iringa, and all three areas of Unguja in Zanzibar.For more description [Click Here](https://microdata.worldbank.org/index.php/catalog/2863/study-description). More description about the data definition of each files and their associate variables are explained [Here](https://microdata.worldbank.org/index.php/catalog/2863/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:
* hh_sec_a.dta : Contain the description about the address of the household which is identified by a unique identification (hhid)
* hh_sec_d.dta :  Different types of food consumed by female members of the household yesterday. This includes food consumed inside and outside of the house

These files are in stata format, therefore we need to change them into csv format for easily manipulation

In [171]:
'''
#convert hh_sec_a.dta to csv file
o_path= 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_a.dta'
u_path = 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_a.csv'


#convert hh_sec_d.dta to csv file
o_path= 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_d.dta'
u_path = 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_d.csv'

dta_to_csv(o_path,u_path) #call the function to convert

dta_to_csv(o_path,u_path) #call the function to convert

'''

"\n#convert hh_sec_a.dta to csv file\no_path= 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_a.dta'\nu_path = 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_a.csv'\n\n\n#convert hh_sec_d.dta to csv file\no_path= 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_d.dta'\nu_path = 'Tanzania/2016/tanzania_2016_preprocessed_data/hh_sec_d.csv'\n\ndta_to_csv(o_path,u_path) #call the function to convert\n\ndta_to_csv(o_path,u_path) #call the function to convert\n\n"

In [167]:
#read the file containing the location of the household 
data =  read_csv('Tanzania/2016/tanzania_2016_preprocessed_data/tanzania_2016_food_consumed.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_items,food_group,is_consumed
0,1780-001,"RICE, MAIZE, FOOD MADE FROM MAIZE (UGALI), MIL...",cereals,1.0
1,1780-001,"BREAD, MANDAAZI, CHAPATI, MACARONI, SPAGHETTI,...",cereals,0.0
2,1780-001,"CASSAVA FRESH, CASSAVA DRY/FLOUR",roots_tubers,0.0
3,1780-001,"PUMPKIN, CARROTS, SQUASH, OR SWEET POTATOES TH...",roots_tubers,0.0
4,1780-001,"WHITE/IRISH POTATOES, WHITE YAMS, COCOYAMS, MA...",roots_tubers,1.0


In [168]:
#read the file containing the location of the household 
location =  read_csv('Tanzania/2016/tanzania_2016_preprocessed_data/tanzania_2016_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,hhid,year,region,district_id,district,ward_id,locality
0,1779-001,2016,DODOMA,1,KONDOA,121,rural
1,1780-001,2016,DODOMA,1,KONDOA,121,rural
2,1781-001,2016,DODOMA,1,KONDOA,121,rural
3,1782-001,2016,DODOMA,1,KONDOA,121,rural
4,1783-001,2016,DODOMA,1,KONDOA,121,rural


#### Compute HDDS

In [169]:
#call the function to compute the fcs
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,cereals,condiments,eggs,fish_seafoods,fruits,meat,milk,oil,pulses_nuts,roots_tubers,sugar,vegetables,hdds
0,1780-001,1,1,0,1,0,0,0,1,1,1,0,1,7
1,1781-001,1,1,0,1,0,0,1,1,0,1,1,1,8
2,1783-001,1,1,0,0,0,0,1,1,1,1,1,1,8
3,1784-001,1,1,0,1,1,0,0,1,1,0,1,1,8
4,1785-001,1,1,0,0,0,1,1,1,1,1,1,1,9


In [172]:
#save the file may be useful later
data_hdds.to_csv('Tanzania/2016/tanzania_2016_preprocessed_data/tanzania_2016_with_hdds.csv', index=False)

## National Panel Survey 2019 - 2020
This dataset is public available on [National Panel Survey(NPS) 2019 - 2020](https://microdata.worldbank.org/index.php/catalog/3885).The NPS interviewed 1,184 households spanning all regions and all districts of Tanzania, both mainland and Zanzibar. The NPS-SDD 2019/20 is the first Extended Panel with sex-disaggregated data survey, collecting information on a wide range of topics including agricultural production, non-farm income generating activities, individual rights to plots, consumption expenditures, and a wealth of other socioeconomic characteristics. For more discription about the coverage of the topics in this dataset [Click Here](https://microdata.worldbank.org/index.php/catalog/3885/study-description).For our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.worldbank.org/index.php/catalog/3885/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:
* HH_SEC_A.dta : Contain the description about the address of the household which is identified by a unique identification (hhid)
* HH_SEC_J3.dta : Contain the data information about food consumption of 7 days recall

These files are in stata format, therefore we need to change them into csv format for easily manipulation

In [176]:
'''
#convert HH_SEC_A.dta to csv file
o_path= 'Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/HH_SEC_A.dta'
u_path = 'Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/HH_SEC_A.csv'

dta_to_csv(o_path,u_path) #call the function to convert

#convert HH_SEC_J3.dta to csv file
o_path= 'Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/HH_SEC_J3.dta'
u_path = 'Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/HH_SEC_J3.csv'

dta_to_csv(o_path,u_path) #call the function to convert

'''

A new dataset is saved to Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/HH_SEC_A.csv


In [178]:
data = read_csv('Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/tanzania_2019_2020_food_consumed.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_code,food_group,is_consumed
0,0001-001-001,A,cereals_tubers,7.0
1,0001-001-001,B,cereals_tubers,1.0
2,0001-001-001,C,pulses_nuts,0.0
3,0001-001-001,D,vegetables_leaves,5.0
4,0001-001-001,E,animal_protein,1.0


In [179]:
#read the file containing the location of the household 
location = read_csv('Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/tanzania_2019_2020_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,hhid,year,region,district,ward_code,locality
0,0001-001-001,2019,DODOMA,KONDOA,14.0,RURAL
1,0001-001-003,2019,DODOMA,CHEMBA,14.0,RURAL
2,0001-001-004,2019,DAR ES SALAAM,KINONDONI,13.0,URBAN
3,0001-004-001,2019,DODOMA,KONDOA,,RURAL
4,0001-004-002,2019,DODOMA,KONDOA URBAN,221.0,URBAN


#### Compute FCS

In [180]:
#call the function to compute the fcs
data_fcs = calculate_fcs_version2(data)
data_fcs.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,fcs
0,0001-001-001,1.0,7.0,7.0,0.0,0.0,7.0,0.0,5.0,5.0,29.0
1,0001-001-003,6.0,7.0,7.0,1.0,0.0,7.0,2.0,0.0,7.0,58.5
2,0001-001-004,3.0,7.0,7.0,1.0,2.0,7.0,3.0,7.0,3.0,51.0
3,0001-004-001,2.0,7.0,7.0,5.0,0.0,7.0,3.0,7.0,5.0,63.0
4,0001-004-002,3.0,7.0,7.0,0.0,0.0,7.0,0.0,0.0,7.0,36.5


In [184]:
data_fcs.to_csv('Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/tanzania_2019_2020_with_fcs.csv',index=False)

#### Compute HDDS

In [182]:
#call the function to compute the fcs
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,hdds
0,0001-001-001,1,1,1,0,0,1,0,1,1,6
1,0001-001-003,1,1,1,1,0,1,1,0,1,7
2,0001-001-004,1,1,1,1,1,1,1,1,1,9
3,0001-004-001,1,1,1,1,0,1,1,1,1,8
4,0001-004-002,1,1,1,0,0,1,0,0,1,5


In [185]:
data_hdds.to_csv('Tanzania/2019_2020/tanzania_2019_2020_preprocessed_data/tanzania_2019_2020_with_hdds.csv',index=False)

## National Panel Survey 2020 - 2021

This dataset is public available on [National Panel Survey(NPS) 2020 - 2021](https://microdata.worldbank.org/index.php/catalog/5639). The NPS interviewed 4,709 households spanning some regions such as Dar es Salaam, Other Urban,Rural and Zanzibar.The dataset containing data related to Household questionnaire, Agriculture questionnaire and Community questionnaire which cover broad range of the topics. For more discription about the coverage of the topics in this dataset [Click Here](https://microdata.worldbank.org/index.php/catalog/5639/study-description). The dataset consinst of multiples files, however for our task we will mostly focus on HouseHold Questionnaire, for more description about the data definition of each files and their associate variables [Visit Here](https://microdata.worldbank.org/index.php/catalog/5639/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:
* hh_sec_a.dta : Contain the description about the address of the household which is identified by a unique identification (hhid)
* hh_sec_j3.dta : Contain the data information about food consumption of 7 days recall

These files are in stata format, therefore we need to change them into csv format for easily manipulation

In [191]:
'''
#convert hh_sec_a.dta to csv file
o_path= 'Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/hh_sec_a.dta'
u_path = 'Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/hh_sec_a.csv'
dta_to_csv(o_path,u_path) #call the function to convert


#convert hh_sec_j3.dta to csv file
o_path= 'Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/hh_sec_j3.dta'
u_path = 'Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/hh_sec_j3.csv'
dta_to_csv(o_path,u_path) #call the function to convert

'''

A new dataset is saved to Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/hh_sec_a.csv


In [201]:
data = read_csv('Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/tanzania_2020_2021_food_consumed.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_code,food_group,is_consumed
0,1000-001-01,"A. CEREALS, GRAINS AND CEREAL PRODUCTS",cereals_tubers,7
1,1000-001-01,"B. ROOTS, TUBERS, AND PLANTAINS",cereals_tubers,4
2,1000-001-01,C. NUTS AND PULSES,pulses_nuts,7
3,1000-001-01,D. VEGETABLES,vegetables_leaves,7
4,1000-001-01,"E. MEAT, FISH AND ANIMAL PRODUCTS",animal_protein,2


In [202]:
#read the file containing the location of the household 
location =  read_csv('Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/tanzania_2020_2021_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,hhid,locality,region,district,ward_id,year
0,1000-001-01,RURAL,ARUSHA,MERU,,2021
1,1000-001-02,RURAL,ARUSHA,MERU,,2021
2,1000-001-03,RURAL,ARUSHA,MERU,,2021
3,1000-001-06,RURAL,ARUSHA,MERU,,2021
4,1001-001-01,RURAL,ARUSHA,MERU,,2021


#### Compute FCS

In [204]:
#call the function to compute the fcs
data_fcs = calculate_fcs_version2(data)
data_fcs.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,fcs
0,1000-001-01,2,7,7,0,0,7,7,7,7,57.0
1,1000-001-02,0,7,0,3,0,0,0,0,7,33.0
2,1000-001-03,3,7,7,7,1,7,2,7,7,75.0
3,1000-001-06,4,7,7,2,4,7,7,7,7,77.0
4,1001-001-01,0,7,7,7,0,7,7,7,7,77.0


In [205]:
data_fcs.to_csv('Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/tanzania_2020_2021_with_fcs.csv', index = False )

#### Compute HDDS

In [199]:
#call the function to compute the hdds
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,hdds
0,1000-001-01,1,1,1,0,0,1,1,1,1,7
1,1000-001-02,0,1,0,1,0,0,0,0,1,3
2,1000-001-03,1,1,1,1,1,1,1,1,1,9
3,1000-001-06,1,1,1,1,1,1,1,1,1,9
4,1001-001-01,0,1,1,1,0,1,1,1,1,7


In [200]:
#save the file may be useful later
data_hdds.to_csv('Tanzania/2020_2021/tanzania_2020_2021_preprocessed_data/tanzania_2020_2021_with_hdds.csv', index=False)

## High Frequency Welfare Monitoring Phone Survey 2021-2024
This dataset is public available on [High Frequency Welfare Monitoring Phone Survey 2021-2024](https://microdata.worldbank.org/index.php/catalog/4542).The survey aims to address the limitations of traditional large household surveys by implementing a high-frequency phone survey allowing for the collection of key information in a timely and cost-effective manner.It covers a broad range of the topics including Employment, Education, Food Security, Mental Health and others. For more discription about the coverage of the topics in this dataset [Click Here](https://microdata.worldbank.org/index.php/catalog/4542/study-description). This survey focus is done on nine different round and each round cover a particular topic.For our task we will mostly focus on Round 7 Questionnaire which contain the data about food consumption.For more description about the data definition of each files and their associate variables [Visit Here](https://microdata.worldbank.org/index.php/catalog/4542/data-dictionary)

#### Loading the data sources

The data files we have use in the dataset are those named:
* r7_sect_1.sav : Contain the description about the address of the household which is identified by a unique identification (hhid)
* r7_sect_12b.sav : Contain the data information about food consumption of 7 days recall

These files are in SPSS Statistics Data File Format, therefore we need to change them into csv format for easily manipulation

In [None]:
'''
#convert r7_sect_1.sav to csv file
o_path= 'Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/r7_sect_1.sav'
u_path = 'Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/r7_sect_1.csv'
sav_to_csv(o_path,u_path) #call the function to convert

#convert r7_sect_12b.sav to csv file
o_path= 'Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/r7_sect_12b.sav'
u_path = 'Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/r7_sect_12b.csv'
sav_to_csv(o_path,u_path) #call the function to convert

'''

In [213]:
data = read_csv('Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/tanzania_2021_2024_food_consumed.csv', header=0,delimiter=',')
data.head()

Unnamed: 0,hhid,food_code,food_group,is_consumed
0,5770-001,10,cereals_tubers,1
1,5770-001,20,cereals_tubers,2
2,5770-001,30,pulses_nuts,1
3,5770-001,40,dairy_products,1
4,5770-001,50,animal_protein,1


In [214]:
#read the file containing the location of the household 
location =  read_csv('Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/tanzania_2021_2024_household_location.csv', header=0,delimiter=',')
location.head()

Unnamed: 0,year,hhid,region_id,region,district_id,district,ward_id,locality
0,2023,00-57-43-39,54,KASKAZINI PEMBA,542,MICHEWENI,589,RURAL
1,2023,01-04-38-95,54,KASKAZINI PEMBA,541,WETE,594,RURAL
2,2023,02-16-34-00,54,KASKAZINI PEMBA,541,WETE,888,URBAN
3,2023,03-55-54-81,51,KASKAZINI UNGUJA,512,KASKAZINI B,473,RURAL
4,2023,03-65-58-77,54,KASKAZINI PEMBA,542,MICHEWENI,816,RURAL


#### Compute FCS

In [215]:
#call the function to compute the fcs
data_fcs = calculate_fcs_version2(data)
data_fcs.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,fcs
0,03-65-58-77,0,7,7,1,0,3,2,7,4,33.0
1,06-72-00-17,3,7,0,0,4,0,0,0,3,33.0
2,08-72-39-21,2,5,2,0,3,0,1,1,5,29.5
3,09-19-46-29,7,7,7,0,4,0,2,0,0,52.0
4,1001-001,4,7,7,7,7,7,5,7,7,94.0


In [216]:
data_fcs.to_csv('Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/tanzania_2021_2024_with_fcs.csv', index = False )

#### Compute HDDS

In [217]:
#call the function to compute the hdds
data_hdds = calculate_hdds(data)
data_hdds.head()

food_group,hhid,animal_protein,cereals_tubers,condiments,dairy_products,fruits,oil,pulses_nuts,sugar,vegetables_leaves,hdds
0,03-65-58-77,0,1,1,1,0,1,1,1,1,7
1,06-72-00-17,1,1,0,0,1,0,0,0,1,4
2,08-72-39-21,1,1,1,0,1,0,1,1,1,7
3,09-19-46-29,1,1,1,0,1,0,1,0,0,5
4,1001-001,1,1,1,1,1,1,1,1,1,9


In [218]:
#save the file may be useful later
data_hdds.to_csv('Tanzania/2021_2024/tanzania_2021_2024_preprocessed_data/tanzania_2021_2024_with_hdds.csv', index=False)