# Introduction to CGAP data objects #

CGAP data universe begins with the original 18 CGAP surveys (three surveys for each of six countries).  The data are cleaned and aligned, and secondary data products are built, as shown in the following data map.  Most of these data products are flat files that lack metadata.  This notebook describes my first attempt at a json-based representation of data objects.  

In [1]:
import sys
import os, json
import copy
from types import SimpleNamespace

import numpy as np
import pandas as pd

# Change this filepath to one for your machine
sys.path.append('/Users/mordor/research/habitus_project/mycode/predictables/Data/Data Objects/Code and Notebooks')

from CGAP_JSON_Encoders_Decoders import Question_Decoder, CGAP_Encoded, CGAP_Decoded, Country_Decoded


# Change this filepath to one for your machine. The actual file is on our Box
# folder at https://pitt.app.box.com/folder/136317983622

Data = CGAP_Decoded()
Data.read_and_decode('/Users/mordor/research/habitus_project/mycode/predictables/Data/Data Objects/CGAP_JSON.txt')

countries = ['bgd','cdi','moz','nga','tan','uga']

Before you begin, go to https://pitt.app.box.com/folder/134211910534 and open one of the User Guides.  Scroll down to page 30 or thereabouts -- it varies by country -- and familiarize yourself with the three surveys (Household, Multiple-respondent and Single-respondent).  Note that each survey item has an label such as H28 or A13. These are unique: No label appears in more than one survey.  In general, though not without exception, the same question in different countries gets the same label.

The `Data` object holds decoded json strings that hold data and metadata for each question.  `Data` is actually a Python SimpleNamespace, which makes it easy to access data using dot notation.   Thus, `Data.__dict__.keys()` gives you all the question keys:

In [11]:
for k,v in Data.__dict__.items(): 
    print(f"key={k}{v}")


key=bgd_A1namespace(answers={'lease_certificate': 1, 'customary_law': 2, 'communal': 3, 'state_ownership': 4, 'Kott': 5, 'other': 6}, country='bgd', df=         A1
1       1.0
2       1.0
3       NaN
4       1.0
5       1.0
...     ...
800129  1.0
800130  6.0
800131  1.0
800132  1.0
800133  1.0

[3136 rows x 1 columns], df_name='bgd_rr', label='A1', qtype='single', survey='rr', text='What is the form of ownership of your land?')
key=cdi_A1namespace(answers={'lease_certificate': 1, 'customary_law': 2, 'communal': 3, 'state_ownership': 4, 'sharecropping': 5, 'other': 6}, country='cdi', df=           A1
31416320  2.0
31416321  1.0
31416322  6.0
31416323  6.0
31416324  6.0
...       ...
31186932  2.0
31186933  2.0
31186934  2.0
31285239  2.0
31416319  2.0

[3002 rows x 1 columns], df_name='cdi_rr', label='A1', qtype='single', survey='rr', text='What is the form of ownership of your land?')
key=moz_A1namespace(answers={'lease_certificate': 1, 'customary_law': 2, 'communal': 3, 'state_owners

key=bgd_A5namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Rice': 1, 'Wheat': 2, 'Mango': 3, 'Jute': 4, 'Maize': 5, 'Tea': 6, 'Pulses': 7, 'Sugarcane': 8, 'Tobacco': 9, 'Chilies': 10, 'Onions': 11, 'Garlic': 12, 'Potato': 13, 'Rapeseed': 14, 'Mustard_seed': 15, 'Coconut': 16, 'Eggplant': 17, 'Radish': 18, 'Tomatoes': 19, 'Cauliflower': 20, 'Cabbage': 21, 'Pumpkin': 22, 'Banana': 23, 'Jackfruit': 24, 'Pineapple': 25, 'Guava': 26, 'Sesame': 27, 'Other_1': 28, 'Other_2': 29, 'Other_3': 30, 'No_crop': 31}, country='bgd', df=        Rice  Wheat  Mango  Jute  Maize  Tea  Pulses  Sugarcane  Tobacco  \
1        1.0    2.0    1.0   1.0    2.0  2.0     2.0        2.0      2.0   
2        1.0    1.0    2.0   1.0    2.0  2.0     2.0        2.0      2.0   
3        NaN    NaN    NaN   NaN    NaN  NaN     NaN        NaN      NaN   
4        1.0    2.0    2.0   2.0    2.0  2.0     1.0        2.0      2.0   
5        1.0    2.0    2.0   2.0    2.0  2.0     1.0        2.0      2.0   
...      .

key=uga_A5namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Maize': 1, 'Beans': 2, 'Sweet_potato': 3, 'Sorghum': 4, 'Rice': 5, 'Groundnuts': 6, 'Cowpea': 7, 'Millet': 8, 'Cassava': 9, 'Potato': 10, 'Pigeon_pea': 11, 'Banana': 12, 'Cotton': 13, 'Sesame': 21, 'Sugarcane': 15, 'Tobacco': 16, 'Tea': 17, 'Cocoa': 18, 'Coffee': 19, 'Field_pea': 20, 'Soybeans': 22, 'Other_1': 23, 'Other_2': 24, 'Other_3': 25, 'No_crop': 26}, country='uga', df=          Maize  Beans  Sweet_potato  Sorghum  Rice  Groundnuts  Cowpea  \
23068679    1.0    1.0           2.0      2.0   2.0         2.0     2.0   
22716424    1.0    1.0           2.0      2.0   1.0         2.0     2.0   
22716425    2.0    2.0           2.0      2.0   2.0         2.0     2.0   
23306249    1.0    1.0           2.0      2.0   2.0         2.0     2.0   
22945803    1.0    1.0           2.0      2.0   2.0         2.0     2.0   
...         ...    ...           ...      ...   ...         ...     ...   
23306221    1.0    2.0      

key=moz_A7namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Maize': 1, 'Beans': 2, 'Sweet_potato': 3, 'Sorghum': 4, 'Rice': 5, 'Groundnuts': 6, 'Cowpea': 7, 'Millet': 8, 'Cassava': 9, 'Potato': 10, 'Pigeon_pea': 11, 'Banana': 12, 'Coconut': 13, 'Cotton': 14, 'Sesame': 15, 'Mango': 16, 'Cashew': 17, 'Sugarcane': 18, 'Tobacco': 19, 'Tea': 20, 'Avocado': 21, 'Cocoa': 22, 'Sisal': 23, 'Cloves': 24, 'Coffee': 25, 'Sunflower': 26, 'Tomatoes': 27, 'Onions': 28, 'Other_1': 29, 'Other_2': 30, 'Other_3': 31, 'No_crop': 32}, country='moz', df=          Maize  Beans  Sweet_potato  Sorghum  Rice  Groundnuts  Cowpea  \
22552580    NaN    NaN           NaN      NaN   NaN         1.0     NaN   
22487045    1.0    2.0           2.0      NaN   NaN         1.0     1.0   
22159366    NaN    NaN           NaN      NaN   NaN         1.0     1.0   
22790149    2.0    1.0           1.0      NaN   NaN         NaN     1.0   
22790150    1.0    NaN           2.0      NaN   NaN         NaN     NaN   
...  

key=bgd_A9namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Rice': 1, 'Wheat': 2, 'Mango': 3, 'Jute': 4, 'Maize': 5, 'Tea': 6, 'Pulses': 7, 'Sugarcane': 8, 'Tobacco': 9, 'Chilies': 10, 'Onions': 11, 'Garlic': 12, 'Potato': 13, 'Rapeseed': 14, 'Mustard_seed': 15, 'Coconut': 16, 'Eggplant': 17, 'Radish': 18, 'Tomatoes': 19, 'Cauliflower': 20, 'Cabbage': 21, 'Pumpkin': 22, 'Banana': 23, 'Jackfruit': 24, 'Pineapple': 25, 'Guava': 26, 'Sesame': 27, 'Other_1': 28, 'Other_2': 29, 'Other_3': 30, 'No_crop': 31}, country='bgd', df=        Rice  Wheat  Mango  Jute  Maize  Tea  Pulses  Sugarcane  Tobacco  \
1        2.0    1.0    2.0   2.0    1.0  1.0     1.0        1.0      1.0   
2        2.0    2.0    1.0   2.0    1.0  1.0     1.0        1.0      1.0   
3        NaN    NaN    NaN   NaN    NaN  NaN     NaN        NaN      NaN   
4        2.0    2.0    2.0   2.0    2.0  1.0     2.0        2.0      2.0   
5        2.0    2.0    2.0   2.0    2.0  2.0     2.0        2.0      2.0   
...      .

key=tan_A9namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Maize': 1, 'Rice': 2, 'Sorghum': 3, 'Millet': 5, 'Cassava': 6, 'Sweet_potato': 7, 'Potato': 8, 'Beans': 9, 'Cowpea': 10, 'Pigeon_pea': 11, 'Sunflower': 12, 'Sesame': 13, 'Groundnuts': 14, 'Tomatoes': 15, 'Cabbage': 16, 'Onions': 17, 'Amaranth': 18, 'Cashew': 19, 'Banana': 20, 'Cotton': 21, 'Tobacco': 22, 'Pyrethrum': 23, 'Coffee': 24, 'Coconut': 25, 'Orange': 26, 'Sugarcane': 27, 'Palm_oil': 28, 'Other_1': 29, 'Other_2': 30, 'Other_3': 31, 'No_crop': 32}, country='tan', df=          Maize  Rice  Sorghum  Millet  Cassava  Sweet_potato  Potato  Beans  \
29073409    1.0   2.0      2.0     2.0      1.0           2.0     2.0    1.0   
29007876    2.0   2.0      2.0     2.0      2.0           2.0     2.0    2.0   
28852233    1.0   1.0      2.0     2.0      1.0           2.0     2.0    2.0   
28852234    1.0   2.0      2.0     2.0      1.0           1.0     2.0    2.0   
28975115    2.0   1.0      1.0     2.0      2.0        

key=moz_A25namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Maize': 1, 'Beans': 2, 'Sweet_potato': 3, 'Sorghum': 4, 'Rice': 5, 'Groundnuts': 6, 'Cowpea': 7, 'Millet': 8, 'Cassava': 9, 'Potato': 10, 'Pigeon_pea': 11, 'Banana': 12, 'Coconut': 13, 'Cotton': 14, 'Sesame': 15, 'Mango': 16, 'Cashew': 17, 'Sugarcane': 18, 'Tobacco': 19, 'Tea': 20, 'Avocado': 21, 'Cocoa': 22, 'Sisal': 23, 'Cloves': 24, 'Coffee': 25, 'Sunflower': 26, 'Tomatoes': 27, 'Onions': 28, 'Other_1': 29, 'Other_2': 30, 'Other_3': 31, 'No_crop': 32}, country='moz', df=          Maize  Beans  Sweet_potato  Sorghum  Rice  Groundnuts  Cowpea  \
22552580    NaN    NaN           NaN      NaN   NaN         2.0     NaN   
22487045    2.0    2.0           2.0      NaN   NaN         2.0     2.0   
22159366    NaN    NaN           NaN      NaN   NaN         2.0     2.0   
22790149    2.0    2.0           2.0      NaN   NaN         NaN     2.0   
22790150    2.0    NaN           2.0      NaN   NaN         NaN     NaN   
... 

key=moz_A36namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Maize': 1, 'Beans': 2, 'Sweet_potato': 3, 'Sorghum': 4, 'Rice': 5, 'Groundnuts': 6, 'Cowpea': 7, 'Millet': 8, 'Cassava': 9, 'Potato': 10, 'Pigeon_pea': 11, 'Banana': 12, 'Coconut': 13, 'Cotton': 14, 'Sesame': 15, 'Mango': 16, 'Cashew': 17, 'Sugarcane': 18, 'Tobacco': 19, 'Tea': 20, 'Avocado': 21, 'Cocoa': 22, 'Sisal': 23, 'Cloves': 24, 'Coffee': 25, 'Sunflower': 26, 'Tomatoes': 27, 'Onions': 28, 'Other_1': 29, 'Other_2': 30, 'Other_3': 31, 'No_crop': 32}, country='moz', df=          Maize  Beans  Sweet_potato  Sorghum  Rice  Groundnuts  Cowpea  \
22552580    NaN    NaN           NaN      NaN   NaN         2.0     NaN   
22487045    2.0    2.0           2.0      NaN   NaN         2.0     2.0   
22159366    NaN    NaN           NaN      NaN   NaN         2.0     2.0   
22790149    2.0    2.0           2.0      NaN   NaN         NaN     2.0   
22790150    2.0    NaN           2.0      NaN   NaN         NaN     NaN   
... 

key=bgd_A53namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Rice': 1, 'Wheat': 2, 'Mango': 3, 'Jute': 4, 'Maize': 5, 'Tea': 6, 'Pulses': 7, 'Sugarcane': 8, 'Tobacco': 9, 'Chilies': 10, 'Onions': 11, 'Garlic': 12, 'Potato': 13, 'Rapeseed': 14, 'Mustard_seed': 15, 'Coconut': 16, 'Eggplant': 17, 'Radish': 18, 'Tomatoes': 19, 'Cauliflower': 20, 'Cabbage': 21, 'Pumpkin': 22, 'Banana': 23, 'Jackfruit': 24, 'Pineapple': 25, 'Guava': 26, 'Sesame': 27, 'Other_1': 28, 'Other_2': 29, 'Other_3': 30, 'No_crop': 31}, country='bgd', df=        Rice  Wheat  Mango  Jute  Maize  Tea  Pulses  Sugarcane  Tobacco  \
1        NaN    NaN    NaN   NaN    NaN  NaN     NaN        NaN      NaN   
2        NaN    NaN    NaN   NaN    NaN  NaN     NaN        NaN      NaN   
3        NaN    NaN    NaN   NaN    NaN  NaN     NaN        NaN      NaN   
4        NaN    NaN    NaN   NaN    NaN  NaN     NaN        NaN      NaN   
5        NaN    NaN    NaN   NaN    NaN  NaN     NaN        NaN      NaN   
...      

key=tan_A53namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Maize': 1, 'Rice': 2, 'Sorghum': 3, 'Millet': 5, 'Cassava': 6, 'Sweet_potato': 7, 'Potato': 8, 'Beans': 9, 'Cowpea': 10, 'Pigeon_pea': 11, 'Sunflower': 12, 'Sesame': 13, 'Groundnuts': 14, 'Tomatoes': 15, 'Cabbage': 16, 'Onions': 17, 'Amaranth': 18, 'Cashew': 19, 'Banana': 20, 'Cotton': 21, 'Tobacco': 22, 'Pyrethrum': 23, 'Coffee': 24, 'Coconut': 25, 'Orange': 26, 'Sugarcane': 27, 'Palm_oil': 28, 'Other_1': 29, 'Other_2': 30, 'Other_3': 31, 'No_crop': 32}, country='tan', df=          Maize  Rice  Sorghum  Millet  Cassava  Sweet_potato  Potato  Beans  \
28845992    1.0   1.0      2.0     2.0      2.0           2.0     2.0    2.0   
29265802    1.0   2.0      2.0     2.0      2.0           2.0     2.0    1.0   
28901706    NaN   NaN      NaN     NaN      NaN           NaN     NaN    NaN   
29302230    1.0   1.0      2.0     2.0      2.0           2.0     2.0    2.0   
28910973    1.0   2.0      2.0     2.0      2.0       

key=moz_A26namespace(answers={'Maize': 3, 'Beans': 9, 'Sweet_potato': 17, 'Sorghum': 5, 'Rice': 1, 'Groundnuts': 8, 'Cowpea': 10, 'Millet': 4, 'Cassava': 7, 'Potato': 16, 'Pigeon_pea': 12, 'Banana': 21, 'Coconut': 44, 'Cotton': 46, 'Sesame': 55, 'Mango': 29, 'Cashew': 41, 'Sugarcane': 57, 'Tobacco': 59, 'Tea': 58, 'Avocado': 20, 'Cocoa': 43, 'Sisal': 56, 'Cloves': 42, 'Coffee': 45, 'Sunflower': 38, 'Tomatoes': 39, 'Onions': 32, 'Other_1': 60, 'Other_2': 61, 'Other_3': 62, 'No_crop': 63}, country='moz', df=           A26
22552580   NaN
22487045   NaN
22159366   NaN
22790149   NaN
22790150   NaN
...        ...
22757331   7.0
22552539   NaN
22200293   7.0
22102008   7.0
22167545  55.0

[2462 rows x 1 columns], df_name='moz_rr', label='A26', qtype='single', survey='rr', text='Which of the following crops that you make most money from?')
key=nga_A26namespace(answers={'Wheat': 2, 'Rice': 1, 'Maize': 3, 'Millet': 4, 'Sorghum': 5, 'Fonio': 6, 'Potato': 16, 'Sweet_potato': 17, 'Cassava': 7, 'Ta

key=cdi_A11namespace(column_dict={'Cattle_beef': 1, 'Cattle_dairy': 2, 'Buffalo': 3, 'Goat_meat': 4, 'Goat_dairy': 5, 'Sheep': 6, 'Chicken_broiler': 7, 'Chicken_layer': 8, 'Pig': 9, 'Duck': 10, 'Pigeon': 11, 'Fish': 12, 'Bees': 13, 'Other': 14, 'No_livestock': 15}, country='cdi', df=          Cattle_beef  Cattle_dairy  Buffalo  Goat_meat  Goat_dairy  Sheep  \
31416320          5.0           7.0      0.0        0.0         0.0    6.0   
31416321          1.0           0.0      0.0        4.0         0.0    1.0   
31416322          NaN           NaN      NaN        NaN         NaN    NaN   
31416323          NaN           NaN      NaN        NaN         NaN    NaN   
31416324          NaN           NaN      NaN        NaN         NaN    NaN   
...               ...           ...      ...        ...         ...    ...   
31186932          NaN           NaN      NaN        NaN         NaN    NaN   
31186933          2.0           0.0      0.0        0.0         0.0    0.0   
31186934      

key=moz_A12namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Cattle_beef': 1, 'Cattle_dairy': 2, 'Cattle_ind': 3, 'Sheep': 4, 'Duck': 5, 'Pig': 6, 'Goat_meat': 7, 'Goat_dairy': 8, 'Chicken_broiler': 9, 'Chicken_layer': 10, 'Fish': 11, 'Bees': 12, 'Other': 13, 'No_livestock': 14}, country='moz', df=          Cattle_beef  Cattle_dairy  Cattle_ind  Sheep  Duck  Pig  Goat_meat  \
22552580          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
22487045          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
22159366          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
22790149          NaN           NaN         NaN    NaN   2.0  NaN        NaN   
22790150          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
...               ...           ...         ...    ...   ...  ...        ...   
22757331          1.0           NaN         NaN    NaN   NaN  1.0        NaN   
22552539          NaN           NaN         NaN    NaN

key=moz_A14namespace(answers={'1': 'yes', '2': 'no'}, column_dict={'Cattle_beef': 1, 'Cattle_dairy': 2, 'Cattle_ind': 3, 'Sheep': 4, 'Duck': 5, 'Pig': 6, 'Goat_meat': 7, 'Goat_dairy': 8, 'Chicken_broiler': 9, 'Chicken_layer': 10, 'Fish': 11, 'Bees': 12, 'Other': 13, 'No_livestock': 14}, country='moz', df=          Cattle_beef  Cattle_dairy  Cattle_ind  Sheep  Duck  Pig  Goat_meat  \
22552580          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
22487045          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
22159366          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
22790149          NaN           NaN         NaN    NaN   2.0  NaN        NaN   
22790150          NaN           NaN         NaN    NaN   NaN  NaN        NaN   
...               ...           ...         ...    ...   ...  ...        ...   
22757331          2.0           NaN         NaN    NaN   NaN  1.0        NaN   
22552539          NaN           NaN         NaN    NaN

key=nga_A15namespace(answers={'yes': 1, 'no': 2}, column_dict={'cooperative': 1, 'wholesaler': 2, ' processor': 3, 'retailer': 4, 'government': 5, 'middleman': 6, 'other': 7, 'no_purchase': 8, 'DK': 98}, country='nga', df=          cooperative  wholesaler   processor  retailer  government  \
40509448          2.0         2.0         2.0       1.0         2.0   
40509449          2.0         2.0         2.0       1.0         2.0   
40503694          2.0         2.0         2.0       2.0         2.0   
40503695          2.0         1.0         2.0       2.0         2.0   
40542247          2.0         2.0         2.0       1.0         2.0   
...               ...         ...         ...       ...         ...   
40501230          2.0         2.0         2.0       1.0         2.0   
40525806          2.0         2.0         2.0       1.0         2.0   
40304631          2.0         2.0         2.0       1.0         2.0   
40344123          2.0         2.0         2.0       1.0         2.0 

key=uga_A17namespace(answers={'yes': 1, 'no': 2}, column_dict={'cash': 1, 'cheque': 2, 'pay_cash_bank': 3, 'electronic': 4, 'mobile_banking': 5, 'in_kind': 6, 'prepaid_card': 7, 'other': 8, 'do_not_buy': 9, 'DK': 98}, country='uga', df=          cash  cheque  pay_cash_bank  electronic  mobile_banking  in_kind  \
23068679   1.0     2.0            2.0         2.0             2.0      2.0   
22716424   1.0     2.0            2.0         2.0             2.0      2.0   
22716425   1.0     2.0            2.0         2.0             2.0      2.0   
23306249   1.0     2.0            2.0         2.0             2.0      2.0   
22945803   1.0     2.0            2.0         2.0             2.0      2.0   
...        ...     ...            ...         ...             ...      ...   
23306221   NaN     NaN            NaN         NaN             NaN      NaN   
23306222   NaN     NaN            NaN         NaN             NaN      NaN   
22872056   1.0     2.0            2.0         2.0             

key=tan_A23namespace(answers={'yes': 1, 'no': 2}, column_dict={'friends_neighbors': 1, 'hired_extended_period': 2, 'family': 3, 'day_labor': 4, 'other': 5, 'no_labor': 6}, country='tan', df=          friends_neighbors  hired_extended_period  family  day_labor  other  \
29073409                1.0                    2.0     1.0        2.0    2.0   
29007876                2.0                    2.0     1.0        2.0    2.0   
28852233                2.0                    2.0     1.0        2.0    2.0   
28852234                2.0                    2.0     1.0        2.0    2.0   
28975115                2.0                    2.0     1.0        2.0    2.0   
...                     ...                    ...     ...        ...    ...   
28884943                2.0                    2.0     1.0        2.0    2.0   
28901353                2.0                    2.0     2.0        2.0    2.0   
28909557                NaN                    NaN     NaN        NaN    NaN   
28909558  

key=uga_A27namespace(answers={'yes': 1, 'no': 2}, column_dict={'cooperative': 1, 'wholesaler': 2, 'processor': 3, 'retailer': 4, 'public': 5, 'government': 6, 'middleman': 7, 'DK': 98, 'other': 10}, country='uga', df=          cooperative  wholesaler  processor  retailer  public  government  \
23068679          2.0         1.0        2.0       2.0     2.0         2.0   
22716424          2.0         2.0        2.0       2.0     1.0         2.0   
22716425          2.0         2.0        2.0       2.0     1.0         2.0   
23306249          2.0         1.0        2.0       2.0     2.0         2.0   
22945803          2.0         2.0        2.0       1.0     2.0         2.0   
...               ...         ...        ...       ...     ...         ...   
23306221          NaN         NaN        NaN       NaN     NaN         NaN   
23306222          2.0         2.0        2.0       2.0     1.0         2.0   
22872056          2.0         2.0        2.0       2.0     2.0         2.0   
229

key=tan_A29namespace(answers={'yes': 1, 'no': 2}, column_dict={'best_price': 1, 'no_transport': 2, 'poor_roads': 3, 'unaware_of_prices': 4, 'small_production': 5, 'other': 6, 'DK': 98}, country='tan', df=          best_price  no_transport  poor_roads  unaware_of_prices  \
29073409         2.0           2.0         2.0                1.0   
29007876         2.0           2.0         2.0                1.0   
28852233         2.0           2.0         2.0                1.0   
28852234         NaN           NaN         NaN                NaN   
28975115         2.0           2.0         1.0                2.0   
...              ...           ...         ...                ...   
28884943         2.0           1.0         2.0                1.0   
28901353         2.0           2.0         2.0                1.0   
28909557         NaN           NaN         NaN                NaN   
28909558         NaN           NaN         NaN                NaN   
29351928         2.0           2.0   

key=uga_A32namespace(answers={'yes': 1, 'no': 2}, country='uga', df=          A32
23068679  2.0
22716424  1.0
22716425  2.0
23306249  2.0
22945803  1.0
...       ...
23306221  NaN
23306222  2.0
22872056  1.0
22904826  2.0
22945787  2.0

[2859 rows x 1 columns], df_name='uga_rr', label='A32', qtype='single', survey='rr', text=' Do you have a contract to sell any of your crops or livestock?')
key=bgd_A33namespace(answers={'yes': 1, 'no': 2}, column_dict={'cash': 1, 'cheque': 2, 'electronic': 3, 'mobile_banking': 4, 'in_kind': 5, 'prepaid_card': 6, 'other': 7}, country='bgd', df=        cash  cheque  electronic  mobile_banking  in_kind  prepaid_card  other
1        1.0     2.0         2.0             2.0      2.0           2.0    2.0
2        1.0     2.0         2.0             2.0      2.0           2.0    2.0
3        NaN     NaN         NaN             NaN      NaN           NaN    NaN
4        NaN     NaN         NaN             NaN      NaN           NaN    NaN
5        1.0     2.0  

key=tan_A35namespace(answers={'yes': 1, 'no': 2}, column_dict={'distance': 1, 'transport': 2, 'damage': 3, 'lack_storage': 4, 'lack_refrigeration': 5, 'unreliable_middlemen': 6, 'no_challenges': 7, 'other': 8}, country='tan', df=          distance  transport  damage  lack_storage  lack_refrigeration  \
29073409       2.0        2.0     2.0           2.0                 2.0   
29007876       2.0        2.0     2.0           2.0                 2.0   
28852233       2.0        2.0     1.0           2.0                 2.0   
28852234       NaN        NaN     NaN           NaN                 NaN   
28975115       2.0        1.0     2.0           2.0                 2.0   
...            ...        ...     ...           ...                 ...   
28884943       1.0        1.0     2.0           1.0                 2.0   
28901353       1.0        1.0     1.0           2.0                 2.0   
28909557       NaN        NaN     NaN           NaN                 NaN   
28909558       NaN   

key=tan_H2Bnamespace(answers={'regular_job': 1, 'occasional_job': 2, 'retail_business': 3, 'services_business': 4, 'grant_pension': 5, 'family_friends': 6, 'growing_crops': 7, 'rearing_livestock': 8, 'other': 9}, country='tan', df=          H2B
29073409  3.0
29007876  7.0
28852233  7.0
28852234  3.0
28975115  7.0
...       ...
28884943  8.0
28901353  7.0
28909557  8.0
28909558  2.0
29351928  9.0

[2812 rows x 1 columns], df_name='tan_rr', label='H2B', qtype='single', survey='rr', text='Which of these has been your main source of income in the last year?')
key=uga_H2Bnamespace(answers={'regular_job': 1, 'occasional_job': 2, 'retail_business': 3, 'services_business': 4, 'grant_pension': 5, 'family_friends': 6, 'growing_crops': 7, 'rearing_livestock': 8, 'other': 9}, country='uga', df=          H2B
23068679  7.0
22716424  6.0
22716425  8.0
23306249  7.0
22945803  7.0
...       ...
23306221  8.0
23306222  8.0
22872056  7.0
22904826  8.0
22945787  7.0

[2859 rows x 1 columns], df_name='uga_

key=tan_H6namespace(answers={'farmer': 1, 'professional': 2, 'shop_owner': 3, 'business_owner': 4, 'laborer': 5, 'other': 6}, country='tan', df=          H6
29073409   1
29007876   1
28852233   1
28852234   1
28975115   1
...       ..
28884943   1
28901353   1
28909557   6
28909558   6
29351928   1

[2812 rows x 1 columns], df_name='tan_rr', label='H6', qtype='single', survey='rr', text='What is your primary job (i.e., the job where you spend most of your time)?')
key=uga_H6namespace(answers={'farmer': 1, 'professional': 2, 'shop_owner': 3, 'business_owner': 4, 'laborer': 5, 'other': 6}, country='uga', df=          H6
23068679   4
22716424   4
22716425   1
23306249   1
22945803   1
...       ..
23306221   1
23306222   1
22872056   2
22904826   2
22945787   1

[2859 rows x 1 columns], df_name='uga_rr', label='H6', qtype='single', survey='rr', text='What is your primary job (i.e., the job where you spend most of your time)?')
key=bgd_H7namespace(answers={'yes': 1, 'no': 2}, column_dict={

key=bgd_H9namespace(answers={'yes': 1, 'no': 2}, column_dict={'processor': 1, 'seller': 2, 'services': 3, 'rent_out_land': 4, 'other': 5, 'no_other_way': 6}, country='bgd', df=        processor  seller  services  rent_out_land  other  no_other_way
1               2       2         2              2      2             1
2               2       2         2              2      2             1
3               2       2         2              2      2             1
4               2       2         2              2      2             1
5               2       2         2              2      2             1
...           ...     ...       ...            ...    ...           ...
800129          2       2         2              2      2             1
800130          2       2         2              2      2             1
800131          1       2         2              2      2             2
800132          2       2         2              2      2             1
800133          2       2       

key=tan_H11namespace(answers={'yes': 1, 'no': 2}, column_dict={'direct_deposit': 1, 'cash': 2, 'cheque': 3, 'courier': 4, 'own_m_money': 5, 'agent_m_money': 6, 'other_m_money': 7, 'digital_card': 8, 'moneygram': 9, 'other': 10}, country='tan', df=          direct_deposit  cash  cheque  courier  own_m_money  agent_m_money  \
29073409             NaN   NaN     NaN      NaN          NaN            NaN   
29007876             NaN   NaN     NaN      NaN          NaN            NaN   
28852233             NaN   NaN     NaN      NaN          NaN            NaN   
28852234             NaN   NaN     NaN      NaN          NaN            NaN   
28975115             NaN   NaN     NaN      NaN          NaN            NaN   
...                  ...   ...     ...      ...          ...            ...   
28884943             NaN   NaN     NaN      NaN          NaN            NaN   
28901353             NaN   NaN     NaN      NaN          NaN            NaN   
28909557             NaN   NaN     NaN    

key=moz_F62namespace(answers={'cannot_read': 1, 'read_parts': 2, 'read_all': 3}, country='moz', df=          F62
22552580  NaN
22487045  1.0
22159366  2.0
22790149  2.0
22790150  6.0
...       ...
22757331  NaN
22552539  NaN
22200293  6.0
22102008  NaN
22167545  NaN

[2462 rows x 1 columns], df_name='moz_rr', label='F62', qtype='single', survey='rr', text='Can you read any part of these sentences to me?')
key=nga_F62namespace(answers={'cannot_read': 1, 'read_parts': 2, 'read_all': 3}, country='nga', df=          F62
40509448  NaN
40509449  NaN
40503694  3.0
40503695  1.0
40542247  NaN
...       ...
40501230  NaN
40525806  3.0
40304631  NaN
40344123  NaN
40344124  NaN

[2858 rows x 1 columns], df_name='nga_rr', label='F62', qtype='single', survey='rr', text='Can you read any part of these sentences to me?')
key=tan_F62namespace(answers={'cannot_read': 1, 'read_parts': 2, 'read_all': 3}, country='tan', df=          F62
29073409  3.0
29007876  2.0
28852233  1.0
28852234  3.0
28975115  2.0

key=cdi_A41namespace(answers={'agree': 1, 'disagree': 2}, column_dict={'enjoy_agriculture': 1, 'want_ag_work_only': 2, 'want_expand': 3, 'would_take_offered_job': 4, 'am_satisfied': 5, 'my_legacy': 6, 'make_ends_meet': 7, 'want_children_continue': 8}, country='cdi', df=          enjoy_agriculture  want_ag_work_only  want_expand  \
31028570                NaN                NaN          NaN   
31028571                1.0                2.0          1.0   
31036603                NaN                NaN          NaN   
31038817                1.0                2.0          1.0   
31040474                1.0                1.0          1.0   
...                     ...                ...          ...   
32261521                1.0                2.0          1.0   
32261522                1.0                2.0          1.0   
32261524                1.0                1.0          1.0   
32280062                1.0                1.0          1.0   
32280063                1.0          

key=tan_A42namespace(answers={'yes': 1, 'no': 2}, column_dict={'plant_weed_harvest': 1, 'exporting': 2, 'union': 3, 'savings': 4, 'women': 5, 'processors': 6, 'cooperative': 7, 'implements': 8, 'sacco': 9, 'other': 10}, country='tan', df=          plant_weed_harvest  exporting  union  savings  women  processors  \
28845992                 2.0        2.0    2.0      2.0    2.0         2.0   
29265802                 2.0        2.0    2.0      2.0    2.0         2.0   
28901706                 1.0        2.0    2.0      2.0    2.0         2.0   
29302230                 2.0        2.0    2.0      2.0    2.0         2.0   
28910973                 2.0        2.0    2.0      2.0    2.0         2.0   
...                      ...        ...    ...      ...    ...         ...   
29319386                 NaN        NaN    NaN      NaN    NaN         NaN   
28924835                 2.0        2.0    2.0      2.0    2.0         2.0   
29010663                 2.0        2.0    2.0      2.0    2

key=moz_A44namespace(answers={'daily': 1, 'weekly': 2, 'monthly': 3, 'less_than_monthly': 4, 'never': 5}, column_dict={'sms': 1, 'radio': 2, 'television': 3, 'internet': 4, 'print_media': 5, 'friends_family': 6, 'religious_leader': 7, 'community': 8, 'development': 9, 'teachers': 10, 'government': 11, 'suppliers': 12, 'merchants': 13, 'extension': 14, 'middlemen': 15, 'other': 16}, country='moz', df=          sms  radio  television  internet  print_media  friends_family  \
21948170  NaN    NaN         NaN       NaN          NaN             NaN   
21948180  4.0    5.0         NaN       NaN          NaN             4.0   
21948779  4.0    5.0         NaN       5.0          4.0             2.0   
21951744  1.0    5.0         1.0       5.0          5.0             1.0   
21955646  NaN    NaN         NaN       NaN          NaN             NaN   
...       ...    ...         ...       ...          ...             ...   
23315711  5.0    5.0         5.0       5.0          5.0             5.0 

key=cdi_A47namespace(answers={'very_important': 1, 'somewhat_important': 2, 'not_important': 3}, column_dict={'fertilizer': 1, 'seeds': 2, 'pesticides': 3, 'equipment': 4, 'fuel': 5, 'workers': 6, 'security': 7, 'investment': 8, 'storage': 9, 'irrigation': 10, 'transportation': 11, 'machinery': 12, 'other': 13}, country='cdi', df=          fertilizer  seeds  pesticides  equipment  fuel  workers  security  \
31028570         NaN    NaN         NaN        NaN   NaN      NaN       NaN   
31028571         1.0    1.0         1.0        1.0   NaN      1.0       1.0   
31036603         NaN    NaN         NaN        NaN   NaN      NaN       NaN   
31038817         1.0    1.0         2.0        1.0   3.0      3.0       1.0   
31040474         1.0    1.0         1.0        1.0   1.0      1.0       1.0   
...              ...    ...         ...        ...   ...      ...       ...   
32261521         3.0    1.0         1.0        1.0   1.0      1.0       1.0   
32261522         1.0    1.0         

key=moz_A48namespace(answers={'yes': 1, 'no': 2}, column_dict={'fertilizer': 1, 'seeds': 2, 'pesticides': 3, 'equipment': 4, 'fuel': 5, 'workers': 6, 'security': 7, 'investment': 8, 'storage': 9, 'irrigation': 10, 'other': 11}, country='moz', df=          fertilizer  seeds  pesticides  equipment  fuel  workers  security  \
21948170         NaN    NaN         NaN        NaN   NaN      NaN       NaN   
21948180         1.0    1.0         1.0        1.0   NaN      NaN       1.0   
21948779         2.0    2.0         2.0        2.0   2.0      2.0       2.0   
21951744         2.0    1.0         2.0        2.0   2.0      2.0       2.0   
21955646         NaN    NaN         NaN        NaN   NaN      NaN       NaN   
...              ...    ...         ...        ...   ...      ...       ...   
23315711         2.0    2.0         2.0        2.0   2.0      NaN       NaN   
23324438         1.0    1.0         1.0        NaN   2.0      NaN       2.0   
23324440         NaN    NaN         NaN    

key=nga_A49namespace(answers={'yes': 1, 'no': 2}, column_dict={'fertilizer': 1, 'seeds': 2, 'pesticides': 3, 'equipment': 4, 'fuel': 5, 'workers': 6, 'security': 7, 'investment': 8, 'storage': 9, 'irrigation': 10, 'transportation': 11, 'machinery': 12, 'other': 13}, country='nga', df=          fertilizer  seeds  pesticides  equipment  fuel  workers  security  \
40127279         1.0    1.0         1.0        1.0   1.0      1.0       1.0   
40134046         1.0    1.0         1.0        1.0   1.0      1.0       1.0   
40135889         1.0    1.0         1.0        1.0   1.0      1.0       1.0   
40143258         1.0    1.0         1.0        1.0   1.0      1.0       2.0   
40144462         2.0    2.0         2.0        2.0   NaN      2.0       2.0   
...              ...    ...         ...        ...   ...      ...       ...   
41076184         2.0    1.0         NaN        1.0   1.0      2.0       2.0   
41076185         1.0    1.0         2.0        2.0   1.0      1.0       1.0   
4107

key=bgd_A57namespace(answers={'yes': 1, 'no': 2}, column_dict={'no_storage': 1, 'expensive': 2, 'no_leftover': 3, 'bad_idea': 4, 'need_money': 5, 'other': 6}, country='bgd', df=        no_storage  expensive  no_leftover  bad_idea  need_money  other
1              NaN        NaN          NaN       NaN         NaN    NaN
2              NaN        NaN          NaN       NaN         NaN    NaN
3              NaN        NaN          NaN       NaN         NaN    NaN
4              2.0        2.0          1.0       2.0         2.0    2.0
5              2.0        2.0          1.0       2.0         2.0    2.0
...            ...        ...          ...       ...         ...    ...
800129         NaN        NaN          NaN       NaN         NaN    NaN
800130         NaN        NaN          NaN       NaN         NaN    NaN
800131         NaN        NaN          NaN       NaN         NaN    NaN
800132         2.0        2.0          1.0       2.0         2.0    2.0
800133         NaN        NaN  

key=tan_A59namespace(answers={'yes': 1, 'no': 2}, country='tan', df=          A59
28845992  NaN
29265802  NaN
28901706  NaN
29302230  NaN
28910973  NaN
...       ...
29319386  NaN
28924835  NaN
29010663  NaN
29555590  NaN
29198985  NaN

[2795 rows x 1 columns], df_name='tan_sr', label='A59', qtype='single', survey='sr', text='Do you currently have livestock that are investments?')
key=uga_A59namespace(answers={'yes': 1, 'no': 2}, country='uga', df=          A59
22678446  NaN
22678927  NaN
22679052  NaN
22679076  NaN
22679225  1.0
...       ...
23500842  NaN
23620542  NaN
23620647  NaN
23620678  NaN
23620722  NaN

[2771 rows x 1 columns], df_name='uga_sr', label='A59', qtype='single', survey='sr', text='Do you currently have livestock that are investments?')
key=bgd_A60namespace(answers={'weather': 1, 'power': 2, 'prices': 3, 'inputs': 4, 'pests_disease': 5, 'contract_broken': 6, 'no_sale': 7, 'perils_accidents': 8, 'health': 9, 'loss_of_land': 10, 'equipment_breakdown': 11, 'input_qual

key=bgd_A62namespace(answers={'temp_job': 1, 'took_loan': 2, 'borrowed': 3, 'sold_livestock': 4, 'sold_asset': 5, 'used_savings': 6, 'insurance_paid': 7, 'no_need': 8, 'did_nothing': 9}, column_dict={'weather': 1, 'pests_disease': 2, 'accident': 3, 'market_prices': 4, 'input_prices': 5, 'contract_broken': 6, 'downturn_no_sale': 7, 'equipment_breakdown': 8, 'health': 9, 'death': 10, 'unrest_or_war': 11, 'DK': 12}, country='bgd', df=        weather  pests_disease  accident  market_prices  input_prices  \
1           NaN            NaN       NaN            NaN           NaN   
2           NaN            NaN       NaN            NaN           NaN   
3           NaN            NaN       NaN            NaN           NaN   
4           NaN            9.0       NaN            NaN           NaN   
5           3.0            3.0       NaN            NaN           NaN   
...         ...            ...       ...            ...           ...   
800129      NaN            2.0       NaN            Na

key=cdi_H17namespace(answers={'very_important': 1, 'somewhat_important': 2, 'not_important': 3}, column_dict={'future_purchases': 1, 'unexpected_event': 2, 'regular_purchases': 3, 'school_fees': 4, 'marriage': 5, 'funeral': 6}, country='cdi', df=          future_purchases  unexpected_event  regular_purchases  school_fees  \
31028570               1.0               1.0                1.0          1.0   
31028571               1.0               2.0                3.0          1.0   
31036603               2.0               2.0                2.0          2.0   
31038817               1.0               1.0                1.0          3.0   
31040474               1.0               1.0                1.0          1.0   
...                    ...               ...                ...          ...   
32261521               1.0               1.0                1.0          1.0   
32261522               1.0               1.0                1.0          1.0   
32261524               1.0        

key=moz_H19namespace(answers={'very_important': 1, 'somewhat_important': 2, 'not_important': 3}, column_dict={'financial_inst': 1, 'informal_group': 2, 'at_home': 3, 'on_mobile': 4}, country='moz', df=          financial_inst  informal_group  at_home  on_mobile
21948170             1.0             NaN      2.0        1.0
21948180             1.0             3.0      3.0        3.0
21948779             2.0             3.0      1.0        NaN
21951744             1.0             3.0      2.0        NaN
21955646             1.0             2.0      1.0        3.0
...                  ...             ...      ...        ...
23315711             3.0             3.0      1.0        3.0
23324438             1.0             2.0      1.0        1.0
23324440             NaN             NaN      NaN        NaN
23324645             NaN             NaN      NaN        NaN
23324649             1.0             1.0      1.0        1.0

[2574 rows x 4 columns], df_name='moz_sr', label='H19', qtype='mul

key=nga_H25namespace(answers={'yes': 1, 'no': 2}, column_dict={'could_relatives_help': 1, 'household_skip_meal': 2, 'house_unlit': 3, 'too_sick_to_work': 4, 'receive_support': 5}, country='nga', df=          could_relatives_help  household_skip_meal  house_unlit  \
40127279                   1.0                  2.0          2.0   
40134046                   1.0                  2.0          2.0   
40135889                   1.0                  2.0          2.0   
40143258                   2.0                  2.0          2.0   
40144462                   2.0                  2.0          2.0   
...                        ...                  ...          ...   
41076184                   2.0                  2.0          2.0   
41076185                   2.0                  NaN          2.0   
41076186                   2.0                  2.0          2.0   
41076187                   2.0                  2.0          2.0   
41076188                   1.0                  2.0   

key=uga_H30namespace(answers={'always_or_mostly': 1, 'sometimes': 2, 'rarely': 3, 'never': 4}, column_dict={'income_exceeds_outgoing?': 1, 'fund_for_unplanned_expenses': 2, 'pay_bills_ontime': 3, 'savings_exceed_debts': 4}, country='uga', df=          income_exceeds_outgoing?  fund_for_unplanned_expenses  \
22678446                       3.0                          2.0   
22678927                       1.0                          4.0   
22679052                       2.0                          3.0   
22679076                       4.0                          4.0   
22679225                       2.0                          2.0   
...                            ...                          ...   
23500842                       NaN                          4.0   
23620542                       2.0                          4.0   
23620647                       2.0                          4.0   
23620678                       3.0                          4.0   
23620722             

key=cdi_H34namespace(answers={'yes': 1, 'no': 2}, column_dict={'loss_of_house': 1, 'major_medical': 2, 'bankruptcy_lost_job': 3, 'loss_of_harvest': 4, 'loss_of_property': 5, 'death': 6, 'time_without_food': 7}, country='cdi', df=          loss_of_house  major_medical  bankruptcy_lost_job  loss_of_harvest  \
31028570            2.0            2.0                  2.0              NaN   
31028571            2.0            2.0                  2.0              2.0   
31036603            NaN            NaN                  NaN              2.0   
31038817            2.0            1.0                  2.0              2.0   
31040474            2.0            2.0                  2.0              2.0   
...                 ...            ...                  ...              ...   
32261521            2.0            2.0                  2.0              2.0   
32261522            2.0            2.0                  2.0              2.0   
32261524            2.0            2.0             

key=cdi_H37namespace(answers={'agree': 1, 'disagree': 2}, column_dict={'actions_determine': 1, 'self_determine': 2, 'short_term': 3, 'live_for_today': 4, 'future_determine_itself': 5, 'work_hard': 6, 'what_happens_happens': 7, 'power_determines': 8}, country='cdi', df=          actions_determine  self_determine  short_term  live_for_today  \
31028570                1.0             2.0         2.0             2.0   
31028571                1.0             2.0         2.0             2.0   
31036603                1.0             NaN         NaN             NaN   
31038817                1.0             2.0         1.0             1.0   
31040474                1.0             1.0         1.0             1.0   
...                     ...             ...         ...             ...   
32261521                1.0             1.0         1.0             2.0   
32261522                1.0             1.0         1.0             2.0   
32261524                1.0             1.0         1.0 

key=cdi_F46namespace(answers={'yes': 1, 'no': 2}, column_dict={'VLSA': 1, 'ROSCA': 2, 'money_guard': 3, 'savings_collector': 4, 'digital_card': 5}, country='cdi', df=          VLSA  ROSCA  money_guard  savings_collector  digital_card
31028570     2      2            2                  2             2
31028571     2      2            2                  2             2
31036603     2      2            2                  2             2
31038817     2      2            2                  2             2
31040474     1      2            1                  2             1
...        ...    ...          ...                ...           ...
32261521     2      2            2                  2             2
32261522     2      2            2                  2             2
32261524     2      2            2                  2             2
32280062     2      2            2                  2             2
32280063     2      2            2                  2             2

[2949 rows x 5 co

[2795 rows x 1 columns], df_name='tan_sr', label='F50', qtype='single', survey='sr', text='Which of these service providers or services is the most important to you?')
key=uga_F50namespace(answer_dict={'VLSA': 1, 'ROSCA': 2, 'other_informal': 3, 'money_guard': 4, 'savings_collector': 5, 'shopkeeper': 6, 'digital_card': 7, 'money_lender': 8}, country='uga', df=          F50
22678446  1.0
22678927  NaN
22679052  NaN
22679076  1.0
22679225  NaN
...       ...
23500842  NaN
23620542  NaN
23620647  NaN
23620678  8.0
23620722  NaN

[2771 rows x 1 columns], df_name='uga_sr', label='F50', qtype='single', survey='sr', text='Which of these service providers or services is the most important to you?')
key=bgd_F49namespace(answer_dict={'yes': 1, 'no': 2}, column_dict={'merry_go_round': 1, 'lend_nonmembers': 2, 'lend_members': 3, 'buy_for_members': 4, 'guarantor_security': 5, 'invest': 6, 'purchase_tools': 7, 'purchase_fixed_assets': 8, 'funeral_emergency': 9, 'help_save': 10}, country='bgd', df=   

key=cdi_F51namespace(answer_dict={'yes': 1, 'no': 2}, column_dict={'have_formal_account': 1, 'have_no_money': 2, 'stealing': 3, 'unfamilar': 4, 'no_need': 5, 'no_trust': 6, 'time_meeting': 7}, country='cdi', df=          have_formal_account  have_no_money  stealing  unfamilar  no_need  \
31028570                  2.0            1.0       2.0        2.0      2.0   
31028571                  2.0            2.0       2.0        2.0      1.0   
31036603                  2.0            1.0       2.0        2.0      2.0   
31038817                  2.0            1.0       2.0        2.0      2.0   
31040474                  NaN            NaN       NaN        NaN      NaN   
...                       ...            ...       ...        ...      ...   
32261521                  2.0            1.0       2.0        2.0      2.0   
32261522                  2.0            1.0       2.0        2.0      2.0   
32261524                  2.0            1.0       2.0        2.0      2.0   
32280062 

key=moz_F54namespace(answer_dict={'very important': 1, 'somewhat_important': 2, 'not_important': 3}, column_dict={'bank': 1, 'microfinance': 2, 'cooperative': 3, 'SACCO': 4, 'moneylender': 5, 'friends_family': 6}, country='moz', df=          bank  microfinance  cooperative  SACCO  moneylender  friends_family
21948170   1.0           1.0          1.0    1.0          2.0             2.0
21948180   1.0           1.0          2.0    2.0          3.0             3.0
21948779   1.0           2.0          3.0    3.0          2.0             1.0
21951744   3.0           3.0          3.0    3.0          3.0             3.0
21955646   1.0           3.0          3.0    2.0          2.0             2.0
...        ...           ...          ...    ...          ...             ...
23315711   NaN           NaN          NaN    NaN          NaN             NaN
23324438   1.0           1.0          2.0    2.0          1.0             3.0
23324440   NaN           NaN          NaN    NaN          NaN     

key=moz_F56namespace(answer_dict={'yes': 1, 'no': 2}, column_dict={'bank': 1, 'microfinance': 2, 'cooperative': 3, 'SACCO': 4, 'moneylender': 5, 'friends_family': 6}, country='moz', df=          bank  microfinance  cooperative  SACCO  moneylender  friends_family
21948170   1.0           2.0          2.0    2.0          2.0             2.0
21948180   2.0           2.0          2.0    2.0          2.0             2.0
21948779   1.0           2.0          2.0    2.0          1.0             1.0
21951744   2.0           2.0          2.0    2.0          1.0             2.0
21955646   1.0           2.0          1.0    1.0          2.0             2.0
...        ...           ...          ...    ...          ...             ...
23315711   2.0           2.0          2.0    2.0          2.0             2.0
23324438   1.0           1.0          2.0    1.0          1.0             1.0
23324440   NaN           NaN          NaN    NaN          NaN             NaN
23324645   NaN           NaN       

key=cdi_AG_INPUTS_PLANSnamespace(country='cdi', df=          AG_INPUTS_PLANS
31028570                0
31028571                0
31036603                0
31038817                0
31040474                0
...                   ...
32261521                0
32261522                0
32261524                0
32280062                0
32280063                0

[2949 rows x 1 columns], df_name='cdi_sr', label='AG_INPUTS_PLANS', qtype='single', survey='sr', text='How many savings/payment plans for agricultural inputs do you currently have?')
key=moz_AG_INPUTS_PLANSnamespace(country='moz', df=          AG_INPUTS_PLANS
21948170                2
21948180                0
21948779                2
21951744                0
21955646                1
...                   ...
23315711                0
23324438                2
23324440                0
23324645                0
23324649                1

[2574 rows x 1 columns], df_name='moz_sr', label='AG_INPUTS_PLANS', qtype='single', surve

key=tan_COUNTRYnamespace(country='tan', df=         COUNTRY
28986883     tan
29244866     tan
29198985     tan
29555590     tan
29010663     tan
...          ...
28753779     tan
28986832     tan
29137010     tan
29348455     tan
29444692     tan

[2993 rows x 1 columns], df_name='tan_hh', label='COUNTRY', qtype='single', survey='hh', text='The country in which the survey was conducted')
key=uga_COUNTRYnamespace(country='uga', df=         COUNTRY
22678091     uga
22678446     uga
22678802     uga
22678927     uga
22679052     uga
...          ...
23500842     uga
23620542     uga
23620647     uga
23620678     uga
23620722     uga

[2870 rows x 1 columns], df_name='uga_hh', label='COUNTRY', qtype='single', survey='hh', text='The country in which the survey was conducted')
key=bgd_URnamespace(country='bgd', df=        UR
1        2
2        2
3        2
4        1
5        1
...     ..
800129   2
800130   2
800131   2
800132   2
800133   2

[3154 rows x 1 columns], df_name='bgd_hh', labe

key=nga_HH11namespace(country='nga', df=          HH11
40177914     1
40229734     2
40239588     2
40244738     2
40252832     2
...        ...
41076184     3
41076185     3
41076186     2
41076187     1
41076188     4

[3026 rows x 1 columns], df_name='nga_hh', label='HH11', qtype='single', survey='hh', text='The number of people in the household who are eligible to be interviewed')
key=tan_HH11namespace(country='tan', df=          HH11
28986883     5
29244866     2
29198985     2
29555590     1
29010663     2
...        ...
28753779     1
28986832     2
29137010     2
29348455     1
29444692     1

[2993 rows x 1 columns], df_name='tan_hh', label='HH11', qtype='single', survey='hh', text='The number of people in the household who are eligible to be interviewed')
key=uga_HH11namespace(country='uga', df=          HH11
22678091     2
22678446     2
22678802     3
22678927     2
22679052     1
...        ...
23500842     2
23620542     8
23620647     1
23620678     2
23620722     3

[28

key=tan_D19namespace(country='tan', df=               D19
28986883  100000.0
29244866   50000.0
29198985   30000.0
29555590  300000.0
29010663  100000.0
...            ...
28753779  150000.0
28986832   25000.0
29137010  100000.0
29348455  100000.0
29444692   30000.0

[2993 rows x 1 columns], df_name='tan_hh', label='D19', qtype='single', survey='hh', text='What is the minimum amount your household needs to survive per month (for personal expenses)?')
key=uga_D19namespace(country='uga', df=               D19
22678091  400000.0
22678446  200000.0
22678802  400000.0
22678927  200000.0
22679052   40000.0
...            ...
23500842  150000.0
23620542  600000.0
23620647  500000.0
23620678  280000.0
23620722  600000.0

[2870 rows x 1 columns], df_name='uga_hh', label='D19', qtype='single', survey='hh', text='What is the minimum amount your household needs to survive per month (for personal expenses)?')
key=bgd_D19_Lnamespace(country='bgd', df=            D19_L
1        9.615805
2       10.30

[3019 rows x 1 columns], df_name='cdi_hh', label='D21_L', qtype='single', survey='hh', text='Log transform of D21 (which is highly skewed)')
key=moz_D21_Lnamespace(country='moz', df=             D21_L
22251317       NaN
23135007  6.802395
22289638  7.600902
22103383  8.987197
22208086       NaN
...            ...
22227757  9.615805
22551138  8.699515
23158719       NaN
22007852       NaN
22093406  8.699515

[2574 rows x 1 columns], df_name='moz_hh', label='D21_L', qtype='single', survey='hh', text='Log transform of D21 (which is highly skewed)')
key=nga_D21_Lnamespace(country='nga', df=              D21_L
40177914   9.903488
40229734  10.308953
40239588   9.903488
40244738        NaN
40252832  10.308953
...             ...
41076184        NaN
41076185        NaN
41076186        NaN
41076187        NaN
41076188        NaN

[3026 rows x 1 columns], df_name='nga_hh', label='D21_L', qtype='single', survey='hh', text='Log transform of D21 (which is highly skewed)')
key=tan_D21_Lnamespace(co

key=bgd_D23namespace(answers={'not_important': 1, 'somewhat_important': 2, 'very_important': 3}, column_dict={'bank_account': 1, 'mobile_phone': 2, 'mobile_money': 3, 'insurance': 4, 'savings': 5, 'loan': 6, 'credit': 7}, country='bgd', df=        bank_account  mobile_phone  mobile_money  insurance  savings  loan  \
1                2.0           3.0           3.0        3.0      3.0   1.0   
2                3.0           3.0           2.0        1.0      3.0   3.0   
3                3.0           3.0           3.0        2.0      2.0   1.0   
4                2.0           3.0           3.0        2.0      3.0   2.0   
5                2.0           3.0           2.0        2.0      1.0   2.0   
...              ...           ...           ...        ...      ...   ...   
800129           3.0           3.0           1.0        3.0      2.0   3.0   
800130           2.0           3.0           3.0        3.0      3.0   3.0   
800131           2.0           2.0           2.0        2.

[3154 rows x 1 columns], df_name='bgd_hh', label='HOUSING0', qtype='single', survey='hh', text='A derived, min-max scaled variable the quality of house construction materials')
key=cdi_HOUSING0namespace(country='cdi', df=          HOUSING0
31028570       1.0
31028571       0.5
31036603       0.5
31038817       0.0
31040474       0.5
...            ...
32261521       0.0
32261522       0.5
32261524       0.0
32280062       0.0
32280063       0.5

[3019 rows x 1 columns], df_name='cdi_hh', label='HOUSING0', qtype='single', survey='hh', text='A derived, min-max scaled variable the quality of house construction materials')
key=moz_HOUSING0namespace(country='moz', df=          HOUSING0
22251317       0.0
23135007       0.0
22289638       0.0
22103383       0.0
22208086       0.0
...            ...
22227757       1.0
22551138       0.5
23158719       1.0
22007852       1.0
22093406       1.0

[2574 rows x 1 columns], df_name='moz_hh', label='HOUSING0', qtype='single', survey='hh', text='A de

key=bgd_POSSESS1namespace(country='bgd', df=        POSSESS1
1       0.333333
2       0.666667
3       0.666667
4       0.333333
5       0.666667
...          ...
800129  0.333333
800130  0.333333
800131  0.500000
800132  0.333333
800133  0.666667

[3154 rows x 1 columns], df_name='bgd_hh', label='POSSESS1', qtype='single', survey='hh', text='A version of POSSESS that includes bicycles, scooters and cars')
key=cdi_POSSESS1namespace(country='cdi', df=          POSSESS1
31028570      0.00
31028571      0.25
31036603      0.50
31038817      0.00
31040474      0.75
...            ...
32261521      0.00
32261522      0.00
32261524      0.00
32280062      0.50
32280063      0.50

[3019 rows x 1 columns], df_name='cdi_hh', label='POSSESS1', qtype='single', survey='hh', text='A version of POSSESS that includes bicycles, scooters and cars')
key=moz_POSSESS1namespace(country='moz', df=          POSSESS1
22251317       0.0
23135007       0.0
22289638       0.0
22103383       0.0
22208086       0.

key=uga_D8_MEANnamespace(country='uga', df=            D8_MEAN
22678091  10.250000
22678446   7.000000
22678802   7.285714
22678927   7.000000
22679052   3.000000
...             ...
23500842   4.200000
23620542   6.916667
23620647   6.000000
23620678   4.000000
23620722   4.000000

[2870 rows x 1 columns], df_name='uga_hh', label='D8_MEAN', note='Education levels have already been rescaled to make them comparable across countries', qtype='single', survey='hh', text='The average education level in the household')
key=bgd_NUM_KIDSnamespace(country='bgd', df=        NUM_KIDS
1              0
2              2
3              0
4              0
5              3
...          ...
800129         0
800130         1
800131         2
800132         3
800133         1

[3154 rows x 1 columns], df_name='bgd_hh', label='NUM_KIDS', qtype='single', survey='hh', text='The number of household members with age <= 15')
key=cdi_NUM_KIDSnamespace(country='cdi', df=          NUM_KIDS
31028570         1
31028

The syntax of data object identifiers is simply `country+'_'+label`. If you ask for, say, `Data.uga_A1`, you get all the metadata for question `A1` and also the data -- the answers in the Uganda survey to question `A1`:

In [21]:
#Data.uga_A1
for x in countries:
    print(len(Data.col(x,'A2')))

3136
3002
2462
2858
2812
2859


Question A1 asks "What is the form of ownership of your land?".  It is a single-answer question (`qtype = 'single'`) which means it expects a single answer that is coded as an integer between 1 and 5. The mapping from answers to integers is described in the `answers` dict. 

Look at question A1 in any User Guide and you'll see that the text strings in the `answers` dict are not identical with the actual answers in the User Guide, but are short mnemonics.

The other metadata fields are 

- `label`: the unique identifier of a question
- `country` : an identifier of the country from which the data is sourced
- `survey` : refers to intermediate data products (see data map, above).  
- `df` : a pandas dataframe that holds the data, itself.

We'll look at the `df` dataframes after I introduce single- and multiple-answer questions. 


### CGAP_Decoded API ###

CGAP_Decoded objects have several methods for viewing data and meta-data and assembling dataframes from data objects.  For a CGAP_Decoded object called Data, these methods are:

- `decode(jstring)` : decodes a JSON string representation of data object(s)
- `read_and_decode(file)` : reads JSON from file and decodes it
- `by_name(name)` : convenience function to get a data object by its name; equivalent to Data.__dict__.get(name)
- `add_col(country, label, df, qtype = 'single', text = None, answers = None)` : adds a column to Data. This is used to add derived variables (e.g., a log transform of a variable, or cluster labels, etc.) See section on `add_col`, below. 
- `describe(label, country = 'bgd', display = True)` : prints a short text description of a variable given its label.  If display=False, this returns an f-string representation of the description.
- `col(country,label,*column)`: the workhorse method for assembling pandas dataframes from data objects. Given a country and a label, this returns a dataframe. The optional `*column` argument is for multi-answer questions; see below.
- `col_from_countries(var, countries)` : This assembles a dataframe for one variable from all the specified countries. var is either a label, alone, if the label denotes a single-answer question, or a (label, column_name) tuple if the label denotes a multi-answer question. For example, `col_from_countries(('A5','Rice'),['bgd','nga'])` assembles a dataframe of the `Rice` answer to question `A5` for Bangladesh and Nigeria.
- `cols_from_countries(*vars,countries)`: Assembles a dataframe for all the variables for all the countries. 


Most of these methods are described in some detail later in this Introduction.

### Countries don't always have the same answers to questions ###

One tricky aspect of the CGAP data is that a given question won't always have the same answers in different countries.  Here's a description of `A1` for Uganda:

In [6]:
Data.describe('A1','uga')

label : A1
 text : What is the form of ownership of your land?
 qtype : single
 survey : rr
 answers : {'lease_certificate': 1, 'customary_law': 2, 'communal': 3, 'state_ownership': 4, 'other': 5}



Bangladesh has an additional answer -- Kott -- which it codes as 5:

In [7]:
Data.describe('A1','bgd')

label : A1
 text : What is the form of ownership of your land?
 qtype : single
 survey : rr
 answers : {'lease_certificate': 1, 'customary_law': 2, 'communal': 3, 'state_ownership': 4, 'Kott': 5, 'other': 6}



So 5 means 'other' in Uganda's survey and 'Kott' in Bangladesh's.   Sometimes this variability in coding schemes is so extreme that it's worth recoding the individual country data to a inter-country scheme (crops and livestock are a good example, see below), but in most cases, like question `A1`, I have not tried to recode all answers to an inter-country scheme, so you should be attentive to the `answers` dict and the mapping of answers to numeric codes.

Sometimes, answers aren't recoded but simply recorded "as is", in which case there won't be an `answers` dict.  For example, question `A2` asks how much land is owned:

In [6]:
Data.describe('A2','bgd')

label : A2
 text : How many hectares of agricultural land do you own?
 qtype : single
 survey : rr
 answers : None



## Single- and multiple-answer questions ##

Attributes such as farm size or family size have just one answer per household, but attributes such as the crops grown by a household can have several values. There's no easy way to code multiple answers to one question in a single variable, so CGAP uses multiple variables to code the answers to multi-answer questions. We've seen examples of single-answer questions above, now let's look at multi-answer questions, such as `H17`, which asks how important it is to save money for various purposes:

In [7]:
Data.moz_H17

namespace(qtype='multi',
          text=' In your opinion, how important is it for your household to save for each of the following?',
          label='H17',
          answers={'very_important': 1,
                   'somewhat_important': 2,
                   'not_important': 3},
          column_dict={'future_purchases': 1,
                       'unexpected_event': 2,
                       'regular_purchases': 3,
                       'school_fees': 4},
          country='moz',
          survey='sr',
          df_name='moz_sr',
          df=          future_purchases  unexpected_event  regular_purchases  school_fees
             21948170               1.0               1.0                1.0          1.0
             21948180               2.0               1.0                2.0          1.0
             21948779               1.0               1.0                1.0          1.0
             21951744               1.0               1.0                2.0          1.0
           

The Data object for `H17` for Mozambique has a field called `column_dict` that you won't see in single-answer Data objects, and the `df` field contains a multi-column pandas dataframe. Each of these columns contains numbers that encode the answers described in the `answers` dict; that is, 1, 2 or 3 depending on whether the answer is "very_important", "somewhat important" or "not important", respectively. 

Look at the first row of the dataframe:  It tells us that the survey respondent from the household with ID 21948170 thinks it is very important to save for all four kinds of expenses; whereas the second household thinls it is only somewhat important to save for future purchases and regular purchases. 

The dataframe for a multi-answer question has one column per answer, and the names of the columns are the mnemonic strings in `column_dict`.  Thus, you can easily find out which farmers in Uganda grow tomatoes:

In [8]:
Data.moz_A5.df.Tomatoes

22552580    1.0
22487045    2.0
22159366    2.0
22790149    1.0
22790150    2.0
           ... 
22757331    2.0
22552539    2.0
22200293    2.0
22102008    2.0
22167545    2.0
Name: Tomatoes, Length: 2462, dtype: float64

Let me break down this query: 

- `Data` is a namespace 
- `moz_A5` is a key into `Data.__dict__` that returns an object that represents question `A5` for Mozambique
- `df` is the field of the object that contains a pandas dataframe
- `Tomatoes` is a column in this dataframe

The df columns for multi-answer questions are always mnemonics for answers.  This is for two reasons:  It's easier for the user to ask for `df.Tomatoes` than, say, `df.A5_27`. More importantly, while `df.A5_27` contains data about tomato-growing in Mozambique, it contains data about growing sesame in Bangladesh and sugarcane in Tanzania.  These differences in coding have all been resolved behind the scenes.

You can get the same data using the `col` method, described below.  

In [8]:
Data.col('moz','A5','Tomatoes')

22552580    1.0
22487045    2.0
22159366    2.0
22790149    1.0
22790150    2.0
           ... 
22757331    2.0
22552539    2.0
22200293    2.0
22102008    2.0
22167545    2.0
Name: Tomatoes, Length: 2462, dtype: float64

## Household IDs and DataFrame Joins ##

You'll notice that the dataframes associated with Data objects have household ids as their indices.  For example, the first record in any Mozambique Data object can be obtained in the usual pandas way:

In [9]:
Data.moz_A5.df.loc[22552580]

Maize           2.0
Beans           2.0
Sweet_potato    2.0
Sorghum         2.0
Rice            2.0
Groundnuts      1.0
Cowpea          2.0
Millet          2.0
Cassava         1.0
Potato          2.0
Pigeon_pea      2.0
Banana          2.0
Coconut         2.0
Cotton          2.0
Sesame          2.0
Mango           2.0
Cashew          2.0
Sugarcane       2.0
Tobacco         2.0
Tea             2.0
Avocado         2.0
Cocoa           2.0
Sisal           2.0
Cloves          2.0
Coffee          2.0
Sunflower       2.0
Tomatoes        1.0
Onions          2.0
Other_1         2.0
Other_2         2.0
Other_3         2.0
No_crop         1.0
Name: 22552580, dtype: float64

This is answers to question `A5` for the household with `HHID` 22552580.  Similarly, here is the answer to the earlier question about farm size:

In [11]:
Data.moz_A2.df.loc[22552580]

A2    5.0
Name: 22552580, dtype: float64

Because Data object `df` indices are unique household IDs, it is straightforward to join `df`s:

In [12]:
x = Data.moz_A2.df.join(Data.moz_A5.df)
x

Unnamed: 0,A2,Maize,Beans,Sweet_potato,Sorghum,Rice,Groundnuts,Cowpea,Millet,Cassava,...,Sisal,Cloves,Coffee,Sunflower,Tomatoes,Onions,Other_1,Other_2,Other_3,No_crop
22552580,5.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0
22487045,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
22159366,0.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
22790149,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,...,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0
22790150,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22757331,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
22552539,2.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
22200293,1.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
22102008,2.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0


Here, the first column, `A2` is the answers to question `A2` and the rest of the dataframe is the answers to questions about crops. 

## Country-specific Decoded objects ##

If you want to work with data from a specific country, you can make the notation even simpler by using the `Country_Decoded` subclass of `CGAP_Decoded` that's specific to a country:

In [13]:
bgd = Country_Decoded('bgd',Data)
cdi = Country_Decoded('cdi',Data)
moz = Country_Decoded('moz',Data)
nga = Country_Decoded('nga',Data)
tan = Country_Decoded('tan',Data)
uga = Country_Decoded('uga',Data)

Now you don't have to specify the country as a string:

In [14]:
print(Data.moz_A5.df.Rice.value_counts())

print(moz.A5.df.Rice.value_counts())

print(moz.col('A5','Rice').value_counts())

pd.concat([moz.col('A5','Rice'), moz.col('H28')],axis=1)

2.0    1712
1.0     619
Name: Rice, dtype: int64
2.0    1712
1.0     619
Name: Rice, dtype: int64
2.0    1712
1.0     619
Name: Rice, dtype: int64


Unnamed: 0,Rice,H28
21948170,2.0,1.0
21948180,2.0,3.0
21948779,2.0,2.0
21951744,2.0,2.0
21955646,2.0,2.0
...,...,...
23315711,2.0,
23324438,2.0,2.0
23324440,1.0,
23324645,,


## Manipulating data ##

This section introduces three methods of `CGAP_Decoded objects`.  The workhorse is `col`, which gets a column of data given a specification, if possible.  `col` takes a country as an argument, but sometimes you will want to get a column of data for one variable for some or all countries; for this, use `col_from_countries`.  To get columns of data for several variables over some or all countries, use `cols_from_countries`.  And be sure to read the warning at the end of this section!

### col ###

`col` is a `CGAP_Decoded` method that returns a pandas Series. It is there only for convenience, as it's easy to forget how to access columns, particularly those for single-answer questions. 

`Data.col('moz','A5','Rice')` is equivalent to `Data.moz_A5.df.Rice` 

`Data.col('moz','H28')` is equivalent to `Data.moz_H28.df.H28`

In [15]:
# single-answer question:
print (Data.col('moz','H28'))
print()

21948170    1.0
21948180    3.0
21948779    2.0
21951744    2.0
21955646    2.0
           ... 
23315711    NaN
23324438    2.0
23324440    NaN
23324645    NaN
23324649    3.0
Name: H28, Length: 2574, dtype: float64



In [16]:
# multi-answer question:
print (Data.col('moz','A5','Rice'))
print()

22552580    2.0
22487045    2.0
22159366    2.0
22790149    2.0
22790150    2.0
           ... 
22757331    2.0
22552539    2.0
22200293    2.0
22102008    2.0
22167545    2.0
Name: Rice, Length: 2462, dtype: float64



`col` will provide diagnostic messages and will return `None` when it can't find what you ask for:

In [17]:
# multi-answer question but the user doesn't know it
print(Data.col('moz','A5'))


moz_A5 is a multi-answer question; please specify an answer column.

Options are ['Maize', 'Beans', 'Sweet_potato', 'Sorghum', 'Rice', 'Groundnuts', 'Cowpea', 'Millet', 'Cassava', 'Potato', 'Pigeon_pea', 'Banana', 'Coconut', 'Cotton', 'Sesame', 'Mango', 'Cashew', 'Sugarcane', 'Tobacco', 'Tea', 'Avocado', 'Cocoa', 'Sisal', 'Cloves', 'Coffee', 'Sunflower', 'Tomatoes', 'Onions', 'Other_1', 'Other_2', 'Other_3', 'No_crop']
None


In [18]:
print(Data.col('moz','A6','Wheat'))

A6 is a single-answer question; ignoring Wheat
22552580     8.0
22487045     3.0
22159366    55.0
22790149     3.0
22790150     3.0
            ... 
22757331     7.0
22552539     8.0
22200293     3.0
22102008     8.0
22167545     3.0
Name: A6, Length: 2462, dtype: float64


In [19]:
# moz doesn't list wheat as an answer to question A5
Data.col('moz','A5','Wheat') is None

moz_A5.Wheat does not exist


True

In [20]:
Data.col('moz','FooBarHaHaHa!') is None

moz_FooBarHaHaHa! does not exist


True

### col_from_countries ###

If you need to get the value of a variable for all countries, use `col_from_countries`.  Here are the values of variable `A6` for Bangladesh and Cote d'Ivoire:

In [21]:
Data.col_from_countries('A6',countries=['bgd','cdi'])

Unnamed: 0,A6
1,1.0
2,1.0
3,
4,1.0
5,1.0
...,...
31186932,41.0
31186933,41.0
31186934,41.0
31285239,43.0


That's pretty straightforward, but `A6` is a single-answer question.  What if you want the values over countries for a multi-answer question such as `A5`, which asks which crops are grown? Use a tuple of the question label and the name of the column that holds a particular answer:

In [22]:
Data.col_from_countries(('A5','Rice'),countries=['bgd','cdi'])

Unnamed: 0,Rice
1,1.0
2,1.0
3,
4,1.0
5,1.0
...,...
31186932,1.0
31186933,1.0
31186934,1.0
31285239,1.0


### cols_from_countries ###

That's fine for single variables, but what if you want a dataframe of one or more variables over several countries? The method `cols_from_countries` builds a dataframe from several variables that can represent single-answer or multiple-answer questions. For example, here are three variables -- number of hectares (A2), whether rice is grown (A5,Rice) and whether maize is grown (A5,Maize) and country for Tanzania and Uganda:

In [23]:
Data.cols_from_countries('A2',('A5','Rice'),('A5','Maize'),'COUNTRY', countries = ['tan','uga'])
    

Unnamed: 0,A2,Rice,Maize,COUNTRY
22678091,,2.0,1.0,uga
22678446,0.809717,2.0,1.0,uga
22678802,0.404858,2.0,2.0,uga
22678927,,,,uga
22679052,3.000000,2.0,1.0,uga
...,...,...,...,...
29633312,1.214575,2.0,2.0,tan
29649585,0.607287,2.0,2.0,tan
29649586,0.607287,2.0,2.0,tan
29680313,1.214575,1.0,1.0,tan


### A warning about col_from_countries and cols_from_countries ###

You might recall that `col` tries to give you what you want, but when it can't, it prints a warning and returns `None`.  `col_from_countries` and `cols_from_countries` don't fail when they get `None` from `col`. Instead, they just don't include the countries you think you're getting:

In [24]:
wheat = Data.col_from_countries(('A5','Wheat'), countries = ['bgd','cdi','moz','nga','tan','moz'])

print(f"The dataframe includes {len(wheat)} records")

cdi_A5.Wheat does not exist
moz_A5.Wheat does not exist
tan_A5.Wheat does not exist
moz_A5.Wheat does not exist
The dataframe includes 5994 records


Instead of records from all the countries, you're getting only the 5994 records from the countries that grow wheat. You can think of this as a bug or a feature! A more troubling case arises when you ask for more than one variable:

In [25]:
wheat_and_rice = Data.cols_from_countries(('A5','Wheat'),('A5','Rice'), countries = ['bgd','cdi','moz','nga','tan','moz'])

print(f"The dataframe includes {len(wheat_and_rice)} records")
print(f"The Rice column includes {np.sum(np.isnan(wheat_and_rice['Rice']))} NaNs")
print(f"The Wheat column includes {np.sum(np.isnan(wheat_and_rice['Wheat']))} NaNs")

cdi_A5.Wheat does not exist
moz_A5.Wheat does not exist
tan_A5.Wheat does not exist
moz_A5.Wheat does not exist
The dataframe includes 16732 records
The Rice column includes 1026 NaNs
The Wheat column includes 11306 NaNs


What's happened here is that wheat isn't grown in four countries, so the records for households in those countries have NaNs for wheat.  It's the right thing to do, but be aware that it's happening!  

## Other data manipulation ##

### Adding new data objects, temporarily ###

You can add new data objects to a CGAP_Decoded object with the `add_col` method.  Here's an example of adding a new variable derived from monthly income (`D21_LZ`) and monthly outgoings (`D19_LZ`):

In [26]:
for country in countries :
    diff = Data.col(country,'D21_LZ') - Data.col(country,'D19_LZ')
    Data.add_col (country = country, 
                  label = 'INCOME_DIFF', 
                  df = diff,
                  text = "Monthly income minus monthly outgoing",
                  qtype = 'single'
                 )

`text` and `answers` default to `None` and `qtype` defaults to `'single'`, so you can get away with specifying just the positional arguments `country`, `label` and `df`, as in:

In [27]:
# make a column of random numbers that's the same length as moz
randoms = np.random.random(len(Data.col('moz','H28')))

# positional arguments are country, label and df
Data.add_col ('moz', 'random_numbers', randoms)

print(Data.__dict__.get('moz_random_numbers'))

namespace(answer=None, country='moz', df=      random_numbers
0           0.077226
1           0.482234
2           0.584238
3           0.827122
4           0.573344
...              ...
2569        0.477449
2570        0.995842
2571        0.901037
2572        0.854165
2573        0.060126

[2574 rows x 1 columns], label='random_numbers', qtype='single', text=None)


Note that `add_col` is permissive about what kind of object is passed as `df`:  Anything that can be turned into a pandas dataframe, such as the numpy array `randoms` in the previous example, will work.

#### NOTE #### 

`add_col` adds a column temporarily to a CGAP_Decoded object such as `Data`.  At present there's no way to write out the addition permanently.  That's because I wrote the encoder in a way that's too specific to CGAP data.  It's a priority to fix this so that people can save derived variables for CGAP and Manobi data objects.

### Other odds and ends (unfinished) ###

Series can be concatenated, provided they have the same indexes:

In [28]:
df = pd.concat([
    Data.col('moz','A5','Rice'),
    Data.col('moz','H28'), # "col" version
    Data.moz_H28.df.H28,  # non="col" version: you have to say H28 twice
        ], axis=1)
df

Unnamed: 0,Rice,H28,H28.1
21948170,2.0,1.0,1.0
21948180,2.0,3.0,3.0
21948779,2.0,2.0,2.0
21951744,2.0,2.0,2.0
21955646,2.0,2.0,2.0
...,...,...,...
23315711,2.0,,
23324438,2.0,2.0,2.0
23324440,1.0,,
23324645,,,


Series can also be concatenated with dataframes:

In [29]:
df = pd.concat([
    Data.col('moz','A5','Rice'),
    Data.moz_A61.df
        ], axis=1)
df

Unnamed: 0,Rice,weather,pests_disease,accident,market_prices,input_prices,contract_broken,downturn_no_sale,equipment_breakdown,health
21948170,2.0,,,,,,,,,
21948180,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
21948779,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
21951744,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
21955646,2.0,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
23315711,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0
23324438,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0
23324440,1.0,,,,,,,,,
23324645,,,,,,,,,,


## Some illustrative data analysis (unfinished) ##



First let's look at the joint distribution of rice and other crops in Cote d'Ivoire:

In [30]:
x = pd.crosstab(Data.col('cdi','A5','Rice'),Data.col('cdi','A5','Groundnuts'),margins=True)
x

Groundnuts,1.0,2.0,All
Rice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,562,870,1432
2.0,311,1169,1480
All,873,2039,2912


We can do the same for multiple countries:

In [31]:
df = Data.cols_from_countries(
    ('A5','Rice'),('A5','Maize'),
    countries = ['cdi','moz','nga']
)

pd.crosstab(df.Rice,df.Maize)

Maize,1.0,2.0
Rice,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2088,612
2.0,3625,1655


This tells us -- among other things -- that growing rice and maize are not independent. For example, if you grow rice in cdi, moz or nga, then the conditional probability of _not_ growing maize is 612/(612+2088) = .23, whereas if you _don't_ grow rice then you probably _will_ grow maize: the conditional probability of growing maize given that you don't grow rice is 3625/(3625+1655)= .69.

Now for something a bit more challenging:  How do respondents generate income?  Question H6 is a single-answer question that asks about a respondent's primary job, while H2B asks about sources of income:

In [32]:
print(Data.bgd_H6.text)
print(Data.bgd_H6.answers)
print()
print(Data.bgd_H2B.text)
print(Data.bgd_H2B.answers)


What is your primary job (i.e., the job where you spend most of your time)?
{'farmer': 1, 'professional': 2, 'shop_owner': 3, 'business_owner': 4, 'laborer': 5, 'other': 6}

Which of these has been your main source of income in the last year?
{'regular_job': 1, 'occasional_job': 2, 'retail_business': 3, 'services_business': 4, 'grant_pension': 5, 'family_friends': 6, 'growing_crops': 7, 'rearing_livestock': 8, 'other': 9}


In [33]:
pd.crosstab(Data.col('bgd','H6'),Data.col('bgd','H2B'), normalize='index')

H2B,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
H6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0.020743,0.029908,0.036179,0.008201,0.004342,0.036662,0.721659,0.115774,0.026532
2,0.691176,0.044118,0.014706,0.073529,0.0,0.014706,0.117647,0.014706,0.029412
3,0.02381,0.0,0.5,0.190476,0.02381,0.071429,0.071429,0.119048,0.0
4,0.055944,0.017483,0.527972,0.13986,0.0,0.017483,0.097902,0.132867,0.01049
5,0.208211,0.246334,0.017595,0.017595,0.0,0.026393,0.164223,0.067449,0.252199
6,0.297735,0.048544,0.009709,0.045307,0.038835,0.158576,0.087379,0.135922,0.177994


So, in Bangladesh, among farmers (H6 == 1) the main sources of income are growing crops (71.6% of farmers) or raising livestock (11.4% of farmers).  Laborers (H6 == 5) have the most variable sources of income, with 20% saying they get income from a regular job (H2B ==1), 24% citing occasional work, 16% citing agriculture and 25% citing 'other' as a source of income. 

Fewer farmers in Mozambique (H6 == 1) get their income primarily from agriculture (only 45% do). Sixteen percent say they get their income primarily from occasional work (H2B == 2).

In [34]:
pd.crosstab(Data.col('moz','H6'),Data.col('moz','H2B'), normalize='index')

H2B,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
H6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0.037545,0.186532,0.030989,0.035161,0.04112,0.087008,0.510131,0.051847,0.019666
2,0.726415,0.09434,0.018868,0.028302,0.018868,0.009434,0.103774,0.0,0.0
3,0.176471,0.117647,0.294118,0.058824,0.0,0.117647,0.117647,0.117647,0.0
4,0.083333,0.116667,0.233333,0.305556,0.005556,0.038889,0.172222,0.022222,0.022222
5,0.607477,0.242991,0.009346,0.037383,0.0,0.009346,0.065421,0.028037,0.0
6,0.281437,0.299401,0.011976,0.071856,0.041916,0.065868,0.08982,0.035928,0.101796


The raw frequencies are instructive: Among 1870 farming households (76% of those surveyed in moz), 303 say their primary source of income is occasional work and 144 say it is family or friends (H2B == 7).  In fact, 49.7% of the farming households say their primary source of income is not either agriculture (H2B == 7) or raising livestock (H2B==8).  (Exercise for the reader: Contrast Mozambique with Nigeria.)

In [35]:
pd.crosstab(Data.col('moz','H6'),Data.col('moz','H2B'))

H2B,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
H6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,63,313,52,59,69,146,856,87,33
2,77,10,2,3,2,1,11,0,0
3,3,2,5,1,0,2,2,2,0
4,15,21,42,55,1,7,31,4,4
5,65,26,1,4,0,1,7,3,0
6,47,50,2,12,7,11,15,6,17


### Building a classifier with sklearn and data objects ###

The `cols_from_countries` method of CGAP_Decoded objects makes it easy to assemble dataframes for multiple variables for multiple countries. To illustrate, the following class defines a classifier for one or more countries: 


In [36]:
import operator
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


class Country_Classifier ():
    def __init__(self,cgap_decoded_obj, countries, X, y, model_class, report = False):
        self.Data = cgap_decoded_obj
        self.X,  self.y = self.Xy_data(countries, X, y, report)
        self.model_class = model_class
        
    def Xy_data (self,countries,X,y, report):
        # make one table of X and y so indices remain aligned 
        # when we drop rows that contain NaNs
        Xy = self.Data.cols_from_countries(*X, y, countries= countries)
        n0 = len(Xy)
        
        # drop rows that contain NaNs
        Xy.dropna(axis=0,inplace=True)
        n1 = len(Xy)
        
        if report : print(f"{countries}:  Removed {n0-n1} rows, loss = {(n0-n1)/n0:.4f}\n")
        
        # reset the index of Xy
        Xy = Xy.reset_index(drop=True)
        
        # return X and y as separate dataframes. If y denotes a column for a 
        # multi-answer then it will be a tuple and the column to drop is the 
        # second value in the tuple
        
        if type(y) in [tuple,list]:
            return Xy.drop(y[1],axis=1) , Xy[y[1]]
        else:
            return Xy.drop(y,axis=1) , Xy[y]
    
    def train_model (self):
        self.model = self.model_class.fit(self.X,self.y) 
        
    def score (self, X = None, y = None):
        """ Returns the model score when tested with X and y.  If these are None
        then the default is X = self.X, y = self.y. However, X and y can be new
        data (e.g., from out of sample) """
        X0 = self.X if X is None else X
        y0 = self.y if y is None else y
        return self.model.score(X0,y0)
    
    def cv_score (self, X = None, y = None, k = 5):
        """ Returns the k-fold cross_validated model score when tested with 
        X and y; also returns the scores themselves.  If X and y are None then 
        the default is X = self.X, y = self.y. """
        X0 = self.X if X is None else X
        y0 = self.y if y is None else y
        cv_scores = cross_val_score(self.model, X0, y0, cv=k)
        return np.mean(cv_scores), cv_scores
    
    def majority_class (self, y = None):
        """ Returns the majority class probability for y and the majority class 
        label.  If y is None then the default is y = self.y. """
        y0 = self.y if y is None else y
        vc = dict(y0.value_counts())
        max_key, max_count = max(vc.items(), key = operator.itemgetter(1))
        return max_count/sum(vc.values()), max_key
    
    
        
ICP = Country_Classifier(
    cgap_decoded_obj = Data,
    countries = ['cdi','nga','tan'],
    X = ['A38',('A41','my_legacy'),'NUM_KIDS'],
    y = ('A41','want_children_continue'),
    model_class = RandomForestClassifier(max_features=None),
    report=True
    )


ICP.train_model()
print(f"Model score on training data: {ICP.score()}")
print(f"Cross-validated model score: {ICP.cv_score()}")
print(f"Majority class in training data: {ICP.majority_class()}")


['cdi', 'nga', 'tan']:  Removed 1917 rows, loss = 0.2121

Model score on training data: 0.7668866732200533
Cross-validated model score: (0.763516361127538, array([0.76350877, 0.74227528, 0.70224719, 0.78160112, 0.82794944]))
Majority class in training data: (0.6501895801151524, 1.0)


Here's an analysis that compares predictions in one country given X from that country and a model from another country. 

In [37]:
def intercountry_predictions(cgap_decoded_obj, X,y,model_class, countries):
    
    results = pd.DataFrame(columns=['mc','self','self-mc','n']+countries, index=countries)
    diffs = pd.DataFrame(columns=countries, index=countries)
    country_classifiers = {}
    
    # First train models for each country on data for that country
    
    for country in countries:
        CC = Country_Classifier(
            cgap_decoded_obj = cgap_decoded_obj,
            countries = [country], 
            X = X, y = y, model_class = model_class()
        )
        CC.train_model()      
        country_classifiers[country] = CC
        
        # within-sample classifier score
        results.loc[country,'self'] = round(CC.score(),3)
        
        # majority class prediction score
        results.loc[country,'mc'] = round(CC.majority_class()[0],3)
        
        # improvement of classifier over majority class
        results.loc[country,'self-mc']=round(CC.score() - CC.majority_class()[0],3)
        
        # number of non-NaN records for available to this classifier
        results.loc[country,'n'] = len(CC.y)
        
    
    # Now use these models to predict y's between countries
    for country1 in countries:
        CC1 = country_classifiers[country1]
        
        for country2 in countries:       
            if country1 == country2:
                results.loc[country1,country2] = round(CC1.score(),3)
                diffs.loc[country1,country2] = 0
                  
            else:
                # Use country1's classifier to predict other countries
                CC2 = country_classifiers[country2]
                
                # score of country1's  model on c2 data: how well the model predicts c2
                c12 =  CC1.score(CC2.X,CC2.y) 
                
                # how much worse it is to use c1 model to predict c2 than it is to use c2 model
                c12_loss = c12 - CC2.score()
        
                results.loc[country1,country2] = round(c12,3)
                diffs.loc[country1,country2] = round(c12_loss,3)
            
    return results, diffs
 
X = ['A38',('A41','my_legacy'),'NUM_KIDS']
y = ('A41','want_children_continue')

results, diffs = intercountry_predictions(Data, X, y, RandomForestClassifier, countries)
 
print ("Accuracy when row country predicts column country\n")
print(results)
print()
print ("Loss of accuracy in column country predictions when row country predicts column country\n")
print(diffs)

Accuracy when row country predicts column country

        mc   self self-mc     n    bgd    cdi    moz    nga    tan    uga
bgd  0.693  0.695   0.002  2371  0.695   0.46  0.366  0.421  0.301  0.373
cdi  0.541  0.639   0.098  2281  0.542  0.639  0.777  0.817  0.795  0.737
moz  0.645  0.797   0.152  1316  0.526  0.611  0.797  0.812  0.806  0.758
nga  0.657  0.849   0.192  2232  0.539  0.628  0.786  0.849  0.816  0.771
tan   0.74  0.819   0.079  2608  0.538   0.62  0.785  0.841  0.819   0.77
uga  0.679  0.777   0.098  2071  0.536  0.617  0.786  0.832  0.811  0.777

Loss of accuracy in column country predictions when row country predicts column country

       bgd    cdi    moz    nga    tan    uga
bgd      0 -0.179 -0.431 -0.428 -0.517 -0.404
cdi -0.153      0 -0.021 -0.032 -0.023  -0.04
moz  -0.17 -0.028      0 -0.037 -0.012 -0.019
nga -0.156 -0.011 -0.011      0 -0.003 -0.006
tan -0.158 -0.019 -0.012 -0.008      0 -0.007
uga -0.159 -0.022 -0.011 -0.017 -0.008      0
