# Thematic groups

This is a sister notebook to Paul's that groups variables in the Manobi and CGAP data into questions about:

- Agronomic: Land, yields, management
- Infrastructure: Water, roads, challenges to trade, mobility
- Social/cognitive: Beliefs, interests, decision-making

Here I'll list the variables relevant to each group and add some code to make it easier to grab variables out of the Manobi dataset. We don't yet have a way to align Manobi and CGAP data in the same dataframes, so I've stuck to lists of Manobi variables for ease of indexing. For groups about finance, commerce and demographics, go to Paul's notebook. Also check out Paul's notebook for a useful review of how to access Manobi and CGAP data and get descriptions of the questions.

Disclaimer: I'm using the 2016-2017 Manobi data for this notebook. There's more detailed agronomic data in later years, but these are being organized and cleaned by the Manobi crew. I'll update this notebook accordingly when all the data is in the same structure.

In [34]:
import sys
import os, json
import pandas as pd 

# Change these to match yours
filepath = '/Users/Allegra/Documents/Postdoc/habitus/'

mnb_ag = pd.read_csv(filepath + 'data/senegal_0520/plot_and_season_StLouis_2016_english_2.csv')
mnb_farmer = pd.read_csv(filepath + 'data/senegal_0520/demographicfarmer_StLouis_2016_cleaned.csv')

C_IA = pd.read_csv(filepath+'data/CGAP/processed/individual_attributes.csv') 

sys.path.append('/Users/Allegra/Documents/Postdoc/habitus/data/CGAP/processed/')
filepath = '/Users/Allegra/Documents/Postdoc/habitus/data/CGAP/processed/'

from CGAP_JSON_Encoders_Decoders import Question_Decoder, CGAP_Encoded, CGAP_Decoded, Country_Decoded

Data = CGAP_Decoded()
Data.read_and_decode(filepath +'CGAP_JSON.txt')

countries = ['bgd','cdi','moz','nga','tan','uga']


In [35]:
# # Look at the CGAP questions
# for key in Data.__dict__.keys():
#     if key[0:3] == 'bgd': # Because the questions are all the same
#         print(key[3:]) # Print the variable name
#         print(Data.__dict__[key].text, "(", Data.__dict__[key].survey,")", "(", Data.__dict__[key].qtype,")", "\n") # Print the question and which survey it's in and what question type

# Agronomic

## Manobi variables

I've broken down the Manobi agronomic variables into sub-groups. All of these variables come from the plot_and_season file except for starred (\*) ones, which come from the farmer file.

Manobi collects some variables related to what's going on in the farmer's plot. These are plot specifications (area, soil characteristics) as well as whether the plot is being used and for what. Note that "dry season" questions are a bit ambiguous: are they about "season 2" or "season 1"? And what *are* "season 1" and "season 2"? In 2016, Manobi collected information related to "this" or "last" season ("season 1") and "the coming" season ("season 2"). If you check out my notebook [here](https://pitt.box.com/s/d3m2251ktfwor76ruezorphzec8fr4ut) you can see which time periods these seasons actually fall in. In general, "season 2" looks like it's the wet season, though it's not always the case.

- Variables related to the plot:
    - **plot_area:** Ask the farmer to tell you the area of the plot
    - **plot_soil_type:** Ask the farmer about the type of soil
    - **plot_soil_color:** Ask the farmer about the color of the soil
    - **plot_soil_quality:** Ask the farmer about the quality of the soil
    - **plot_topography:** Is the plot flat or sloped
    - **farmed_season1:** Did the farmer plant a crop last season
    - **crop_grown_season1:** Which crop was grown on this plot last season
    - **farmed_season2:** Is the farmer planting this coming season? (second season)
    - **crop_grown_season2:** Which crop will the farmer grow? (second season)
    - **area_sown_season2:** What area has the farmer sown? (second season)
    - **farmed_dry_season:** Does the farmer cultivate this plot during the dry season
    - **crop_grown_dry_season:** If the farmer cultivates this plot during the dry season, what crop does the farmer grow 
    
There are a lot of questions about management of a farmer's field. They come with questions about costs, which are noted in Paul's notebook. Probably the most interesting variable is **input_use** which asks whether the farmer actually applied the inputs as recommended (spoiler: not always!)
    
- Variables related to preparation and management:
    - **harrow_season1**: Did the farmer harrow the plot before planting
    - **weed_season1**: Did the farmer weed last season
    - **weed_method_season1**: If the farmer weeded last season, did he do it manually, mechanically or through herbicide
    - **ridging_season1**: Did the farmer do ridging (?!) last season
    - **used_fertilizer_season1**: Did the farmer use fertilizer last season
    - **fertilizer_num_sacks_season1**/**fertilizer_amount_season1**: How much fertilizer did the farmer use last season (kg/ha or sacks)
    - **fertilizer_type_season1**: What type of fertilizer did the farmer use last season
    - **fertilizer_formula_season1**: Fertilizer formula (I think this is like N:P:K)
    - **insecticide_season1**: Did the farmer apply insecticide last season
    - **insecticide_brand_season1**: What brand of insecticide did the farmer use last season
    - **fungicide_season1**: Did the farmer apply fungicide last season
    - **fungicide_brand_season1**: What brand of fungicide did the farmer use last season
    - **herbicide_season1**: Did the farmer apply herbicide last season
    - **herbicide_brand_season1**: What brand of herbicide did the farmer use last season
    - **fertilizer1_application_season2**: Have you already applied fertilizer? (second season)
    - **fertilizer1_type_season2**: If you already applied fertilizer, what kind did you use? (second season)
    - **fertilizer1_amount_season2**: How much fertilizer did the farmer use (second season)
    - **fertilizer2_application_season2**: Did the farmer already apply a second round of fertilizer (second season)
    - **fertilizer2_type_season2**: What type of fertilizer did the farmer use for the second round (second season)
    - **fertilizer2_amount_season2**: How much fertilizer did the farmer use for the second round (second season)
    - **fertilizer_type_dry_season**: What type of fertilizer will the farmer use for the dry season
    - **input_use**: Whether the farmer applies inputs as recommended *
    
Manobi collects quite a bit of information about sowing specifications, which ought to come in handy for DSSAT. 

- Variables related to sowing:
    - **seed_variety_season1**: The seed variety used last season (traditional, certified, or a hybrid)
    - **row_spacing_season1**: Row spacing
    - **seed_amount_season1**: How many kilograms per hectare of seeds the farmer sowed last season
    - **num_seeds_per_hole_season1**: Number of seeds per hole
    - **seed_variety_season2**: What variety of seed did the farmer sow? (second season)
    - **seed_amount_received_season2**: How much seed did the farmer receive? (second season)
    
Manobi doesn't collect much data in 2016 about harvest (note the unit difference between tons and kg below) but I think there's more information in later surveys.

- Variables related to harvest:
    - **production_season1**: How many tons of crop were produced on the plot
    - **harvest_num_sacks_season1**: How many sacks were harvested per week
    - **kg_per_sack_season1**: What was the average kilogram weight of sacks harvested
    
Manobi also collects information about the timing of events during a season. Again, what "season" means is sort of ambiguous. However, you can check the dates associated with a row of data to figure out whether you're in the dry or wet season. Depending on when the data were collected (for 2016, mostly after December), farmers may or may not have fertilized yet.

- Variables related to timing:
    - **sowing_date_season1**: The date that seeds were sown on the plot last season
    - **harvest_date_season1**: What date did the farmer begin harvesting last season
    - **harvest_duration_season1**: How long did the harvest last last season
    - **first_rainfall_date_season2**: What date was the first significant rainfall? (second season)
    - **sowing_date_season2**: What date did the farmer sow the seeds? (second season)
    - **fertilizer1_date_season2**: What date did the farmer apply the first round of fertilizer (second season)
    - **fertilizer2_date_season2**: What date did the farmer apply a second round of fertilizer (second season)



## CGAP variables

CGAP doesn't collect very much data related to specific agricultural events. The following are about what a farmer does to make a living. Many more questions about things like fertilizer are really financial questions, like "how important is it to pay for fertilizer", which you can go to Paul's notebook to find out about.

- **A2**: How many hectares of agricultural land do you own? 
- **A5**: Which of the following crops do you grow?
- **A10**: Do you have any livestock herds, other farm animals, or poultry?
- **A11**: How many of each of the following do you rear?
- **A23**: For managing the land and livestock, what types of labor do you use?
- **A38**: How many years have you been farming?

# Lists of variables

Note that some of the Manobi variables come from different files (plot_and_season vs farmer). For the CGAP variables, note that most of them are multi-answer, which requires you to select the columns you want. Go to Paul's notebooks for an intro on how to do this.

In [36]:
mnb_plot = ['plot_area', 'plot_soil_type', 'plot_soil_color', 'plot_soil_quality', 'plot_topography',
            'farmed_season1', 'crop_grown_season1', 'farmed_season2', 'crop_grown_season2', 'farmed_dry_season',
            'crop_grown_dry_season']


mnb_mgmt = ['harrowing_season1', 'weed_season1', 'weed_method_season1', 'ridging_season1', 'used_fertilizer_season1',
            'fertilizer_num_sacks_season1', 'fertilizer_type_season1', 'fertilizer_formula_season1', 
            'fertilizer_amount_season1', 'insecticide_season1', 'insecticide_brand_season1', 'fungicide_season1',
            'fungicide_brand_season1', 'herbicide_season1', 'herbicide_brand_season1', 'fertilizer1_application_season2',
            'fertilizer1_type_season2', 'fertilizer1_amount_season2', 'fertilizer2_application_season2',
            'fertilizer2_type_season2', 'fertilizer2_amount_season2', 'fertilizer_type_dry_season']

mnb_sowing = ['seed_variety_season1', 'row_spacing_season1', 'seed_amount_season1', 'num_seeds_per_hole_season1',
              'seed_variety_season2', 'seed_amount_received_season2']

mnb_harvest = ['production_season1', 'harvest_num_sacks_season1', 'kg_per_sack_season1']

mnb_timing = ['sowing_date_season1', 'harvest_date_season1', 'harvest_duration_season1',
              'first_rainfall_date_season2', 'sowing_date_season2', 'fertilizer1_date_season2', 
              'fertilizer2_date_season2']

mnb_agronomic = mnb_plot + mnb_mgmt + mnb_sowing + mnb_harvest + mnb_timing
print(mnb_ag[mnb_agronomic].head())

   plot_area plot_soil_type plot_soil_color plot_soil_quality plot_topography  \
0       2.00         Autres           black            medium            flat   
1       0.51         Autres           black            medium            flat   
2       0.72         Autres           black              good            flat   
3       0.62         Autres           black              good            flat   
4       0.40         Autres           black              good            flat   

  farmed_season1 crop_grown_season1 farmed_season2 crop_grown_season2  \
0            yes     irrigated_rice             no                NaN   
1             no                NaN             no                NaN   
2            yes     irrigated_rice            yes     irrigated_rice   
3            yes     irrigated_rice            yes     irrigated_rice   
4            yes     irrigated_rice            yes     irrigated_rice   

  farmed_dry_season  ... production_season1 harvest_num_sacks_season1  \
0

# Infrastructure 

Both Manobi and CGAP ask questions related to infrastructure, a word I'm using quite broadly. In this group I put variables that inform whether a farmer has what they need to succeed -- the right tools to plant and harvest? Available labor? Adequate transportation? A sufficient trade network? 

## Manobi variables
All of these variables come from the plot_and_season file except for starred (\*) ones. I started by listing the questions that ask a farmer which tools they have access to.

- Variables about what tools the farmer has access to for agriculture:
    - **trator_season1**: Did the farmer use a tractor
    - **combine_season1**: Did the farmer use a combine harvester
    - **baler_season1**: Did the farmer use a baler
    - **tiller_season1**: Did the farmer use a tiller
    - **seed_drill_season1**: Did the farmer use a seed drill
    - **van_season1**: Did the farmer use a van
    - **motor_pump_season1**: Did the farmer use a motor pump
    - **weeder_season1**: Did the farmer use a weeder
    - **sprayer_season1**: Did the farmer use a sprayer
    - **harrow_season1**: Did the farmer use a harrow
    - **cart_season1**: Did the farmer use a cart
    - **plow_season1**: Did the farmer use a plow
    - **used_animal_season1**: Did the farmer use animal power
    - **animal_type_season1**: What kind of animal power did the farmer use (bovine, equine)
    - **labor_soil_preparation_season1**: Did the farmer use labor to prepare the soil
    
Manobi also collects brief information about irrigation:
- Variables about irrigation:
    - **irrigation_method_dry_season**: What method of irrigation does the farmer use in the dry season
    - **irrigation_source_dry_season**: Where does the water for irrigation come from
    
These questions about what a farmer does after harvest correspond with some of the CGAP questions listed later. For example, Manobi and CGAP both ask about storage location and where the farmer sells product.

- Variables about post-harvest:
    - **storage_place_season1**: Where was the grain stored? (Store, granary, or elsewhere)
    - **harvest_sold_place_season1**: Where was the harvest sold? Market, cooperative, next to the fields, or elsewhere
    
I call the following subgroup "commerce infrastructure" because it summarizes what we know about the system around the farmer. That is, does the farmer have access to infrastructure allowing them to receive seed and credit, for example?

- Commerce infrastructure:
    - **seed_amount_demanded_season2**: How much seed did the farmer ask for?
    - **seed_provider_season2**: Who provided the seed?
    - **credit_source_season2**: Did the farmer get credit from a bank, a cooperative or a third party
    - **credit_from_coop**: Whether the farmer receives credit from a production or farmer cooperative *
    - **credit_from_third_party**: Whether the farmer receives credit from a third party lender *
    - **credit_from_bank**: Whether the farmer receives credit from a bank *
    
The rest of the variables in this list are more conventionally about infrastructure -- how a farmer gets around, what sort of house they live in and what public services they have access to.

- Variables about non-agricultural access:
    - **transportation**: What method of transportation the farmer uses to get around *
    - **housing_material**: What material the farmer's house is made out of *
    - **access_electricity**: Whether the farmer and his/her family have access to electricity / What source the farmer gets his electricity from *
    - **access_drinking_water**: Whether the farmer and his/her family have access to potable drinking water / Where the farmer gets his drinking water *
    - **access_healthcare**: Whether the farmer and his/her family have access to healthcare or a health center / Where the farmer receives health care *

## CGAP variables

CGAP also asks about crop storage, as well as questions relevant to commerce infrastructure. A lot more of CGAP's questions about things like credit are focused around finance, so I'll refer you to Paul's notebook for those. There are some other overlaps with Manobi like the question about transportation possessions (**POSSESS1**) and a question about water (**A22**). 

- Crop storage:
    - **A53**: Which crops do you normally store?
    - **A52**: Do you currently store any of your crops after the harvest?
    - **A56**: Why do you store your crops?
    - **A57**: Why do you not currently store your crops?
    - **M20**: Do you currently have any of the following abilities for your agricultural activities? (Questions include access to weather services, market pricing, farming information, transportation tracking, etc.)

- Commerce infrastructure:
    - **A15**: Who do you normally purchase your main agricultural and livestock inputs (such as seeds, fertilizer, or pesticide) from?
    - **A28**: Where do you normally sell your crops and livestock?
    - **A29**: Why do you sell your crops and livestock at this location?
    - **A35**: What challenges do you face in terms of getting your crops and livestock to your customers?
    - **F37**: What is the main reason you started using mobile money?
    - **F40**: You said you do not use a mobile money account for any payments or purchases. Please tell me why.
    
- Variables about non-agricultural access:
    - **A22**: Which of the following statements best describe your water situation?
    - **A43**: What types of services do you get from groups or associations?
    - **POSSESS1**: Do you have a bicycle, scooter or car?


# Lists of variables

In [37]:
mnb_tools = ['tractor_season1', 'combine_season1', 'baler_season1', 'tiller_season1',
             'seed_drill_season1', 'van_season1', 'motor_pump_season1', 'weeder_season1',
             'sprayer_season1', 'harrow_season1', 'cart_season1', 'plow_season1', 
             'used_animal_season1', 'animal_type_season1', 'labor_soil_preparation_season1']

mnb_irrigation = ['irrigation_method_dry_season',
                  'irrigation_source_dry_season']

mnb_postharvest = ['storage_place_season1', 'harvest_sold_place_season1']

mnb_entities = ['seed_amount_demanded_season2', 'seed_provider_season2', 'credit_source_season2']

mnb_infrastructure = mnb_tools + mnb_irrigation + mnb_postharvest + mnb_entities

mnb_ag[mnb_infrastructure].head()

Unnamed: 0,tractor_season1,combine_season1,baler_season1,tiller_season1,seed_drill_season1,van_season1,motor_pump_season1,weeder_season1,sprayer_season1,harrow_season1,...,used_animal_season1,animal_type_season1,labor_soil_preparation_season1,irrigation_method_dry_season,irrigation_source_dry_season,storage_place_season1,harvest_sold_place_season1,seed_amount_demanded_season2,seed_provider_season2,credit_source_season2
0,no,no,no,no,no,no,yes,no,yes,no,...,no,,yes,,,shop,,0,,
1,no,no,no,no,no,no,no,no,no,no,...,,,,,,,,0,,
2,no,no,no,no,no,no,no,no,yes,no,...,,,yes,,,shop,market,0,,
3,no,no,no,no,no,no,no,no,yes,no,...,,,yes,,,shop,market,0,,
4,no,no,no,no,no,no,no,no,yes,no,...,,,yes,,,shop,market,0,,


# Social/cognitive

Manobi and CGAP both ask questions about a farmer's cognitive processes, but Manobi tends to be more specific ("why did you plant this crop?") and CGAP tends to be more general ("Do you intend to keep working in agriculture?"). Part of the difference is that Manobi is interested in a farmer's productivity and provides free answer boxes for the rare cognitive question, while CGAP has a robust set of answers for general cognitive questions because they care about those. (Questions come from plot_and_season unless marked with a \*; then they come from the farmer file.)

## Manobi variables
Manobi asks specific questions about why a farmer selects crops/seeds. They also ask if the farmer will insure the plot. I also included **harvest_sold_season1** because deciding how much crop to keep for subsistence versus sell for cash is an important decision.
- Decision-making:
    - **reason_for_crop_season1/reason_for_crop_season2**: Reason for planting the crop (seed production, subsistence, cash cropping/revenue generation, or all three)
    - **reason_for_seed_variety_season1**: Why the farmer chose that seed variety for last season
    - **harvest_sold_season1**: How much of the harvest was sold
    - **plot_insurance_season2**: Is the farmer going to insure the plot for the coming season?
    
Manobi is concerned with how satisfied farmers are by the services provided to them:
- Beliefs and preferences:
    - **insurance_satisfaction_season1**: Was the farmer satisfied or unsatisfied with the insurance compensation (if received)
    - **inputs_useful**: Whether the farmer thinks agricultural inputs have been useful *
    - **interest_use_inputs**: Whether the farmer is interested in using agricultural inputs *
    - **interest_insurance**: Whether the farmer is interested in agricultural insurance *
    - **use_ag_services** : Does the farmer use agricultural services *
    - **ag_services_quality**: The quality of agricultural services provided to the farmer *
    - **ag_services_frequency**: How often the farmer receives agricultural services *
    - **interest_ag_services**: Whether the farmer is interested in agricultural services *
    
Manobi asks about the cooperative that a farmer belongs to, but doesn't ask about the services provided. Conversely, CGAP asks whether a farmer belongs to various organizations (**A42**) and the services they receive (**A43**) but no specifics about names.

- Social:
    - **coop_name**: Which farmer cooperative does the farmer belong to? *


## CGAP variables

 F49, F50, F51
CGAP asks about decision-making generally, with questions like **A39**, and also differentiates between whether a farmer *does* save money (**A48**, see Paul's notebook) versus whether they *want* to save money. **D22** is an especially interesting variable to me because it tells you whether the wife in the household makes decisions (though I could not manage to predict **D22** with any accuracy). 

- Decision-making:
    - **A39**: Do you intend to keep working in agriculture?
    - **A40**: What would make you less likely to stay in agriculture?
    - **A44**: How often do you use each of the following sources of information for agricultural activities?
    - **A49**: Do you want to keep money aside for the following agricultural needs?
    - **D22**: Generally, who makes decisions in the household?
    - **H30**: Question about how often a farmer finds themselves a) spending less than they make, b) having an emergency fund for unplanned expenses, c) paying their bills on time and d) having more savings than debt
    - **A4A**: Which of the following statements best describes your decision-making role in the crops?
    - **H42**: When it comes to household expenses, which statement best matches the decision-making role that you play?
    
Manobi asks about the value of services; CGAP focuses more on the mindset of the farmer. CGAP asks about how the farmer feels about various aspects of farming, as well as how the farmer sees the future and herself.
- Beliefs and preferences:
    - **A4**: Do you consider your farm to be a business?
    - **A31**: Why do you not get the current market price?
    - **A6**: Which of the following crops is most important to you?
    - **H3**: Which of the following income sources is most important to you?
    - **H4**: Which of the following income sources do you like getting the most?
    - **A41**: Do you agree or disagree with the following statements (feelings about farming)
    - **H37**: Do you agree or disagree with the following statements (about the future)?
    - **H38**: Do you agree or disagree with the following statements (about yourself)?
    - **F17**: Have you ever used any of the following (financial services)?
    - **F52**: On a scale from 1 to 5, where 1 means “fully distrust” and 5 means “fully trust,” how much do you trust each of the following as financial sources?
    
Like I mentioned earlier, these questions overlap a little with Manobi's **coop_name** variable (but not by much). Note that some of the questions (**F49** for example) could be considered "infrastructure" but they also have a social component (for example, in the case of savings groups).
- Social:
    - **A42**: Are you a member of any of the following groups or associations?
    - **A43**: What types of services do you get from these organizations?
    - **H16**: When it comes to financial or income-related advice, who do you regularly talk to?
    - **F47**:  Apart  from today when was the last time you used these services or service providers for any financial activity?
    - **F49**: (About financial groups) Which of the following services do these groups provide?
    - **F50**: (About financial groups) Which of these service providers or services is the most important to you?
    - **F51**: (About financial groups) Why do you not have a membership with any of these groups?


# Lists of variables

In [38]:
mnb_decisions = ['reason_for_crop_season1', 'reason_for_crop_season2', 'reason_for_seed_variety_season1', 
 'harvest_sold_season1', 'plot_insurance_season2']

mnb_beliefs = ['insurance_satisfaction_season1', 'inputs_useful', 'interest_use_inputs', 'interest_insurance',
              'ag_services_quality', 'interest_ag_services']

mnb_social = ['coop_name']

Paul's notebook reviews how to look at questions much better than mine, but note that most of the cognitive variables (and others in this notebook) are multi-answer questions, meaning that you have a yes/no column for each possible answer. 

In [39]:
cgap_cognitive = ['A4', 'A6', 'A31', 'H3', 'H4', 'A39', 'A40', 'A41', 'A42', 'A43', 'A44', 'A47', 'A49', 
                  'D22', 'H16', 'H30', 'H37', 'H38', 'F47', 'F49', 'F50', 'F51']

for var in cgap_cognitive:
    q = eval('Data.bgd_' + var)
    print(q.label, ":    ", q.text, "(", q.qtype, ")")
    if q.qtype == "single":
        print(q.answers)
    else:
        print(q.column_dict)
    print("\n")

A4 :     Do you consider your farm to be a business? ( single )
{'yes': 1, 'no': 2}


A6 :     Which of the following crops is most important to you? ( single )
{'Rice': 1, 'Wheat': 2, 'Mango': 29, 'Jute': 48, 'Maize': 3, 'Tea': 58, 'Pulses': 13, 'Sugarcane': 57, 'Tobacco': 59, 'Chilies': 24, 'Onions': 32, 'Garlic': 27, 'Potato': 16, 'Rapeseed': 36, 'Mustard_seed': 50, 'Coconut': 44, 'Eggplant': 26, 'Radish': 35, 'Tomatoes': 39, 'Cauliflower': 23, 'Cabbage': 22, 'Pumpkin': 34, 'Banana': 21, 'Jackfruit': 28, 'Pineapple': 32, 'Guava': 28, 'Sesame': 55, 'Other_1': 60, 'Other_2': 61, 'Other_3': 62, 'No_crop': 63}


A31 :     Why do you not get the current market price? ( single )
{'few_customers': 1, 'take_advantage': 2, 'high_commissions': 3, 'corruption': 4, 'no_transport': 5, 'DK': 98, 'other': 6}


H3 :     Which of the following income sources is most important to you? ( single )
{'regular_job': 1, 'occasional_job': 2, 'retail_business': 3, 'services_business': 4, 'grant_pension': 5, 

AttributeError: 'CGAP_Decoded' object has no attribute 'bgd_A40'