# Introduction to Data Objects #
Paul Cohen June, 2021

Data Objects (DOs, in contrast with don'ts) incorporate metadata, which will become increasingly necessary as our project goes forward. DOs also are searchable, are encoded in JSON, and can be assembled easily into `pandas` dataframes. I'll illustrate how to use them with both CGAP and Manobi data. 

This notebook supersedes the previous [Introduction to Data Objects](https://pitt.app.box.com/folder/136678084196)

Data Objects were designed to encode CGAP data but have since been generalized. Those of you who are already working with CGAP data should see no change in the interface, but please let me know if something stops working. 

I have tried to minimize dependencies, but to make DOs searchable I need some elementary natural language tools, specifically stemming.  So you'll need to [install `nltk`](https://www.nltk.org/install.html).  

## Data Objects API ##

Skip this section on first reading.  It is here as a quick reference for those who are already familiar with Data Objects.

There are only four classes in the Data Objects package.  

- `DO_Encoder` : Given data and metadata, this class creates a single data object and encodes it as a JSON string 
- `Encoded_DOs` : This class holds a collection of encoded DOs and has methods that operate on the collection
- `DO_Decoder` : This class decodes an encoded DO and has methods that operate on individual DOs
- `Decoded_DOs` : This class holds a collection of decoded DOs and has methods that operate on the collection

Different sources of data might require you to make subclasses and overwrite some methods; for example, the CGAP survey data from six countries requires one additional method for assembling dataframes from sources in some or all of the countries (see below). 


### DO_Encoder methods and attributes ###

- `encode(self)` : Encodes the DO as a JSON string
- `make_search_terms (self, *attributes)` :  Extracts search terms and stemmed search terms from the specified attributes of a DO.  If `*attributes` is None, all the DO attributes are used.

All `DO_Encoder` objects have a `name` that uniquely identifies the DO and a `df` that holds its data.  Additionally, they might have a `column_dict` that specifies where to find data and what the columns of `df` should be called.  If the `make_search_terms` method is executed, the `DO_Encoder` will also have a `search_terms` attribute. All other attributes are passed as keyword arguments (`**kwargs`). Generally, these arguments are metadata. 

### DO_Decoder methods and attributes ###

- `describe(self, *attributes, display = True)` : This makes a tuple of f-strings that describe the decoded data object. If `attributes` is `None`, this will describe all attributes other than `df` and `search_terms`. If `display = False`, this will return the f-strings, otherwise it prints them.
- `cols (self, *columns)` : If `columns` is `None`, this returns all the columns of `df`, otherwise it returns the columns specified in `columns`. 
- `re_encode(self,*new_name)` : This creates a new `DO_Encoder` object from `self`.  It's useful when we want to decode a DO to modify it and then recode it. The optional argument `new_name` assigns a new name to the recoded object; it's analogous to `save as`.

Because a DO_Decoder object is just a decoded version of an encoded DO_Encoder object, it has the same attributes as the latter: `name`, `df`, optionally `column_dict`, other optional attributes that represent metadata, and, usually, `search_terms`. 

### Encoded_DOs methods and attributes ###


The `Encoded_DOs` class has just two methods:

- `add_encoded(self,encoded)` : adds an encoded `DO_Encoder` to a collection
- `write(self,filename)` : writes the collection to a file

The single attribute of the `Encoded_DOs` class is `jstrings`, which is a list of JSON strings: one for each encoded DO in the collection.

### Decoded_DOs methods and attributes ###

The `Decoded_DOs` class holds a collection of decoded DOs and serves as a workspace for data analysis, so it has more methods:

- `decode (self, jstring)` : decodes an encoded DO

- `read_and_decode (file)` : reads a batch of encoded DOs from a file and decodes them

- `build_term_index ()` : builds a dict that maps search terms to decoded DOs

- `dob (self,name)`: Returns the decoded DO with the given name if it exists, otherwise warns the user and returns `None`.

- `describe (name, *attributes, display = True)` : A convenience function to call the `DO_Decoder.describe` method on the DO with the given `name`. 

- `search (*query, display = True)`: If display is True, this `describe`s the questions that match a Boolean query; if `display` is a list or tuple or attributes, this `describes` just those attributes;  otherwise it returns the names of those questions (see below).

- `cols (name, *columns)`: This assembles a dataframe from the columns in `*columns`. Unlike the `DO_Decoder.cols` method, which assembles a dataframes from a single DO, this one assembles dataframes from selections of DOs in the collection.  

- `re_encode (self)` : re-encodes all of the decoded DOs in the collection. Re-encoding is used after a set of decoded questions has been edited or augmented with new questions and needs encoded before being saved to a file.
 

In [1]:
import sys, os, json
from types import SimpleNamespace
import string, copy

import numpy as np
import pandas as pd

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

sys.path.append('/Users/mordor/research/habitus_project/clulab_repo/predictables/Data/data_objects/code_and_notebooks')
from Data_Objects import DO_Encoder, DO_Decoder, Encoded_DOs, Decoded_DOs, Decoded_CGAP_DOs


You'll need `Data_Objects.py`  and `data_objects_indexing.py`, both of which are [here](https://pitt.app.box.com/folder/136678084196) on the Pitt Habitus Box folder.

Now we'll read in two Manobi datasets, one for farms, the other for plots and seasons. These datasets can be found on the Pitt Box folder [here](https://pitt.app.box.com/folder/131436606139). 

In [2]:
# Change the filepath to whatever makes sense on your local computer
filepath = '/Users/prcohen/anaconda2/envs/aPRAM/Habitus/Data/Manobi Data/SRV/processed/'

# Manobi farm and season data
farm = pd.read_csv(filepath+'demographicfarmer_StLouis_2016_cleaned.csv')
season = pd.read_csv(filepath+'plot_and_season_StLouis_2016_english.csv')

In [3]:
print(farm.columns)

Index(['id', 'village_id', 'validation_status', 'accuracy', 'farmerDate',
       'coop_name', 'gender', 'age', 'years_farming', 'education_level',
       'num_wives', 'wife_trades', 'wife_food_crops', 'wife_crafts',
       'num_children', 'num_young_children', 'num_older_children',
       'num_children_in_school', 'children_working', 'income',
       'transportation', 'housing_material', 'access_electricity',
       'access_drinking_water', 'access_healthcare', 'wants_credit',
       'credit_from_coop', 'credit_from_third_party', 'credit_from_bank',
       'loan_amount', 'loan_rate', 'has_bridge_loan', 'bridge_loan_amount',
       'pledged_output', 'wants_bridge_loan', 'wants_bridge_loan_amount',
       'input_use', 'input_cost', 'inputs_useful', 'interest_use_inputs',
       'interest_insurance', 'use_ag_services', 'ag_services_quality',
       'ag_services_frequency', 'interest_ag_services', 'buyer_type',
       'last_price', 'num_plots', 'total_ha', 'main_crop', 'secondary_crop',
  

## DO_Encoder and Meta-data ##

To create a DO, make a DO_Encoder object.  `name` is a DO identifier; it must be unique to the DO.   `data` is a required argument but it doesn't have to be a pandas dataframe, just something that can be turned into one. (This will be useful when you want to create new derived variables from existing data; see below). 

In [4]:
q = DO_Encoder(
    name = 'farm_id',
    data = farm['id'],
    text = 'The unique id of the farm.',
    note = 'farm.id is used as a foreign key the plot and season data to link plot/season to farm',
    source = 'demographicfarmer_StLouis_2016_cleaned.csv',
    country = 'Senegal',
    region = 'SRV',
    time_period = '2016-2017',
    DO_creator = 'Paul Cohen'
    )

All other arguments are keyword arguments; they can be anything you want.  This flexibility means you can specify any meta-data you please. However, I recommend a small set to get us started in the HEURISTICS project.  You don't have to use all of them (e.g.,`note` might not be needed) but you should use the relevant ones. 

- column_dict : a dict that maps the locations of data in df to names of variables in DO (see below)
- text : a string that says what the data means. For example, the label `id` could mean anything, so say something like 'the unique id of the farm'. A good choice for this field is the text of the associated survey question.
- answers : a dict that maps the encoding of answers to their meanings (see below)
- note : anything you want to say about the data, its provenance, how it is derived, etc. that isn't in `text`.
- source : a filename or URL or some other pointer to the data
- country : the country from which the data is collected
- region : a smaller geographic area than country
- time_period : could be a date range or a year or a season identifier
- DO_creator : who made this data object


Note that encoding the DO_Encoder object to a JSON string is not automatic because you might want to do some work on it before encoding it.  Encoding is done by the `encode` method: 

In [5]:
en = q.encode()
# look at the first 400 characters:
print(en[:400])

{"text": "The unique id of the farm.", "note": "farm.id is used as a foreign key the plot and season data to link plot/season to farm", "source": "demographicfarmer_StLouis_2016_cleaned.csv", "country": "Senegal", "region": "SRV", "time_period": "2016-2017", "DO_creator": "Paul Cohen", "name": "farm_id", "column_dict": null, "df": "{\"columns\":[\"id\"],\"index\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,1


Encoded DOs are decoded (and recoded, see below) by DO_Decoder objects.  Decoding *is* done automatically:

In [6]:
de = DO_Decoder(en)
de

namespace(text='The unique id of the farm.',
          note='farm.id is used as a foreign key the plot and season data to link plot/season to farm',
          source='demographicfarmer_StLouis_2016_cleaned.csv',
          country='Senegal',
          region='SRV',
          time_period='2016-2017',
          DO_creator='Paul Cohen',
          name='farm_id',
          column_dict=None,
          df=        id
             0     3832
             1     3833
             2     3846
             3     3847
             4     3848
             ...    ...
             3857  9421
             3858  9364
             3859  9383
             3860  9387
             3861  9419
             
             [3862 rows x 1 columns],
          search_terms=['plotseason',
                        'cohen',
                        'farmid',
                        'unique',
                        'srv',
                        'farm',
                        'Senegal',
                        'key',
   

Everything we specified in the original `DO_Encoder` is there, plus some search terms that I'll discuss later. It can be annoying to look at long lists of search terms and rows of dataframes, so the `DO_Decoder` class has a method called `describe` that lets you control what's displayed:

- `describe(*attributes, display = True)` : If attributes is None, this will describe all attributes other than df and the search terms. If display = False, this will return the f-string representation of the description.



In [7]:
de.describe('name','text','note','df')


name : farm_id
text : The unique id of the farm.
note : farm.id is used as a foreign key the plot and season data to link plot/season to farm
df :         id
0     3832
1     3833
2     3846
3     3847
4     3848
...    ...
3857  9421
3858  9364
3859  9383
3860  9387
3861  9419

[3862 rows x 1 columns]



## column_dict : selecting and naming variables ##

In the previous example, `df` was a pandas Series. An alternative is to pass a dataframe and select columns from it using the `column_dict` attribute.  The following example passes the whole `farm` dataframe but selects just the `id` column.  

`column_dict` can be used both to select data columns and to give different names to DO columns than the names they had in the original data.  

The keys of `column_dict` identify particular columns in `data` and the values are the names you want them to have in the DO dataframe.  Thus `column_dict = {'id' : 'farm_id'}` says this data object will find a column called `id` in `farm` and create a column called `farm_id` in its `df`:

In [8]:
q = DO_Encoder(
    name = 'farm_id',
    data = farm,
    column_dict = {'id' : 'farm_id'},
    text = 'The unique id of the farm.',
    note = 'farm.id is used as a foreign key the plot and season data to link plot/season to farm',
    source = 'demographicfarmer_StLouis_2016_cleaned.csv',
    country = 'Senegal',
    region = 'SRV',
    time_period = '2016-2017',
    DO_creator = 'Paul Cohen'
).encode()

DO_Decoder(q).df

Unnamed: 0,farm_id
0,3832
1,3833
2,3846
3,3847
4,3848
...,...
3857,9421
3858,9364
3859,9383
3860,9387


## column_dict : when data isn't a pandas dataframe ##

As noted, the `data` argument needn't be a pandas dataframe but must be something that can be turned into one.  In such a case, `column_dict` keys might correspond with array indices and values are, as above, the names you want the columns to have in the DO dataframe:  

In [9]:
data = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]

q = DO_Encoder(
    name = 'foobar',
    data = data,
    column_dict = {0 : 'A', 1: 'B', 2: 'C'},
    text = "Example of using array indices in column_dict when data isn't a pandas dataframe",
).encode()

DO_Decoder(q).df


Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12


## column_dict : multi-answer questions ##

Surveys commonly specify possible answers to questions; for example, a question about marital status might allow  'married', 'unmarried', 'cohabitating', etc., as possible answers. The respondent would typically pick one of these answers.  But if asked about the crops grown on a farm, the respondent might pick several answers, such as rice *and* tomatoes *and* mangos. Confusingly, these two types of questions are called single-answer and multi-answer questions, even though in both cases the respondent picks among multiple answers.  It would be better to call these 'pick one' and 'pick any' questions -- pick one marital status, pick any crops -- but we're stuck with the single-answer and multi-answer terminology.

The point of all this is that DOs should be able to handle multi-answer questions.  To illustrate how it works, I will use an example from Manobi data. 

The Manobi farmer data contains three columns for the activities of the spouse.  These are called `wife_trades`, `wife_food_crops` and `wife_crafts`.  Three columns are needed because the wife's activity is a multi-answer kind of question: she can do any or none of the three activities.   Each column contains either 1 or 0 (encoding 'yes' or 'no') for that activity: 

In [10]:
farm[['wife_trades','wife_food_crops','wife_crafts']]

Unnamed: 0,wife_trades,wife_food_crops,wife_crafts
0,1,0,0
1,0,1,0
2,0,1,0
3,0,1,0
4,1,0,0
...,...,...,...
3857,1,0,0
3858,0,0,0
3859,1,0,0
3860,0,0,0


Unfortunately, there's no metadata to tell us that these three columns of data are related, nor what 0 and 1 mean. Here's how I'd encode these data as a single DO:

In [11]:
wife_activity = DO_Encoder(
    name = 'wife_activity',
    data = farm,
    column_dict = {'wife_trades' : 'trade', 'wife_food_crops' : 'food_crops', 'wife_crafts' : 'crafts'},
    answers = {1 : 'yes', 2: 'no'},
    text = 'Which of the following activities does your wife engage in?',
    note = "I don't know the actual survey question text",
    source = 'demographicfarmer_StLouis_2016_cleaned.csv',
    country = 'Senegal',
    region = 'SRV',
    time_period = '2016-2017',
    DO_creator = 'Paul Cohen'
).encode()

DO_Decoder(wife_activity).describe('name','text','answers','df')


name : wife_activity
text : Which of the following activities does your wife engage in?
answers : {'1': 'yes', '2': 'no'}
df :       trade  food_crops  crafts
0         1           0       0
1         0           1       0
2         0           1       0
3         0           1       0
4         1           0       0
...     ...         ...     ...
3857      1           0       0
3858      0           0       0
3859      1           0       0
3860      0           0       0
3861      0           0       0

[3862 rows x 3 columns]
column names : ['wife_trades', 'wife_food_crops', 'wife_crafts']



Here you see that the three columns have been grouped together as answers to a single question, which I gave the label 'wife_activity'.  Note also that the numeric coding of answers is explained by the dict called `answers`.  

## Working with Collections of Data Objects ## 

When you work with batches of DOs, you'll want to put them somewhere, and you might want to run methods -- such as searching for DOs -- over all the DOs in a batch.  The Data Objects framework has an `Encoded_DOs` class to hold encoded questions a `Decoded_DOs` class to hold decoded questions.  

### Batch creation of Data Objects ###

It takes very little effort to transform Manobi data into DOs, although it does require a source of metadata.  I'll use two keys - one for farm data, the other for plot and season data - developed by Allegra for the Manobi 2016 data.  (Creating DOs for CGAP data is *much* more involved because CGAP has data from six countries but the coding of data is inconsistent across countries.)

In [12]:
# Key for Manobi season data
# We'll use this to illustrate batch creation of DOs later on
season_key = pd.read_excel(filepath+'season_key_2016.xls')
farm_key = pd.read_excel(filepath+'farmer_key_2016.xls')



In [13]:
# The columns in Allegra's farmer key
farm_key[:15]

Unnamed: 0,Column name when downloaded,Units,Survey question,Assigned variable name
0,X.ID,,,id
1,Parent.ID,,,village_id
2,Status,,,validation_status
3,Accuracy,,,accuracy
4,Date,,,farmerDate
5,Groupement,,Which farmer cooperative does the farmer belon...,coop_name
6,Genre,,What gender is the farmer?,gender
7,Age,,How old is the farmer?,age
8,Nbr.annÈes.activitÈ,,How many years has the farmer been farming?,years_farming
9,Niveau.Scolaire,,What education level has the farmer attained?,education_level


In [14]:
# The columns in Allegra's season key
season_key[:15]

Unnamed: 0,Column name when downloaded,Units,Survey question,"""Season""",Assigned variable name
0,X.ID,,,,plot_ID_from_plot
1,Parent.ID,,,,farm_ID
2,Object.state,,,,Object.state
3,Status,,,,Status
4,Accuracy,,,,Accuracy
5,Date,,,,collection_date_from_plot
6,Parcelle.N.,,,,plot_number
7,Statut,,Ask the farmer to tell you the status of the p...,,plot_owner
8,Superficie.dÈclarÈe,,Ask the farmer to tell you the area of the plo...,,plot_area
9,DÈlimitation.parcelle,,Take a tour of the plot with the farmer and re...,,plot_boundary


Clearly, these keys are not complete and include some artifacts, but they are sufficient to automate the construction of DOs. 

In [15]:
# Create an Encoded_DOs object for each batch of DOs
Farm_Encoded   = Encoded_DOs()
Season_Encoded = Encoded_DOs()

# Build a data object for each item in Allegra's key and add it to Encoded

def data_object_from_key_item (key_item, dataset,source):
    
    text, label = key_item['Survey question'], key_item['Assigned variable name']
    
    if label is None or label not in dataset.columns:
        print (f"Variable {label} not found in the dataset")
        return
    
    else:
        if text is None: 
            print (f"Warning:  Variable {label} has no associated text")
            
        # build a DO_Encoder, encode it and return it
                   
        return DO_Encoder(
            name = label,
            data = dataset[label],
            text = text,
            unit = key_item['Units'],
            source = source,
            country = 'Senegal',
            region = 'SRV',
            time_period = '2016-2017',
            DO_creator = 'Paul Cohen'
        ).encode()
        

# Build a batch of encoded DOs for the farm dataset given the farm_key 
dataset, source = farm, 'demographicfarmer_StLouis_2016_cleaned.csv'
for i in range(len(farm_key)):
    enc = data_object_from_key_item(farm_key.iloc[i], dataset, source)
    if enc is not None: 
        Farm_Encoded.add_encoded(enc)
    

# Build a batch of encoded DOs for the season dataset given the season_key      
dataset, source = season, 'plot_and_season_StLouis_2016_english.csv'
for i in range(len(season_key)):
    enc = data_object_from_key_item(season_key.iloc[i], dataset, source)
    if enc is not None:
        Season_Encoded.add_encoded(enc)
    


Variable collection_date not found in the dataset
Variable ? Note: Nobody knows what this is not found in the dataset
Variable ?? not found in the dataset


A batch of encoded DOs isn't much to look at, in fact, it is just a list of very long JSON strings.  These can be written to a file for permanent storage and shared with your colleagues, but to do any work with these data objects you'll need to decode them.  

### Writing, Reading and Decoding and Examining  Data Objects ###

To save a batch of encoded DOs to a file, use the `write` method of the `Encoded_DOs` class, and to read and decode a file of encoded DOs, use the `read_and_decode` method of the `Decoded_DOs` class:


In [16]:
# You can write the encoded objects to a file (change the filepath to suit):
Season_Encoded.write(filepath + 'season_data_objects.txt')
Farm_Encoded.write(filepath + 'farm_data_objects.txt')

In [17]:
# You can read a file of encoded objects and decode them:
# First create Decoded_DOs objects to "hold" the DOs

Season_Decoded = Decoded_DOs()
Farm_Decoded   = Decoded_DOs()

# Then read and decode
Season_Decoded.read_and_decode(filepath + 'season_data_objects.txt')
Farm_Decoded.read_and_decode(filepath + 'farm_data_objects.txt')

Now we can examine any of these decoded data objects and, as you'll see shortly, edit them.  Here are three ways to get at decoded DOs: 

In [18]:
# get the DO directly from the Decoded_DOs dict
# Here we'll look at the text of the DO

Season_Decoded.__dict__.get('plot_area').text

'Ask the farmer to tell you the area of the plot (note: not necessarily the area planted)'

In [19]:
# A slightly easier way is to use the Decoded_DOs `dob` method
# It checks whether the name is known. If so, it returns the DO itself, 
# otherwise it warns and returns None

Season_Decoded.dob('plot_perimeter')
Season_Decoded.dob('plot_area').text
    

plot_perimeter cannot be found


'Ask the farmer to tell you the area of the plot (note: not necessarily the area planted)'

In [20]:
# A third way is to use the Decoded_DOs `describe` method
# It does not return the DO itself

# The first argument is the name, the rest are fields to describe.
# If the name is not known, this method warns and returns None

Farm_Decoded.describe('log_income','text','df')
Farm_Decoded.describe('income','text','df')
Season_Decoded.describe('plot_area','text','unit')

log_income cannot be found
log_income does not exist

text : The yearly income of the farmer
df :          income
0     1000680.0
1      798000.0
2      375000.0
3      100000.0
4           0.0
...         ...
3857        NaN
3858   107000.0
3859   500000.0
3860   750000.0
3861   400000.0

[3862 rows x 1 columns]


text : Ask the farmer to tell you the area of the plot (note: not necessarily the area planted)
unit : nan



## Searching for Data Objects ##

`DO_Encoder` objects have a method called `make_search_terms` that extracts all the content words from a DO's attributes  -- the text, notes, answers, column_dict, etc. -- and runs some rudimentary text processing over them.  This yields the search terms that you see, above.  This work is done with *un*-encoded `DO_Encoder` objects.  You can run `make_search_terms` manually if you like, but you needn't, as it's done automatically by the `DO_Encoder.encode` method.

The terms in a query are disjunctive unless they are tuples, in which case terms in the tuples are conjunctive. For example, `search('farmer',('seed','receive))` finds all questions that talk about `sowed` OR (`seed` AND `receive`): 


In [21]:
Season_Decoded.search('sowed',('seeding','receiving'))

Building a term index...

text : The date that seeds were sown on the plot last season
unit : nan
source : plot_and_season_StLouis_2016_english.csv
country : Senegal
region : SRV
time_period : 2016-2017
DO_creator : Paul Cohen
name : sowing_date_season1
column_dict : None


text : How much seed did the farmer receive?
unit : kg
source : plot_and_season_StLouis_2016_english.csv
country : Senegal
region : SRV
time_period : 2016-2017
DO_creator : Paul Cohen
name : seed_amount_received_season2
column_dict : None


text : How many kilograms per hectare of seeds the farmer sowed last season
unit : kg/ha
source : plot_and_season_StLouis_2016_english.csv
country : Senegal
region : SRV
time_period : 2016-2017
DO_creator : Paul Cohen
name : seed_amount_season1
column_dict : None


text : What variety of seed did the farmer sow?
unit : nan
source : plot_and_season_StLouis_2016_english.csv
country : Senegal
region : SRV
time_period : 2016-2017
DO_creator : Paul Cohen
name : seed_variety_season2
co

The word `sowed` appears in only one of these DOs and `seeding` and `receiving` appear in none of them. The query returns several DOs because query terms are stemmed; `sowed` becomed `sow`, `seeding` becomes `seed` and `receiving` becomes `receiv`. 

If you don't want the entire question description, you can get just the names, and if you want to describe just part of the question, you can tell `describe` which parts:

In [22]:
# Set display = False to get just the names
print(Season_Decoded.search('sowed',('seeding','receiving'),display=False))
print()

#Set display to a list or tuple to get particular fields
print(Season_Decoded.search('sowed',('seeding','receiving'),display=['name','text']))


{'sowing_date_season1', 'seed_amount_received_season2', 'seed_amount_season1', 'seed_variety_season2', 'sowing_date_season2', 'seed_value_season2'}


name : sowing_date_season1
text : The date that seeds were sown on the plot last season


name : seed_amount_received_season2
text : How much seed did the farmer receive?


name : seed_amount_season1
text : How many kilograms per hectare of seeds the farmer sowed last season


name : seed_variety_season2
text : What variety of seed did the farmer sow?


name : sowing_date_season2
text : What date did the farmer sow the seeds?


name : seed_value_season2
text : What was the monetary value of the seed the farmer received?

None


## Creating, Editing and Saving Data Objects ##

Earlier, I showed how to create a batch of DOs, store them in a Encoded_DOs object, and then write them to a file.  

Another use case, perhaps a more common one, involves reading a file of encoded DOs, decoding them, editing some and adding others, and then writing the original and newly edited and created DOs back to a file.  The only tricky part of this process is *re-encoding* the decoded data objects. 

The `DO_Decoder` class has a method called `re_encode` that re-encodes itself, and the `Decoded_DOs` class has a method called `re_encode` that re-encodes all the decoded questions it contains.  Re-encoding differs from encoding in two ways:

1) Encoding, when done by the `encode` method of a `DO_Encoder`, builds a dataframe from supplied data.  Re-encoding, when done by a `DO_Decoder` object, does not build a dataframe but simply copies the dataframe in the object.  

2) Re-encoding does not automatically update the search terms associated with a DO.  A simple workaround is shown below. 

The following code block reads and decode DOs from a file, then edits one and creates another, then re-encodes and writes all of the decoded DOs back to a file. 

In [23]:
# Read in the previously created data objects
Farm_Decoded.read_and_decode(filepath + 'farm_data_objects.txt')

# Edit a decoded data object. 
# ==========================
# We'll add a text string to a DO that's missing one

d = Farm_Decoded.dob('income_binned')
print (d.text)

# Add some metadata that describes how income_binned was derived
d.text = 'the bins are the quartile boundaries of income'

# Check whether it worked
Farm_Decoded.describe('income_binned','name','text')

# Create new data objects
#=========================
# We'll create multi-answer question DO for the spouse's activities, as shown earlier
# and also for the equipment used on the farm.
# Note that we must encode these DOs if we want to create search terms for them

wife_activity = DO_Encoder(
    name = 'wife_activity',
    data = farm,
    column_dict = {'wife_trades' : 'trade', 'wife_food_crops' : 'food_crops', 'wife_crafts' : 'crafts'},
    answers = {1 : 'yes', 2: 'no'},
    text = 'Which of the following activities does your wife engage in?',
    note = "I don't know the actual survey question text",
    source = 'demographicfarmer_StLouis_2016_cleaned.csv',
    country = 'Senegal',
    region = 'SRV',
    time_period = '2016-2017',
    DO_creator = 'Paul Cohen'
).encode()

# By decoding the encoded DO, we add it to Farm_Decoded, along with search terms
Farm_Decoded.decode(wife_activity)

# We want to write all the decoded DOs in Farm_Decoded to a file, but this 
# requires re-encoding all the decoded objects:

Farm_Re_Encoded = Farm_Decoded.re_encode()

# Now write 
# Change the filepath for your local machine and remember it; you'll need it later.
Farm_Re_Encoded.write(filepath + 'farm_data_objects_new.txt')

# The Decoded_DOs method re_encode returns an Encoded_DOs object

print(type(Farm_Re_Encoded))

nan

name : income_binned
text : the bins are the quartile boundaries of income

<class 'Data_Objects.Encoded_DOs'>


`Farm_Re_Encoded` has been written to a file and now can be read in and decoded to see whether our previous work did what it was supposed to do, but here's a simpler approach:

In [24]:
for jstring in Farm_Re_Encoded.jstrings:
    decoded = DO_Decoder(jstring) # decode each re_encoded DO
    if decoded.name == 'wife_activity':  # grab the one that we just created
        print(decoded) 

DO_Decoder(DO_creator='Paul Cohen', answers={'1': 'yes', '2': 'no'}, column_dict=None, country='Senegal', df=      trade  food_crops  crafts
0         1           0       0
1         0           1       0
2         0           1       0
3         0           1       0
4         1           0       0
...     ...         ...     ...
3857      1           0       0
3858      0           0       0
3859      1           0       0
3860      0           0       0
3861      0           0       0

[3862 rows x 3 columns], name='wife_activity', note="I don't know the actual survey question text", region='SRV', search_terms=['cohen', 'text', 'srv', 'crop', 'engage', 'Senegal', 'activity', 'craft', 'cleanedcsv', '2016', 'demographicfarmer', 'activities', 'actual', 'SRV', 'trades', 'follow', 'following', 'Paul', 'trade', 'engag', 'demographicfarm', 'know', 'survey', 'Cohen', 'StLouis', 'food', 'wife', 'question', 'seneg', 'activ', 'paul', 'dont', 'crafts', 'stloui', 'crops', '20162017'], source='de

### Editing vs. creating new Data Objects ###

As a general rule, you shouldn't make permanent changes to DOs:  Other people might be using them!  If you want to change something, it's usually best to create a new DO and leave the old one be.  That said, you'll sometimes want to make minor changes or augmentations to decoded DOs; for example, earlier we added a `text` field to `income_binned`, which didn't have one. Another example is translating the strings `yes` and `no` to numeric codes 1 and 2, respectively.  Neither of these examples *loses* information. 

In general, if an edit to a DO loses information, you shouldn't do it.  Instead, create a new DO. 

## Accessing and Manipulating data ##






This section describes how to access and build dataframes from the data in DOs.  To illustrate the methods, I'll set up two `Decoded_DOs` objects, one for the newly edited Manobi farm data, the other for CGAP data. The CGAP Data Objects code and the file `CGAP_JSON.txt` live [here](https://pitt.app.box.com/folder/136317983622) on the Pitt Box folder. 

Note that while the farm DOs are read into a `Decoded_DOs` object, the CGAP DOs are read into a `Decoded_CGAP_DOs` object.  The latter is a subclass of the former.  It is identical but for one additional method for assembling data from different countries, described below. 

In [37]:
# Corpus of Manobi farm Data Objects
filepath = '/Users/prcohen/anaconda2/envs/aPRAM/Habitus/Data/Manobi Data/SRV/processed/'
Farm = Decoded_DOs()
Farm.read_and_decode(filepath + 'farm_data_objects_new.txt')


# Corpus of CGAP Data Objects 
filepath = '/Users/prcohen/anaconda2/envs/aPRAM/Habitus/Data/Data Objects/'
CGAP = Decoded_CGAP_DOs()
CGAP.read_and_decode(filepath+'CGAP_JSON.txt')

The `DO_Decoder` class and the `Decoded_DOs` class each have a method called `cols` for getting columns of data. These methods have slightly different syntactic forms because, in the `DO_Decoder` class, `cols` is getting data for `self`, but in the `Decoded_DOs` class, `cols` needs to be told which questions to get data for. This is easiest to see in some examples. 

In [38]:
# Farm.dob returns a data object identified by a question id
# For a single-answer question, cols returns a single column of data

Farm.dob('income_binned').cols()

Unnamed: 0,income_binned
0,3.0
1,3.0
2,3.0
3,1.0
4,0.0
...,...
3857,
3858,1.0
3859,3.0
3860,3.0


In [39]:
# For a multi-answer question, cols with no argument returns all the answer columns
Farm.dob('wife_activity').cols()

Unnamed: 0,trade,food_crops,crafts
0,1,0,0
1,0,1,0
2,0,1,0
3,0,1,0
4,1,0,0
...,...,...,...
3857,1,0,0
3858,0,0,0
3859,1,0,0
3860,0,0,0


In [40]:
# Specify which answer columns to return from a multi-answer question:
Farm.dob('wife_activity').cols('trade','crafts')

Unnamed: 0,trade,crafts
0,1,0
1,0,0
2,0,0
3,0,0
4,1,0
...,...,...
3857,1,0
3858,0,0
3859,1,0
3860,0,0


The preceding examples show how to get data from individual data objects, but if you want to join data from different DOs, then use the `Decoded_DOs.cols` method.  Its syntax is slightly different because you have to say which DOs you want:

In [41]:
# Build one dataframe from two single-answer questions
Farm.cols('income_binned','income')


Unnamed: 0,income_binned,income
0,3.0,1000680.0
1,3.0,798000.0
2,3.0,375000.0
3,1.0,100000.0
4,0.0,0.0
...,...,...
3857,,
3858,1.0,107000.0
3859,3.0,500000.0
3860,3.0,750000.0


In [42]:
# Build one dataframe from a single-answer question and one column 
# of a multi-answer question.  Note the syntax: (x,*args) means 
# data object with name == x and answer columns *args

Farm.cols('income',('wife_activity','trade'))

Unnamed: 0,income,trade
0,1000680.0,1
1,798000.0,0
2,375000.0,0
3,100000.0,0
4,0.0,1
...,...,...
3857,,1
3858,107000.0,0
3859,500000.0,1
3860,750000.0,0


### Beware: Behind-the-scenes DataFrame Joins. ###

The `Decoded_DOs.cols` method performs inner joins on dataframe indices.  If the indices of columns aren't the same, then the joins will still happen but the results won't be what you expect.  For example, in the CGAP data, the indices are household IDs.  You can join variables within a country because the variables have the same indices.  For example, you can assemble a dataframe from the single-answer question H6 and the multi-answer question H7 in Mozambique because each question has Mozambique's household IDs as an index:

In [43]:
CGAP.cols('moz_H6','moz_H7')

Unnamed: 0,H6,farmer,professional,shop_owner,business_owner,laborer,other,no_secondary_job
22552580,1,1,2,2,2,2,2,1
22487045,1,1,2,2,2,2,2,1
22159366,4,2,2,2,2,2,2,2
22790149,1,1,2,2,2,2,2,1
22790150,1,1,2,2,2,2,2,1
...,...,...,...,...,...,...,...,...
22757331,1,1,2,2,2,2,2,1
22552539,1,2,2,2,2,2,2,2
22200293,4,1,2,2,2,2,2,1
22102008,2,2,2,2,2,2,1,1


But if you try to assemble a dataframe from Mozambique and Bangladesh data, you'll get nothing because the indices of these countries have a null intersection (i.e., household IDs are unique not only within but also across countries).  

In [44]:
CGAP.cols('moz_H6','bgd_H7')

Unnamed: 0,H6,farmer,professional,shop_owner,business_owner,laborer,other,no_secondary_job


This behavior is correct: Our implicit assumption is that a row of data comes from one source, so a row should not include data from two different households.

So, pay attention to the indices of dataframes. Personally, I don't like pandas implicit indexing; I'd rather build the indices for dataframes, myself, to mirror something that matters, such as household ID or farm ID.  That way, I can be sure that when variables are joined, they are joined on household ID, or farm ID, or some other indicator of where the data comes from. 

### CGAP Data:  Appending columns from different countries ###

The CGAP dataset contains data from six countries.  You might want to get the same columns from two or more countries and append them into a single dataframe.  Question ids in CGAP data have the country code as a prefix (e.g., question H6 for Mozambique has name moz_H6) but if you want to assemble, say, five variables for six countries then you'd have to write out 5 * 6 unique names, which is annoying! 

Instead, use the `Decoded_CGAP_DOs.cols_from_countries` method. It has exactly the same syntax as the `cols` method but has an additional keyword argument for specifying which countries you want:

In [45]:
CGAP.cols_from_countries('A6',('A5','Maize'),'A8',countries=['bgd','cdi'])

Unnamed: 0,A6,Maize,A8
1,1.0,2.0,1.0
2,1.0,2.0,1.0
3,,,
4,1.0,2.0,1.0
5,1.0,2.0,1.0
...,...,...,...
31186932,7.0,1.0,1.0
31186933,7.0,1.0,1.0
31186934,7.0,1.0,1.0
31285239,3.0,1.0,1.0


### Warning: cols and cols_from_countries return as much as possible ###

If you ask for something that `cols` or `cols_from_countries` cannot provide, they will warn you and then they'll provide whatever they can.  For example, if you ask for two columns of data but only one can be found, that's what you'll get: 

In [46]:
CGAP.cols('moz_A1','foo')

foo cannot be found


Unnamed: 0,A1
22552580,1.0
22487045,4.0
22159366,2.0
22790149,3.0
22790150,3.0
...,...
22757331,2.0
22552539,5.0
22200293,1.0
22102008,1.0


Similarly, if you ask for answer columns for multi-answer questions, you'll get any that exist and you'll be warned about those that don't:

In [47]:
CGAP.cols('moz_A1',('moz_A5', 'foo', 'Rice', 'Beans'))

At least one of the columns you requested isn't in moz_A5 dataframe


Unnamed: 0,A1,Rice,Beans
22552580,1.0,2.0,2.0
22487045,4.0,2.0,2.0
22159366,2.0,2.0,2.0
22790149,3.0,2.0,1.0
22790150,3.0,2.0,2.0
...,...,...,...
22757331,2.0,2.0,2.0
22552539,5.0,2.0,2.0
22200293,1.0,2.0,2.0
22102008,1.0,2.0,1.0


However, this helpful inclination to provide whatever exists can yield surprising results. If you ask for data that exists in some countries but not others, you'll get the data that exists and NaNs for the countries that don't have it.  For example, the CGAP data doesn't have an answer column called 'Wheat' for four countries, but 'Wheat' is an answer for two other countries:

In [48]:
wheat = CGAP.cols_from_countries(('A5','Wheat'), countries = ['bgd','cdi','moz','nga','tan','moz'])
print(f"\nThe dataframe includes {len(wheat)} records of which {wheat['Wheat'].isna().sum()} are NaN")

At least one of the columns you requested isn't in cdi_A5 dataframe
At least one of the columns you requested isn't in moz_A5 dataframe
At least one of the columns you requested isn't in tan_A5 dataframe
At least one of the columns you requested isn't in moz_A5 dataframe

The dataframe includes 16732 records of which 11306 are NaN
