In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import convokit
from convokit import Corpus, Conversation, Speaker, Utterance, StorageManager
import sys

print('Which system install of python is running? :', sys.executable)
print('Which version of convokit is imported? :', convokit.__file__)


Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


Which system install of python is running? : /reef/jes543_test_convokit_dbstorage/convokit_db_test/bin/python
Which version of convokit is imported? : /reef/jes543_test_convokit_dbstorage/Cornell-Conversational-Analysis-Toolkit/convokit/__init__.py


In [4]:
existing_corpus_db = Corpus(corpus_id='test_corpus', storage_type='db', in_place=True)

assert existing_corpus_db.get_utterance('11').text == 'Changing the text of utterance 11 in corpus1.'
assert existing_corpus_db.get_utterance('12').meta == {'favorite': True}

for utt in existing_corpus_db.iter_utterances():
    print(f'{utt.speaker}: {utt.text}')

Corpus test_corpus_v0 not found in the DB; building new corpus
Speaker(id: Bob): This is the 0th utterance.
Speaker(id: Jim): This is the 1th utterance.
Speaker(id: Rachel): This is the 2th utterance.
Speaker(id: Bob): This is the 3th utterance.
Speaker(id: Jim): This is the 4th utterance.
Speaker(id: Rachel): This is the 5th utterance.
Speaker(id: Bob): This is the 6th utterance.
Speaker(id: Jim): This is the 7th utterance.
Speaker(id: Rachel): This is the 8th utterance.
Speaker(id: Bob): This is the 9th utterance.
Speaker(id: Jim): This is the 10th utterance.
Speaker(id: Rachel): Changing the text of utterance 11 in corpus1.
Speaker(id: Bob): This is the 12th utterance.
Speaker(id: Jim): This is the 13th utterance.
Speaker(id: Rachel): This is the 14th utterance.
Speaker(id: Bob): This is the 15th utterance.
Speaker(id: Jim): This is the 16th utterance.
Speaker(id: Rachel): This is the 17th utterance.
Speaker(id: Bob): This is the 18th utterance.
Speaker(id: Jim): This is the 19th ut

## Traditional Convokit Corpora
Traditionally, Convokit Corpora exist in RAM

In [4]:
speakers = {0: Speaker(id='Bob'),1: Speaker(id='Jim'),2: Speaker(id='Rachel')}
corpus1_mem = Corpus(utterances=[Utterance(id=str(i), 
                                text=f'This is the {i}th utterance.', 
                                reply_to=i-1 if i > 0 else None, 
                                speaker=speakers[i % 3]) for i in range(100)],
                    storage_type='mem')

# Once a program exits, the data stored in RAM Corpus is no longer around unless 
# we explicitly dump its contents to disk for long term storage
corpus1_mem.dump('test_corpus')

# We can later access the same data in convokit by loading this data from disk into
# RAM with a new Corpus object. 
corpus2_mem = Corpus(corpus_id='test_corpus',
                    storage_type='mem')

# These two corpora contain the same data
assert corpus1_mem.get_utterance('10').text == corpus2_mem.get_utterance('10').text

# Modifications to one corpus are not reflected in the other corpus - they are distinct copies
corpus1_mem.get_utterance('11').text = 'Changing the text of utterance 11 in corpus1.'
assert corpus2_mem.get_utterance('11').text == 'This is the 11th utterance.'

corpus2_mem.get_utterance('12').meta['favorite'] = True
assert corpus1_mem.get_utterance('10').meta == {}

True


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 241051.95it/s]


dump to  /home/jes543/.convokit/saved-corpora/test_corpus
True
Loading corpus test_corpus from disk at /home/jes543/.convokit/saved-corpora/test_corpus


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 415689.20it/s]


## Introducing DB Corpora
Now, convokit additionally offers Database based storage for Corpora

In [5]:
# Clearing the database for repeatability 
StorageManager('db', corpus_id='test_corpus').purge_all_collections()



In [6]:
corpus1_db = Corpus(utterances=[Utterance(id=str(i), 
                                text=f'This is the {i}th utterance.', 
                                reply_to=i-1 if i > 0 else None, 
                                speaker=speakers[i % 3]) for i in range(100)],
                    storage_type='db', 
                    corpus_id='test_corpus')

Corpus test_corpus_v0 not found in the DB; building new corpus


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 624.63it/s]


Once a program exits, the data stored in the DB Corpus still contained in the DB, without any concept of or need for a dump.

We can later access the same data in convokit with a corpus object by id. By default, this creats a copy of the accessed corpus, offering the same behavior convokit users expect of having two distinct corpora to work with.

In [7]:
corpus2_db = Corpus(corpus_id='test_corpus', storage_type='db')

# corpus2_db and corpus2_db contain the same data
assert corpus1_db.get_utterance('10').text == corpus2_mem.get_utterance('10').text

# However, modifications to one corpus are not reflected in the other corpus - they are distinct copies
corpus1_db.get_utterance('11').text = 'Changing the text of utterance 11 in corpus1.'
assert corpus2_db.get_utterance('11').text == 'This is the 11th utterance.'

corpus1_db.get_utterance('12').meta['favorite'] = True
assert corpus2_db.get_utterance('12').meta == {}

Copying corpus test_corpus v0to corpus test_corpus_v0.1


On the other hand, we can specify `in_place=True` when constructing a DB corpus to connect directly to an existing dataset in the database with read/write access

In [8]:
corpus3_db = Corpus(corpus_id='test_corpus', storage_type='db', in_place=True)

assert corpus3_db.get_utterance('11').text == 'Changing the text of utterance 11 in corpus1.'
assert corpus3_db.get_utterance('12').meta == {'favorite': True}

# Similarly, modifications to corpus3_db will be reflected in corpus1_db, but not in corpus2_db
corpus3_db.get_speaker('Bob').meta['height'] = '6\'2"'
assert corpus1_db.get_speaker('Bob').meta == {'height': '6\'2"'}
assert corpus2_db.get_speaker('Bob').meta == {}

Corpus test_corpus_v0 not found in the DB; building new corpus


## Naming Conventions for DB Corpora

As demonstraited above, DB corpora should be initilized with a `corpus_id` to set the long term storage name for a corpus. 

If a corpus is initilized with `in_place=False` (the default value) and with a `corpus_id` that is already taken in the database, then ConvoKit will automatically pick a similar but unique `corpus_id` for the new corpus to track a copy of the original data refered to by that `corpus_id`. 

For example when initilizing `corpus2_db` aboive with `corpus_id='test_corpus'` and `in_place=False`, `corpus2_db` will be assigned a new unique `corpus_id` since `corpus1_db` had already took the id `'test_corpus'`. Moreover, `corpus2_db` will be initilized to hold the contents of the data identified by the id `'test_corpus'` (ie what's in corpus1_db), but the unique effective id ensures any modifications to `corpus2_db` are not reflected back in `corpus1_db`.

After initilizing a corpus with `in_place=False`, `corpus.id == corpus.storage.corpus_id` will contain the unique 'effective id' that convokit possibly modified and actually identifies the data in the database, and `corpus.storage.raw_corpus_id` will contain the original unmodified id inputted by the user, identifying the data this corpus was initilized from.

Eg:

In [21]:
print('variable name\t\tcorpus.storage.corpus_id\tcorpus.storage.version\tcorpus.storage.full_name')
print('-----------------------------------------------------------------------------------------------')
print('corpus1_db\t\t', corpus1_db.storage.corpus_id, 
      '\t\t\t',corpus1_db.storage.version, '\t\t\t',corpus1_db.storage.full_name)
print('corpus2_db\t\t', corpus2_db.storage.corpus_id, 
      '\t\t\t',corpus2_db.storage.version, '\t\t\t',corpus2_db.storage.full_name)



variable name		corpus.storage.corpus_id	corpus.storage.version	corpus.storage.full_name
-----------------------------------------------------------------------------------------------
corpus1_db		 test_corpus 			 0 			 test_corpus_v0
corpus2_db		 test_corpus 			 0.1 			 test_corpus_v0.1


On the other hand, initilizing a corpus with `in_place=True` will directly connect the new corpus to the database using the identifier provided if the identifier refers to an existing dataset, or create a new empty corpus with that name if no such dataset exists. In this case, `corpus.id == corpus.storage.raw_corpus_id`. 

In [22]:
print('variable name\t\tcorpus.storage.corpus_id\tcorpus.storage.version\tcorpus.storage.full_name')
print('-----------------------------------------------------------------------------------------------')
print('corpus3_db\t\t', corpus3_db.storage.corpus_id, 
      '\t\t\t',corpus3_db.storage.version, '\t\t\t',corpus3_db.storage.full_name)

variable name		corpus.storage.corpus_id	corpus.storage.version	corpus.storage.full_name
-----------------------------------------------------------------------------------------------
corpus3_db		 test_corpus 			 0 			 test_corpus_v0


Because a DB based Corpus needs an identifier to connect to the database, a corpus_id will automatically be picked if no corpus_id is specified in the initilization.

In [23]:
unnamed_db_corpus = Corpus(storage_type='db')
print('variable name\t\tcorpus.storage.corpus_id\tcorpus.storage.version\tcorpus.storage.full_name')
print('-----------------------------------------------------------------------------------------------')
print('unnamed_db_corpus\t\t', unnamed_db_corpus.storage.corpus_id, 
      '\t\t\t',unnamed_db_corpus.storage.version, '\t\t\t',unnamed_db_corpus.storage.full_name)

No filename or corpus name specified for DB storage; using name 469355
Corpus 469355_v0 not found in the DB; building new corpus
variable name		corpus.storage.corpus_id	corpus.storage.version	corpus.storage.full_name
-----------------------------------------------------------------------------------------------
unnamed_db_corpus		 469355 			 0 			 469355_v0


## Converting Between Storage Formats
We can use the corpus.copy_as function to convert between Mem and DB storage modes.

In [24]:
# Mem -> DB
mem_corpus = Corpus(utterances=[Utterance(id=str(i), 
                                text=f'Memory Utterance #{i}', 
                                reply_to=i-1 if i > 0 else None, 
                                speaker=speakers[i % 3]) for i in range(100)],
                    storage_type='mem')
db_corpus_from_mem = mem_corpus.copy_as('db', corpus_id='dbcorpus_from_memcorpus')
print(mem_corpus.storage.storage_type)
print(db_corpus_from_mem.storage.storage_type)
assert mem_corpus.get_utterance('30') == db_corpus_from_mem.get_utterance('30')

# DB -> mem
db_corpus = Corpus(utterances=[Utterance(id=str(i), 
                                text=f'DB Utterance #{i}', 
                                reply_to=i-1 if i > 0 else None, 
                                speaker=speakers[i % 3]) for i in range(100)],
                    storage_type='db')
mem_corpus_from_db = db_corpus.copy_as('mem', corpus_id='dbcorpus_from_memcorpus')
print(mem_corpus_from_db.storage.storage_type)
print(db_corpus.storage.storage_type)
assert mem_corpus_from_db.get_utterance('30') == db_corpus.get_utterance('30')

True


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 122104.92it/s]

copy as  dbcorpus_from_memcorpus





mem
db
No filename or corpus name specified for DB storage; using name 822848
Corpus 822848_v0 not found in the DB; building new corpus


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 597.69it/s]


copy as  dbcorpus_from_memcorpus
mem
db


## Default Values
I have introduced a configuration file that should live at `~/.convokit/config.yml` to control convokit's behavior. This is the default contents of the file:
```
# Default Storage Parameters
db_host : localhost:27017 
data_dir: ~/.convokit/saved-corpora
default_storage_mode: mem
```

These defaults can all be overridden in code when initilizing `Corpus` objects or calling functions such as `dump` (for data_dir).

When initilizing a corpus by `corpus_id`, if the corpus is using mem storage it will try to load this `corpus_id` from the `data_dir` (the given value or the default if none is specified). Likewise, when initilizing a corpus by `corpus_id`, if the corpus is using DB storage it will try to load this `corpus_id` from the database at `db_host` (the given value or the default if none is specified). Finally, a `dump` operation will try to dump a corpus into a folder identified by the `corpus_id` within the `data_dir`.

A memory corpus can still be loaded from disk using the `filename` paramater for backwards compatability. To load a corpus from JSON lists on disk into database storage, you should first load the dataset from disk as a mem corpus, then convert it to a DB corpus.


## Implementation details: Abstract Storage
To implement abstract storage, I introduced the convokit.storage submodule. This submodule introduces the `StorageManager`, `DBCollectionMapping`, `DBDocumentMapping`,  `MemCollectionMapping`, and `MemDocumentMapping` classes; I also moved `ConvoKitIndex` to be within the storage submodule. 

In my implementation, every instance of a `Corpus`, `CorpusComponent`, or `ConvoKitMeta` class has an instance variable
`self.storage : StorageManager`. This funcions as a centralized storage location—-`CorpusComponent` and `ConvoKitMeta` objects within a Corpus are provided with the same StorageManager as their owner corpus internally, which is used to implement operations between corpus components such as `utterance.get_speaker`, `conversation.iter_utterances`, `speaker.iter_conversations`... All refrences between corpus components are stored by object ids, with the real data being held by the `StorageManager`--for example, a `Conversation` object now itself only stores a list of `utterance_ids` and uses the `StorageManager` to get actual utterances.

A `StorageManager` contains the instance variables `_utterances`, `_conversations`, `_speakers`, `_metas`.
Each of these is a `MutableMapping` from object ids to objects of that type.
If this instance of StorageManager is implementing db storage then each of these instance variables will be a `DBCollectionMapping`; If this instance of StorageManager is implementing in-memory storage then each of these instance variables will be a `MemCollectionMapping`. 

In [25]:
print('Corpus initilized for mem storage')
print('storage_type:',mem_corpus.storage.storage_type)
print('Collections Type:',type(mem_corpus.storage._utterances))

print('\nCorpus initilized for db storage')
print('storage_type:',db_corpus.storage.storage_type)
print('Collections Type:',type(db_corpus.storage._utterances))

print('\nCorpus initilized for mem storage as a copy from a db corpus')
print('storage_type:',mem_corpus_from_db.storage.storage_type)
print('Collections Type:',type(mem_corpus_from_db.storage._utterances))

print('\nCorpus initilized for db storage as a copy from a mem corpus')
print('storage_type:',db_corpus_from_mem.storage.storage_type)
print('Collections Type:',type(db_corpus_from_mem.storage._utterances))


Corpus initilized for mem storage
storage_type: mem
Collections Type: <class 'convokit.storage.memMappings.MemCollectionMapping'>

Corpus initilized for db storage
storage_type: db
Collections Type: <class 'convokit.storage.dbMappings.DBCollectionMapping'>

Corpus initilized for mem storage as a copy from a db corpus
storage_type: mem
Collections Type: <class 'convokit.storage.memMappings.MemCollectionMapping'>

Corpus initilized for db storage as a copy from a mem corpus
storage_type: db
Collections Type: <class 'convokit.storage.dbMappings.DBCollectionMapping'>


A `MemCollectionMapping` is essentially a wrapper around a python dict, directly implementing the mapping from object ids to objects in program memory. 

On the other hand, a `DBCollectionMapping` appears to have the same functionallity, but does a lot more under the hood. When inserting an object into a `DBCollectionMapping`, a database document representing the object is pushed into the database. When reading from a `DBCollectionMapping` by object id, the stored data is retrieved from the database and the object is automatically reconstructed using the retrieved database document. To use this automatic reconstruction, the `DBCollectionMapping` must be initilized with a type declaration, and that type must provide a class method `from_dbdoc` that constructs objects of that type from database documents. 


In addition to the aforementioned abstraction for collections of objects, I also introduce an abstraction for storing data within objects: now, any object that could be stored in a collection (any `CorpusComponent` or `ConvoKitMeta`) has an instance variable `fields` that is a `MutableMapping` from field names to the data held at that field. Essentially, for these objects things that used to be stored in instance variables now live inside `fields` with the same name. I use python properties to hide this abstraction from users; for example consider this exerpt from corpusComponent.py:
```
    @property
    def utterance_ids(self):
        return self.fields['utterance_ids']

    @utterance_ids.setter
    def utterance_ids(self, new_utterance_ids):
        self.fields['utterance_ids'] = new_utterance_ids

```
If an object is being stored in memory then its fields will be a `MemDocumentMapping`; if an object is being stored in a database then its fields will be a `DBDocumentMapping`. Eg:

In [26]:
print('Utterance from a Mem Corpus')
mem_utt = mem_corpus.get_utterance('0')
print('Fields Type:',type(mem_utt.fields))

print('Utterance from a DB Corpus')
db_utt = db_corpus.get_utterance('0')
print('Fields Type:',type(db_utt.fields))

Utterance from a Mem Corpus
Fields Type: <class 'convokit.storage.memMappings.MemDocumentMapping'>
Utterance from a DB Corpus
Fields Type: <class 'convokit.storage.dbMappings.DBDocumentMapping'>


Similarly to for collections, A `MemDocumentMapping` is essentially a wrapper around a python dict, directly implementing the mapping from field names to data in program memory. 

On the other hand, a `DBDocumentMapping` appears to have the same functionallity, but does a lot more under the hood. When inserting or modifying an field in a `DBDocumentMapping`, the underlying database document is automatically updated; moreover, when reading from a `DBDocumentMapping` by field name, the freshest version of the stored data is retrieved from the database. 
