# Converting between memory and database storage

This notebook is a demonstration of how to convert between the two storage formats ConvoKit offers: in program memory, and in a MongoDB database. 

In [1]:
from convokit import Corpus, Speaker, Utterance
from time import time

First, let's create a Corpus that uses DB Storage and populate it with some data. The number of utterances in this demo is on the order of 100s of utterances so that this demo notebook is runable in a resonable amout of time, but the tradeoffs we will explore only become more pertinant with larger quantities of data. 

In [2]:
db_corpus = Corpus(corpus_id='big_test_corpus', storage_type='db')

# Initilizing Speakers
jim = Speaker(id="Jim")
bob = Speaker(id="Bob")
kate = Speaker(id="Kate")

# Initilizing Utterances
utts = []

# First, everything Jim said:
for i in range(100):
    utts.append(Utterance(id=f"jim.{i}", text=f"I am the {i}'th comment!", speaker=jim))
                    
# Then, all of Bob's replies to Jim's comments:
for i in range(100):
    utts.append(Utterance(id=f"bob.{i}", text=f"Wow Jim, your {i}'th comment is so interesting!", reply_to=f"jim.{i}", speaker=bob))
                            
# Then, all of Kate's replies to Bob to the corpus; however, Kate only responds to some of Bob's comments:
for i in range(100):
    if i % 2 == 0:
        utts.append(Utterance(id=f"kate.{i}", text=f"I totally agree Bob! Jim's {i}'th comment is insightful", reply_to=f"bob.{i}", speaker=kate))
        
# Finally, we actually add these utterances to the DB corpus:
db_corpus = db_corpus.add_utterances(utts)

Corpus big_test_corpus_v0 not found in the DB; building new corpus
No filename or corpus name specified for DB storage; using name 710306
Corpus 710306_v0 not found in the DB; building new corpus


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:01<00:00, 195.32it/s]


We can work with this data directly in the db_corpus. If we primarially want to store the data long term while maintaining the ability to occasionally modify and read the data very quickly, then database storage is a good option. For example, if you want to use the Corpus as the backbone of a webserver that stores conversational data, database storage is the best option. Here is an example of this type of use case:

In [3]:
# Example of responding to an incoming request for comment kate.34
print(db_corpus.get_utterance('kate.34'))


# Example of Bob viewing, then editing, his 68th comment
print(db_corpus.get_utterance('bob.68'))
db_corpus.get_utterance('bob.68').text = "Actually, on second thought Jim, your 68th comment is slightly less interesting than the other ones"
print(db_corpus.get_utterance('bob.68'))


Utterance(id: 'kate.34', conversation_id: None, reply-to: bob.34, speaker: Speaker(id: Kate), timestamp: None, text: "I totally agree Bob! Jim's 34'th comment is insightful", vectors: [], meta: {})
Utterance(id: 'bob.68', conversation_id: None, reply-to: jim.68, speaker: Speaker(id: Bob), timestamp: None, text: "Wow Jim, your 68'th comment is so interesting!", vectors: [], meta: {})
Utterance(id: 'bob.68', conversation_id: None, reply-to: jim.68, speaker: Speaker(id: Bob), timestamp: None, text: 'Actually, on second thought Jim, your 68th comment is slightly less interesting than the other ones', vectors: [], meta: {})


On the other hand, imagine that you want to take all the data on your webserver and run the latest Natural Language Processing algorithms on it in order to learn something about what people are talking about on your website. This is a computationaly intensive task, that will likely require reading the data many times in a short time period to do the computations. Working with the data in this manner—for computationally expensive operations that only need to access the data over a short time period—would be inneficient to do with the data stored in the database as it would require repeated reads from and writes back to the database; it would be more efficient to work with the data in memory. Thus, we can pull all the data into memory at once by creating a copy of the corpus using the memory storage format, do our expensive computations, then write the results back into the database all at once.

In [4]:
from convokit import TextParser
from convokit import PolitenessStrategies

parser = TextParser(verbosity=1000)
ps = PolitenessStrategies()


In [10]:
mem_start = time()

mem_corpus = Corpus(storage_type='mem', from_corpus=db_corpus)

mem_corpus = parser.transform(mem_corpus)
mem_corpus = ps.transform(mem_corpus, markers=True)

mem_end = time()

250/250 utterances processed


Lets try running the same computations directly on db_corpus to compare how long each strategy takes — including the time to convert from database to in-memory storage. 

In [6]:
db_start = time()

db_corpus = parser.transform(db_corpus)
db_corpus = ps.transform(db_corpus, markers=True)

db_end = time()

250/250 utterances processed


In [11]:
print(f"Converting from db -> mem storage + doing computations in memory took {mem_end - mem_start} ")
print(f"Doing computations with the DB corpus directly took {db_end - db_start} ")

Converting from db -> mem storage + doing computations in memory took 13.639177083969116 
Doing computations with the DB corpus directly took 7.874305009841919 
