# Tweet Turing Test: PyMongo Edition

| **Notebook** | `01_merge.ipynb`                                               |
|-------------:|----------------------------------------------------------------|
|  **Purpose** | Integrate raw data from multiple sources into a common schema. |
|     **Team** | John Johnson, Justin Minnion, Srinivas Pai                     |
|   **Course** | INFO 607 "Applied Database Technologies"                       |

# 0 - Prerequisites

Prior to executing code within this Jupyter notebook, the following prerequisites must be met.

- A MongoDB server is running and accessible by this notebook.
- The initial loading of raw data (CSV and JSON files) has been performed using the included `utils.py` method `load_raw_data(...)` 
    - Note this function can be invoked from command-line:  
      `>>> python main.py --load-data`
- A Python environment is available and packages in requirements.txt (including their respective dependencies) have been installed.

# 1 - Setup

## 1.1 - Imports

In [1]:
# imports from Python standard library
import math

# local imports
import utils
from utils import TweetDB

# pymongo
import pymongo
import pymongo.cursor
from bson.objectid import ObjectId

# other packages
import pandas as pd

## 1.2 - Options / Constants

In [2]:
CHUNK_SIZE_DEFAULT = utils.CHUNK_SIZE_DEFAULT

# Collection Names
COLLECTION_RAW = utils.COLLECTION_NAMES['raw']

## 1.3 - Make Database Connection

In [3]:
db = utils.TweetDB()

# 2 - Label source of data points

Modify the `raw` collection to add a data source for each tweet. Because this is non-destructive, we'll make the edit in-place to the existing documents.

## 2.1 - Verified users  

Data for these users were obtained from the Twitter API as nested JSON data. When compared to the CSV-format data from FiveThirtyEight, these tweets contain more fields and more data. We'll work to integrate the two data sources into one federated dataset, but to do so we'll need to apply different transformations to each sources' data. This added data source label will help streamline that.  

Verified user tweets will be found based on:
 - Contain a top-level (in JSON hierarchy) field called "`created_at`".
 - Tweets from the FiveThirtyEight dataset do not contain this field.

Verified user tweets will be modifed to:
 - Add new field "`data_source`" with value "`verified`"

**_Note on Batching_**

The method of batching is demonstrated here, as well. In order to prepare this task for future parallelization, as well as reducing the memory footprint required to execute this code in its proof-of-concept form on a single compute resource, we divide the update operation into batches.

General approach for batching:
 1. Create a query dictionary to identify which tweets to grab.
 2. Apply the `TweetDB.query()` function using this query, but only retrieve the MongoDB "`_id`" field. The `_id` field is autogenerated by MongoDB and contains a unique, 12-byte, surrogate primary key for each document (tweet).
 3. The `query()` function returns a PyMongo `cursor` object, which is effectively a lazy-evaluated Python generator. We could iterate over the cursor directly, but the cursor by default returns one document at a time, so we'd lose the ability to batch/paralellize. Instead, we can use a specialized Python `itertools` iterator to divide the cursor into batches.
 4. Feed the returned cursor to the `utils.batched()` function. This wraps the original cursor generator into another generator (Python `itertools.islice`), but allows us to iterate over batches. It even handles the scenario where (unless batch size divides evenly into the total document size) the last batch will have fewer elements than the target batch size.

 We attempted a few other approaches to batching the PyMongo `cursor`, and considered the built-in `pymongo.collections.find_raw_batches()` function as well, but none provided the simplicity/readability of the Python `itertools.islice` approach.

In [4]:
source_collection = COLLECTION_RAW
dest_collection = COLLECTION_RAW    # in-place modification

# setup query
query_dict = {
    'created_at': {         # look for field "created_at"
        '$exists': True,    # ... and check if it is present in a record
    },
}

# get a sense of how many tweets will be modified
n_tweets: int = db.count_tweets_by_filter(
    collection=source_collection,
    query_dict=query_dict,
    approximate=False
)

print(f"Number of tweets to be modified:  {n_tweets:,}")

Number of tweets to be modified:  1,508,028


In [6]:
# make initial query to pull `_id` values
return_fields = ['_id']

tweet_id_cursor:pymongo.cursor.Cursor = db.query(
    collection=source_collection,
    query_dict=query_dict,
    return_fields=return_fields
)

# setup dict for the update to be made
update_dict = {
    '$set': {
        'data_source': 'verified',
    },
}

# update in batches
chunk_size = CHUNK_SIZE_DEFAULT
n_chunks = math.ceil(n_tweets / chunk_size)

#   outer loop iterates over chunks (batches)
for chunk in utils.batched(cursor=tweet_id_cursor, chunk_size=chunk_size, 
                           show_progress_bar=True, progress_bar_n_chunks=n_chunks):
    # inner loop iterates over tweets (documents) within a chunk
    for doc in chunk:
        # `doc` is a dict with key=document field (str), value=value (Any)
        #   example value for `doc`: 
        #       {'_id': ObjectId('6458645e09e423ae6e15d8e4')}
        doc_query_dict = {'_id': doc['_id']}
        doc_update_dict = update_dict

        # make the update to this doc
        db.update_tweets(
            collection=dest_collection,
            query_dict=doc_query_dict,
            update_dict=doc_update_dict,
            verbose=False   # reiterating the default value
        )
    # </inner loop>
# </outer loop>

Number of batches: 100%|██████████| 31/31 [09:50<00:00, 19.03s/it]


For the sake of comparison, we also wanted to try that same operation without batching. We're interested to see whether MongoDB's internal optimization can accomplish this task without 1) taking longer than our batched approach, or 2) crashing the Jupyter kernel / host computer.

First we'll unset the new field we just created, deleting it from the ~1.5 million tweets previously modified by the batched edit.

In [11]:
# first remove the field we just added
revert_update_dict = {
    '$unset': {
        'data_source': "",
    },
}

db.update_tweets(
    collection=dest_collection,
    query_dict=query_dict,
    update_dict=revert_update_dict,
    verbose=True
)

update_tweets: update was acknowledged 	number of tweets modified: 1508028 	number of tweets matched:  1508028


Next, apply the same `update_dict` from the batched edit, but do so to the entire result of `query_dict` rather than batched subsets.

In [12]:
db.update_tweets(
    collection=dest_collection,
    query_dict=query_dict,
    update_dict=update_dict,
    verbose=True
)

update_tweets: update was acknowledged 	number of tweets modified: 1508028 	number of tweets matched:  1508028


As it turns out, that operation seems to have worked a lot faster. It's possible the overhead of the many individual updates/queries outweights the benefits of memory footprint reduction.

To summarize the results:
 - Both approaches applied identical edits to identical corpus of ~1.5 million tweets
 - Using the same computing environment:
    - Batching/chunking manually required ~10 minutes to complete the operation
    - Allowing PyMongo/MongoDB to handle the batching/chunking required ~0.5 minutes to complete the operation (1/20th the amount of time).
    - Informal monitoring of RAM usage for both operations did not show any significant difference.

For reference, the above tests were conducted with a Windows 10 PC equipped with an AMD Ryzen 7 5800X (8-core/16-thread) CPU and 64 GB RAM. The running MongoDB service did not exceed ~7.3 GB of RAM usage during these tests.