In [1]:
import json
import os
import re
import sys
from tqdm import tqdm

module_path1 = os.path.abspath(os.path.join('../modules/'))
if module_path1 not in sys.path:
    sys.path.append(module_path1)

import mongo_db_interface as mongo

### This notebook demonstrates some of the functions of the BETO_NLP module `mongo_db_handler`

This class facilitates connections to the MongoDB Atlas database that holds the corpus. To connect to the DB, you need to first:

 - Ensure you are a registered user of the cluster and collection with a password
 - Ensure your IP address is whitelisted. If you use a VPN, make sure that IP address is also whitelisted
 
On initialization of the `MongoDBHandler()` class, you need to pass the correct username and password as strings. These are inserted into the client connection string. If you are using a MongoDB cluster and/or project that is different from the default, make sure to modify the names in the client connection string as shown below:



In MongoDBHandler.__init__(), replace `practice-general-corpus` with your preferred cluster and
replace `BETO_corpus_practice` with your preferred project

`client = MongoClient('mongodb+srv://' + user + ':' + password + '@practice-general-corpus.qt2hh.mongodb.net/BETO_corpus_practice?retryWrites=true&w=majority')`

In [2]:
#load user and password from local file
with open('/Users/wesleytatum/Desktop/post_doc/BETO/mongo_passwords.json', 'r') as f:
    mongo_passwords = json.load(f)
    f.close()
    
user = mongo_passwords['user']
password = mongo_passwords['password']
    
#initialize MongoDBHandler()
mongo_handler = mongo.MongoDBHandler(user, password)

In this tutorial, we will access all of the articles saved in a practice corpus. They were all scraped from the Royal Society of Chemistry (RSC) using the `rsc_corpus_gen` module in this repo. Originally, the articles have the following 'blobs' of data associated with them:

 - '_id': Their unique ObjectID. In the case of articles, this corresponds to the DOI
 - 'doi': This is a redundant datafield that arises due to scraping
 - 'title': The title of the article
 - 'abstract': If the article has an abstract, it is listed in a string here. Otherwise the string states "no abstract"
 - 'html': The raw HTML of the article saved as a string. This allows access to figures and tables in the future
 
Because we are expecting our corpus to have articles from multiple publishers, we would like to add an additional field:

 - 'publisher': In this case, the value will be 'RSC'

The process to add this new field and update the database is outlined below. It's general structure is generalizable to adding any new field to the article objects:

 1. Obtain all objects you wish to add the field to. Their "_id's" are retuned as an iterable list.
     - This can be done to all articles or any subset by filtering with a particular field (_e.g._ publisher, keyword)
 2. Access a single object at a time and perform the desired function on a field of the object and add the results to the object.
     - In this case we add the 'publisher' field. Instead, we could add fields for 'keywords', 'article_string', or 'compounds'
 3. Update and upload the object in the MongoDB

In [3]:
#1. Get a list of all articles in the DB
doi_list = mongo_handler.retrieve_all_article_doi()

#2. Iterate through articles and add new field

pbar = tqdm(total = len(doi_list), position = 0)

for doi in doi_list:
    article = mongo_handler.retrieve_doc_by_doi(doi)
    
    #3. Update and upload the article
    article['publisher'] = 'RSC'
    mongo_handler.upload_article_document(article)
    
    pbar.update()

100%|██████████| 318/318 [05:06<00:00,  1.29it/s]

Now we see that the documents have the new field added to them:

In [6]:
article = mongo_handler.retrieve_doc_by_doi(doi_list[28])
print(article.keys())
article['publisher']

dict_keys(['_id', 'doi', 'title', 'abstract', 'html', 'upload_timestamp', 'publisher'])


'RSC'

Iterating through all of the documents is a slow task, though. Each iteration of the above loop took an average of 1.29 seconds. Luckily we're working with a small corpus. In the future, though, we want to be working with >30,000 documents, which would take at least 10.75 hours just to add on a single new field. Obviously, this is prohibitively slow.

MongoDB has functions to address this, such as `update_many()`, which allows you to find all documents that match a filter and update them. This is typically used for reassigning a field value to a single pre-determined value for a collection of documents that are matched by a query filter (like `corpus.find({'publisher':'RSC'})`). However, it is much more complicated to apply a custom function on a field of the documents and add the results as a new field. There are a few different ways to do this, but it seems that the fastest is to write the custom function in JavaScript and us the `forEach()` function. Unfortunately, the `forEach()` and `map()` functions are only available in the MongoDB shell, and not the pymongo driver interface. For those interested, this quick, powerful approach is shown below to add the field `keywords`.

To do this, we will use a pre-determined list of keywords and a new field containing all keywords that find matches in the `article['html']` value. This example corpus was scraped looking for 'conjugated polymers', so the keywords selected will correspond to these materials and their applications.

To access the object that interfaces with the MongoDB database for these custom queries and operations, assign the return of `MongoDBHandler.return_client_corpus_object()` to a variable. This allows collection-level operations to be performed.

In [16]:
#define our keywords that we are searching for
keywords = ['chemistry', 'polymer', 'photovoltaic', 'OPV', 'semiconductor', 'transister',
            'OFET', 'OTFT', 'ternary blend', 'nonfullerene acceptor', 'non-fullerene acceptor',
            'thermoelectric', 'LED', 'sensor', 'donor', 'acceptor', 'copolymer']

#This is the Python version of the custom function that finds keyword matches
def re_keyword_match(keyword_list, html):
    matches = re.findall(r"(?=("+'|'.join(keyword_list)+r"))", html)
    return matches

test_string = 'the polymers were used in photovoltaic devices'

kws = re_keyword_match(keywords, test_string)

kws

['polymer', 'photovoltaic']

Converting the python function to a JavaScript function will allow the MongoDB servers themselves to apply the function, eliminating the need to download it into local memory and re-uploading the updated object. Shown below is the JavaScript translation of the `re_keyword_match` function defined above:

```
function (doc) {
        var keyword_list = ['chemistry', 'polymer', 'photovoltaic', 'OPV', 'semiconductor', 'transister', 'OFET', 'OTFT', 'ternary blend', 'nonfullerene acceptor', 'non-fullerene acceptor', 'thermoelectric', 'LED', 'sensor', 'donor', 'acceptor', 'copolymer'];
        const regexp = new RegExp(keyword_list.join("|", 'gi'));
        const str = doc.html;
        const matches = str.matchAll(regexp);
        corpus.update(
            {'_id':'doc._id'},
            {'$set':{'keywords':matches}}
        );
    }
```

In [7]:
#get the corpus collection object
corpus = mongo_handler.return_client_corpus_object()

#perform our query and apply the function
corpus.find({'publisher':'RSC'}).map(
    function (doc) {
        var keyword_list = ['chemistry', 'polymer', 'photovoltaic', 'OPV', 'semiconductor', 'transister', 'OFET', 'OTFT', 'ternary blend', 'nonfullerene acceptor', 'non-fullerene acceptor', 'thermoelectric', 'LED', 'sensor', 'donor', 'acceptor', 'copolymer'];
        const regexp = new RegExp(keyword_list.join("|", 'gi'));
        const str = doc.html;
        const matches = str.matchAll(regexp);
        corpus.update(
            {'_id':'doc._id'},
            {'$set':{'keywords':matches}}
        );
    })

AttributeError: 'Cursor' object has no attribute 'map'

pymongo cursors have way fewer functions... probably just do the multi-threading.
above approach only works in MongoDB shell

In order to perform these functions purely in the python interface, you can use a multi-threading approach to be able to speed up the process. This allows you to utilize multiple cores on your computer to run the code and iterations in parallel. This requires the use of the pymongo function `parallel_scan()`.

To demonstrate this approach, we add the `keywords` field again.

In [8]:
#define our keywords that we are searching for
keywords = ['chemistry', 'polymer', 'photovoltaic', 'OPV', 'semiconductor', 'transister',
            'OFET', 'OTFT', 'ternary blend', 'nonfullerene acceptor', 'non-fullerene acceptor',
            'thermoelectric', 'LED', 'sensor', 'donor', 'acceptor', 'copolymer']


def keyword_match(cursor):
    
    for row in cursor.batch_size(200):
        matches = re.findall(r"(?=("+'|'.join(keyword_list)+r"))", html)
        db.collection.update_one({'_id': row['_id']}, 
                                 {'$set': {'keywords': matches}},
                                 upsert=True)


def add_keywords(num_threads=4):

    # Get up to max 'num_threads' cursors.
    cursors = corpus.parallel_scan(num_threads)
    threads = [threading.Thread(target=keyword_match, args=(cursor,)) for cursor in cursors]

    for thread in threads:
        thread.start()

    for thread in threads:
        thread.join()

In [9]:
#get the corpus collection object
corpus = mongo_handler.return_client_corpus_object()

#call the above functions
add_keywords()

  cursors = corpus.parallel_scan(num_threads)


OperationFailure: CMD_NOT_ALLOWED: parallelCollectionScan, full error: {'ok': 0, 'errmsg': 'CMD_NOT_ALLOWED: parallelCollectionScan', 'code': 8000, 'codeName': 'AtlasError'}