## CORD-19-Research-Data-Set-Atlas

This workbook covers the process of querying data from a db hosted on Atlas.

For this code to run, you'll need to create a MongoDB on Atlas named "covid-19" with a collection named "noncomm-subset". 

### Download data

You can download the data from:

https://pages.semanticscholar.org/coronavirus-research

I'm using the non-commercial use subset (76mb download). To keep the example size manageable, I moved the first 70 records into the data directory of this repository. 

### Install Mongo

https://docs.mongodb.com/manual/installation/

### Mongo Client

You may want to get familiar with the MongoDB client and CRUD operations before working with python. 

### Build an Atlas MongoDB from the CORD-19 Subset

Build a database name "covid-19", and a collection named "noncomm-subset".

Add records through (substitute in your username, password, and 

mongoimport --uri "mongodb+srv://your_username:your_password@your_atlas_uri/covid-19" --collection noncomm-subset --drop --file filename.json

for example, to upload one file, mine would be (still redacting password and username)

mongoimport --uri "mongodb+srv://username:password@python-mongodb-workshop.unmjr.gcp.mongodb.net/covid-19" --collection noncomm-subset --drop --file 1b58422e266ab9339c919119923229d080f27360.json 


In [1]:
import json
from pymongo import MongoClient

In [2]:
client = MongoClient()

In [3]:
MDB_URL = "mongodb+srv://python_workshop_user:python_workshop_pwd@python-mongodb-workshop-unmjr.gcp.mongodb.net/test"
client = MongoClient(MDB_URL)

In [4]:
client.list_database_names()

['covid-19',
 'sample_airbnb',
 'sample_analytics',
 'sample_geospatial',
 'sample_mflix',
 'sample_restaurants',
 'sample_supplies',
 'sample_training',
 'sample_weatherdata',
 'admin',
 'local']

In [5]:
db = client.get_database("covid-19")

In [6]:
db.list_collection_names()

['noncomm-subset']

In [7]:
pmc_content = db['noncomm-subset']

In [8]:
for c in db['noncomm-subset'].find().limit(10):
    print(c)

{'_id': ObjectId('5f4ecb5cb752ba3f43f46c36'), 'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24', 'metadata': {'title': '', 'authors': []}, 'abstract': [], 'body_text': [{'text': 'This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/bync/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.', 'cite_spans': [], 'ref_spans': [], 'section': 'Editorial'}, {'text': "The Middle East Respiratory Syndrome (MERS) crisis in Korea is coming to an end, but that's not the end of the story. The World Health Organization (WHO) has warned that the MERS crisis was a 'wakeup call. ' Like Severe Acute Respiratory Syndrome (SARS) in 2003 and swine flu in 2009, epidemic outbreaks will continue to happen anywhere. Korea was simply unlucky. In retrospect, it is regrettable to say that we could have done much better

In [9]:
# to print the titles only and suppress the id

for c in db['noncomm-subset'].find({},{ 'metadata.title': 1, '_id': 0 }):
    print(c)

{'metadata': {'title': ''}}
{'metadata': {'title': ''}}
{'metadata': {'title': 'Comparative Evaluation of Three Homogenization Methods for Isolating Middle East Respiratory Syndrome Coronavirus Nucleic Acids From Sputum Samples for Real-Time Reverse Transcription PCR'}}
{'metadata': {'title': 'Chest Computed Tomography Abnormalities and Their Relationship to the Clinical Manifestation of Respiratory Syncytial Virus Infection in a Genetically Confirmed Outbreak'}}
{'metadata': {'title': 'Functional analysis of the SRV-1 RNA frameshifting pseudoknot'}}
{'metadata': {'title': 'Open Forum Infectious Diseases Open Forum Infectious Diseases ® Outpatient Antibiotic Stewardship: A Growing Frontier-Combining Myxovirus Resistance Protein A With Other Biomarkers to Improve Antibiotic Use'}}
{'metadata': {'title': ''}}
{'metadata': {'title': 'VIRUS ON T H E CEREBRAL ACTIVITY OF P L E U R O P N E U M O N I A -L I K E ORGANISMS I N M I C E'}}
{'metadata': {'title': 'Molecular epidemiology and phylog

In [10]:
# find one paper by paper_id

for c in db['noncomm-subset'].find({'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24'}):
    print(c)

{'_id': ObjectId('5f4ecb5cb752ba3f43f46c36'), 'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24', 'metadata': {'title': '', 'authors': []}, 'abstract': [], 'body_text': [{'text': 'This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/bync/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.', 'cite_spans': [], 'ref_spans': [], 'section': 'Editorial'}, {'text': "The Middle East Respiratory Syndrome (MERS) crisis in Korea is coming to an end, but that's not the end of the story. The World Health Organization (WHO) has warned that the MERS crisis was a 'wakeup call. ' Like Severe Acute Respiratory Syndrome (SARS) in 2003 and swine flu in 2009, epidemic outbreaks will continue to happen anywhere. Korea was simply unlucky. In retrospect, it is regrettable to say that we could have done much better

In [11]:
# query on nested documents
# see: https://docs.mongodb.com/manual/tutorial/query-embedded-documents/

for c in db['noncomm-subset'].find({'metadata.title': 'ACE/ACE2 Ratio and MMP-9 Activity as Potential Biomarkers in Tuberculous Pleural Effusions'}):
    print(c)

{'_id': ObjectId('5f4ecbc6f7c2f0238d4cbc14'), 'paper_id': '1b58422e266ab9339c919119923229d080f27360', 'metadata': {'title': 'ACE/ACE2 Ratio and MMP-9 Activity as Potential Biomarkers in Tuberculous Pleural Effusions', 'authors': [{'first': 'Wen-Yeh', 'middle': [], 'last': 'Hsieh', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Tang-Ching', 'middle': [], 'last': 'Kuan', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Kun-Shan', 'middle': [], 'last': 'Cheng', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Yan-Chiou', 'middle': [], 'last': 'Liao', 'suffix': '', 'affiliation': {'laboratory'

In [12]:
# query on an array of embedded documents
# see: https://docs.mongodb.com/manual/tutorial/query-array-of-documents/

for c in db['noncomm-subset'].find({'metadata.authors.first': 'Wen-Yeh'}):
    print(c)

{'_id': ObjectId('5f4ecbc6f7c2f0238d4cbc14'), 'paper_id': '1b58422e266ab9339c919119923229d080f27360', 'metadata': {'title': 'ACE/ACE2 Ratio and MMP-9 Activity as Potential Biomarkers in Tuberculous Pleural Effusions', 'authors': [{'first': 'Wen-Yeh', 'middle': [], 'last': 'Hsieh', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Tang-Ching', 'middle': [], 'last': 'Kuan', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Kun-Shan', 'middle': [], 'last': 'Cheng', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Yan-Chiou', 'middle': [], 'last': 'Liao', 'suffix': '', 'affiliation': {'laboratory'

### Query on a text index

To query on a search phrase or word, you'll need to build a text index on the fields you want to search. For this tutorial, we'll do this with the MongoDB shell. 

Note - if the field is nested you'll need to put it in qoutation marks when you build the text index.

To build the index:
```
db.pmc_content.createIndex( { "body_text.text": "text" } )
```

To list all the indexes you have on a collection
```
db.pmc_content.getIndexes()
```

To remove the index
```
db.pmc_content.dropIndex("body_text.text_text")
```

In [13]:
import re
regx = re.compile("pleural", re.IGNORECASE)
for r in db['noncomm-subset'].find({'body_text.text': { '$regex': regx } }):
    print(r)

{'_id': ObjectId('5f4ecb9678b041422935980e'), 'paper_id': '0cf8f70367fc77ea444519b2d9ed108765edd25f', 'metadata': {'title': 'case report', 'authors': [{'first': 'Ann', 'middle': [], 'last': '', 'suffix': '', 'affiliation': {}, 'email': ''}]}, 'abstract': [], 'body_text': [{'text': "A cute fibrinous and organizing pneumonia (AFOP) was first described by Beasley et al in 2002 , as a new pattern of lung injury, with histological similarities to organizing pneumonia (OP), diffuse alveolar damage (DAD), and eosinophilic pneumonia (EP). However, it has a distinct overall histological pattern, characterized by intra-alveolar fibrin associated with OP in a patchy distribution. Since Beasley' s initial description, 14 individual case reports of AFOP have been published in English research papers. Although the histopathological features are well described, the clinical manifestations, course, and treatment of AFOP are not characterized. We report a case of a male with bilateral patchy lower lobe

In [14]:
# query the text index field created on the body_text field
# in the browser use { "body_text.text": "text" } 

In [18]:
for c in db['noncomm-subset'].find({'$text':{'$search':'Pleural'}}):
    print(c)

{'_id': ObjectId('5f4ecbc6f7c2f0238d4cbc14'), 'paper_id': '1b58422e266ab9339c919119923229d080f27360', 'metadata': {'title': 'ACE/ACE2 Ratio and MMP-9 Activity as Potential Biomarkers in Tuberculous Pleural Effusions', 'authors': [{'first': 'Wen-Yeh', 'middle': [], 'last': 'Hsieh', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Tang-Ching', 'middle': [], 'last': 'Kuan', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Kun-Shan', 'middle': [], 'last': 'Cheng', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'National Chiao Tung University', 'location': {'settlement': 'Hsinchu', 'country': 'Taiwan'}}, 'email': ''}, {'first': 'Yan-Chiou', 'middle': [], 'last': 'Liao', 'suffix': '', 'affiliation': {'laboratory'

In [35]:
for a in db['noncomm-subset'].aggregate([{'$group':{'_id':{'$arrayElemAt': ['$metadata.authors.affiliation.location.country', 0]},
                                                    'count':{'$sum': 1}}}]):
    print(a)

{'_id': 'United States', 'count': 1}
{'_id': "People's Republic of China", 'count': 1}
{'_id': 'UK', 'count': 2}
{'_id': 'Brazil, Brazil, Brazil (', 'count': 1}
{'_id': 'P.R. China, P.R. China', 'count': 1}
{'_id': 'Italy', 'count': 1}
{'_id': 'Hungary', 'count': 1}
{'_id': 'USA', 'count': 14}
{'_id': 'Denmark', 'count': 1}
{'_id': 'Switzerland', 'count': 1}
{'_id': 'Japan', 'count': 1}
{'_id': None, 'count': 27}
{'_id': 'The Netherlands', 'count': 2}
{'_id': 'Paraguay', 'count': 1}
{'_id': 'Korea', 'count': 4}
{'_id': 'France', 'count': 1}
{'_id': 'China', 'count': 6}
{'_id': 'Israel', 'count': 1}
{'_id': 'Ann Arbor', 'count': 1}
{'_id': 'Taiwan', 'count': 3}


In [33]:
for a in db['noncomm-subset'].aggregate([{'$group':{'_id':{'$arrayElemAt': ['$metadata.authors.affiliation.location.country', 0]},
                                                    'count':{'$sum': 1}}}]):
    print(a)

{'_id': "People's Republic of China", 'count': 1}
{'_id': 'United States', 'count': 1}
{'_id': 'Brazil, Brazil, Brazil (', 'count': 1}
{'_id': 'Taiwan', 'count': 3}
{'_id': 'Italy', 'count': 1}
{'_id': 'P.R. China, P.R. China', 'count': 1}
{'_id': 'Hungary', 'count': 1}
{'_id': 'USA', 'count': 14}
{'_id': 'Denmark', 'count': 1}
{'_id': 'Switzerland', 'count': 1}
{'_id': None, 'count': 27}
{'_id': 'Japan', 'count': 1}
{'_id': 'The Netherlands', 'count': 2}
{'_id': 'Paraguay', 'count': 1}
{'_id': 'Korea', 'count': 4}
{'_id': 'France', 'count': 1}
{'_id': 'China', 'count': 6}
{'_id': 'Israel', 'count': 1}
{'_id': 'Ann Arbor', 'count': 1}
{'_id': 'UK', 'count': 2}


In [17]:
for a in db['noncomm-subset'].aggregate([{'$group':{'_id':'$paper_id','count':{'$sum': 1}}}]):
    print(a)

{'_id': '0dc475f4419836c2dc352073498f56ae982170a0', 'count': 1}
{'_id': '0adc0c7498f0ebbb3c8f1c662da34605d5f91fe2', 'count': 1}
{'_id': '01af2562df4acf3113843d039c3b82bb801b1427', 'count': 1}
{'_id': '0e63d19fa3d2ebfcc7cdb06f85225ae0489acf8e', 'count': 1}
{'_id': '0f90aed025cdf162b89014244740266ce67a3f47', 'count': 1}
{'_id': '1a2900a53677a6c68ff37167c31ca33ad219acdb', 'count': 1}
{'_id': '0c3068b22cb2cb50114316f9fed2738a943d2435', 'count': 1}
{'_id': '0dc8d11784da63b899dbb2b404be4efd330e4ac3', 'count': 1}
{'_id': '0f0bb7346d45679cc1bb2435c66d5ad3ef52c108', 'count': 1}
{'_id': '01bc7fe59fc7feb0e3d23c716aa23a694a4362a2', 'count': 1}
{'_id': '1b7520912fbd483ef60014fe0e7f0d0c2df1d07e', 'count': 1}
{'_id': '0b90f302993b075229dbe5f9bc03a180b2e2632f', 'count': 1}
{'_id': '0a098b3876e799d9b21608e26e80d05aa9ec6475', 'count': 1}
{'_id': '0c92f5b237572a3461ae2205a62ba7622c07a6ab', 'count': 1}
{'_id': '0ad29b032fe87db2641893f4cc880c22f2107712', 'count': 1}
{'_id': '0a19aacc124c9c42d66d481a9d0c837