## CORD-19-Research-Data-Set

This workbook covers the process of querying data from a local MongoDB.

For this code to run on your local system, you'll need to download the CORD-19 data set and built a local MongoDB from a subset of the JSON files. 

### Download data

You can download the data from:

https://pages.semanticscholar.org/coronavirus-research

I'm using the non-commercial use subset (76mb download). To keep the example size manageable, I moved the first twenty records into the data directory of this repository. 

### Install Mongo

Instructions for installation are available on the MongoDB Manual at:

https://docs.mongodb.com/manual/installation/

### Mongo Client

You may want to get familiar with the MongoDB client and CRUD operations before working with python. 

### Build a MongoDB from the CORD-19 Subset

start mongo
```
mongo
```

create a MongoDB named "covid-noncomm-use-dataset"
```
use covid-noncomm-use-dataset
```

to import the CORD-19 JSON files into a new collection named "pmc_content", run the script (in the data directory with the JSON files)
```
for f in *.json
do
	mongoimport --db=covid-noncomm-use-subset --collection=pmc_content --file=$f
done
```


In [1]:
import json
from pymongo import MongoClient

In [2]:
client = MongoClient()

In [3]:
db = client['covid-noncomm-use-subset']

In [4]:
for c in db.pmc_content.find():
    print(c)

{'_id': ObjectId('5ea71355b3c4f60e9b93e8f9'), 'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24', 'metadata': {'title': '', 'authors': []}, 'abstract': [], 'body_text': [{'text': 'This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/bync/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.', 'cite_spans': [], 'ref_spans': [], 'section': 'Editorial'}, {'text': "The Middle East Respiratory Syndrome (MERS) crisis in Korea is coming to an end, but that's not the end of the story. The World Health Organization (WHO) has warned that the MERS crisis was a 'wakeup call. ' Like Severe Acute Respiratory Syndrome (SARS) in 2003 and swine flu in 2009, epidemic outbreaks will continue to happen anywhere. Korea was simply unlucky. In retrospect, it is regrettable to say that we could have done much better

In [5]:
# to print the titles only and suppress the id

for c in db.pmc_content.find({},{ 'metadata.title': 1, '_id': 0 }):
    print(c)

{'metadata': {'title': ''}}
{'metadata': {'title': 'Identification of Leukotoxin and other vaccine candidate proteins in a Mannheimia haemolytica commercial antigen'}}
{'metadata': {'title': 'Bone Marrow Dendritic Cells from Mice with an Altered Microbiota Provide Interleukin 17A-Dependent Protection against Entamoeba histolytica Colitis'}}
{'metadata': {'title': ''}}
{'metadata': {'title': 'Identification of Leukotoxin and other vaccine candidate proteins in a Mannheimia haemolytica commercial antigen'}}
{'metadata': {'title': ''}}
{'metadata': {'title': 'The outbreak of COVID-19: An overview'}}
{'metadata': {'title': 'VIRUS ON T H E CEREBRAL ACTIVITY OF P L E U R O P N E U M O N I A -L I K E ORGANISMS I N M I C E'}}
{'metadata': {'title': 'TNF-α − 308 G N A and IFN-γ + 874 A N T gene polymorphisms in Egyptian patients with lupus erythematosus'}}
{'metadata': {'title': 'Molecular epidemiology and phylogenetic analysis of diverse bovine astroviruses associated with diarrhea in cattle a

In [6]:
# find one paper by paper_id

for c in db.pmc_content.find({'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24'}):
    print(c)

{'_id': ObjectId('5ea71355b3c4f60e9b93e8f9'), 'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24', 'metadata': {'title': '', 'authors': []}, 'abstract': [], 'body_text': [{'text': 'This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/bync/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.', 'cite_spans': [], 'ref_spans': [], 'section': 'Editorial'}, {'text': "The Middle East Respiratory Syndrome (MERS) crisis in Korea is coming to an end, but that's not the end of the story. The World Health Organization (WHO) has warned that the MERS crisis was a 'wakeup call. ' Like Severe Acute Respiratory Syndrome (SARS) in 2003 and swine flu in 2009, epidemic outbreaks will continue to happen anywhere. Korea was simply unlucky. In retrospect, it is regrettable to say that we could have done much better

In [7]:
# query on nested documents
# see: https://docs.mongodb.com/manual/tutorial/query-embedded-documents/

for c in db.pmc_content.find({'metadata.title': 'Identification of Leukotoxin and other vaccine candidate proteins in a Mannheimia haemolytica commercial antigen'}):
    print(c)

{'_id': ObjectId('5ea713694867d6c5a9bc5a31'), 'paper_id': '0a003aa69f43cee4357f1e943df79a8b87c0a88e', 'metadata': {'title': 'Identification of Leukotoxin and other vaccine candidate proteins in a Mannheimia haemolytica commercial antigen', 'authors': [{'first': 'Paula', 'middle': [], 'last': 'Tucci', 'suffix': '', 'affiliation': {'laboratory': 'Laboratorios Celsius', 'institution': '', 'location': {'postCode': '6201', 'settlement': 'Montevideo', 'country': 'S.A. Avenida Italia, Uruguay'}}, 'email': ''}, {'first': 'Verónica', 'middle': [], 'last': 'Estevez', 'suffix': '', 'affiliation': {'laboratory': 'Laboratorios Celsius', 'institution': '', 'location': {'postCode': '6201', 'settlement': 'Montevideo', 'country': 'S.A. Avenida Italia, Uruguay'}}, 'email': ''}, {'first': 'Lorena', 'middle': [], 'last': 'Becco', 'suffix': '', 'affiliation': {'laboratory': 'Laboratorios Celsius', 'institution': '', 'location': {'postCode': '6201', 'settlement': 'Montevideo', 'country': 'S.A. Avenida Itali

In [8]:
# query on an array of embedded documents
# see: https://docs.mongodb.com/manual/tutorial/query-array-of-documents/

for c in db.pmc_content.find({'metadata.authors.first': 'Florencia'}):
    print(c)

{'_id': ObjectId('5ea713694867d6c5a9bc5a31'), 'paper_id': '0a003aa69f43cee4357f1e943df79a8b87c0a88e', 'metadata': {'title': 'Identification of Leukotoxin and other vaccine candidate proteins in a Mannheimia haemolytica commercial antigen', 'authors': [{'first': 'Paula', 'middle': [], 'last': 'Tucci', 'suffix': '', 'affiliation': {'laboratory': 'Laboratorios Celsius', 'institution': '', 'location': {'postCode': '6201', 'settlement': 'Montevideo', 'country': 'S.A. Avenida Italia, Uruguay'}}, 'email': ''}, {'first': 'Verónica', 'middle': [], 'last': 'Estevez', 'suffix': '', 'affiliation': {'laboratory': 'Laboratorios Celsius', 'institution': '', 'location': {'postCode': '6201', 'settlement': 'Montevideo', 'country': 'S.A. Avenida Italia, Uruguay'}}, 'email': ''}, {'first': 'Lorena', 'middle': [], 'last': 'Becco', 'suffix': '', 'affiliation': {'laboratory': 'Laboratorios Celsius', 'institution': '', 'location': {'postCode': '6201', 'settlement': 'Montevideo', 'country': 'S.A. Avenida Itali

### Query on a text index

To query on a search phrase or word, you'll need to build a text index on the fields you want to search. For this tutorial, we'll do this with the MongoDB shell. 

Note - if the field is nested you'll need to put it in qoutation marks when you build the text index.

To build the index:
```
db.pmc_content.createIndex( { "body_text.text": "text" } )
```

To list all the indexes you have on a collection
```
db.pmc_content.getIndexes()
```

To remove the index
```
db.pmc_content.dropIndex("body_text.text_text")
```

In [9]:
# query the text index field created on the body_text field

for c in db.pmc_content.find({'$text':{'$search':'Ebola'}}):
    print(c)

{'_id': ObjectId('5ea715ef19a2fa9193482bdf'), 'paper_id': '0a17c029515e527bfad8a67810ab82f4a3d6a299', 'metadata': {'title': 'The outbreak of COVID-19: An overview', 'authors': [{'first': 'Yi-Chi', 'middle': [], 'last': 'Wu', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'Taipei Veterans General Hospital', 'location': {'settlement': 'Taipei', 'country': 'Taiwan, ROC'}}, 'email': ''}, {'first': 'Ching-Sung', 'middle': [], 'last': 'Chen', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'Taipei Veterans General Hospital', 'location': {'settlement': 'Taipei', 'country': 'Taiwan, ROC'}}, 'email': ''}, {'first': 'Yu-Jiun', 'middle': [], 'last': 'Chan', 'suffix': '', 'affiliation': {'laboratory': '', 'institution': 'Taipei Veterans General Hospital', 'location': {'settlement': 'Taipei', 'country': 'Taiwan, ROC'}}, 'email': 'yjchan@vghtpe.gov.twy.-j.chan.'}]}, 'abstract': [], 'body_text': [{'text': 'In late December 2019, an outbreak of a mysterious pneumonia cha