## CORD-19-Research-Data-Set

This workbook covers the process of querying data from a local MongoDB.

For this code to run on your local system, you'll need to download the CORD-19 data set and built a local MongoDB from a subset of the JSON files. 

### Download data

You can download the data from:

https://pages.semanticscholar.org/coronavirus-research

I'm using the non-commercial use subset (76mb download). To keep the example size manageable, I moved the first twenty records into the data directory of this repository. 

### Install Mongo

Instructions for installation are available on the MongoDB Manual at:

https://docs.mongodb.com/manual/installation/

### Mongo Client

You may want to get familiar with the MongoDB client and CRUD operations before working with python. 

### Build a MongoDB from the CORD-19 Subset

start mongo
```
mongo
```

create a MongoDB named "covid-noncomm-use-dataset"
```
use covid-noncomm-use-dataset
```

to import the CORD-19 JSON files into a new collection named "pmc_content", run the script (in the data directory with the JSON files)
```
for f in *.json
do
	mongoimport --db=covid-noncomm-use-subset --collection=pmc_content --file=$f
done
```


In [None]:
import json
from pymongo import MongoClient

In [None]:
client = MongoClient()

In [None]:
db = client['covid-noncomm-use-subset']

In [None]:
for c in db.pmc_content.find():
    print(c)

In [None]:
# to print the titles only and suppress the id

for c in db.pmc_content.find({},{ 'metadata.title': 1, '_id': 0 }):
    print(c)

In [None]:
# find one paper by paper_id

for c in db.pmc_content.find({'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24'}):
    print(c)

In [None]:
# query on nested documents
# see: https://docs.mongodb.com/manual/tutorial/query-embedded-documents/

for c in db.pmc_content.find({'metadata.title': 'Identification of Leukotoxin and other vaccine candidate proteins in a Mannheimia haemolytica commercial antigen'}):
    print(c)

In [None]:
# query on an array of embedded documents
# see: https://docs.mongodb.com/manual/tutorial/query-array-of-documents/

for c in db.pmc_content.find({'metadata.authors.first': 'Florencia'}):
    print(c)

### Query on a text index

To query on a search phrase or word, you'll need to build a text index on the fields you want to search. For this tutorial, we'll do this with the MongoDB shell. 

Note - if the field is nested you'll need to put it in qoutation marks when you build the text index.

To build the index:
```
db.pmc_content.createIndex( { "body_text.text": "text" } )
```

To list all the indexes you have on a collection
```
db.pmc_content.getIndexes()
```

To remove the index
```
db.pmc_content.dropIndex("body_text.text_text")
```

In [None]:
# query the text index field created on the body_text field

for c in db.pmc_content.find({'$text':{'$search':'Ebola'}}):
    print(c)