# BRAINWORKS - Data Exploration Tutorial
[Mohammad M. Ghassemi](https://ghassemi.xyz), DATA Scholar, 2021


## About
This notebook provides a gentle overview of the code utilities that power the BRAINWORKS data collection engine. In this tutorial, we will cover how to use the tools to query and analyze data froe the PubMed API via the [Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/). 

<hr>

## 0. Install Dependencies:
To begin, please import the following external and internal python libraries

In [1]:
# External Libraries
from   pprint import pprint
import importlib
import json
import datetime
import time
import glob

import os
import sys
currentdir = os.getcwd()
parentdir  = os.path.dirname(currentdir)
sys.path.insert(0, parentdir)

# Inernal Libraries
from configuration.config import config
from utils.documentCollector.pubmed import pubmed

Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


<br><br>
## 1. Obtain an NCBI Key
This tool assumes access to the [NCBI Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/). To ensure proper functionality, please:
1. [Register](https://www.ncbi.nlm.nih.gov/account/) for an NCBI account.
2. navigate to the **settings** page by clicking your *username* in the top-right corner. 
3. Under the **API Key Management** section click *Create an API Key*
4. Copy the resulting key into the configuration variable in `/configuration.config.py`:


<br><br>
## 2. Searching for Papers

Before downloading any papers, we may first want to search the Pubmed archve for paper with properties of interest. For instance, we may like to pull the document identifiers of all pubmed papers (`'db':'pubmed'`) containing the term brain (`{'term':'brain'}` that were published on January 15, 2020 (`'date':'2020/01/15'`). To do this, we would specify the following qury paramter object:

In [14]:
# Specification of query paramters to do a pubmed central id search
query_params = {'db'       : 'pubmed',         # Database: 'pubmed', 'pmc', 'nlmcatalog'     
                'term'     : 'brain',          # Search term, e.g. 'brain', use None to get all records on a specific date.
                'date'     : '2020/01/15',     # The date that you want to pull papers from                                  
                }

#  
With these paramters specified, we can call the `pubmed` function of the `documnetCollector` object to fetch the document ids that satisfy our search query paramters; more specificly, we can perform a document search (`action = search`), and store our results in `/data/test`. We can ask the tool to overwrite results if they already exist (`replace_existing = True`), and to display the specific GET request used to obtain the result (`show_query = True`); you can click that hyperlink to see the specific results that are generated.

In [16]:
# Execution of an id search
pm = pubmed()
pm.collect(action           = 'search',        # The action we want to perform: `search` or `fetch`
           write_location   = '../data/brain/',# The location we want to store the result
           query_params     = query_params,    # The query parameters
           replace_existing = True,            # `True` replace existing result, `False` do not replace existing results
           show_query       = True)            # `True` show the API call, `False` dont' show the API call.

See Results Here:
 https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=brain+AND+2020/01/15[pdat]&api_key=26000c321e5d45fa14e115f859303a77e808&retmode=xml&retstart=0&retmax=10000& 



{'status': 'download',
 'location': '../data/brain/2020/01/15/search/json/db-pubmed_term-brain_date-20200115_retmode-xml_retstart-0_retmax-10000.json'}

Clicking the hyperlink above, we can navigate through the `<xml>` to obtain the `<Id>` of the papers that meet our search criteria. The first item in the list is `6782223`

<br><br>

## 3. Fetching a few specific papers
Now that we have the document ids of some papers that satisfied our search critiera, our next step might be to fetch the full text of these papers and store them for later analysis. We can fetch papers from pubmed by specifying a list of ids (`'id' : ['4534530', '33324907', '33062166']`) - these are what was returned from our search results!

In [17]:
# Specificication of query paramaters to collect a single document.
query_params = {'db'      : 'pubmed',                                # Database: 'pubmed', 'pmc', 'nlmcatalog'  
                'id'      : ['4534530', '33324907', '33062166'],     # ids of the papers
                }

<br>As before, we can pass this query to the `pubmed` function, this time selecting a `fetch` action. 

In [18]:
# Collecting a Single Document
c = pubmed()
c.collect(action           = 'fetch',                # The action you want to perform: `search`, or `fetch`
          write_location   = '../data/brain',           # Where you want to write the results of your collection
          query_params     = query_params,           # your query paramters
          replace_existing = True,                   # Replaces file if replace_existing == True
          show_query       = True)                   # Displays the API call used to fetch the result

See Results Here:
 https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=4534530,33324907,33062166&api_key=26000c321e5d45fa14e115f859303a77e808&retmode=xml& 



{'status': 'download',
 'location': ['../data/brain/1974/01/01/fetch/json/db-pubmed_id-4534530_retmode-xml.json',
  '../data/brain/2020/01/01/fetch/json/db-pubmed_id-33324907_retmode-xml.json',
  '../data/brain/2020/10/08/fetch/json/db-pubmed_id-33062166_retmode-xml.json']}

<br><br>
## 4. Bulk Downloading Data

The two functionalities may be used together to collect the set of all documents that were published within a time range; as shown below:

In [20]:
# Setting the pubmed Bulk Collector Paramters
parameters = {'start_date'       : '2005/01/03',       # The start-date that you would like to collect papers against 
              'end_date'         : '2005/01/05',       # The end-date that you would like to collect papers against        
              'database'         : 'pubmed',           # The database you want to search, e.g. PubMed
              'search_term'      : 'brain',            # Search term, e.g. 'brain', use None to get *all records* in a date range.
              'save_directory'   : '../data/brain/',   # The root directory that you will save the results to.
              'replace_existing' : False               # `True`: download and replace existing files, `False`: only download if the file doesn't exist.
             }

# Starting the Collector
pm = pubmed()
pm.bulkCollect(parameters=parameters)

------------------------------------------------
Starting Bulk Paper Collection
------------------------------------------------


**PLEASE NOTE**: If you are interested in keeping your record collection up-to-date on a daily basis without making multiple API calls, you can use the National Library of Medicine's [daily pubmed article dump](https://www.nlm.nih.gov/databases/download/pubmed_medline.html).

<br><br>
## 5. Proccessing the File
After the Pubmed data has been downloaded, we can prcess the file, ingesting it into the database. To start, let's pull the file.

In [2]:
# Collecting a list of stored documents in a the root directory `brain/`
pm = pubmed()
pm.getStoredDocumentList(data_path = config['data_directory'] + 'brain/')
print(pm.document_list[1:10], '...')

.... Identified 94 papers.
['../data/brain/2020/10/08/fetch/json/db-pubmed_id-33062166_retmode-xml.json', '../data/brain/2020/01/01/fetch/json/db-pubmed_id-33324907_retmode-xml.json', '../data/brain/2005/03/04/fetch/json/db-pubmed_id-15632190_retmode-xml.json', '../data/brain/2005/03/18/fetch/json/db-pubmed_id-15632154_retmode-xml.json', '../data/brain/2005/03/11/fetch/json/db-pubmed_id-15632127_retmode-xml.json', '../data/brain/2005/03/11/fetch/json/db-pubmed_id-15632119_retmode-xml.json', '../data/brain/2005/03/11/fetch/json/db-pubmed_id-15632147_retmode-xml.json', '../data/brain/2005/03/11/fetch/json/db-pubmed_id-15632143_retmode-xml.json', '../data/brain/2005/03/11/fetch/json/db-pubmed_id-15632188_retmode-xml.json'] ...


<br> Generate the tables needed to store the Pubmed Data

In [None]:
pm.generateTables()      # Generate the required tables for the Pubmed Data

<br>
We can now import these record into the database 

In [4]:
# Processing the papers in the root directory, and storing processing log in the `fish-log` folder
pm.processPapers(paper_list          = pm.document_list,  # A list of .json pubmed papers on th disk 
                 log_folder          = 'brain-log',       # The log folder
                 showtime            = False,             # `True`: show the amount of time it takes to insert the records into the database.
                 db_insert           = False,             # `True`: insert the processed papers into the database; `False`: process papers, but don't insert rsults into the database.
                 purge_logs          = True,              # `True`: we will purge the logs that keep track of the paper processing; `False`: keep the logs, we will pick up processing from where we left off in the list.
                 prevent_duplication = False              # `True`: we will skip any pmids that are already in the `documents` table.
                )

....purging logs
batch complete 0


tail: logs/brain-log/processed.log: No such file or directory


Completing Final Batch


<br>

### 6. Query the Database
The `database` utility provides a helper function `getTableInfo()` that returns the properties of each table's columns (data type, size, comment, and keys) in the database; the returned object is a python `dictionary`, with one entry for each table name. We may, for instance, access more information on the `affiliations` table as follows:

In [5]:
from utils.database.database import database   # import the utility
db = database()   

In [6]:
table_info = db.getTableInfo() 
pprint(table_info.get('affiliations'))

{'affiliation': {'column_key': '',
                 'column_length': 65535,
                 'comment': 'The full affiliation string provided by PubMed.',
                 'data_type': 'text'},
 'affiliation_num': {'column_key': 'PRI',
                     'column_length': None,
                     'comment': '',
                     'data_type': 'bigint'},
 'country': {'column_key': '',
             'column_length': 250,
             'comment': 'The inferred country of the author.',
             'data_type': 'varchar'},
 'department': {'column_key': '',
                'column_length': 250,
                'comment': 'The inferred department of the author.',
                'data_type': 'varchar'},
 'email': {'column_key': '',
           'column_length': 250,
           'comment': 'The inferred email address of the author.',
           'data_type': 'varchar'},
 'first_name': {'column_key': '',
                'column_length': 100,
                'comment': 'The first name of the aut

<br>

The database object also has a static `id_map` variable that indicates which columns can be used to join tables. We can access the columns that join the `publications` table, for instance, by calling:

In [7]:
pprint(db.id_map.get('affiliations'))

{'citations': ['citations.pmid', 'affiliations.pmid'],
 'documents': ['documents.pmid', 'affiliations.pmid'],
 'grants': ['grants.pmid', 'affiliations.pmid'],
 'id_map': ['id_map.pmid', 'affiliations.pmid'],
 'link_tables': ['link_tables.pmid', 'affiliations.pmid'],
 'publications': ['publications.pmid', 'affiliations.pmid'],
 'qualifiers': ['qualifiers.pmid', 'affiliations.pmid'],
 'topics': ['topics.pmid', 'affiliations.pmid']}


<br>

The output above shows the set of the tables that can be joined with `affiliations` (`publications`, `citations` ... `topics`), and what columns specfically must be used to perform the join. To join information from the `publications` and `grants` tables for instance, we would use the `pmid` column in each table. Let's run a simple query on the database to demonstrate

In [8]:
db.query("""SELECT publications.*, 
                   documents.content 
              FROM publications 
              LEFT JOIN documents ON documents.pmid = publications.pmid
             WHERE publications.pmid = 15632119
               AND documents.content_type = 'abstract'
          """)

[{'pmid': 15632119,
  'pub_date': datetime.date(2005, 3, 11),
  'pub_title': 'Polymorphisms in human organic anion-transporting polypeptide 1A2 (OATP1A2): implications for altered drug disposition and central nervous system drug entry.',
  'country': 'United States',
  'issn': '0021-9258',
  'journal_issue': '10',
  'journal_title': 'The Journal of biological chemistry',
  'journal_title_abbr': 'J Biol Chem',
  'journal_volume': 280,
  'lang': 'eng',
  'page_number': '9610-7',
  'content': 'Organic anion-transporting polypeptide 1A2 (OATP1A2) is a drug uptake transporter known for broad substrate specificity, including many drugs in clinical use. Therefore, genetic variation in SLCO1A2 may have important implications to the disposition and tissue penetration of substrate drugs. In the present study, we demonstrate OATP1A2 protein expression in human brain capillary and renal distal nephron using immunohistochemistry. We also determined the extent of single nucleotide polymorphisms in S

<br><br>
## Appendix

#### A.1. Joining all the data
Here is an example query that JOINS every table in the database (note that this assumes you've imported grant data as well - see part 2 for that)

In [12]:
from   pprint import pprint
all_data = db.query("""SELECT *
                        FROM  documents 
                        JOIN  elements       ON  elements.element_id        = documents.element_id
                        JOIN  publications   ON  documents.pmid             = publications.pmid 
                        JOIN  grants         ON  documents.pmid             = grants.pmid
                        JOIN  id_map         ON  documents.pmid             = id_map.pmid
                        JOIN  topics         ON  documents.pmid             = topics.pmid
                        JOIN  qualifiers     ON  topics.pmid                = qualifiers.pmid 
                                             AND topics.topic_id            = qualifiers.topic_id
                        JOIN  affiliations   ON  documents.pmid             = affiliations.pmid
                        JOIN  citations      ON  documents.pmid             = citations.pmid 
                        JOIN  link_tables    ON  documents.pmid             = link_tables.pmid
                        JOIN  patents        ON  patents.project_id         = link_tables.project_number
                        JOIN  projects       ON  link_tables.project_number = projects.core_project_num
                        JOIN  abstracts      ON  projects.application_id    = abstracts.application_id
                        WHERE   documents.element_type LIKE 'abstract'
                          AND   topics.description     LIKE 'Neuronal Plasticity'
                          AND   topics.class           LIKE 'major'
                          AND   publications.pub_date  > '2012-1-1'
                          AND   publications.pub_date  < '2012-12-31'
                        LIMIT 1
                        """)
pprint(all_data)

[{'abstract_text': '  [unreadable] Description (provided by applicant): ALS is '
                   'a devastating disease causing progressive motor neuron '
                   'degeneration and death. Most ALS patients develop severe '
                   'respiratory insufficiency and, ultimately, die from '
                   'ventilatory failure. Despite its fundamental mportance, '
                   'respiratory function has seldom been studied in any ALS '
                   'model. In this revised application, we focus attention on '
                   'respiratory motor function in a rodent model of familial '
                   'ALS, the transgenic rat overexpressing mutated superoxide '
                   'dismutase-1 (SOD1G93A rat). The fundamental hypothesis '
                   'guiding this proposal is that compensatory spinal '
                   'neuroplasticity offsets severe motor neuron degeneration, '
                   'preserving the ability to breathe until late 