# BRAINWORKS -  Data Collection Tutorial
[Mohammad M. Ghassemi](https://ghassemi.xyz), DATA Scholar, 2021

## About
The Data Genratation Phase involved the creation of several software utlities to [Extract, Transform and Load (ETL)](https://en.wikipedia.org/wiki/Extract,_transform,_load#:~:text=In%20computing%2C%20extract%2C%20transform%2C,than%20the%20source(s)) publicaly available data assets into an read-opimized MySQL database that powers downstream BRAINWORKS analytic functions. More specifically, we developed three tools to ingest data from the: (1) the [NIH ExPORTER Asset](https://exporter.nih.gov/ExPORTER_Catalog.aspx), (2) the PubMed API via the [Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and (3) the [Medical Subject Headings](https://www.nlm.nih.gov/databases/download/mesh.html) assets. 

<br>This iPython notebook provides an interactive overview of software tools, and how they can be used to recreate the data collected for BRAINWORKS.
<hr>

## 0. Configuration and Utility Import:

To use this notebook, you will need to update the configuration in `/configuration/config.py`. More specifically, you must update: `database.base_dir`, `database.configuration_file`, `data_directory`, `NCBI_API.NCBI_API_KEY` and `NCBI_API.rate_limit`. The other fields are optional. Following configuration, will import several code utilities that are shipped with this repository; the source code for these utilities may be found in the `/utils` directory of this repository.

In [1]:
import os
import sys
currentdir = os.getcwd()
parentdir  = os.path.dirname(currentdir)
sys.path.insert(0, parentdir)


from utils.cloudComputing.storage        import storage
from utils.documentCollector.exporter    import exporter
from utils.documentCollector.pubmed      import pubmed
from utils.database.database             import database
from utils.generalPurpose                import generalPurpose as gp
from configuration.config                import config
from pprint                              import pprint     
from utils.documentCollector.grid        import grid

Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


<br>
Next, we will initialize instances of several code utilities that will be used in the remainder of this notebook. 

In [2]:
db = database()          # The database utility handles the connection to the database, and interactions with it.         
ex = exporter()          # The exporter utility collects data from the NIH Exporter utility 
pm = pubmed()            # The pubmed utility collects data from the PubMed API
cs = storage()           # The storage utility backs up downloaded data to an AWS S3 Bucket
g = grid()

<br><br>
## 1. SQL Table Generation
To begin, we will create a set of MySQL Tables that will be used to store the data collected from the ExPORTER and PubMed data sources. The `generateTables()` functions in the `exporter` and `pubmed` utilities will automatically create all tables required to ingest data from the public sources following configuration. We are also using the [Global Researcher Identifier Database](https://www.grid.ac/). The following lines will download and import this data:

In [3]:
ex.generateTables()      # Generate the required tables for the Exporter Data
pm.generateTables()      # Generate the required tables for the Pubmed Data
g.updateGRID()           # Generate the required tabled for the GRID Data     

------------------------------------------------
 Creating ExPORTER Tables                       
------------------------------------------------
....`abstracts` table already exists; skipping creation
....`link_tables` table already exists; skipping creation
....`patents` table already exists; skipping creation
....`projects` table already exists; skipping creation
------------------------------------------------
 Creating PubMed Tables                         
------------------------------------------------
.... `application_types` table created
.... `citations` table created
.... `documents` table created
.... `grants` table created
.... `id_map` table created
.... `qualifiers` table created
.... `topics` table created
.... `affiliations` table created
.... `publications` table created
------------------------------------------------
Downloading latest GRID file from https://grid.ac/downloads
.... files will be saved to ../data/GRID/
-----------------------------------------------

<br>

<br><br>
## 2. ExPORTER Data Collection
With the database initated, we can begin collecting publically avaialable data on grants, patents, and affiliated papers from the [NIH ExPORTER Catalog](https://exporter.nih.gov/ExPORTER_Catalog.aspx). The `exporter` utility contains a function `collect()` that may be used to download, decompress and convert all publically available `.csv` data files into `JSON` format. Please note that the `collect` function can be run periodically to collect any new files posted by the NIH. That is, re-running the tool will _only fetch and process any new files that you have not previously collected_.

In [3]:
ex.collect(limit_to_tables = ['abstracts', 'projects', 'patents', 'link_tables'])    # Collect any new data on abstracts, projects, patents and link_tables.

------------------------------------------------
Downloading data from https://exporter.nih.gov/ 
.... files will be saved to data/ExPORTER/
------------------------------------------------
downloading `projects`CSV data from https://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 ...
.... 72 / 82 previously downloaded
.... downloading 10 new files
downloading `abstracts`CSV data from https://exporter.nih.gov/ExPORTER_Catalog.aspx?index=1 ...
.... 72 / 82 previously downloaded
.... downloading 10 new files
downloading `patents`CSV data from https://exporter.nih.gov/ExPORTER_Catalog.aspx?index=3 ...
.... 0 / 1 previously downloaded
.... downloading 1 new files
downloading `link_tables`CSV data from https://exporter.nih.gov/ExPORTER_Catalog.aspx?index=5 ...
.... 41 / 41 previously downloaded
------------------------------------------------
 Unzipping data                                 
------------------------------------------------
patents ...
.... Starting update
projects ...
.... St

<br>

Following download, we can import the data into the MySQL Tables using the `exporter` object's `importData()` function.

In [4]:
ex.importData(limit_to_tables = ['abstracts', 'projects', 'patents', 'link_tables'],  # Specifies the set of tables we want to update, 
              batch_size      = 10000) 

------------------------------------------------
 Importing Data into SQL Database               
------------------------------------------------
Collecting list of files previously imported
.... This may take a while depending on the size of your database
Importing data into patents
.... (previously imported data will be skipped unless replace_existing is True)
Deleting patents - these records are replaced, not augmented...
....Importing RePORTER_PATENTS_C_ALL
Importing data into projects
.... (previously imported data will be skipped unless replace_existing is True)
....Importing RePORTER_PRJ_C_FY2021_037
....Importing RePORTER_PRJ_C_FY2021_038
....Importing RePORTER_PRJ_C_FY2021_039
....Importing RePORTER_PRJ_C_FY2021_040
....Importing RePORTER_PRJ_C_FY2021_041
....Importing RePORTER_PRJ_C_FY2021_042
....Importing RePORTER_PRJ_C_FY2021_043
....Importing RePORTER_PRJ_C_FY2021_044
....Importing RePORTER_PRJ_C_FY2021_045
....Importing RePORTER_PRJ_C_FY2021_046
....Importing RePORTER_P

<br><br>

## 3. PubMed Data Collection
The data we collected using the `exporter` utility contains information on grants (see `abstracts` and `projects` tables), patents (see `patents` table), and the publication ids that resulted from grant funding (see `link_tables` tables). 

The `link_tables` contain the PubMed identification numbers of papers that are linked to the the patents and grants, but do not contain information on the publications themselves. To collect this data, we will make use of the `pubmed` utility. More specifically, we will use the `downloadExporterPapersByPubmedId()` fucntion, which collects all pubmed papers that show up in the `link_tables` and which we have not already imported.

In [None]:
pm.getStoredDocumentList(data_path = config['data_directory'] + 'PubMed/')
pm.downloadExporterPapersByPubmedId(write_location= '../data/PubMed', batch_size = 100)

<br>

Once the papers are collected, we can import them into the database by collecting the list of the stored documents we want to ingest using `getStoredDocumentList()`, and then import those collected papers into a MySQL instance using the `processPapers()` function. For instance, let's collect a list of all documents from 2013 through 2021, and then process those papers. The list of documents are stored interally within the object in `pm.document_list`. We can inspect this list and/or pass it to the `processPapers()` function to store the papers in the database

In [None]:
from utils.documentCollector.pubmed      import pubmed
from configuration.config                import config
pm = pubmed()

for year in range(2021,2018,-1):
    print('Importing Data From', year)
    year_str = str(year)
    pm.getStoredDocumentList(data_path    = config['data_directory'] + 'PubMed/'+ year_str + '/')
    pm.processPapers( paper_list          = pm.document_list,                # The list of all Documents from the data/PubMed/ directory
                      log_folder          = 'ingest-pubmed-' + year_str,     # The log folder that keeps track of ingestion process.
                      prevent_duplication = False,                           # `True`: we will skip any pmids that are already in the `publications` table.
                      purge_logs          = False,                           # `True`: logs will be purged, logs are how we keep track of what was already procesed.
                      db_insert           = True,                            # `True`: values are inserted into the database
                      batch_size          = 1000,                            # Reccomended size: 10000, 
                      limit_to_tables     = ['affiliations','documents','id_map','grants','topics','qualifiers','citations','publications','triples','concepts']  # ['affiliations','documents','id_map','grants','topics','qualifiers','citations','publications','triples','concepts']
                    )                  

<br><br> If you are processing `triples` and `concepts` for a large number of papers, you may require parallel computing resources to accomplish the task in a reasonable timeframe. Please see the parallel computing module in the `/cluster` directory for instructions on how to configure and run the parallel computing cluster to extract paper information in-parallel. 

<br><br>
## 4. Download papers citing papers
We can now download the papers that cited the papers we've already collected

In [None]:
pm.getStoredDocumentList(data_path = config['data_directory'] + 'PubMed/')
pm.downloadCitedPapersByPubmedId(write_location= '../data/PubMed', batch_size = 100)

<br><br>

## Appendix A
Below are some additional features that were developed, but are not critical to the core data collection procedure.

#### A.1 Compute Field Statistics
You can compute field level statisticas on your JSON document store using the `getStoredDocumentFieldStats` function. The function will capture all unique JSON fields that show up across all records in the set, and the percentage of records that contain a given field.

In [None]:
pm.getStoredDocumentFieldStats(data_path = config['data_directory'] + 'PubMed/', 
                               savename  = config['data_directory'] + 'PubMed/stats/pubmed-field-statistics.stats')

#### A.2 Backup Files to S3 Bucket
You may backup the contents of a directory to an S3 bucket using the `cs.backup` fucntion.

In [None]:
files          = gp.getDirectoryContents(data_path = config['data_directory'] + 'ExPORTER/')
failed_uploads = cs.backup(files)

In [None]:
files          = gp.getDirectoryContents(data_path = config['data_directory'] + 'PubMed/')
failed_uploads = cs.backup(files)

Uploading 10525517 files


#### A.3 Purge Data
Optionally, you can remove a set of documents from the database. In the example below, we are removing data from 2021

In [None]:
pm.getStoredDocumentList(data_path = config['data_directory'] + 'PubMed/2021/')
db.purgeDocumentsfromDatabase(pm.document_list)