<h1><center>Analyzing food and drug interactions through PubMed abstracts</center></h1>
<h2><center>PHASE 1: Selection of Parser</center></h2>

## Step 1 - Download data

In [1]:
# To use PubMed API
import pubmed.utils as pb

import json
from pprint import pprint

# Split abstracts to sentences
from nltk.tokenize import sent_tokenize

# UTF-8 support
import codecs

The _pubmed.utils_ module implements a _PubMedQuery_ class which searches the PubMed database for a search term and returns at most _max_\__results_ abstracts. 

### Step 1a) Testing search in PubMed - 3 samples

In [2]:
search_term = 'ACE inhibitor'
max_results = 3
query = pb.PubMedQuery(search_term, max_results)

After instantiating a _PubMedQuery_ instance, _PubMedQuery.id_\__getter_ should be called to read the ids of the articles of the search results.

In [3]:
ids = query.id_getter()
print ids

28687104,28682034,28682025


In [4]:
query.abstract_getter(ids)

{0: u'Nonsteroidal antiinflammatory agents, \u03b2-lactam antibiotics, non-\u03b2 lactam antibiotics, and angiotensin-converting enzyme inhibitors are the most common classes of drugs that cause angioedema. Drug-induced angioedema is known to occur via mechanisms mediated by histamine, bradykinin, or leukotriene, and an understanding of these mechanisms is crucial in guiding therapeutic decisions. Nonallergic angioedema occurs in patients with genetic variants that affect metabolism or synthesis of bradykinin, substance P, prostaglandins, or leukotrienes, or when patients are taking drugs that have synergistic mechanisms. The mainstay in treatment of nonallergic drug-induced angioedema is cessation of the offending agents.',
 1: "Fabry's disease (FD) is a severe congenital metabolic disorder characterized by the deficient activity of lysosomal exoglycohydrolase alpha-galactosidase, characterized by glycosphingolipid deposition in several cells, such as capillary endothelial cells, rena

Becuase of pairing IDs with abstracts is a little bit tricky as some search result files are differently formatted than others, _abstract_\__getter_ just assigns a generic ID (a counter) to the abstracts.

If we went through the IDs one-by-one, the ids could get paired correctly with the abstracts, and then we would have ID and abstract text pairing...but then it would take very long and we are likely not going to use this information.

### Step 1b) Downloading abstracts to local jsons

#### Just to download few queries

In [3]:
%%time
# Test: download a few only
# If you don't want to download all abstracts, please do not use the download_all_abstracts() method, 
# but rather the abstract_getter() as below
pb.PubMedQuery.COUNT = 0
max_results = 10

# Specify just few queries
search_term = 'ACE inhibitor'
query = pb.PubMedQuery(search_term, max_results)
ids = query.id_getter()
abstracts = query.abstract_getter(ids)

CPU times: user 102 ms, sys: 10.9 ms, total: 113 ms
Wall time: 1.05 s


In [4]:
print abstracts

{0: u'Nonsteroidal antiinflammatory agents, \u03b2-lactam antibiotics, non-\u03b2 lactam antibiotics, and angiotensin-converting enzyme inhibitors are the most common classes of drugs that cause angioedema. Drug-induced angioedema is known to occur via mechanisms mediated by histamine, bradykinin, or leukotriene, and an understanding of these mechanisms is crucial in guiding therapeutic decisions. Nonallergic angioedema occurs in patients with genetic variants that affect metabolism or synthesis of bradykinin, substance P, prostaglandins, or leukotrienes, or when patients are taking drugs that have synergistic mechanisms. The mainstay in treatment of nonallergic drug-induced angioedema is cessation of the offending agents.', 1: "Fabry's disease (FD) is a severe congenital metabolic disorder characterized by the deficient activity of lysosomal exoglycohydrolase alpha-galactosidase, characterized by glycosphingolipid deposition in several cells, such as capillary endothelial cells, renal

In [5]:
# Write file with the 10 examples of queries
json_file = 'my_ten_abstracts.json'
print 'Saving to ' + json_file
with open(json_file, 'w') as outfile:
    json.dump(abstracts, outfile, indent=4)

Saving to my_ten_abstracts.json


#### Download all abstracts
The _download_\_all_\__abstracts_ method saves _max_\__results_ abstracts to a json file as below.

In [5]:
%%time
# reset the counter before calling the download_all_abstracts method as we have already called it once above
pb.PubMedQuery.COUNT = 0
max_results = 500
pb.download_all_abstracts(search_term, max_results)

Saving to pbabstract1.json
500/50539 downloaded
Saving to pbabstract2.json
1000/50539 downloaded
Saving to pbabstract3.json
1500/50539 downloaded
Saving to pbabstract4.json
2000/50539 downloaded
Saving to pbabstract5.json
2500/50539 downloaded
Saving to pbabstract6.json
3000/50539 downloaded
Saving to pbabstract7.json
3500/50539 downloaded
Saving to pbabstract8.json
4000/50539 downloaded
Saving to pbabstract9.json
4500/50539 downloaded
Saving to pbabstract10.json
5000/50539 downloaded
Saving to pbabstract11.json
5500/50539 downloaded
Saving to pbabstract12.json
6000/50539 downloaded
Saving to pbabstract13.json
6500/50539 downloaded
Saving to pbabstract14.json
7000/50539 downloaded
Saving to pbabstract15.json
7500/50539 downloaded
Saving to pbabstract16.json
8000/50539 downloaded
Saving to pbabstract17.json
8500/50539 downloaded
Saving to pbabstract18.json
9000/50539 downloaded
Saving to pbabstract19.json
9500/50539 downloaded
Saving to pbabstract20.json
10000/50539 downloaded
Saving to

## Step 2 - Adding Code for Parser


1. Download the parser from https://stanfordnlp.github.io/CoreNLP/
2. Unpack into a local dir
3. Put the path to englishPCFG.ser.gz as an arg to StanfordParser

## Step 2a) Pre-processing of downloaded data

### Dictionary of sentences from all abstracts with unique IDs AND filtered if any word related to keyword ACEI

In [2]:
# Splitting abstracts into sentences AND filtering sentences with words related to keyword "ACEI"
data_dict_ACE = {}
keyword = "ACEI"

for i in range(1, 102):
    name = "pbabstract"+str(i)+".json"
    with codecs.open(name,"r","utf-8") as data_file:
        data = json.load(data_file)  # dictionary type with abstract ID as key
        keys = data.keys()
        for old_key in keys:
            try:
                temp_list = sent_tokenize(data.pop(old_key))  # to convert abstract into a list of sentences
                for j in range(len(temp_list)):
                    new_key = str(i)+"_"+str(old_key)+"_"+str(j)  # add prefix to ID's for unique IDs across different json files
                    transformed_sentence = pb.ace_substitutor(temp_list[j], keyword) 
                    if keyword in transformed_sentence:  # Filtering with keyword
                        data[new_key] = transformed_sentence
            except:
                next
        data_dict_ACE.update(data)

In [3]:
print len(data_dict_ACE), "sentences in all abstracts with keyword"

65387 sentences in all abstracts with keyword


Note: if filtered only with keyword "ACE" instead of related words to ACE inhibitors, we get 54239 sentences.  
  
  

In [4]:
# Examples of sentences to check if sent.tokenize() works well
i = 0
for key, sentence in data_dict_ACE.iteritems():
    print key, ": ", sentence
    i += 1
    if i  > 5:
        break
        
# Observations:
# Some sentences with abbreviations does not work well with sent.tokenize() - see second example without keyword above

4_490_1 :  increase in all-cause mortality was associated with higher age and ACEI prescription.
4_490_2 :  higher risk of cardiovascular mortality was associated with increasing age, prescriptions for ACEI, and diagnosis of myocardial infarction or angina as compared with the other diagnoses.
21_712_0 :  the clinical benefits of ACEI (ACEI) inhibitors and angiotensin ii receptor blockers (arb) are well established in chronic kidney disease (ckd) patients with diabetic and non-diabetic nephropathies.
7_789_0 :  angiotensin ii receptor blockers (arbs), ACEI are some of the most commonly prescribed medications for hypertension.
21_712_2 :  given that the single agents can achieve only partial and not durable suppression of the renin-angiotensin system (ras), it has been hypothesized that dual blockage with ACEI and arbs would be most beneficial in the management of progressive ckd than either agent alone.
13_206_0 :  to examine the association between ACEI (ACEI) inhibitor use and clinic

## Step 2b) Write text file to run with Stanford RNN parser

In [5]:
def sample_sentences(sample_size, dictname):
    # Sample of 1,000 sentences for testing
    max_sample = sample_size
    sample = 0
    data_dict_sample = {}
    i= 0

    for id_num, sentence in dictname.iteritems():
        data_dict_sample[id_num] = sentence
        i += 1
        if i >= max_sample:
            break
            
    return data_dict_sample

In [6]:
data_dict_ACE_sample = sample_sentences(10, data_dict_ACE)

In [23]:
# Json dump: fast but then keeps the index and json format in text file - we don't want that!
#json.dump(data_dict_ACE_sample, open("ACEsentences_sample.txt",'w'))

In [7]:
def dict_to_file(filename, dictname):
    with open(filename,'w') as f:
        for value in dictname.values():
            try:
                f.write('{}\n'.format(value))  # some will be dropped from ascii encoding issues?
            except:
                next

In [8]:
dict_to_file("ACEsentences_sample2.txt", data_dict_ACE_sample)

In [8]:
!cat ACEsentences_sample2.txt

Increase in all-cause mortality was associated with higher age and ACE inhibitors prescription.
Higher risk of cardiovascular mortality was associated with increasing age, prescriptions for ACE inhibitor, and diagnosis of myocardial infarction or angina as compared with the other diagnoses.
The clinical benefits of angiotensin-converting enzyme (ACE) inhibitors and angiotensin II receptor blockers (ARB) are well established in chronic kidney disease (CKD) patients with diabetic and non-diabetic nephropathies.
Angiotensin II receptor blockers (ARBs), angiotensin-converting enzyme inhibitors (ACEIs) are some of the most commonly prescribed medications for hypertension.
Given that the single agents can achieve only partial and not durable suppression of the renin-angiotensin system (RAS), it has been hypothesized that dual blockage with ACE inhibitors and ARBs would be most beneficial in the management of progressive CKD than either agent alone.
To examine the association between ang

## Step 2c) Run Stanford parser models with sample set

### Stanford models API download instructions:

We first downloaded the files from this link <http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip> 

Then we dragged the pubmed folder into that same folder, with the copied abstracts (if you don't want to re-download them all again).

There are a few files you will need to make sure are present:

`lexparser-gui.bat                  
lexparser-gui.command              
lexparser-gui.sh                   
lexparser-lang-train-test.sh       
lexparser-lang.sh                  
lexparser.bat                      
lexparser.sh                       `

You will also need to add the `edu` folder that can be found here:
<https://www.dropbox.com/s/t9uk4z1xznpo0jz/jars.zip?dl=0>

Add the .zip extension to the `stanford-corenlp-3.8.0-models.jar` file, and unzip it. Copy that `edu` folder and paste it in to your home directory.

### Testing with default parser (lexical parser)

In [9]:
!chmod a+x lexparser.sh

In [10]:
! ./lexparser.sh ACEsentences_sample2.txt

[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec].
Parsing file: ACEsentences_sample2.txt
Parsing [sent. 1 len. 14]: Increase in all-cause mortality was associated with higher age and ACE inhibitors prescription .
(ROOT
  (S
    (NP
      (NP (NNP Increase))
      (PP (IN in)
        (NP (JJ all-cause) (NN mortality))))
    (VP (VBD was)
      (VP (VBN associated)
        (PP (IN with)
          (NP
            (NP (JJR higher) (NN age))
            (CC and)
            (NP (NNP ACE) (NNS inhibitors) (NN prescription))))))
    (. .)))

nsubjpass(associated-6, Increase-1)
case(mortality-4, in-2)
amod(mortality-4, all-cause-3)
nmod:in(Increase-1, mortality-4)
auxpass(associated-6, was-5)
root(ROOT-0, associated-6)
case(age-9, with-7)
amod(age-9, higher-8)
nmod:with(associated-6, age-9)
cc(age-9, and-10)
compound(prescription-13, ACE-11)
compound(prescription-13, inh

(ROOT
  (S
    (S
      (VP (TO To)
        (VP (VB examine)
          (NP
            (NP (DT the) (NN association))
            (PP (IN between)
              (NP
                (NP (JJ angiotensin-converting) (NN enzyme))
                (PRN (-LRB- -LRB-)
                  (NP (NNP ACE))
                  (-RRB- -RRB-))))))))
    (VP (VBP inhibitor)
      (NP (NN use)
        (CC and)
        (JJ clinical) (NN outcome))
      (PP (IN after)
        (NP
          (NP (JJ primary) (NN vascular) (NN reconstruction))
          (PP (IN in)
            (NP (DT a) (JJ population-based) (NN follow-up) (NN study))))))
    (. .)))

mark(examine-2, To-1)
csubj(inhibitor-11, examine-2)
det(association-4, the-3)
dobj(examine-2, association-4)
case(enzyme-7, between-5)
amod(enzyme-7, angiotensin-converting-6)
nmod:between(association-4, enzyme-7)
appos(enzyme-7, ACE-9)
root(ROOT-0, inhibitor-11)
compound(outcome-15, use-12)
cc(use-12, and-13)
conj:and(use-12, clinical-14)
compound(outcome-15, c

### Testing with RNN parser

Make sure to create a new script for RNN model from making a copy of the `lexparser.sh` and update both memory (increase needed) and model (the ser.gz filename)

In [11]:
!chmod a+x lexparserRNN.sh
! ./lexparserRNN.sh ACEsentences_sample2.txt

[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishRNN.ser.gz ... done [1.9 sec].
Parsing file: ACEsentences_sample2.txt
Parsing [sent. 1 len. 14]: Increase in all-cause mortality was associated with higher age and ACE inhibitors prescription .
(ROOT
  (S
    (NP
      (NP (JJ Increase))
      (PP (IN in)
        (NP (JJ all-cause) (NN mortality))))
    (VP (VBD was)
      (VP (VBN associated)
        (PP (IN with)
          (NP
            (NP (JJR higher) (NN age))
            (CC and)
            (NP (NNP ACE) (NN inhibitors) (NN prescription))))))
    (. .)))

nsubjpass(associated-6, Increase-1)
case(mortality-4, in-2)
amod(mortality-4, all-cause-3)
nmod:in(Increase-1, mortality-4)
auxpass(associated-6, was-5)
root(ROOT-0, associated-6)
case(age-9, with-7)
amod(age-9, higher-8)
nmod:with(associated-6, age-9)
cc(age-9, and-10)
compound(prescription-13, ACE-11)
compound(prescription-13, inhibi

(ROOT
  (S
    (VP (TO To)
      (VP (VB examine)
        (NP (DT the) (NN association))
        (PP (IN between)
          (S
            (VP (VBG angiotensin-converting)
              (NP
                (NP (JJ enzyme)
                  (PRN (-LRB- -LRB-)
                    (NP (NNP ACE))
                    (-RRB- -RRB-))
                  (NN inhibitor) (NN use))
                (CC and)
                (NP (JJ clinical) (NN outcome)))
              (PP (IN after)
                (NP
                  (NP (JJ primary) (NN vascular) (NN reconstruction))
                  (PP (IN in)
                    (NP (DT a) (JJ population-based) (NN follow-up) (NN study))))))))))
    (. .)))

mark(examine-2, To-1)
root(ROOT-0, examine-2)
det(association-4, the-3)
dobj(examine-2, association-4)
mark(angiotensin-converting-6, between-5)
advcl(examine-2, angiotensin-converting-6)
amod(use-12, enzyme-7)
appos(use-12, ACE-9)
compound(use-12, inhibitor-11)
dobj(angiotensin-converting-6, use-12)
cc

In [7]:
!java -version

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)


## Step 2d) Run Stanford parser models with entire set

In [12]:
dict_to_file("ACEsentences_full.txt", data_dict_ACE)

In [None]:
! ./lexparserRNN.sh ACEsentences.txt