## Instructions

We first downloaded the files from this link <http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip> 

Then we dragged the pubmed folder into that same folder, with the copied abstracts (if you don't want to re-download them all again).

There are a few files you will need to make sure are present:

`lexparser-gui.bat                  
lexparser-gui.command              
lexparser-gui.sh                   
lexparser-lang-train-test.sh       
lexparser-lang.sh                  
lexparser.bat                      
lexparser.sh                       `

You will also need to add the `edu` folder that can be found here:
<https://www.dropbox.com/s/t9uk4z1xznpo0jz/jars.zip?dl=0>

Add the .zip extension to the `stanford-corenlp-3.8.0-models.jar` file, and unzip it. Copy that `edu` folder and paste it in to your home directory.

## Here is a list of some of the files you should see in your home folder...

In [1]:
!ls

[31mCoreNLP-to-HTML.xsl[m[m                pbabstract311.json
KeywordSentences.py                pbabstract312.json
[31mKeywordSentences.txt[m[m               pbabstract313.json
KeywordSentences.txt.out           pbabstract314.json
KeywordSentences.txt.xml           pbabstract315.json
KeywordSentences2.py               pbabstract316.json
KeywordSentences2.txt              pbabstract317.json
KeywordSentences2.txt.out          pbabstract318.json
KeywordSentences_output.txt        pbabstract319.json
LIBRARY-LICENSES                   pbabstract32.json
LICENSE.txt                        pbabstract320.json
Makefile                           pbabstract321.json
README.txt                         pbabstract322.json
SemgrexDemo.java                   pbabstract323.json
ShiftReduceDemo.java               pbabstract324.json
StanfordCoreNlpDemo.java           pbabstract325.json
StanfordDependenciesManual.pdf     pbabstract326.json
baseline-sparkless.ipynb           pbabstrac

# Parsing the Pubmed Abstracts

In [1]:
import pubmed.utils as pb
import json
import re
from collections import defaultdict
from pprint import pprint
import string
# utf-8 support
import codecs
# spit abstracts to sentences
from nltk.tokenize import sent_tokenize

In [7]:
#search_term = 'ACE inhibitor'
search_term = 'statin'
max_results = 100
query = pb.PubMedQuery(search_term, max_results)

In [8]:
ids = query.id_getter()
print "Num abstracts: ", len(ids)

Num abstracts:  899


## This is to just download a few of the abstracts

In [26]:
%%time
#I just tried to download a few 
# if you don't want to download all abstracts, please do not use the download_all_abstracts() method, but rather the 
# abstract_getter() as below
pb.PubMedQuery.COUNT = 0
max_results = 25
query = pb.PubMedQuery(search_term, max_results)
ids = query.id_getter()
abstracts = query.abstract_getter(ids)

CPU times: user 160 ms, sys: 0 ns, total: 160 ms
Wall time: 2.12 s


## This is to download a TON of the abstracts

In [70]:
# %%time
# pb.PubMedQuery.COUNT = 0
# max_results = 100
#full_query = pb.download_all_abstracts(search_term, max_results)
#ids = full_query.id_getter()
#abstracts = full_query.id_getter(ids)

## Save abstracts to JSON file

In [30]:
json_file = 'my_ten_abstracts.json'
#json_file = 'more_abstracts.json'
print 'Saving to ' + json_file
with codecs.open(json_file, 'w','utf-8') as outfile:
    json.dump(abstracts, outfile, indent=4)

Saving to my_ten_abstracts.json


## We might not need all the sentences in the abstract, so I am seeing what would happen if we just parse out the ones with key words, and write them to a textfile, and then parse the resulting file.

Keyword selection could happen here in the sentence tokenizer.

In [32]:
sentences = []
with codecs.open('my_ten_abstracts.json','r','utf-8') as data_file:    
    data = json.load(data_file)
    for abstract in data.itervalues():
        sentences.append(sent_tokenize(abstract))
    # flatten the list of abstracts into one long list of sentences
    sentences = [sent for s in sentences for sent in s]
    print "Sentences: ",len(sentences)

Sentences:  234


In [33]:
keyword = 'ACEI'
key_sentences = []
transformed_sentences = pb.ace_substitutor(sentences, 'ACEI')
for sent in transformed_sentences:
    if keyword in sent:
        key_sentences.append(sent)

In [69]:
%%writefile KeywordSentences.py
import json
import codecs
import pubmed.utils as pb

#change keyword to ACE
keyword = 'ACEI'
with codecs.open('my_ten_abstracts.json','r','utf-8') as data_file:    
    data = json.load(data_file)
    #pick snippets related to ACE inhibitors
    for abstract in data.itervalues():
        transformed_abstract = pb.ace_substitutor(abstract, 'ACEI') 
        if keyword in transformed_abstract:
            print transformed_abstract.encode('utf-8') + '\n'



Overwriting KeywordSentences.py


In [70]:
!python2 KeywordSentences.py > KeywordSentences.txt

In [71]:
#This mimics the format that the example parser file has
!head KeywordSentences.txt

renin-angiotensin-aldosterone system (raas) antagonists, including ACEI, angiotensin receptor blockers (arb), and mineralocorticoid receptor antagonists (mra) decrease mortality and morbidity in heart failure but increase the risk of hyperkalemia, especially when used in combination. prevention of hyperkalemia and its associated complications requires careful patient selection, counseling regarding dietary potassium intake, awareness of drug interactions, and regular laboratory surveillance. recent data suggests that the risk of hyperkalemia may be further moderated through the use of combined angiotensin-neprilysin inhibitors, novel mras, and novel potassium binding agents. clinicians should be mindful of the risk of hyperkalemia when prescribing raas inhibitors to patients with heart failure. in patients at highest risk, such as those with diabetes, the elderly, and advanced chronic kidney disease, more intensive laboratory surveillance of potassium and creatinine may be required. no

In [38]:
!chmod a+x lexparser.sh

# Testing with PCFG Model*

*notice I made a change to the lexparser file to allow for more memory

In [3]:
!cat ./lexparser.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz $*


In [4]:
%%timeit
! ./lexparser.sh  KeywordSentences.txt

[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.6 sec].
Parsing file: KeywordSentences.txt
Parsing [sent. 1 len. 52]: RAAS , a major pharmacological target in cardiovascular medicine , is inhibited by pharmacological classes including angiotensin converting enzyme -LRB- ACE -RRB- inhibitors -LRB- ACEIs -RRB- , angiotensin-II type 1 blockers -LRB- ARBs -RRB- and aldosterone receptors antagonists , in addition to the recently introduced direct renin inhibitors -LRB- DRIs -RRB- .
(ROOT
  (S
    (NP
      (NP (NNS RAAS))
      (, ,)
      (NP
        (NP (DT a) (JJ major) (JJ pharmacological) (NN target))
        (PP (IN in)
          (NP (JJ cardiovascular) (NN medicine))))
      (, ,))
    (VP (VBZ is)
      (ADJP (JJ inhibited)
        (PP (IN by)
          (NP
            (NP (JJ pharmacological) (NNS classes))
            (PP (VBG including)
              (NP
        

## Testing with RNN Model

In [14]:
!cat ./lexparser_rnn.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz $*


In [15]:
%%timeit
! ./lexparser_rnn.sh  KeywordSentences.txt

[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec].
Parsing file: KeywordSentences.txt
Parsed file: KeywordSentences.txt [0 sentences].
Parsed 0 words in 0 sentences (0.00 wds/sec; 0.00 sents/sec).
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec].
Parsing file: KeywordSentences.txt
Parsed file: KeywordSentences.txt [0 sentences].
Parsed 0 words in 0 sentences (0.00 wds/sec; 0.00 sents/sec).
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec].
Parsing file: KeywordSentences.txt
Parsed file: KeywordSentences.txt [0 sentences].
Parsed 0 words in 0 sentences (0.00 wds/sec; 0.00 sents/sec).
[main] INFO edu.stanford.nlp.parser.l

## Test with Caseless PCFG  Model

In [30]:
!cat ./lexparser_caseless.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz $*


In [19]:
%%timeit
! ./lexparser_caseless.sh  KeywordSentences.txt

[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ... done [0.5 sec].
Parsing file: KeywordSentences.txt
Parsed file: KeywordSentences.txt [0 sentences].
Parsed 0 words in 0 sentences (0.00 wds/sec; 0.00 sents/sec).
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ... done [0.5 sec].
Parsing file: KeywordSentences.txt
Parsed file: KeywordSentences.txt [0 sentences].
Parsed 0 words in 0 sentences (0.00 wds/sec; 0.00 sents/sec).
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ... done [0.6 sec].
Parsing file: KeywordSentences.txt
Parsed file: KeywordSentences.txt [0 sentences].
Parsed 0 words in 0 sentences (0.00 wds/sec; 0.00 sents/sec).
[main] INF

## Command Line Sentiment Analysis

This creates an output file with tuples and sentiments! We can modify which annotators we use. I think maybe just the sentiment and the ner is all we need.

In [20]:
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file KeywordSentences.txt -outputFormat text

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.8 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.2 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7cla

In [24]:
!cat *.out >> output_file.json

In [23]:
!cat output_file.json

## Sentences of note:

`Plasma concentrations of AII were significantly decreased by captopril and increased by losartan`.
`nmod:of(concentrations-2, AII-4)`
`nmod:agent(decreased-7, captopril-9)`
`nmod:by(increased-11, losartan-13)`


## Experimenting with Selected Abstracts

In [161]:
#From <https://www.ncbi.nlm.nih.gov/pubmed/28656517>

!echo "Renin-angiotensin-aldosterone system (RAAS) antagonists, including angiotensin-converting enzyme inhibitors (ACEI), angiotensin receptor blockers (ARB), and mineralocorticoid receptor antagonists (MRA) decrease mortality and morbidity in heart failure but increase the risk of hyperkalemia, especially when used in combination. Prevention of hyperkalemia and its associated complications requires careful patient selection, counseling regarding dietary potassium intake, awareness of drug interactions, and regular laboratory surveillance. Recent data suggests that the risk of hyperkalemia may be further moderated through the use of combined angiotensin-neprilysin inhibitors, novel MRAs, and novel potassium binding agents. Clinicians should be mindful of the risk of hyperkalemia when prescribing RAAS inhibitors to patients with heart failure. In patients at highest risk, such as those with diabetes, the elderly, and advanced chronic kidney disease, more intensive laboratory surveillance of potassium and creatinine may be required. Novel therapies hold promise for reducing the risk of hyperkalemia and enhancing the tolerability of RAAS antagonists." > sample_abstract.txt

In [162]:
!head sample_abstract.txt

Renin-angiotensin-aldosterone system (RAAS) antagonists, including angiotensin-converting enzyme inhibitors (ACEI), angiotensin receptor blockers (ARB), and mineralocorticoid receptor antagonists (MRA) decrease mortality and morbidity in heart failure but increase the risk of hyperkalemia, especially when used in combination. Prevention of hyperkalemia and its associated complications requires careful patient selection, counseling regarding dietary potassium intake, awareness of drug interactions, and regular laboratory surveillance. Recent data suggests that the risk of hyperkalemia may be further moderated through the use of combined angiotensin-neprilysin inhibitors, novel MRAs, and novel potassium binding agents. Clinicians should be mindful of the risk of hyperkalemia when prescribing RAAS inhibitors to patients with heart failure. In patients at highest risk, such as those with diabetes, the elderly, and advanced chronic kidney disease, more intensive laboratory surveillance of p

In [163]:
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file sample_abstract.txt -outputFormat text

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.2 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7cla

In [164]:
!cat sample_abstract.txt.out

Sentence #1 (49 tokens, sentiment: Very negative):
Renin-angiotensin-aldosterone system (RAAS) antagonists, including angiotensin-converting enzyme inhibitors (ACEI), angiotensin receptor blockers (ARB), and mineralocorticoid receptor antagonists (MRA) decrease mortality and morbidity in heart failure but increase the risk of hyperkalemia, especially when used in combination.
[Text=Renin-angiotensin-aldosterone CharacterOffsetBegin=0 CharacterOffsetEnd=29 PartOfSpeech=NN Lemma=renin-angiotensin-aldosterone NamedEntityTag=O SentimentClass=Neutral]
[Text=system CharacterOffsetBegin=30 CharacterOffsetEnd=36 PartOfSpeech=NN Lemma=system NamedEntityTag=O SentimentClass=Neutral]
[Text=-LRB- CharacterOffsetBegin=37 CharacterOffsetEnd=38 PartOfSpeech=-LRB- Lemma=-lrb- NamedEntityTag=O SentimentClass=Neutral]
[Text=RAAS CharacterOffsetBegin=38 CharacterOffsetEnd=42 PartOfSpeech=NN Lemma=raas NamedEntityTag=O SentimentClass=Neutral]
[Text=-RRB- CharacterOffsetBegin=42 CharacterOffsetEnd=43

Our takeaway sentence is that "ACE Inhibitors (ACEI) [...] decrease mortality and morbidity in heart failure but increase the risk of hyperkalemia...".

`amod(inhibitors-11, angiotensin-converting-9)`

`dobj(increase-38, risk-40)`

`nmod:of(risk-40, hyperkalemia-42)`


## Experimenting with specific drug names instead of a drug class

It didn't look like we had a lot of interactions between drugs and molecules, so I wonder if we change our serach terms a little bit to include the drug names. The search terms are not case sensitive.

`Drugs in class: Captopril, Lisinopril, Ramipril, Benazepril`

In [165]:
search_term = ['Captopril', 'Lisinopril','Ramipril','Benazepril']
max_results = 100
for i in search_term:
    query2 = pb.PubMedQuery(i, max_results)
    
ids2 = query2.id_getter()
print "Num abstracts: ", len(ids2)

abstracts2 = query.abstract_getter(ids2)

json_file2 = 'more_abstracts2.json'
print 'Saving to ' + json_file2
with open(json_file2, 'w') as outfile:
    json.dump(abstracts2, outfile, indent=4)
    
    
sentences2 = []
with open('more_abstracts2.json') as data_file:    
    data = json.load(data_file)
    for abstract in data.itervalues():
        sentences2.append(sent_tokenize(abstract))
    # flatten the list of abstracts into one long list of sentences
    sentences2 = [sent for s in sentences2 for sent in s]
    print "Sentences: ",len(sentences2)

Num abstracts:  899
Saving to more_abstracts2.json
Sentences:  938


In [169]:
#Just looking at some sentences
sentences2[:5]

[u'Benazepril plays an important role in down-regulating the expression of TGFbeta1 and decreasing the accumulation of ECM by blocking intrarenal renin-angiotensin system.',
 u'To study the expression of type I transforming growth factor beta receptor (TGFbetaRI) in renal cortex in streptozotocin-induced diabetic rats and the regulation of benazepril.',
 u'Experimental glomerulosclerosis in rats was induced by adriamycin.',
 u'The treated group was given benazepril (4 mg x kg(-1) x d(-1)).',
 u'Intrarenal angiotensin converting enzyme (ACE) activity and angiotensin II (AngII) concentration were measured with colorimetry and radioimmunoassay respectively.']

In [218]:
%%writefile KeywordSentences2.py
import json

#change keyword to ACE
keywords = ['Captopril', 'Lisinopril','Ramipril','Benazepril']
with open('more_abstracts2.json') as data_file:    
    data = json.load(data_file)
    #pick snippets related to ACE inhibitors
    for i in range(len(data.keys())):
        try:
            if any(k in data[str(i)] for k in keywords):
                print data[str(i)] + '\n'
            else:
                next
        except:
            next
            

Overwriting KeywordSentences2.py


In [219]:
!python KeywordSentences2.py > KeywordSentences2.txt

In [220]:
!cat KeywordSentences2.txt

A number of experimental and clinical investigations support the notion that angiotensin-converting enzyme inhibitor (ACEi) and angiotensin II type 1 receptor blocker (ARB) compounds attenuate renal fibrosis. Fibrosis can be attenuated by either suppressing matrix formation or facilitating matrix degradation. In this study, drugs of ACEi and ARB classes were tested for their ability to facilitate matrix degradation in the kidney. A murine model system in which cyclosporin A (CsA) treatment for a specified period caused interstitial matrix deposition in the kidney was used. CsA was then discontinued, and experimental procedures were initiated to investigate matrix degradation. Benazepril, an ACEi, facilitated matrix degradation via the bradykinin (BK) B2 receptor on tubular epithelial cells in the kidney, whereas CGP-48933, an ARB, did not. In this murine model of CsA nephropathy under ACE blockade, plasminogen activator inhibitor-1 (PAI-1) expression was decreased in tubular epithelial

## Experimenting with Sentiment Analysis

In [222]:
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file KeywordSentences2.txt -outputFormat text

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.0 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7cla

In [223]:
!cat KeywordSentences2.txt.out

Sentence #1 (32 tokens, sentiment: Negative):
A number of experimental and clinical investigations support the notion that angiotensin-converting enzyme inhibitor (ACEi) and angiotensin II type 1 receptor blocker (ARB) compounds attenuate renal fibrosis.
[Text=A CharacterOffsetBegin=0 CharacterOffsetEnd=1 PartOfSpeech=DT Lemma=a NamedEntityTag=O SentimentClass=Neutral]
[Text=number CharacterOffsetBegin=2 CharacterOffsetEnd=8 PartOfSpeech=NN Lemma=number NamedEntityTag=O SentimentClass=Neutral]
[Text=of CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=IN Lemma=of NamedEntityTag=O SentimentClass=Neutral]
[Text=experimental CharacterOffsetBegin=12 CharacterOffsetEnd=24 PartOfSpeech=JJ Lemma=experimental NamedEntityTag=O SentimentClass=Neutral]
[Text=and CharacterOffsetBegin=25 CharacterOffsetEnd=28 PartOfSpeech=CC Lemma=and NamedEntityTag=O SentimentClass=Neutral]
[Text=clinical CharacterOffsetBegin=29 CharacterOffsetEnd=37 PartOfSpeech=JJ Lemma=clinical NamedEntityTag=O Sentimen

## Sentiment Experiment with 50 Sentences

In [7]:
!head -50 data/FilteredSentences.txt > data/Filtered50.txt

In [9]:
%%timeit
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file data/Filtered50.txt -outputFormat text

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.8 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7cla

In [10]:
!head Filtered50.txt.out

Sentence #1 (34 tokens, sentiment: Negative):
[u'casein', u'whey']in this study, we examined the separated caseins and whey proteins of goat milk for the presence of ACEI inhibitory peptides.
[Text=-LSB- CharacterOffsetBegin=0 CharacterOffsetEnd=1 PartOfSpeech=-LRB- Lemma=-lsb- NamedEntityTag=O SentimentClass=Neutral]
[Text=u CharacterOffsetBegin=1 CharacterOffsetEnd=2 PartOfSpeech=FW Lemma=u NamedEntityTag=O SentimentClass=Neutral]
[Text=` CharacterOffsetBegin=2 CharacterOffsetEnd=3 PartOfSpeech=`` Lemma=` NamedEntityTag=O SentimentClass=Neutral]
[Text=casein CharacterOffsetBegin=3 CharacterOffsetEnd=9 PartOfSpeech=NN Lemma=casein NamedEntityTag=O SentimentClass=Neutral]
[Text=' CharacterOffsetBegin=9 CharacterOffsetEnd=10 PartOfSpeech='' Lemma=' NamedEntityTag=O SentimentClass=Neutral]
[Text=, CharacterOffsetBegin=10 CharacterOffsetEnd=11 PartOfSpeech=, Lemma=, NamedEntityTag=O SentimentClass=Neutral]
[Text=u CharacterOffsetBegin=12 CharacterOffsetEnd=13 PartOfSpeech=FW Lemma

In [25]:
co_occurrence_dict = defaultdict(list)

for line in open('Filtered50.txt.out').readlines():
    if 'Sentence #' in line:
        sentence = str(line.strip('\n')).split(" ")[1]
        sentiment = str(line.strip('\n')).split(":")[1]
        sentiment = re.sub("\d+","",re.sub(r'[^\w\s]','',sentiment))
    elif line[0:5] == '[Text':
        word = str(line.split("=")[1]).split(" ")[0].lower()
        pos = str(line.split("=")[4]).split(" ")[0]
        co_occurrence_dict[(sentiment, sentence)].append(word)

In [37]:
#Example of one of the entries in the dictionary

from itertools import islice
import  pprint
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

n_items = take(1, co_occurrence_dict.iteritems())
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(n_items)

[((' Very negative', '#27'),
  ['-lsb-',
   'u',
   '`',
   'barley',
   "'",
   '-rsb-',
   'papain',
   'was',
   'the',
   'enzyme',
   'of',
   'choice',
   ',',
   'based',
   'on',
   'in',
   'silico',
   'analysis',
   ',',
   'for',
   'experimental',
   'hydrolysis',
   'of',
   'barley',
   'protein',
   'concentrate',
   ',',
   'which',
   'was',
   'performed',
   'at',
   'the',
   'enzyme',
   "'s",
   'optimum',
   'conditions',
   '-lrb-',
   '60',
   'unkc',
   ',',
   'ph',
   '6.0',
   '-rrb-',
   'for',
   '24',
   'h.',
   'the',
   'generated',
   'hydrolysate',
   'was',
   'subjected',
   'to',
   'molecular',
   'weight',
   'cut-off',
   '-lrb-',
   'mwco',
   '-rrb-',
   'filtration',
   ',',
   'following',
   'which',
   'the',
   'non-ultrafiltered',
   'hydrolysate',
   '-lrb-',
   'nufh',
   '-rrb-',
   ',',
   'and',
   'the',
   'generated',
   '3',
   'kda',
   'and',
   '10',
   'kda',
   'mwco',
   'filtrates',
   'were',
   'assessed',
   'for',


In [36]:
#Assuming these are the key words we are curious about.
search_phrase = ['barley','acei-i']

#Searching through the dictionary
for k,v in co_occurrence_dict.iteritems():
#     if any(k in search_phrase for k in v):
    if all(k in v for k in search_phrase):
        #print k, v
        print "There is a", k[0], "relationship between ", search_phrase, ".\nLiterature: ", " ".join([i for i in v])

There is a  Very negative relationship between  ['barley', 'acei-i'] .
Literature:  -lsb- u ` barley ' -rsb- papain was the enzyme of choice , based on in silico analysis , for experimental hydrolysis of barley protein concentrate , which was performed at the enzyme 's optimum conditions -lrb- 60 unkc , ph 6.0 -rrb- for 24 h. the generated hydrolysate was subjected to molecular weight cut-off -lrb- mwco -rrb- filtration , following which the non-ultrafiltered hydrolysate -lrb- nufh -rrb- , and the generated 3 kda and 10 kda mwco filtrates were assessed for their in vitro acei-i inhibitory activities .
There is a  Negative relationship between  ['barley', 'acei-i'] .
Literature:  -lsb- u ` barley ' -rsb- our current work utilised in silico methodologies and peptide databases as tools for predicting release of acei-i inhibitory peptides from barley proteins .
There is a  Very negative relationship between  ['barley', 'acei-i'] .
Literature:  -lsb- u ` barley ' -rsb- the 3 kda filtrate -l

## EXPERIMENT ENDING

## Experimenting with Different Models

Per instructions at <https://nlp.stanford.edu/software/nndep.shtml>

I opened up the `stanford-corenlp-3.8.0.jar` file and added all the folders inside the `edu/stanford/nlp/` folder!

In [6]:
!java edu.stanford.nlp.parser.nndep.DependencyParser -model edu/stanford/nlp/models/parser/nndep/PTB_CoNLL_params.txt.gz -textFile KeywordSentences.txt -outFile KeywordSentences_output.txt

Loading depparse model file: edu/stanford/nlp/models/parser/nndep/PTB_CoNLL_params.txt.gz ... 
###################
#Transitions: 35
#Labels: 17
ROOTLABEL: ROOT
PreComputed 100000, Elapsed Time: 1.664 (s)
Initializing dependency parser ... done [2.6 sec].
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].
Tagging completed in 1.09 sec.
Parsed 12 sentences in 0.52 seconds (23.30 sents/sec).


In [7]:
!head KeywordSentences_output.txt

VMOD(is-11, RAAS-1)
P(RAAS-1, ,-2)
NMOD(target-6, a-3)
NMOD(target-6, major-4)
NMOD(target-6, pharmacological-5)
APPO(RAAS-1, target-6)
NMOD(target-6, in-7)
NMOD(medicine-9, cardiovascular-8)
PMOD(in-7, medicine-9)
P(RAAS-1, ,-10)


In [10]:
!java edu.stanford.nlp.parser.nndep.DependencyParser -model edu/stanford/nlp/models/parser/nndep/english_UD.gz -textFile KeywordSentences.txt -outFile KeywordSentences_output.txt

Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... 
###################
#Transitions: 81
#Labels: 40
ROOTLABEL: root
PreComputed 99996, Elapsed Time: 12.48 (s)
Initializing dependency parser ... done [13.8 sec].
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.2 sec].
Tagging completed in 2.43 sec.
Parsed 12 sentences in 0.68 seconds (17.67 sents/sec).


In [11]:
!head KeywordSentences_output.txt

nsubjpass(inhibited-12, RAAS-1)
punct(RAAS-1, ,-2)
det(target-6, a-3)
amod(target-6, major-4)
amod(target-6, pharmacological-5)
appos(RAAS-1, target-6)
case(medicine-9, in-7)
amod(medicine-9, cardiovascular-8)
nmod(target-6, medicine-9)
punct(RAAS-1, ,-10)


## Hypothesis testing....

I found an example abstract <https://www.ncbi.nlm.nih.gov/pubmed/28702139> that pretty clearly makes a connection between an ACEI and liver toxicity and side effects.... now I want to see if we could pull out any interactions.

In [15]:
!echo "Aim: Angiotensin-converting enzyme inhibitors (ACEIs) are commonly used to treat hypertension. Although generally well tolerated, the adverse effects of ACEIs include hypotension, cough, acute kidney injury and hyperkalemia. Rare reports of ACEI-induced hepatotoxicity have been described, most notably a cholestatic pattern of injury related to captopril. A 67-year-old male presented to the emergency department with a three-week history of jaundice, pruritis and weakness. Eight weeks before, he began taking ramipril and clopidogrel. His past medical history was significant for previous acute cholestatic liver injury approximately 20 years earlier, which was attributed to methimazole. Abnormal blood work demonstrated aspartate aminotransferase (AST) 47 U/L, alanine aminotransferase (ALT) 46 U/L, total bilirubin 230 µmol/L, direct bilirubin 176 µmol/L, and alkaline phosphatase (ALP) 470 U/L. Abdominal ultrasound and magnetic resonance cholangiopancreatography showed no bile duct obstruction. Further work-up was negative for infectious, autoimmune, or other causes. Percutaneous liver biopsy showed marked cholestasis. With discontinuation of ramipril, the patient demonstrated prolonged cholestasis with partial biochemical improvement and was discharged after six weeks in hospital. This case represents the first described cross reactivity between ramipril and methimazole, illustrating the complex and poorly understood nature of DILI. Despite the relatively few instances of ACEI-induced liver hepatotoxicity, consideration should be given to discontinuation of ramipril in situations of unknown liver damage." > sample_sentence.txt

In [16]:
!head sample_sentence.txt

Aim: Angiotensin-converting enzyme inhibitors (ACEIs) are commonly used to treat hypertension. Although generally well tolerated, the adverse effects of ACEIs include hypotension, cough, acute kidney injury and hyperkalemia. Rare reports of ACEI-induced hepatotoxicity have been described, most notably a cholestatic pattern of injury related to captopril. A 67-year-old male presented to the emergency department with a three-week history of jaundice, pruritis and weakness. Eight weeks before, he began taking ramipril and clopidogrel. His past medical history was significant for previous acute cholestatic liver injury approximately 20 years earlier, which was attributed to methimazole. Abnormal blood work demonstrated aspartate aminotransferase (AST) 47 U/L, alanine aminotransferase (ALT) 46 U/L, total bilirubin 230 µmol/L, direct bilirubin 176 µmol/L, and alkaline phosphatase (ALP) 470 U/L. Abdominal ultrasound and magnetic resonance cholangiopancreatography showed no bile duct obstructi

## Wanting to check Sentiment Analysis

What if I grouped all the words together based on the sentiment of the sentence they're found in, and then allowed us to search for a couple of key words, and then the sentiment analysis would determine the relationship between the two items: compound and drug, compound and organ, organ and drug, etc?

### Running a parser and annotator to get the sentiment...

In [60]:
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file sample_sentence.txt -outputFormat text

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.8 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7cla

### We can take a look at the output...

It's pretty messy, so I just wanted to pull out the important stuff.

In [180]:
# !cat sample_sentence.txt.out

### Made a little dictionary that would group words by sentiment, and assign the appropriate sentiment to them. 

In [181]:
co_occurrence_dict = defaultdict(list)

for line in open('sample_sentence.txt.out').readlines():
    if 'Sentence #' in line:
        sentence = str(line.strip('\n')).split(" ")[1]
        sentiment = str(line.strip('\n')).split(":")[1]
        sentiment = re.sub("\d+","",re.sub(r'[^\w\s]','',sentiment))
    elif line[0] == '[':
        word = str(line.split("=")[1]).split(" ")[0].lower()
        pos = str(line.split("=")[4]).split(" ")[0]
        co_occurrence_dict[(sentiment, sentence)].append(word)

In [198]:
#Example of one of the entries in the dictionary

from itertools import islice
import  pprint
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

n_items = take(1, co_occurrence_dict.iteritems())
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(n_items)

[((' Negative', '#10'),
  ['percutaneous', 'liver', 'biopsy', 'showed', 'marked', 'cholestasis', '.'])]


### This could be a way that we search our dictionary to find the relationship between the words.
Look at a few key words, and see what results. Notice the different types of searches we could do. `if any...` vs `if all...`!

In [199]:
#Assuming these are the key words we are curious about.
search_phrase = ['ramipril','liver']

#Searching through the dictionary
for k,v in co_occurrence_dict.iteritems():
#     if any(k in search_phrase for k in v):
    if all(k in v for k in search_phrase):
        #print k, v
        print "There is a", k[0], "relationship between ", search_phrase, ".\nLiterature: ", " ".join([i for i in v])

There is a  Negative relationship between  ['ramipril', 'liver'] .
Literature:  despite the relatively few instances of acei-induced liver hepatotoxicity , consideration should be given to discontinuation of ramipril in situations of unknown liver damage .


## POS Tagging and Caching

Another way of looking at it... We could just group the words together by POS tag? Not sure if this could be more helpful.

In [17]:
!java edu.stanford.nlp.parser.nndep.DependencyParser -model edu/stanford/nlp/models/parser/nndep/PTB_CoNLL_params.txt.gz -textFile sample_sentence.txt -outFile sample_sentence_output.txt

Loading depparse model file: edu/stanford/nlp/models/parser/nndep/PTB_CoNLL_params.txt.gz ... 
###################
#Transitions: 35
#Labels: 17
ROOTLABEL: ROOT
PreComputed 100000, Elapsed Time: 1.804 (s)
Initializing dependency parser ... done [2.8 sec].
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Tagging completed in 0.90 sec.
Parsed 13 sentences in 0.48 seconds (27.31 sents/sec).


In [19]:
!cat sample_sentence_output.txt

root(ROOT-0, Aim-1)
P(Aim-1, :-2)
NMOD(inhibitors-5, Angiotensin-converting-3)
NMOD(inhibitors-5, enzyme-4)
VMOD(are-9, inhibitors-5)
NMOD(-RRB--8, -LRB--6)
NMOD(-RRB--8, ACEIs-7)
APPO(inhibitors-5, -RRB--8)
NMOD(Aim-1, are-9)
VMOD(are-9, commonly-10)
VC(are-9, used-11)
VMOD(used-11, to-12)
IM(to-12, treat-13)
VMOD(treat-13, hypertension-14)
P(Aim-1, .-15)

VMOD(include-11, Although-1)
VMOD(tolerated-4, generally-2)
AMOD(generally-2, well-3)
SUB(Although-1, tolerated-4)
P(include-11, ,-5)
NMOD(effects-8, the-6)
NMOD(effects-8, adverse-7)
VMOD(include-11, effects-8)
NMOD(effects-8, of-9)
PMOD(of-9, ACEIs-10)
root(ROOT-0, include-11)
VMOD(include-11, hypotension-12)
P(hypotension-12, ,-13)
COORD(hypotension-12, cough-14)
P(cough-14, ,-15)
NMOD(injury-18, acute-16)
NMOD(injury-18, kidney-17)
COORD(cough-14, injury-18)
COORD(injury-18, and-19)
CONJ(and-19, hyperkalemia-20)
P(include-11, .-21)

NMOD(reports-2, Rare-1)
VMOD(have-6, reports-2)
NMOD(repo

In [93]:
POS_dict = defaultdict(list)
words_dict = defaultdict(list)
for line in open('sample_sentence_output.txt').readlines():
    pos = line.split("(")[0]
    try:
        word_1, word_2 = str(line.split("(")[1]).split(",")[0], str(line.split("(")[1]).split(",")[1]
        word_1 = re.sub("\d+","",re.sub(r'[^\w\s]','',word_1))
        word_2 = re.sub("\d+","",re.sub(r'[^\w\s]','',word_2))
        
        #Dictionary of words and their parts of speech?
        words_dict[str(word_1).strip(' \n')].append(str(pos).strip('\n'))
        words_dict[str(word_2).strip(' \n')].append(str(pos).strip('\n'))
        
        #Dictionary of POS with corresponding words?
        POS_dict[str(pos)].append(((str(word_1).strip('\n')),str(word_2).strip('\n')))
        
    except:
        next

print POS_dict.keys()
print words_dict.keys()

['VC', 'SUB', 'VMOD', 'PMOD', 'DEP', 'NMOD', 'AMOD', 'APPO', 'P', 'IM', 'COORD', 'CONJ', 'root']
['', 'Percutaneous', 'partial', 'magnetic', 'years', 'obstruction', 'aspartate', 'DILI', 'causes', 'before', 'His', 'situations', 'cholestasis', 'reactivity', 'improvement', 'to', 'enzyme', 'complex', 'Although', 'Despite', 'weeks', 'include', 'pruritis', 'ACEIinduced', 'emergency', 'represents', 'weakness', 'biopsy', 'L', 'Aim', 'effects', 'mol', 'damage', 'acute', 'jaundice', 'presented', 'should', 'prolonged', 'methimazole', 'Further', 'alanine', 'illustrating', 'ramipril', 'direct', 'related', 'past', 'understood', 'are', 'instances', 'for', 'poorly', 'pattern', 'inhibitors', 'Eight', 'phosphatase', 'between', 'previous', 'ROOT', 'demonstrated', 'be', 'after', 'patient', 'nature', 'This', 'alkaline', 'notably', 'cholestatic', 'cough', 'discharged', 'LRB', 'of', 'taking', 'Abnormal', 'attributed', 'UL', 'discontinuation', 'resonance', 'or', 'first', 'Abdominal', 'hepatotoxicity', 'duct',