## Instructions

- I first downloaded the files from this link <http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip> 
- Then I dragged the pubmed folder into that same folder.
- Then I ran the following....

# Parsing the Pubmed Abstracts

In [2]:
import pubmed.downloader as pb
import json
from pprint import pprint
# spit abstracts to sentences
from nltk.tokenize import sent_tokenize

In [3]:
search_term = 'ACE inhibitor'
max_results = 3
query = pb.PubMedQuery(search_term, max_results)

In [4]:
ids = query.id_getter()
print ids

28687104,28682034,28682025


In [5]:
query.abstract_getter(ids)

{0: u'Nonsteroidal antiinflammatory agents, \u03b2-lactam antibiotics, non-\u03b2 lactam antibiotics, and angiotensin-converting enzyme inhibitors are the most common classes of drugs that cause angioedema. Drug-induced angioedema is known to occur via mechanisms mediated by histamine, bradykinin, or leukotriene, and an understanding of these mechanisms is crucial in guiding therapeutic decisions. Nonallergic angioedema occurs in patients with genetic variants that affect metabolism or synthesis of bradykinin, substance P, prostaglandins, or leukotrienes, or when patients are taking drugs that have synergistic mechanisms. The mainstay in treatment of nonallergic drug-induced angioedema is cessation of the offending agents.',
 1: "Fabry's disease (FD) is a severe congenital metabolic disorder characterized by the deficient activity of lysosomal exoglycohydrolase alpha-galactosidase, characterized by glycosphingolipid deposition in several cells, such as capillary endothelial cells, rena

In [13]:
%%time
#I just tried to download a few 
# if you don't want to download all abstracts, please do not use the download_all_abstracts() method, but rather the 
# abstract_getter() as below
pb.PubMedQuery.COUNT = 0
max_results = 10
query = pb.PubMedQuery(search_term, max_results)
ids = query.id_getter()
abstracts = query.abstract_getter(ids)

CPU times: user 152 ms, sys: 4 ms, total: 156 ms
Wall time: 8.98 s


In [6]:
abstracts = query.abstract_getter(ids)
abstracts

{0: u'Nonsteroidal antiinflammatory agents, \u03b2-lactam antibiotics, non-\u03b2 lactam antibiotics, and angiotensin-converting enzyme inhibitors are the most common classes of drugs that cause angioedema. Drug-induced angioedema is known to occur via mechanisms mediated by histamine, bradykinin, or leukotriene, and an understanding of these mechanisms is crucial in guiding therapeutic decisions. Nonallergic angioedema occurs in patients with genetic variants that affect metabolism or synthesis of bradykinin, substance P, prostaglandins, or leukotrienes, or when patients are taking drugs that have synergistic mechanisms. The mainstay in treatment of nonallergic drug-induced angioedema is cessation of the offending agents.',
 1: "Fabry's disease (FD) is a severe congenital metabolic disorder characterized by the deficient activity of lysosomal exoglycohydrolase alpha-galactosidase, characterized by glycosphingolipid deposition in several cells, such as capillary endothelial cells, rena

In [7]:
json_file = 'my_ten_abstracts.json'
print 'Saving to ' + json_file
with open(json_file, 'w') as outfile:
    json.dump(abstracts, outfile, indent=4)

Saving to my_ten_abstracts.json


## We might not need all the sentences in the abstract, so I am seeing what would happen if we just parse out the ones with key words, and write them to a textfile, and then parse the resulting file.

Keyword selection could happen here in the sentence tokenizer.

In [8]:
sentences = []
with open('my_ten_abstracts.json') as data_file:    
    data = json.load(data_file)
    for abstract in data.itervalues():
        sentences.append(sent_tokenize(abstract))
    # flatten the list of abstracts into one long list of sentences
    sentences = [sent for s in sentences for sent in s]
    print "Sentences: ",len(sentences)

Sentences:  16


In [10]:
%%writefile KeywordSentences.py
import json
keyword = 'in'
with open('my_ten_abstracts.json') as data_file:    
    data = json.load(data_file)
    #pick snippets related to ACE inhibitors
    for i in range(len(data.keys())):
        try:
            if keyword in data[str(i)]:
                print data[str(i)] + '\n'
            else:
                next
        except:
            next

Writing KeywordSentences.py


In [12]:
!python KeywordSentences.py > KeywordSentences.txt

In [13]:
#This mimics the format that the example parser file has
!cat KeywordSentences.txt

Fabry's disease (FD) is a severe congenital metabolic disorder characterized by the deficient activity of lysosomal exoglycohydrolase alpha-galactosidase, characterized by glycosphingolipid deposition in several cells, such as capillary endothelial cells, renal, cardiac, and nerve cells. As a systemic disease leading to a contemporaneous myocardial and renal dysfunction, FD might be an example of cardiorenal syndrome type 5 (CRS-5). Kidney damage is commonly characterized by proteinuria, isosthenuria and altered tubular function when occurs at the second-third decade, azotemia and end-stage renal disease in third-fifth decade. Beyond the irreversible glomerular, tubular and vascular damages, the podocytes foot process effacement is the major cause of kidney dysfunction. Myocardial damage is usually observed with right and left ventricular hypertrophy, arrhythmias (due to sinus node and conduction system impairment), diastolic dysfunction, congestive heart failure, myocardial ischemia, 

## Running the Stanford Parser

We first have to set up a configurations file.

In [22]:
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props sampleProps.properties

Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: argsToProperties could not read properties file: sampleProps.properties
	at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:1013)
	at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:929)
	at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1334)
Caused by: java.io.IOException: Unable to open "sampleProps.properties" as class path, filename or URL
	at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:480)
	at edu.stanford.nlp.io.IOUtils.readerFromString(IOUtils.java:617)
	at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:1004)
	... 2 more


## Pulling from the Readme file, this is what I am going to run:
Runs Stanford CoreNLP.

Simple uses for xml and plain text output to files are:

`./corenlp.sh -file filename`

`./corenlp.sh -file filename -outputFormat text`

-outputFormat : Different methods for outputting results. 
Can be:

“text”: An ad hoc human-readable text format. Tokens, s-expression parse trees, relation(head, dep) dependencies. Output file extension is .out. This is the default output format only if the XMLOutputter is unavailable.

“xml”: An XML format with accompanying XSLT stylesheet, which allows web browser rendering. Output file extension is .xml. This is the default output format, unless the XMLOutputter is unavailable.

“json”: JSON. Output file extension is .json. ‘Nuf said.

“conll”: A tab-separated values (TSV) format. Output extension is .conll. This representation may give only a partial view of an Annotation and doesn’t correspond to any particular CoNLL format. Columns are: wordIndex, token, lemma, POS, NER, head, depRel.

“conllu”: CoNLL-U output format, another tab-separated values (TSV) format. Output extension is .conllu. This representation may give only a partial view of an Annotation.

“serialized”: Produces some serialized version of each Annotation. May or may not be lossy. What you actually get depends on the outputSerializer property, which you should also set. The default is the GenericAnnotationSerializer, which uses the built-in Java object serialization and writes a file with extension .ser.gz.



Split into sentences, run POS tagger and NER, write CoNLL-style TSV file:
`./corenlp.sh -annotators tokenize,ssplit,pos,lemma,ner -outputFormat conll -file input.txt`

You can also start a simple shell where you can enter sentences to be processed:
`./corenlp.sh`

## I am going to simply run it as a text file, with a text output. What it will do is keep the same file name, and append "out".

In [23]:
!./corenlp.sh -annotators tokenize,ssplit,pos -file KeywordSentences.txt -outputFormat text

java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file KeywordSentences.txt -outputFormat text
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.4 sec].

Processing file /Users/lisabarcelo/Desktop/W266/stanford-corenlp-full-2017-06-09/KeywordSentences.txt ... writing to /Users/lisabarcelo/Desktop/W266/stanford-corenlp-full-2017-06-09/KeywordSentences.txt.out
Annotating file /Users/lisabarcelo/Desktop/W266/stanford-corenlp-full-2017-06-09/KeywordSentences.txt ... d

## Now we can see the output of the file.

In [27]:
!head KeywordSentences.txt.out

Sentence #1 (44 tokens):
Fabry's disease (FD) is a severe congenital metabolic disorder characterized by the deficient activity of lysosomal exoglycohydrolase alpha-galactosidase, characterized by glycosphingolipid deposition in several cells, such as capillary endothelial cells, renal, cardiac, and nerve cells.
[Text=Fabry CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NN]
[Text='s CharacterOffsetBegin=5 CharacterOffsetEnd=7 PartOfSpeech=POS]
[Text=disease CharacterOffsetBegin=8 CharacterOffsetEnd=15 PartOfSpeech=NN]
[Text=-LRB- CharacterOffsetBegin=16 CharacterOffsetEnd=17 PartOfSpeech=-LRB-]
[Text=FD CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=NN]
[Text=-RRB- CharacterOffsetBegin=19 CharacterOffsetEnd=20 PartOfSpeech=-RRB-]
[Text=is CharacterOffsetBegin=21 CharacterOffsetEnd=23 PartOfSpeech=VBZ]
[Text=a CharacterOffsetBegin=24 CharacterOffsetEnd=25 PartOfSpeech=DT]


## Experimenting.... what if I don't feed in the annotators?

In [28]:
!./corenlp.sh -file KeywordSentences.txt -outputFormat text

java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file KeywordSentences.txt -outputFormat text
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Searching for resource: StanfordCoreNLP.properties ... not found.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Searching for resource: edu/stanford/nlp/pipeline/StanfordCoreNLP.properties ... found.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.Abstract

In [30]:
!head KeywordSentences.txt.out

Sentence #1 (44 tokens):
Fabry's disease (FD) is a severe congenital metabolic disorder characterized by the deficient activity of lysosomal exoglycohydrolase alpha-galactosidase, characterized by glycosphingolipid deposition in several cells, such as capillary endothelial cells, renal, cardiac, and nerve cells.
[Text=Fabry CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NN Lemma=fabry NamedEntityTag=O]
[Text='s CharacterOffsetBegin=5 CharacterOffsetEnd=7 PartOfSpeech=POS Lemma='s NamedEntityTag=O]
[Text=disease CharacterOffsetBegin=8 CharacterOffsetEnd=15 PartOfSpeech=NN Lemma=disease NamedEntityTag=O]
[Text=-LRB- CharacterOffsetBegin=16 CharacterOffsetEnd=17 PartOfSpeech=-LRB- Lemma=-lrb- NamedEntityTag=O]
[Text=FD CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=NN Lemma=fd NamedEntityTag=O]
[Text=-RRB- CharacterOffsetBegin=19 CharacterOffsetEnd=20 PartOfSpeech=-RRB- Lemma=-rrb- NamedEntityTag=O]
[Text=is CharacterOffsetBegin=21 CharacterOffsetEnd=23 PartO

##  Next step: Sentiment Analysis! To be continued...
<https://stanfordnlp.github.io/CoreNLP/>