## Instructions

We first downloaded the files from this link <http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip> 

Then we dragged the pubmed folder into that same folder, with the copied abstracts (if you don't want to re-download them all again).

There are a few files you will need to make sure are present:

`lexparser-gui.bat                  
lexparser-gui.command              
lexparser-gui.sh                   
lexparser-lang-train-test.sh       
lexparser-lang.sh                  
lexparser.bat                      
lexparser.sh                       `

You will also need to add the `edu` folder that can be found here:
<https://www.dropbox.com/s/t9uk4z1xznpo0jz/jars.zip?dl=0>

Add the .zip extension to the `stanford-corenlp-3.8.0-models.jar` file, and unzip it. Copy that `edu` folder and paste it in to your home directory.

# Parsing the Pubmed Abstracts

In [1]:
import pubmed.utils as pb
import json
import re
from collections import defaultdict
from pprint import pprint
import string
# utf-8 support
import codecs
import nltk
# spit abstracts to sentences
from nltk.tokenize import sent_tokenize
import ast

#pandas!
import pandas as pd
import numpy as np

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lisabarcelo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
!curl "https://docs.google.com/spreadsheets/d/1CFBcf_vvv3G-ucgQA-vR9LgUVOFkhr5Qnun6k8XzkwE/pub?output=tsv" -o sent_files/acet_sent.tsv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  207k    0  207k    0     0   242k      0 --:--:-- --:--:-- --:--:--  243k


In [9]:
!head sent_files/acet_sent.tsv

PMID	label	compound	sentence			
1308788	neutral	dehydroepiandrosterone	1-Naphthol and estrone were extensively sulfated, whereas paracetamol and dehydroepiandrosterone were not good substrates for the pulmonary enzyme.			
15204697	neutral	1-chloro-2,4-dinitrobenzene	1: The metabolism by HepG2 cell from two sources (M1, M2) of 12 substrates is reported: ethoxyresorufin, ethoxycoumarin, testosterone, tolbutamide, chlorzoxazone, dextromethorphan, phenacetin, midazolam, acetaminophen, hydroxycoumarin, p-nitrophenol and 1-chloro-2,4-dinitrobenzene (CDNB), and a pharmaceutical compound, EMD68843.			
20732160	neutral	rotenone	45, 36-41), isopropanol, 1,4-dinitrophenol (DNP), diethylstilbestrol (DES), carbonylcyanide-m-chlorophenylhydrazone (CCCP), rotenone, paracetamol and acetyl salicylic acid (ASA) induced HSP synthesis after a 1-h incubation at a substance-specific concentration.			
18968415	neutral	persea americana	A biosensor based on vaseline/graphite modified with avocado tissu

In [10]:
%%writefile sent_files/for_SSA.py

for item in open("sent_files/acet_sent.tsv","r"):
    sent_id, sentiment, compound, sentence = item.split("\t",3)
    print sentence.encode('utf-8')

Writing sent_files/for_SSA.py


In [11]:
!python2 sent_files/for_SSA.py > sent_files/ssa_acetaminophen.txt

In [12]:
!head sent_files/ssa_acetaminophen.txt

sentence			

1-Naphthol and estrone were extensively sulfated, whereas paracetamol and dehydroepiandrosterone were not good substrates for the pulmonary enzyme.			

1: The metabolism by HepG2 cell from two sources (M1, M2) of 12 substrates is reported: ethoxyresorufin, ethoxycoumarin, testosterone, tolbutamide, chlorzoxazone, dextromethorphan, phenacetin, midazolam, acetaminophen, hydroxycoumarin, p-nitrophenol and 1-chloro-2,4-dinitrobenzene (CDNB), and a pharmaceutical compound, EMD68843.			

45, 36-41), isopropanol, 1,4-dinitrophenol (DNP), diethylstilbestrol (DES), carbonylcyanide-m-chlorophenylhydrazone (CCCP), rotenone, paracetamol and acetyl salicylic acid (ASA) induced HSP synthesis after a 1-h incubation at a substance-specific concentration.			

A biosensor based on vaseline/graphite modified with avocado tissue (Persea americana) as the source of polyphenol oxidase was developed and used for the chronoamperometric determination of paracetamol in pharmaceutical fo

In [13]:
!chmod a+x lexparser.sh

# Testing with PCFG Model*

*notice I made a change to the lexparser file to allow for more memory

In [14]:
!cat ./lexparser.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz $*


In [15]:
# %%timeit
# ! ./lexparser.sh  sent_files/ssa_acetaminophen.txt

## Testing with RNN Model

In [16]:
!cat ./lexparser_rnn.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz $*


In [17]:
# %%timeit
# ! ./lexparser_rnn.sh  sent_files/ssa_acetaminophen.txt

## Test with Caseless PCFG  Model

In [18]:
!cat ./lexparser_caseless.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz $*


In [19]:
# %%timeit
# ! ./lexparser_caseless.sh  sent_files/ssa_acetaminophen.txt

## Command Line Sentiment Analysis

This creates an output file with tuples and sentiments!

In [28]:
!head sent_files/ssa_acetaminophen.txt

sentence			

1-Naphthol and estrone were extensively sulfated, whereas paracetamol and dehydroepiandrosterone were not good substrates for the pulmonary enzyme.			

1: The metabolism by HepG2 cell from two sources (M1, M2) of 12 substrates is reported: ethoxyresorufin, ethoxycoumarin, testosterone, tolbutamide, chlorzoxazone, dextromethorphan, phenacetin, midazolam, acetaminophen, hydroxycoumarin, p-nitrophenol and 1-chloro-2,4-dinitrobenzene (CDNB), and a pharmaceutical compound, EMD68843.			

45, 36-41), isopropanol, 1,4-dinitrophenol (DNP), diethylstilbestrol (DES), carbonylcyanide-m-chlorophenylhydrazone (CCCP), rotenone, paracetamol and acetyl salicylic acid (ASA) induced HSP synthesis after a 1-h incubation at a substance-specific concentration.			

A biosensor based on vaseline/graphite modified with avocado tissue (Persea americana) as the source of polyphenol oxidase was developed and used for the chronoamperometric determination of paracetamol in pharmaceutical fo

In [21]:
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment \
-file sent_files/ssa_acetaminophen.txt \
-outputFormat text

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.5 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7cla

In [25]:
!cat parsing/ssa_acetaminophen.txt.out > sent_files/output_file.json

cat: parsing/ssa_acetaminophen.txt.out: No such file or directory


In [12]:
!head output_file.json

Sentence #1 (42 tokens, sentiment: Negative):
clinically stable patients who underwent des implantation 12 months previously and received aspirin monotherapy were randomly assigned to receive either high-intensity (40mg atorvastatin, n = 1000) or low-intensity (20mg pravastatin, n = 1000) statin treatment.
[Text=clinically CharacterOffsetBegin=0 CharacterOffsetEnd=10 PartOfSpeech=RB Lemma=clinically NamedEntityTag=O SentimentClass=Neutral]
[Text=stable CharacterOffsetBegin=11 CharacterOffsetEnd=17 PartOfSpeech=JJ Lemma=stable NamedEntityTag=O SentimentClass=Neutral]
[Text=patients CharacterOffsetBegin=18 CharacterOffsetEnd=26 PartOfSpeech=NNS Lemma=patient NamedEntityTag=O SentimentClass=Neutral]
[Text=who CharacterOffsetBegin=27 CharacterOffsetEnd=30 PartOfSpeech=WP Lemma=who NamedEntityTag=O SentimentClass=Neutral]
[Text=underwent CharacterOffsetBegin=31 CharacterOffsetEnd=40 PartOfSpeech=VBD Lemma=undergo NamedEntityTag=O SentimentClass=Neutral]
[Text=des CharacterOffsetBegin

## Match with label

## Creating Dictionaries

In [31]:
co_occurrence_dict = defaultdict(list)

for line in open('KeywordSentences.txt.out').readlines():
    if 'Sentence #' in line:
        sentence = str(line.strip('\n')).split(" ")[1]
        sentiment = str(line.strip('\n')).split(":")[1]
        sentiment = re.sub("\d+","",re.sub(r'[^\w\s]','',sentiment))
    elif line[0:5] == '[Text':
        word = str(line.split("=")[1]).split(" ")[0].lower()
        pos = str(line.split("=")[4]).split(" ")[0]
        co_occurrence_dict[(sentiment, sentence)].append(word)

In [32]:
#Example of one of the entries in the dictionary

from itertools import islice
import  pprint
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

n_items = take(1, co_occurrence_dict.iteritems())
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(n_items)

[((' Neutral', '#248'),
  ['specifically',
   ',',
   'one',
   'needs',
   'to',
   '-lrb-',
   'i',
   '-rrb-',
   'recognize',
   'the',
   'types',
   'of',
   'biochemical',
   'events',
   'that',
   'change',
   'isotopic',
   'enrichments',
   ',',
   '-lrb-',
   'ii',
   '-rrb-',
   'appreciate',
   'the',
   'distinction',
   'between',
   'fractional',
   'turnover',
   'and',
   'flux',
   'rate',
   'and',
   '-lrb-',
   'iii',
   '-rrb-',
   'be',
   'aware',
   'of',
   'the',
   'subtle',
   'differences',
   'between',
   'tracer',
   'kinetics',
   'and',
   'pharmacokinetics',
   '.'])]


In [22]:
# for k,v in co_occurrence_dict.iteritems():
#     print v

In [85]:
# h = {('-lsb-', 'statin'): [' Very negative',
#               ' Very negative',
#               ' Negative',
#               ' Negative']}

# for k,v in h.iteritems():
#     for vv in v:
#         print k, vv, v.count(vv)

('-lsb-', 'statin')  Very negative 2
('-lsb-', 'statin')  Very negative 2
('-lsb-', 'statin')  Negative 2
('-lsb-', 'statin')  Negative 2


In [118]:
sent_dict = defaultdict(list)
words_dict = defaultdict(list)


#Assuming these are the key words we are curious about.
search_phrase = ['australian','statin']

#Searching through the dictionary
for k,v in co_occurrence_dict.iteritems():
    for vv in v:
        if "acetaminophen" in v:
            words_dict[(vv, "acetaminophen")].append(k[0])

for k, v in words_dict.iteritems():
    for vv in set(v):
        if "-" not in vv:
            sent_dict[k].append({vv: v.count(vv)})

In [120]:
for k, v in sent_dict.iteritems():
    for vv in v:
        print k, vv

('connective', 'statin') {' Negative': 1}
('-rrb-', 'statin') {' Negative': 67}
('-rrb-', 'statin') {' Very negative': 20}
('0.029', 'statin') {' Negative': 1}
('particles', 'statin') {' Negative': 2}
('factor-15', 'statin') {' Negative': 1}
('consensus', 'statin') {' Negative': 1}
('hypertension', 'statin') {' Negative': 3}
('change', 'statin') {' Negative': 3}
('selected', 'statin') {' Negative': 1}
('might', 'statin') {' Negative': 1}
('v', 'statin') {' Negative': 1}
('considered', 'statin') {' Negative': 3}
('prediagnostic', 'statin') {' Negative': 1}
('low-dose', 'statin') {' Negative': 1}
('0.4', 'statin') {' Negative': 1}
('guideline-based', 'statin') {' Negative': 1}
('rural', 'statin') {' Negative': 1}
('decreased', 'statin') {' Negative': 2}
('maximal-tolerated', 'statin') {' Very negative': 1}
('bodies', 'statin') {' Negative': 1}
('men', 'statin') {' Negative': 1}
('pitavastatin', 'statin') {' Negative': 1}
('homologous', 'statin') {' Negative': 1}
('patients', 'statin') {'