## Instructions

We first downloaded the files from this link <http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip> 

Then we dragged the pubmed folder into that same folder, with the copied abstracts (if you don't want to re-download them all again).

There are a few files you will need to make sure are present:

`lexparser-gui.bat                  
lexparser-gui.command              
lexparser-gui.sh                   
lexparser-lang-train-test.sh       
lexparser-lang.sh                  
lexparser.bat                      
lexparser.sh                       `

You will also need to add the `edu` folder that can be found here:
<https://www.dropbox.com/s/t9uk4z1xznpo0jz/jars.zip?dl=0>

Add the .zip extension to the `stanford-corenlp-3.8.0-models.jar` file, and unzip it. Copy that `edu` folder and paste it in to your home directory.

# Parsing the Pubmed Abstracts

In [6]:
import pubmed.utils as pb
import json
import re
from collections import defaultdict
from pprint import pprint
import string
# utf-8 support
import codecs
import nltk
# spit abstracts to sentences
from nltk.tokenize import sent_tokenize
import ast

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/barcelise/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#search_term = 'ACE inhibitor'
search_term = 'statin'
max_results = 100
query = pb.PubMedQuery(search_term, max_results)

In [7]:
ids = query.id_getter()
print "Num abstracts: ", len(ids)

NameError: name 'query' is not defined

## This is to just download a few of the abstracts

In [10]:
%%time
#I just tried to download a few 
# if you don't want to download all abstracts, please do not use the download_all_abstracts() method, but rather the 
# abstract_getter() as below
pb.PubMedQuery.COUNT = 0
max_results = 100
query = pb.PubMedQuery(search_term, max_results)
ids = query.id_getter()
abstracts = query.abstract_getter(ids)

CPU times: user 316 ms, sys: 8 ms, total: 324 ms
Wall time: 1.31 s


## This is to download a TON of the abstracts

In [6]:
# %%time
# pb.PubMedQuery.COUNT = 0
# max_results = 100
# full_query = pb.download_all_abstracts(search_term, max_results)
# ids = full_query.id_getter()
# abstracts = full_query.id_getter(ids)

## Save all abstracts to JSON file

In [5]:
len(ids)

NameError: name 'ids' is not defined

In [13]:
json_file = 'statins_abstracts.json'
#json_file = 'more_abstracts.json'
print 'Saving to ' + json_file
with codecs.open(json_file, 'w','utf-8') as outfile:
    json.dump(abstracts, outfile, indent=4)

Saving to statins_abstracts.json


## We might not need all the sentences in the abstract, so I am seeing what would happen if we just parse out the ones with key words, and write them to a textfile, and then parse the resulting file.

Keyword selection could happen here in the sentence tokenizer.

In [3]:
sentences = []
with codecs.open('statins_pbabstract/statins_abstracts.json','r','utf-8') as data_file:    
    data = json.load(data_file)
    for abstract in data.itervalues():
        sentences.append(sent_tokenize(abstract))
    # flatten the list of abstracts into one long list of sentences
    sentences = [sent for s in sentences for sent in s]
    print "Sentences: ",len(sentences)

Sentences:  970


In [21]:
# keyword = 'statin'
# key_sentences = []
# transformed_sentences = pb.ace_substitutor(sentences, 'ACEI')
# for sent in transformed_sentences:
#     if keyword in sent:
#         key_sentences.append(sent)

In [4]:
%%writefile KeywordSentences.py
import json
import codecs
import pubmed.utils as pb

#change keyword to ACE
keyword = 'statin'
with codecs.open('statins_pbabstract/statins_abstracts.json','r','utf-8') as data_file:    
    data = json.load(data_file)
    #pick snippets related to ACE inhibitors
    for abstract in data.itervalues():
        transformed_abstract = pb.ace_substitutor(abstract, 'statin') 
        if keyword in transformed_abstract:
            print transformed_abstract.encode('utf-8') + '\n'

Overwriting KeywordSentences.py


In [5]:
!python2 KeywordSentences.py > KeywordSentences.txt

In [6]:
#This mimics the format that the example parser file has
!head KeywordSentences.txt

clinically stable patients who underwent des implantation 12 months previously and received aspirin monotherapy were randomly assigned to receive either high-intensity (40mg atorvastatin, n = 1000) or low-intensity (20mg pravastatin, n = 1000) statin treatment. the primary endpoint was adverse clinical events at 12-month follow-up (a composite of all death, myocardial infarction, revascularization, stent thrombosis, stroke, renal deterioration, intervention for peripheral artery disease, and admission for cardiac events).

the primary endpoint at 12-month follow-up occurred in 25 patients (2.5%) receiving high-intensity statin treatment and in 40 patients (4.1%) receiving low-intensity statin treatment (hr, 0.58; 95%ci, 0.36-0.92; p = .018). this difference was mainly driven by a lower rate of cardiac death (0 vs 0.4%, p = .025) and nontarget vessel myocardial infarction (0.1 vs 0.7%, p = .033) in the high-intensity statin treatment group.

patients with heterozygous familial hyper

In [7]:
!chmod a+x lexparser.sh

# Testing with PCFG Model*

*notice I made a change to the lexparser file to allow for more memory

In [1]:
!cat ./lexparser.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz $*


In [2]:
%%timeit
! ./lexparser.sh  data/FilteredSentences_Statin_commonFood_clean.txt

[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.8 sec].
Parsing file: data/FilteredSentences_Statin_commonFood_clean.txt
Parsing [sent. 1 len. 21]: -LSB- u ` rice ' -RSB- 18 samples of red yeast rice powder and 18 samples of lovastatin were collected .
(ROOT
  (S
    (S
      (NP (JJ -LSB-) (NN u) (`` `) (NN rice) ('' '))
      (VP (VBZ -RSB-)
        (NP
          (NP (CD 18) (NNS samples))
          (PP (IN of)
            (NP (JJ red) (NN yeast) (NN rice) (NN powder))))))
    (CC and)
    (S
      (NP
        (NP (CD 18) (NNS samples))
        (PP (IN of)
          (NP (NN lovastatin))))
      (VP (VBD were)
        (VP (VBN collected))))
    (. .)))

amod(rice-4, -LSB--1)
compound(rice-4, u-2)
nsubj(-RSB--6, rice-4)
root(ROOT-0, -RSB--6)
nummod(samples-8, 18-7)
dobj(-RSB--6, samples-8)
case(powder-13, of-9)
amod(powder-13, red-10)
compound(powder-13, yeast-11)
comp

## Testing with RNN Model

In [None]:
!cat ./lexparser_rnn.sh

In [24]:
%%timeit
! ./lexparser_rnn.sh  KeywordSentences.txt

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Parsing file: KeywordSentences.txt
Parsing [sent. 1 len. 52]: RAAS , a major pharmacological target in cardiovascular medicine , is inhibited by pharmacological classes including angiotensin converting enzyme -LRB- ACE -RRB- inhibitors -LRB- ACEIs -RRB- , angiotensin-II type 1 blockers -LRB- ARBs -RRB- and aldosterone receptors antagonists , in addition to the recently introduced direct renin inhibitors -LRB- DRIs -RRB- .
(ROOT
  (S
    (NP
      (NP (NNS RAAS))
      (, ,)
      (NP
        (NP (DT a) (JJ major) (JJ pharmacological) (NN target))
        (PP (IN in)
          (NP (JJ cardiovascular) (NN medicine))))
      (, ,))
    (VP (VBZ is)
      (ADJP (JJ inhibited)
        (PP (IN by)
          (NP
            (NP (JJ pharmacological) (NNS classes))
            (PP (VBG in

## Test with Caseless PCFG  Model

In [25]:
!cat ./lexparser_caseless.sh

#!/usr/bin/env bash
#
# Runs the English PCFG parser on one or more files, printing trees only

if [ ! $# -ge 1 ]; then
  echo Usage: `basename $0` 'file(s)'
  echo
  exit
fi

scriptdir=`dirname $0`

java -mx500m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
 -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz $*


In [26]:
%%timeit
! ./lexparser_caseless.sh  KeywordSentences.txt

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Parsing file: KeywordSentences.txt
Parsing [sent. 1 len. 52]: RAAS , a major pharmacological target in cardiovascular medicine , is inhibited by pharmacological classes including angiotensin converting enzyme -LRB- ACE -RRB- inhibitors -LRB- ACEIs -RRB- , angiotensin-II type 1 blockers -LRB- ARBs -RRB- and aldosterone receptors antagonists , in addition to the recently introduced direct renin inhibitors -LRB- DRIs -RRB- .
(ROOT
  (S
    (NP
      (NP (NNS RAAS))
      (, ,)
      (NP
        (NP (DT a) (JJ major) (JJ pharmacological) (NN target))
        (PP (IN in)
          (NP (JJ cardiovascular) (NN medicine))))
      (, ,))
    (VP (VBZ is)
      (ADJP (JJ inhibited)
        (PP (IN by)
          (NP
            (NP (JJ pharmacological) (NNS classes))
            (PP (VBG in

## Command Line Sentiment Analysis

This creates an output file with tuples and sentiments! We can modify which annotators we use. I think maybe just the sentiment and the ner is all we need.

In [6]:
!head KeywordSentences.txt

clinically stable patients who underwent des implantation 12 months previously and received aspirin monotherapy were randomly assigned to receive either high-intensity (40mg atorvastatin, n = 1000) or low-intensity (20mg pravastatin, n = 1000) statin treatment. the primary endpoint was adverse clinical events at 12-month follow-up (a composite of all death, myocardial infarction, revascularization, stent thrombosis, stroke, renal deterioration, intervention for peripheral artery disease, and admission for cardiac events).

the primary endpoint at 12-month follow-up occurred in 25 patients (2.5%) receiving high-intensity statin treatment and in 40 patients (4.1%) receiving low-intensity statin treatment (hr, 0.58; 95%ci, 0.36-0.92; p = .018). this difference was mainly driven by a lower rate of cardiac death (0 vs 0.4%, p = .025) and nontarget vessel myocardial infarction (0.1 vs 0.7%, p = .033) in the high-intensity statin treatment group.

patients with heterozygous familial hyper

In [2]:
!java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment \
-file KeywordSentences.txt \
-outputFormat text

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.9 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7cla

In [11]:
!cat KeywordSentences.txt.out > output_file.json

In [12]:
!head output_file.json

Sentence #1 (42 tokens, sentiment: Negative):
clinically stable patients who underwent des implantation 12 months previously and received aspirin monotherapy were randomly assigned to receive either high-intensity (40mg atorvastatin, n = 1000) or low-intensity (20mg pravastatin, n = 1000) statin treatment.
[Text=clinically CharacterOffsetBegin=0 CharacterOffsetEnd=10 PartOfSpeech=RB Lemma=clinically NamedEntityTag=O SentimentClass=Neutral]
[Text=stable CharacterOffsetBegin=11 CharacterOffsetEnd=17 PartOfSpeech=JJ Lemma=stable NamedEntityTag=O SentimentClass=Neutral]
[Text=patients CharacterOffsetBegin=18 CharacterOffsetEnd=26 PartOfSpeech=NNS Lemma=patient NamedEntityTag=O SentimentClass=Neutral]
[Text=who CharacterOffsetBegin=27 CharacterOffsetEnd=30 PartOfSpeech=WP Lemma=who NamedEntityTag=O SentimentClass=Neutral]
[Text=underwent CharacterOffsetBegin=31 CharacterOffsetEnd=40 PartOfSpeech=VBD Lemma=undergo NamedEntityTag=O SentimentClass=Neutral]
[Text=des CharacterOffsetBegin

## Sentences of note:

`Plasma concentrations of AII were significantly decreased by captopril and increased by losartan`.
`nmod:of(concentrations-2, AII-4)`
`nmod:agent(decreased-7, captopril-9)`
`nmod:by(increased-11, losartan-13)`


## Experimenting with Selected Abstracts

In [13]:
# From <https://www.ncbi.nlm.nih.gov/pubmed/28656517>

# !echo "Renin-angiotensin-aldosterone system (RAAS) antagonists, including angiotensin-converting enzyme inhibitors (ACEI), angiotensin receptor blockers (ARB), and mineralocorticoid receptor antagonists (MRA) decrease mortality and morbidity in heart failure but increase the risk of hyperkalemia, especially when used in combination. Prevention of hyperkalemia and its associated complications requires careful patient selection, counseling regarding dietary potassium intake, awareness of drug interactions, and regular laboratory surveillance. Recent data suggests that the risk of hyperkalemia may be further moderated through the use of combined angiotensin-neprilysin inhibitors, novel MRAs, and novel potassium binding agents. Clinicians should be mindful of the risk of hyperkalemia when prescribing RAAS inhibitors to patients with heart failure. In patients at highest risk, such as those with diabetes, the elderly, and advanced chronic kidney disease, more intensive laboratory surveillance of potassium and creatinine may be required. Novel therapies hold promise for reducing the risk of hyperkalemia and enhancing the tolerability of RAAS antagonists." > sample_abstract.txt

In [14]:
# !head sample_abstract.txt

In [15]:
# !java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file sample_abstract.txt -outputFormat text

In [16]:
# !cat sample_abstract.txt.out

Our takeaway sentence is that "ACE Inhibitors (ACEI) [...] decrease mortality and morbidity in heart failure but increase the risk of hyperkalemia...".

`amod(inhibitors-11, angiotensin-converting-9)`

`dobj(increase-38, risk-40)`

`nmod:of(risk-40, hyperkalemia-42)`


## Experimenting with specific drug names instead of a drug class

It didn't look like we had a lot of interactions between drugs and molecules, so I wonder if we change our serach terms a little bit to include the drug names. The search terms are not case sensitive.

`Drugs in class: Captopril, Lisinopril, Ramipril, Benazepril`

In [17]:
# search_term = ['Captopril', 'Lisinopril','Ramipril','Benazepril']
# max_results = 100
# for i in search_term:
#     query2 = pb.PubMedQuery(i, max_results)
    
# ids2 = query2.id_getter()
# print "Num abstracts: ", len(ids2)

# abstracts2 = query.abstract_getter(ids2)

# json_file2 = 'more_abstracts2.json'
# print 'Saving to ' + json_file2
# with open(json_file2, 'w') as outfile:
#     json.dump(abstracts2, outfile, indent=4)
    
    
# sentences2 = []
# with open('more_abstracts2.json') as data_file:    
#     data = json.load(data_file)
#     for abstract in data.itervalues():
#         sentences2.append(sent_tokenize(abstract))
#     # flatten the list of abstracts into one long list of sentences
#     sentences2 = [sent for s in sentences2 for sent in s]
#     print "Sentences: ",len(sentences2)

In [18]:
# Just looking at some sentences
# sentences2[:5]

In [19]:
# %%writefile KeywordSentences2.py
# import json

# #change keyword to ACE
# keywords = ['Captopril', 'Lisinopril','Ramipril','Benazepril']
# with open('more_abstracts2.json') as data_file:    
#     data = json.load(data_file)
#     #pick snippets related to ACE inhibitors
#     for i in range(len(data.keys())):
#         try:
#             if any(k in data[str(i)] for k in keywords):
#                 print data[str(i)] + '\n'
#             else:
#                 next
#         except:
#             next
            

In [20]:
# !python KeywordSentences2.py > KeywordSentences2.txt

In [21]:
# !cat KeywordSentences2.txt

## Experimenting with Sentiment Analysis

In [22]:
# !java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file KeywordSentences2.txt -outputFormat text

In [23]:
# !cat KeywordSentences2.txt.out

## Sentiment Experiment with 50 Sentences

In [24]:
# !head -50 data/FilteredSentences.txt > data/Filtered50.txt

In [25]:
# %%timeit
# !java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref,sentiment -file data/Filtered50.txt -outputFormat text

In [26]:
# !head Filtered50.txt.out

## Creating Dictionaries

In [11]:
co_occurrence_dict = defaultdict(list)

for line in open('KeywordSentences.txt.out').readlines():
    if 'Sentence #' in line:
        sentence = str(line.strip('\n')).split(" ")[1]
        sentiment = str(line.strip('\n')).split(":")[1]
        sentiment = re.sub("\d+","",re.sub(r'[^\w\s]','',sentiment))
    elif line[0:5] == '[Text':
        word = str(line.split("=")[1]).split(" ")[0].lower()
        pos = str(line.split("=")[4]).split(" ")[0]
        co_occurrence_dict[(sentiment, sentence)].append(word)

In [12]:
#Example of one of the entries in the dictionary

from itertools import islice
import  pprint
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

n_items = take(1, co_occurrence_dict.iteritems())
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(n_items)

[((' Neutral', '#248'),
  ['specifically',
   ',',
   'one',
   'needs',
   'to',
   '-lrb-',
   'i',
   '-rrb-',
   'recognize',
   'the',
   'types',
   'of',
   'biochemical',
   'events',
   'that',
   'change',
   'isotopic',
   'enrichments',
   ',',
   '-lrb-',
   'ii',
   '-rrb-',
   'appreciate',
   'the',
   'distinction',
   'between',
   'fractional',
   'turnover',
   'and',
   'flux',
   'rate',
   'and',
   '-lrb-',
   'iii',
   '-rrb-',
   'be',
   'aware',
   'of',
   'the',
   'subtle',
   'differences',
   'between',
   'tracer',
   'kinetics',
   'and',
   'pharmacokinetics',
   '.'])]


In [13]:
for k,v in co_occurrence_dict.iteritems():
    print v

['specifically', ',', 'one', 'needs', 'to', '-lrb-', 'i', '-rrb-', 'recognize', 'the', 'types', 'of', 'biochemical', 'events', 'that', 'change', 'isotopic', 'enrichments', ',', '-lrb-', 'ii', '-rrb-', 'appreciate', 'the', 'distinction', 'between', 'fractional', 'turnover', 'and', 'flux', 'rate', 'and', '-lrb-', 'iii', '-rrb-', 'be', 'aware', 'of', 'the', 'subtle', 'differences', 'between', 'tracer', 'kinetics', 'and', 'pharmacokinetics', '.']
['the', 'final', 'sample', 'consisted', 'of', '13,947', 'individuals', '-lrb-', '48', '\xc2\xb1', '6', 'years', ',', '71', '%', 'men', '-rrb-', '.']
['nevertheless', ',', 'the', 'potential', 'risk', 'of', 'an', 'adverse', 'event', 'occurring', 'must', 'be', 'considered', 'when', 'prescribing', 'and', 'monitoring', 'statin', 'therapy', 'to', 'individual', 'patients', '.']
['abiraterone', 'acetate', '-lrb-', 'aa', '-rrb-', 'may', 'also', 'undergo', 'slco-mediated', 'transport', '.']
['the', 'primary', 'endpoint', 'at', '12-month', 'follow-up', 'occu

In [84]:
h = {('-lsb-', 'statin'): [' Very negative',
              ' Very negative',
              ' Negative',
              ' Negative']}

hh = defaultdict(list)

for k,v in h.iteritems():
    for vv in v:
        hh[k].append({vv: v.count(vv)})
hh

defaultdict(list,
            {('-lsb-', 'statin'): [{' Very negative': 2},
              {' Very negative': 2},
              {' Negative': 2},
              {' Negative': 2}]})

In [76]:
sent_dict = defaultdict(list)
words_dict = defaultdict(list)


#Assuming these are the key words we are curious about.
search_phrase = ['australian','statin']

#Searching through the dictionary
for k,v in co_occurrence_dict.iteritems():
    for vv in v:
        if "statin" in v:
            words_dict[(vv, "statin")].append(k[0])

for k, v in words_dict.iteritems():
    for vv in set(v):
        if any(str.isdigit(vv) for vvv in vv):
            next
        elif len(vv)<3:
            next
        elif any(str.isdigit(vv) for vvv in "[]\/?<>-."):
            next
        else:
            sent_dict[k].append({vv: v.count(vv)})

In [139]:
clean_dict = defaultdict(list)
final_dict = defaultdict(list)

for k, v in sent_dict.iteritems():
    for vv in v:
#         if any(str.isdigit(k[0]) for k in k):
#             pass
#         elif len(k[0])<3:
#             pass
#         elif any(str.isdigit(k[0]) for k in "[]\/?<>-."):
#             pass
#         else:
#             print k, vv
            clean_dict[k].append(vv)
            
for k, v in clean_dict.iteritems():
    if len(clean_dict[k]) > 2:
        print k,max(clean_dict[k]).keys()[0]

('evaluated', 'statin') [' Very negative']
('treatment', 'statin') [' Very negative']
('use', 'statin') [' Very negative']
('statin', 'statin') [' Very negative']
('associated', 'statin') [' Very negative']
('-lrb-', 'statin') [' Very negative']
('for', 'statin') [' Very negative']
('risk', 'statin') [' Very negative']
('is', 'statin') [' Very negative']
('12', 'statin') [' Very negative']
('1', 'statin') [' Very negative']
(':', 'statin') [' Very negative']
('of', 'statin') [' Very negative']
('on', 'statin') [' Positive']
('statins', 'statin') [' Very negative']
('-rrb-', 'statin') [' Very negative']
('patients', 'statin') [' Very negative']
(',', 'statin') [' Very negative']
('at', 'statin') [' Very negative']
('users', 'statin') [' Very negative']
('the', 'statin') [' Very negative']
('in', 'statin') [' Very negative']
('as', 'statin') [' Very negative']
('to', 'statin') [' Very negative']
('study', 'statin') [' Positive']
('were', 'statin') [' Very negative']
('and', 'statin') [' 

In [78]:
treeData = defaultdict(list)

treeData["name"] =  "Statins"
treeData["parent"] = "null"

for k, v in sent_dict.iteritems():
    treeData["children"].append({"parent":"Statins",
        "name":str(v[0].keys()[0]).strip(" ")})
    for vv in v:
        treeData["children"].append({"name":k[0],
        "parent":str(v[0].keys()[0]).strip(" ")})

        
treeData

defaultdict(list,
            {'children': [{'name': 'Negative', 'parent': 'Statins'},
              {'name': '`', 'parent': 'Negative'},
              {'name': 'Negative', 'parent': 'Statins'},
              {'name': 'modified', 'parent': 'Negative'},
              {'name': 'modified', 'parent': 'Negative'},
              {'name': 'Negative', 'parent': 'Statins'},
              {'name': '0.24-0', 'parent': 'Negative'},
              {'name': 'Negative', 'parent': 'Statins'},
              {'name': 'variations', 'parent': 'Negative'},
              {'name': 'Negative', 'parent': 'Statins'},
              {'name': 'alone', 'parent': 'Negative'},
              {'name': 'Negative', 'parent': 'Statins'},
              {'name': 'men', 'parent': 'Negative'},
              {'name': 'Negative', 'parent': 'Statins'},
              {'name': 'homologous', 'parent': 'Negative'},
              {'name': 'Negative', 'parent': 'Statins'},
              {'name': 'outcomes', 'parent': 'Negative'},
     

In [None]:
var treeData = [
  {
    "name": "Statins",
    "parent": "null",
    "children": [

      {
        "name": "Negative",
        "parent": "Statins",
        "children": [
          {
            "name": "casein",
            "parent": "Negative"
          },
          {
            "name": "whey",
            "parent": "Negative"
          }
        ]
      },

      {
        "name": "Neutral",
        "parent": "Statins",

        "children": [
          {
            "name": "grapes",
            "parent": "Very Negative"
          },
          {
            "name": "dates",
            "parent": "Very Negative"
          }
        ]
      },
       
       {
        "name": "Positive",
        "parent": "Statins",

        "children": [
          {
            "name": "asparagus",
            "parent": "Positive"
          },
          {
            "name": "kale",
            "parent": "Positive"
          }
        ]
      },
    ]
  }
];