### A notebook to support academic question exploration and literature search / what do you do when you search the literature?

0. Define a research question e.g Is occupational **asbestos exposure** an underecognised **cause** of IPF? 
1. Consider the different possible ways of answering the question (methods). Different study designs and ways of measuring asbestos exposure e.g Epidemiological, observational, cross-sectional, cohort, case-control, post-mortem and explant studies, ecological, toxicology, animal models, molecular disease models, exposure assessment, occupational hygeinst measurements, minerologic analysis (tissue, BAL etc)
2. Generate search terms e.g "IPF", "case-control", "occupational", "asbestos" (? && mesh terms)
3. Carry out search using search terms and e.g pubmed, embase, google scholar, scopus, biorxiv, web of science, clinicaltrials.gov, ?google books, prepubmed.org
4. Search results == Candidate Papers
5. Extract title | journal | author | location | year | abstract | key words | full text && save result (as .bib) (prob want to export to jabref) https://stackoverflow.com/questions/30768745/is-there-a-reliable-python-library-for-taking-a-bibtex-entry-and-outputting-it-i
6. Review Candidate Papers to identify Relevant Papers 
7. Use Relevant Papers to identify more Candidate papers. Search also by author, cited by, cite, [triangle closing](https://en.wikipedia.org/wiki/Triadic_closure) e.g https://github.com/hinnefe2/bibcheck.py and other means (?tensorflow/scikit-learn,nltk), https://en.wikipedia.org/wiki/Jaccard_index, networkx?
8. Use the Relevant Papers collected for whatever it is they are relevant for (usually to help compose a written document in which they are cited). Including exporting to a nice table for LaTeX.
9. ?django app times

general discussion of the problem: 1. http://drugmonkey.scientopia.org/2010/09/28/on-keeping-abreast-of-the-literature-a-highly-loaded-poll-question/ 2. http://www.sciencemag.org/careers/2016/11/how-keep-scientific-literature

meta: github/stack exchange etc to check out other peoples search strategies. this is likely to be formulated as a machine learning problem somewhere.

#### interesting related I found includes: https://www.projectcredo.com/, http://citationexplorer.hoppmann.me/, lict from a previous nhshackday, https://github.com/jvoytek/pubmedbrain/blob/f5170a2e3540e0c2aa665559c86048dfb1583f16/documents/Voytek-brainSCANrPreprint.pdf, https://github.com/graeham/hackathon/blob/master/paperGraph.py

### search github for relevant stuff with the following 'webbit' 
> https://github.com/search?l=Python&q=http%3A%2F%2Feutils.ncbi.nlm.nih.gov%2Fentrez%2Feutils%2Fesearch.fcgi++stars%3A%3E5&ref=advsearch&type=Code&utf8=%E2%9C%93

gists and interwebs inc stackoverflow also helpful

tempting to dive into django a la https://github.com/afouchet/OpenReview but probably not essential and now is not optimal timing

https://github.com/gui11aume looks well documented, poss useful template
https://github.com/swcarpentry/2013-08-23-harvard/blob/b2097bc20833e0a58b2e73eecd1227d61bd5a00a/lessons/misc-biopython/eutils.md looks like nice intro to biopython utils and https://gist.github.com/bonzanini/5a4c39e4c02502a8451d, https://gist.github.com/ehazlett/1104507, https://gist.github.com/vtrubets/ef1dabb397ea6a05ce5b4e767ed15af9 (for use of icite), https://gist.github.com/mcfrank/c1ec74df1427278cbe53, http://stackoverflow.com/questions/17409107/obtaining-data-from-pubmed-using-python, https://github.com/bwallace/abstrackr-web/tree/master/abstrackr, http://www.billconnelly.net/?p=44

### let's tackle pubmed first

In [None]:
"""
Notebook to support academic question exploration and literature search.

Thanks to https://marcobonzanini.wordpress.com/2015/01/12/searching-pubmed-with-python/ and 
http://www.fredtrotter.com/2014/11/14/hacking-on-the-pubmed-api/ and vtrubets
https://gist.github.com/vtrubets/ef1dabb397ea6a05ce5b4e767ed15af9 

Pubmed advanced search is helpful for designing search/experimenting 
https://www.ncbi.nlm.nih.gov/pubmed/advanced

Docs for NCBI esearch:
https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
https://www.nlm.nih.gov/bsd/mms/medlineelements.html
"""


In [3]:
from Bio import Entrez
from Bio import Medline
from tqdm import tqdm
from collections import Counter
import json
import requests
import re
import pandas as pd
import pickle 
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [4]:
def get_chunked_pmids(term, chunksize=50):
    """
    Return a list of Pubmed ids from pubmed search in chunks
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    count_handle = Entrez.esearch(db="pubmed",
                                  term=term,
                                  retmode="xml",
                                  rettype="count")
    count_results = Entrez.read(count_handle)
    count = int(count_results["Count"])

    retmax_requests = list(range(chunksize, count, chunksize))
    retmax_requests.append(count - retmax_requests[len(retmax_requests) - 1])

    for i, retmax in enumerate(retmax_requests):
        pmid_handle = Entrez.esearch(db="pubmed",
                                     term=term,
                                     sort="relevance",
                                     retmode="xml",
                                     usehistory='y',
                                     retstart=retmax,
                                     retmax=chunksize)
        results = Entrez.read(pmid_handle)
        yield results["IdList"]

In [5]:
def get_pubmed_summaries(pubmed_id):
    """
    Use the Pubmed API to return a json summary of a list of pmid strings 
    """
    pubmed_id = ','.join(pubmed_id) # citation api likes to take a single string

    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.esummary(db='pubmed', 
                             id=pubmed_id, 
                             retmode='json',
                             rettype='abstract')
    return json.loads(handle.read())['result']
  

In [6]:
def get_pubmed_keywords(pubmed_id):
    """
    Use the Pubmed API to return the medline record and extract the key words for each pmid in a
    a list of pmid strings. Returns {pmid:[list of keywords]}.
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.efetch(db='pubmed',
                           id=pubmed_id,
                           rettype='medline',
                           retmode='text')
    records = Medline.parse(handle)
    keywords = {}
    for record in records:
        pmid = record.get('PMID','?')
        mh = record.get('MH','?')
        keywords[pmid] = mh
    return keywords

In [7]:
def get_pubmed_abstracts(pubmed_id):
    """
    Use the Pubmed API to return the medline record and extract the abstract for each pmid in a list
    of pmid strings. Return {pmid:abstract}.
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.efetch(db='pubmed',
                           id=pubmed_id,
                           rettype='medline',
                           retmode='text')
    records = Medline.parse(handle)
    abstracts = {}
    for record in records:
        pmid = record.get('PMID','?')
        ab = record.get('AB','?')
        abstracts[pmid] = ab
    return abstracts

In [8]:
def get_pubmed_pubtypes(pubmed_id):
    """
    Use the Pubmed API to return the medline record and extract the publication type for each pmid
    in a list of pmid strings. Return {pmid:pubtype}.
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.efetch(db='pubmed',
                           id=pubmed_id,
                           rettype='medline',
                           retmode='text')
    records = Medline.parse(handle)
    pubtypes = {}
    for record in records:
        pmid = record.get('PMID','?')
        pt = record.get('PT','?')
        pubtypes[pmid] = pt
    return pubtypes

In [9]:
def get_citation_information(pubmed_id):
    """
    Use the special citation api to return relative citation ratios
    Takes a list of pmid strings. Returns {pmid:rcr}
    """
    pubmed_id = ','.join(pubmed_id) # citation api likes to take a single string
    citation_search = 'https://icite.od.nih.gov/api/pubs?pmids={0}'.format(pubmed_id)
    response = requests.get(citation_search).content
    str_response = response.decode('utf-8')
    
    try:
        data = json.loads(str_response)['data']
    except KeyError:
        data = False
         
    citations = {}
    
    if data:
        for record in data:
            pmid = record.get('pmid')
            rcr = record.get('relative_citation_ratio')
            citations[pmid] = rcr
    return citations
        

In [10]:
def get_pmids_for_papers_citing(pubmed_id):
    """
    Use the Pubmed API to return a list of pmids for a paper citing a *single pmid*
    Return list of citing pmids.
    see also https://www.ncbi.nlm.nih.gov/pmc/tools/cites-citedby/
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.elink(dbfrom="pubmed",
                          id=pubmed_id,
                          linkname="pubmed_pubmed_citedin")
    records = Entrez.read(handle)
    list_of_pmids_citing = []
    
    if records[0]["LinkSetDb"]:
        for link in records[0]["LinkSetDb"][0]["Link"]:
            list_of_pmids_citing.append(link["Id"]) 
  
    return list_of_pmids_citing 

In [11]:
def get_pmids_for_papers_cited(pubmed_id):
    """
    Use the Pubmed API to return a list of pmids for papers cited by a particular pmid
    Return list of pmids. *only works for papers in pubmed central*
    see also https://www.ncbi.nlm.nih.gov/pmc/tools/cites-citedby/
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.elink(dbfrom="pubmed",
                          id=pubmed_id,
                          linkname="pubmed_pubmed_refs")
    records = Entrez.read(handle)
    list_of_pmids_cited = []
    
    if records[0]["LinkSetDb"]:
        for link in records[0]["LinkSetDb"][0]["Link"]:
            list_of_pmids_cited.append(link["Id"]) 
    return list_of_pmids_cited

In [12]:
def get_cited_pmids(pubmed_id):
    """
    Use the Pubmed API to return a list of pmids for a paper cited by pmid
    in a list of pmids. Return {pmid:list_of_citing_pmids}
    Only works for papers in PMC
    """
    cited = {}
    for pmid in pubmed_id:
        cited[pmid] = get_pmids_for_papers_cited(pmid)
    return cited

In [13]:
def get_citing_pmids(pubmed_id):
    """
    Use the Pubmed API to return a list of pmids for a paper citing a pmid
    in a list of pmids
    Return {pmid:list_of_citing_pmids}
    """
    citing = {}
    for pmid in pubmed_id:
        citing[pmid] = get_pmids_for_papers_citing(pmid)
  
    return citing

In [14]:
def lit_fetch(pubmed_id):
    """
    Search pubmed for a list of pmid strings and return information about the results
    """
    result = {}
    result['summaries'] = get_pubmed_summaries(pubmed_id)
    result['pubtypes'] = get_pubmed_pubtypes(pubmed_id)
    result['abstracts'] = get_pubmed_abstracts(pubmed_id)
    result['keywords'] = get_pubmed_keywords(pubmed_id)
    result['rcrs'] = get_citation_information(pubmed_id)
    result['citing'] = get_citing_pmids(pubmed_id)
    result['cited'] = get_cited_pmids(pubmed_id)
    return result
        

In [15]:
def resultdf(r):
    df = pd.DataFrame(r['summaries']['uids'], columns=['pmid'])
    df['title'] = df['pmid'].map(lambda x: r['summaries'].get(x)['title'])
    df['firstauthor'] = df['pmid'].map(lambda x: r['summaries'].get(x)['sortfirstauthor'])
    df['lastauthor'] = df['pmid'].map(lambda x: r['summaries'].get(x)['lastauthor'])
    df['journal'] = df['pmid'].map(lambda x: r['summaries'].get(x)['source'])
    df['pubdate'] = df['pmid'].map(lambda x: r['summaries'].get(x)['sortpubdate'])
    df['pubtype'] = df['pmid'].map(lambda x: r['pubtypes'].get(x))
    df['abstract'] = df['pmid'].map(lambda x: r['abstracts'].get(x))
    df['keywords'] = df['pmid'].map(lambda x: r['keywords'].get(x))
    df['rcr'] = df['pmid'].astype(int).map(lambda x: r['rcrs'].get(x))
    df['citedby'] =  df['pmid'].map(lambda x: r['citing'].get(x))
    df['cites'] = df['pmid'].map(lambda x: r['cited'].get(x))
    pd.set_option('max_colwidth',300)
    return df

In [57]:
def save_df_as_csv(term, df):
    name = term.replace(" ", "-")
    df.to_csv(name+".csv")

In [56]:
def lit_search(term):
    """
    Search pubmed for a term and collect information about the results
    """
    pmid_blocks = get_chunked_pmids(term, chunksize=200)
    df_list = []
    for i, block in enumerate(pmid_blocks):
        if i < 1:
            print("Beginning block processing")
            result = lit_fetch(block)
            df = resultdf(result)
            df_list.append(df)
            print("Processed block {0}".format(i))
    df = pd.concat(df_list)
    save_df_as_csv(term, df)
    return df
        

In [47]:
def explore_result_dataframe(df):
    """
    Print some simple stats for result dataframe
    """
    print('Top first authors\n')
    print(df.firstauthor.value_counts().head())
    print('\nTop last authors\n')
    print(df.lastauthor.value_counts().head())
    print('\nTop journals\n')
    print(df.journal.value_counts().head())
    print('\nTop publication types\n')
    print(df.pubtype.astype(str).value_counts().head())
    print('\nPublications per year\n')
    df.pubdate = pd.to_datetime(df.pubdate)
    df.index = df.pubdate
    df.groupby(df.pubdate.map(lambda x: x.year)).pmid.count().plot(kind='bar')


In [None]:
df = lit_search('idiopathic pulmonary fibrosis')
explore_result_dataframe(df)

In [63]:
pmid_blocks = get_chunked_pmids('idiopathic pulmonary fibrosis', chunksize=200)


In [55]:
terms = ['idiopathic pulmonary fibrosis', 'cryptogenic fibrosing alveolitis',
        'usual interstitial pneumonia', 'asbestosis']

#?(idiopathic pulmonary fibrosis) AND (asbestos) 

def litsearch_terms(terms):
    """
    Search pubmed for the terms and return a df
    """
    df_list = []
    for term in terms:
        print ('searching for {0}'.format(term))
        df = lit_search(term)
        df_list.append(df)
        save_df_as_csv(term, df)
    df = pd.concat(df_list)
    return df

In [99]:
topic_collection = {"ipfjes_case_control_studies" : ['23022860', '10968375', '24413348', '19782552', '17628464', 
                                                     '10841131', '8569361', '8087336', '15640309', '9571528', 
                                                     '18507288', '23022860', '2249047'], 
                    "ipfjes_reviews" : ['25621562', '24348069', '10193340', '11816818', '15331187', '16733403']}


def litfetch_topics(topic_collection):
    """
    fetch info for pmids in a topic collection and save the result as a .csv
    """
    for topic in topic_collection:
        print('begining topic {0} analaysis'.format(topic))
        result = lit_fetch(topic_collection[topic])
        df = resultdf(result)
        save_df_as_csv(topic, df)
        print('csv of topic {0} saved'.format(topic))

In [None]:
df = pd.read_csv('idiopathic-pulmonary-fibrosis.csv', usecols=['pmid','title','firstauthor','lastauthor',
                                                             'journal','pubdate','pubtype',
                                                             'abstract', 'keywords', 'rcr'])

In [None]:
df = pd.read_csv('ipfjes_case_control_studies.csv', usecols=['pmid','title','firstauthor','lastauthor',
                                                             'journal','pubdate','pubtype',
                                                             'abstract', 'keywords', 'rcr', 'citedby', 'cites'])


In [None]:
df1 = pd.read_csv('idiopathic-pulmonary-fibrosis.csv', usecols=['pmid','title','firstauthor','lastauthor',
                                                             'journal','pubdate','pubtype',
                                                             'abstract', 'keywords', 'rcr', 'citedby'])

In [100]:
keywords = get_pubmed_keywords(topic_collection['ipfjes_case_control_studies'])

print ('Top pubmed keywords (mesh headings)',
       'for the {0} occupational IPF case-control studies found'.format(len(keywords)))

corpus = []
       
for record in keywords:
    corpus.append(keywords[record])
    
corpus = [item for sublist in corpus for item in sublist]

x = Counter(corpus)

top_keywords = [(l,k) for k,l in sorted([(j,i) for i,j in x.items()], reverse=True)]
top_keywords[:5]

Top pubmed keywords (mesh headings) for the 12 occupational IPF case-control studies found


[('Male', 12),
 ('Humans', 12),
 ('Female', 12),
 ('Case-Control Studies', 10),
 ('Aged', 8),
 ('Middle Aged', 7),
 ('Risk Factors', 6),
 ('Adult', 5),
 ('Pulmonary Fibrosis/*etiology', 4),
 ('Wood', 3),
 ('Surveys and Questionnaires', 3),
 ('Odds Ratio', 3),
 ('Metals', 3),
 ('Dust', 3),
 ('*Occupational Exposure', 3),
 ('Smoking/adverse effects', 2),
 ('Smoking/*adverse effects', 2),
 ('Sex Factors', 2),
 ('Pulmonary Fibrosis/*mortality', 2),
 ('Occupational Exposure/*adverse effects/statistics & numerical data', 2),
 ('Occupational Exposure', 2),
 ('Dust/*adverse effects', 2),
 ('Aged, 80 and over', 2),
 ('Adolescent', 2),
 ('*Environmental Exposure', 2),
 ('Wood/*adverse effects', 1),
 ('United States/epidemiology', 1),
 ('United Kingdom/epidemiology', 1),
 ('Time Factors', 1),
 ('Sweden/epidemiology', 1),
 ('Sex Distribution', 1),
 ('Severity of Illness Index', 1),
 ('Risk Assessment', 1),
 ('Risk', 1),
 ('Retrospective Studies', 1),
 ('Regression Analysis', 1),
 ('Pulmonary Fibro

In [117]:
def citing_count(pubmed_id):
    '''
    Looks up papers citing a list of pubmed_id strings and returns a dict
    {citing_paper:number_of_times_cites_list}
    '''
    citing = get_citing_pmids(pubmed_id)
    corpus = []
    for record in citing:
        corpus.append(citing[record])
    corpus = [item for sublist in corpus for item in sublist]
    x = Counter(corpus)
    citingpapers = [(l,k) for k,l in sorted([(j,i) for i,j in x.items()], reverse=True)]
    citing_count = dict(citingpapers)
    return citing_count

In [118]:
citing_count = citing_count(topic_collection['ipfjes_case_control_studies'])

In [126]:
result = lit_fetch(list(citing_count.keys()))


RemoteDisconnected: Remote end closed connection without response

In [None]:
df = resultdf(result)

In [128]:
citing_count.get('10193340')

3

In [None]:
def dataframe_from_citingpapers(citingpapers):
    """
    takes our list of tuples showing which pmid cite most of our pmidgrp
    makes a dataframe showing info for each pmid
    """
    df = pd.DataFrame(citingpapers)
    df.columns = ['pmid', 'citations of pmidgrp']
    summaries, pubtypes, abstracts, keywords, rcrs, citing, cited = lit_search_for_topic(list(df.pmid.astype('str')))
    df['title'] = df['pmid'].map(lambda x: summaries.get(x)['title'])
    df['firstauthor'] = df['pmid'].map(lambda x: summaries.get(x)['sortfirstauthor'])
    df['lastauthor'] = df['pmid'].map(lambda x: summaries.get(x)['lastauthor'])
    df['journal'] = df['pmid'].map(lambda x: summaries.get(x)['source'])
    df['pubdate'] = df['pmid'].map(lambda x: summaries.get(x)['sortpubdate'])
    df['pubtype'] = df['pmid'].map(lambda x: pubtypes.get(x))
    df['abstract'] = df['pmid'].map(lambda x: abstracts.get(x))
    df['keywords'] = df['pmid'].map(lambda x: keywords.get(x))
    df['rcr'] = df['pmid'].astype(int).map(lambda x: rcrs.get(x))
    return df
    

In [None]:
df = dataframe_from_citingpapers(citingpapers)

In [None]:
df.pubdate = pd.to_datetime(df.pubdate)
df.index = df.pubdate.map(lambda x: x.year)
df = df.sort_values(by = 'pubdate', ascending=True)
df = df[['title','firstauthor', 'lastauthor', 'journal', 'pubtype', 'rcr', 'citations of pmidgrp']]
df

In [None]:
df.to_csv('papers_citing_ipf_occupational_dust_case_control_studies.csv')

In [None]:
df.sort_values(by='citations of pmidgrp', ascending=False)

In [None]:
# make something to show new things

In [None]:
cited = get_cited_pmids(topic_collection['ipfjes_case_control_studies'])

print ('Top cited papers (from pubmed)',
       'for the {0} occupational IPF case-control studies found'.format(len(citing)))

corpus = []
       
for record in cited:
    corpus.append(cited[record])
    
corpus = [item for sublist in corpus for item in sublist]

x = Counter(corpus)

# [(l,k) for k,l in sorted([(j,i) for i,j in x.items()], reverse=True)] not that interesting because most
# of our papers aren't in PMC and so citations not available

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

In [106]:
citing

{'10841131': ['27630174',
  '27266705',
  '26511746',
  '25927611',
  '25621562',
  '24348069',
  '23166540',
  '23153608',
  '15340372'],
 '10968375': ['27630174',
  '27266705',
  '26893575',
  '26761627',
  '25933009',
  '25927611',
  '25621562',
  '25208940',
  '25022318',
  '24901704',
  '24413348',
  '24348069',
  '24015331',
  '23716070',
  '23153608',
  '22917154',
  '22448328',
  '22429962',
  '22268124',
  '22029812',
  '21850202',
  '21743030',
  '21668038',
  '20070197',
  '19740254',
  '19590695',
  '19223434',
  '19129758',
  '18366757',
  '17999627',
  '17897991',
  '17710235',
  '16928146',
  '16844727',
  '16324212',
  '15340372',
  '15262886',
  '14508706',
  '11686865'],
 '15640309': ['27630174',
  '27266705',
  '26934369',
  '26893575',
  '24746629',
  '24413348',
  '24348069',
  '24285090',
  '21799956',
  '21668038',
  '19542480'],
 '17628464': ['25927611',
  '25621562',
  '24413348',
  '24348069',
  '23153608',
  '21668038'],
 '18507288': ['28096793',
  '27175674'

In [None]:
# G = nx.DiGraph(nx.from_dict_of_lists(citing))

G = nx.from_dict_of_lists(citing)

pos=nx.spring_layout(G)
  
nx.draw_networkx_nodes(G,pos,
                       nodelist=citing.keys(),
                       node_color='r',
                       node_size=[len(v) * 50 for v in citing.values()],
                       alpha=0.8)

nx.draw_networkx_nodes(G,pos,
                       nodelist=[item for sublist in citing.values() for item in sublist],
                       node_color='b',
                       node_size=50,
                       alpha=0.8)

nx.draw_networkx_edges(G,pos,width=1.0,alpha=0.5)

print ('IPF case-control studies and their citing papers\n')

labels = {}

for i, item in enumerate(citing):
    labels[item] = i
    
    
print ('Key : pubmed id : title author year')
for i, item in enumerate(labels):
    print(i,':', item, get_pubmed_summaries([item])[item]['title'],
          get_pubmed_summaries([item])[item]['sortfirstauthor'], 
          get_pubmed_summaries([item])[item]['sortpubdate'][:4])
    
nx.draw_networkx_labels(G,pos,labels,font_size=8)
        
plt.savefig('PapersThatIPFCaseControlStudies.png')

In [None]:
# need to sort paper (and label order)
# consider weighting size of citing papers by RSI
# classification
# flask app
# ?hover over labels in networkx
# http://nbviewer.jupyter.org/urls/ep2016.europython.eu/media/conference/slides/networkx-visualization-powered-by-bokeh.ipynb
# https://github.com/bokeh/bokeh/blob/master/examples/plotting/file/graphs.py
# https://andrewmellor.co.uk/blog/articles/2014/12/14/d3-networks/