### A notebook to support academic question exploration and literature search / what do you do when you search the literature?

0. Define a research question e.g Is occupational **asbestos exposure** an underecognised **cause** of IPF? 
1. Consider the different possible ways of answering the question (methods). Different study designs and ways of measuring asbestos exposure e.g Epidemiological, observational, cross-sectional, cohort, case-control, post-mortem and explant studies, ecological, toxicology, animal models, molecular disease models, exposure assessment, occupational hygeinst measurements, minerologic analysis (tissue, BAL etc)
2. Generate search terms e.g "IPF", "case-control", "occupational", "asbestos" (? && mesh terms)
3. Carry out search using search terms and e.g pubmed, google scholar, scopus, biorxiv, web of science, clinicaltrials.gov, ?google books
4. Search results == Candidate Papers
5. Extract title | journal | author | location | year | abstract | key words | full text && save result (as .bib) (prob want to export to jabref)
6. Review Candidate Papers to identify Relevant Papers 
7. Use Relevant Papers to identify more Candidate papers. Search also by author, cited by, cite, [triangle closing](https://en.wikipedia.org/wiki/Triadic_closure) e.g https://github.com/hinnefe2/bibcheck.py and other means (?tensorflow)
8. Use the Relevant Papers collected for whatever it is they are relevant for (usually to help compose a written document in which they are cited)

meta: github/stack exchange etc to check out other peoples search strategies. this is likely to be formulated as a machine learning problem somewhere.

#### interesting related I found includes: https://www.projectcredo.com/, http://citationexplorer.hoppmann.me/, lict from a previous nhshackday, https://github.com/jvoytek/pubmedbrain/blob/f5170a2e3540e0c2aa665559c86048dfb1583f16/documents/Voytek-brainSCANrPreprint.pdf, https://github.com/graeham/hackathon/blob/master/paperGraph.py

### search github for relevant stuff with the following 'webbit' 
> https://github.com/search?l=Python&q=http%3A%2F%2Feutils.ncbi.nlm.nih.gov%2Fentrez%2Feutils%2Fesearch.fcgi++stars%3A%3E5&ref=advsearch&type=Code&utf8=%E2%9C%93

gists and interwebs inc stackoverflow also helpful

tempting to dive into django a la https://github.com/afouchet/OpenReview but probably not essential and now is not optimal timing

https://github.com/gui11aume looks well documented, poss useful template
https://github.com/swcarpentry/2013-08-23-harvard/blob/b2097bc20833e0a58b2e73eecd1227d61bd5a00a/lessons/misc-biopython/eutils.md looks like nice intro to biopython utils and https://gist.github.com/bonzanini/5a4c39e4c02502a8451d, https://gist.github.com/ehazlett/1104507, https://gist.github.com/vtrubets/ef1dabb397ea6a05ce5b4e767ed15af9 (for use of icite), https://gist.github.com/mcfrank/c1ec74df1427278cbe53, http://stackoverflow.com/questions/17409107/obtaining-data-from-pubmed-using-python, https://github.com/bwallace/abstrackr-web/tree/master/abstrackr

### let's tackle pubmed first

In [1]:
"""
Notebook to support academic question exploration and literature search.

Thanks to https://marcobonzanini.wordpress.com/2015/01/12/searching-pubmed-with-python/ and 
http://www.fredtrotter.com/2014/11/14/hacking-on-the-pubmed-api/

Pubmed advanced search is helpful for designing search/experimenting https://www.ncbi.nlm.nih.gov/pubmed/advanced

Docs for NCBI esearch:
https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
https://www.nlm.nih.gov/bsd/mms/medlineelements.html
"""


'\nNotebook to support academic question exploration and literature search.\n\nThanks to https://marcobonzanini.wordpress.com/2015/01/12/searching-pubmed-with-python/ and \nhttp://www.fredtrotter.com/2014/11/14/hacking-on-the-pubmed-api/\n\nPubmed advanced search is helpful for designing search/experimenting https://www.ncbi.nlm.nih.gov/pubmed/advanced\n\nDocs for NCBI esearch:\nhttps://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch\nhttps://www.nlm.nih.gov/bsd/mms/medlineelements.html\n'

In [2]:
from Bio import Entrez
from Bio import Medline
from tqdm import tqdm
import json
import requests
import pandas as pd
import urllib
import pickle 

In [3]:
def get_chunked_pmids(term, chunksize=50):
    """
    Return a list of Pubmed ids from pubmed search in chunks
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    count_handle = Entrez.esearch(db="pubmed",
                                  term=term,
                                  sort="relevance",
                                  retmode="xml",
                                  rettype="count")
    count_results = Entrez.read(count_handle)
    count = int(count_results["Count"])

    retmax_requests = list(range(chunksize, count, chunksize))
    retmax_requests.append(count - retmax_requests[len(retmax_requests) - 1])

    for i, retmax in enumerate(retmax_requests):
        pmid_handle = Entrez.esearch(db="pubmed",
                                     term=term,
                                     sort="relevance",
                                     retmode="xml",
                                     usehistory='y',
                                     retstart=retmax,
                                     retmax=chunksize)
        results = Entrez.read(pmid_handle)
        yield results["IdList"]

In [4]:
def get_pubmed_summary(pubmed_id):
    """
    Use the Pubmed API to return the summary of a pubmed article
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    pubmed_id = ', '.join(map(str, pubmed_id))
    handle = Entrez.esummary(db='pubmed', 
                             id=pubmed_id, 
                             retmode='json', 
                             rettype='abstract')
    return json.loads(handle.read())['result']
  

In [5]:
def get_pubmed_abstract(pubmed_id):
    """
    Use the Pubmed API to return the abstract of a pubmed article
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.efetch(db='pubmed',
                           id=pubmed_id,
                           retmode='text',
                           rettype='abstract')
    return handle.read()
  

In [6]:
def get_pubmed_keywords(pubmed_id):
    """
    Use the Pubmed API to return the medline record and extract the key words of a pubmed article
    """
    Entrez.email = "carl.reynolds@imperial.ac.uk"
    handle = Entrez.efetch(db='pubmed',
                           id=pubmed_id,
                           rettype='medline',
                           retmode='text')
    records = Medline.parse(handle)
    keywords = []
    for record in records:
        mh = record.get('MH','?')
        for w in mh: 
            if w not in keywords:
                keywords.append(w)
        keywords.sort()
    return keywords

In [7]:
def get_citation_information(pubmed_id):
    """
    Use the special citation api to return relative citation ratios
    """
    citation_search = 'https://icite.od.nih.gov/api/pubs/{0}'.format(pubmed_id)
    response = requests.get(citation_search).content
    str_response = response.decode('utf-8')
    try:
        return json.loads(str_response)['relative_citation_ratio']
    except KeyError:
        return None
        

In [8]:
def lit_search(term):
    """
    Search pubmed for a term and collect information about the results
    """
    pmid_blocks = get_chunked_pmids(term, chunksize=200)
    summaries = []
    abstracts = []
    keywords = []
    rcrs = []
    litsearch_results = [summaries, abstracts, keywords, rcrs]
    for i, block in enumerate(pmid_blocks):
        summaries.append(get_pubmed_summary(block))
        abstracts.append(get_pubmed_keywords(block))
        keywords.append(get_pubmed_keywords(block))
        rcrs.append(get_citation_information(block))
        print("Processed block {0}".format(i))
    pickle.dump( litsearch_results, open( "litsearch_results{0}.p".format(term), "wb" ) )
    return litsearch_results
        

Processed block 0
Processed block 1
Processed block 2
Processed block 3
Processed block 4
Processed block 5
Processed block 6
Processed block 7
Processed block 8
Processed block 9
Processed block 10
Processed block 11
Processed block 12
Processed block 13
Processed block 14
Processed block 15
Processed block 16
Processed block 17
Processed block 18
Processed block 19
Processed block 20
Processed block 21
Processed block 22
Processed block 23
Processed block 24
Processed block 25
Processed block 26
Processed block 27
Processed block 28
Processed block 29
Processed block 30
Processed block 31
Processed block 32
Processed block 33


In [12]:
term = 'idiopathic pulmonary fibrosis'
list_of_lists = [summaries, abstracts, keywords, rcrs]
pickle.dump( list_of_lists, open( "list_of_lists_{0}.p".format(term), "wb" ) )
#can make resume?
#can support export to .bib?

In [4]:
term = 'idiopathic pulmonary fibrosis'
list_of_lists = pickle.load( open( "list_of_lists_{0}.p".format(term), "rb" ) )

In [10]:
#records = pickle.load( open( "records.p", "rb" ) )
# write update function
# add pub type

In [7]:
df = pd.DataFrame()
for i, item in enumerate(list_of_lists):
    for i, item in enumerate(list_of_lists[i]):
         for i, item in enumerate(list_of_lists[i]):
                df['title'] = 

SyntaxError: invalid syntax (<ipython-input-7-a85dcb679feb>, line 5)

In [11]:
df = pd.DataFrame(pubmed_ids, columns=['pmid'])
df['title'] = df['pmid'].map(lambda x: summaries.get(x)['title'])
df['firstauthor'] = df['pmid'].map(lambda x: summaries.get(x)['sortfirstauthor'])
df['lastauthor'] = df['pmid'].map(lambda x: summaries.get(x)['lastauthor'])
df['journal'] = df['pmid'].map(lambda x: summaries.get(x)['source'])
df['pubdate'] = df['pmid'].map(lambda x: summaries.get(x)['sortpubdate'])
df['keywords'] = df['pmid'].map(lambda x: keywords.get(x))
df['rcr'] = df['pmid'].map(lambda x: rcrs.get(x))

NameError: name 'pubmed_ids' is not defined

In [None]:
pd.set_option('max_colwidth',300)
df.tail()

In [None]:
df2 = pd.DataFrame(list_of_dicts)
df2.transpose().head()