# Automatically annotate life science literature with named entities

### Objectives:
    Map terms to classes:
        chemical
        biomolecule
        drug
        process
        anatomy
        species
        disease
        method

### Sources:
    1. Pubtator
    2. DBpedia, DBpedia spotlight
    3. BIO2RDF

In [55]:
# get pubtator annotations
import requests
import json
import pandas as pd
import numpy as np
from IPython.display import display
base = 'https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/RESTful/tmTool.cgi/BioConcept'
retmode = 'json'
pmid = '29969095'
url = '{0}/{1}/{2}' .format(base, pmid, retmode)
r = requests.get(url).text
j = json.loads(r)
text = j[0]['text']
print(text)
pbt_ann = []
pbt2class = {
    'gene': 'biomolecule',
    'species': 'species',
    'chemical': 'chemical',
    'disease': 'disease',
}
for a in j[0]['denotations']:
    pbt_class = a['obj'].split(':')[0].lower()
    try:
        mapped_class = pbt2class[pbt_class]
        pass
    except:
        print('Unmapped class: ', pbt_class)
        mapped_class = np.nan
    strt = a['span']['begin']
    end = a['span']['end']
    s = text[strt:end]
    pbt_ann.append((s, pbt_class, mapped_class))
pbt = pd.DataFrame(list(pbt_ann), columns=['term', 'pubtator_class', 'pubtator_mapped_class']).set_index('term')
pbt.drop_duplicates(inplace=True)
display(pbt)

Neurogenic decisions require a cell cycle independent function of the CDC25B phosphatase. A fundamental issue in developmental biology and in organ homeostasis is understanding the molecular mechanisms governing the balance between stem cell maintenance and differentiation into a specific lineage. Accumulating data suggest that cell cycle dynamics play a major role in the regulation of this balance. Here we show that the G2/M cell cycle regulator CDC25B phosphatase is required in mammals to finely tune neuronal production in the neural tube. We show that in chick neural progenitors, CDC25B activity favors fast nuclei departure from the apical surface in early G1, stimulates neurogenic divisions and promotes neuronal differentiation. We design a mathematical model showing that within a limited period of time, cell cycle length modifications cannot account for changes in the ratio of the mode of division. Using a CDC25B point mutation that cannot interact with CDK, we show that part of C

Unnamed: 0_level_0,pubtator_class,pubtator_mapped_class
term,Unnamed: 1_level_1,Unnamed: 2_level_1
Neurogenic decisions,disease,disease
CDC25B,gene,biomolecule
chick,species,species


In [51]:
import spotlight

base = 'https://api.dbpedia-spotlight.org/en/annotate'
annotations = spotlight.annotate(base, text, confidence=0.4, support=5)
annotations

[{'URI': 'http://dbpedia.org/resource/Cell_cycle',
  'support': 1839,
  'types': '',
  'surfaceForm': 'cell cycle',
  'offset': 31,
  'similarityScore': 1.0,
  'percentageOfSecondRank': 6.106436594423217e-20},
 {'URI': 'http://dbpedia.org/resource/Subroutine',
  'support': 3978,
  'types': '',
  'surfaceForm': 'function',
  'offset': 54,
  'similarityScore': 0.893844864875603,
  'percentageOfSecondRank': 0.11365424768746649},
 {'URI': 'http://dbpedia.org/resource/CDC25B',
  'support': 12,
  'types': '',
  'surfaceForm': 'CDC25B',
  'offset': 70,
  'similarityScore': 0.9999999999889724,
  'percentageOfSecondRank': 0.0},
 {'URI': 'http://dbpedia.org/resource/Phosphatase',
  'support': 494,
  'types': '',
  'surfaceForm': 'phosphatase',
  'offset': 77,
  'similarityScore': 0.9999999999998863,
  'percentageOfSecondRank': 1.580676278011721e-13},
 {'URI': 'http://dbpedia.org/resource/Molecular_biology',
  'support': 2760,
  'types': '',
  'surfaceForm': 'biology',
  'offset': 127,
  'similar

In [56]:
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper('http://dbpedia.org/sparql')
sparql.setReturnFormat(JSON)

def check_dbp_redirect(uri):
    q = 'SELECT * WHERE {{ {0} <http://dbpedia.org/ontology/wikiPageRedirects> ?redirect}}'.format(uri)
    sparql.setQuery(q)
    results = sparql.query().convert()
    try:
        uri = '<{0}>' .format(results['results']['bindings'][0]['redirect']['value'])
    except:
        pass
    return uri

dbp2class = {  # Map dbpedia types to our standard classes
    'biomolecule': 'biomolecule',
    'protein': 'biomolecule',
    'species': 'species',
    'eukaryote': 'species',
    'animal': 'species',
    'mammal': 'species',
    'disease': 'disease',
    'anatomicalstructure': 'anatomy'
}

dbp_ann = {}
for a in annotations:
    term = a['surfaceForm']
    types = a['types'].lower().replace('dbpedia:', '').split(',')
    uri = '<{0}>' .format(a['URI'])
    uri = check_dbp_redirect(uri)
    types = [t for t in types if t in dbp2class.keys()]
    classes = [dbp2class[t] for t in types]
    dbp_ann[term] = {
        'dbpedia_types': a['types'],
        'dbpedia_mapped_classes': classes,
        'dbpedia_resource': uri,   
    }

dbp = pd.DataFrame(dbp_ann).transpose()
dbp

Unnamed: 0,dbpedia_mapped_classes,dbpedia_resource,dbpedia_types
cell cycle,[],<http://dbpedia.org/resource/Cell_cycle>,
function,[],<http://dbpedia.org/resource/Subroutine>,
CDC25B,[],<http://dbpedia.org/resource/CDC25B>,
phosphatase,[],<http://dbpedia.org/resource/Phosphatase>,
biology,[],<http://dbpedia.org/resource/Molecular_biology>,
homeostasis,[],<http://dbpedia.org/resource/Homeostasis>,
stem cell,[anatomy],<http://dbpedia.org/resource/Stem_cell>,"Wikidata:Q4936952,DBpedia:AnatomicalStructure"
differentiation,[],<http://dbpedia.org/resource/Cellular_differen...,
G2/M,[],<http://dbpedia.org/resource/Cell_cycle_checkp...,
neural tube,[anatomy],<http://dbpedia.org/resource/Neural_tube>,"Wikidata:Q4936952,DBpedia:Embryology,DBpedia:A..."


In [30]:
# create lists of dbpedia URIs to assign dbpedia annotations without classes
class_props = {
    'method': [
        'rdf:type yago:WikicatBiologicalTechniquesAndTools',
        'rdf:type yago:WikicatLaboratoryTechniques',
        'rdf:type yago:WikicatMolecularBiologyTechniques',
        'rdf:type yago:Method105660268',
        'rdf:type yago:Invention105633385',
        'rdf:type yago:Technique105665146',
        'rdf:type yago:WikicatBiochemistryMethods',
        'rdf:type yago:WikicatProteinMethods',
        'dct:subject dbc:Laboratory_techniques',
        'dct:subject dbc:Molecular_biology_techniques',
        'dct:subject dbc:Protein_methods',
    ],
    'pathway_process': [
        'dct:subject <http://dbpedia.org/resource/Category:Cellular_processes>',
        'rdf:type yago:WikicatCellularProcesses',
    ]
}

base = '''
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX dct: <http://purl.org/dc/terms/subject>
    PREFIX yago: <http://dbpedia.org/class/yago/>
    PREFIX dbc: <http://dbpedia.org/resource/>
    SELECT ?resource WHERE {{
        ?resource {0}
    }}
'''

class_URIs = {}
for c, pr in class_props.items():
    uris = []
    for p in pr:
        q = base.format(p)
        sparql.setQuery(q)
        r = sparql.query().convert()
        uris += ['<{0}>' .format(d['resource']['value']) for d in r['results']['bindings']]
    class_URIs[c] = uris

for c, ur in class_URIs.items():
    print(len(ur), c)

9744 method
65 pathway_process


In [53]:
# use other dbpedia data to assign classes
classless = dbp.loc[dbp['dbpedia_classes'].isnull()]['dbpedia_resource'].values

for c, ur in class_URIs.items():
    m = map(lambda x: x if x in ur else False, classless)
#     f = filter(lambda x: x, m)
    for uri in f:
        idx = dbp.loc[dbp['dbpedia_resource']==uri].index[0]
        dbp.loc[idx, 'dbpedia_mapped_classes'] = c

display(dbp)

Unnamed: 0,dbpedia_classes,dbpedia_resource,dbpedia_types
cell cycle,[],<http://dbpedia.org/resource/Cell_cycle>,
function,[],<http://dbpedia.org/resource/Subroutine>,
CDC25B,[],<http://dbpedia.org/resource/CDC25B>,
phosphatase,[],<http://dbpedia.org/resource/Phosphatase>,
biology,[],<http://dbpedia.org/resource/Molecular_biology>,
homeostasis,[],<http://dbpedia.org/resource/Homeostasis>,
stem cell,[anatomy],<http://dbpedia.org/resource/Stem_cell>,"Wikidata:Q4936952,DBpedia:AnatomicalStructure"
differentiation,[],<http://dbpedia.org/resource/Cellular_differen...,
G2/M,[],<http://dbpedia.org/resource/Cell_cycle_checkp...,
neural tube,[anatomy],<http://dbpedia.org/resource/Neural_tube>,"Wikidata:Q4936952,DBpedia:Embryology,DBpedia:A..."


In [60]:
# join dfs
df = pbt.join(dbp, how='outer')

In [72]:
# bio2rdf
sparql = SPARQLWrapper('http://pubmed.bio2rdf.org/sparql')
sparql.setReturnFormat(JSON)

base = '''
SELECT DISTINCT ?concept ?type WHERE {{
?concept dcterms:title "{0}"@en . 
?concept rdf:type ?type .
}}
'''

bio2rdf = {}
for t in df.index:
    q = base.format(t)
    sparql.setQuery(q)
    results = sparql.query().convert()
    if not results:
        continue
    bio2rdf[t] = {
        'bio2rdf_uris': [r['concept']['value'] for r in results['results']['bindings']],
        'bio2rdf_classes': [r['type']['value'] for r in results['results']['bindings']],
    }
bio2rdf = pd.DataFrame(bio2rdf).transpose()
display(bio2rdf)

Unnamed: 0,bio2rdf_class,bio2rdf_uri
CDC25B,[http://bio2rdf.org/hgnc.symbol_vocabulary:Res...,"[http://bio2rdf.org/hgnc.symbol:CDC25B, http:/..."
CDK,[],[]
G1,[http://bio2rdf.org/clinicaltrials_vocabulary:...,[http://bio2rdf.org/clinicaltrials_resource:NC...
G2/M,[],[]
Neurogenic decisions,[],[]
apical surface,[],[]
biology,[],[]
cell cycle,"[http://www.w3.org/2002/07/owl#Class, http://b...","[http://bio2rdf.org/go:0007049, http://bio2rdf..."
chick,[],[]
differentiation,[],[]


In [65]:
df = df.join(goa)
display(df)

Unnamed: 0,pubtator_class,pubtator_mapped_class,dbpedia_mapped_classes,dbpedia_resource,dbpedia_types,goa_class,goa_uri
CDC25B,gene,biomolecule,[],<http://dbpedia.org/resource/CDC25B>,,,
CDK,,,[biomolecule],<http://dbpedia.org/resource/Cyclin-dependent_...,"Wikidata:Q8047,Wikidata:Q206229,DBpedia:Enzyme...",,
G1,,,[],<http://dbpedia.org/resource/G1_phase>,,,
G2/M,,,[],<http://dbpedia.org/resource/Cell_cycle_checkp...,,,
Neurogenic decisions,disease,disease,,,,,
apical surface,,,[],<http://dbpedia.org/resource/Cell_membrane>,,,
biology,,,[],<http://dbpedia.org/resource/Molecular_biology>,,,
cell cycle,,,[],<http://dbpedia.org/resource/Cell_cycle>,,biological_process,http://bio2rdf.org/go:0007049
chick,species,species,,,,,
differentiation,,,[],<http://dbpedia.org/resource/Cellular_differen...,,,


In [None]:
# get meta data for terms with classes
df = df.loc[
    (df['pubtator_class'].notnull()) |
    (df['dbpedia_class'].notnull()) |
    (df['goa_class'].notnull())
]
display(df)
# get wikipedia info