<a href="https://colab.research.google.com/github/alisonmitchell/Biomedical-Knowledge-Graph/blob/main/04_Named_Entity_Recognition/spaCy_scispaCy.ipynb"
   target="_parent">
   <img src="https://colab.research.google.com/assets/colab-badge.svg"
      alt="Open in Colab">
</a>

# spaCy and scispaCy

## 1. Introduction

The second step in the information extraction pipeline after Coreference Resolution is Named entity recognition (NER), a subtask that involves identifying and classifying
named entities in unstructured text into predefined categories. In the biomedical domain, categories would include drugs, diseases, genes and proteins.
We will use the scispaCy NER models trained on biomedical corpora to extract and label entities.

The next problem to solve is Entity Disambiguation which is the process of accurately identifying and distinguishing between entities with similar names or references to ensure the correct entity is recognised in a given context. We will do this using the technique of Entity Linking, the next subtask in the pipeline, which detects relevant entities and maps them to concepts in a target knowledge base. spaCy provides an EntityLinker pipeline component and five supported linkers to biomedical knowledge bases which we will use to resolve entities to concept unique identifiers and return match scores and descriptions as ground truth.

## 2. Install/import libraries

In [None]:
!pip install spacy scispacy swifter

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
# scispaCy small model
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
# scispaCy NER models
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bionlp13cg_md-0.5.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_jnlpba_md-0.5.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_craft_md-0.5.4.tar.gz

In [None]:
import pandas as pd
import pickle
import spacy
import scispacy
import itertools
import json
import swifter
import warnings
warnings.filterwarnings("ignore")

from collections import Counter, defaultdict
from spacy import displacy
from scispacy.abbreviation import AbbreviationDetector
from scispacy.linking import EntityLinker
from scispacy.hyponym_detector import HyponymDetector

## 3. Load data

We will load the dataset containing the columns added after coreference resolution.

In [None]:
with open('2024-07-25_pmc_arxiv_full_sent_text_spacy_sent_coref_df.pickle', 'rb') as f:
  pmc_arxiv_full_sent_text_spacy_fastcoref = pickle.load(f)

In [None]:
len(pmc_arxiv_full_sent_text_spacy_fastcoref)

10

In [None]:
# convert sentence-tokenised coreference resolved text column to a list
all_sent_coref_text = pmc_arxiv_full_sent_text_spacy_fastcoref.sent_coref_text.tolist()

In [None]:
# print number of sentences in each article
for i in all_sent_coref_text:
    print(len(i))

242
200
138
338
223
178
127
484
169
251


In [None]:
# first article in dataset
pmc_arxiv_full_sent_text_spacy_fastcoref.sent_coref_text[0]

['Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery.',
 'In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets.',
 'molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted.',
 'The definition of the term drug repurposing has been endorsed by scholars and used by scholars.',
 'It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switchin

## 4. spaCy

We will load a spaCy model to visualise the dependency parse and named entities, and extract noun phrases and verbs in the text. This will give a high level overview of potential biomedical named entities and relations between them.

In [None]:
# load small English spaCy model
nlp = spacy.load("en_core_web_sm")

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

We will iterate over the sentence-tokenised text and append the processed Doc objects to a list.

In [None]:
%%time

doc_list = []

for sents in all_sent_coref_text:
    docs = nlp.pipe(sents)
    doc_list.append(list(docs))

CPU times: user 11.4 s, sys: 320 ms, total: 11.7 s
Wall time: 12 s


In [None]:
len(doc_list)

10

### 4.1 Dependency parse

spaCy features a fast and accurate syntactic dependency parser and comes with the built-in displaCy dependency visualiser to which you can pass one or more Doc objects and view the visualisation.

The dependency parse can be a useful tool for information extraction, especially when combined with other predictions like named entities. It is possible to extract labelled entities and then use the dependency parse to find the noun phrase they are referring to.






In [None]:
# visualise dependency parse for first sentence
displacy.render(next(doc_list[0][0].sents), style='dep', jupyter=True)

We can iterate over the noun phrases and verbs in the sentence and extract them but the visualisation clearly shows us the syntactic dependencies.

In [None]:
# Analyse syntax
print("Noun phrases:", [chunk.text for chunk in doc_list[0][0].noun_chunks])
print("Verbs:", [token.lemma_ for token in doc_list[0][0] if token.pos_ == "VERB"])

Noun phrases: ['Sir James Black', 'a winner', 'the 1988 Nobel Prize', 'the 21st century', 'drug repurposing strategies', 'an important place', 'the future', 'new drug discovery']
Verbs: ['recognize', 'occupy']


### 4.2 Noun phrases

We can extract the base noun phrases, or [noun chunks](https://spacy.io/usage/linguistic-features#noun-chunks), from the list of Doc objects by iterating over the `Doc.noun_chunks` property.

A noun chunk is a noun plus the words describing the noun.





In [None]:
def noun_phrases(docs):
    noun_phrase_list = []
    for doc in docs:
        nouns = [chunk.text for chunk in doc.noun_chunks]
        noun_phrase_list.append(nouns)
    return noun_phrase_list

In [None]:
noun_phrases = list(map(noun_phrases, doc_list))

In [None]:
noun_phrases

[[['Sir James Black',
   'a winner',
   'the 1988 Nobel Prize',
   'the 21st century',
   'drug repurposing strategies',
   'an important place',
   'the future',
   'new drug discovery'],
  ['In 2004, Ted T. Ashburn',
   '.',
   'previous research',
   'a general approach',
   'drug development',
   'new indications',
   'approved drugs',
   'molecules',
   'that',
   'approval',
   'new pathways',
   'action',
   'targets'],
  ['molecules',
   'that',
   'approval',
   'new pathways',
   'action',
   'targets',
   'clinical trials',
   'sufficient efficacy',
   'the treatment',
   'the disease'],
  ['The definition', 'the term drug repurposing', 'scholars', 'scholars'],
  ['It',
   'the synonyms',
   'academics',
   'drug repositioning',
   'drug rediscovery',
   'drug retasking',
   'therapeutic switching'],
  ['the research study',
   'Ted T. Ashburn',
   '.',
   'Allarakhia et al',
   '.',
   'the starting materials',
   'drug',
   'products',
   'that',
   'commercial reasons',
 

In [None]:
with open('2023-06-03_all_sent_coref_text_noun_phrases_0-10.pickle', 'wb') as f:
  pickle.dump(noun_phrases, f)

### 4.3 Verbs

After tokenisation, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context.

We can iterate over the verbs in the Doc objects and extract the lemma, or base form of the word. Although it will lemmatise 'repurposing' to 'repurpose', for example, it will give an indication of the verbs in the text. This could give some insights which might inform the later Relation Extraction step in the pipeline when it comes to extracting subject-verb-object triples.



In [None]:
def verbs(docs):
    verb_list = []
    for doc in docs:
        verbs = [token.lemma_ for token in doc if token.pos_ == "VERB"]
        verb_list.append(verbs)
    return verb_list

In [None]:
verbs = list(map(verbs, doc_list))

In [None]:
verbs

[[['recognize', 'repurpose', 'occupy'],
  ['summarize', 'develop', 'use', 'repurpose', 'look', 'approve', 'wait'],
  ['wait', 'show', 'target'],
  ['endorse', 'use'],
  ['point', 'repurpose', 'use', 'include'],
  ['expand', 'reposition', 'include', 'discontinue', 'expire'],
  ['lie'],
  ['use',
   'exist',
   'know',
   'develop',
   'use',
   'reduce',
   'increase',
   'provide',
   'make',
   'win'],
  ['generate'],
  ['rely', 'base'],
  ['fragment'],
  ['combine', 'develop'],
  ['promote', 'make'],
  ['classify', 'depend'],
  ['rely', 'provide'],
  ['make', 'include', 'develop', 'report'],
  ['find', 'carry', 'hold'],
  ['perform'],
  ['carry', 'include'],
  ['reposition', 'transfer'],
  ['require'],
  ['include', 'aid'],
  [],
  ['increase', 'establish'],
  ['translate', 'launch', 'build', 'explore'],
  ['reposition', 'present', 'need', 'improve', 'remain', 'deny'],
  ['repurpose', 'provide', 'repurpose'],
  ['include',
   'seek',
   'repurpose',
   'seek',
   'obtain',
   'aid',


In [None]:
with open('2023-06-03_all_sent_coref_text_verbs_0-10.pickle', 'wb') as f:
  pickle.dump(verbs, f)

We will do some basic statistical analysis and see which are the most common verbs by frequency in each article.

In [None]:
def get_num_unique_verbs(verb_list):

    num_unique_verbs = Counter([verb for verbs in verb_list for verb in verbs]).most_common()

    return num_unique_verbs

In [None]:
num_unique_verbs = list(map(get_num_unique_verbs, verbs))
num_unique_verbs

[[('use', 28),
  ('publish', 27),
  ('show', 26),
  ('have', 25),
  ('follow', 21),
  ('relate', 18),
  ('include', 17),
  ('reposition', 17),
  ('repurpose', 16),
  ('base', 16),
  ('cite', 16),
  ('develop', 14),
  ('combine', 13),
  ('remain', 12),
  ('provide', 9),
  ('obtain', 9),
  ('be', 9),
  ('increase', 8),
  ('make', 8),
  ('analyze', 8),
  ('rank', 8),
  ('appear', 8),
  ('find', 7),
  ('become', 7),
  ('indicate', 7),
  ('seek', 6),
  ('study', 6),
  ('note', 6),
  ('perform', 5),
  ('establish', 5),
  ('need', 5),
  ('improve', 5),
  ('produce', 5),
  ('take', 5),
  ('represent', 5),
  ('target', 4),
  ('reveal', 4),
  ('contribute', 4),
  ('divide', 4),
  ('reflect', 4),
  ('suggest', 4),
  ('correspond', 4),
  ('recognize', 3),
  ('know', 3),
  ('reduce', 3),
  ('classify', 3),
  ('report', 3),
  ('explore', 3),
  ('aforementione', 3),
  ('assess', 3),
  ('cover', 3),
  ('search', 3),
  ('define', 3),
  ('enter', 3),
  ('merge', 3),
  ('address', 3),
  ('identify', 3),


In [None]:
with open('2023-06-03_all_sent_coref_text_verbs_most_common_0-10.pickle', 'wb') as f:
  pickle.dump(num_unique_verbs, f)

We will flatten the nested list of verbs into a single list.

In [None]:
verb_list_merged = list(itertools.chain.from_iterable(list(itertools.chain.from_iterable(verbs))))
verb_list_merged

['recognize',
 'repurpose',
 'occupy',
 'summarize',
 'develop',
 'use',
 'repurpose',
 'look',
 'approve',
 'wait',
 'wait',
 'show',
 'target',
 'endorse',
 'use',
 'point',
 'repurpose',
 'use',
 'include',
 'expand',
 'reposition',
 'include',
 'discontinue',
 'expire',
 'lie',
 'use',
 'exist',
 'know',
 'develop',
 'use',
 'reduce',
 'increase',
 'provide',
 'make',
 'win',
 'generate',
 'rely',
 'base',
 'fragment',
 'combine',
 'develop',
 'promote',
 'make',
 'classify',
 'depend',
 'rely',
 'provide',
 'make',
 'include',
 'develop',
 'report',
 'find',
 'carry',
 'hold',
 'perform',
 'carry',
 'include',
 'reposition',
 'transfer',
 'require',
 'include',
 'aid',
 'increase',
 'establish',
 'translate',
 'launch',
 'build',
 'explore',
 'reposition',
 'present',
 'need',
 'improve',
 'remain',
 'deny',
 'repurpose',
 'provide',
 'repurpose',
 'include',
 'seek',
 'repurpose',
 'seek',
 'obtain',
 'aid',
 'make',
 'convolute',
 'consume',
 'solve',
 'aforementione',
 'study',

In [None]:
len(verb_list_merged)

7418

And sort them alphabetically.

In [None]:
verb_list_sorted = sorted(set(verb_list_merged))
verb_list_sorted

['-approve',
 '-cause',
 '-initiate',
 'E484K.',
 'KEGG',
 'abbreviate',
 'abovementione',
 'absorb',
 'accelerate',
 'accentuate',
 'accept',
 'access',
 'accompany',
 'accomplish',
 'accord',
 'account',
 'accrue',
 'achieve',
 'acknowledge',
 'acquire',
 'act',
 'activate',
 'add',
 'address',
 'adjust',
 'administer',
 'admit',
 'adopt',
 'affect',
 'aforementione',
 'age',
 'agglomerate',
 'aggregate',
 'agree',
 'aid',
 'aim',
 'algorithm',
 'align',
 'alleviate',
 'allocate',
 'allow',
 'allude',
 'alter',
 'amalgamate',
 'ameliorate',
 'amplify',
 'analyse',
 'analyze',
 'anchor',
 'angiotensin',
 'angiotensin‐converte',
 'animal',
 'ankylose',
 'annotate',
 'announce',
 'antagonize',
 'antibodie',
 'anticipate',
 'appeal',
 'appear',
 'apply',
 'approach',
 'approve',
 'arise',
 'arrive',
 'ask',
 'assay',
 'assemble',
 'assess',
 'assign',
 'assist',
 'associate',
 'assume',
 'assure',
 'attach',
 'attack',
 'attain',
 'attempt',
 'attenuate',
 'attract',
 'augment',
 'author

In [None]:
len(verb_list_sorted)

837

In [None]:
with open('2023-06-03_all_sent_coref_text_verbs_0-10_merged_sorted.pickle', 'wb') as f:
  pickle.dump(verb_list_sorted, f)

We can now extract the most common verbs by frequency across the entire corpus.

In [None]:
def get_num_unique_verbs(verb_list):

    num_unique_verbs = Counter(verb_list).most_common()

    return num_unique_verbs

In [None]:
num_unique_verbs = get_num_unique_verbs(verb_list_merged)
num_unique_verbs

[('use', 388),
 ('base', 279),
 ('repurpose', 253),
 ('show', 187),
 ('have', 185),
 ('identify', 164),
 ('bind', 127),
 ('include', 113),
 ('approve', 102),
 ('find', 96),
 ('be', 92),
 ('follow', 83),
 ('predict', 81),
 ('develop', 66),
 ('provide', 65),
 ('involve', 65),
 ('know', 64),
 ('target', 60),
 ('relate', 58),
 ('lead', 57),
 ('take', 57),
 ('exist', 56),
 ('reposition', 54),
 ('make', 51),
 ('screen', 50),
 ('report', 49),
 ('obtain', 49),
 ('perform', 48),
 ('compare', 46),
 ('utilize', 46),
 ('suggest', 45),
 ('apply', 45),
 ('treat', 45),
 ('consider', 43),
 ('present', 42),
 ('cause', 39),
 ('study', 38),
 ('select', 38),
 ('interact', 38),
 ('enrich', 37),
 ('inhibit', 36),
 ('do', 36),
 ('evaluate', 34),
 ('publish', 34),
 ('combine', 33),
 ('analyze', 33),
 ('associate', 33),
 ('employ', 33),
 ('reveal', 32),
 ('determine', 30),
 ('result', 30),
 ('increase', 29),
 ('contain', 29),
 ('need', 28),
 ('play', 28),
 ('discover', 28),
 ('reduce', 27),
 ('represent', 27),

In [None]:
with open('2023-06-03_all_sent_coref_text_verbs_most_common_0-10_merged.pickle', 'wb') as f:
  pickle.dump(num_unique_verbs, f)

### 4.4 Named Entity Recognition

The [spaCy documentation](https://spacy.io/usage/linguistic-features#named-entities-101) defines a named entity as a "real-world object" that's assigned a name - for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction.

Named entities are available as the `ents` property of a Doc. We will iterate over the entities in the list of Doc objects and extract the named entity, and entity type using the `label_` attribute.




In [None]:
# Find named entities and entity types

def find_entities(docs):
    found_entities = []
    for doc in docs:
        for entity in doc.ents:
            found_entity = entity.text, entity.label_
            found_entities.append(found_entity)
    return found_entities

In [None]:
entities = list(map(find_entities, doc_list))

In [None]:
entities

[[('James Black', 'PERSON'),
  ('1988', 'DATE'),
  ('Nobel Prize', 'WORK_OF_ART'),
  ('the 21st century', 'DATE'),
  ('2004', 'DATE'),
  ('Ted T. Ashburn', 'PERSON'),
  ('Ted T. Ashburn', 'PERSON'),
  ('Allarakhia', 'NORP'),
  ('the 1990s', 'DATE'),
  ('three', 'CARDINAL'),
  ('1', 'CARDINAL'),
  ('2', 'CARDINAL'),
  ('3', 'CARDINAL'),
  ('the beginning of the 21st century', 'DATE'),
  ('DrugBank', 'ORG'),
  ('Cmap', 'GPE'),
  ('PDB', 'ORG'),
  ('EK-DRD', 'ORG'),
  ('DREIMT', 'ORG'),
  ('DrugSig', 'ORG'),
  ('2.0', 'CARDINAL'),
  ('the last few years', 'DATE'),
  ('only 10%', 'PERCENT'),
  ('hundreds of millions', 'CARDINAL'),
  ('hundreds of millions', 'CARDINAL'),
  ('one', 'CARDINAL'),
  ('two', 'CARDINAL'),
  ('Today', 'DATE'),
  ('the United Kingdom', 'GPE'),
  ('the United States', 'GPE'),
  ('Netherlands', 'GPE'),
  ('Bibliometrics', 'ORG'),
  ('1', 'CARDINAL'),
  ('2', 'CARDINAL'),
  ('3', 'CARDINAL'),
  ('4', 'CARDINAL'),
  ('5', 'CARDINAL'),
  ('Essential Science Indicators',

In [None]:
with open('2023-06-03_all_sent_coref_text_entities_0-10.pickle', 'wb') as f:
  pickle.dump(entities, f)

We will use spaCy's built-in displaCy named entity visualiser to highlight the named entities and their labels in the text.

In [None]:
spacy.displacy.render(doc_list[0][0], style="ent",jupyter=True)

We used the small web model which was not trained on biomedical data so it has not labelled domain-specific terms such as 'drug', 'drug repurposing' or 'drug discovery'. It has also labelled 'Nobel Prize' as `WORK_OF_ART`.

The documentation advises that because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

## 5. scispaCy

scispaCy is trained on top of spaCy for POS tagging, dependency parsing, and NER using biomedical training data. It contains spaCy models for processing biomedical, scientific or clinical text. We will load the `en_core_sci_sm` model to extract noun phrases and named entities.

In [None]:
# Load small English scispaCy model
nlp = spacy.load("en_core_sci_sm")

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']

We will iterate over the sentence-tokenised text and append the processed Doc objects to a list.

In [None]:
%%time

doc_list = []

for sents in all_sent_coref_text:
    docs = nlp.pipe(sents)
    doc_list.append(list(docs))

CPU times: user 13.5 s, sys: 345 ms, total: 13.8 s
Wall time: 17.1 s


In [None]:
len(doc_list)

10

In [None]:
with open('2024-07-31_scispacy_fastcoref_sent_coref_text_doc_list.pickle', 'wb') as f:
    pickle.dump(doc_list, f)

### 5.1 Noun phrases

In [None]:
def noun_phrases(docs):
    noun_phrase_list = []
    for doc in docs:
        nouns = [chunk.text for chunk in doc.noun_chunks]
        noun_phrase_list.append(nouns)
    return noun_phrase_list

In [None]:
noun_phrases = list(map(noun_phrases, doc_list))

In [None]:
# scispaCy en_core_sci_sm output
noun_phrases

[[['Sir James Black',
   'a winner',
   'drug repurposing strategies',
   'an important place'],
  ['Ted T. Ashburn',
   'a general approach',
   'development',
   'drug repurposing',
   'that'],
  ['molecules', 'that', 'sufficient efficacy'],
  ['The definition'],
  ['It',
   'the synonyms',
   'drug repositioning',
   'drug rediscovery',
   'drug redirecting',
   'drug retasking',
   'therapeutic switching'],
  ['the starting materials', 'products', 'that'],
  ['the difficulty'],
  ['existing knowledge',
   'the time',
   'risk',
   'cost',
   'a drug',
   'drug repositioning',
   'the efficiency',
   'economics',
   'a better risk–reward trade-off',
   'it',
   'the favor'],
  ['the repositioning',
   'the development',
   'cessation',
   'new applications',
   'chronic graft-versus-host disease',
   'intense interest'],
  ['These classic success stories'],
  ['drug repositioning', 'success'],
  [],
  ['cheminformatics',
   'bioinformatics',
   'systems biology',
   'genomics',
   '

In [None]:
with open('2024-07-31_scispacy_all_sent_coref_text_noun_phrases_0-10.pickle', 'wb') as f:
  pickle.dump(noun_phrases, f)

The output does appear to differ slightly from the noun phrases extracted previously using the spaCy `en_core_web_sm` model. We will include the latter again below for comparison.

In [None]:
# spaCy en_core_web_sm output
noun_phrases

[[['Sir James Black',
   'a winner',
   'the 1988 Nobel Prize',
   'the 21st century',
   'drug repurposing strategies',
   'an important place',
   'the future',
   'new drug discovery'],
  ['In 2004, Ted T. Ashburn',
   '.',
   'previous research',
   'a general approach',
   'drug development',
   'new indications',
   'approved drugs',
   'molecules',
   'that',
   'approval',
   'new pathways',
   'action',
   'targets'],
  ['molecules',
   'that',
   'approval',
   'new pathways',
   'action',
   'targets',
   'clinical trials',
   'sufficient efficacy',
   'the treatment',
   'the disease'],
  ['The definition', 'the term drug repurposing', 'scholars', 'scholars'],
  ['It',
   'the synonyms',
   'academics',
   'drug repositioning',
   'drug rediscovery',
   'drug retasking',
   'therapeutic switching'],
  ['the research study',
   'Ted T. Ashburn',
   '.',
   'Allarakhia et al',
   '.',
   'the starting materials',
   'drug',
   'products',
   'that',
   'commercial reasons',
 

The scispaCy model output does not include dates e.g. 'the 1988 Nobel Prize' and 'the 21st century' extracted by the spaCy web model, but does include 'chronic graft-versus-host disease' omitted by the latter.

### 5.2 Named Entity Recognition

We will extract the entity annotations and labels using the scispaCy `en_core_sci_sm` model.

In [None]:
def find_entities(docs):
    found_entities = []
    for doc in docs:
        for entity in doc.ents:
            found_entity = entity.text, entity.label_
            found_entities.append(found_entity)
    return found_entities

In [None]:
entities = list(map(find_entities, doc_list))

In [None]:
# scispaCy en_core_sci_sm output
entities

[[('Sir James Black', 'ENTITY'),
  ('winner', 'ENTITY'),
  ('Nobel Prize', 'ENTITY'),
  ('drug repurposing strategies', 'ENTITY'),
  ('drug discovery', 'ENTITY'),
  ('Ted', 'ENTITY'),
  ('research', 'ENTITY'),
  ('general approach', 'ENTITY'),
  ('drug development', 'ENTITY'),
  ('drug repurposing', 'ENTITY'),
  ('retrospectively looking', 'ENTITY'),
  ('indications', 'ENTITY'),
  ('drugs', 'ENTITY'),
  ('molecules', 'ENTITY'),
  ('waiting', 'ENTITY'),
  ('approval', 'ENTITY'),
  ('pathways', 'ENTITY'),
  ('action', 'ENTITY'),
  ('targets', 'ENTITY'),
  ('molecules', 'ENTITY'),
  ('waiting', 'ENTITY'),
  ('approval', 'ENTITY'),
  ('pathways', 'ENTITY'),
  ('action', 'ENTITY'),
  ('targets', 'ENTITY'),
  ('clinical trials', 'ENTITY'),
  ('efficacy', 'ENTITY'),
  ('treatment', 'ENTITY'),
  ('disease', 'ENTITY'),
  ('definition', 'ENTITY'),
  ('term', 'ENTITY'),
  ('drug repurposing', 'ENTITY'),
  ('scholars', 'ENTITY'),
  ('scholars', 'ENTITY'),
  ('drug repurposing', 'ENTITY'),
  ('acad

In [None]:
with open('2024-07-31_scispacy_all_sent_coref_text_entities_0-10.pickle', 'wb') as f:
  pickle.dump(entities, f)

The model identifies entities but does not assign them a specific entity type label unlike the spaCy web model shown below for comparison.

In [None]:
# spaCy en_core_web_sm output
entities

[[('James Black', 'PERSON'),
  ('1988', 'DATE'),
  ('Nobel Prize', 'WORK_OF_ART'),
  ('the 21st century', 'DATE'),
  ('2004', 'DATE'),
  ('Ted T. Ashburn', 'PERSON'),
  ('Ted T. Ashburn', 'PERSON'),
  ('Allarakhia', 'NORP'),
  ('the 1990s', 'DATE'),
  ('three', 'CARDINAL'),
  ('1', 'CARDINAL'),
  ('2', 'CARDINAL'),
  ('3', 'CARDINAL'),
  ('the beginning of the 21st century', 'DATE'),
  ('DrugBank', 'ORG'),
  ('Cmap', 'GPE'),
  ('PDB', 'ORG'),
  ('EK-DRD', 'ORG'),
  ('DREIMT', 'ORG'),
  ('DrugSig', 'ORG'),
  ('2.0', 'CARDINAL'),
  ('the last few years', 'DATE'),
  ('only 10%', 'PERCENT'),
  ('hundreds of millions', 'CARDINAL'),
  ('hundreds of millions', 'CARDINAL'),
  ('one', 'CARDINAL'),
  ('two', 'CARDINAL'),
  ('Today', 'DATE'),
  ('the United Kingdom', 'GPE'),
  ('the United States', 'GPE'),
  ('Netherlands', 'GPE'),
  ('Bibliometrics', 'ORG'),
  ('1', 'CARDINAL'),
  ('2', 'CARDINAL'),
  ('3', 'CARDINAL'),
  ('4', 'CARDINAL'),
  ('5', 'CARDINAL'),
  ('Essential Science Indicators',

We will use spaCy's built-in displaCy named entity visualiser to highlight the named entities and their labels in the text.

We can see in the first sentence that, although the entity types are not indicated, the scispaCy model has labelled 'drug repurposing strategies' and ' drug discovery' as entities.

In [None]:
# entities for first sentence
spacy.displacy.render(doc_list[0][0], style="ent",jupyter=True)

In [None]:
# entities for first article
spacy.displacy.render(doc_list[0], style="ent",jupyter=True)

### 5.3 spaCy NER models

We will load one of the spaCy NER models trained on the BC5CDR corpus to identify `DISEASE` and `CHEMICAL` entity types.



In [None]:
# load spaCy NER model trained on the BC5CDR corpus
bc5cdr = spacy.load("en_ner_bc5cdr_md")

In [None]:
bc5cdr.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']

We will iterate over the sentence-tokenised text and append the processed Doc objects to a list.

In [None]:
%%time

doc_list = []

for sents in all_sent_coref_text:
    docs = bc5cdr.pipe(sents)
    doc_list.append(list(docs))

CPU times: user 11.8 s, sys: 318 ms, total: 12.1 s
Wall time: 12.2 s


In [None]:
len(doc_list)

10

In [None]:
with open('2024-10-15_scispacy_fastcoref_sent_coref_text_doc_list_bc5cdr.pickle', 'wb') as f:
    pickle.dump(doc_list, f)

We will use the displaCy named entity visualiser to highlight the named entities and their labels in the text. The model should be able to identify drugs and diseases.

In [None]:
spacy.displacy.render(doc_list[0], style="ent",jupyter=True)

If we focus on a section that contains COVID-19 entities we can see that it has not identified them at all.

In [None]:
def get_entity_options():
    entities = ["DISEASE", "CHEMICAL"]
    colors = {'DISEASE': 'linear-gradient(180deg, #66ffcc, #abf763)', 'CHEMICAL': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)'}
    options = {"ents": entities, "colors": colors}
    return options
options = get_entity_options()

displacy.render(doc_list[0][157:159], style='ent', options=options, jupyter=True)

The model was released in 2016 following the [BioCreative V](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/) challenge in 2015 organised for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. It was trained on 1500 PubMed abstracts but is unable to annotate the COVID-19-related entities as they were obviously not included in the training data.

We will try comma separating the terms to see if this makes a difference. The model might be able to identify 'Coronavirus' as a disease.

In [None]:
text = 'the top 30 most used author keywords include COVID-19, SARS-CoV-2, Coronavirus'

In [None]:
doc = bc5cdr(text)

In [None]:
displacy.render(doc, style='ent', options=options, jupyter=True)

Not surprisingly the model does not identify COVID-19 and SARS-CoV-2, but it does not label Coronavirus either.  

We will load all the other available [scispaCy NER models](https://allenai.github.io/scispacy/) each trained on different corpora and identifying different entity types.

In [None]:
bionlp = spacy.load('en_ner_bionlp13cg_md')
jnlpba = spacy.load('en_ner_jnlpba_md')
craft = spacy.load('en_ner_craft_md')

In [None]:
models = {"bc5cdr": bc5cdr, "bionlp": bionlp, "jnlpba": jnlpba, "craft": craft}

For each model we will iterate over the NER labels to see which entity types it recognises.

In [None]:
for key, model in models.items():
    print(key)
    c = 0
    for label in model.get_pipe('ner').labels:
        c += 1
        print(label)
    print("\n")

bc5cdr
CHEMICAL
DISEASE


bionlp
AMINO_ACID
ANATOMICAL_SYSTEM
CANCER
CELL
CELLULAR_COMPONENT
DEVELOPING_ANATOMICAL_STRUCTURE
GENE_OR_GENE_PRODUCT
IMMATERIAL_ANATOMICAL_ENTITY
MULTI_TISSUE_STRUCTURE
ORGAN
ORGANISM
ORGANISM_SUBDIVISION
ORGANISM_SUBSTANCE
PATHOLOGICAL_FORMATION
SIMPLE_CHEMICAL
TISSUE


jnlpba
CELL_LINE
CELL_TYPE
DNA
PROTEIN
RNA


craft
CHEBI
CL
GGP
GO
SO
TAXON




In [None]:
# view a sample of what each model identifies
text_sample = pmc_arxiv_full_sent_text_spacy_fastcoref.coref_text[4][:2500]
print(text_sample)

The 2019 novel coronavirus, now dubbed SARS-CoV-2, has led to a global pandemic as declared by the World Health Organization WHO on 11 March 2020. Many studies have described the complex immune response associated with viral infection, leading to the identification of several clinical and immunological features. The 2019 novel coronavirus, now dubbed SARS-CoV-2 shares The 2019 novel coronavirus, now dubbed SARS-CoV-2's mechanism of viral entry with other viruses of the Coronaviridae family as its mechanism of viral entry is mediated by the spike S glycoprotein which binds to angiotensin converting enzyme 2 ACE2 receptors that are localized in a variety of cell types, such as in the heart, liver, kidney, but most abundantly in the lungs and respiratory system i.e., alveolar epithelial cells and capillary endothelial cells, leading to a wide range of various symptoms experienced by COVID-19 patients. Despite this variance, it is of note that The 2019 novel coronavirus, now dubbed SARS-Co

For each model we will iterate over the entities in the Doc object and extract entity and entity type for the sample text.

In [None]:
for key, model in models.items():
    doc = model(text_sample)
    ents = list(doc.ents)
    print(key)
    for ent in ents:
        print(f"{ent.label_}: {ent.text}")
    print("\n")

bc5cdr
DISEASE: viral infection
CHEMICAL: angiotensin
DISEASE: infections
DISEASE: SIRS
DISEASE: acute respiratory distress syndrome
DISEASE: ARDS
DISEASE: COVID-19 infections
DISEASE: infections
DISEASE: infections
DISEASE: viral infections
DISEASE: SARS
DISEASE: respiratory distress
DISEASE: ARDS


bionlp
ORGANISM: coronavirus
ORGANISM: coronavirus
ORGANISM: coronavirus
ORGANISM: Coronaviridae
CELL: cell
ORGAN: heart
ORGAN: liver
ORGAN: kidney
ORGAN: lungs
CELL: alveolar epithelial cells
CELL: capillary endothelial cells
ORGANISM: patients
ORGANISM: coronavirus
SIMPLE_CHEMICAL: COVID-19
SIMPLE_CHEMICAL: COVID-19
SIMPLE_CHEMICAL: COVID-19
CANCER: acute respiratory distress
SIMPLE_CHEMICAL: COVID-19


jnlpba
PROTEIN: spike S glycoprotein
PROTEIN: angiotensin converting enzyme 2 ACE2 receptors
CELL_TYPE: alveolar epithelial cells
CELL_TYPE: capillary endothelial cells
PROTEIN: pro-inflammatory cytokines
PROTEIN: cytokine
PROTEIN: cytokine
PROTEIN: pro-inflammatory cytokines
PROTEIN: cyt

We can see that the models do not always label entities correctly, and that different models label the same entity differently. spaCy recognises entity types by asking the model for a prediction but, because statistical models depend on the training data, this will not always be accurate without fine tuning to a specific use case.

This time `bc5cdr` has correctly labelled the noun phrase 'COVID-19 infections' as `DISEASE` but again does not label 'COVID-19' or 'coronavirus'. The `bionlp` model has labelled 'COVID-19' as `SIMPLE_CHEMICAL`.

We will extract the entities and labels for all models from all articles.





In [None]:
# Initialise a new DataFrame with the 'article_id' column
df_final = pmc_arxiv_full_sent_text_spacy_fastcoref[['article_id']].copy()

# Iterate over the models
for name, model in models.items():
    print(f"Processing model: {name}")

    # Apply the model to each article's text
    docs = pmc_arxiv_full_sent_text_spacy_fastcoref['coref_text'].apply(model)

    # Create a list to store entities for each document
    entities = []

    # Iterate over the processed docs
    for doc in docs:
        doc_entities = defaultdict(set)

        # Extract entities from the document
        for ent in doc.ents:
            doc_entities[ent.label_].add(ent.text)

        # Convert the set of entities into a list for each entity type
        for key, val in doc_entities.items():
            doc_entities[key] = list(val)

        entities.append(doc_entities)

    # Convert the list of dictionaries into a DataFrame
    entity_df = pd.DataFrame(entities)

    # Join the entity DataFrame with df_final on the article_id
    df_final = pd.concat([df_final, entity_df], axis=1)

    print(f"Finished processing {name}, columns added: {entity_df.columns}")

# df_final now contains the article_id and all the entity types extracted by each model
print(f"Final DataFrame columns: {df_final.columns}")

Processing model: bc5cdr
Finished processing bc5cdr, columns added: Index(['CHEMICAL', 'DISEASE'], dtype='object')
Processing model: bionlp
Finished processing bionlp, columns added: Index(['ORGANISM_SUBSTANCE', 'ORGAN', 'SIMPLE_CHEMICAL', 'CANCER',
       'CELLULAR_COMPONENT', 'MULTI_TISSUE_STRUCTURE', 'GENE_OR_GENE_PRODUCT',
       'CELL', 'ORGANISM', 'ANATOMICAL_SYSTEM', 'PATHOLOGICAL_FORMATION',
       'ORGANISM_SUBDIVISION', 'TISSUE', 'AMINO_ACID',
       'IMMATERIAL_ANATOMICAL_ENTITY'],
      dtype='object')
Processing model: jnlpba
Finished processing jnlpba, columns added: Index(['PROTEIN', 'DNA', 'CELL_LINE', 'RNA', 'CELL_TYPE'], dtype='object')
Processing model: craft
Finished processing craft, columns added: Index(['CHEBI', 'GGP', 'SO', 'CL', 'GO', 'TAXON'], dtype='object')
Final DataFrame columns: Index(['article_id', 'CHEMICAL', 'DISEASE', 'ORGANISM_SUBSTANCE', 'ORGAN',
       'SIMPLE_CHEMICAL', 'CANCER', 'CELLULAR_COMPONENT',
       'MULTI_TISSUE_STRUCTURE', 'GENE_OR_GENE

In [None]:
df_final

Unnamed: 0,article_id,CHEMICAL,DISEASE,ORGANISM_SUBSTANCE,ORGAN,SIMPLE_CHEMICAL,CANCER,CELLULAR_COMPONENT,MULTI_TISSUE_STRUCTURE,GENE_OR_GENE_PRODUCT,...,DNA,CELL_LINE,RNA,CELL_TYPE,CHEBI,GGP,SO,CL,GO,TAXON
0,PMC9549161,"[copublications, chloroquine, Statins, raltegr...","[drug–disease, neurological diseases, Cancer, ...",[Ted T.],"[organs, erectile, pulmonary]","[DTIs, hydroxychloroquine, technology-other, d...","[NIH, myeloma, Cancers, disease, cancer, tumor...","[matrix, EK-DRD, ESI]","[network, nodal]","[ACPP, V10, AJ, Talevi, A, Mt, IF, FX, p97 seg...",...,"[WOS Core Collection Database Citation, molecu...","[NIH, connecting lines, HAM, 10–20, Karolinska...",,,"[compounds, drug, chloroquine, hydroxychloroqu...","[ACPP, A SARS-CoV-2 protein, NPL4, Metformin, ...","[consensus, genetically, consensus docking-bas...","[cell, antitoxic]",[AJ],"[Ebola, coronavirus, coronaviruses drug, human..."
1,PMC9539342,"[haloperidol, hydroxychloroquine, carfilzomib,...","[SARS‐CoV‐2 infection, breast cancer Li, organ...","[broad‐spectrum, extracts]","[erectile, pulmonary, organoid, lung]","[anti‐HCoV, DTIs, haloperidol, hydroxychloroqu...","[antitumoral, NCATS, disease, 3D, colorectal c...","[DNA, genome]",,"[receptor–ligand, PARP1, Gln166, spike S recep...",...,"[HIV‐1 DNA, knowledge‐graph‐based DR, mol2vec,...","[human lung cell line, multi‐well cell culture...","[RNA‐dependent RNA polymerase, viral RNA]","[naïve T cells, human cells, host cells]","[haloperidol, compounds, acid, mol2vec, hydrox...","[Cathepsin L, SARS‐CoV‐2 Abdellatiif, SARS‐CoV...","[genes, DNA, synthetic compound, protease bind...","[cellular, cell, T cells, hepatitis C virus, m...","[membrane M protein, replicase complex]","[human protein, Ebola viral, human cells, C., ..."
2,PMC9357751,"[Molnupiravir, nucleoside, SARS-CoV-2, gamma, ...","[viral infection, infections, HHS, SARS-CoV-2 ...","[SARSCoV-2, serum]",,"[isopropyl, antiviral nucleoside, FDA, Food, C...","[disease, nasopharyngeal, NCT04746183, MERS co...","[EIDD-1931, NCT04746183]",,"[DR, MK-4482-002, delta]",...,,[Molnupiravir population],[viral RNA],,"[electron acceptor, drug, nucleoside, toxic, f...","[SARSCoV-2, SARSCoV-2 viral]","[RdRp enzyme, RdRp, part of the, RNA mutagenes...",[enterocytes],,"[nonhuman, individuals, remdesivir-resistant, ..."
3,PMC9346052,"[indinavir, Fostamatinib disodium, lopinavir c...","[viral infection, tumor necrosis, pulmonary le...","[plasma, cytoplasmic]","[heart, gastric]","[indinavir, infliximab, Fostamatinib disodium,...","[multilayer, T, myeloma, sections, NCT04315948...","[matrix, integral membrane, DNA, nucleoprotein...","[drug-disease network, neural network]","[casein kinase 2, Wuhan, CD147, glycogen synth...",...,"[beta coronaviruses, Lineage B.1.351, viral RN...","[Yuce, P.1, HeLa-ACE2 cells, Calu-3, Interfero...","[Single-stranded RNA ssRNA, genomic RNA, mRNAs...","[epithelial cells, B.1.526 lineage, human resp...","[compounds, pivoxetil, mRNAs, hydroxychloroqui...","[casein kinase 2, CD147, p38 MAPK mitogen-acti...","[confirmed Coronavirus, features, genes, domai...","[immune check-point inhibitors, cell, vascular...","[antibodies, operationsSome, envelope protein,...","[Ebola virus, nucleocapsid proteins, CoVs, vir..."
4,PMC9775208,"[histamine, loratadine, non-protein, antihista...","[viral infection, neurogenerative diseases, ac...",[blood],"[lung, heart, lungs, kidney, liver, brain]","[histamine, loratadine, histamine antagonists,...","[WGCNA, disease, cancer, CCs, transcriptome-dr...","[matrix, mitophagy-animal]","[PPI network, network, airway, PPI networks]","[M33, Genbank IDs, platelet-derived extracellu...",...,"[M1 genes, module hub genes, empty gene, term ...","[M46, M44, Th1, M1, Th1/Th2]",,"[alveolar epithelial cells, alveolar cells, Th...","[histamine, antimalarials, drug, loratadine, c...","[M33, platelet-derived extracellular vesicles,...","[genes, construct gene, sites, gene modules, M...","[cellular, Th cells, T-cell, platelet, cell, e...",[extracellular],"[individuals, viruses, C., murine, mouse, mice..."
5,PMC9527439,"[epigallocatechin gallate, Nitric oxide, hepar...","[viral infection, cancer and infectious diseas...","[jensenone, capmatinib, plasma, extracts]","[lung, pulmonary]","[Nitric oxide, heparin, baicalin, chlorpromazi...","[cancer, 3D, biomolecules]","[virus-cell membrane, surface, crown-like, mem...",,"[protease N 154, drug-receptor, nonstructural ...",...,[viral targets 34],[Vero E6 cell line],[viral RNA],[host cells],"[epigallocatechin gallate, compounds, acid, bi...","[ACE2, benzoylpinostrobin, chymotrypsin-like p...","[enzyme, natural product, matching, experiment...","[cell, cells, pro-inflammatory cytokines]","[membrane M protein, antibodies]","[Coronaviruses, viral RNA, mouse, plant metabo..."
6,PMC9729590,"[luminal, MGL, Bexarotene, abiraterone acetate...","[respiratory syndromes SARS, toxicity, obesity...",[gmmpbsa],,"[TYR129, ALA125, LEU207, abiraterone acetate, ...","[cutaneous T-cell lymphoma, non-small cell lun...",[nucleocapsids],,"[Wuhan, CYP17A1, MGL, Androgen, Oct, TYR732 am...",...,"[viral genome single-stranded RNA, A0, grid bo...","[6NUR, A0, HIS133, Oct 13]",[viral RNA],[human cells],"[compounds, bexarotene anticancer, Na+ ions, m...","[6NUR, RdRp-RNA protein, Androgen, abiraterone...","[RNA nucleotides, RdRp inhibitor drug, beta-sh...","[cell, nucleocapsids, T-cell, cells]",,"[Coronaviruses, human cells, viral genome sing..."
7,PMC9236981,"[Ebselen, hydroxychloroquine, AI, azithromycin...","[autoimmune, malignant, cancer, cardiovascular...",,"[lungs, renal, gape, pulmonary]","[DTIs, Ebselen, hydroxychloroquine, azithromyc...","[colon cancer, MCC, networks, PPIs, drug-targe...","[matrix, AUC, membrane, self-membrane]","[nodes, DTI networks, networks nodes, node, ne...","[PDBbind, UniHI, SARS2-DEG, PARP1, Arun Asif, ...",...,"[drug-virus pairs, Modified genes, E, maps hum...","[cultured non-human primate cells, human-deriv...","[enveloped single-stranded RNA, RNA transcript]","[alveolar epithelial cells, Covid-19, 25 poten...","[compounds, Ebselen, hydroxychloroquine, metab...","[ACE2, neurofibromin, PPIs, Few-Shot, Anti-Cov...","[gene products, consensus, genes, known, posit...","[human-derived cells, cellular, epithelial cel...","[complexes, self-membrane protein, membrane pr...","[human cells, human coronavirus, zoonotic, org..."
8,PMC9694939,"[nucleoside, 37542, NSP-12/Sofosbuvir, SARS-Co...","[Mouth Disease Virus FMDV, and Encephalomyocar...",,[cavity],"[−6.2, nucleoside, Sofosbuvir, NVT, Sofosbuvir...","[Mouth Disease Virus FMDV, −9.1, B.The, websit...","[matrix, neighbor, −6.5, genome]","[non-polar, Wall]","[−6.35, PROMALS3D, PolV, −8.2, C-terminal RdRp...",...,"[Coronaviruses's genome, conserved active site...",[T556],,[1.13 pairs],"[compounds, residue, nucleoside, protein, mole...","[Sofosbuvir and the, -C, Mg2+, -D, Sofosbuvir ...","[conserved active site, conserved domains, con...",[cellular],"[inside, complexes]","[Coronaviruses, Virus JEV, Human Enterovirus 7..."
9,PMC9556799,"[nucleoside, hydroxychloroquine, chlorpromazin...","[viral infection, acute respiratory syndrome c...",[blood],"[lung, lungs, heart, testis, pulmonary, kidney...","[nucleoside, hydroxychloroquine, CoV-host, chl...","[disease, CMAP, A., chronic myeloid leukemia, ...","[cell surface, CRISPR, virus-membrane, endosom...","[coronary heart, nodes, node, cerebrovascular,...","[GTEx, PLCB4, BioGRID, TYK2, OAS3, low-density...",...,"[10557 variants, 16 genes, interacting genes, ...",[CoV-infected cells],[degrades viral RNA],"[human cells, infected cells, alveolar type II...","[hydroxychloroquine, chlorpromazine, protein, ...","[PLCB4, TYK2, OAS3, low-density lipoprotein re...","[GTEx, genes, quantitative trait locus, known,...","[epithelial cells, cell, CoV-infected cells, e...","[cell surface, chromosome, virus-membrane fusi...","[virus-based, human protein, Vibrio cholerae, ..."


In [None]:
with open('2024-10-15_scispacy_fastcoref_sent_coref_text_model_labels_only.pickle', 'wb') as f:
    pickle.dump(df_final, f)

We will add a line of code to prefix the entity label with the model should this be required in future.

In [None]:
# Initialise a new DataFrame with the 'article_id' column
df_final = pmc_arxiv_full_sent_text_spacy_fastcoref[['article_id']].copy()

# Iterate over the models
for name, model in models.items():
    print(f"Processing model: {name}")

    # Apply the model to each article's text
    docs = pmc_arxiv_full_sent_text_spacy_fastcoref['coref_text'].apply(model)

    # Create a list to store entities for each document
    entities = []

    # Iterate over the processed docs
    for doc in docs:
        doc_entities = defaultdict(set)

        # Extract entities from the document
        for ent in doc.ents:
            doc_entities[ent.label_].add(ent.text)

        # Convert the set of entities into a list for each entity type
        for key, val in doc_entities.items():
            doc_entities[key] = list(val)

        entities.append(doc_entities)

    # Convert the list of dictionaries into a DataFrame
    entity_df = pd.DataFrame(entities)

    # Prefix entity columns with model when joining with df_final
    entity_df = entity_df.add_prefix(f"{name}_")

    # Join the entity DataFrame with df_final on the article_id
    df_final = pd.concat([df_final, entity_df], axis=1)

    print(f"Finished processing {name}, columns added: {entity_df.columns}")

# df_final now contains the article_id and all the entity types extracted by each model
print(f"Final DataFrame columns: {df_final.columns}")

Processing model: bc5cdr
Finished processing bc5cdr, columns added: Index(['bc5cdr_CHEMICAL', 'bc5cdr_DISEASE'], dtype='object')
Processing model: bionlp
Finished processing bionlp, columns added: Index(['bionlp_ORGANISM_SUBSTANCE', 'bionlp_ORGAN', 'bionlp_SIMPLE_CHEMICAL',
       'bionlp_CANCER', 'bionlp_CELLULAR_COMPONENT',
       'bionlp_MULTI_TISSUE_STRUCTURE', 'bionlp_GENE_OR_GENE_PRODUCT',
       'bionlp_CELL', 'bionlp_ORGANISM', 'bionlp_ANATOMICAL_SYSTEM',
       'bionlp_PATHOLOGICAL_FORMATION', 'bionlp_ORGANISM_SUBDIVISION',
       'bionlp_TISSUE', 'bionlp_AMINO_ACID',
       'bionlp_IMMATERIAL_ANATOMICAL_ENTITY'],
      dtype='object')
Processing model: jnlpba
Finished processing jnlpba, columns added: Index(['jnlpba_PROTEIN', 'jnlpba_DNA', 'jnlpba_CELL_LINE', 'jnlpba_RNA',
       'jnlpba_CELL_TYPE'],
      dtype='object')
Processing model: craft
Finished processing craft, columns added: Index(['craft_CHEBI', 'craft_GGP', 'craft_SO', 'craft_CL', 'craft_GO',
       'craft_TAXON

In [None]:
df_final

Unnamed: 0,article_id,bc5cdr_CHEMICAL,bc5cdr_DISEASE,bionlp_ORGANISM_SUBSTANCE,bionlp_ORGAN,bionlp_SIMPLE_CHEMICAL,bionlp_CANCER,bionlp_CELLULAR_COMPONENT,bionlp_MULTI_TISSUE_STRUCTURE,bionlp_GENE_OR_GENE_PRODUCT,...,jnlpba_DNA,jnlpba_CELL_LINE,jnlpba_RNA,jnlpba_CELL_TYPE,craft_CHEBI,craft_GGP,craft_SO,craft_CL,craft_GO,craft_TAXON
0,PMC9549161,"[copublications, chloroquine, Statins, raltegr...","[drug–disease, neurological diseases, Cancer, ...",[Ted T.],"[organs, erectile, pulmonary]","[DTIs, hydroxychloroquine, technology-other, d...","[NIH, myeloma, Cancers, disease, cancer, tumor...","[matrix, EK-DRD, ESI]","[network, nodal]","[ACPP, V10, AJ, Talevi, A, Mt, IF, FX, p97 seg...",...,"[WOS Core Collection Database Citation, molecu...","[NIH, connecting lines, HAM, 10–20, Karolinska...",,,"[compounds, drug, chloroquine, hydroxychloroqu...","[ACPP, A SARS-CoV-2 protein, NPL4, Metformin, ...","[consensus, genetically, consensus docking-bas...","[cell, antitoxic]",[AJ],"[Ebola, coronavirus, coronaviruses drug, human..."
1,PMC9539342,"[haloperidol, hydroxychloroquine, carfilzomib,...","[SARS‐CoV‐2 infection, breast cancer Li, organ...","[broad‐spectrum, extracts]","[erectile, pulmonary, organoid, lung]","[anti‐HCoV, DTIs, haloperidol, hydroxychloroqu...","[antitumoral, NCATS, disease, 3D, colorectal c...","[DNA, genome]",,"[receptor–ligand, PARP1, Gln166, spike S recep...",...,"[HIV‐1 DNA, knowledge‐graph‐based DR, mol2vec,...","[human lung cell line, multi‐well cell culture...","[RNA‐dependent RNA polymerase, viral RNA]","[naïve T cells, human cells, host cells]","[haloperidol, compounds, acid, mol2vec, hydrox...","[Cathepsin L, SARS‐CoV‐2 Abdellatiif, SARS‐CoV...","[genes, DNA, synthetic compound, protease bind...","[cellular, cell, T cells, hepatitis C virus, m...","[membrane M protein, replicase complex]","[human protein, Ebola viral, human cells, C., ..."
2,PMC9357751,"[Molnupiravir, nucleoside, SARS-CoV-2, gamma, ...","[viral infection, infections, HHS, SARS-CoV-2 ...","[SARSCoV-2, serum]",,"[isopropyl, antiviral nucleoside, FDA, Food, C...","[disease, nasopharyngeal, NCT04746183, MERS co...","[EIDD-1931, NCT04746183]",,"[DR, MK-4482-002, delta]",...,,[Molnupiravir population],[viral RNA],,"[electron acceptor, drug, nucleoside, toxic, f...","[SARSCoV-2, SARSCoV-2 viral]","[RdRp enzyme, RdRp, part of the, RNA mutagenes...",[enterocytes],,"[nonhuman, individuals, remdesivir-resistant, ..."
3,PMC9346052,"[indinavir, Fostamatinib disodium, lopinavir c...","[viral infection, tumor necrosis, pulmonary le...","[plasma, cytoplasmic]","[heart, gastric]","[indinavir, infliximab, Fostamatinib disodium,...","[multilayer, T, myeloma, sections, NCT04315948...","[matrix, integral membrane, DNA, nucleoprotein...","[drug-disease network, neural network]","[casein kinase 2, Wuhan, CD147, glycogen synth...",...,"[beta coronaviruses, Lineage B.1.351, viral RN...","[Yuce, P.1, HeLa-ACE2 cells, Calu-3, Interfero...","[Single-stranded RNA ssRNA, genomic RNA, mRNAs...","[epithelial cells, B.1.526 lineage, human resp...","[compounds, pivoxetil, mRNAs, hydroxychloroqui...","[casein kinase 2, CD147, p38 MAPK mitogen-acti...","[confirmed Coronavirus, features, genes, domai...","[immune check-point inhibitors, cell, vascular...","[antibodies, operationsSome, envelope protein,...","[Ebola virus, nucleocapsid proteins, CoVs, vir..."
4,PMC9775208,"[histamine, loratadine, non-protein, antihista...","[viral infection, neurogenerative diseases, ac...",[blood],"[lung, heart, lungs, kidney, liver, brain]","[histamine, loratadine, histamine antagonists,...","[WGCNA, disease, cancer, CCs, transcriptome-dr...","[matrix, mitophagy-animal]","[PPI network, network, airway, PPI networks]","[M33, Genbank IDs, platelet-derived extracellu...",...,"[M1 genes, module hub genes, empty gene, term ...","[M46, M44, Th1, M1, Th1/Th2]",,"[alveolar epithelial cells, alveolar cells, Th...","[histamine, antimalarials, drug, loratadine, c...","[M33, platelet-derived extracellular vesicles,...","[genes, construct gene, sites, gene modules, M...","[cellular, Th cells, T-cell, platelet, cell, e...",[extracellular],"[individuals, viruses, C., murine, mouse, mice..."
5,PMC9527439,"[epigallocatechin gallate, Nitric oxide, hepar...","[viral infection, cancer and infectious diseas...","[jensenone, capmatinib, plasma, extracts]","[lung, pulmonary]","[Nitric oxide, heparin, baicalin, chlorpromazi...","[cancer, 3D, biomolecules]","[virus-cell membrane, surface, crown-like, mem...",,"[protease N 154, drug-receptor, nonstructural ...",...,[viral targets 34],[Vero E6 cell line],[viral RNA],[host cells],"[epigallocatechin gallate, compounds, acid, bi...","[ACE2, benzoylpinostrobin, chymotrypsin-like p...","[enzyme, natural product, matching, experiment...","[cell, cells, pro-inflammatory cytokines]","[membrane M protein, antibodies]","[Coronaviruses, viral RNA, mouse, plant metabo..."
6,PMC9729590,"[luminal, MGL, Bexarotene, abiraterone acetate...","[respiratory syndromes SARS, toxicity, obesity...",[gmmpbsa],,"[TYR129, ALA125, LEU207, abiraterone acetate, ...","[cutaneous T-cell lymphoma, non-small cell lun...",[nucleocapsids],,"[Wuhan, CYP17A1, MGL, Androgen, Oct, TYR732 am...",...,"[viral genome single-stranded RNA, A0, grid bo...","[6NUR, A0, HIS133, Oct 13]",[viral RNA],[human cells],"[compounds, bexarotene anticancer, Na+ ions, m...","[6NUR, RdRp-RNA protein, Androgen, abiraterone...","[RNA nucleotides, RdRp inhibitor drug, beta-sh...","[cell, nucleocapsids, T-cell, cells]",,"[Coronaviruses, human cells, viral genome sing..."
7,PMC9236981,"[Ebselen, hydroxychloroquine, AI, azithromycin...","[autoimmune, malignant, cancer, cardiovascular...",,"[lungs, renal, gape, pulmonary]","[DTIs, Ebselen, hydroxychloroquine, azithromyc...","[colon cancer, MCC, networks, PPIs, drug-targe...","[matrix, AUC, membrane, self-membrane]","[nodes, DTI networks, networks nodes, node, ne...","[PDBbind, UniHI, SARS2-DEG, PARP1, Arun Asif, ...",...,"[drug-virus pairs, Modified genes, E, maps hum...","[cultured non-human primate cells, human-deriv...","[enveloped single-stranded RNA, RNA transcript]","[alveolar epithelial cells, Covid-19, 25 poten...","[compounds, Ebselen, hydroxychloroquine, metab...","[ACE2, neurofibromin, PPIs, Few-Shot, Anti-Cov...","[gene products, consensus, genes, known, posit...","[human-derived cells, cellular, epithelial cel...","[complexes, self-membrane protein, membrane pr...","[human cells, human coronavirus, zoonotic, org..."
8,PMC9694939,"[nucleoside, 37542, NSP-12/Sofosbuvir, SARS-Co...","[Mouth Disease Virus FMDV, and Encephalomyocar...",,[cavity],"[−6.2, nucleoside, Sofosbuvir, NVT, Sofosbuvir...","[Mouth Disease Virus FMDV, −9.1, B.The, websit...","[matrix, neighbor, −6.5, genome]","[non-polar, Wall]","[−6.35, PROMALS3D, PolV, −8.2, C-terminal RdRp...",...,"[Coronaviruses's genome, conserved active site...",[T556],,[1.13 pairs],"[compounds, residue, nucleoside, protein, mole...","[Sofosbuvir and the, -C, Mg2+, -D, Sofosbuvir ...","[conserved active site, conserved domains, con...",[cellular],"[inside, complexes]","[Coronaviruses, Virus JEV, Human Enterovirus 7..."
9,PMC9556799,"[nucleoside, hydroxychloroquine, chlorpromazin...","[viral infection, acute respiratory syndrome c...",[blood],"[lung, lungs, heart, testis, pulmonary, kidney...","[nucleoside, hydroxychloroquine, CoV-host, chl...","[disease, CMAP, A., chronic myeloid leukemia, ...","[cell surface, CRISPR, virus-membrane, endosom...","[coronary heart, nodes, node, cerebrovascular,...","[GTEx, PLCB4, BioGRID, TYK2, OAS3, low-density...",...,"[10557 variants, 16 genes, interacting genes, ...",[CoV-infected cells],[degrades viral RNA],"[human cells, infected cells, alveolar type II...","[hydroxychloroquine, chlorpromazine, protein, ...","[PLCB4, TYK2, OAS3, low-density lipoprotein re...","[GTEx, genes, quantitative trait locus, known,...","[epithelial cells, cell, CoV-infected cells, e...","[cell surface, chromosome, virus-membrane fusi...","[virus-based, human protein, Vibrio cholerae, ..."


In [None]:
with open('2024-10-15_scispacy_fastcoref_sent_coref_text_model_labels.pickle', 'wb') as f:
    pickle.dump(df_final, f)

The entities for each entity type are in lists with each article as a separate row. We will also create a DataFrame with each entity as a separate row.

In [None]:
data = []  # List to hold the rows for the final DataFrame

# Iterate through each model
for name, model in models.items():
    # Apply the NER model to 'coref_text' column
    for idx, text in pmc_arxiv_full_sent_text_spacy_fastcoref[['article_id', 'coref_text']].itertuples(index=False):
        doc = model(text)
        # Extract the entities from the processed doc
        for ent in doc.ents:
            # Append the article_id, entity text, entity label, and model name to the data list
            data.append({'article_id': idx, 'entity': ent.text, 'label': ent.label_, 'model': name})

# Convert the list of dictionaries into a DataFrame
df_entities = pd.DataFrame(data)

In [None]:
df_entities

Unnamed: 0,article_id,entity,label,model
0,PMC9549161,sildenafil,CHEMICAL,bc5cdr
1,PMC9549161,erectile dysfunction,DISEASE,bc5cdr
2,PMC9549161,pulmonary hypertension,DISEASE,bc5cdr
3,PMC9549161,bupropion,CHEMICAL,bc5cdr
4,PMC9549161,thalidomide,CHEMICAL,bc5cdr
...,...,...,...,...
8816,PMC9556799,gene,SO,craft
8817,PMC9556799,cells,CL,craft
8818,PMC9556799,African green monkey,TAXON,craft
8819,PMC9556799,genes,SO,craft


In [None]:
with open('2024-10-15_scispacy_fastcoref_sent_coref_text_model_ents_labels.pickle', 'wb') as f:
    pickle.dump(df_entities, f)

In [None]:
df_entities.to_csv('2024-10-15_scispacy_fastcoref_sent_coref_text_model_ents_labels.csv', index=False)

### 5.4 AbbreviationDetector

The `AbbreviationDetector` is a additional spaCy pipeline component which implements the abbreviation detection algorithm introduced by Schwartz & Hearst, 2003 in the paper [A simple algorithm for identifying abbreviation definitions in biomedical text.](https://pubmed.ncbi.nlm.nih.gov/12603049/)

In [None]:
# Load small English scispaCy model
nlp = spacy.load("en_core_sci_sm")

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner']

In [None]:
# add abbreviation_detector component
nlp.add_pipe("abbreviation_detector")

<scispacy.abbreviation.AbbreviationDetector at 0x782b92c25c90>

In [None]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'abbreviation_detector']

We will try the Abbreviation Detector on a test sentence. At least one <long-form, short-form> or <short-form, long-form> pair must exist with the definition or the abbreviation occurring adjacent to parentheses.

In [None]:
test1 = "Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) caused Coronavirus Disease 2019 (COVID-19), \
 the provisional name was 2019 novel coronavirus (2019-nCoV)"

In [None]:
doc = nlp(test1)

In [None]:
for abrv in doc._.abbreviations:
    print(f"{abrv} \t {abrv._.long_form} \t ({abrv.start}, {abrv.end})")

COVID-19 	 Coronavirus Disease 2019 	 (13, 14)


'COVID-19' is recognised as an abbreviation but not 'SARS-CoV-2' or '2019-nCoV'.

In [None]:
[(ent.text, ent.label_) for ent in doc.ents]

[('Severe', 'ENTITY'),
 ('acute respiratory syndrome', 'ENTITY'),
 ('coronavirus-2', 'ENTITY'),
 ('SARS-CoV-2', 'ENTITY'),
 ('Coronavirus Disease', 'ENTITY'),
 ('COVID-19', 'ENTITY'),
 ('provisional name', 'ENTITY'),
 ('coronavirus', 'ENTITY')]

'Severe acute respiratory syndrome coronavirus-2' is split into three entities which probably explains the inability of the model to recognise 'SARS-CoV-2' as its abbreviation, and '2019 novel coronavirus' is also not recognised as an entity.



We will create a pipeline for a small batch of texts.

In [None]:
text0 = "Under agency from the Secretary of Department of Health and Human Services (HHS), the Food and Drug Administration (FDA) may issue an Emergency Use Authorization (EUA) approving the urgent use of an unauthorized medicine, unlicensed or uncleared instrument, or unregistered biological component or a non - authorized use of an effective drug, authorized or approved machine, or registered biological drug. In other texts, when no appropriate, authorized, and accessible approaches exist, an EUA can enable healthcare countermeasures e.g. prescription medications, vaccines to be included during an announced outbreak to detect, treat, or avoid potential or life-threatening illnesses associated with biological as well as other entities. An EUA differs from a marketing authorization in that An EUA is dependent on a higher standard of proof. In order for an EUA to be granted, the Food and Drug Administration FDA must determine, based on existing data, that the product is potentially effective for the intended purpose and that the product known and prospective benefits exceed the product's known and potential dangers. the Food and Drug Administration FDA grants EUAs, which represent the Food and Drug Administration FDA's goal to safeguard public’s health by guaranteeing the safety, effectiveness, and integrity of human and veterinary pharmaceuticals, biologicals, medical instruments, the country’s food system, cosmetics, and radiation-emitting goods."
text1 = "The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) -caused Coronavirus Disease 2019 (COVID-19) has resulted in a significant increase in morbidity and mortality around the world. Finding therapies for The severe acute respiratory syndrome coronavirus-2 SARS-CoV-2 -caused Coronavirus Disease 2019 COVID-19 has taken a significant amount of time and effort. But besides advancements in technology and increased understanding of human related illness, therapeutic developments have become much slower than anticipated. Drug discovery can take decades and is complex and expensive. To bring a medicine to market, it ends up taking an average of 10 years and at least 1 billion. The The severe acute respiratory syndrome coronavirus-2 SARS-CoV-2 -caused Coronavirus Disease 2019 COVID-19 pandemic has prompted the researchers and doctors to repurpose antiviral medications to combat SARSCoV-2 infection. Drug repurposing DR, also known as drug rescuing, drug redirection, drug repositioning, therapeutic switching, drug reprofiling, drug recycling and drug re-tasking, is a method of recognizing novel therapeutic evidence from Investigational/pro-drugs/old/already marketed/existing/FDA approved/failed drugs, and applying the new advanced medicine to the management of diseases other than the ones for which the new advanced medicine were originally developed. Regulatory authorities throughout the world have established fast-track methods to speed up the research and approval of The severe acute respiratory syndrome coronavirus-2 SARS-CoV-2 -caused Coronavirus Disease 2019 COVID-19 therapeutics. Some of the recommended approaches include the use of antiviral medicines or immune function modulators. Many medicines have demonstrated potent activity against The SARS-CoV-2 -caused Coronavirus Disease 2019 COVID-19 in animal/preclinical investigations and have progressed to human clinical trials, among which Molnupiravir being authorized for treatment of The severe acute respiratory syndrome coronavirus-2 SARS-CoV-2 -caused Coronavirus Disease 2019 COVID-19 in the United States, India, the United Kingdom, and other countries."
text2 = "De novo identification and development of new molecular entities (NME) is a classic strategy to drug discovery that comprises five phases discovery and preclinical, safety evaluation, clinical trials, FDA clearance, and FDA post-market safety monitoring. Because of the specific characteristics of the medicine for a mechanism, De novo identification and development of new molecular entities NME is arduous, time-consuming, and costly, and De novo identification and development of new molecular entities NME comes with a significant risk of failure."

In [None]:
texts = [text0, text1, text2]

In [None]:
doc_list = []

docs = nlp.pipe(texts)
doc_list.append(list(docs))

In [None]:
def get_abbrevs(docs):
    abbrev_list = []

    for index, doc in enumerate(docs):
        abbrevs = doc._.abbreviations
        for abrv in abbrevs:
            abbrev_list.append(f"{abrv.text}, {abrv._.long_form}, ({abrv.start}, {abrv.end})")
    return abbrev_list

In [None]:
abbrev_texts = list(map(get_abbrevs, doc_list))

In [None]:
abbrev_texts

[['HHS, Health and Human Services, (13, 14)',
  'FDA, Food and Drug Administration, (22, 23)',
  'FDA, Food and Drug Administration, (211, 212)',
  'FDA, Food and Drug Administration, (159, 160)',
  'FDA, Food and Drug Administration, (200, 201)',
  'EUA, Emergency Use Authorization, (135, 136)',
  'EUA, Emergency Use Authorization, (31, 32)',
  'EUA, Emergency Use Authorization, (149, 150)',
  'EUA, Emergency Use Authorization, (126, 127)',
  'EUA, Emergency Use Authorization, (88, 89)',
  'COVID-19, Coronavirus Disease 2019, (14, 15)',
  'COVID-19, Coronavirus Disease 2019, (122, 123)',
  'COVID-19, Coronavirus Disease 2019, (234, 235)',
  'COVID-19, Coronavirus Disease 2019, (266, 267)',
  'COVID-19, Coronavirus Disease 2019, (297, 298)',
  'COVID-19, Coronavirus Disease 2019, (44, 45)',
  'NME, new molecular entities, (10, 11)',
  'NME, new molecular entities, (82, 83)',
  'NME, new molecular entities, (63, 64)']]

The presence of just one <long form, short form> pair does seem to work in the case of COVID-19 as all six references are identified.




The coreference resolved text for the full articles has had parentheses removed in an earlier preprocessing step, and it appears that the algorithm does not always recognise abbreviations even when in adjacent parentheses as we have seen e.g. with SARS-CoV-2, so we will continue to use the text without detecting abbreviations for now.

### 5.5 EntityLinker

The `EntityLinker` is a SpaCy component which performs linking to a knowledge base. The linker simply performs a string overlap-based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base using an approximate nearest neighbours search.

The five linkers currently supported are:

* `umls` -  [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html)

* `mesh` - [Medical Subject Headings](https://www.nlm.nih.gov/mesh/meshhome.html)

* `rxnorm` - [RxNorm](https://www.nlm.nih.gov/research/umls/rxnorm/index.html) ontology containing normalised names for clinical drugs

* `go` - [Gene Ontology](http://geneontology.org/)

* `hpo` - [Human Phenotype Ontology](https://hpo.jax.org/app/)

In [None]:
# convert coreference resolved articles to list
all_articles = pmc_arxiv_full_sent_text_spacy_fastcoref.coref_text.tolist()

In [None]:
len(all_articles)

10

In [None]:
# iterate over the text and append the processed Doc objects to a list
def get_docs(docs):
    doc_list = []
    docs = nlp.pipe(docs)
    doc_list.append(list(docs))
    return doc_list

In [None]:
%%time

doc_list = get_docs(all_articles)

CPU times: user 1min 28s, sys: 3.74 s, total: 1min 31s
Wall time: 1min 4s


In [None]:
with open('2024-10-13_articles_0-10_doclist_coref_text_tolist.pickle', 'wb') as f:
   pickle.dump(doc_list, f)

We will add the `EntityLinker` component to the pipeline configured with the `umls` linker to compare concepts in the UMLS knowledge base.

In [None]:
nlp.add_pipe("scispacy_linker", config={"linker_name": "umls", "resolve_abbreviations": True})

https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2023-04-23/umls/tfidf_vectors_sparse.npz not found in cache, downloading to /tmp/tmpluaffjou


100%|██████████| 492M/492M [00:23<00:00, 22.2MiB/s]


Finished download, copying /tmp/tmpluaffjou to cache at /root/.scispacy/datasets/2b79923846fb52e62d686f2db846392575c8eb5b732d9d26cd3ca9378c622d40.87bd52d0f0ee055c1e455ef54ba45149d188552f07991b765da256a1b512ca0b.tfidf_vectors_sparse.npz
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2023-04-23/umls/nmslib_index.bin not found in cache, downloading to /tmp/tmp3had0ylb


100%|██████████| 724M/724M [00:23<00:00, 32.4MiB/s]


Finished download, copying /tmp/tmp3had0ylb to cache at /root/.scispacy/datasets/7e8e091ec80370b87b1652f461eae9d926e543a403a69c1f0968f71157322c25.6d801a1e14867953e36258b0e19a23723ae84b0abd2a723bdd3574c3e0c873b4.nmslib_index.bin
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2023-04-23/umls/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmpwpj6iumv


100%|██████████| 1.32M/1.32M [00:00<00:00, 4.63MiB/s]


Finished download, copying /tmp/tmpwpj6iumv to cache at /root/.scispacy/datasets/37bc06bb7ce30de7251db5f5cbac788998e33b3984410caed2d0083187e01d38.f0994c1b61cc70d0eb96dea4947dddcb37460fb5ae60975013711228c8fe3fba.tfidf_vectorizer.joblib
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2023-04-23/umls/concept_aliases.json not found in cache, downloading to /tmp/tmpb8fjsly0


100%|██████████| 264M/264M [00:09<00:00, 28.1MiB/s]


Finished download, copying /tmp/tmpb8fjsly0 to cache at /root/.scispacy/datasets/6238f505f56aca33290aab44097f67dd1b88880e3be6d6dcce65e56e9255b7d4.d7f77b1629001b40f1b1bc951f3a890ff2d516fb8fbae3111b236b31b33d6dcf.concept_aliases.json
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/kbs/2023-04-23/umls_2022_ab_cat0129.jsonl not found in cache, downloading to /tmp/tmpwpf2qgf6


100%|██████████| 628M/628M [00:16<00:00, 39.6MiB/s]


Finished download, copying /tmp/tmpwpf2qgf6 to cache at /root/.scispacy/datasets/d5e593bc2d8adeee7754be423cd64f5d331ebf26272074a2575616be55697632.0660f30a60ad00fffd8bbf084a18eb3f462fd192ac5563bf50940fc32a850a3c.umls_2022_ab_cat0129.jsonl
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/umls_semantic_type_tree.tsv not found in cache, downloading to /tmp/tmpgfal2x3m


100%|██████████| 4.26k/4.26k [00:00<00:00, 2.51MiB/s]

Finished download, copying /tmp/tmpgfal2x3m to cache at /root/.scispacy/datasets/21a1012c532c3a431d60895c509f5b4d45b0f8966c4178b892190a302b21836f.330707f4efe774134872b9f77f0e3208c1d30f50800b3b39a6b8ec21d9adf1b7.umls_semantic_type_tree.tsv





<scispacy.linking.EntityLinker at 0x782b92c27cd0>

In [None]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'abbreviation_detector',
 'scispacy_linker']

In [None]:
# iterate over list of Doc objects and extract entities
def get_entities(docs):
    ent_list = []
    for index, doc in enumerate(docs):
        ents = doc.ents
        for ent in ents:
            ent_list.append(f"({ent.text}, {ent.label_})")
    return ent_list

In [None]:
%%time

ents = list(map(get_entities, doc_list))

CPU times: user 764 ms, sys: 4.71 ms, total: 768 ms
Wall time: 1.29 s


In [None]:
ents

[['(Sir James Black, ENTITY)',
  '(winner, ENTITY)',
  '(Nobel Prize, ENTITY)',
  '(drug repurposing strategies, ENTITY)',
  '(drug discovery, ENTITY)',
  '(Ted, ENTITY)',
  '(research, ENTITY)',
  '(general approach, ENTITY)',
  '(drug development, ENTITY)',
  '(drug repurposing, ENTITY)',
  '(retrospectively looking, ENTITY)',
  '(indications, ENTITY)',
  '(drugs, ENTITY)',
  '(molecules, ENTITY)',
  '(waiting, ENTITY)',
  '(approval, ENTITY)',
  '(pathways, ENTITY)',
  '(action, ENTITY)',
  '(targets, ENTITY)',
  '(molecules, ENTITY)',
  '(waiting, ENTITY)',
  '(approval, ENTITY)',
  '(pathways, ENTITY)',
  '(action, ENTITY)',
  '(targets, ENTITY)',
  '(clinical trials, ENTITY)',
  '(efficacy, ENTITY)',
  '(treatment, ENTITY)',
  '(disease, ENTITY)',
  '(definition, ENTITY)',
  '(term, ENTITY)',
  '(drug repurposing, ENTITY)',
  '(scholars, ENTITY)',
  '(scholars, ENTITY)',
  '(drug repurposing, ENTITY)',
  '(academics, ENTITY)',
  '(drug repositioning, ENTITY)',
  '(drug rediscover

In [None]:
for ent in ents:
    print(len(ent))

18812


In [None]:
with open('2024-10-14_articles_0-10_doclist_all_entities_coref_text_18812.pickle', 'wb') as f:
   pickle.dump(ents, f)

#### 5.5.1 EntityLinker results - UMLS

We will iterate over the entities in the doc_list and return a list of match tuples containing the entity, concept unique ID (CUI) and match score for each entity matched to a UMLS knowledge base entry.

In [None]:
def get_kb_ents(docs):
    kb_ent_list = []
    for index, doc in enumerate(docs):
        entities = doc.ents
        for entity in entities:
            for kb_entry in entity._.kb_ents:
                cui = kb_entry[0]
                match_score = kb_entry[1]
                kb_ent_list.append(f"{entity.text} {cui} {match_score}")
    return kb_ent_list

In [None]:
%%time

kb_ents = list(map(get_kb_ents, doc_list))

CPU times: user 274 ms, sys: 0 ns, total: 274 ms
Wall time: 275 ms


In [None]:
kb_ents

[['winner C0205102 0.761066734790802',
  'winner C4048877 0.7028506398200989',
  'Nobel Prize C0028236 0.9771338701248169',
  'drug repurposing strategies C2936405 0.7462714910507202',
  'drug discovery C0920472 0.9833343625068665',
  'drug discovery C1880355 0.8152775764465332',
  'drug discovery C1512071 0.7712427973747253',
  'drug discovery C2717881 0.728194534778595',
  'Ted C1539997 0.9935382604598999',
  'research C0035168 0.985704243183136',
  'research C0242481 0.985704243183136',
  'research C1548287 0.985704243183136',
  'research C3245477 0.985704243183136',
  'research C1518856 0.8372102975845337',
  'general approach C0449445 0.7233768105506897',
  'general approach C5445118 0.7233768105506897',
  'drug development C0872152 0.9743182063102722',
  'drug development C4684648 0.8551326990127563',
  'drug development C0678723 0.8284767270088196',
  'drug development C1527148 0.8284767270088196',
  'drug development C0011435 0.7768809199333191',
  'drug repurposing C2936405 0.

In [None]:
for kb_ent in kb_ents:
    print(len(kb_ent))

60010


In [None]:
with open('2024-10-14_coref_text_0_10_kb_ents_60010.pickle', 'wb') as f:
   pickle.dump(kb_ents, f)



We will access one entity from a test document to view CUIs and match scores. By default the `EntityLinker` will return 5 as the `max_entities_per_mention` regardless of how many are nearest neighbours are found.



In [None]:
doc = nlp("Today, drug repositioning is increasingly prominent in the development of drugs for a variety of neurological diseases, cancer, rare diseases, and infectious diseases.")

fmt_str = "{:<20}| {:<11}| {:<6}"
print(fmt_str.format("Entity", "Concept ID", "Score"))

entity = doc.ents[4]

for kb_entry in entity._.kb_ents:
    cui = kb_entry[0]
    match_score = kb_entry[1]

    print(fmt_str.format(entity.text, cui, match_score))

Entity              | Concept ID | Score 
neurological diseases| C0027765   | 0.9158749580383301
neurological diseases| C0042075   | 0.8679414987564087
neurological diseases| C0752235   | 0.8057830929756165
neurological diseases| C0205494   | 0.7591356635093689
neurological diseases| C0013447   | 0.7429007887840271


#### 5.5.2 Querying knowledge base entries - UMLS

We will query the knowledge base for more detail by retrieving the `EntityLinker` component from the pipeline using the `get_pipe()` method. We can then access the entity definition by looking up the CUI in the `linker.kb.cui_to_entity` dictionary using the CUI as the key.  



In [None]:
doc = nlp("Today, drug repositioning is increasingly prominent in the development of drugs for a variety of neurological diseases, cancer, rare diseases, and infectious diseases.")

linker = nlp.get_pipe("scispacy_linker")

def test_query_kb(doc):
    kb_ent_list = []
    for entity in doc.ents:
        first_cuid = entity._.kb_ents[0][0]
        kb_entry = linker.kb.cui_to_entity[first_cuid]
        kb_ent_list.append(f"{entity.text}, {first_cuid}, {kb_entry.canonical_name}, {kb_entry.definition}")
    return kb_ent_list

In [None]:
%%time

test_query = test_query_kb(doc)

CPU times: user 239 µs, sys: 0 ns, total: 239 µs
Wall time: 246 µs


In [None]:
test_query

['Today, C0310367, Today, None',
 'drug repositioning, C2936405, Drug Repositioning, The deliberate and methodical practice of finding new applications for existing drugs.',
 'development, C0243107, development aspects, None',
 'drugs, C0013227, Pharmaceutical Preparations, Drugs intended for human or veterinary use, presented in their finished dosage form. Included here are materials used in the preparation and/or formulation of the finished dosage form.',
 'neurological diseases, C0027765, nervous system disorder, Diseases of the central and peripheral nervous system. This includes disorders of the brain, spinal cord, cranial nerves, peripheral nerves, nerve roots, autonomic nervous system, neuromuscular junction, and muscle.',
 'cancer, C0006826, Malignant Neoplasms, A tumor composed of atypical neoplastic, often pleomorphic cells that invade other tissues. Malignant neoplasms often metastasize to distant anatomic sites and may recur after excision. The most common malignant neoplas

We will query the knowledge base for the first entity per mention and return the entity, first CUI, canonical name and definition for 10 articles.



In [None]:
def get_detailed_kb_ents(docs):
    kb_ent_list = []
    for index, doc in enumerate(docs):
        entities = doc.ents
        for entity in entities:
            if entity._.kb_ents:
                first_cuid = entity._.kb_ents[0][0]
                kb_entry = linker.kb.cui_to_entity[first_cuid]
                kb_ent_list.append(f"{entity.text}, {first_cuid}, {kb_entry.canonical_name}, {kb_entry.definition}")
    return kb_ent_list

In [None]:
detailed_kb_ents = list(map(get_detailed_kb_ents, doc_list))

In [None]:
detailed_kb_ents

[['winner, C0205102, Internal, Happening or arising or located within some limits, or especially, within some surface.',
  'Nobel Prize, C0028236, Nobel Prize, Any of six international prizes awarded annually for outstanding work in physics, chemistry, physiology or medicine, literature, economics and the promotion of peace.',
  'drug repurposing strategies, C2936405, Drug Repositioning, The deliberate and methodical practice of finding new applications for existing drugs.',
  'drug discovery, C0920472, Drug Discovery, The process of finding chemicals for potential therapeutic use.',
  'Ted, C1539997, NALF2 gene, None',
  'research, C0035168, research, Critical and exhaustive investigation or experimentation, having for its aim the discovery of new facts and their correct interpretation, the revision of accepted conclusions, theories, or laws in the light of newly discovered facts, or the practical application of such new or revised conclusions, theories, or laws. (Webster, 3d ed)',
  

In [None]:
with open('2024-10-14_articles_0-10_doclist_detailed_kb_ents.pickle', 'wb') as f:
   pickle.dump(detailed_kb_ents, f)

#### 5.5.3 Querying knowledge base entries - all linkers

We will now look up entity definitions in all five knowledge bases for the same entities, starting with 'COVID-19' as a test entity.



In [None]:
def entity_linker(linker_name, document):

    if "scispacy_linker" in nlp.pipe_names:
        nlp.remove_pipe("scispacy_linker")

    nlp.add_pipe(
    "scispacy_linker",
    config={
        "linker_name": linker_name,
        "resolve_abbreviations": True,
        "k": 10,  # Number of top candidates to consider for linking
        "max_entities_per_mention": 2  # Maximum number of entities to link per mention
    }
)
    linker = nlp.get_pipe("scispacy_linker")
    doc = nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_Id, Score = linker_ent
            entity_details.append(f'Entity_Matching_Score :{Score}')
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

In [None]:
test = 'COVID-19'

In [None]:
entity_linker('umls', test)

https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2023-04-23/umls/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmp9dsw8st7


100%|██████████| 1.32M/1.32M [00:00<00:00, 1.65MiB/s]


Finished download, copying /tmp/tmp9dsw8st7 to cache at /root/.scispacy/datasets/37bc06bb7ce30de7251db5f5cbac788998e33b3984410caed2d0083187e01d38.f0994c1b61cc70d0eb96dea4947dddcb37460fb5ae60975013711228c8fe3fba.tfidf_vectorizer.joblib
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2023-04-23/umls/concept_aliases.json not found in cache, downloading to /tmp/tmprcujkhzg


100%|██████████| 264M/264M [00:12<00:00, 21.4MiB/s]


Finished download, copying /tmp/tmprcujkhzg to cache at /root/.scispacy/datasets/6238f505f56aca33290aab44097f67dd1b88880e3be6d6dcce65e56e9255b7d4.d7f77b1629001b40f1b1bc951f3a890ff2d516fb8fbae3111b236b31b33d6dcf.concept_aliases.json
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/kbs/2023-04-23/umls_2022_ab_cat0129.jsonl not found in cache, downloading to /tmp/tmpsaclg_pi


100%|██████████| 628M/628M [01:05<00:00, 10.0MiB/s]


Finished download, copying /tmp/tmpsaclg_pi to cache at /root/.scispacy/datasets/d5e593bc2d8adeee7754be423cd64f5d331ebf26272074a2575616be55697632.0660f30a60ad00fffd8bbf084a18eb3f462fd192ac5563bf50940fc32a850a3c.umls_2022_ab_cat0129.jsonl
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/umls_semantic_type_tree.tsv not found in cache, downloading to /tmp/tmpg5oyugkg


100%|██████████| 4.26k/4.26k [00:00<00:00, 1.68MiB/s]

Finished download, copying /tmp/tmpg5oyugkg to cache at /root/.scispacy/datasets/21a1012c532c3a431d60895c509f5b4d45b0f8966c4178b892190a302b21836f.330707f4efe774134872b9f77f0e3208c1d30f50800b3b39a6b8ec21d9adf1b7.umls_semantic_type_tree.tsv





[COVID-19,
 'Entity_Matching_Score :0.9786118268966675',
 CUI: C5203670, Name: COVID19 (disease)
 Definition: A viral disorder generally characterized by high FEVER; COUGH; DYSPNEA; CHILLS; PERSISTENT TREMOR; MUSCLE PAIN; HEADACHE; SORE THROAT; a new loss of taste and/or smell (see AGEUSIA and ANOSMIA) and other symptoms of a VIRAL PNEUMONIA. In severe cases, a myriad of coagulopathy associated symptoms often correlating with COVID-19 severity is seen (e.g., BLOOD COAGULATION; THROMBOSIS; ACUTE RESPIRATORY DISTRESS SYNDROME; SEIZURES; HEART ATTACK; STROKE; multiple CEREBRAL INFARCTIONS; KIDNEY FAILURE; catastrophic ANTIPHOSPHOLIPID ANTIBODY SYNDROME and/or DISSEMINATED INTRAVASCULAR COAGULATION). In younger patients, rare inflammatory syndromes are sometimes associated with COVID-19 (e.g., atypical KAWASAKI SYNDROME; TOXIC SHOCK SYNDROME; pediatric multisystem inflammatory disease; and CYTOKINE STORM SYNDROME). A coronavirus, SARS-CoV-2, in the genus BETACORONAVIRUS is the causative ag

In [None]:
entity_linker('mesh', test)

https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/mesh/tfidf_vectors_sparse.npz not found in cache, downloading to /tmp/tmpdy2gch_6


100%|██████████| 68.1M/68.1M [00:08<00:00, 8.28MiB/s]


Finished download, copying /tmp/tmpdy2gch_6 to cache at /root/.scispacy/datasets/0acb1f67e1908d2211efb5291880a946e905e1a14a87c10cfc640d0711f914c7.e4877c46bb5a882e9729b6abe799b33f195067557a3c0c15086a50471f29b985.tfidf_vectors_sparse.npz
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/mesh/nmslib_index.bin not found in cache, downloading to /tmp/tmp87go1aoj


100%|██████████| 146M/146M [00:22<00:00, 6.75MiB/s]


Finished download, copying /tmp/tmp87go1aoj to cache at /root/.scispacy/datasets/7bad4a37e60db48ee4b5b03dfaa61b195af5b4c69a6850fa5b466103229c263d.4952ca58f4ed53ad673bb387c8f203d92f422dbcc8cfb673ffed9720e7c0af68.nmslib_index.bin
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/mesh/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmpb03vc7a7


100%|██████████| 674k/674k [00:00<00:00, 976kiB/s] 


Finished download, copying /tmp/tmpb03vc7a7 to cache at /root/.scispacy/datasets/6a0e66a77d89d942876d1b853abf461e6b16edaef65bebfa7f3a8dd99ff6553b.6eb17e1805a69a55fc151aa59fe42343d2b4be81405127043fd065bf5f9620e0.tfidf_vectorizer.joblib
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/mesh/concept_aliases.json not found in cache, downloading to /tmp/tmpz018e6ly


100%|██████████| 29.3M/29.3M [00:02<00:00, 11.4MiB/s]


Finished download, copying /tmp/tmpz018e6ly to cache at /root/.scispacy/datasets/ccb3a55e3a37984902cc7de591d37d56d90eb0962d128536512b8d1219e71bcb.89e92a904a5ccc051bcba6ee26c5744e183dee7197cc835cfeb152b330b44559.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/kbs/2023-04-23/umls_mesh_2022.jsonl not found in cache, downloading to /tmp/tmp_l709u0l


100%|██████████| 76.4M/76.4M [00:05<00:00, 14.1MiB/s]


Finished download, copying /tmp/tmp_l709u0l to cache at /root/.scispacy/datasets/5541a1df25533cfafec1fdcf0446c761f998591519c8ad4a73876f48d7e0a224.c4d6e393746f18aaf6eafff94fe1782cebf29ef535b101501e66f1e3462cdb09.umls_mesh_2022.jsonl


[COVID-19,
 'Entity_Matching_Score :0.9856609106063843',
 CUI: C5203670, Name: 2019-nCoV Infection
 Definition: A viral disorder generally characterized by high FEVER; COUGH; DYSPNEA; CHILLS; PERSISTENT TREMOR; MUSCLE PAIN; HEADACHE; SORE THROAT; a new loss of taste and/or smell (see AGEUSIA and ANOSMIA) and other symptoms of a VIRAL PNEUMONIA. In severe cases, a myriad of coagulopathy associated symptoms often correlating with COVID-19 severity is seen (e.g., BLOOD COAGULATION; THROMBOSIS; ACUTE RESPIRATORY DISTRESS SYNDROME; SEIZURES; HEART ATTACK; STROKE; multiple CEREBRAL INFARCTIONS; KIDNEY FAILURE; catastrophic ANTIPHOSPHOLIPID ANTIBODY SYNDROME and/or DISSEMINATED INTRAVASCULAR COAGULATION). In younger patients, rare inflammatory syndromes are sometimes associated with COVID-19 (e.g., atypical KAWASAKI SYNDROME; TOXIC SHOCK SYNDROME; pediatric multisystem inflammatory disease; and CYTOKINE STORM SYNDROME). A coronavirus, SARS-CoV-2, in the genus BETACORONAVIRUS is the causative 

In [None]:
entity_linker('rxnorm', test)

https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/rxnorm/tfidf_vectors_sparse.npz not found in cache, downloading to /tmp/tmpbzx0bcf7


100%|██████████| 12.9M/12.9M [00:01<00:00, 8.57MiB/s]


Finished download, copying /tmp/tmpbzx0bcf7 to cache at /root/.scispacy/datasets/68e7f1197d5852698808a5f9d694026c210e4b93a7e496dea608a46fff914774.e9a1075d5c32b5e7a180b60a96b15fc072ea714b95dd458047a48ccf2bb065be.tfidf_vectors_sparse.npz
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/rxnorm/nmslib_index.bin not found in cache, downloading to /tmp/tmp4_sooija


100%|██████████| 16.9M/16.9M [00:01<00:00, 9.53MiB/s]


Finished download, copying /tmp/tmp4_sooija to cache at /root/.scispacy/datasets/3742ff1d61c637ce7dc935674fa4199810af16978f9a10088d71d37bba16203a.8f798c6f751125a0d68f8b4e82ecfba4fd37bfb2a447d61fba584e208e6af9d3.nmslib_index.bin
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/rxnorm/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmphqpcdbio


100%|██████████| 200k/200k [00:00<00:00, 480kiB/s]


Finished download, copying /tmp/tmphqpcdbio to cache at /root/.scispacy/datasets/e6db3b626658739bfbd89a4695141db556c21cb8b915a8e7de00650992529158.2bf384392e4cece70fca03154737daf5a4e8a43fcab3fe83bb8e5d3467ccaff1.tfidf_vectorizer.joblib
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/rxnorm/concept_aliases.json not found in cache, downloading to /tmp/tmpme1crp81


100%|██████████| 7.63M/7.63M [00:01<00:00, 6.26MiB/s]


Finished download, copying /tmp/tmpme1crp81 to cache at /root/.scispacy/datasets/54a3afac2f157748a3326a13e59ffe165fcc40ce0cceab6dc47303965dc3c0ed.71746c536649e7ba8a47b6cb7a3a7c8e0c447e022bdf819e69fbb1de9276d411.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/kbs/2023-04-23/umls_rxnorm_2022.jsonl not found in cache, downloading to /tmp/tmpxi80nwwt


100%|██████████| 17.5M/17.5M [00:01<00:00, 9.28MiB/s]


Finished download, copying /tmp/tmpxi80nwwt to cache at /root/.scispacy/datasets/afd8034c6b1a9b6e9eb94a5688ab043023fb450ddf36c88b9f78efa21c5b2d0a.7afae38a116c40277e6052ddcfcd0013fb8136a6d4f96d965ccc7689e8543712.umls_rxnorm_2022.jsonl


[COVID-19]

In [None]:
entity_linker('go', test)

https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/go/tfidf_vectors_sparse.npz not found in cache, downloading to /tmp/tmpff3phir9


100%|██████████| 13.8M/13.8M [00:01<00:00, 8.41MiB/s]


Finished download, copying /tmp/tmpff3phir9 to cache at /root/.scispacy/datasets/98b21d1968addfd51eceee816a491b7b10de52fbc8f11f22fbf8374d9f881229.0a8a2035151feef72cf9dc0bcda27bda35e86771810a2a4523bae7ea337ae7bb.tfidf_vectors_sparse.npz
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/go/nmslib_index.bin not found in cache, downloading to /tmp/tmp1hpf7lnw


100%|██████████| 15.0M/15.0M [00:01<00:00, 8.54MiB/s]


Finished download, copying /tmp/tmp1hpf7lnw to cache at /root/.scispacy/datasets/3ed448934f89223c37be21a402a665d6e3dfcbea9bfd87b1fcd68dbb2f850760.40c7e42a18bea0b2f632b9ec6c299545f1f7d91b2187158ee03380d639eb867f.nmslib_index.bin
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/go/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmp9ov7c3ma


100%|██████████| 198k/198k [00:00<00:00, 483kiB/s]


Finished download, copying /tmp/tmp9ov7c3ma to cache at /root/.scispacy/datasets/2be1d8fd599f1a1d6140e5af989f4f512613b42a0b74a603aa599c038fc939b5.716eb257880cc667cc906fd73f357c3fccf3abec701904072f3b53b02a45deba.tfidf_vectorizer.joblib
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/go/concept_aliases.json not found in cache, downloading to /tmp/tmp7qjfcw2d


100%|██████████| 7.50M/7.50M [00:01<00:00, 5.61MiB/s]


Finished download, copying /tmp/tmp7qjfcw2d to cache at /root/.scispacy/datasets/e4e99357becdaacb55a07f8b1bcee8d7f6a634ab41db03ab28182a2166f24d4c.5b185b2b9139bc990299750dd4c87979e814ffee13ae3c650bc218c96dbc63ae.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/kbs/2023-04-23/umls_go_2022.jsonl not found in cache, downloading to /tmp/tmp35krrnkh


100%|██████████| 21.0M/21.0M [00:02<00:00, 10.4MiB/s]


Finished download, copying /tmp/tmp35krrnkh to cache at /root/.scispacy/datasets/f2fae68affc838ddf0a87884154533ce359bda3c7d430bb7aa21ae851bee639d.0f776e01d8c81b2c7a6b9b8ffeff2bd7dc23c2b06fdc7719513bd10f1cff9c5a.umls_go_2022.jsonl


[COVID-19]

In [None]:
entity_linker('hpo', test)

https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/hpo/tfidf_vectors_sparse.npz not found in cache, downloading to /tmp/tmp26y_bwod


100%|██████████| 2.45M/2.45M [00:00<00:00, 2.98MiB/s]


Finished download, copying /tmp/tmp26y_bwod to cache at /root/.scispacy/datasets/ce11d8a176fa1830308fc265ab8845ca877f10c70fa3f74212ff2d9fdd97ab96.029e8ca566e1b5d6ab99138a96aa1c7b050565132aabb6b296a1c870c64d6f9b.tfidf_vectors_sparse.npz
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/hpo/nmslib_index.bin not found in cache, downloading to /tmp/tmp8zo3v74n


100%|██████████| 5.86M/5.86M [00:01<00:00, 5.28MiB/s]


Finished download, copying /tmp/tmp8zo3v74n to cache at /root/.scispacy/datasets/066d3db776b9acaff67728a857a1d6625f4c86194a70804ffd5399fa738caa4e.ecc1ac28794235140b2bafbbf81ce0454219cd1e05056786dce65ab17fee53b2.nmslib_index.bin
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/hpo/tfidf_vectorizer.joblib not found in cache, downloading to /tmp/tmpay9v2bmf


100%|██████████| 92.6k/92.6k [00:00<00:00, 340kiB/s]


Finished download, copying /tmp/tmpay9v2bmf to cache at /root/.scispacy/datasets/cdf67b07073317b3e1b0773eee16408261d3748e12bacb00a05eedf8de0daca5.8b3ec883bcab0da944125277caaacbcf72880dba390effff464cda4d2dcded62.tfidf_vectorizer.joblib
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2023-04-23/hpo/concept_aliases.json not found in cache, downloading to /tmp/tmpyk5oyx2i


100%|██████████| 1.17M/1.17M [00:03<00:00, 400kiB/s]


Finished download, copying /tmp/tmpyk5oyx2i to cache at /root/.scispacy/datasets/092c266817935c16682d3a1511bad5bdb7e3665d93da4d2eb21d42fa6b2f4100.298fa9e3ef85c61367c35b4240deae6f06545e2cb68659bbad65602be2dfefab.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/kbs/2023-04-23/umls_hpo_2022.jsonl not found in cache, downloading to /tmp/tmpagnr5lei


100%|██████████| 4.95M/4.95M [00:01<00:00, 4.45MiB/s]


Finished download, copying /tmp/tmpagnr5lei to cache at /root/.scispacy/datasets/4acfb77195a577a57a9791f9627dcc8c47561d8c2fa4671b9a5ca0e494970e87.b703e72c55ea536eac9c2fcb2d63553d36ea0aadec6a3f525a9eb21998302bc7.umls_hpo_2022.jsonl


[COVID-19]

We will load the dataset we created earlier using the four spaCy NER models and extract the CUI and definition using the five linkers where this information is available in the knowledge bases.

In [None]:
df_entities = pd.read_csv('2024-10-15_scispacy_fastcoref_sent_coref_text_model_ents_labels.csv')

In [None]:
df_entities

Unnamed: 0,article_id,entity,label,model
0,PMC9549161,sildenafil,CHEMICAL,bc5cdr
1,PMC9549161,erectile dysfunction,DISEASE,bc5cdr
2,PMC9549161,pulmonary hypertension,DISEASE,bc5cdr
3,PMC9549161,bupropion,CHEMICAL,bc5cdr
4,PMC9549161,thalidomide,CHEMICAL,bc5cdr
...,...,...,...,...
8816,PMC9556799,gene,SO,craft
8817,PMC9556799,cells,CL,craft
8818,PMC9556799,African green monkey,TAXON,craft
8819,PMC9556799,genes,SO,craft


In [None]:
if "scispacy_linker" in nlp.pipe_names:
    nlp.remove_pipe("scispacy_linker")

nlp.add_pipe(
    "scispacy_linker",
    config={
        "linker_name": "umls",
        "resolve_abbreviations": True,
        "k": 10,  # Number of top candidates to consider for linking
        "max_entities_per_mention": 2  # Maximum number of entities to link per mention
    }
)
linker = nlp.get_pipe("scispacy_linker")

def umls_entity_linker(document):
    doc = nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_Id, Score = linker_ent
            entity_details.append(f'Entity_Matching_Score :{Score}')
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

In [None]:
%%time
df_entities['umls'] = df_entities['entity'].swifter.apply(lambda x : umls_entity_linker(x))

Pandas Apply:   0%|          | 0/8821 [00:00<?, ?it/s]

CPU times: user 2min 17s, sys: 3.42 s, total: 2min 20s
Wall time: 2min 29s


In [None]:
if "scispacy_linker" in nlp.pipe_names:
    nlp.remove_pipe("scispacy_linker")

nlp.add_pipe(
    "scispacy_linker",
    config={
        "linker_name": "mesh",
        "resolve_abbreviations": True,
        "k": 10,  # Number of top candidates to consider for linking
        "max_entities_per_mention": 2  # Maximum number of entities to link per mention
    }
)
linker = nlp.get_pipe("scispacy_linker")

def mesh_entity_linker(document):
    doc = nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_Id, Score = linker_ent
            entity_details.append(f'Entity_Matching_Score :{Score}')
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

In [None]:
%%time
df_entities['mesh'] = df_entities['entity'].swifter.apply(lambda x : mesh_entity_linker(x))

Pandas Apply:   0%|          | 0/8821 [00:00<?, ?it/s]

CPU times: user 2min 16s, sys: 3.71 s, total: 2min 20s
Wall time: 2min 28s


In [None]:
if "scispacy_linker" in nlp.pipe_names:
    nlp.remove_pipe("scispacy_linker")

nlp.add_pipe(
    "scispacy_linker",
    config={
        "linker_name": "rxnorm",
        "resolve_abbreviations": True,
        "k": 10,  # Number of top candidates to consider for linking
        "max_entities_per_mention": 2  # Maximum number of entities to link per mention
    }
)
linker = nlp.get_pipe("scispacy_linker")

def rxnorm_entity_linker(document):
    doc = nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_Id, Score = linker_ent
            entity_details.append(f'Entity_Matching_Score :{Score}')
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

In [None]:
%%time
df_entities['rxnorm'] = df_entities['entity'].swifter.apply(lambda x : rxnorm_entity_linker(x))

Pandas Apply:   0%|          | 0/8821 [00:00<?, ?it/s]

CPU times: user 2min 6s, sys: 3.41 s, total: 2min 10s
Wall time: 2min 19s


In [None]:
if "scispacy_linker" in nlp.pipe_names:
    nlp.remove_pipe("scispacy_linker")

nlp.add_pipe(
    "scispacy_linker",
    config={
        "linker_name": "go",
        "resolve_abbreviations": True,
        "k": 10,  # Number of top candidates to consider for linking
        "max_entities_per_mention": 2  # Maximum number of entities to link per mention
    }
)
linker = nlp.get_pipe("scispacy_linker")

def go_entity_linker(document):
    doc = nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_Id, Score = linker_ent
            entity_details.append(f'Entity_Matching_Score :{Score}')
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

In [None]:
%%time
df_entities['go'] = df_entities['entity'].swifter.apply(lambda x : go_entity_linker(x))

Pandas Apply:   0%|          | 0/8821 [00:00<?, ?it/s]

CPU times: user 2min 7s, sys: 3.83 s, total: 2min 11s
Wall time: 2min 24s


In [None]:
if "scispacy_linker" in nlp.pipe_names:
    nlp.remove_pipe("scispacy_linker")

nlp.add_pipe(
    "scispacy_linker",
    config={
        "linker_name": "hpo",
        "resolve_abbreviations": True,
        "k": 10,  # Number of top candidates to consider for linking
        "max_entities_per_mention": 2  # Maximum number of entities to link per mention
    }
)
linker = nlp.get_pipe("scispacy_linker")

def hpo_entity_linker(document):
    doc = nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_Id, Score = linker_ent
            entity_details.append(f'Entity_Matching_Score :{Score}')
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

In [None]:
%%time
df_entities['hpo'] = df_entities['entity'].swifter.apply(lambda x : hpo_entity_linker(x))

Pandas Apply:   0%|          | 0/8821 [00:00<?, ?it/s]

CPU times: user 2min 16s, sys: 3.91 s, total: 2min 20s
Wall time: 2min 35s


In [None]:
df_entities

Unnamed: 0,article_id,entity,label,model,umls,mesh,rxnorm,go,hpo
0,PMC9549161,sildenafil,CHEMICAL,bc5cdr,"[(sildenafil), Entity_Matching_Score :0.983729...","[(sildenafil), Entity_Matching_Score :0.985787...","[(sildenafil), Entity_Matching_Score :0.988025...",[(sildenafil)],[(sildenafil)]
1,PMC9549161,erectile dysfunction,DISEASE,bc5cdr,"[(erectile, dysfunction), Entity_Matching_Scor...","[(erectile, dysfunction), Entity_Matching_Scor...","[(erectile, dysfunction)]","[(erectile, dysfunction)]","[(erectile, dysfunction), Entity_Matching_Scor..."
2,PMC9549161,pulmonary hypertension,DISEASE,bc5cdr,"[(pulmonary, hypertension), Entity_Matching_Sc...","[(pulmonary, hypertension), Entity_Matching_Sc...","[(pulmonary, hypertension)]","[(pulmonary, hypertension)]","[(pulmonary, hypertension), Entity_Matching_Sc..."
3,PMC9549161,bupropion,CHEMICAL,bc5cdr,"[(bupropion), Entity_Matching_Score :0.9431696...","[(bupropion), Entity_Matching_Score :0.9397377...","[(bupropion), Entity_Matching_Score :0.9476281...",[(bupropion)],[(bupropion)]
4,PMC9549161,thalidomide,CHEMICAL,bc5cdr,"[(thalidomide), Entity_Matching_Score :0.99491...","[(thalidomide), Entity_Matching_Score :0.98994...","[(thalidomide), Entity_Matching_Score :0.97174...",[(thalidomide)],[(thalidomide)]
...,...,...,...,...,...,...,...,...,...
8816,PMC9556799,gene,SO,craft,"[(gene), Entity_Matching_Score :0.994082808494...","[(gene), Entity_Matching_Score :0.978798389434...",[(gene)],"[(gene), Entity_Matching_Score :0.792613029479...",[(gene)]
8817,PMC9556799,cells,CL,craft,[Nan],[Nan],[Nan],[Nan],[Nan]
8818,PMC9556799,African green monkey,TAXON,craft,"[(African, green, monkey), Entity_Matching_Sco...","[(African, green, monkey), Entity_Matching_Sco...","[(African, green, monkey)]","[(African, green, monkey)]","[(African, green, monkey)]"
8819,PMC9556799,genes,SO,craft,"[(genes), Entity_Matching_Score :0.99266731739...","[(genes), Entity_Matching_Score :0.99184966087...",[(genes)],[(genes)],[(genes)]


In [None]:
df_entities.to_csv('2024-10-15_scispacy_fastcoref_sent_coref_text_model_ents_labels_5_linkers.csv', index=False)

### 5.6 Hearst Patterns

The `HyponymDetector` component implements Marti Hearst's 1992 paper, [Automatic Aquisition of Hyponyms from Large Text Corpora](https://www.aclweb.org/anthology/C92-2082.pdf) using the SpaCy rule-based `Matcher` component. Hearst identified and described a method for discovering a set of lexico-syntactic patterns that are easily recognisable, occur frequently across text genre boundaries, and indisputably indicate the lexical relation of interest.




In [None]:
# convert coreference resolved text to list
all_article_text = pmc_arxiv_full_sent_text_spacy_fastcoref.sent_coref_text.tolist()

In [None]:
all_article_text

[['Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery.',
  'In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets.',
  'molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted.',
  'The definition of the term drug repurposing has been endorsed by scholars and used by scholars.',
  'It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic swi

We will add the `HyponymDetector` component to the pipeline configured with `"extended": False`.

As per the [scispaCy documentation](https://github.com/allenai/scispacy?tab=readme-ov-file#hearst-patterns-v030-and-up), passing `"extended": True` will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).



In [None]:
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})

<scispacy.hyponym_detector.HyponymDetector at 0x7eebd38d0af0>

In [None]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'abbreviation_detector',
 'scispacy_linker',
 'hyponym_detector']

The `HyponymDetector` component produces a doc level attribute on the spaCy doc: `doc._.hearst_patterns`, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

* The relation rule used to extract the hyponym (type: `str`)
* The more general concept (type: `spacy.Span`)
* The more specific concept (type: `spacy.Span`)



In [None]:
# iterate over article doc objects and return list of hearst patterns
def get_hearst_patterns(docs):
    pattern_list = []
    for article_list in docs:
        for doc_text in article_list:
            doc = nlp(doc_text)
            hearst_patterns = doc._.hearst_patterns
            pattern_list.extend(hearst_patterns)
    return pattern_list

In [None]:
%%time

hearst_patterns = get_hearst_patterns(all_article_text)

CPU times: user 1min 49s, sys: 1.22 s, total: 1min 50s
Wall time: 1min 30s


In [None]:
len(hearst_patterns)

409

In [None]:
hearst_patterns

[('other', disciplines, precision medicine),
 ('other', disciplines, systems biology),
 ('other', disciplines, genomics),
 ('other', disciplines, polypharmacology),
 ('other', disciplines, precision medicine),
 ('other', disciplines, cheminformatics),
 ('other', disciplines, bioinformatics),
 ('other', disciplines, systems biology),
 ('other', disciplines, genomics),
 ('other', disciplines, polypharmacology),
 ('such_as', databases, DrugBank),
 ('such_as', databases, ChEMBL),
 ('such_as', databases, Cmap),
 ('such_as', databases, PDB),
 ('such_as', databases, OMIM),
 ('such_as', databases, etc),
 ('such_as', drug repurposing, EK-DRD),
 ('such_as', drug repurposing, DREIMT),
 ('such_as', drug repurposing, DrugSig),
 ('such_as', drug repurposing, RepoDB),
 ('include', compounds, machine learning),
 ('include', compounds, network modeling),
 ('include', compounds, reasoning),
 ('include', compounds, text mining),
 ('such_as', factors, price),
 ('such_as', factors, toxicity levels),
 ('suc

In [None]:
# Specify the output file path
hearst_patterns_2024_10_15_articles_0_10_409 = 'hearst_patterns_409.json'

In [None]:
# Convert the hearst_patterns list to JSON-serialisable format
serialisable_patterns = [(rel, str(ent1), str(ent2)) for rel, ent1, ent2 in hearst_patterns]

In [None]:
# Save the patterns to the JSON file
with open(hearst_patterns_2024_10_15_articles_0_10_409, 'w') as f:
    json.dump(serialisable_patterns, f)

In [None]:
# Specify the path to the JSON file
hearst_patterns_2024_10_15_articles_0_10_409 = 'hearst_patterns_409.json'

In [None]:
# Load the hearst_patterns from the JSON file
with open(hearst_patterns_2024_10_15_articles_0_10_409, 'r') as f:
    hearst_patterns = json.load(f)

In [None]:
# Print the loaded patterns
print(hearst_patterns)

[['other', 'disciplines', 'precision medicine'], ['other', 'disciplines', 'systems biology'], ['other', 'disciplines', 'genomics'], ['other', 'disciplines', 'polypharmacology'], ['other', 'disciplines', 'precision medicine'], ['other', 'disciplines', 'cheminformatics'], ['other', 'disciplines', 'bioinformatics'], ['other', 'disciplines', 'systems biology'], ['other', 'disciplines', 'genomics'], ['other', 'disciplines', 'polypharmacology'], ['such_as', 'databases', 'DrugBank'], ['such_as', 'databases', 'ChEMBL'], ['such_as', 'databases', 'Cmap'], ['such_as', 'databases', 'PDB'], ['such_as', 'databases', 'OMIM'], ['such_as', 'databases', 'etc'], ['such_as', 'drug repurposing', 'EK-DRD'], ['such_as', 'drug repurposing', 'DREIMT'], ['such_as', 'drug repurposing', 'DrugSig'], ['such_as', 'drug repurposing', 'RepoDB'], ['include', 'compounds', 'machine learning'], ['include', 'compounds', 'network modeling'], ['include', 'compounds', 'reasoning'], ['include', 'compounds', 'text mining'], ['s

We will try a couple of test sentences.

In [None]:
doc = nlp("Drugs such as chloroquine and hydroxychloroquine have immunomodulatory effects.")

In [None]:
print(doc._.hearst_patterns)

[('such_as', Drugs, chloroquine), ('such_as', Drugs, hydroxychloroquine)]


In [None]:
doc = nlp("Drugs including chloroquine and hydroxychloroquine have immunomodulatory effects.")

In [None]:
print(doc._.hearst_patterns)

[('include', Drugs, chloroquine), ('include', Drugs, hydroxychloroquine)]


We will remove the `HyponymDetector` component from the pipeline and add it back again configured with `"extended": True`.

In [None]:
nlp.remove_pipe("hyponym_detector")

('hyponym_detector',
 <scispacy.hyponym_detector.HyponymDetector at 0x7eebd38d0af0>)

In [None]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'abbreviation_detector',
 'scispacy_linker']

In [None]:
nlp.add_pipe("hyponym_detector", last=True, config={"extended": True})

<scispacy.hyponym_detector.HyponymDetector at 0x7eeb938af730>

In [None]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'abbreviation_detector',
 'scispacy_linker',
 'hyponym_detector']

In [None]:
def get_hearst_patterns(docs):
    pattern_list = []
    for article_list in docs:
        for doc_text in article_list:
            doc = nlp(doc_text)
            hearst_patterns = doc._.hearst_patterns
            pattern_list.extend(hearst_patterns)
    return pattern_list

In [None]:
%%time

hearst_patterns_ext = get_hearst_patterns(all_article_text)

CPU times: user 1min 40s, sys: 1.09 s, total: 1min 41s
Wall time: 1min 22s


In [None]:
len(hearst_patterns_ext)

464

As expected, with `"extended": True` we get higher recall.

In [None]:
hearst_patterns_ext

[('other', disciplines, precision medicine),
 ('other', disciplines, systems biology),
 ('other', disciplines, genomics),
 ('other', disciplines, polypharmacology),
 ('other', disciplines, precision medicine),
 ('other', disciplines, cheminformatics),
 ('other', disciplines, bioinformatics),
 ('other', disciplines, systems biology),
 ('other', disciplines, genomics),
 ('other', disciplines, polypharmacology),
 ('such_as', databases, DrugBank),
 ('such_as', databases, ChEMBL),
 ('such_as', databases, Cmap),
 ('such_as', databases, PDB),
 ('such_as', databases, OMIM),
 ('such_as', databases, etc),
 ('such_as', drug repurposing, EK-DRD),
 ('such_as', drug repurposing, DREIMT),
 ('such_as', drug repurposing, DrugSig),
 ('such_as', drug repurposing, RepoDB),
 ('include', compounds, machine learning),
 ('include', compounds, network modeling),
 ('include', compounds, reasoning),
 ('include', compounds, text mining),
 ('such_as', factors, price),
 ('such_as', factors, toxicity levels),
 ('suc

In [None]:
# Specify the output file path
hearst_patterns_2024_10_15_ext_articles_0_10_464 = 'hearst_patterns_ext_464.json'

In [None]:
# Convert the hearst_patterns list to JSON-serialisable format
serialisable_patterns_ext = [(rel, str(ent1), str(ent2)) for rel, ent1, ent2 in hearst_patterns_ext]

In [None]:
# Save the patterns to the JSON file
with open(hearst_patterns_2024_10_15_ext_articles_0_10_464, 'w') as f:
    json.dump(serialisable_patterns_ext, f)

In [None]:
# Specify the path to the JSON file
hearst_patterns_2024_10_15_ext_articles_0_10_464 = 'hearst_patterns_ext_464.json'

In [None]:
# Load the hearst_patterns from the JSON file
with open(hearst_patterns_2024_10_15_ext_articles_0_10_464, 'r') as f:
    hearst_patterns_ext = json.load(f)

In [None]:
# Print the loaded patterns
print(hearst_patterns_ext)

[['other', 'disciplines', 'precision medicine'], ['other', 'disciplines', 'systems biology'], ['other', 'disciplines', 'genomics'], ['other', 'disciplines', 'polypharmacology'], ['other', 'disciplines', 'precision medicine'], ['other', 'disciplines', 'cheminformatics'], ['other', 'disciplines', 'bioinformatics'], ['other', 'disciplines', 'systems biology'], ['other', 'disciplines', 'genomics'], ['other', 'disciplines', 'polypharmacology'], ['such_as', 'databases', 'DrugBank'], ['such_as', 'databases', 'ChEMBL'], ['such_as', 'databases', 'Cmap'], ['such_as', 'databases', 'PDB'], ['such_as', 'databases', 'OMIM'], ['such_as', 'databases', 'etc'], ['such_as', 'drug repurposing', 'EK-DRD'], ['such_as', 'drug repurposing', 'DREIMT'], ['such_as', 'drug repurposing', 'DrugSig'], ['such_as', 'drug repurposing', 'RepoDB'], ['include', 'compounds', 'machine learning'], ['include', 'compounds', 'network modeling'], ['include', 'compounds', 'reasoning'], ['include', 'compounds', 'text mining'], ['s

Try a test sentence.

In [None]:
doc = nlp("To predict the impact of repurposed drugs on comorbidities, we first selected the largest GWAS for 17 complex diseases including diabetes mellitus, cardiovascular disease, cerebrovascular disease, chronic liver disease, chronic kidney injury, autoimmune disease, and many cancers from CAUSALdb.")

In [None]:
print(doc._.hearst_patterns)

[('include', complex diseases, diabetes mellitus), ('include', complex diseases, disease), ('include', complex diseases, liver disease), ('include', complex diseases, kidney injury), ('include', complex diseases, cancers), ('include', complex diseases, cerebrovascular disease), ('include', complex diseases, disease)]


### References

* https://github.com/explosion/spaCy

* https://spacy.io/

* Neumann, M. et al. (2019). ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. [arXiv:1902.07669](https://arxiv.org/pdf/1902.07669)

* https://github.com/allenai/scispacy

* Hearst, M. (1992). [Automatic Aquisition of Hyponyms from Large Text Corpora.](https://aclanthology.org/C92-2082.pdf) In *COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics*, 539-545.

* Schwartz, A. & Hearst, M. (2003). [A simple algorithm for identifying abbreviation definitions in biomedical text.](https://pubmed.ncbi.nlm.nih.gov/12603049/) In *Pacific Symposium on Biocomputing 2003*, 451-62. PMID: 12603049  

* https://oyewusiwuraola.medium.com/how-to-use-scispacy-for-biomedical-named-entity-recognition-abbreviation-resolution-and-link-umls-87d3f7c08db2

* https://oyewusiwuraola.medium.com/how-to-use-scispacy-entity-linkers-for-biomedical-named-entities-7cf13b29ef67

* https://gbnegrini.com/post/biomedical-text-nlp-scispacy-named-entity-recognition-medical-records/

* https://kristinelpetrosyan.medium.com/ner-and-ned-with-spacy-dd847800b7d9

* https://kristinelpetrosyan.medium.com/named-entity-linking-7760b21e32f

* https://www.andrewvillazon.com/clinical-natural-language-processing-python/

* https://github.com/nasa-jpl-cord-19/Biomolecular-Named-Entities/blob/master/SciSpacy%20NER.ipynb


* https://towardsdatascience.com/construct-a-biomedical-knowledge-graph-with-nlp-1f25eddc54a0

* https://bratanic-tomaz.medium.com/constructing-knowledge-graphs-from-text-using-openai-functions-096a6d010c17

* Bratanič, T. (2024). [Graph Algorithms for Data Science](https://www.manning.com/books/graph-algorithms-for-data-science)

* https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/

* https://towardsdatascience.com/implementing-hearst-patterns-with-spacy-216e585f61f8