# Coreference Resolution - fastcoref

## 1. Introduction

Information Extraction is the higher-level NLP task of extracting structured information from unstructured text and traditionally comprises a pipeline of subtasks: Coreference Resolution, Named Entity Recognition, Entity Disambiguation and Linking, and Relation Extraction.

Coreference resolution is the task of *mention detection*, or identifying all textual mentions referring to the same entity without using its proper name, but by using some form of anaphora, cataphora, split-reference or compound noun-phrase, followed by *mention clustering*, or identifying and merging mentions referring to the same entity into the same cluster. It is most often represented as the task of replacing pronouns with referenced entities.

It could be considered an advanced preprocessing task and an important first step in an information extraction pipeline. The two main coreference resolution approaches are rule-based, using a set of predefined rules, and neural networks which learn patterns in the data and include mention-pair, mention-ranking, and clustering-based models. Traditional domain-customised pipelines of separate models were superseded by end-to-end neural coreference models and, more recently with the rise of LLMs, the potential for unified solutions using prompt-based text generation for NLP tasks.

We will explore neural models using the [`fastcoref`](https://github.com/shon-otmazgin/fastcoref) library which provides an easy and fast API for coreference resolution with only a few lines of code and without any preprocessing steps. It provides state-of-the-art coreference accuracy with the LINGMESS model, and a substantially faster F-Coref model.

## 2. Install/import libraries

In [None]:
!pip install fastcoref spacy scispacy

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

In [None]:
import pandas as pd
import pickle
import spacy
import scispacy
import itertools
import warnings
warnings.filterwarnings("ignore")

from fastcoref import FCoref, spacy_component, LingMessCoref

## 3. F-Coref

The ['fastcoref'](https://github.com/shon-otmazgin/fastcoref) package was introduced by [Otmazgin et al](https://arxiv.org/abs/2209.04280) in 2022 to address the lack of easy-to-use coreference packages implementing SOTA models, and the challenges of memory consumption and inference time.

We will compare the F-Coref and LingMess models for accuracy which, given that we only have a small dataset, will be more critical than speed.

The F-Coref model is based on the architecture of the  start-to-end (s2e) model ([Kirstain et al., 2021](https://aclanthology.org/2021.acl-short.3/)) which scores each pair of spans in the text to be co-referring to each other, and represents each span as a function of its start and end tokens rather than constructing span representations, thereby reducing memory
footprint allowing it to handle longer sequences.

Knowledge Distillation is the process of learning a small student model from a large teacher model. Here the F-Coref model is the student model learning coreference knowledge transferred from the LingMess teacher model.



In [None]:
# Initialise the pretrained FCoref model on CPU
model = FCoref(device='cpu')  # use FCoref(device='cuda:0') for GPU

config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/393 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/362M [00:00<?, ?B/s]

As described in the paper and documentation, the main functionally of `fastcoref` is the `predict` function which accepts a list of texts. The model tries to predict whether spans are coreferent i.e. refer to the same entity.

In [None]:
%%time

preds = model.predict(
   texts=['In the context of the pandemic in 2020, there was a surge in research on the subject, with COVID-19/SARS-CoV-2/Coronavirus topping the list of keywords as soon as they appeared that year.']
)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Inference:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 1.4 s, sys: 198 ms, total: 1.6 s
Wall time: 2.12 s


The return value of the function is a list of `CorefResult` objects, from which one can extract the coreference clusters.


We can extract the clusters as character  indices over the original text.

In [None]:
preds[0].get_clusters(as_strings=False)

[[(91, 122), (163, 167)], [(34, 38), (177, 186)]]

We can also  extract the clusters as strings.

In [None]:
preds[0].get_clusters()

[['COVID-19/SARS-CoV-2/Coronavirus', 'they'], ['2020', 'that year']]

The logits for each coreferring entity pair are the outputs of a neural network before the activation function is applied. They are raw, unnormalised probabilities that an instance belongs to a certain class, in this case how strongly the model believes that the two spans are coreferent.


We will extract the logits for the first coreferring entity pair.

In [None]:
preds[0].get_logit(
   span_i=(91, 122), span_j=(163, 167)
)

4.59851

Logits for the second coreferring entity pair.

In [None]:
preds[0].get_logit(
   span_i=(34, 38), span_j=(177, 186)
)

7.3609405

The score for the second pair is slightly higher than the first which means that the model is more confident that the first pair are coreferent compared to the second.

To compare results, logits could be normalised by applying a softmax function where all results sum to 1 allowing us to think of them as probabilities, the closer to 1 the higher the probability.

If the input text is already word tokenised you can use `is_split_into_words=True`

In [None]:
%%time

preds = model.predict(
   texts = [["In", "the", "context", "of", "the", "pandemic", "in", "2020", ",",
             "there", "was", "a", "surge", "in", "research", "on", "the", "subject", ",",
             "with", "COVID-19/SARS-CoV-2/Coronavirus", "topping", "the", "list", "of", "keywords",
             "as", "soon", "as", "they", "appeared", "that", "year", "."]],
   is_split_into_words=True
)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Inference:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 368 ms, sys: 5.17 ms, total: 373 ms
Wall time: 402 ms


In [None]:
preds[0].get_clusters(as_strings=False)

[[(20, 21), (29, 30)], [(7, 8), (31, 33)]]

In [None]:
preds[0].get_clusters()

[[['COVID-19/SARS-CoV-2/Coronavirus'], ['they']], [['2020'], ['that', 'year']]]

In [None]:
preds[0].get_logit(
   span_i=(20, 21), span_j=(29, 30)
)

3.725453

In [None]:
preds[0].get_logit(
   span_i=(7, 8), span_j=(31, 33)
)

7.3277707

Processing can be applied to a collection of texts of any length in a batched and parallel fashion.

We will try a list of test sentences first using the faster F-Coref model.

In [None]:
text0 = "Autophagy maintains tumour growth through circulating the great arginine."
text1 = "It is a skin disease causing much itchiness. Scratching leads to redness, swelling, cracking, weeping clear fluid, crusting, and scaling."
text2 = "An outbreak of infection is considered a pandemic when it grows very fast. It starts out slowly at first and then takes off."
text3 = "Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in Wuhan, China in late December 2019.  This outbreak began spreading at an alarming rate, and unleashed a severe health crisis around the globe."

texts = [text0, text1, text2, text3]


The `max_tokens_in_batch` parameter can be used to control the speed vs. memory consumption (as well as speed vs. accuracy) tradeoff, and can be tuned to maximise the utilisation of the associated hardware.

In [None]:
%%time

# control the batch size
# with max_tokens_in_batch parameter

preds = model.predict(
    texts=texts, max_tokens_in_batch=100
)

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Inference:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 701 ms, sys: 20.1 ms, total: 721 ms
Wall time: 741 ms


We will show the list of `CorefResult` objects returned by the `predict` function.

In [None]:
preds

[CorefResult(text="Autophagy maintains tumour growth through circulat...", clusters=[]),
 CorefResult(text="It is a skin disease causing much itchiness. Scrat...", clusters=[['It', 'Scratching']]),
 CorefResult(text="An outbreak of infection is considered a pandemic ...", clusters=[['An outbreak of infection', 'it', 'It']]),
 CorefResult(text="Severe acute respiratory syndrome coronavirus 2 (S...", clusters=[['originated', 'This outbreak']])]

We can see that the `FCoref` model has identified no coreferring entities in the first text and returned an empty cluster, which is correct.

In [None]:
preds[1].get_clusters(as_strings=False)

[[(0, 2), (45, 55)]]

In [None]:
preds[1].get_clusters()

[['It', 'Scratching']]

The cluster for the second text is incorrect.

In [None]:
preds[1].get_logit(
   span_i=(0, 2), span_j=(45, 55)
)

0.037763357

As a result the logit value is very low indicating the model is not very confident.

In [None]:
preds[2].get_clusters(as_strings=False)

[[(0, 24), (55, 57), (75, 77)]]

In [None]:
preds[2].get_clusters()

[['An outbreak of infection', 'it', 'It']]

The cluster for the third text which has two coreferring pairs is correct.

In [None]:
preds[2].get_logit(
   span_i=(0, 24), span_j=(55, 57)
)

8.613898

In [None]:
preds[2].get_logit(
   span_i=(0, 24), span_j=(75, 77)
)

9.424862

Interestingly, the logit score for the first coreferring mention is slightly lower than the second even though it is in the same sentence as the antecedent. This is an example of anaphora.

In [None]:
preds[3].get_clusters(as_strings=False)

[[(61, 71), (112, 125)]]

In [None]:
preds[3].get_clusters()

[['originated', 'This outbreak']]

The cluster for the fourth text is incorrect.

In [None]:
preds[3].get_logit(
   span_i=(61, 71), span_j=(112, 125)
)

0.24184513

And again, like the second text, it has a correspondingly low score.

## 4. LingMess

To use the larger but more accurate LingMess model, we can simply import `LingMessCoref` instead of `FCoref`.

LingMess (Linguistically Informed Multi Expert Scorers for Coreference Resolution) was also introduced by [Otmazgin et al](https://arxiv.org/abd/2205.12644) who observed that different kinds of mention pairs require different information sources to assess their score, but existing coreference systems were based on a single pairwise scoring component.

LingMess significantly improves accuracy by splitting the scoring function into different categories, and routing each scoring decision to its own category based
on a deterministic, linguistically informed heuristic.



In [None]:
# Initialise the pretrained LingMessCoref model on CPU
model = LingMessCoref(device='cpu')

config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

We will use the same test sentences as above and compare the output with the F-Coref model.

In [None]:
%%time

# control the batch size
# with max_tokens_in_batch parameter

preds = model.predict(
    texts=texts, max_tokens_in_batch=100
)

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Inference:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 35 s, sys: 828 ms, total: 35.8 s
Wall time: 36 s


We will show the list of `CorefResult` objects returned by the `predict` function.

In [None]:
preds

[CorefResult(text="Autophagy maintains tumour growth through circulat...", clusters=[]),
 CorefResult(text="It is a skin disease causing much itchiness. Scrat...", clusters=[]),
 CorefResult(text="An outbreak of infection is considered a pandemic ...", clusters=[['An outbreak of infection', 'it', 'It']]),
 CorefResult(text="Severe acute respiratory syndrome coronavirus 2 (S...", clusters=[['Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)', 'This outbreak']])]

The `LingMessCoref` model is far more accurate than `FCoref` which got text1 and text3 wrong, albeit slower (36s vs 741ms).

In [None]:
preds[2].get_clusters(as_strings=False)

[[(0, 24), (55, 57), (75, 77)]]

In [None]:
preds[2].get_clusters()

[['An outbreak of infection', 'it', 'It']]

In [None]:
# compared with 8.613898 for FCoref

preds[2].get_logit(
   span_i=(0, 24), span_j=(55, 57)
)

50.868378

In [None]:
# compared with 9.424862 for FCoref

preds[2].get_logit(
   span_i=(0, 24), span_j=(75, 77)
)

33.16862

The clusters for the third sentence are correct but got higher logit values than for `FCoref` which scored 8.613898 and 9.424862 respectively. The `LingMessCoref` model also gave the first coreferring pair the higher score which is more logical given this is the sentence with the antecedent.

In [None]:
preds[3].get_clusters(as_strings=False)

[[(0, 60), (112, 125)]]

In [None]:
preds[3].get_clusters()

[['Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)',
  'This outbreak']]

The cluster for the fourth text is far more accurate than the `FCoref` cluster.

In [None]:
# compared with 0.24184513 for FCoref

preds[3].get_logit(
   span_i=(0, 60), span_j=(112, 125)
)

0.6793537

 `LingMessCoref` also gives a higher logit score for the more accurate coreferring pair, although it is still quite low compared with other texts.

## 5. Spacy component

On top of the provided models, the package
also provides a custom SpaCy component that can
be plugged into a SpaCy(V3) pipeline.



### 5.1 F-Coref

We will try this using the pretrained `FCoref` model.

In [None]:
model = FCoref(device='cpu')

In [None]:
text = 'In the context of the pandemic in 2020, there was a surge in research on the subject, with COVID-19/SARS-CoV-2/Coronavirus topping the list of keywords as soon as they appeared that year.'

We will `exclude=["parser", "lemmatizer", "ner", "textcat"]` at `spacy.load` since the `fastcoref` component only relies on pos tagging, and then add it to the pipeline to use `FCoref`.

In [None]:
# Load spaCy small model and exclude unused pipeline components
coref = spacy.load("en_core_web_sm", exclude=["parser", "lemmatizer", "ner", "textcat"])
coref.add_pipe("fastcoref")

<fastcoref.spacy_component.spacy_component.FastCorefResolver at 0x79cc1d045840>

In [None]:
coref.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'fastcoref']

In [None]:
%%time

doc = coref(text)
doc._.coref_clusters

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Inference:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 448 ms, sys: 0 ns, total: 448 ms
Wall time: 462 ms


[[(91, 122), (163, 167)], [(34, 38), (177, 186)]]


By specifying `resolve_text=True` in the pipe call, we can get the resolved text for each cluster.

In [None]:
%%time

doc = coref(
   text,
   component_cfg={"fastcoref": {'resolve_text': True}}
)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Inference:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 426 ms, sys: 0 ns, total: 426 ms
Wall time: 448 ms


In [None]:
doc._.resolved_text

'In the context of the pandemic in that year, there was a surge in research on the subject, with COVID-19/SARS-CoV-2/Coronavirus topping the list of keywords as soon as COVID-19/SARS-CoV-2/Coronavirus appeared that year.'

`[[(91, 122), (163, 167)], [(34, 38), (177, 186)]]` are the cluster indices for the following strings:

`[['COVID-19/SARS-CoV-2/Coronavirus', 'they'], ['2020', 'that year']]`

These look accurate, although it might have been more logical to have 'in 2020' as the entity.

### 5.2 LingMess

We can also load other models such as the more accurate model `LingMessCoref`

In [None]:
model = LingMessCoref(device='cpu')

In [None]:
text = 'In the context of the pandemic in 2020, there was a surge in research on the subject, with COVID-19/SARS-CoV-2/Coronavirus topping the list of keywords as soon as they appeared that year.'

In [None]:
coref = spacy.load("en_core_web_sm", exclude=["parser", "lemmatizer", "ner", "textcat"])
coref.add_pipe(
   "fastcoref",
   config={'model_architecture': 'LingMessCoref', 'model_path': 'biu-nlp/lingmess-coref', 'device': 'cpu'}
)

<fastcoref.spacy_component.spacy_component.FastCorefResolver at 0x79cc2133cfd0>

In [None]:
coref.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'fastcoref']

In [None]:
%%time

doc = coref(text)
doc._.coref_clusters

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Inference:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 10.8 s, sys: 313 ms, total: 11.1 s
Wall time: 13.9 s


[[(91, 122), (163, 167)], [(34, 38), (177, 186)]]

In [None]:
%%time

doc = coref(
   text,
   component_cfg={"fastcoref": {'resolve_text': True}}
)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Inference:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 9.72 s, sys: 490 ms, total: 10.2 s
Wall time: 10.4 s


In [None]:
doc._.coref_clusters

[[(91, 122), (163, 167)], [(34, 38), (177, 186)]]

In [None]:
doc._.resolved_text

'In the context of the pandemic in that year, there was a surge in research on the subject, with COVID-19/SARS-CoV-2/Coronavirus topping the list of keywords as soon as COVID-19/SARS-CoV-2/Coronavirus appeared that year.'

The clusters and resolved text are the same as for the `FCoref` model.

### 5.3 Batching texts

According to [spaCy's documentation](https://spacy.io/usage/processing-pipelines#processing), when processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts. spaCy's `nlp.pipe` method takes an iterable of texts and yields processed Doc objects. The batching is done internally.

We will try this out on the same list of texts.

In [None]:
text0 = "Autophagy maintains tumour growth through circulating the great arginine."
text1 = "It is a skin disease causing much itchiness. Scratching leads to redness, swelling, cracking, weeping clear fluid, crusting, and scaling."
text2 = "An outbreak of infection is considered a pandemic when it grows very fast. It starts out slowly at first and then takes off."
text3 = "Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in Wuhan, China in late December 2019.  This outbreak began spreading at an alarming rate, and unleashed a severe health crisis around the globe."

texts = [text0, text1, text2, text3]

In [None]:
%%time

docs = coref.pipe(
   texts,
   component_cfg={"fastcoref": {'resolve_text': True}}
)


CPU times: user 16 µs, sys: 0 ns, total: 16 µs
Wall time: 21.2 µs


`docs` is a generator so we have to convert it to a list before we can access the resolved text.

In [None]:
%%time

doc_list = list(docs)


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Inference:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 35.1 s, sys: 10.1 s, total: 45.2 s
Wall time: 48.8 s


Access the resolved text for a sample text in the list.

In [None]:
doc_list[3]._.resolved_text

'Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in Wuhan, China in late December 2019.  Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) began spreading at an alarming rate, and unleashed a severe health crisis around the globe.'

Iterate over the list of Doc objects and access the resolved text.

In [None]:
for doc in doc_list:
    print(doc._.resolved_text)

Autophagy maintains tumour growth through circulating the great arginine.
It is a skin disease causing much itchiness. Scratching leads to redness, swelling, cracking, weeping clear fluid, crusting, and scaling.
An outbreak of infection is considered a pandemic when An outbreak of infection grows very fast. An outbreak of infection starts out slowly at first and then takes off.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in Wuhan, China in late December 2019.  Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) began spreading at an alarming rate, and unleashed a severe health crisis around the globe.


A more efficient way by using a `map` function.

In [None]:
def resolve_text(doc):
    return doc._.resolved_text

In [None]:
resolved_texts = list(map(resolve_text, doc_list))

In [None]:
resolved_texts

['Autophagy maintains tumour growth through circulating the great arginine.',
 'It is a skin disease causing much itchiness. Scratching leads to redness, swelling, cracking, weeping clear fluid, crusting, and scaling.',
 'An outbreak of infection is considered a pandemic when An outbreak of infection grows very fast. An outbreak of infection starts out slowly at first and then takes off.',
 'Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in Wuhan, China in late December 2019.  Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) began spreading at an alarming rate, and unleashed a severe health crisis around the globe.']

## 6. LingMess model on 10 articles

We will use the `LingMessCoref` model as part of a spaCy pipeline to perform coreference resolution on the full text of 10 articles.

In [None]:
# load data
with open('2024-07-24_pmc_arxiv_full_text_merged_plus_cleaned_test_sent_text.pickle', 'rb') as f:
    pmc_arxiv_full_text_merged_plus_cleaned_test = pickle.load(f)

In [None]:
# copy of the DataFrame
pmc_arxiv_full_sent_text_spacy_fastcoref = pmc_arxiv_full_text_merged_plus_cleaned_test.copy()

In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text,sent_text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr...","[Sir James Black, a winner of the 1988 Nobel P..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...,[Tight and selective interaction between ligan...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...,[The severe acute respiratory syndrome coronav...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...,[The sudden outbreak of SARS-CoV-2 in 2019 too...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co...","[The 2019 novel coronavirus, now dubbed SARS-C..."
5,PMC9527439,2022-09-25,2022-10-07,Novel Drug Design for Treatment of COVID-19: A...,The Canadian journal of infectious diseases & ...,"Mousavi S, Zare S, Mirzaei M, Feizi A.",10.1155/2022/2044282,https://europepmc.org/articles/PMC9527439?pdf=...,"Coronavirus disease 2019 COVID-19, which was f...","[Coronavirus disease 2019 COVID-19, which was ..."
6,PMC9729590,2022-12-08,2023-01-03,"Repurposing FDA-approved drugs cetilistat, abi...",Informatics in medicine unlocked,"Shahabadi N, Zendehcheshm S, Mahdavi M, Khadem...",10.1016/j.imu.2022.101147,https://europepmc.org/articles/PMC9729590?pdf=...,COVID-19 is an infectious disease caused by Co...,[COVID-19 is an infectious disease caused by C...
7,PMC9236981,2022-06-28,2022-12-21,A comprehensive review of artificial intellige...,Biomedicine & pharmacotherapy = Biomedecine & ...,"Ahmed F, Soomro AM, Chethikkattuveli Salih AR,...",10.1016/j.biopha.2022.113350,https://europepmc.org/articles/PMC9236981?pdf=...,A novel coronavirus CoV first appeared by the ...,[A novel coronavirus CoV first appeared by the...
8,PMC9694939,2022-11-10,2022-12-13,Structural Homology-Based Drug Repurposing App...,"Molecules (Basel, Switzerland)","Aljuaid A, Salam A, Almehmadi M, Baammi S, Als...",10.3390/molecules27227732,https://europepmc.org/articles/PMC9694939?pdf=...,Drug discovery is a time-consuming and costly ...,[Drug discovery is a time-consuming and costly...
9,PMC9556799,2022-10-13,2022-11-01,Rational drug repositioning for coronavirus-as...,iScience,"Wang J, Liu J, Luo M, Cui H, Zhang W, Zhao K, ...",10.1016/j.isci.2022.105348,https://europepmc.org/articles/PMC9556799?pdf=...,"Coronavirus disease 2019 COVID-19, caused by s...","[Coronavirus disease 2019 COVID-19, caused by ..."


We will test by performing inference on the first sentence-tokenised article in the dataset after converting it into a list.




In [None]:
# convert sentence-tokenised text column to a list
sent_texts = pmc_arxiv_full_sent_text_spacy_fastcoref.sent_text.tolist()

In [None]:
len(sent_texts)

10

In [None]:
# print number of sentences in each article
for text in sent_texts:
    print(len(text))

244
200
139
338
224
175
126
485
169
253


First sentence-tokenised article in dataset

In [None]:
texts = sent_texts[0]

In [None]:
texts

['Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery.',
 'In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets.',
 'These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted.',
 'The definition of the term drug repurposing has been endorsed by scholars and used by them.',
 'It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching.',
 'After the research study by Ashburn et al., Allarakhia et al

In [None]:
model = LingMessCoref(device='cpu')

config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

In [None]:
coref = spacy.load("en_core_web_sm", exclude=["parser", "lemmatizer", "ner", "textcat"])
coref.add_pipe(
   "fastcoref",
   config={'model_architecture': 'LingMessCoref', 'model_path': 'biu-nlp/lingmess-coref', 'device': 'cpu'}
)

<fastcoref.spacy_component.spacy_component.FastCorefResolver at 0x7ed874cbbe50>

In [None]:
docs = coref.pipe(
   texts,
   component_cfg={"fastcoref": {'resolve_text': True}}
)


In [None]:
%%time

doc_list = list(docs)

Map:   0%|          | 0/244 [00:00<?, ? examples/s]

Inference:   0%|          | 0/244 [00:00<?, ?it/s]

CPU times: user 33min 16s, sys: 13min 24s, total: 46min 40s
Wall time: 47min 9s


In [None]:
def resolve_text(doc):
    return doc._.resolved_text

In [None]:
resolved_texts = list(map(resolve_text, doc_list))

In [None]:
resolved_texts

['Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery.',
 'In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets.',
 'These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted.',
 'The definition of the term drug repurposing has been endorsed by scholars and used by scholars.',
 'It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching.',
 'After the research study by Ashburn et al., Allarakhia e

Coreference does not happen cross-sentence e.g. between the second and third sentences '...approved drugs and molecules that are waiting for approval for new pathways of action and targets.' and
 'These molecules are usually safe in clinical trials...'

 We will divide the article into larger chunks and retry with `'max_tokens_in_batch': 10000`

In [None]:
# convert text column into a list
texts = pmc_arxiv_full_sent_text_spacy_fastcoref.text.tolist()

In [None]:
len(texts)

10

In [None]:
texts[0]

'Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by them. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the research study by Ashburn et al., Allarakhia et al. expanded the starti

In [None]:
text = texts[0]

In [None]:
len(text)

45830

We will split the text into six chunks.

In [None]:
text0 = "Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by them. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the research study by Ashburn et al., Allarakhia et al. expanded the starting materials for drug repositioning to include products that were discontinued for commercial reasons, expired patents, and candidates for laboratory testing. In the discovery process of a completely new drug, the difficulty usually lies in its safety and efficacy, which are the main potential causes of failure of most drugs in the approval or clinical development stage. Using existing knowledge about a drug or known target, the time, risk, and cost of developing a drug using drug repositioning are reduced, thereby greatly increasing the efficiency and economics of drug development, providing a better risk–reward trade-off, and making it easier to win the favor of venture capital firms. Since the 1990s, the repositioning of sildenafil for male erectile dysfunction and pulmonary hypertension, the development of a new efficacy of bupropion for smoking cessation, new applications of thalidomide for multiple myeloma, and chronic graft-versus-host disease have generated intense interest from pharmaceutical companies and academics. These classic success stories rely on three traditional approaches 1 molecular biology approaches, 2 in vivo and ex vivo experimental approaches, and 3 expert knowledge-based approaches. Due to the unknown, complex, and information-fragmented nature of drug candidates and potential new mechanisms of action, this activity is dependent on multiple factors, and success is often fortuitous. At the beginning of the 21st century, cheminformatics, bioinformatics, systems biology, genomics, polypharmacology, precision medicine, and other disciplines, combined with artificial intelligence, have developed rapidly. These rapidly growing disciplines have promoted the generation of systematic computer methods to make the drug repositioning process cheaper and shorter. Computational drug repositioning is classified as disease-centric or target/gene-centric or drug-centric depending on the source of discovery. This process relies on public biochemical databases such as DrugBank, ChEMBL, Cmap, PDB, OMIM, etc., to provide the appropriate information. In fact, to make the computational drug repurposing process, including the molecular docking and virtual screening steps, more convenient, database tools specifically developed for drug repurposing, such as EK-DRD, DREIMT, DrugSig, RepoDB, Promiscuous 2.0, etc., have been reported in the last few years. In addition, it has been found in the literature that only 10% of the research results have been carried out in the drug-centric pathway, which holds great prospects for future development. With the help of database tools, it is now possible to perform computational screening of even a staggering number of hundreds of millions of compounds. Computer methods to carry out this screening include machine learning, network modeling, text mining, and semantic reasoning, among others. The ultimate objective of repositioning is to transfer one or two of the most relevant results to clinical applications. Therefore, validation is quite important and requires consideration of multiple factors, such as price, toxicity levels, bioavailability, and differences between validated and computational models. Current validation methods include experimental validation, electronic health records to aid validation, cross-validation, gold standard dataset evaluation, literature citation validation, and expert consultation. Today, drug repositioning is increasingly prominent in the development of drugs for a variety of neurological diseases, cancer, rare diseases, and infectious diseases. An increasing number of pharmaceutical companies are also establishing relevant R&D programs or funding support. To translate relevant research results efficiently and smoothly, national departments within the United Kingdom, the United States, and the Netherlands have launched initiatives or programs to build partnerships between pharmaceutical companies and academia and to further explore scientific and commercial opportunities. It is certain that drug repositioning currently presents several dilemmas, such as intellectual property challenges, data platforms, and analytical techniques that need to be improved, that financial support remains important for technology development and clinical trials, and that some scientists deny the practical utility of the approach. There have been systematic analyses of terminology in the drug repurposing literature, text mining of drug–disease combinations, and the progression of a particular drug, but no studies have yet provided a broad overview of publications on the topic of drug repurposing research. When independent researchers or collectives including pharmaceutical companies, academia, and government departments seek drug repurposing partnership partners and seek to obtain a concise overview of comprehensive current research hotspots, the lack of relevant intelligence analysis to aid decision-making often makes the process convoluted and time-consuming. The bibliometric approach can solve the aforementioned problems relatively fairly, but at present, scholars have only studied the bibliometrics of aspirin, a drug repurposing there has not been a panoramic study of drug repurposing, and therefore, this study is necessary. Bibliometrics is a useful tool combining multiple parameters for the quantitative analysis of scholarly publications and is currently used to assess research hotspots and trends in a wide range of disciplines and industries, such as management, sociology, economics, medicine, environmental engineering, and agronomy. Therefore, this study uses bibliometric methods to quantitatively assess the following elements of drug repositioning publications 1 major contributors countries, research institutions, and authors 2 modes of collaboration intercountry collaborations 3 the most productive journals 4 the most frequently used disciplinary knowledge and 5 research trends, judged by analyzing author keywords, Essential Science Indicators ESI high citations, and hot research studies. We use the Web of Science core database, an authoritative academic information data service platform produced by Clarivate version 2021 Clarivate."
text1 = "Due to its rigorous selection of journals, the Web of Science WOS Core Collection Database is now internationally recognized as a database for evaluating the scientific output or disciplinary development of scholars and institutions. Among the subdatabases, SCI-E mainly includes global journals in basic science research, covering basic pharmacological and medical research related to the theme of this study, drug repositioning, while SSCI includes social science, covering ethical, nursing, psychological, and other social science research related to this study. The data were obtained on 25 October 2021 through the WOS Core Collection Database Citation Indexes SCI-E and SSCI, using the formula drug repurposing OR drug repositioning OR drug rediscovery OR therapeutic switching OR drug redirecting OR drug rediscovery OR drug retasking search query, searching in the subject field and defining the document type as Article and Review. The publication time parameters were initially limited to publications related to drug repositioning published between 1990 and 2020. A total of 3009 documents were obtained, of which only 31 were published in two decades from 1990 to 2009. Of these 31 documents, except for one document that is still frequently used by scholars as a retrospective source for drug repurposing definitions in these years, the remaining 30 were cited by other authors during the period of 2010–2020 as shown in. The overall level of interest in these studies shows a fluctuating downward trend as opposed to the rising citation fervor for drug repurposing, entering a stage of decline even under the less-demanding evaluation criterion of a 5-year maturation window. As the literature ages, its content becomes stale and obsolete in the perspective of intelligence sources, and the value of the metrics for judging current research trends is low. Therefore, we further narrowed the study to 2978 publications published from 2010 to 2020. The complete records of all retrieved documents are downloaded and imported for processing into Derwent Data Analyzer DDA version V10, a data cleaning, multiperspective data mining, and visualization software from Clarivate that improves data analysis efficiency and reduces labor costs. After importing all records of WOS documents into DDA, they are classified and measured according to a list of fields such as keyword, country/region, institution, author, research field, journal, etc. For each item in the list fieldset, DDA has a built-in data cleaning tool for automatic data deduplication. After the machine has removed duplicates, the items in the set of fields still need to be manually verified for splitting or merging. It is to be noted that the regions of certain countries are presented separately, while they are usually considered as a single country internationally. Therefore, we need to perform merging, such as combining Wales, Scotland, England, and Northern Ireland into the United Kingdom column and combining Hong Kong and Macau regions into the China column. To address the possible problem of different authors with the same name, the following two main verification steps were performed 1 returning to the WOS database to search for publications under that author’s name under the original search formula conditions and 2 for authors whose publications provide disputed information this also includes three cases first, two or more authors with the same name but not the same person second, two or more authors with the same abbreviated name, but the full names were found to be different after a search and third, similar signatures being different variants of the same author’s name, in addition to searching the ORCID-related information of the authors concerned for judgment, we checked different institutional websites as well as encyclopedias to look for changes in the study and work history of authors with the same or similar names from 2010 to 2020 to determine whether they were the same person. Based on the verification, we then split or merged the results. After data cleaning and matrix analysis by DDA, various types of cluster plots and bubble plots can be obtained to reveal the useful information behind the data. The bibliometric fields of publication volumes, countries, international collaborations, institutions, research areas, journals, authors, highly cited research studies, and author keywords were analyzed in this study. It should be noted that because some studies were published online ahead of time and the study publication date was a year or two behind, for statistical purposes, the year of publication of such research studies was included as the year of online publication. e.g., a study shown in the reference as published in 2022 may have been published online in 2020. Of the 2978 papers obtained using the search criteria mentioned previously, the main ones were research studies and reviews. Furthermore, individual publications are not only classified by journals in the single category of research studies or reviews but also belong to other categories. These publications were also related to proceeding studies, early access, book chapters, data studies, and retracted publications. The vast majority of research studies and reviews were published in English, with the remainder in Japanese, Chinese, Czech, French, German, Hungarian, Korean, and Portuguese. Ninety were from SSCI, and the remaining 2888 were from SCI-E. Further, 1996 were from Open Access. An annual analysis of published research studies is shown in. The number of publications for every year expanded from 17 in 2010 to 970 in 2020. Annual publications on the subject have increased by more than 64 times. The number of annual publications has been increasing at a relatively high rate since 2015, while in 2020, there was a spike in the number of publications and annual citations, probably due to the COVID-19 pandemic, a global public health emergency that prompted special attention from scientists. Among the four countries with the highest number of publications the United States, China, the United Kingdom, and India, the United States has maintained a high growth volume since 2010, while China was the fastest in terms of average annual growth rate in the last three years."
text2 = "In 2020, the number of publications in India surged and surpassed the production of the United Kingdom. With respect to the 2978 publications related to drug repositioning research, 89 countries contributed to the field of drug repositioning research. The number of publications and citations from the 20 most productive countries/regions is shown in. There are nine countries/regions in Europe, five in the Americas, five in Asia, and one in Oceania. The four most productive countries/regions are, in order, the United States, China, the United Kingdom, and India. The United States is the absolute leader in this field, with 918 research studies on drug repositioning published since 2010, which is already more than the next highest number of publications in China and the United Kingdom combined. This is followed by India, Italy, Germany, South Korea, and Japan. Other productive countries include Brazil, France, Canada, Spain, Australia, the Netherlands, and Switzerland. In terms of publication impact, the United States led the Total citations TC rankings with 27355, twice as many as that of China, which ranked second. We also included the average citations per publication ACPP in the comparison, which is calculated by dividing the TC by the TP total papers value and is a relative number that may better reflect the individual or collective level of attention than the individual TC and TP values. Canada ranked first in ACPP at 41.81, closely followed by the United Kingdom and Germany. In addition, the h-index was originally proposed as a simple quantification that a researcher had at least h publications cited h times, reflecting to a certain extent the research results of the researcher as an individual. Later, the word researcher in the definition began to be replaced by collective words such as academic group or institution, journal, and country, becoming an indicator of the level of collective research to some extent. Undoubtedly, the h-index of the United States ranks first in this field with 74 times. Taking all parameters into account, we find that publications in the United Kingdom, the United States, and Canada perform better on average. While the number of publications in China and India is significant, they have received low levels of attention. It should be noted that DDA analysis software is nationally identified based on the location of each researcher’s institution address provided in the publication. If a publication is coauthored by institutions from more than two countries, the publication is defined as the result of an international collaboration. Whether there is some affiliation between the various institutions of the research group that produces the multicountry collaboration is not taken into account. As shown in, among the publications of the top 20 countries and regions, the proportion of international collaborations is quite high in European countries, especially in Sweden and Belgium. Asian and Latin American countries are generally underrepresented. In addition, the United States, the most active country in publishing and the country with the most collaborations—with 59 countries or regions—still has over 50% of the studies published overall. depicts the academic collaboration network for the top 20 countries and regions in terms of productivity. Using DDA software, the network was mapped using a co-occurrence matrix. The size of the circles is proportional to the extent of each country’s contribution, the lines between the circles represent the collaboration between countries/regions, and the thickness of the connecting lines indicates the frequency of collaboration. The results show that the United States cooperates most frequently with China and the United Kingdom and has the closest cooperation with them. In addition, Mexico, Belgium, Argentina, Taiwan, Japan, and Korea have slightly sparser cooperation networks among the 20 most productive countries/regions, while the remaining countries have more extensive cooperation networks among themselves. A total of 3530 institutes were involved in drug repositioning research. The top 20 productive institutes are shown in. Eight of the top 20 institutions are from the United States, again indicating the dominance of the United States in drug repositioning research three are from the United Kingdom two are from China and Brazil, France, Mexico, Canada, the Netherlands, Austria, and Sweden each have one. The Chinese Academy of Science ranked first in terms of the number of research studies, followed by Case Western Reserve University and the NIH. The Icahn School of Medicine at Mt Sinai ranked first in ACPP at 77.32. The NIH had the highest h-index value of 22. The Icahn School of Medicine at Mt Sinai was the best performer in ACPP at 77.32, followed by the University of California, San Francisco and Johns Hopkins University, both of which are US-based research institutions. Compared with US research institutions, Chinese research institutions are at the back of the pack in terms of ACPP, and their impact needs to be improved. The collaboration network between the 15 largest institutions in 2010–2020 is shown in. The collaboration network provides a more visual view of the collaboration with different institutions and thus helps in the search for more beneficial collaborations. Next to the name of each institution is its total number of publications. At the intersections of these institutions, yellow dots indicate collaborations with the other top 10 research institutions. It should be noted that the number of yellow dots can indicate the output of cooperation and the strength of interagency cooperation. The nodal data with no crossover points represent the number of publications produced by the institute, either by its independent work or in collaboration with research institutions outside the top 15. From, we see that the University of Cambridge established the largest collaborative network, followed by the large network established by four institutions, the NIH, the Icahn School of Medicine at Mt Sinai, Karolinska Institute, and King’s College London. In terms of the number of copublications with established institutions, the Chinese Academy of Science and Shanghai Jiao Tong University copublished as many as six, followed by the University of Cambridge and King’s College London and the NIH and the Icahn School of Medicine at Mt Sinai. Analyzing the aforementioned three pairs of institutional combinations, King’s College London has two publications that are the product of collaboration between the three research institutions. The University of Sao Paulo and Aix-Marseille University are relatively independent in this research area. Combining the ranking of multiple parameters, we found that the NIH and Icahn Sch Med Mt Sinai in the United States are the most vocal institutions in terms of academic research result perspective on the topic. An analysis of research areas gives a good indication of the scope of application of the research topic, with an unrestrained number of 74 areas covered, with the top 20 areas of research in terms of publication volume shown in. Briefly, pharmacology and pharmacy took first place with 962 articles, followed by biochemistry and molecular biology, and for ACPP, the top three were science and technology-other topics, mathematics, and cell biology. shows a bubble graph of the top 20 drug repositioning research areas. The bubble plot shows three dimensions of the data, namely, research area, year of publication, and the number of publications. The horizontal change in bubble size illustrates the growing trend of research areas over time, the vertical size of the bubble shows the most popular research areas in that year, and the number in the bubble indicates the frequency of the topic in the research area and the number of publications in that year. The number of research results in each relevant field is increasing year by year. Biophysics increased from five in 2019 to 77 in 2020, a more than 15-fold increase, suggesting that drug repositioning may have made a breakthrough or become widely used in this field. The field of virology was in a downturn from 2010 to 2014, with only one publication, with a gradual increase in relevant studies after 2015."
text3 = "For scholars studying drug repositioning-related topics, knowing which journals publish relevant research is important in deciding which journals to read or submit their research studies to. A total of 2988 publications related to drug repositioning research were published in 845 journals during the period of 2010–2020. The top 25 journals in terms of a total number of studies published are shown in Sci Rep topped the list with 75 studies published, followed by PLoS One, J. Biomol. Struct. Dyn, Bioinformatics, and BMC bioinformatics 50 articles 1.78%. The rest of the journals had a share of less than 1.5%. In terms of total citations TC, at present, studies in Drug Discov. have been cited a total of 2119 times over the past 10 years, followed in rank by PLoS One and Bioinformatics. For ACPP, Drug Discov. Today still holds first place with a high frequency of 50.45 times, followed by PLoS Comput 33.14 times. The impact factor IF of a journal is calculated by dividing the total number of citations of all publications in the journal in the previous two years by the number of publications. Thus, shows that the ACPP of drug repurposing publications included in most journals is much higher than that of IF, which roughly verifies that the number of scholars interested in drug repurposing is relatively high. In terms of the impact factor IF of specific journals, except for Oncotarget and BMC Syst. Biol., which have not been included in SCI since 2018 and 2020, Brief. Bioinform. has the highest value of 11.622, followed by Drug Discov. Today, Bioinformatics, Cancers, Eur. J. Med. Chem, and Expert. Opin. Drug Discov. The bubble chart shows that J. Biomol. Struct. Dyn. featured 64 publications in 2020, compared to a combined total of only four publications in the previous ten years the Oncotarget journal inclusion in this category peaked in 2016–2017. For scholars interested in the topic of drug repositioning, it is useful to know how other researchers are working on the issue to facilitate communication and collaboration between scholars. A total of 15620 authors contributed to studies within our measurement consideration, and shows the top 20 prolific authors by a number of publications. Of these 20 highly productive authors, seven were from the United States, three were from Argentina, and two were from Germany, indicating a relatively high concentration of drug repositioning research in certain countries. In addition, the NIH United States, Case Western Reserve University United States, Tech University Dresden Germany, and the National University of La Plata Argentina each have two of these academics. Cheng, FX leads the list with 25 research studies, followed by Talevi, A and Mucke, HAM. For the list of corresponding authors, the top three remain, in order, Mucke, HAM, Cheng, FX, and Talevi, A. In terms of ACPP ranking, Butte, AJ was ranked first with 154.33 points, followed by Cheng, FX, Tang, Y, and Dudley, JT. Cheng, FX still has the highest h-index at 21, followed by Zheng, W, Talevi, A, Xu, R, and Schroeder, M. The h-index has two drawbacks when researchers of the same topic are compared with each other. One is that the scholar’s h-index does not decrease over time but only grows or stays the same, and it is not possible to obtain information on whether the scholar is still in an academic career. In this study, we narrow the study to the most recent publications from 2010 to 2020, taking into account the timeliness of the h-index response information. Second, older scholars usually enter academia earlier and have an advantage in their h-indexes in comparison with those of younger scholars. Therefore, this phenomenon must be targeted for analysis or illustration. Thus, by combining the authors’ educational experiences and employment relationship changes that were recorded in the WOS database and ORCID business cards, we inferred that more than half of the scholars in the top 20 in terms of the number of publications received their Ph. D. before 2008, and two scholars, Mucke, HAM and Zheng, Wei, are older. In contrast, Cheng, FX, a scholar from Case Western Reserve Univ, completed his Ph. D. without a gap in 2013 and may have a longer academic career in the future therefore, Cheng, FX’s h-index in the field of drug repositioning is likely to grow more in the future and Cheng, FX is likely to have more academic influence. To reveal the focus of drug repositioning research and research trends, the author keywords and the highly cited and hot research topics of the ESI for each of the 2978 publications were analyzed, which were also derived from the core database of the WOS database SCI-E/SSCI. Highly cited studies were defined as studies in the top 1% of the citations for all studies in the same ESI discipline within the 10-year range of inclusion of ESI inclusion. A hot research topic of the ESI refers to a study published in two years with a citation frequency within one of the corresponding disciplines in the world in the last two months. Author keywords tend to provide more information and have thus become a widespread focus. The data of 6083 author keywords in the search results were merged to make keywords with the same meaning represented by a single unified word. In the end, 5616 author keywords were obtained. It should be specified that some publications without author keywords were excluded from the statistical analysis. Of these author keywords, 4296 were used only once, representing 76.50% of the total. A total of 1216 appeared 2–10 times, 79 appeared 10–20 times, 37 appeared 21–50 times, and the remaining eight were used between 51 and 1500 times. All keywords cumulatively appear a total of 12400 times, while the top 30 most used author keywords appear 2967 times alone, or approximately 23.93%, as shown in. The comparison of keywords in recent years allows for tracking the frontiers of research and predicting hotspots and trends in drug repositioning research. The bubble plots show the three dimensions of the data, namely, the year of publication, the author’s keywords, and the number of corresponding publications. The horizontal change in the size of the bubble illustrates the increasing trend of author keywords over time, the vertical size of the bubble shows the most popular keywords in that year, and the numbers in the bubble indicate the frequency of author keywords and the number of publications. The top 30 keywords include five diseases COVID-19/SARS-CoV-2/Coronavirus, Cancer, Alzheimer’s disease, Breast cancer, and Chagas disease. Drug names appear four times, Antiviral drugs, Metformin, Statins, and Antibiotics, which reveal the diseases and applications to which drugs were often repositioned during these 11 years. There were four subject categories, Bioinformatics, Polypharmacology, Systems biology, and Precision medicine and eight occurrences of research methods, namely, Virtual screening, Molecular docking, Machine learning 54, eighth, Clinical trials, High-throughput screening, Connectivity map, and Molecular dynamics simulation. In the context of the pandemic in 2020, there was a surge in research on the subject, with COVID-19/SARS-CoV-2/Coronavirus topping the list of keywords as soon as they appeared that year. Virtual screening is a research method that appeared seven times more frequently in 2020 than in the previous year. Since Drug repurposing/repositioning is a subject matter and a strategy for drug discovery/drug development, it would not make much sense to analyze these two keywords. Molecular docking is one of the core steps of virtual screening, and the COVID-19 pandemic generated many opportunities for the practice of drug repositioning. Therefore, high-quality studies of the keywords COVID-19/SARS-CoV-2/Coronavirus, Virtual screening, and Molecular docking were surveyed, as shown in the bubble chart, in the past two years, reflecting the relevant research trend in recent years. Wang, F et al. developed a new free reverse docking server based on a consensus algorithm combining several docking algorithm strategies to address the original shortcomings of computational molecular docking in drug repositioning, such as a low success rate, cumbersome operational steps, and reliance on code writing. M Lapillo et al. extensively evaluated the performance assessment of docking-based target fishing methods and developed a consensus docking-based target fishing tactic."
text4 = "In a virtual screening process, Gervasoni, S. conducted a literature search for molecular binding sites for SARS-CoV-2-associated protein targets while combining pocket and docking searches to propose a new pocket mapping strategy that identifies binding cavities with significantly better performance than pocket detection alone. Xie, L et al. screened antitoxic drugs based on the multitarget structure of the pathway center and stated that this inhibition of multiple targets in one pathway would be more effective than targeting a single protein, and the chance of drug resistance was smaller, which could be applied to other pathways. Li, Z et al. reported a virtual screening method based on accelerated free energy perturbation absolute binding free energy FEP-ABFE prediction and stated that the virtual screening method based on the prediction of FEP-ABFE will play a role in many other drug repositioning studies. After a series of drug repurposing computational screens and various validation activities by several scientists, it was agreed that raltegravir, clonidine, chloroquine and hydroxychloroquine have therapeutic effects in the treatment of novel coronavirus. In addition, from the studies on the keyword Machine learning over the 10-year period shown in the bubble chart, it was found that the classical machine learning algorithms of support vector machines, regularized least squares, logistic regression, and random forests have been widely used in inferring drug–target and drug–disease interactions. While the level of influence of a study is reflected by a combination of many aspects, the number of citations remains an important indicator. Based on the definition of highly cited and hot ESI papers in of this study, a total of 108 highly cited studies were obtained, of which 11 were hot research topics. Hot research topics are shown in. It should be noted that the first-ranked author is used here as a representative, and the corresponding institution is shown. This rule is followed in of this study. All hot research topics were published in 2020, and with the exception of an article describing the damage caused by nonsteroidal anti-inflammatory drugs NSAIDs to multiple organs and new information on drug repurposing, the remaining studies focused on drug repositioning therapeutic target studies in novel coronavirus pneumonia, screening drug studies, reviews of clinical trials, and reports of other coronavirus therapeutic agents and vaccine studies. From the perspective of cooperation, most of them were completed by a country’s independent agency. In terms of the countries and regions studied, four studies involved US scholars, five studies involved Asian scholars, and one contribution was from an African scholar. In addition, A SARS-CoV-2 protein interaction map reveals targets for drug repurposing, published in Nature by Gordon, DE with a total of 125 scholars from the United States, the United Kingdom, and France was the most cited publication with 952 citations. Eleven hot research topics were removed from the 108 highly cited ESI studies, and the top 20 most cited studies were selected from the remaining highly cited studies for analysis. In terms of year of publication, the study by Dudley, JT et al. published in NUCLEIC ACIDS RESEARCH in February 2011 was the earliest of these studies. Five highly cited studies were published in 2013, and three studies were published as recently as 2020. Two studies were published in Nature, and one each was published in Nat. Rev. Drug Discov. and Nat. Med. subj. of Nature E J. Med. Chem. L was next with two studies. There were 12 studies with the first author or coauthor from the United States, representing more than half of those in, followed by China, Canada, England, Germany, Japan, and Switzerland in order of contribution of two or more studies. Nine studies were based on collaborations between different institutions in multiple countries. One of them, entitled Alcohol-abuse drug disulfiram targets cancer via p97 segregase adapter NPL4, published in Nature in 2017 by Skrott, Z et al. is a collaboration between scholars from six countries Czech Republic, the United States, Denmark, Sweden, Switzerland, and China. In TC, DrugBank 5.0 a major update to the DrugBank database for 2018 by Canadian University of Alberta scientists Wishart, DS et al. ranked first 1820 total citations. The most cited publication on an annual basis was Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2, published in 2020, which was authored by Zhou, YD et al. and was the highest annual average cited publication with 609 citations. The scientists Cheng, FX and Dudley, JT, contributed to two of these 20 publications and are important influencers in the field. The three studies published in 2020 focus on novel coronavirus-related drug rediscovery activities and the use of deep learning techniques. and Pushpakom, S et al. provided systematic reviews of the methods and challenges of drug repositioning at that time. Initially, Sirota, M et al. explored the role of integrating genome-wide computational approaches for predicting reusable drugs, while from 2013 onward, Peters, JU et al., Medina-Franco et al., JL et al., Reddy, AS et al., and Anighoro, A et al. generally recognized the importance of combining multiple points of pharmacological knowledge for drug repositioning studies. In the face of a worldwide health emergency caused by the Zika virus epidemic, used drug repositioning to identify lead compounds for drug development. Of course, techniques related to the mining of repositionable drugs through experimental high-throughput screening, a traditional experimental approach, are not without progress for example, explored the challenges of transferring 3D cell culture technology to the use of high-throughput screening HTS. In 1995, Mchugh et al. investigated the immunomodulatory action mechanism of thalidomide in humans, which was the first relevant publication on drug repositioning. The publication time can be divided into three phases the growth period of 1995–2009, the steady growth period of 2010–2018, and the rapid rise from 2019 and beyond. The 2978 publications studied between 2010 and 2020 were completed by 15338 authors from 3530 research institutions in 89 countries, and at the time of this study’s completion, the WOS database had surpassed more than 1400 publications in 2021 under the same search restrictions for the topic, with more than 31000 citations for the year, supporting further evidence that the topic is still gaining momentum worldwide. The publication countries/regions are divided into three types first, countries with a traditionally developed medical level, mainly developed countries in Western Europe, North America, and Oceania second, countries with a developed pharmaceutical manufacturing industry, such as India and Japan in Asia and third, developing countries with some research potential, such as China, Brazil, Argentina, and Mexico. In terms of national cooperation, Western European countries have shown a high degree of cooperation, with the United States, China, and the United Kingdom cooperating more frequently. This may be because Western European countries have a tradition of cooperation in the field of research, and the United States, China, and the United Kingdom are the most powerful countries in terms of drug repositioning publications and therefore cooperate more with each other. The United States accounts for half of the 20 most productive institutions, which may explain why the United States still publishes more than 50% of its studies independently, despite having the largest international collaborative network base, because it already has the most active and high-quality producing institutions within the country for research institutions seeking collaboration. Furthermore, 19 of these 20 institutions are universities and research institutes, and one is a company, HM Pharma Consultancy, which was established in 2000 to focus on drug repositioning for the development of new drugs. This evidence suggests that the topic of drug repositioning is not only widely studied in academia but also has a place in the industry. The 2978 studies are spread across 74 research areas, but pharmacology and pharmacy and biochemistry and molecular biology account for a larger proportion of the total number of studies. It is quite notable that the majority of studies reported in biophysics did not rise significantly until 2020. The reasons for this may be the following first, there was a breakthrough in basic research in this field in 2020 and second, due to the novel coronavirus, research in this direction has increased its application for the prevention and control of the pandemic. In terms of journals, Sci Rep ranked first, followed by PLoS One and J. Biomol. Struct. Dyn. In terms of lead authors, three have the most productive and influential positions Cheng, FX is the most prolific author, based on the number of papers and h-index Mucke, HAM is the most frequent corresponding author and Butte, AJ is the top author in terms of ACPP ranking. Even though Latin American countries do not have an advantage in terms of national cooperation or the total number of institutional funding units, Latin American scholars have overcome many obstacles and are actively at the forefront of scholarship, contributing significantly to the field. Through the analysis of the authors’ keywords, cancer has been the main disease addressed by this method. Metformin has been found by many scientists to have a good inhibitory effect on various tumors, mainly in gynecology, and it has become a specific drug that has been most frequently mentioned in drug reuse in recent years."
text5 = 'In terms of antiviral drugs, scholars not only use drug repositioning to find antiviral drugs to treat diseases, such as Ebola and HIV, that have plagued humans for a long time but also use this method to seek treatments for infectious diseases, such as Zika virus and novel coronaviruses that have threatened several countries and even the world. For these diseases, emergency research on drug repositioning has played an important role in reducing mortality, calming patient fears, and restoring economic production when no specific drugs or vaccines were initially available during the pandemic. The combination of precision medicine and drug repositioning studies, often used to seek treatments for rare diseases and, in particular, genetically related diseases, is expected to be fully developed in the future. In the past 2 years, Virtual screening, together with Molecular docking and Machine learning, has become the most cutting-edge and important research methods in related technology fields, constantly improving the accuracy of drug reuse and screening. Currently, to develop more efficient and accurate research, there are two trends in the use of drug repositioning. One is the combination of various methods, such as the use of text mining and network analysis, and the creation of statistical models for predicting semantic link association to assess the relationship between pharmacological target pairings text analysis combined with machine learning to develop drugs for Parkinson’s disease prediction of new DTIs using data from multiple databases and the obtained relocated anticancer drugs were verified by cross-validation, literature, and experimental verification. Second, the most advanced algorithms are applied and improved, such as matrix decomposition and matrix completion and deep learning. In fact, some of the studies in the list of highly cited research topics on novel coronaviruses drug repurposing studies are currently approaching 3000 citations on Google Scholar. The percentage of highly cited studies and hot research topics related to novel coronaviruses is also a good indication that the method has made an indelible contribution to the study of novel coronaviruses and similar infectious diseases. Auxiliary technology for the experimental screening of traditional drugs is also developing, which also promotes drug repositioning or other drug development processes. Furthermore, the high-quality results of Elfiky, AA, a scientist from Cairo University, Egypt, suggest that relevant research in some economically underdeveloped countries may reach top levels worldwide due to the return or affiliation of some prominent scientists. For this research, the literature on drug repositioning research published in the SCI-E and SSCI sections of WOS core journals from 2010 to 2020 was analyzed based on bibliometrics and DDA software. This area has been of interest to scientists since the end of the 20th century and entered a period of rapid growth in 2019, with the peak far from being reached. Using bibliometrics as a tool, the United States has become a world leader in terms of the number of submissions, number of high-quality studies, funding support, strength of research institutions, and number of top scholars, followed by China and the United Kingdom, where more research is being performed in this area. As a method of drug discovery, drug repurposing is closely related to the development of various biomedical disciplines, and computer-related disciplinary methods, such as mathematical computational biology and computer science, have taken an important place in the research of this field in the last decade. The authors’ keyword analysis suggests that research in the field of the novel coronavirus will remain valuable until the associated pandemic is completely contained. Virtual screening, molecular docking, machine learning, and other related technical fields still need long-term development to achieve efficient and accurate repositioning of drugs. Precision medicine, combined with drug repositioning, is the most promising direction for the future. In conclusion, drug repositioning can help to treat more diseases, such as drug resistance, poor drug selectivity, and limited therapeutic options. This study may help some scholars with an initial interest in drug repositioning-related research to gain a concise and rapid understanding of the current state of global research, as well as offer some relevant information to institutions or groups seeking collaboration. It is worth noting that this study has some biases and limitations. First, there are still some issues with the publications included in the study based on subject terms 1 some relevant publications that do not use the search formula in this study may have been excluded from this study and 2 there may also be a small number of articles whose use of some of the aforementioned search terms deviates significantly from the general understanding yet, such publications are included in this study. Second, some extraneous factors distort the credibility of the bibliometric statistics. 1 When analyzing the keywords of publications, some publications are excluded from the statistical analysis because they do not list author keywords e.g. 2 Excessive self-citation by some authors inflates the actual level of interest in the publication. 3 for a publication, when an author submits more than one institution’s address information, this publication is counted as research results by each institution. Finally, in future work, patents from the WOS database associated with the topic of drug repositioning will be analyzed to provide another perspective on the situation of the topic in terms of applications and technological innovations.'

texts = [text0, text1, text2, text3, text4, text5]

In [None]:
len(texts)

6

In [None]:
model = LingMessCoref(device='cpu')

config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

We have set `'max_tokens_in_batch': 10000` although this is almost certainly meant for the `FCoref` model as `LingMessCoref` is unable to handle long texts due to there being an apparent [context window](https://github.com/shon-otmazgin/fastcoref/issues/47) of 4096.



In [None]:
coref = spacy.load("en_core_web_sm", exclude=["parser", "lemmatizer", "ner", "textcat"])
coref.add_pipe(
   "fastcoref",
   config={'model_architecture': 'LingMessCoref', 'model_path': 'biu-nlp/lingmess-coref',  'max_tokens_in_batch': 10000, 'device': 'cpu'}
)

<fastcoref.spacy_component.spacy_component.FastCorefResolver at 0x7ed874cb9330>

In [None]:
docs = coref.pipe(
   texts,
   component_cfg={"fastcoref": {'resolve_text': True}}
)


In [None]:
%%time

doc_list = list(docs)

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

Inference:   0%|          | 0/6 [00:00<?, ?it/s]

CPU times: user 4min 19s, sys: 1min 29s, total: 5min 48s
Wall time: 6min 11s


Batching the text into six chunks for inference is much faster at 6min 11s than the sentence-tokenised article which took 47min 9s. This would of course be much faster using a GPU.

In [None]:
def resolve_text(doc):
    return doc._.resolved_text

In [None]:
resolved_texts = list(map(resolve_text, doc_list))

In [None]:
resolved_texts

["Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by scholars. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the res

If we look at the resolved text we can see that for the second and third sentences in the first article the coreference resolution has worked:

'...approved drugs and molecules that are waiting for approval for new pathways of action and targets. molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials...'

This resolves the original third sentence: 'These molecules are usually safe in clinical trials...'

We will now perform coreference resolution on the first 10 articles.

In [None]:
model = LingMessCoref(device='cpu')

In [None]:
# convert text for 10 articles into a list
all_sent_text = pmc_arxiv_full_sent_text_spacy_fastcoref.sent_text.tolist()

In [None]:
# number of sentences in each article
for i in all_sent_text:
    print(len(i))

244
200
139
338
224
175
126
485
169
253


We will divide each sentence-tokenised articles into chunks of 50 sentences.

In [None]:
slice_size = 50

def chunks_list(texts):
    chunks = []
    remainder = len(texts) % slice_size
    chunks_count = int((len(texts) - remainder) / slice_size)
    for i in range(chunks_count):
        chunks.append(texts[i * slice_size: (i+1) * slice_size])
    if remainder:
        chunks.append(texts[-remainder:])
    return chunks

In [None]:
chunks_list = list(map(chunks_list, all_sent_text))

In [None]:
len(chunks_list)

10

In [None]:
# print number of 50-sentence chunks in each article
for i in chunks_list:
    print(len(i))

5
4
3
7
5
4
3
10
4
6


In [None]:
# print number of sentences in each chunk - the last chunk will have the remainder

for i in chunks_list:
    for j in i:
        print(len(j))

50
50
50
50
44
50
50
50
50
50
50
39
50
50
50
50
50
50
38
50
50
50
50
24
50
50
50
25
50
50
26
50
50
50
50
50
50
50
50
50
35
50
50
50
19
50
50
50
50
50
3


In [None]:
for i in chunks_list:
    for j in i:
        print(j)

['Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery.', 'In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets.', 'These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted.', 'The definition of the term drug repurposing has been endorsed by scholars and used by them.', 'It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching.', 'After the research study by Ashburn et al., Allarakhia et al. exp

In [None]:
with open('2024-07-24_sent_text_batches_of_50.pickle', 'wb') as f:
  pickle.dump(chunks_list, f)


Create batches of chunks, one batch of 50-sentence chunks for each article.

In [None]:
def return_batches(lst):

    batches = []

    for i in lst:
        batches.append(' '.join(map(str, i)))

    return batches

In [None]:
batches =  list(map(return_batches, chunks_list))

In [None]:
with open('2024-07-24_sent_text_joined_batches_of_50.pickle', 'wb') as f:
  pickle.dump(batches, f)

In [None]:
len(batches)

10

In [None]:
# print number of characters in each 50-sentence chunk in the first batch (i.e. first article)
for batch in batches[0]:
    print(len(batch))

11427
7743
7748
9366
9542


In [None]:
# print text in each 50-sentence chunk in first batch
for batch in batches[0]:
    print(batch)

Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by them. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the research study by Ashburn et al., Allarakhia et al. expanded the startin

In [None]:
# Create spaCy pipeline with LingMessCoref component.
coref = spacy.load("en_core_web_sm", exclude=["parser", "lemmatizer", "ner", "textcat"])
coref.add_pipe(
   "fastcoref",
   config={'model_architecture': 'LingMessCoref', 'model_path': 'biu-nlp/lingmess-coref', 'max_tokens_in_batch': 10000, 'device': 'cpu'}
)

config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

<fastcoref.spacy_component.spacy_component.FastCorefResolver at 0x7c0ed3bdad70>

In [None]:
coref.pipe_names

['tok2vec', 'tagger', 'attribute_ruler', 'fastcoref']

We will perform inference on the  50-sentence batches for the 10 articles.

In [None]:
%%time

doc_list = []

for batch in batches:
    docs = coref.pipe(
   batch,
   component_cfg={"fastcoref": {'resolve_text': True}}
)
    doc_list.append(list(docs))

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Inference:   0%|          | 0/5 [00:00<?, ?it/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Inference:   0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Inference:   0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Inference:   0%|          | 0/7 [00:00<?, ?it/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Inference:   0%|          | 0/5 [00:00<?, ?it/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Inference:   0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Inference:   0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Inference:   0%|          | 0/10 [00:00<?, ?it/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Inference:   0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

Inference:   0%|          | 0/6 [00:00<?, ?it/s]

CPU times: user 33min 11s, sys: 12min 57s, total: 46min 8s
Wall time: 47min 24s


In [None]:
len(doc_list)

10

In [None]:
doc_list

[[Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by them. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the research study by Ashburn et al., Allarakhia et al. expanded the start

In [None]:
with open('2024-07-25_fastcoref_lingmesscoref_spacy_articles_sent_text_batches_of_50_df.pickle', 'wb') as f:
  pickle.dump(doc_list, f)

Resolved text for first batch of first article.

In [None]:
doc_list[0][0]._.resolved_text

"Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by scholars. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the rese

We can see again that the coreference resolution has worked for the second and third sentences with '...molecules that are waiting for approval...'

We will get the resolved text for all 10 articles.

In [None]:
def resolve_text(docs):
    resolved_list = []
    for index, doc in enumerate(docs):
        resolved = doc._.resolved_text
        resolved_list.append(resolved)
    return resolved_list

In [None]:
resolved_texts = list(map(resolve_text, doc_list))

In [None]:
len(resolved_texts)

10

In [None]:
resolved_texts

[["Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by scholars. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the re

In [None]:
with open('2024-07-25_resolved_fastcoref_lingmesscoref_spacy_articles_sent_text_all_batches_of_50.pickle', 'wb') as f:
  pickle.dump(resolved_texts, f)

We will join the resolved chunks together as full articles.

In [None]:
def join_docs(lst):

    joined_docs = []

    joined_docs.append(' '.join(map(str, lst)))

    return joined_docs

In [None]:
joined_docs =  list(map(join_docs, resolved_texts))

In [None]:
len(joined_docs)

10

In [None]:
joined_docs

[["Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by scholars. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the re

In [None]:
with open('2024-07-25_resolved_fastcoref_lingmesscoref_spacy_articles_sent_text_joined_docs.pickle', 'wb') as f:
  pickle.dump(joined_docs, f)

In [None]:
# flatten the list of joined articles
joined_docs_flat = list(itertools.chain.from_iterable(joined_docs))

In [None]:
joined_docs_flat

["Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery. In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted. The definition of the term drug repurposing has been endorsed by scholars and used by scholars. It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic switching. After the res

In [None]:
len(joined_docs_flat)

10

In [None]:
with open('2024-07-25_resolved_fastcoref_lingmesscoref_spacy_articles_sent_text_joined_docs_flat.pickle', 'wb') as f:
  pickle.dump(joined_docs_flat, f)

Add list of resolved articles as a column in the DataFrame.

In [None]:
with open('2024-07-24_pmc_arxiv_full_text_merged_plus_cleaned_test_sent_text.pickle', 'rb') as f:
    pmc_arxiv_full_sent_text_spacy = pickle.load(f)

In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref = pmc_arxiv_full_sent_text_spacy.copy()

In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article_id  10 non-null     object
 1   published   10 non-null     object
 2   revised     10 non-null     object
 3   title       10 non-null     object
 4   journal     10 non-null     object
 5   authors     10 non-null     object
 6   doi         10 non-null     object
 7   pdf_url     10 non-null     object
 8   text        10 non-null     object
 9   sent_text   10 non-null     object
dtypes: object(10)
memory usage: 928.0+ bytes


In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref['coref_text'] = joined_docs_flat

In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article_id  10 non-null     object
 1   published   10 non-null     object
 2   revised     10 non-null     object
 3   title       10 non-null     object
 4   journal     10 non-null     object
 5   authors     10 non-null     object
 6   doi         10 non-null     object
 7   pdf_url     10 non-null     object
 8   text        10 non-null     object
 9   sent_text   10 non-null     object
 10  coref_text  10 non-null     object
dtypes: object(11)
memory usage: 1008.0+ bytes


In [None]:
with open('2024-07-25_pmc_arxiv_full_sent_text_spacy_fastcoref_df.pickle', 'wb') as f:
  pickle.dump(pmc_arxiv_full_sent_text_spacy_fastcoref, f)

We will sentence-tokenise the coreference resolved text and add to a column in the DataFrame for the next stage in the information extraction pipeline - Named Entity Recognition (NER).

The default spacy models are mostly trained on newspaper and web text so we will load a scispacy model for biomedical data to perform sentence-tokenisation.

In [None]:
nlp = spacy.load('en_core_sci_sm')

In [None]:
sentencizer = nlp.add_pipe("sentencizer")

In [None]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'sentencizer']

In [None]:
# sentence-tokenise the coreference resolved text column

def sent_pipe(text):
    sentence_list = []
    for doc in nlp.pipe(iter(text)):
        sentences = [sentence.text for sentence in doc.sents]
        sentence_list.append(sentences)
    return sentence_list

In [None]:
%%time

resolved_coref_text_sent = sent_pipe(pmc_arxiv_full_sent_text_spacy_fastcoref.coref_text)

CPU times: user 12.2 s, sys: 2.67 s, total: 14.8 s
Wall time: 16.3 s


In [None]:
resolved_coref_text_sent

[['Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery.',
  'In 2004, Ted T. Ashburn et al. summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets.',
  'molecules that are waiting for approval for new pathways of action and targets are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted.',
  'The definition of the term drug repurposing has been endorsed by scholars and used by scholars.',
  'It should be pointed out that the synonyms of drug repurposing often used by academics also include drug repositioning, drug rediscovery, drug redirecting, drug retasking, and therapeutic swi

In [None]:
len(resolved_coref_text_sent)

10

In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref['sent_coref_text'] = resolved_coref_text_sent

In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   article_id       10 non-null     object
 1   published        10 non-null     object
 2   revised          10 non-null     object
 3   title            10 non-null     object
 4   journal          10 non-null     object
 5   authors          10 non-null     object
 6   doi              10 non-null     object
 7   pdf_url          10 non-null     object
 8   text             10 non-null     object
 9   sent_text        10 non-null     object
 10  coref_text       10 non-null     object
 11  sent_coref_text  10 non-null     object
dtypes: object(12)
memory usage: 1.1+ KB


In [None]:
pmc_arxiv_full_sent_text_spacy_fastcoref

Unnamed: 0,article_id,published,revised,title,journal,authors,doi,pdf_url,text,sent_text,coref_text,sent_coref_text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr...","[Sir James Black, a winner of the 1988 Nobel P...","Sir James Black, a winner of the 1988 Nobel Pr...","[Sir James Black, a winner of the 1988 Nobel P..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...,[Tight and selective interaction between ligan...,Tight and selective interaction between ligand...,[Tight and selective interaction between ligan...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...,[The severe acute respiratory syndrome coronav...,The severe acute respiratory syndrome coronavi...,[The severe acute respiratory syndrome coronav...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...,[The sudden outbreak of SARS-CoV-2 in 2019 too...,The sudden outbreak of SARS-CoV-2 in 2019 took...,[The sudden outbreak of SARS-CoV-2 in 2019 too...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co...","[The 2019 novel coronavirus, now dubbed SARS-C...","The 2019 novel coronavirus, now dubbed SARS-Co...","[The 2019 novel coronavirus, now dubbed SARS-C..."
5,PMC9527439,2022-09-25,2022-10-07,Novel Drug Design for Treatment of COVID-19: A...,The Canadian journal of infectious diseases & ...,"Mousavi S, Zare S, Mirzaei M, Feizi A.",10.1155/2022/2044282,https://europepmc.org/articles/PMC9527439?pdf=...,"Coronavirus disease 2019 COVID-19, which was f...","[Coronavirus disease 2019 COVID-19, which was ...","Coronavirus disease 2019 COVID-19, which was f...","[Coronavirus disease 2019 COVID-19, which was ..."
6,PMC9729590,2022-12-08,2023-01-03,"Repurposing FDA-approved drugs cetilistat, abi...",Informatics in medicine unlocked,"Shahabadi N, Zendehcheshm S, Mahdavi M, Khadem...",10.1016/j.imu.2022.101147,https://europepmc.org/articles/PMC9729590?pdf=...,COVID-19 is an infectious disease caused by Co...,[COVID-19 is an infectious disease caused by C...,COVID-19 is an infectious disease caused by Co...,[COVID-19 is an infectious disease caused by C...
7,PMC9236981,2022-06-28,2022-12-21,A comprehensive review of artificial intellige...,Biomedicine & pharmacotherapy = Biomedecine & ...,"Ahmed F, Soomro AM, Chethikkattuveli Salih AR,...",10.1016/j.biopha.2022.113350,https://europepmc.org/articles/PMC9236981?pdf=...,A novel coronavirus CoV first appeared by the ...,[A novel coronavirus CoV first appeared by the...,A novel coronavirus CoV first appeared by the ...,[A novel coronavirus CoV first appeared by the...
8,PMC9694939,2022-11-10,2022-12-13,Structural Homology-Based Drug Repurposing App...,"Molecules (Basel, Switzerland)","Aljuaid A, Salam A, Almehmadi M, Baammi S, Als...",10.3390/molecules27227732,https://europepmc.org/articles/PMC9694939?pdf=...,Drug discovery is a time-consuming and costly ...,[Drug discovery is a time-consuming and costly...,Drug discovery is a time-consuming and costly ...,[Drug discovery is a time-consuming and costly...
9,PMC9556799,2022-10-13,2022-11-01,Rational drug repositioning for coronavirus-as...,iScience,"Wang J, Liu J, Luo M, Cui H, Zhang W, Zhao K, ...",10.1016/j.isci.2022.105348,https://europepmc.org/articles/PMC9556799?pdf=...,"Coronavirus disease 2019 COVID-19, caused by s...","[Coronavirus disease 2019 COVID-19, caused by ...","Coronavirus disease 2019 COVID-19, caused by s...","[Coronavirus disease 2019 COVID-19, caused by ..."


In [None]:
with open('2024-07-25_pmc_arxiv_full_sent_text_spacy_sent_coref_df.pickle', 'wb') as f:
  pickle.dump(pmc_arxiv_full_sent_text_spacy_fastcoref, f)

### References

* Otmazgin, S. et al. (2022). F-coref: Fast, Accurate and Easy to Use Coreference Resolution. [arXiv:2209.04280](https://arxiv.org/pdf/2209.04280)

* fastcoref https://github.com/shon-otmazgin/fastcoref

* Otmazgin, S. et al. (2022). LINGMESS: Linguistically Informed Multi Expert Scorers for Coreference Resolution. [arXiv:2205.12644](https://arxiv.org/pdf/2205.12644)

* LingMess https://github.com/shon-otmazgin/lingmess-coref

* https://towardsdatascience.com/fastcoref-a-practical-package-for-coreference-resolution-bfbcb749e464

* https://galhever.medium.com/a-quick-review-of-coreference-resolution-task-14358f27b931

* https://galhever.medium.com/a-review-to-co-reference-resolution-models-f44b4360a00

* https://medium.com/@michaelwood33311/creating-accurate-ai-coreference-resolution-with-fastcoref-20f06044bdf9

* Bratanič, T. (2024). [Graph Algorithms for Data Science](https://www.manning.com/books/graph-algorithms-for-data-science)