# SpaCy Crosslingual Coreference
The notebook checks whether the crosslingual coreference model can be used to reliably resolve coreferences in articles of the extracted dataset.

Source: https://github.com/Pandora-Intelligence/crosslingual-coreference

In [2]:
import scispacy
import spacy
import crosslingual_coreference
from crosslingual_coreference import Predictor
import pandas as pd

[nltk_data] Downloading package omw-1.4 to /home/florian/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
# reads csv as dataset and stores the column 'fulltext' in a list, removes duplicates
df = pd.read_csv('cleaned_fulltext_articles.csv')
list_text = df['fulltext'].dropna().tolist()

In [4]:
# loads spacy model and adds coreference mode to the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(
    'xx_coref', config={'chunk_size': 2500, 'chunk_overlap': 2, 'device': -1}) # builds to not overload device = cpu

Some weights of the model checkpoint at nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-st

<crosslingual_coreference.CrossLingualPredictorSpacy.CrossLingualPredictorSpacy at 0x7fbf8fbf9d90>

In [5]:
# processes an article with nlp to doc object
doc = nlp(list_text[2])

  num_effective_segments = (seq_lengths + self._max_length - 1) // self._max_length


In [6]:
print(doc._.coref_clusters)

[[[34, 35], [83, 84], [281, 282]], [[50, 50], [399, 400]], [[106, 112], [121, 121]], [[108, 110], [131, 133], [163, 165]], [[108, 112], [244, 248]], [[112, 112], [248, 248]], [[234, 235], [241, 248], [260, 260]], [[293, 296], [309, 332], [347, 352], [364, 367], [377, 383]], [[293, 296], [309, 332], [364, 367], [377, 383]], [[468, 486], [485, 485]], [[501, 501], [572, 572]], [[719, 721], [771, 772]], [[795, 797], [806, 808]], [[806, 811], [1107, 1114], [1122, 1129], [1136, 1136], [1141, 1141]], [[806, 811], [1107, 1114], [1122, 1129], [1136, 1136], [1141, 1141], [1188, 1196], [1333, 1341], [1352, 1360]], [[810, 811], [1093, 1096], [1111, 1114], [1126, 1129]], [[810, 811], [1093, 1096], [1111, 1114], [1126, 1129], [1162, 1166], [1192, 1196], [1337, 1341], [1356, 1360], [1375, 1378], [1380, 1380], [1420, 1424]], [[868, 873], [905, 910], [950, 951], [970, 975], [1002, 1007]], [[871, 873], [897, 899], [908, 910], [973, 975], [1005, 1007], [1107, 1109], [1122, 1124]], [[871, 873], [897, 899]

In [10]:
doc

ANI, average nucleotide identity; dDDH, digital DNA–DNA hybridization; KEGG, Kyoto Encyclopedia of Genes and Genomes; R2A, Reasoner's 2A. The newly sequenced data included in this work are deposited under the nucleotide accession numbers: MZ919349 and MZ920050 and under the Bioproject accession numbers JAIQDI000000000, JAJNEC000000000 and JAIQDJ000000000 at a public domain server in the National Center for Biotechnology Information database. All supporting data, code and protocols have been provided within the article or through supplementary data files. A supplementary table and further supplementary files can be found at: https://doi.org/10.6084/m9.figshare.18220907.v1. Although members of the genera Niabella and Thermomonas are often isolated from similar environmental samples, they belong to the Bacteroidetes and Proteobacteria, respectively. The genus Niabella is a member of the family Chitinophagaceae. Cells of Niabella species are Gram-stain-negative, aerobic, non-flagellated an

In [9]:
# sorts coreferences by clusters
i = 1
for item in doc._.coref_clusters:
    print (f'Cluster {i}')
    for span in item:
        start, end = span
        print(doc[start:end+1])
    i = i+1
    print()

Cluster 1
this work
the article
this report

Cluster 2
Bioproject
Reasoner's

Cluster 3
members of the genera Niabella and Thermomonas
they

Cluster 4
the genera Niabella
The genus Niabella
the genus Niabella

Cluster 5
the genera Niabella and Thermomonas
the genera Niabella and Thermomonas

Cluster 6
Thermomonas
Thermomonas

Cluster 7
seven species
The species of the genera Niabella and Thermomonas
they

Cluster 8
a constructed wetland system
The constructed wetland system, also called the Dragon-shaped Wetland, located in the central area of the Beijing Olympic Park area
the Dragon-shaped Water System
the entire water system
the Dragon-shaped Wetland water system

Cluster 9
a constructed wetland system
The constructed wetland system, also called the Dragon-shaped Wetland, located in the central area of the Beijing Olympic Park area
the entire water system
the Dragon-shaped Wetland water system

Cluster 10
PCR with universal bacterial primers 27 F and 1492R, which was also used for se

## Result
crosslingual_coreference is not a good solution when it comes to scientific articles.
The model has to be trained on the specific domain and is weak with longer sentences/texts. Proofed above. Acc about 0,8.
