# Getting Started

<a target="_blank" href="https://colab.research.google.com/github/centre-for-humanities-computing/conspiracies/blob/main/docs/tutorials/overview.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Coreference model
A small use case of the coreference component in spaCy.

In [None]:
import spacy
from spacy.tokens import Span
from conspiracies.coref import CoreferenceComponent 

nlp = spacy.blank("da")
nlp.add_pipe('sentencizer')
nlp.add_pipe("allennlp_coref")  # download the model if you haven't already

In [1]:
doc = nlp("Do you see Julie over there? She is really into programming!")

assert isinstance(doc._.coref_clusters, list)

for sent in doc.sents:
    assert isinstance(sent._.coref_clusters, list)
    assert isinstance(sent._.coref_clusters[0], tuple)
    assert isinstance(sent._.coref_clusters[0][0], int)
    assert isinstance(sent._.coref_clusters[0][1], Span)
    sent._.resolve_coref # get resolved coref

  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequ

Examining the output a bit further:

In [2]:
print("DOC LEVEL (Coref clusters)")
print(doc._.coref_clusters)
print("-----\n\nSPAN LEVEL (sentences)")
for sent in doc.sents:
    print(sent._.coref_clusters)
print("-----\n\nSPAN LEVEL (entities)\n")
for sent in doc.sents:
    for i, coref_entity in sent._.coref_clusters:
        print(f"Coref Entity: {coref_entity} \nAntecedent: {coref_entity._.antecedent}")
        print("\n")

DOC LEVEL (Coref clusters)
[(0, [Julie, She])]
-----

SPAN LEVEL (sentences)
[(0, Julie)]
[(0, She)]
-----

SPAN LEVEL (entities)

Coref Entity: Julie 
Antecedent: Julie


Coref Entity: She 
Antecedent: Julie




## Headword Extraction
A small use case of how to use the headword extraction component to extract headwords.

````{note}
For this example we will use the spacy pipeline `en_core_web_sm` if you don't have it installed you can do so by running the following command in your terminal:

```bash
spacy download en_core_web_sm
```
````

In [3]:
import spacy
from conspiracies.HeadWordExtractionComponent import contains_ents

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("heads_extraction", config={"normalize_to_entity": True, "normalize_to_noun_chunk": True})

doc = nlp("Mette Frederiksen is the Danish politician.")
heads_spans = []

print(doc._.most_common_ancestor)
the_danish = doc[3:5]
print(the_danish._.most_common_ancestor) # it normalizes to noun chunk


is
the Danish politician


## Wordpiece length normalization Extraction
A small use case of how to use word piece length normalization to normalize the length of
your texts in case you are applying transformer-based pipelines.

````{note}
For this example we will use the spacy pipeline `da_core_news_sm` if you don't have it installed you can do so by running the following command in your terminal:

```bash
spacy download en_core_web_sm
```
````

In [4]:
import spacy
from transformers import AutoTokenizer

# load nlp (we don't recommend a trf based spacy model as it is too slow)
nlp = spacy.load("da_core_news_sm")
# load huggingface tokenizer - should be the same as the model you wish to apply later
tokenizer_name = "alexandrainst/da-sentiment-base"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# An example with a very long text
from conspiracies import wordpiece_length_normalization
long_text = ["Hej mit navn er Kenneth. " * 200]
normalized_text = wordpiece_length_normalization(long_text, nlp, tokenizer, max_length=500)
normalized_text = list(normalized_text)
assert len(normalized_text) > 1, "a long text should be split into multiple texts"


## RelationExtraction

A Python library for extracting knowledge triplets from a text document.

````{note}
For this example we will use the spacy pipeline `en_core_web_sm` if you don't have it installed you can do so by running the following command in your terminal:

```bash
spacy download en_core_web_sm
```
````

In [5]:
from conspiracies.relationextraction import SpacyRelationExtractor
import spacy


nlp = spacy.load("da_core_news_sm")

test_sents = [
    "Pernille Blume vinder delt EM-sølv i Ungarn.",
    "Pernille Blume blev nummer to ved EM på langbane i disciplinen 50 meter fri.",
    "Hurtigst var til gengæld hollænderen Ranomi Kromowidjojo, der sikrede sig guldet i tiden 23,97 sekunder.",
    "Og at formen er til en EM-sølvmedalje tegner godt, siger Pernille Blume med tanke på, at hun få uger siden var smittet med corona.",
    "Ved EM tirsdag blev det ikke til medalje for den danske medley for mixede hold i 4 x 200 meter fri.",
    "In a phone call on Monday, Mr. Biden warned Mr. Netanyahu that he could fend off criticism of the Gaza strikes for only so long, according to two people familiar with the call",
    "That phone call and others since the fighting started last week reflect Mr. Biden and Mr. Netanyahu's complicated 40-year relationship.",
    "Politiet skal etterforske Siv Jensen etter mulig smittevernsbrudd.",
    "En av Belgiens mest framträdande virusexperter har flyttats med sin familj till skyddat boende efter hot från en beväpnad högerextremist.",
]


# change these to your purposes. 2.7 is the default confidence threshold (the bulk of bad relations not kept and the majority of correct ones kept)
# batch_size should be changed according to your device. Can most likely be bumped up a fair bit
config = {"confidence_threshold": 2.7, "model_args": {"batch_size": 10}}
nlp.add_pipe("relation_extractor", config=config)

pipe = nlp.pipe(test_sents)

for d in pipe:
    print(d.text, "\n", d._.relation_triplets)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/imag

Pernille Blume vinder delt EM-sølv i Ungarn. 
 []


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


Pernille Blume blev nummer to ved EM på langbane i disciplinen 50 meter fri. 
 [(i, disciplinen, 50 meter fri)]


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


Hurtigst var til gengæld hollænderen Ranomi Kromowidjojo, der sikrede sig guldet i tiden 23,97 sekunder. 
 [(i, tiden, 23,97 sekunder)]


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


Og at formen er til en EM-sølvmedalje tegner godt, siger Pernille Blume med tanke på, at hun få uger siden var smittet med corona. 
 []


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


Ved EM tirsdag blev det ikke til medalje for den danske medley for mixede hold i 4 x 200 meter fri. 
 []


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


In a phone call on Monday, Mr. Biden warned Mr. Netanyahu that he could fend off criticism of the Gaza strikes for only so long, according to two people familiar with the call 
 [(Mr. Biden, warned, Mr. Netanyahu), (he, could fend off, criticism of the Gaza strikes)]


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


That phone call and others since the fighting started last week reflect Mr. Biden and Mr. Netanyahu's complicated 40-year relationship. 
 [(the fighting, started, last week), (That phone call and others since the fighting, reflect, Mr. Biden and Mr. Netanyahu's complicated 40-year relationship)]


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


Politiet skal etterforske Siv Jensen etter mulig smittevernsbrudd. 
 [(Politiet, skal etterforske, Siv Jensen etter mulig smittevernsbrudd)]


  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")


En av Belgiens mest framträdande virusexperter har flyttats med sin familj till skyddat boende efter hot från en beväpnad högerextremist. 
 []


### Relation extraction without using SpaCy


In [7]:
from conspiracies.relationextraction import KnowledgeTriplets


test_sents = [
    "Lasse er en dreng på 26 år.",
    "Jeg arbejder som tømrer",
    "Albert var videnskabsmand og døde i 1921",
    "Lasse lives in Denmark and owns two cats",
]


test_sents = [
    "Pernille Blume vinder delt EM-sølv i Ungarn.",
    "Pernille Blume blev nummer to ved EM på langbane i disciplinen 50 meter fri.",
    "Hurtigst var til gengæld hollænderen Ranomi Kromowidjojo, der sikrede sig guldet i tiden 23,97 sekunder.",
    "Og at formen er til en EM-sølvmedalje tegner godt, siger Pernille Blume med tanke på, at hun få uger siden var smittet med corona.",
    "Ved EM tirsdag blev det ikke til medalje for den danske medley for mixede hold i 4 x 200 meter fri.",
    "In a phone call on Monday, Mr. Biden warned Mr. Netanyahu that he could fend off criticism of the Gaza strikes for only so long, according to two people familiar with the call",
    "That phone call and others since the fighting started last week reflect Mr. Biden and Mr. Netanyahu's complicated 40-year relationship.",
    "Politiet skal etterforske Siv Jensen etter mulig smittevernsbrudd.",
    "En av Belgiens mest framträdande virusexperter har flyttats med sin familj till skyddat boende efter hot från en beväpnad högerextremist.",
]


# initialize a class object
# call the class method for extracting triplets from a given list of sentences

relations = KnowledgeTriplets()
final_result = relations.extract_relations(test_sents)


print(final_result["sentence"])
print(final_result)

  Referenced from: <CA752DFD-DB79-3AEF-B196-0DE84ACD1E36> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <598DDE6B-5A2E-3301-B9C5-9034AEC256A9> /Users/au561649/.virtualenvs/conspiracies/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")
  return torch.tensor(res, dtype=torch.bool, device=tensor.device)


['Pernille Blume vinder delt EM-sølv i Ungarn.', 'Pernille Blume blev nummer to ved EM på langbane i disciplinen 50 meter fri.', 'Hurtigst var til gengæld hollænderen Ranomi Kromowidjojo, der sikrede sig guldet i tiden 23,97 sekunder.', 'Og at formen er til en EM-sølvmedalje tegner godt, siger Pernille Blume med tanke på, at hun få uger siden var smittet med corona.', 'Ved EM tirsdag blev det ikke til medalje for den danske medley for mixede hold i 4 x 200 meter fri.', 'In a phone call on Monday, Mr. Biden warned Mr. Netanyahu that he could fend off criticism of the Gaza strikes for only so long, according to two people familiar with the call', "That phone call and others since the fighting started last week reflect Mr. Biden and Mr. Netanyahu's complicated 40-year relationship.", 'Politiet skal etterforske Siv Jensen etter mulig smittevernsbrudd.', 'En av Belgiens mest framträdande virusexperter har flyttats med sin familj till skyddat boende efter hot från en beväpnad högerextremist.