# An introduction to `relatio` 
**Runtime $\sim$ 5min**

Original paper: ["Text Semantics Capture Political and Economic Narratives"](https://arxiv.org/abs/2108.01720)

----------------------------

This is a short demo of the package `relatio`.  It takes as input a text corpus and outputs a list of narrative statements. The pipeline is unsupervised: the user does not need to specify narratives beforehand. Narrative statements are defined as tuples of semantic roles with a (agent, verb, patient, attribute) structure. 

Here, we present the main wrapper functions to quickly obtain narrative statements from a corpus.

----------------------------

In this tutorial, we work with tweets from candidates at the French Presidential Elections (2022).

----------------------------

In [1]:
# Catch warnings for an easy ride
from relatio import FileLogger
logger = FileLogger(level = 'WARNING')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from relatio import list_data
list_data()


{
    "trump_tweet_archive": 
    {
        "description": "Tweets from the Trump Tweet Archives (https://www.thetrumparchive.com/)",
        "language": "english",
        "srl_model": "allennlp v0.9 -- srl-model-2018.05.25.tar.gz",
        "links": 
        {
            "raw": "https://www.dropbox.com/s/lxqz454n29iqktn/trump_archive.csv?dl=1",
            "sentences": "https://www.dropbox.com/s/coh4ergyrjeolen/split_sentences.json?dl=1",
            "srl_res": "https://www.dropbox.com/s/54lloy84ka8mycp/srl_res.json?dl=1"
        }
    },
    "tweets_candidates_french_elections": 
    {
        "description": "Tweets of candidates at the French presidential elections (2022)",
        "language": "french",
        "srl_model": "",
        "links": 
        {
            "raw": "https://www.dropbox.com/s/qqlq8xn9x645f79/tweets_candidates_french_elections.csv?dl=1"
        }
    }
}



In [3]:
from relatio import load_data
df = load_data(dataset = "tweets_candidates_french_elections", content = "raw")
df = df[df['candidate'] == 'yjadot']
df.head()

Unnamed: 0,id,doc,date,candidate
29238,29237,"Hier, nous étions à #Rennes, place Hoche, pour...",2022-02-09T16:01:00.000Z,yjadot
29239,29238,Pensées à ses proches,2022-02-09T15:31:36.000Z,yjadot
29240,29239,"Un an déjà que Guillaume, militant communiste ...",2022-02-09T15:31:16.000Z,yjadot
29241,29240,Le #OneOceanSummit s'ouvre aujourd'hui à #Bres...,2022-02-09T15:03:00.000Z,yjadot
29242,29241,Pour revoir l'intégralité de mon passage sur @...,2022-02-09T14:00:03.000Z,yjadot


In [4]:
from relatio import Preprocessor

import string
alphabet_string = string.ascii_lowercase
alphabet_list = list(alphabet_string) + ['rt']

p = Preprocessor(
    spacy_model = "fr_core_news_sm",
    remove_punctuation = True,
    remove_digits = True,
    lowercase = True,
    lemmatize = True,
    remove_chars = ["\"",'-',"^",".","?","!",";","(",")",",",":","\'","+","&","|","/","{","}",
                    "~","_","`","[","]",">","<","=","*","%","$","@","#","’"],
    stop_words = alphabet_list,
    n_process = -1,
    batch_size = 100
)

df = p.split_into_sentences(
    df, output_path = None, progress_bar = True
)

Splitting into sentences...


100%|█████████████████████████████████████| 3432/3432 [00:02<00:00, 1172.62it/s]


In [5]:
sentence_index, roles = p.extract_svos(df['sentence'], progress_bar = True)

for svo in roles[0:5]: print(svo)

Extracting SVOs...


100%|█████████████████████████████████████| 7661/7661 [00:03<00:00, 2055.81it/s]

{'ARG0': 'nous', 'B-ARGM-NEG': False, 'B-V': 'étions', 'ARG1': ''}
{'ARG0': 'nous', 'B-ARGM-NEG': False, 'B-V': 'serons', 'ARG1': ''}
{'ARG0': 'vous', 'B-ARGM-NEG': True, 'B-V': 'pouvez', 'ARG1': ''}
{'ARG0': 'vous', 'B-ARGM-NEG': False, 'B-V': 'voici', 'ARG1': ''}
{'ARG0': 'vous', 'B-ARGM-NEG': False, 'B-V': 'suivre', 'ARG1': ''}





In [6]:
postproc_roles = p.process_roles(roles, 
                                 dict_of_pos_tags_to_keep = {
                                     "ARG0": ['PRON', 'NOUN', 'PROPN'],
                                     "B-V": ['VERB'],
                                     "ARG1": ['NOUN', 'PROPN', 'PRON']
                                 }, 
                                 max_length = 50,
                                 progress_bar = True,
                                 output_path = 'postproc_roles.json')

from relatio.utils import load_roles
postproc_roles = load_roles('postproc_roles.json')

for d in postproc_roles[0:5]: print(d)

Cleaning phrases for role ARG0...


100%|███████████████████████████████████| 10054/10054 [00:04<00:00, 2254.06it/s]


Cleaning phrases for role B-V...


100%|███████████████████████████████████| 10054/10054 [00:04<00:00, 2292.06it/s]


Cleaning phrases for role B-ARGM-MOD...


0it [00:00, ?it/s]


Cleaning phrases for role ARG1...


100%|███████████████████████████████████| 10054/10054 [00:05<00:00, 1884.66it/s]


Cleaning phrases for role ARG2...


0it [00:00, ?it/s]


{'ARG0': 'nous', 'B-ARGM-NEG': False}
{'ARG0': 'nous', 'B-ARGM-NEG': False}
{'ARG0': 'vous', 'B-V': 'pouvoir', 'B-ARGM-NEG': True}
{'ARG0': 'vous', 'B-V': 'voici', 'B-ARGM-NEG': False}
{'ARG0': 'vous', 'B-V': 'suivre', 'B-ARGM-NEG': False}


In [7]:
known_entities = p.mine_entities(
    df['sentence'], 
    clean_entities = True, 
    progress_bar = True,
    output_path = 'entities.pkl'
)

from relatio.utils import load_entities
known_entities = load_entities('entities.pkl')

for n in known_entities.most_common(10): print(n)

Mining named entities...


100%|██████████████████████████████████████| 7661/7661 [00:11<00:00, 673.04it/s]

('', 521)
('écologie', 92)
('parlement européen', 27)
('écologi', 21)
('primaireecologist', 19)
('ue', 17)
('union européen', 14)
('humanité', 13)
('avenir', 13)
('tva', 11)





In [8]:
top_known_entities = [e[0] for e in list(known_entities.most_common(100)) if e[0] != '']

In [9]:
from relatio import Embeddings
nlp_model = Embeddings("spaCy", "fr_core_news_sm", sentences=df['sentence']) 

In [10]:
from relatio import NarrativeModel
from relatio.utils import prettify
from collections import Counter

In [22]:
m1 = NarrativeModel(clustering = 'hdbscan',
                    PCA = False,
                    UMAP = False,
                    roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'ARG1'],
                    roles_with_known_entities = ['ARG0','ARG1'],
                    known_entities = top_known_entities,
                    assignment_to_known_entities = 'character_matching',
                    roles_with_unknown_entities = ['ARG0','ARG1'],
                    embeddings_model = nlp_model,
                    threshold = 0.3)    

m1.fit(postproc_roles, weight_by_frequency = True)

Matching known entities (with character matching)...


100%|████████████████████████████████████| 1282/1282 [00:00<00:00, 13759.52it/s]


Matching known entities (with character matching)...


100%|████████████████████████████████████| 2429/2429 [00:00<00:00, 14897.76it/s]


Computing phrase embeddings...


100%|██████████████████████████████████████| 3176/3176 [00:12<00:00, 255.76it/s]


Clustering phrases into clusters...
Clustering parameters chosen in this range:
{'min_cluster_size': [13, 27, 41, 56], 'min_samples': [1, 10, 20], 'cluster_selection_method': ['eom'], 'gen_min_span_tree': True, 'approx_min_span_tree': False, 'prediction_data': True}
Labeling the clusters by the most frequent phrases...


In [23]:
narratives = m1.predict(postproc_roles, progress_bar = True)


Predicting entities for role: ARG0...
Matching known entities (with character matching)...


100%|████████████████████████████████████| 5657/5657 [00:00<00:00, 17165.10it/s]


Matching unknown entities (with clustering model)...
Computing phrase embeddings...


100%|██████████████████████████████████████| 5657/5657 [00:15<00:00, 366.98it/s]


Assignment to clusters...
Assigning labels to matches...

Predicting entities for role: ARG1...
Matching known entities (with character matching)...


100%|████████████████████████████████████| 3556/3556 [00:00<00:00, 15938.33it/s]


Matching unknown entities (with clustering model)...
Computing phrase embeddings...


100%|██████████████████████████████████████| 3556/3556 [00:11<00:00, 318.15it/s]


Assignment to clusters...
Assigning labels to matches...


In [None]:
from relatio.utils import prettify_narratives
pretty_narratives = prettify_narratives(narratives, fix_grammar=False)
for t in pretty_narratives.most_common(10): print(t)

In [21]:
m1.inspect_cluster("nous")

[('nous', 917),
 ('je', 735),
 ('on', 617),
 ('il', 281),
 ('vous', 149),
 ('lui', 116),
 ('cela', 94),
 ('jadot2022', 58),
 ('europe', 26),
 ('qui', 22)]

In [None]:
m2 = NarrativeModel(clustering = 'kmeans',
                    PCA = True,
                    UMAP = True,
                    roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'ARG1'],
                    roles_with_known_entities = ['ARG0','ARG1'],
                    known_entities = top_known_entities,
                    assignment_to_known_entities = 'character_matching',
                    roles_with_unknown_entities = ['ARG0','ARG1'],
                    embeddings_model = nlp_model,
                    threshold = 0.3)    

m2.fit(postproc_roles, weight_by_frequency = True, progress_bar = True)

In [None]:
narratives = m2.predict(postproc_roles, progress_bar = True)

In [None]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') is not None:
        if n.get('B-V') is not None:
            if n.get('ARG1') is not None:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

In [None]:
from relatio import build_graph, draw_graph

G = build_graph(
    narratives, 
    top_n = 100, 
    prune_network = True
)

draw_graph(
    G,
    notebook = False,
    show_buttons = False,
    width="1600px",
    height="1000px",
    output_filename = 'example.html'
    )