# The social network of Caesar's *Bellum Gallicum*

## Data Flow

The data will be collected from a reduced version (extracts) of the *Bellum Gallicum*, which has been manually annotated for morphology and syntax according to the Universal Dependency model, in the CoNLL-U format.

See the [GitHub repository](https://github.com/proiel/proiel-treebank/blob/master/caes-gal.conll).

### The components of the CoNLL-U annotation schema

CoNLL-U annotations are distributed as plain text files.

The annotation files contain three types of lines: **comment lines**, **word lines** and **blank lines**.

**Comment lines** precede word lines and start with a hash character (#). These lines can be used to provide metadata about the word lines that follow.

Each **word line** contains annotations for a single word or token. Larger linguistic units are represented by subsequent word lines.

The annotations for a word line are provided using the following fields, each separated by a tabulator character:

```console
ID	FORM	LEMMA	UPOS	XPOS	FEATS	HEAD	DEPREL	DEPS	MISC
```

 1. `ID`: Index of the word in sequence
 2. `FORM`: The form of a word or punctuation symbol
 3. `LEMMA`: Lemma or the base form of a word
 4. `UPOS`: [Universal part-of-speech tag](https://universaldependencies.org/u/pos/)
 5. `XPOS`: Language-specific part-of-speech tag
 6. `FEATS`: [Morphological features](https://universaldependencies.org/u/feat/index.html)
 7. `HEAD`: Syntactic head of the current word
 8. `DEPREL`: Universal dependency relation to the `HEAD`
 9. `DEPS`: [Enhanced dependency relations](https://universaldependencies.org/u/overview/enhanced-syntax.html)
 10. `MISC`: Any additional annotations

Finally, a **blank line** after word lines is used to separate sentences.

For reading and managing the CoNNL-U formatted file, we will use Python [conllu](https://pypi.org/project/conllu/) library.

In [1]:
# Import the conllu library
import conllu

In [2]:
# Open the plain text file for reading; assign under 'data'
with open('caes_gal.conllu', mode="r", encoding="utf-8") as data:

    # Read the file contents and assign under 'annotations'
    annotations = data.read()
# Use the parse() function to parse the annotations; store under 'sentences'
sentences = conllu.parse(annotations)

The parse() function returns a Python list populated by TokenList objects, native to pyconll library.

In [3]:
# print the first token (TokenList object) of the first sentence (list).
sentences[0][0]

{'id': 1,
 'form': 'Gallia',
 'lemma': 'Gallia',
 'upos': 'PROPN',
 'xpos': 'Ne',
 'feats': {'Case': 'Nom', 'Gender': 'Fem', 'Number': 'Sing'},
 'head': 4,
 'deprel': 'nsubj:pass',
 'deps': None,
 'misc': {'Ref': '1.1.1'}}

### Named Entities extraction

First of all, we want to extract the NE from the annotations. Although proper nouns are already marked as such in `annotations`, we don't need to perform an actual NE recognition. Nevertheless, since sentences are annotated token-by-token, in case of multi-token names we need to group the tokens that are associated with the same entity, i.e. are part of the full name of a character. To do so, we can rely on the consistency of the annotation procedure, for which the "second names" are marked as direct dependants of the "first name" token by a 'flat:name' relation.

In [4]:
def get_ne(sentences:list):
    full_names = []
    for sent in sentences:
        visited_tokens = []
        for token in sent:
            if token['xpos'] == 'Ne' and token not in visited_tokens:
                second_names = [t for t in sent if t['deprel'] == 'flat:name' and t['head'] == token['id'] and t not in visited_tokens]
                visited_tokens.append(token)
                if second_names:
                    multi_token_name = [token, *second_names] # the token corresponding to the "first name" at the beginning of the list
                    full_names.append(multi_token_name)
                    for t in second_names:
                        visited_tokens.append(t)
                else:
                    full_names.append([token])
    return full_names

In [5]:
ne_instances = get_ne(sentences)

In [6]:
ne_tok_instances_set = [l[0] for l in ne_instances]


In [7]:
ne_tok_instances_set

[{'id': 1,
  'form': 'Gallia',
  'lemma': 'Gallia',
  'upos': 'PROPN',
  'xpos': 'Ne',
  'feats': {'Case': 'Nom', 'Gender': 'Fem', 'Number': 'Sing'},
  'head': 4,
  'deprel': 'nsubj:pass',
  'deps': None,
  'misc': {'Ref': '1.1.1'}},
 {'id': 4,
  'form': 'Garumna',
  'lemma': 'Garumna',
  'upos': 'PROPN',
  'xpos': 'Ne',
  'feats': {'Case': 'Nom', 'Gender': 'Masc', 'Number': 'Sing'},
  'head': 1,
  'deprel': 'nsubj',
  'deps': None,
  'misc': {'Ref': '1.1.2'}},
 {'id': 8,
  'form': 'Matrona',
  'lemma': 'Matrona',
  'upos': 'PROPN',
  'xpos': 'Ne',
  'feats': {'Case': 'Nom', 'Gender': 'Masc', 'Number': 'Sing'},
  'head': 11,
  'deprel': 'nsubj',
  'deps': None,
  'misc': {'Ref': '1.1.2'}},
 {'id': 10,
  'form': 'Sequana',
  'lemma': 'Sequana',
  'upos': 'PROPN',
  'xpos': 'Ne',
  'feats': {'Case': 'Nom', 'Gender': 'Fem', 'Number': 'Sing'},
  'head': 8,
  'deprel': 'conj',
  'deps': None,
  'misc': {'Ref': '1.1.2'}},
 {'id': 36,
  'form': 'Rhenum',
  'lemma': 'Rhenus',
  'upos': 'PROPN'

In [8]:
import itertools
from pprint import pprint
from collections import defaultdict

In [9]:
def get_relations(sentences:list):

    global_relations = []
    # all_full_names = []
    for sent in sentences:
        relations_in_sentence = []
        entities_in_sent = []
        visited_tokens = []
        for token in sent:
            if token['upos'] == 'PROPN' and token not in visited_tokens:
                second_names = [t for t in sent if t['deprel'] == 'flat:name' and t['head'] == token['id'] and t not in visited_tokens]
                visited_tokens.append(token)
                if second_names:
                    fullname = [token, *second_names] # the token corresponding to the "first name" at the beginning of the list
                    # all_full_names.append(fullname)
                    entities_in_sent.append(fullname)
                    for t in second_names:
                        visited_tokens.append(t)
                else:
                    fullname = [token]
                    # all_full_names.append(fullname)
                    entities_in_sent.append(fullname)
                    visited_tokens.append(fullname)

        pairs = list(itertools.combinations(entities_in_sent, 2))

        for pair in pairs:
            ent1 = pair[0]
            ent2 = pair[1]
            distance = 0.5


            if ent1[0]['head'] == ent2[0]['head'] or (ent1[0]['head'] == ent2[0]['id'] or ent2[0]['head'] == ent1[0]['id']):
                distance = 1

            if distance is not None:
                relation = {'ent1':ent1, 'ent2': ent2, 'distance': distance}
                relations_in_sentence.append(relation)

        if relations_in_sentence:
            global_relations.append(relations_in_sentence)

    return global_relations

In [10]:
relations_instances = get_relations(sentences)

## Process relations instances

In the script below, the list of relations, in which each entity's occurence is represented by its TokenList object, is processed to obtain a list of tuples, `final_relations`, each of which consists in the weighted edge of the graph that will be built.

In [11]:
tmp_relations = []
for rels_in_sent in relations_instances:

    for rel in rels_in_sent:

        ent1 = ' '.join([i['lemma'] for i in rel['ent1']])
        ent2 = ' '.join([i['lemma'] for i in rel['ent2']])

        distance = rel['distance']

        row = (ent1, ent2, distance)
        tmp_relations.append(row)

# create a defaultdict to store the sums
sums = defaultdict(int)

# loop through the tuples and add the third element to the sum
# for the corresponding first two elements
for t in tmp_relations:
    sums[(t[0], t[1])] += t[2]

# create a new list of tuples with the first two elements as the key
# and the sum as the value
final_relations = [(k[0], k[1], v) for k, v in sums.items()]
final_relations = [rel for rel in final_relations if rel[0] != rel[1]]

In [12]:
final_relations

[('Garumna', 'Matrona', 0.5),
 ('Garumna', 'Sequana', 0.5),
 ('Matrona', 'Sequana', 1),
 ('Marcus Messala', 'Marcus Piso', 2),
 ('Rhenus', 'Iura', 0.5),
 ('Rhenus', 'Lemannus', 0.5),
 ('Rhenus', 'Rhodanus', 0.5),
 ('Iura', 'Lemannus', 0.5),
 ('Iura', 'Rhodanus', 1.0),
 ('Lemannus', 'Rhodanus', 1.5),
 ('Casticus', 'Catamantaloedes', 0.5),
 ('Dumnorix', 'Diviciacus', 1.0),
 ('Rhenus', 'Noreia', 0.5),
 ('Lucius Piso', 'Aulus Gabinius', 1),
 ('Caesar', 'Gallia', 7.0),
 ('Caesar', 'Genava', 0.5),
 ('Gallia', 'Genava', 1.0),
 ('Nammeius', 'Verucloetius', 1),
 ('Caesar', 'Lucius Cassius', 0.5),
 ('Lemannus', 'Iura', 0.5),
 ('Rhodanus', 'Iura', 0.5),
 ('Dumnorix', 'Orgetorix', 0.5),
 ('Italia', 'Aquileia', 0.5),
 ('Italia', 'Gallia', 0.5),
 ('Italia', 'Alpis', 0.5),
 ('Aquileia', 'Gallia', 0.5),
 ('Aquileia', 'Alpis', 0.5),
 ('Gallia', 'Alpis', 1),
 ('Rhodanus', 'Caesar', 0.5),
 ('Arar', 'Rhodanus', 0.5),
 ('Caesar', 'Arar', 0.5),
 ('Caesar', 'Lucius Piso', 1.0),
 ('Caesar', 'Tigurinus', 0.5),

#### Create and process the entities that appear in at least one interaction and are actual people

In [13]:
import json

# create list of entities that are involved in a relation
entities = list(sorted({name for rel in final_relations for name in rel if type(name) == str}))

# assign to a variable the list of all places in Latin literature
# places_list = json.loads('places.json')

with open('places.json') as json_file:
   places_list = json.load(json_file)

# filter entities list
entities = list(filter(lambda x: x not in places_list, entities))

In [14]:
entities

['Acco',
 'Adiatunnus',
 'Alpis',
 'Ambiorix',
 'Andes',
 'Appius Claudius',
 'Arduenna',
 'Ariovistus',
 'Aulus Gabinius',
 'Aurunculeius',
 'Boduognatus',
 'Bratuspantium',
 'Caesar',
 'Carcaso',
 'Cassius',
 'Casticus',
 'Catamantaloedes',
 'Catuvolcus',
 'Cicero',
 'Cimberius',
 'Cingetorix',
 'Commius Atrebas',
 'Considius',
 'Coriosolites',
 'Cotta',
 'Crassus',
 'Diviciacus',
 'Dumnorix',
 'Durocortorum',
 'Eratosthenes',
 'Fabius',
 'Gaius Antistius Reginus',
 'Gaius Fabius',
 'Gaius Valerius',
 'Gaius Valerius Caburus',
 'Gaius Valerius Procillus',
 'Gaius Valerius Troucillus',
 'Gaius Volusenus',
 'Galba',
 'Garumna',
 'Genava',
 'Gnaeus Pompeius',
 'Hercynia',
 'Iccius',
 'Illyricum',
 'Indutiomarus',
 'Itius',
 'Iura',
 'Labienus',
 'Legio',
 'Lemannus',
 'Liger',
 'Liscus',
 'Lucius Aurunculeius Cotta',
 'Lucius Cassius',
 'Lucius Cotta',
 'Lucius Domitius',
 'Lucius Manlius',
 'Lucius Minucius Basilus',
 'Lucius Piso',
 'Lucius Sulla',
 'Lucius Valerius Praeconinus',
 'Lu

## Let's build the graph!

In [15]:
import networkx as nx
from get_relations import final_relations
import matplotlib.pyplot as plt
from networkx.algorithms import community
from pprint import pprint

edges = final_relations
nodes = entities

G = nx.Graph()

for n in nodes:
    G.add_node(n)

for item in final_relations:
    G.add_edge(item[0], item[1], weight=item[2])

In [16]:
print(G.edges().data())

[('Acco', 'Caesar', {'weight': 0.5}), ('Acco', 'Durocortorum', {'weight': 0.5}), ('Acco', 'Gallia', {'weight': 0.5}), ('Adiatunnus', 'Crassus', {'weight': 1}), ('Alpis', 'Italia', {'weight': 0.5}), ('Alpis', 'Aquileia', {'weight': 0.5}), ('Alpis', 'Gallia', {'weight': 1}), ('Ambiorix', 'Gallia', {'weight': 1.0}), ('Ambiorix', 'Arduenna', {'weight': 0.5}), ('Ambiorix', 'Lucius', {'weight': 0.5}), ('Ambiorix', 'Catuvolcus', {'weight': 1.0}), ('Ambiorix', 'Germania', {'weight': 1.0}), ('Ambiorix', 'Scaldis', {'weight': 0.5}), ('Ambiorix', 'Mosa', {'weight': 0.5}), ('Ambiorix', 'Rhenus', {'weight': 1.0}), ('Ambiorix', 'Caesar', {'weight': 0.5}), ('Andes', 'Publius', {'weight': 1}), ('Andes', 'Oceanus', {'weight': 0.5}), ('Arduenna', 'Indutiomarus', {'weight': 0.5}), ('Arduenna', 'Rhenus', {'weight': 0.5}), ('Arduenna', 'Lucius', {'weight': 0.5}), ('Arduenna', 'Scaldis', {'weight': 0.5}), ('Arduenna', 'Mosa', {'weight': 0.5}), ('Ariovistus', 'Caesar', {'weight': 2.0}), ('Ariovistus', 'Vesen

In [17]:
# VISUALIZATION
# plt.figure(figsize=(15,15))
# nx.draw(G, with_labels=True) # with_labels=True for showing node's label
# plt.show()

In [18]:
from pyvis.network import Network

net = Network(notebook=True, width="1000px", height="700px", bgcolor="#222222", font_color='white')

net.from_nx(G)
net.show("caes_net.html")

Local cdn resources have problems on chrome/safari when used in jupyter-notebook. 
