# Session 7 - Collate with CollateX: Input Formats

## Summary 

- Plain text
- JSON (simple)
- JSON (pre-tokenized)
  - jupyter trick: displaying JSON
  - jupyter trick: saving an HTML alignment table
- XML

In this exercise, we will look at the possible input formats for CollateX.

As before, delete the outputs to start with a fresh notebook (Kernel > Restart & Clear Outputs).

Again, we import `collatex` and a new module called `json` that we will need for one of the input formats.

In [1]:
from collatex import *
import json

## Input 1 - Plain text

We have seen already how to use a plain text input in the previous sessions [5a](https://github.com/automaticCollationLausanne2020/Materials/blob/98b873a89b9bdb0c152f2ac7f06a899a71fda120/session5/Session05_PlainTextCollation.ipynb) and [5b](). ADD LINKS

Plain text means that it is a string data type: a string is the most basic way to represent text (hence the name "plain").

But it is possible to have a more complex representation of text. 

## Input 2 - JSON (simple) 

You have seen what JSON is in [session 6](), so you should not be too suprised by this input format! ADD LINKS

There are two possible JSON inputs. The first one is a simple version that is very similar to the plain text collation, except that all the witnesses and their sigils are gathered in one single variable.

In [102]:
json_input = """{
  "witnesses" : [
    {
      "id" : "UK",
      "content" : "George reckons he had a bogey-flavoured one once."
    },
    {
      "id" : "US",
      "content" : "George reckons he had a booger-flavored one once."
    },
    {
      "id" : "Film",
      "content" : "George sweared he got a bogey-flavoured one once."
    }
  ]
}""" # three """ is for a string spanning multiple lines

The `json_input` variable is a string, but it contains `{}` and `[]` which mark JSON's arrays/lists and Python's dictionaries/lists. 

It is very straighforward to convert this string into a dictionary that can be processed in Python:

In [109]:
# we are using the loads() method
witnesses_simple = json.loads(json_input)

# we can check that we have the correct data type
print("json_input: ", type(json_input))
print("witnesses: ", type(witnesses_simple))

json_input:  <class 'str'>
witnesses:  <class 'dict'>


Since we already have all our witnesses gathered into a single variable, we do not need to create a Collation instance with the line:

`collation = Collation()`

We can directly collate the witnesses.

In [110]:
result = collate(witnesses_simple) # of course we could have more options (output, layout, segmentation, indent)
print(result)

+------+--------+---------+----+-----+---+--------+---+-----------+-----------+
| UK   | George | reckons | he | had | a | bogey  | - | flavoured | one once. |
| US   | George | reckons | he | had | a | booger | - | flavored  | one once. |
| Film | George | sweared | he | got | a | bogey  | - | flavoured | one once. |
+------+--------+---------+----+-----+---+--------+---+-----------+-----------+


## Input 3 - JSON (pre-tokenized)

For this input format, the witnesses must be pre-tokenized before we can collate (see the [Gothenburg model]() about tokenization). ADD LINKS

In JSON, the tokens are like a python dictionary, they have properties expressed as key/value pairs and are surrounded by curly braces (see [session 2]()).ADD LINKS 

Tokens have the following properties:
- **t** : a mandatory property with the actual text of the transcribed document (technically, it is called a 'reading')
- **n** : an optional normalized form of the reading, which can help CollateX with the alignment (we will look at the normalization in a few minutes during [session 7b]()). ADD LINKS
- Anything that you feel like including! For instance part-of-speech tagging, or the location of the reading in the document, an editorial note or comment...

In [89]:
json_sample = {"witnesses": [ # we have a list of witnesses
    {"id": "A", # each witness has a name
    "tokens": [ # and a list of tokens
        {"t":"Bladorthin,"}, # mandatory 't'
        {"t":"Dwarves,", "n":"dwarves,"}, # optional 'n'
        {"t":"and", "pos":"conjunction"}, # and whatever you want
        {"t":"Mr"},
        {"t":"Baggins."} # last element: no comma after
        ]
    }, 
    #... here you would have the other witnesses
]}

If this looks scary, you can take some time to have a good look at it again. It is a very complex object, made of several nested lists and dictionaries:

1. dictionary: the whole object is in curly braces `{}`
2. the dictionary has one key called 'witnesses'
3. the value of 'witnesses' is a list of dictionaries!
4. each witness is a dictionary with two keys ('id' and 'tokens')
5. the 'tokens' key is a list of dictionaries again!
6. each token has at least one key/value ('t')

### Reading a JSON file
Ususally, we don't want to write an entire JSON file ourselves in Python. For now we are going to read it from a file with the `load()` method (we do not use `read()` for JSON).

**ATTENTION** 

There are two json methods, `json.load()` and `json.loads()`. This can be confusing!

- `json.loads()` is for "load string", it converts a string into a dictionary
- `json.load()` is for reading JSON from a file

In [111]:
# this is how to read a json file
with open("../data/Catullus/catullus-tokenized-OGBodmer47.json", "r") as file:
    witnesses = json.load(file)

In [112]:
# let us look at the file we just read: what properties can token have here?
witnesses

{'witnesses': [{'id': 'G1',
   'tokens': [{'t': 'fletus', 'locus': '1r:12'},
    {'t': 'passeris', 'locus': '1r:12'},
    {'t': 'lesbie', 'locus': '1r:12'},
    {'t': 'Passer', 'locus': '1r:13'},
    {'t': 'delicie', 'n': 'deliciae', 'locus': '1r:13'},
    {'t': 'mee', 'n': 'meae', 'locus': '1r:13'},
    {'t': 'puelle', 'n': 'puellae', 'locus': '1r:13'},
    {'t': 'Qui', 'locus': '1r:14'},
    {'t': 'cum', 'locus': '1r:14'},
    {'t': 'ludere', 'locus': '1r:14'},
    {'t': 'quem', 'locus': '1r:14'},
    {'t': 'in', 'locus': '1r:14'},
    {'t': 'sinu', 'locus': '1r:14'},
    {'t': 'tenere', 'locus': '1r:14'},
    {'t': 'Qui', 'locus': '1r:15'},
    {'t': 'primum', 'locus': '1r:15'},
    {'t': 'digitum', 'locus': '1r:15'},
    {'t': 'dare', 'locus': '1r:15'},
    {'t': 'at', 'locus': '1r:15'},
    {'t': 'petenti', 'locus': '1r:15'},
    {'t': 'Et', 'locus': '1r:16'},
    {'t': 'acris', 'locus': '1r:16'},
    {'t': 'solet', 'locus': '1r:16'},
    {'t': 'incitare', 'locus': '1r:16'},
    {

### Collating JSON

Again, since we have already our witnesses in a variable, we do not need to create a `Collation()` instance.

In [119]:
# since the collate function does not return anything when the output is 'html',
# we don't need a variable
collate(witnesses, output="html", layout="vertical")

G1,G2,O1,O2,Bodmer47/1,Bodmer47/2
fletus,fletus,-,-,-,-
passeris,passeris,-,-,passeris,passeris
lesbie,lesbie,-,-,appelatio,appelatio
Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdelitiaemeaepu ellaeQuicumludereque minsinutenere,Passerdelitiaemeaepu ellaeQuicumludereque minsinutenere
Qui,Qui,Qui,Cui,Quoi,Quoi
primumdigitumdare,primumdigitumdare,primumdigitumdare,primumdigitumdare,primumdigitumdare,primumdigitumdare
at,at,at,at,appetentiAtque,appetentiAtque
petenti,patenti,petenti,petenti,-,-
Et,Et,Ea,Ea,-,-
acrissoletincitaremo rsusCumdesideriomeon itentiCarumnescioqui d,acrissoletincitaremo rsusCumdesideriomeon itentiCarumnescioqui d,acrissoletincitaremo rsusCumdesideriomeon itentiKarumnescioqui d,acrissoletincitaremo rsusCumdesideriomeon itentiKarumnescioqui d,acrissoletincitaremo rsusCumdesyderiomeon itentiCharumnescioqu id,acrissoletincitaremo rsusCumdesyderiomeon itentiCharumnescioqu id


**A couple of remarks**

1. Do you see weird characters in the collation result? If this ever happen to you, the most likely explanation is a file encoding issue. Go back to the opening function for the json file and add `, encoding='utf-8'` after `r`, and collate again. That's better...

2. BUT, you see how the text is completely concatenated in some cells of the table. This is a result of the JSON input: there are no whitespaces to separate the tokens, as there was in plain text. Go back to the `collate()`function and add `, segmentation=False` to the arguments.

Now your collation result should be readable!

That is nice, but what happened to our tokens properties, the **n**, **locus**, and **note** that we could see in the JSON file?

CollateX does not keep them for most outputs: if you try to modify the output in the next code cell, you will see that `json` is the only one where you can find them again.

In [120]:
with open("../data/Catullus/catullus-tokenized-OGBodmer47.json", "r", encoding="utf-8") as file:
    witnesses = json.load(file)

collation = Collation.create_from_dict(witnesses)
result = collate(collation, output="json", layout="vertical", segmentation=False)
print(result)

{"table": [[[{"_sigil": "G1", "_token_array_position": 0, "locus": "1r:12", "t": "fletus"}], [{"_sigil": "G1", "_token_array_position": 1, "locus": "1r:12", "t": "passeris"}], [{"_sigil": "G1", "_token_array_position": 2, "locus": "1r:12", "t": "lesbie"}], [{"_sigil": "G1", "_token_array_position": 3, "locus": "1r:13", "t": "Passer"}], [{"_sigil": "G1", "_token_array_position": 4, "locus": "1r:13", "n": "deliciae", "t": "delicie"}], [{"_sigil": "G1", "_token_array_position": 5, "locus": "1r:13", "n": "meae", "t": "mee"}], [{"_sigil": "G1", "_token_array_position": 6, "locus": "1r:13", "n": "puellae", "t": "puelle"}], [{"_sigil": "G1", "_token_array_position": 7, "locus": "1r:14", "t": "Qui"}], [{"_sigil": "G1", "_token_array_position": 8, "locus": "1r:14", "t": "cum"}], [{"_sigil": "G1", "_token_array_position": 9, "locus": "1r:14", "t": "ludere"}], [{"_sigil": "G1", "_token_array_position": 10, "locus": "1r:14", "t": "quem"}], [{"_sigil": "G1", "_token_array_position": 11, "locus": "1

### Jupyter trick: Displaying JSON 


The JSON result of CollateX is not very easy to read. It is actually a string, a sequence of characters. The computer does not interpret the brackets as indications of the structure of the result. 

So, if we want a better display of the JSON result (like what we saw when we opened the input file), we need to follow a few steps:
1. convert the string to a dictionary object
2. "prettify" the dictionary
3. print the pretty dictionary

In [7]:
type(result) # => it is a string. To display it nicely, it would be better to have a json object

str

In [8]:
dictionary = json.loads(result) # we convert the string to json
type(dictionary) # => now we have a dictionary

dict

In [124]:
# we can "prettify" json with json.dumps()
pretty_dict = json.dumps(dictionary, indent=1) 
print(pretty_dict) 

{
 "table": [
  [
   [
    {
     "_sigil": "G1",
     "_token_array_position": 0,
     "locus": "1r:12",
     "t": "fletus"
    }
   ],
   [
    {
     "_sigil": "G1",
     "_token_array_position": 1,
     "locus": "1r:12",
     "t": "passeris"
    }
   ],
   [
    {
     "_sigil": "G1",
     "_token_array_position": 2,
     "locus": "1r:12",
     "t": "lesbie"
    }
   ],
   [
    {
     "_sigil": "G1",
     "_token_array_position": 3,
     "locus": "1r:13",
     "t": "Passer"
    }
   ],
   [
    {
     "_sigil": "G1",
     "_token_array_position": 4,
     "locus": "1r:13",
     "n": "deliciae",
     "t": "delicie"
    }
   ],
   [
    {
     "_sigil": "G1",
     "_token_array_position": 5,
     "locus": "1r:13",
     "n": "meae",
     "t": "mee"
    }
   ],
   [
    {
     "_sigil": "G1",
     "_token_array_position": 6,
     "locus": "1r:13",
     "n": "puellae",
     "t": "puelle"
    }
   ],
   [
    {
     "_sigil": "G1",
     "_token_array_position": 7,
     "locus": "1r:14",


In [100]:
# and of course we can save it as a file
with open('collatex_json_result.json', 'w', encoding='utf-8') as file:
    file.write(pretty_dict)

### Jupyter trick: saving your html table 

As we have seen, the HTML table and the graph output are displayed directly in Jupyter, but you cannot save them in a variable or in a file. That is sometimes inconvenient, since you may not always want to open a notebook to see your results.

There is a Jupyter trick, a magic command, that will capture the output of the cell (what you see after `out [a number]:`).

In [122]:
%%capture result 

# the first line is a jupyter magic command! It must be always the first line in the cell

collation = Collation.create_from_dict(witnesses)

collate(collation, output="html", layout="vertical")

In [123]:
# since we captured the previous cell's output, it did not appear
# if we want to see the result we have to display it in a new cell with the method show()
result.show()

G1,G2,O1,O2,Bodmer47/1,Bodmer47/2
fletus,fletus,-,-,-,-
passeris,passeris,-,-,passeris,passeris
lesbie,lesbie,-,-,appelatio,appelatio
Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdeliciemeepuel leQuicumluderequemin sinutenere,Passerdelitiaemeaepu ellaeQuicumludereque minsinutenere,Passerdelitiaemeaepu ellaeQuicumludereque minsinutenere
Qui,Qui,Qui,Cui,Quoi,Quoi
primumdigitumdare,primumdigitumdare,primumdigitumdare,primumdigitumdare,primumdigitumdare,primumdigitumdare
at,at,at,at,appetentiAtque,appetentiAtque
petenti,patenti,petenti,petenti,-,-
Et,Et,Ea,Ea,-,-
acrissoletincitaremo rsusCumdesideriomeon itentiCarumnescioqui d,acrissoletincitaremo rsusCumdesideriomeon itentiCarumnescioqui d,acrissoletincitaremo rsusCumdesideriomeon itentiKarumnescioqui d,acrissoletincitaremo rsusCumdesideriomeon itentiKarumnescioqui d,acrissoletincitaremo rsusCumdesyderiomeon itentiCharumnescioqu id,acrissoletincitaremo rsusCumdesyderiomeon itentiCharumnescioqu id


In [97]:
# we save the html part of the output in a variable
# this is tricky! You just have to believe that it will work for now
# Attention: it could stop working later with new versions of Jupyter
html = test.outputs[0].data["text/html"]
html

'<table>\n    <tr>\n        <th>G1</th>\n        <th>G2</th>\n        <th>O1</th>\n        <th>O2</th>\n        <th>Bodmer47/1</th>\n        <th>Bodmer47/2</th>\n    </tr>\n    <tr>\n        <td>fletus</td>\n        <td>fletus</td>\n        <td>-</td>\n        <td>-</td>\n        <td>-</td>\n        <td>-</td>\n    </tr>\n    <tr>\n        <td>passeris</td>\n        <td>passeris</td>\n        <td>-</td>\n        <td>-</td>\n        <td>passeris</td>\n        <td>passeris</td>\n    </tr>\n    <tr>\n        <td>lesbie</td>\n        <td>lesbie</td>\n        <td>-</td>\n        <td>-</td>\n        <td>appelatio</td>\n        <td>appelatio</td>\n    </tr>\n    <tr>\n        <td>Passerdeliciemeepuel<br>leQuicumluderequemin<br>sinutenere</td>\n        <td>Passerdeliciemeepuel<br>leQuicumluderequemin<br>sinutenere</td>\n        <td>Passerdeliciemeepuel<br>leQuicumluderequemin<br>sinutenere</td>\n        <td>Passerdeliciemeepuel<br>leQuicumluderequemin<br>sinutenere</td>\n        <td>Passerdeli

In [96]:
# and finally we can save our table in a file for later
with open('collatex_html_result.html', 'w') as file:
    file.write(html)

Of course, we only save the HTML, not the CSS, which is what makes alternate lines darker, and hihlights in blue the rows of the table when you hover over it with your mouse.

This can be fixed by adding [CSS code](https://www.w3schools.com/css/css_table.asp) to your HTML file.

## Input 3 - XML

The [TEI Guidelines](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html) is the standard format for transcription encoding. It is very likely that your transcriptions will use XML TEI, as plain text is rarely suitable to record the text of a document.

Here is a short video to watch later, if you want to know what TEI is and why it is a good idea to use it: https://www.youtube.com/watch?v=VvSQ530gxPM

Unfortunately, XML is not an input option for the python version of CollateX.

### Why no XML input?

If we try to collate XML directly we will have problems: CollateX cannot make the difference between the tags and the text! We need to remove the tags, but there are two problems with that.

1. there can be many different ways of encoding the same in the TEI: it is a flexible format that can be adapted to the needs of researchers depending on their sources, and their particular needs. Therefore it is difficult for CollateX to anticipate all possible TEI input formats! 

2. There can be several layers of texts encoded in a single file: 

For instance need to make one witness for each layer of text

### What can I do?

If you need to collate from XML encoded files, you can have different solutions depending on the complexity of your encoding and what you are trying to achieve. Therefore it is important from the beginning of your project, if possible even before you start transcribing, to think about the **purpose** of your collation project, and **what information you will need later on for your analysis.**

If you have a relatively simple encoding, and you don't need more than the text of the token, you can extract a plain text version by removing the tags (use XSLT or regex) fairly easily.

If you need to include some normalization forms for a better collation, or other kind of notes (e.g. part of speech tagging, etc), then it will be worth converting your xml to json. For that you need xslt, here are some examples online:

- https://github.com/enury/phd-automated-collation/blob/master/XSLT/witnesses-to-json.xsl (Elisa Nury, it is the transformation I have used for the Catullus example, and the Calpurnius App)
- https://gitlab.huma-num.fr/mgillelevenson/tei_collator/-/blob/master/xsl/pre_alignement/transformation_json.xsl (Matthias Gille [Levenson](http://perso.ens-lyon.fr/matthias.gille-levenson/accueil.html), PhD student, Lyon)
- https://github.com/CondorCompPhil/falcon/blob/656a1ed8988bf8b3f60f31c3445af89cb97fba61/falcon/collation.py#L8-L68 (Jean-Baptiste [Camps](http://www.chartes.psl.eu/fr/jean-baptiste-camps), for the [*Falcon*](https://github.com/CondorCompPhil/falcon) project)

### Other options?

1. HyperCollate? https://github.com/HuygensING/hyper-collate
2. CollateX java version has limited support for XML - https://collatex.net/doc/#xml-input
3. Another tool...

## Recap and Exercise 
