# Session 7 - Collate with CollateX: Input Formats

## Summary 

- Plain text
- JSON (simple)
- JSON (tokenized)
  - jupyter trick: displaying JSON
  - jupyter trick: saving an HTML alignment table
- XML

In this exercise, we will look at the possible input formats for CollateX.

As before, delete the outputs to start with a fresh notebook (Kernel > Restart & Clear Outputs).

Again, we import `collatex` and a new module called `json` that we will need for one of the input formats.

In [None]:
from collatex import *
import json

## Input 1 - Plain text

We have seen already how to use a plain text input in the previous sessions [5a](https://github.com/automaticCollationLausanne2020/Materials/blob/98b873a89b9bdb0c152f2ac7f06a899a71fda120/session5/Session05_PlainTextCollation.ipynb) and [5b](https://github.com/automaticCollationLausanne2020/Materials/blob/master/session5/Session05b_collateFiles.ipynb).

**Plain text** means that it is a **string** data type: a string is the most basic way to represent text (hence the name "plain").

But it is possible to have a more complex representation of text. 

## Input 2 - JSON (simple) 

You have seen what JSON is in [session 6](https://github.com/automaticCollationLausanne2020/Materials/blob/master/session6/collation-outputs.ipynb), so you should not be too suprised by this input format!

There are two possible JSON inputs. The first one is a simple version that is very similar to the plain text collation, except that all the witnesses and their sigils are gathered together in one single variable.

In [None]:
json_input = """{
  "witnesses" : [
    {
      "id" : "UK",
      "content" : "George reckons he had a bogey-flavoured one once."
    },
    {
      "id" : "US",
      "content" : "George reckons he had a booger-flavored one once."
    },
    {
      "id" : "Film",
      "content" : "George sweared he got a bogey-flavoured one once."
    }
  ]
}""" # three """ is for a string spanning multiple lines

The `json_input` variable is a string, but it contains `{}` and `[]` which mark JSON's arrays/lists and Python's dictionaries/lists. 

It is very straighforward to convert this string into a dictionary that can be processed in Python:

In [None]:
# we are using the loads() method
witnesses_simple = json.loads(json_input)

# we can check that we have the correct data type
print("json_input: ", type(json_input))
print("witnesses: ", type(witnesses_simple))

Since we already have all our witnesses gathered into a single variable, we do not need to create a Collation instance with the line:

`collation = Collation()`

We can directly collate the witnesses.

In [None]:
# of course we could have more options (output, layout, segmentation, indent)
result = collate(witnesses_simple)
print(result)

## Input 3 - JSON (tokenized)

For this input format, the witnesses must be pre-tokenized before we can collate (see the [Gothenburg model](https://automaticcollationlausanne2020.github.io/session4a.html) about tokenization).

In JSON, the tokens are like a python dictionary, they have properties expressed as key/value pairs and are surrounded by curly braces (see [session 2](https://github.com/automaticCollationLausanne2020/Materials/blob/master/session2/Session02_python_introduction.ipynb)).

Tokens have the following properties:
- **t** : a mandatory property with the actual text of the transcribed document (technically, it is called a 'reading')
- **n** : an optional normalized form of the reading, which can help CollateX with the alignment (we will look at the normalization in a few minutes during [session 7b](https://github.com/automaticCollationLausanne2020/Materials/blob/master/session7/Session07b_Normalization.ipynb)).
- Anything that you feel like including! For instance part-of-speech tagging, or the location of the reading in the document, an editorial note or comment...

In [None]:
json_sample = {"witnesses": [ # we have a list of witnesses
    {"id": "A", # each witness has a name
    "tokens": [ # and a list of tokens
        {"t":"Bladorthin,"}, # mandatory 't'
        {"t":"Dwarves,", "n":"dwarves,"}, # optional 'n'
        {"t":"and", "pos":"conjunction"}, # and whatever you want
        {"t":"Mr"},
        {"t":"Baggins."} # last element: no comma after
        ]
    }, 
    #... here you would have the other witnesses
]}

If this looks scary, you can take some time to have a good look at it again. It is a very complex object, made of several nested lists and dictionaries:

1. dictionary: the whole object is in curly braces `{}`
2. the dictionary has one key called 'witnesses'
3. the value of 'witnesses' is a list of dictionaries!
4. each witness is a dictionary with two keys ('id' and 'tokens')
5. the 'tokens' key is a list of dictionaries again!
6. each token has at least one key/value ('t')

### Reading a JSON file
Ususally, we don't want to write an entire JSON file ourselves in Python. For now we are going to read it from a file with the `load()` method.

**Attention!** There are two json methods:

- `json.loads()` is for "load string", it converts a string into a dictionary
- `json.load()` is for reading JSON from a file

In [None]:
# this is how to read a json file
with open("../data/Catullus/catullus-tokenized-OGBodmer47.json", "r") as file:
    witnesses = json.load(file)

In [None]:
# let us look at the file we just read: what properties can token have here?
witnesses

### Collating JSON

Again, since we have already our witnesses in a variable, we do not need to create a `Collation()` instance.

In [None]:
# the collate function does not return anything when the output is 'html',
# therefore we don't need a variable to store the result
collate(witnesses, output="html", layout="vertical")

**A couple of remarks**

Do you see weird characters in the collation result? If this ever happen to you, the most likely explanation is a file encoding issue. Go back to the opening function for the json file and add `, encoding='utf-8'` after `r`, and collate again. That's better...

You also see that the text is completely concatenated in some cells of the table. This is a result of the JSON input: there are no whitespaces to separate the tokens, as there was in plain text. Go back to the `collate()`function and add `, segmentation=False` to the arguments.

Now your collation result should be readable. That is nice! but what happened to our tokens properties, the **n**, **locus**, and **note** that we could see in the JSON file?

CollateX does not keep them for most outputs: if you try to modify the output in the next code cell, you will see that `json` is the only one where you can find them again.

In [None]:
with open("../data/Catullus/catullus-tokenized-OGBodmer47.json", "r", encoding="utf-8") as file:
    witnesses = json.load(file)

result = collate(witnesses, output="json", layout="vertical", segmentation=False)
print(result)

### Jupyter trick: Displaying JSON 


The JSON result of CollateX is not very easy to read. It is actually a string, a sequence of characters. The computer does not interpret the brackets as indications of the structure of the result. 

So, if we want a better display of the JSON result (like what we saw when we opened the input file), we need to follow a few steps:
1. convert the string to a dictionary object
2. "prettify" the dictionary
3. print the pretty dictionary

In [None]:
type(result) # => it is a string. 

In [None]:
# To display it nicely, it would be better to have a json object

dictionary = json.loads(result) # we convert the string to json
type(dictionary) # => now we have a dictionary

In [None]:
# we can "prettify" json with json.dumps()
pretty_dict = json.dumps(dictionary, indent=1) 
print(pretty_dict) 

In [None]:
# and of course we can save it as a file
with open('collatex_json_result.json', 'w', encoding='utf-8') as file:
    file.write(pretty_dict)

### Jupyter trick: saving your html table 

As we have seen, the HTML table and the graph output are displayed directly in Jupyter, but you cannot save them in a variable or in a file. That is sometimes inconvenient, since you may not always want to open a notebook to see your results.

There is a Jupyter trick, a magic command, that will capture the output of the cell (what you see after `out [xx]:`).

In [None]:
%%capture result 

# %%capture is a jupyter magic command! 
# It must be always the first line in a cell

collate(witnesses, output="html", layout="vertical", segmentation=False)

In [None]:
# since we captured the previous cell's output, it did not appear
# if we want to see the result we have to display it in a new cell with the method show()
result.show()

In [None]:
# we save the html part of the output in a variable
# this is tricky! You just have to believe that it will work for now
# Attention: it could stop working later with new versions of Jupyter
html = result.outputs[0].data["text/html"]
html

In [None]:
# and finally we can save our table in a file for later
with open('collatex_html_result.html', 'w', encoding="utf-8") as file:
    file.write(html)

Of course, we only save the HTML, not the CSS, which is what makes alternate lines darker, and highlights the rows of the table when you hover over it with your mouse.

This can be fixed by adding [CSS code](https://www.w3schools.com/css/css_table.asp) to your HTML file.

## Input 3 - XML TEI

The [TEI Guidelines](https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html) is the standard format for transcription encoding. It is very likely that your transcriptions will use XML TEI, as plain text is rarely suitable to record the text of a document in its complexity.

Here is a short video to watch later, if you want to know what TEI is and why it is a good idea to use it: https://www.youtube.com/watch?v=VvSQ530gxPM

Unfortunately, XML is not an input option for the Python version of CollateX.

### Why no XML input?

If we try to collate XML directly we will have problems: CollateX cannot make the difference between the tags and the text! We need to remove the tags, but there are two problems with that.

1. There can be many different ways of encoding the same in the TEI: it is a flexible format that can be adapted to the needs of researchers depending on their sources, and their particular needs. Therefore it is difficult for CollateX to anticipate all possible TEI input formats! 

2. Second, there can be several layers of texts encoded in a single file. For Catullus, we have two separate files for the first hand text and the second hand corrections. But you could very well have both in a single file. 

Consider the following example from codex [Bodmer 47](https://www.e-codices.unifr.ch/fr/fmb/cb-0047/1v/):

<img src="../ancillary/catullus2_secondHandCorr.jpg" width="75%"/>

You could have this encoding:

`<subst><del>Credo</del><add>Corde</add></subst> ut...`

If you simply remove the tags, your text will be "Credo Corde ut..." which does not make much sense! You should have either *Credo* or *Corde* but not both...

### What can I do?

If you need to collate from XML encoded files, there will be different solutions depending on the complexity of your encoding and what you are trying to achieve. 

Therefore it is important from the beginning of your project, if possible even before you start transcribing, to think about the **purpose** of your collation project, and **what information you will need later on for your analysis.**

If you have a relatively simple encoding, and you don't need more than simple tokens **t**, you can extract a plain text version by removing the tags (using XSLT or even regular expressions) fairly easily. There is an example script in the ancillary folder, [tei2txt.xsl](https://github.com/automaticCollationLausanne2020/Materials/blob/master/ancillary/tei2txt.xsl)

If you need to include some normalization forms for a better collation, or other kind of notes (e.g. part of speech tagging, etc), then it will be worth converting your XML to JSON.

For the conversion from XML to JSON, here are a few examples online:

- [example 1](https://github.com/enury/phd-automated-collation/blob/master/XSLT/witnesses-to-json.xsl) (Elisa Nury, it is the transformation I have used for the Catullus example and the Calpurnius App)
- [example 2](https://gitlab.huma-num.fr/mgillelevenson/tei_collator/-/blob/master/xsl/pre_alignement/transformation_json.xsl) (Matthias Gille [Levenson](http://perso.ens-lyon.fr/matthias.gille-levenson/accueil.html), PhD student, Lyon)
- [example 3](https://github.com/CondorCompPhil/falcon/blob/656a1ed8988bf8b3f60f31c3445af89cb97fba61/falcon/collation.py#L8-L68) (Jean-Baptiste [Camps](http://www.chartes.psl.eu/fr/jean-baptiste-camps), for the [*Falcon*](https://github.com/CondorCompPhil/falcon) project)

### Other options?

1. HyperCollate? https://github.com/HuygensING/hyper-collate
2. CollateX java version has limited support for XML - https://collatex.net/doc/#xml-input
3. Another collation tool...

## Recap and Exercise 
 There are three input formats for CollateX: plain text, simple JSON and tokenized JSON.
 
Text files with plain text witnesses are read from files with `read()`. Plain text witnesses must be added to a `Collation()` instance before collating.
 
JSON inputs are loaded either from a string with `json.loads()` or from a file with `json.load()`. You do not need to add JSON inputs to a `Collation()` instance.
 
The tokenized JSON input can include additional information about tokens. These additional informations can be accessed again from the JSON output.
 
Both plain text and simple JSON let you get the same outputs. In the case of tokenized JSON, you may need to use the option `segmentation=False`.
 
 **Exercise:** start the notebook again and collate both simple JSON and tokenized JSON. When you feel confident, create your own code cell and try to do the following again:
 - load the Catullus tokenized JSON 
 - collate
 - try various combinations of options (output, layout, segmentation, indent)