# Working with (larger) files

In this tutorial you'll learn how to import files that are located elsewhere and collate them with CollateX. We will use a dataset placed in this binder, so you do not have to download anything. The witnesses are three editions from _To the Lighthouse_, a novel by Virginia Woolf first published in England in 1927. 

The witnesses in this dataset are larger, which gives you the chance to experiment with different visualisations. The final part of the tutorial you'll learn how to save the HTML output visualisation that was created in the Jupyter notebook.

## Before you begin

Make sure that the code cells in this Jupyter notebook have no output yet. If they do, clear the output:  

Go to `Kernel` >> `Restart & Clear Output`. The notebook will ask if you're sure and, after confirming that indeed you are sure you want to restart and clear all output, it will restart. This may take a little while.

## 1. Reading the files

First, we will create three variables that will serve as containers for the three witnesses of _To the Lighthouse_. We will give the three witnesses a sigil that corresponds to the edition they are taken from:  
1. TtL_UK = _To the Lighthouse_, the first UK edition (1927);
2. TtL_US = _To the Lighthouse_, the first US edition (1927);
3. TtL_EM = _To the Lighthouse_, the Everyman edition (1938).

Using a line of Python code, we will instruct the computer where to find and to store it in the variable. In this case, we will point the computer to the folder in the binder. If you do not work in the binder or you'd like to work with other witnesses, make sure to adjust the path!

In [None]:
TtL_UK = open("../data/Woolf/Lighthouse-1/Lighthouse-1-UK.txt", encoding='utf-8').read()
TtL_US = open("../data/Woolf/Lighthouse-1/Lighthouse-1-USA.txt", encoding='utf-8').read()
TtL_EM = open("../data/Woolf/Lighthouse-1/Lighthouse-1-EM.txt",encoding='utf-8').read()

If you didn't get any output, that's good: you didn't ask for it! Let's check whether the variables (the "empty containers") are no longer empty:

In [None]:
print(TtL_UK)

Great, that works! You see that the text of this witness is longer than what we have experimented with thus far.

If you like, you can check whether the other witnesses are also stored in their variables, by changing name in the brackets that follow the `print` method:  `print(TtL_US)` and `print(TtL_EM)`.

## 2. Collating the files

Now that the text of each witness is stored in a variable, we can collate them.  

As per usual, we first check whether we still have the latest version of CollateX by running the code in the cell below:

In [None]:
!pip install --upgrade collatex

We then import CollateX, create a variable, and refer a collation object to it. Remember, the newly created collation object is empty until you add witnesses to it. You can picture it as an empty bucket to which you will add the witnesses one by one. You can name the variable/empty bucket as you like; here I've named it `Woolf_coll`.

In [None]:
from collatex import * 

In [None]:
Woolf_coll = Collation()

In [None]:
Woolf_coll.add_plain_witness("UK", TtL_UK)
Woolf_coll.add_plain_witness("US", TtL_US)
Woolf_coll.add_plain_witness("EM", TtL_EM)

Note that in contrast to what you did in the earlier tutorials, you do not have to type out the whole text of each witness. Because we have just stored the text of the witnesses in variables, we can simply point to the variables!  

Now you can collate. Do play around with the different visualization options (HTML alignment tables, variant graphs) and note that some visualisations work better than others.

In [None]:
collate(Woolf_coll, output='html', layout='vertical')

In [None]:
collate(Woolf_coll, output='html2')

In [None]:
collate(Woolf_coll, output='svg_simple')

## 3. More files!

You may have noticed there is another folder in the data set, called "Lighthouse-2". If you like, you can also experiment with those witnesses. Make sure to adjust the path so that it points to the right folder.

## 4. Store the HTML visualisation

One of the great things of Jupyter is that they display the visualizations of the alignment tables and variant graphs directly in the notebook. However, in some case you may want to save them. Unfortunately that is not possible, altough it *is* possible to capture the output of a Jupyter cell and then to store that capture on your computer. 

### A few disclaimers
- First of all, this approach requires some more advanced Python code. If you don't understand everything, don't worry. As long as you understand the process. 
- The `%%capture` command only works for the two HTML alignment tables visualizations, not for the variant graph.
- The command is a workaround and may not be available in the future, for instance because Jupyter Notebook will have updated. But by then you'll be so adroit with CollateX you won't be needing the Notebooks or the Binder anymore.

The capture command, `%%capture result`, needs to be the first line of the cell.

In [None]:
%%capture result 

collate(Woolf_coll, output='html', layout='vertical')

Let's see if that has worked, by asking to see the result we just captured:

In [None]:
result.show()

To be clear: we have now captured the collation output, visualized as an HTML table with vertical layout, in a variable called `result`. 

If you like, you can visualize the collation output as a colored HTML table (by changing the `output` parameter to `html2`) or in a horizontal layout (by removing the parameter `layout='vertical`). You can also experiment with turning on the segmentation, using the parameter `segmentation=False`.

Now let's save the HTML and the text parts of the output in a variable which we'll call `html`, by running the following code:

In [None]:
html = result.outputs[0].data["text/html"]

The final step is to save the table in a file:

In [None]:
with open('collatex_html_result2.html', 'w', encoding="utf-8") as file:
    file.write(html)

The file is saved in the same location as your notebook. If you work in a binder, it will be stored there. Note that when you close the binder, you will lose all your edits! So make sure to download your work as well.