# Text Processing

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Text-Processing" data-toc-modified-id="Text-Processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Processing</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li></ul></li><li><span><a href="#Preparations" data-toc-modified-id="Preparations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparations</a></span></li><li><span><a href="#TF/IDF-" data-toc-modified-id="TF/IDF--3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TF/IDF <a name="tfidf"></a></a></span></li><li><span><a href="#Translations?" data-toc-modified-id="Translations?-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Translations?</a></span></li><li><span><a href="#Similarity-" data-toc-modified-id="Similarity--5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Similarity <a name="DocumentSimilarity"></a></a></span></li><li><span><a href="#Word-Clouds-" data-toc-modified-id="Word-Clouds--6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Word Clouds <a name="WordClouds"></a></a></span></li></ul></div>

## Introduction

This file is the continuation of preceding work. Previously, I have worked my way through a couple of text-analysing approaches - such as tf/idf frequencies, n-grams and the like - in the context of a project concerned with Juan de Solórzano Pereira's *Politica Indiana*. This can be seen [here](TextProcessing_Solorzano.ipynb).

In the former context, I got somewhat stuck when I was trying to automatically align corresponding passages of two editions of the same work ... where the one edition would be a **translation** of the other and thus we would have two different languages.

The present file takes this up, tries to refine an approach taken there and to find alternative ways of analysing a text across several languages. This time, the work concerned is Martín de Azpilcueta's *Manual de confesores*, a work of the 16th century that has seen very many editions and translations, quite a few of them even by the work's original author and it is the subject of the research project ["Martín de Azpilcueta’s Manual for Confessors and the Phenomenon of Epitomisation"](http://www.rg.mpg.de/research/martin-de-azpilcuetas-manual-for-confessors) by Manuela Bragagnolo. 

(There are a few DH-ey things about the project that are not directly of concern here, like a synoptic display of several editions or the presentation of the divergence of many actual translations of a given term. Such aspects are being treated with other software, like [HyperMachiavel](http://hyperprince.ens-lyon.fr/hypermachiavel) or [Lera](http://lera.uzi.uni-halle.de/).)

As in the previous case, the programming language used in the following examples is called "python" and the tool used to get prose discussion and code samples together is called ["jupyter"](http://jupyter.org/). (A common way of installing both the language and the jupyter software, especially in windows, is by installing a python "distribution" like [Anaconda](https://www.anaconda.com/what-is-anaconda/).) In jupyter, you have a "notebook" that you can populate with text (if you want to use it, jupyter understands [markdown](http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) code formatting), or code and a program that pipes a nice rendering of the notebook to a web browser as you are reading right now. In many places in such a notebook, the output that the code samples produce is printed right below the code itself. Sometimes this can be quite a lot of output and depending on your viewing environment you might have to scroll quite some way to get to the continuation of the discussion.

You can save your notebook online (the current one is [here at github](https://github.com/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb)) and there is an online service, nbviewer, able to render any notebook that it can access online. So chances are you are reading this present notebook at the web address [https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb](https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb).

A final word about the elements of this notebook:

<div class="alert alertbox alert-success">At some points I am mentioning things I consider to be important decisions or take-away messages for scholarly readers. E.g. whether or not to insert certain artefacts into the very transcription of your text, what the methodological ramifications of a certain approach or parameter are, what the implications of an example solution are, or what a possible interpretation of a certain result might be. I am highlighting these things in a block like this one here or at least in <font color="green">**green bold font**</font>.</div>

<div class="alert alertbox alert-danger">**NOTE:** As I am continually improving the notebook on the side of the source text, wordlists and other parameters, it is sometimes hard to keep the prose description in synch. So while the actual descriptions still apply, the numbers that are mentioned in the prose (as where we have e.g. a "table with 20 rows and 1.672 columns") might no longer reflect the latest state of the sources, auxiliary files and parameters and you should take these with a grain of salt. Best double check them by reading the actual code ;-)

I apologize for the inconsistency.</div>

# Preparations

Unlike in the previous case, where we had word files that we could export as plaintext, in this case Manuela has prepared a sample chapter with four editions transcribed *in parallel* in an office spreadsheet. So we first of all make sure that we have good **UTF-8** comma-separated-value files, e.g. by uploading a **csv** export of our office program of choice to [a CSV Linting service](https://csvlint.io/). (As a side remark, in my case, exporting with LibreOffice provided me with options to select UTF-8 encoding and choose the field delimiter and resulted in a valid csv file. MS Excel did neither of those.) Below, we expect the file below at the following position:

In [1]:
sourcePath = 'Azpilcueta/cap6_align_-_2018-01.csv'

Then, we can go ahead and open the file:

In [2]:
import csv

sourceFile = open(sourcePath, newline='', encoding='utf-8')
sourceTable = csv.reader(sourceFile)

And we read each line into new elements of four respective arrays (since we're dealing with one sample chapter, we try to handle it all in memory first and see if we run into problems):

*(Note here and in the following that in most cases, when the program is counting, it does so beginning with zero. Which means that if we end up with 20 segments, they are going to be called segment 0, segment 1, ..., segment 19. There is not going to be a segment bearing the number twenty, although we do have twenty segments. The first one has the number zero and the twentieth one has the number nineteen. Even for more experienced coders, this sometimes leads to mistakes, called "off-by-one errors".)*

In [3]:
    # Initialize a two-dimensional array ...
    Editions = [[]]

    # ...with four rows
    for i in range(4):
        a = []
        Editions.append(a)

    # Now populate it from our sourceTable
    for row in sourceTable:
        for i, field in enumerate(row):
            Editions[i].append(field)

    print(str(sourceTable.line_num) + " rows read.")

    # As an example, see the first seven sections of the second edition:
    for field in range(7):
        print(Editions[1][field])

41 rows read.
1552 por
¶ Capitolo VI. Das circunstancias.


1.     [1] Pera fundamento disto diremos : lho primeiro, que a circumstancia do peccado, segundo a mente de S. Tho P & outros, he hum accidente daquilo, que he peccado. Dissemos (he accidente) porque nenhũa circũstãcia da obra, he a substancia della. Dissemos (da quilo que he peccado) & nã do peccado : porque muytas vezes a obra em si não he peccado, & pola circũ se faz peccado : & como então ella he aquilo, em que consiste ho peccado, não he tãto accidente do peccado, quanto da quilo que he peccado : segundo que ho declaramos em outra parteq (q. in d. c. Consideret n. 3), seguindo a Alex. de Ales r (r in. 4 pt. q. 77 ar.z. co.la.z).
2.     ¶ [2] Ho II. Que a circunstancia se parte em sete species, que se conte nem aquelle verso : Quis, quid, ubi, quibus auxiliis, cur, quomodo, quando : Referido por S. Tho.s (in d. q. 7. ar 3) . Quem, Que, Onde, Com que ajudas, Porque, Em que maneira, Quando. O qual verso temos por melhor, que

# TF/IDF <a name="tfidf"></a>

In the previous (i.e. Solórzano) analyses, things like tokenization, lemmatization and stop-word lists filtering are explained step by step. Here, we rely on what we have found there and feed it all into functions that are ready-made and available in suitable libraries...

First, we build our lemmatization resource and "function":

In [4]:
lemma    = {}    # we build a so-called dictionary for the lookups
tempdict = []

wordfile_path = 'Azpilcueta/wordforms-lat-full.txt'
wordfile = open(wordfile_path, encoding='utf-8')

# open the wordfile (defined above) for reading
wordfile = open(wordfile_path, encoding='utf-8')

for line in wordfile.readlines():
    tempdict.append(tuple(line.split('>'))) # we split each line by ">" and append a tuple to a
                                            # temporary list.

lemma = {k.strip(): v.strip() for k, v in tempdict} # for every tuple in the list,
                                                    # we strip whitespace and make a key-value
                                                    # pair, appending it to our "lemma" dictionary
wordfile.close
print(str(len(lemma)) + ' wordforms known to the system.')

2131211 wordforms known to the system.


Again, a quick test: Let's see with which "lemma"/basic word the particular wordform "fidem" is associated, or, in other words, what *value* our lemma variable returns when we query for the *key* "fidem":

In [5]:
lemma['fidem']

'fides'

And we are going to need the stopwords list:

In [6]:
stopwords_path = 'Azpilcueta/stopwords-lat.txt'
stopwords = open(stopwords_path, encoding='utf-8').read().splitlines()

print(str(len(stopwords)) + ' stopwords known to the system, e.g.: ' + str(stopwords[95:170]))

389 stopwords known to the system, e.g.: ['a', 'ab', 'ac', 'ad', 'adhic', 'adhuc', 'ae', 'ait', 'ali', 'alii', 'aliis', 'alio', 'aliqua', 'aliqui', 'aliquid', 'aliquis', 'aliquo', 'am', 'an', 'ante', 'apud', 'ar', 'at', 'atque', 'au', 'aut', 'autem', 'bus', 'c', 'ca', 'cap', 'ceptum', 'co', 'con', 'cons', 'cui', 'cum', 'cur', 'cùm', 'd', 'da', 'de', 'deinde', 'detur', 'di', 'diu', 'do', 'dum', 'e', 'ea', 'eadem', 'ec', 'eccle', 'ego', 'ei', 'eis', 'eius', 'el', 'em', 'en', 'enim', 'eo', 'eos', 'er', 'erat', 'ergo', 'erit', 'es', 'esse', 'essent', 'esset', 'est', 'et', 'etenim', 'eti']


(In contrast to simpler numbers that have been filtered out by our stopwords filter, I have left numbers representing years like "1610" in place.)

Next, we should find some very characteristic words for each segment for each edition. (Let's say we are looking for the "Top 20".) We should build a vocabulary for each edition individually and only afterwards work towards a common vocabulary of several "Top n" sets.

In [7]:
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

numTopTerms = 20

# So first we build a tokenising and lemmatising function to work as an input filter
# to the CountVectorizer function
def ourLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[wordform].lower().strip() if wordform in lemma else wordform.lower().strip() for wordform in wordforms ]

# !!!!
# TODO: The above pipes all the tokens through the latin lemmatizer.
# We should lemmatize Spanish and Portuguese differently!
# !!!!

topTerms = []
for i in range(4):

    topTermsEd = []
    # Initialize the library's function, specifying our
    # tokenizing function from above and our stopwords list.
    tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords, use_idf=True, tokenizer=ourLemmatiser, norm='l2')

    # Finally, we feed our corpus to the function to build a new "tfidf_matrix" object
    tfidf_matrix = tfidf_vectorizer.fit_transform(Editions[i])

    # convert your matrix to an array to loop over it
    mx_array = tfidf_matrix.toarray()

    # get your feature names
    fn = tfidf_vectorizer.get_feature_names()

    # now loop through all segments and get the respective top n words.
    pos = 0
    for j in mx_array:
        # We have empty segments, i.e. none of the words in our vocabulary has any tf/idf score > 0
        if (j.max() == 0):
            topTermsEd.append([("", 0)])
        # otherwise append (present) lemmatised words until numTopTerms or the number of words (-stopwords) is reached
        else:
            topTermsEd.append(
                [(fn[x], j[x]) for x in ((j*-1).argsort()) if j[x] > 0] \
                [:min(numTopTerms, len(
                    [word for word in re.split('\W+', Editions[i][pos]) if ourLemmatiser(word) not in stopwords]
                ))])
        pos += 1
    topTerms.append(topTermsEd)

# Translations?

Maybe there is an approach to inter-lingual comparison after all. After a first unsuccessful try with [conceptnet.io](http://conceptnet.io), I next want to try [Babelnet](http://babelnet.org) in order to lookup synonyms, related terms and translations. I still have to study the [API](http://babelnet.org/guide)...



For example, let's take this single segment 19:

In [8]:
startEd = 3
segment_no = 19 

print("Comparing words from segments " + str(segment_no) + " ...")

print(" ")
print("Here is the segment in the four editions:")
print(" ")
for i in range(4):
    print("Ed. " + str(i) + ":")
    print("------")
    print(Editions[i][segment_no])
    print(" ")

print(" ")
print(" ")

# Build List of most significant words for a segment

print("Most significant words in the segment:")
print(" ")
for i in range(4):
    print("Ed. " + str(i) + ":")
    print("------")
    print(topTerms[i][segment_no])
    print(" ")

Comparing words from segments 19 ...
 
Here is the segment in the four editions:
 
Ed. 0:
------
¶A circunstancia do dia deputado a jejuum, ou oraçam : nam he de necessidade confessata : salvo quando fizesse peccado com proposito de ho quebrantar como acima do dia da festa. Segundo Navarro, ubi supra.
 
Ed. 1:
------
 ¶ A IX que a circunstancia do dia de jejuũ, ou de oração, não se ha de confessar necessariamente, se não quando se pecca com proposito de ho quebrãtar : porque nã faz algũa das ditas tres cousas, segũdo em outra parte ho provamoss (s : f. in d. c. Consideret n. 32 vers. sic. Ad primum) 
 
Ed. 2:
------
17 ¶  El X que la circunstancia del dia de ayuno, o de oracion, no se ha de confessar necessariamente, fino quando se peca con proposito delo quebrantar, por ello por que no haze alguna delas dichas tres cosas, segun lo provamos alibim (m : in d. c. Consideret nu. 32 ver. Ad primum) 
 
Ed. 3:
------
17Decimo. Quod circunstantia diei ieiunio vel orationi consecrati, licet vi

Now we look up the "concepts" associated to those words in babelnet. Then we look up the concepts associated with the words of the present segment from another edition/language, and see if the concepts are the same.

In [9]:
import urllib
import json
from collections import defaultdict

babelAPIKey = '18546fd3-8999-43db-ac31-dc113506f825'
babelGetSynsetIdsURL = "https://babelnet.io/v4/getSynsetIds?" + \
                       "langs=LA&langs=ES&langs=PT" + \
                       "&filterLangs=LA&filterLangs=ES&filterLangs=PT" + \
                       "&key=" + babelAPIKey + \
                       "&word="

# Build lists of possible concepts
top_possible_conceptIDs = defaultdict(list)
for (word, val) in topTerms[startEd][segment_no]:
    concepts_uri = babelGetSynsetIdsURL + word
    response = urllib.request.urlopen(concepts_uri)
    conceptIDs = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in conceptIDs:
        top_possible_conceptIDs[word].append(rel.get("id"))

print(" ")
print("For each of the latin words, here are possible synsets:")
print(" ")
for word in top_possible_conceptIDs:
    print(word + ":" + " " + ', '.join(c for c in top_possible_conceptIDs[word]))
    print(" ")

print(" ")
print(" ")
print(" ")
# Build list of 10 most significant words in the second language
top_possible_conceptIDs_2 = defaultdict(list)
for (word, val) in topTerms[2][segment_no]:
    concepts_uri = babelGetSynsetIdsURL + word
    response = urllib.request.urlopen(concepts_uri)
    conceptIDs = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in conceptIDs:
        top_possible_conceptIDs_2[word].append(rel.get("id"))

print(" ")
print("For each of the Spanish words, here are possible synsets:")
print(" ")
for word in top_possible_conceptIDs_2:
    print(word + ":" + " " + ', '.join(c for c in top_possible_conceptIDs_2[word]))
    print(" ")

 
For each of the latin words, here are possible synsets:
 
dies: bn:00025419n, bn:00025341n, bn:00000086n, bn:00025422n, bn:00025417n
 
oratio: bn:00027523n, bn:16473014n
 
ieiunium: bn:00033737n
 
novo: bn:00105752a, bn:00107274a, bn:00107275a, bn:00109521a, bn:00107267a, bn:00103327a, bn:00103328a, bn:02989482n, bn:06978195n
 
necessitas: bn:03277980n, bn:00046093n
 
nu: bn:00058225n, bn:00058342n, bn:00351758n, bn:00058316n, bn:10185248n, bn:08302435n, bn:03284263n, bn:00097745a, bn:03295401n, bn:06965733n, bn:00056748n, bn:00078931n, bn:14479066n, bn:00056688n
 
video: bn:00079978n, bn:00062276n, bn:00079972n, bn:08263856n, bn:00093430v, bn:00091096v, bn:00076373n, bn:00085496v, bn:00093435v, bn:08263853n, bn:00085652v, bn:00082727v, bn:00092640v
 
dico: bn:00082800v, bn:00093292v, bn:00093287v
 
probo: bn:00086567v, bn:03617190n
 
 
 
 
 
For each of the Spanish words, here are possible synsets:
 
fino: bn:00105687a, bn:00100876a, bn:00102975a, bn:00098683a, bn:00034612n, bn:0009

In [10]:
# calculate number of overlapping terms
values_a = set([item for sublist in top_possible_conceptIDs.values() for item in sublist])
values_b = set([item for sublist in top_possible_conceptIDs_2.values() for item in sublist])
overlaps = values_a & values_b
print("Overlaps:")

babelGetSynsetInfoURL = "https://babelnet.io/v4/getSynset?key=" + babelAPIKey + \
                        "&filterLangs=LA&filterLangs=ES&filterLangs=PT" + \
                        "&id="

for c in overlaps:
    info_uri = babelGetSynsetInfoURL + c
    response = urllib.request.urlopen(info_uri)
    words = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    
    senses = words['senses']
    for result in senses[:1]:
        lemma = result.get('lemma')
        language = result.get('language')
        print(c + ": " + lemma + " (" + language.lower() + ")")

# do a nifty ranking

Overlaps:
bn:00025422n: día (es)
bn:00093430v: ver (es)
bn:00000086n: día (es)
bn:00085496v: percibir (es)
bn:00025419n: día (es)
bn:00093435v: ver (es)
bn:00025417n: día (es)
bn:00033737n: ayuno (es)
bn:00091096v: notar (es)
bn:00085652v: considerar (es)


# Similarity <a name="DocumentSimilarity"/>

It seems we could now create another matrix replacing lemmata with concepts and retaining the tf/idf values (so as to keep a weight coefficient to the concepts). Then we should be able to calculate similarity measures across the same concepts...

The approach to choose would probably be the "cosine similarity" of concept vector spaces. Again, there is a library ready for us to use (but you can find some documentation [here](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/), [here](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and [here](https://en.wikipedia.org/wiki/Cosine_similarity).)

**However, this is where I have to take a break now. I will return to here soon...**

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = pd.DataFrame(cosine_similarity(tfidf_matrix))
similarities[round(similarities, 0) == 1] = 0 # Suppress a document's similarity to itself
print("Pairwise similarities:")
print(similarities)

Pairwise similarities:
     0         1    2         3         4         5         6         7   \
0   0.0  0.000000  0.0  0.000000  0.000000  0.000000  0.000000  0.000000   
1   0.0  0.000000  0.0  0.021730  0.051073  0.009358  0.051261  0.013045   
2   0.0  0.000000  0.0  0.000000  0.000000  0.000000  0.000000  0.000000   
3   0.0  0.021730  0.0  0.000000  0.128698  0.201813  0.205456  0.066350   
4   0.0  0.051073  0.0  0.128698  0.000000  0.148800  0.176078  0.045951   
5   0.0  0.009358  0.0  0.201813  0.148800  0.000000  0.100226  0.039999   
6   0.0  0.051261  0.0  0.205456  0.176078  0.100226  0.000000  0.139324   
7   0.0  0.013045  0.0  0.066350  0.045951  0.039999  0.139324  0.000000   
8   0.0  0.013191  0.0  0.086402  0.041347  0.083278  0.043033  0.049024   
9   0.0  0.000000  0.0  0.155631  0.064009  0.079912  0.165588  0.063541   
10  0.0  0.000000  0.0  0.123848  0.070881  0.070336  0.197763  0.081927   
11  0.0  0.030417  0.0  0.098265  0.090397  0.039227  0.187536  0

In [12]:
print("The two most similar segments in the corpus are")
print("segments", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1), \
      "and", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0)[ similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1) ].astype(int), \
      ".")
print("They have a similarity score of")
print(similarities.values.max())

The two most similar segments in the corpus are
segments 37 and 38 .
They have a similarity score of
0.371503404623


<div class="alert alertbox alert-success">Of course, in every set of documents, we will always find two that are similar in the sense of them being more similar to each other than to the other ones. Whether or not this actually *means* anything in terms of content is still up to scholarly interpretation. But at least it means that a scholar can look at the two documents and when she determines that they are not so similar after all, then perhaps there is something interesting to say about similar vocabulary used for different puproses. Or the other way round: When the scholar knows that two passages are similar, but they have a low "similarity score", shouldn't that say something about the texts's rhetorics?</div>

# Word Clouds <a name="WordClouds"/>

We can use a library that takes word frequencies like above, calculates corresponding relative sizes of words and creates nice wordcloud images for our sections (again, taking the fourth segment as an example) like this:

In [13]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# We make tuples of (lemma, tf/idf score) for one of our segments
# But we have to convert our tf/idf weights to pseudo-frequencies (i.e. integer numbers)
frq = [ int(round(x * 100000, 0)) for x in Editions[1][3]]
freq = dict(zip(fn, frq))

wc = WordCloud(background_color=None, mode="RGBA", max_font_size=40, relative_scaling=1).fit_words(freq)

# Now show/plot the wordcloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

ModuleNotFoundError: No module named 'wordcloud'

In order to have a nicer overview over the many segments than is possible in this notebook, let's create a new html file listing some of the characteristics that we have found so far...

In [None]:
outputDir = "Azpilcueta"
htmlfile = open(outputDir + '/Overview.html', encoding='utf-8', mode='w')

# Write the html header and the opening of a layout table
htmlfile.write("""<!DOCTYPE html>
<html>
    <head>
        <title>Section Characteristics</title>
        <meta charset="utf-8"/>
    </head>
    <body>
        <table>
""")

a = [[]]
a.clear()
dicts = []
w = []

# For each segment, create a wordcloud and write it along with label and
# other information into a new row of the html table
for i in range(len(mx_array)):
    # this is like above in the single-segment example...
    a.append([ int(round(x * 100000, 0)) for x in mx_array[i]])
    dicts.append(dict(zip(fn, a[i])))
    w.append(WordCloud(background_color=None, mode="RGBA", \
                       max_font_size=40, min_font_size=10, \
                       max_words=60, relative_scaling=0.8).fit_words(dicts[i]))
    # We write the wordcloud image to a file
    w[i].to_file(outputDir + '/wc_' + str(i) + '.png')
    # Finally we write the column row
    htmlfile.write("""
            <tr>
                <td>
                    <head>Section {a}: <b>{b}</b></head><br/>
                    <img src="./wc_{a}.png"/><br/>
                    <small><i>length: {c} words</i></small>
                </td>
            </tr>
            <tr><td>&nbsp;</td></tr>
""".format(a = str(i), b = label[i], c = len(tokenised[i])))

# And then we write the end of the html file.
htmlfile.write("""
        </table>
    </body>
</html>
""")
htmlfile.close()

This should have created a nice html file which we can open [here](./Solorzano/Overview.html).