# Text Processing

# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Text-Processing" data-toc-modified-id="Text-Processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Processing</a></div><div class="lev2 toc-item"><a href="#Introduction" data-toc-modified-id="Introduction-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></div><div class="lev1 toc-item"><a href="#Preparations" data-toc-modified-id="Preparations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparations</a></div><div class="lev1 toc-item"><a href="#TF/IDF-" data-toc-modified-id="TF/IDF--3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TF/IDF </a></div><div class="lev1 toc-item"><a href="#Translations?" data-toc-modified-id="Translations?-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Translations?</a></div><div class="lev1 toc-item"><a href="#Similarity-" data-toc-modified-id="Similarity--5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Similarity </a></div><div class="lev1 toc-item"><a href="#Word-Clouds-" data-toc-modified-id="Word-Clouds--6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Word Clouds </a></div>

## Introduction

This file is the continuation of preceding work. Previously, I have worked my way through a couple of text-analysing approaches - such as tf/idf frequencies, n-grams and the like - in the context of a project concerned with Juan de Solórzano Pereira's *Politica Indiana*. This can be seen [here](TextProcessing_Solorzano.ipynb).

In the former context, I got somewhat stuck when I was trying to automatically align corresponding passages of two editions of the same work ... where the one edition would be a **translation** of the other and thus we would have two different languages. In vector terminology, two languages means two almost orthogonal vectors and it makes little sense to search for similarities there.

The present file takes this up, tries to refine an approach taken there and to find alternative ways of analysing a text across several languages. This time, the work concerned is Martín de Azpilcueta's *Manual de confesores*, a work of the 16th century that has seen very many editions and translations, quite a few of them even by the work's original author and it is the subject of the research project ["Martín de Azpilcueta’s Manual for Confessors and the Phenomenon of Epitomisation"](http://www.rg.mpg.de/research/martin-de-azpilcuetas-manual-for-confessors) by Manuela Bragagnolo. 

(There are a few DH-ey things about the project that are not directly of concern here, like a synoptic display of several editions or the presentation of the divergence of many actual translations of a given term. Such aspects are being treated with other software, like [HyperMachiavel](http://hyperprince.ens-lyon.fr/hypermachiavel) or [Lera](http://lera.uzi.uni-halle.de/).)

As in the previous case, the programming language used in the following examples is "python" and the tool used to get prose discussion and code samples together is called ["jupyter"](http://jupyter.org/). (A common way of installing both the language and the jupyter software, especially in windows, is by installing a python "distribution" like [Anaconda](https://www.anaconda.com/what-is-anaconda/).) In jupyter, you have a "notebook" that you can populate with text (if you want to use it, jupyter understands [markdown](http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) code formatting) or code, and a program that pipes a nice rendering of the notebook to a web browser as you are reading right now. In many places in such a notebook, the output that the code samples produce is printed right below the code itself. Sometimes this can be quite a lot of output and depending on your viewing environment you might have to scroll quite some way to get to the continuation of the discussion.

You can save your notebook online (the current one is [here at github](https://github.com/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb)) and there is an online service, nbviewer, able to render any notebook that it can access online. So chances are you are reading this present notebook at the web address [https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb](https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb).

A final word about the elements of this notebook:

<div class="alert alertbox alert-success">At some points I am mentioning things I consider to be important decisions or take-away messages for scholarly readers. E.g. whether or not to insert certain artefacts into the very transcription of your text, what the methodological ramifications of a certain approach or parameter are, what the implications of an example solution are, or what a possible interpretation of a certain result might be. I am highlighting these things in a block like this one here or at least in <font color="green">**green bold font**</font>.</div>

<div class="alert alertbox alert-danger">**NOTE:** As I am continually improving the notebook on the side of the source text, wordlists and other parameters, it is sometimes hard to keep the prose description in sync. So while the actual descriptions still apply, the numbers that are mentioned in the prose (as where we have e.g. a "table with 20 rows and 1.672 columns") might no longer reflect the latest state of the sources, auxiliary files and parameters and you should take these with a grain of salt. Best double check them by reading the actual code ;-)

I apologize for the inconsistency.</div>

# Preparations

Unlike in the previous case, where we had word files that we could export as plaintext, in this case Manuela has prepared a sample chapter with four editions transcribed *in parallel* in an office spreadsheet. So we first of all make sure that we have good **UTF-8** comma-separated-value files, e.g. by uploading a **csv** export of our office program of choice to [a CSV Linting service](https://csvlint.io/). (As a side remark, in my case, exporting with LibreOffice provided me with options to select UTF-8 encoding and choose the field delimiter and resulted in a valid csv file. MS Excel did neither of those.) Below, we expect the file at the following position:

In [119]:
sourcePath = 'Azpilcueta/cap6_align_-_2018-01.csv'

Then, we can go ahead and open the file in python's csv reader:

In [120]:
import csv

sourceFile = open(sourcePath, newline='', encoding='utf-8')
sourceTable = csv.reader(sourceFile)

And next, we read each line into new elements of four respective lists (since we're dealing with one sample chapter, we try to handle it all in memory first and see if we run into problems):

*(Note here and in the following that in most cases, when the program is counting, it does so beginning with zero. Which means that if we end up with 20 segments, they are going to be called segment 0, segment 1, ..., segment 19. There is not going to be a segment bearing the number twenty, although we do have twenty segments. The first one has the number zero and the twentieth one has the number nineteen. Even for more experienced coders, this sometimes leads to mistakes, called "off-by-one errors".)*

In [121]:
    # Initialize a list of lists, or two-dimensional list ...
    Editions = [[]]

    # ...with four sub-lists 0 to 3
    for i in range(3):
        a = []
        Editions.append(a)

    # Now populate it from our sourceTable
    sourceFile.seek(0)             # in repeated runs, restart from the beginning of the file
    for row in sourceTable:
        for i, field in enumerate(row):
            Editions[i].append(field)

    print(str(len(Editions[0])) + " rows read.\n")

    # As an example, see the first seven sections of the third edition (1556 SPA):
    for field in range(6):
        print(Editions[2][field])

41 rows read.

1556 SPA
¶ Capitulo.6. De las circunstancias del pecado.
Sumario.
1Circunstancia que es? nu. I. y que ay siete especies della.nu.2. Y que se ha de confessar de necessidad, la que muda la especie. nu. 3. Pero no la de aver pecado en confinança de se confessar.n.4./Circunstancia de homicidio, y de fornicacion en lugar sagrado se ha de confessar, y la vedada por otra ley diversa &c. nu. 5/Circunstancia de mentira iocosa, y la que alivia el pecado quando se ha de confessar.nu.6.7.  & 8. Y quando la del dia de fiesta, de ayuno, o de oracion, o del lugar sagrado. nu. 9 & 10. Y la de la proprioa persona, y de la religion. nu. 11. Y ha de pecar contra consciencia. nume.12/[p. 32,corretto 31; 24 pdf] Circunstancia como no es el numero de los pecados nu. 14. Pecaodo multipliarse tantas vezes, quantas se itera, como se ha de entender, y si crece el numero de los pecados por se interpolar la voluntad. nu. 16. Y por mudar el proposito, para no acabar el pecado con otras muchas consid

Actually, let's define two more list variables to hold information about the different editions - language and year of print:

In [122]:
numOfEds = 4
language = ["PT", "PT", "ES", "LA"] # I am using language codes that later on can be used in babelnet
year = [1549, 1552, 1556, 1573]

# TF/IDF <a name="tfidf"></a>

In the previous (i.e. Solórzano) analyses, things like tokenization, lemmatization and stop-word lists filtering are explained step by step. Here, we rely on what we have found there and feed it all into functions that are ready-made and available in suitable libraries...

First, we build our lemmatization resource and "function":

In [123]:
lemma = [{} for i in range(numOfEds)]
# lemma    = {}    # we build a so-called dictionary for the lookups

for i in range(numOfEds):
    
    wordfile_path = 'Azpilcueta/wordforms-' + language[i].lower() + '.txt'

    # open the wordfile (defined above) for reading
    wordfile = open(wordfile_path, encoding='utf-8')

    tempdict = []
    for line in wordfile.readlines():
        tempdict.append(tuple(line.split('>'))) # we split each line by ">" and append
                                                # a tuple to a temporary list.

    lemma[i] = {k.strip(): v.strip() for k, v in tempdict} # for every tuple in the temp. list,
                                                    # we strip whitespace and make a key-value
                                                    # pair, appending it to our "lemma"
                                                    # dictionary
    wordfile.close

    print(str(len(lemma[i])) + ' ' + language[i] + ' wordforms known to the system.')


614729 PT wordforms known to the system.
614729 PT wordforms known to the system.
614701 ES wordforms known to the system.
1709 LA wordforms known to the system.


Again, a quick test: Let's see with which "lemma"/basic word the particular wordform "diremos" is associated, or, in other words, what *value* our lemma variable returns when we query for the *key* "diremos":

In [124]:
lemma[language.index("ES")]['diremos']

'decir'

And we are going to need the stopwords lists:

In [125]:
for i in range(numOfEds):
    
    stopwords_path = 'Azpilcueta/stopwords-' + language[i].lower() + '.txt'
    stopwords[i] = open(stopwords_path, encoding='utf-8').read().splitlines()

    print(str(len(stopwords[i])) + ' ' + language[i]
          + ' stopwords known to the system, e.g.: ' + str(stopwords[i][100:119]) + '\n')

746 PT stopwords known to the system, e.g.: ['ciertos', 'cinco', 'claro', 'comentó', 'como', 'cómo', 'con', 'conmigo', 'conocer', 'conseguimos', 'conseguir', 'considera', 'consideró', 'consigo', 'consigue', 'consiguen', 'consigues', 'contigo', 'contra']

746 PT stopwords known to the system, e.g.: ['ciertos', 'cinco', 'claro', 'comentó', 'como', 'cómo', 'con', 'conmigo', 'conocer', 'conseguimos', 'conseguir', 'considera', 'consideró', 'consigo', 'consigue', 'consiguen', 'consigues', 'contigo', 'contra']

756 ES stopwords known to the system, e.g.: ['cierta', 'ciertas', 'cierto', 'ciertos', 'cinco', 'claro', 'comentó', 'como', 'cómo', 'con', 'conmigo', 'conocer', 'conseguimos', 'conseguir', 'considera', 'consideró', 'consigo', 'consigue', 'consiguen']

395 LA stopwords known to the system, e.g.: ['ac', 'ad', 'adhic', 'adhuc', 'ae', 'ait', 'ali', 'alii', 'aliis', 'alio', 'aliqua', 'aliqui', 'aliquid', 'aliquis', 'aliquo', 'am', 'an', 'ante', 'apud']



(In contrast to simpler numbers that have been filtered out by the stopwords filter, I have left numbers representing years like "1610" in place.)

Next, we should find some very characteristic words for each segment for each edition. (Let's say we are looking for the "Top 20".) We should build a vocabulary for each edition individually and only afterwards work towards a common vocabulary of several "Top n" sets.

In [126]:
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

numTopTerms = 20

# So first we build a tokenising and lemmatising function (per language) to work as
# an input filter to the CountVectorizer function
def ourLaLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[language.index("LA")][wordform].lower().strip() if wordform in lemma[language.index("LA")] else wordform.lower().strip() for wordform in wordforms ]
def ourEsLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[language.index("ES")][wordform].lower().strip() if wordform in lemma[language.index("ES")] else wordform.lower().strip() for wordform in wordforms ]
def ourPtLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[language.index("PT")][wordform].lower().strip() if wordform in lemma[language.index("PT")] else wordform.lower().strip() for wordform in wordforms ]

def ourLemmatiser(lang):
    if (lang == "LA"):
        return ourLaLemmatiser
    if (lang == "ES"):
        return ourEsLemmatiser
    if (lang == "PT"):
        return ourPtLemmatiser

def ourStopwords(lang):
    if (lang == "LA"):
        return stopwords[language.index("LA")]
    if (lang == "ES"):
        return stopwords[language.index("ES")]
    if (lang == "PT"):
        return stopwords[language.index("PT")]

topTerms = []
for i in range(numOfEds):

    topTermsEd = []
    # Initialize the library's function, specifying our
    # tokenizing function from above and our stopwords list.
    tfidf_vectorizer = TfidfVectorizer(stop_words=ourStopwords(language[i]), use_idf=True, tokenizer=ourLemmatiser(language[i]), norm='l2')

    # Finally, we feed our corpus to the function to build a new "tfidf_matrix" object
    tfidf_matrix = tfidf_vectorizer.fit_transform(Editions[i])

    # convert your matrix to an array to loop over it
    mx_array = tfidf_matrix.toarray()

    # get your feature names
    fn = tfidf_vectorizer.get_feature_names()

    # now loop through all segments and get the respective top n words.
    pos = 0
    for j in mx_array:
        # We have empty segments, i.e. none of the words in our vocabulary has any tf/idf score > 0
        if (j.max() == 0):
            topTermsEd.append([("", 0)])
        # otherwise append (present) lemmatised words until numTopTerms or the number of words (-stopwords) is reached
        else:
            topTermsEd.append(
                [(fn[x], j[x]) for x in ((j*-1).argsort()) if j[x] > 0] \
                [:min(numTopTerms, len(
                    [word for word in re.split('\W+', Editions[i][pos]) if ourLemmatiser(language[i])(word) not in stopwords]
                ))])
        pos += 1
    topTerms.append(topTermsEd)

# Translations?

Maybe there is an approach to inter-lingual comparison after all. After a first unsuccessful try with [conceptnet.io](http://conceptnet.io), I next want to try [Babelnet](http://babelnet.org) in order to lookup synonyms, related terms and translations. I still have to study the [API](http://babelnet.org/guide)...



For example, let's take this single segment 19:

In [127]:
segment_no = 19 

And then first let's see how this segment compares in the different editions:

In [128]:
print("Comparing words from segments " + str(segment_no) + " ...")

print(" ")
print("Here is the segment in the four editions:")
print(" ")
for i in range(numOfEds):
    print("Ed. " + str(i) + ":")
    print("------")
    print(Editions[i][segment_no])
    print(" ")

print(" ")
print(" ")

# Build List of most significant words for a segment

print("Most significant words in the segment:")
print(" ")
for i in range(numOfEds):
    print("Ed. " + str(i) + ":")
    print("------")
    print(topTerms[i][segment_no])
    print(" ")

Comparing words from segments 19 ...
 
Here is the segment in the four editions:
 
Ed. 0:
------
¶A circunstancia do dia deputado a jejuum, ou oraçam : nam he de necessidade confessata : salvo quando fizesse peccado com proposito de ho quebrantar como acima do dia da festa. Segundo Navarro, ubi supra.
 
Ed. 1:
------
 ¶ A IX que a circunstancia do dia de jejuũ, ou de oração, não se ha de confessar necessariamente, se não quando se pecca com proposito de ho quebrãtar : porque nã faz algũa das ditas tres cousas, segũdo em outra parte ho provamoss (s : f. in d. c. Consideret n. 32 vers. sic. Ad primum) 
 
Ed. 2:
------
17 ¶  El X que la circunstancia del dia de ayuno, o de oracion, no se ha de confessar necessariamente, fino quando se peca con proposito delo quebrantar, por ello por que no haze alguna delas dichas tres cosas, segun lo provamos alibim (m : in d. c. Consideret nu. 32 ver. Ad primum) 
 
Ed. 3:
------
17Decimo. Quod circunstantia diei ieiunio vel orationi consecrati, licet vi

Now we look up the "concepts" associated to those words in babelnet. Then we look up the concepts associated with the words of the present segment from another edition/language, and see if the concepts are the same.

But we have to decide on some particular editions to get things started. Let's take the Spanish and Latin ones:

In [129]:
startEd = 2
secondEd = 3

And then we can continue...

In [130]:
import urllib
import json
from collections import defaultdict

babelAPIKey = '18546fd3-8999-43db-ac31-dc113506f825'
babelGetSynsetIdsURL = "https://babelnet.io/v5/getSynsetIds?" + \
                       "targetLang=LA&targetLang=ES&targetLang=PT" + \
                       "&searchLang=" + language[startEd] + \
                       "&key=" + babelAPIKey + \
                       "&lemma="

# Build lists of possible concepts
top_possible_conceptIDs = defaultdict(list)
for (word, val) in topTerms[startEd][segment_no]:
    concepts_uri = babelGetSynsetIdsURL + urllib.parse.quote(word)
    response = urllib.request.urlopen(concepts_uri)
    conceptIDs = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in conceptIDs:
        top_possible_conceptIDs[word].append(rel.get("id"))

print(" ")
print("For each of the '" + language[startEd] + "' words, here are possible synsets:")
print(" ")

for word in top_possible_conceptIDs:
    print(word + ":" + " " + ', '.join(c for c in top_possible_conceptIDs[word]))
    print(" ")

print(" ")
print(" ")
print(" ")

babelGetSynsetIdsURL2 = "https://babelnet.io/v5/getSynsetIds?" + \
                        "targetLang=LA&targetLang=ES&targetLang=PT" + \
                        "&searchLang=" + language[secondEd] + \
                        "&key=" + babelAPIKey + \
                        "&lemma="

# Build list of 10 most significant words in the second language
top_possible_conceptIDs_2 = defaultdict(list)
for (word, val) in topTerms[secondEd][segment_no]:
    concepts_uri = babelGetSynsetIdsURL + urllib.parse.quote(word)
    response = urllib.request.urlopen(concepts_uri)
    conceptIDs = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in conceptIDs:
        top_possible_conceptIDs_2[word].append(rel.get("id"))

print(" ")
print("For each of the '" + language[secondEd] + "' words, here are possible synsets:")
print(" ")
for word in top_possible_conceptIDs_2:
    print(word + ":" + " " + ', '.join(c for c in top_possible_conceptIDs_2[word]))
    print(" ")

 
For each of the 'ES' words, here are possible synsets:
 
finar: bn:00084343v
 
oración: bn:00064039n, bn:00059529n, bn:00070528n, bn:00059274n, bn:00064040n, bn:08296656n, bn:00059276n, bn:00059244n
 
ayunar: bn:00088050v, bn:00033737n, bn:00088049v
 
quebrantar: bn:00083911v, bn:00093586v, bn:00083904v
 
propósito: bn:00074900n, bn:00030721n, bn:00002178n, bn:00026651n, bn:00036822n, bn:00047046n, bn:00028724n
 
peca: bn:00036309n
 
probar: bn:00092112v, bn:00086567v, bn:00088852v, bn:00094413v, bn:00094833v, bn:00094834v, bn:00093251v, bn:00092109v, bn:00087867v, bn:00087731v, bn:00095207v, bn:00082844v, bn:00083245v, bn:00092110v, bn:00086679v
 
cosa: bn:00076928n, bn:16956461n, bn:00076927n, bn:00058442n, bn:00076925n, bn:11368151n, bn:00076924n, bn:00074798n, bn:16512490n, bn:00076923n, bn:00076922n, bn:00076921n, bn:00076920n, bn:01260156n, bn:00028240n, bn:00053801n, bn:00001734n, bn:11389907n, bn:03630544n, bn:00750164n, bn:16854696n, bn:00076919n, bn:00076918n
 
confesar: bn

In [134]:
# calculate number of overlapping terms
values_a = set([item for sublist in top_possible_conceptIDs.values() for item in sublist])
values_b = set([item for sublist in top_possible_conceptIDs_2.values() for item in sublist])
overlaps = values_a & values_b
print("Overlaps: " + str(overlaps))

babelGetSynsetInfoURL = "https://babelnet.io/v5/getSynset?key=" + babelAPIKey + \
                        "&targetLang=LA&targetLang=ES&targetLang=PT" + \
                        "&id="

for c in overlaps:
    info_uri = babelGetSynsetInfoURL + c
    response = urllib.request.urlopen(info_uri)
    words = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    
    senses = words['senses']
    for result in senses[:1]:
        lemma = result['properties'].get('fullLemma')
        language = result['properties'].get('language')
        print(c + ": " + lemma + " (" + language.lower() + ")")

# do a nifty ranking

Overlaps: {'bn:00059529n'}
https://babelnet.io/v5/getSynset?key=18546fd3-8999-43db-ac31-dc113506f825&targetLang=LA&targetLang=ES&targetLang=PT&id=bn:00059529n
bn:00059529n: comunión (es)


# Similarity <a name="DocumentSimilarity"/>

It seems we could now create another matrix replacing lemmata with concepts and retaining the tf/idf values (so as to keep a weight coefficient to the concepts). Then we should be able to calculate similarity measures across the same concepts...

The approach to choose would probably be the "cosine similarity" of concept vector spaces. Again, there is a library ready for us to use (but you can find some documentation [here](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/), [here](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and [here](https://en.wikipedia.org/wiki/Cosine_similarity).)

**However, this is where I have to take a break now. I will return to here soon...**

In [135]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = pd.DataFrame(cosine_similarity(tfidf_matrix))
similarities[round(similarities, 0) == 1] = 0 # Suppress a document's similarity to itself
print("Pairwise similarities:")
print(similarities)

Pairwise similarities:
     0         1    2         3         4         5         6         7   \
0   0.0  0.000000  0.0  0.000000  0.000000  0.000000  0.000000  0.000000   
1   0.0  0.000000  0.0  0.027305  0.042553  0.021084  0.050231  0.000000   
2   0.0  0.000000  0.0  0.000000  0.000000  0.000000  0.000000  0.000000   
3   0.0  0.027305  0.0  0.000000  0.090117  0.141109  0.109094  0.017595   
4   0.0  0.042553  0.0  0.090117  0.000000  0.125574  0.125320  0.046224   
5   0.0  0.021084  0.0  0.141109  0.125574  0.000000  0.059634  0.000000   
6   0.0  0.050231  0.0  0.109094  0.125320  0.059634  0.000000  0.049117   
7   0.0  0.000000  0.0  0.017595  0.046224  0.000000  0.049117  0.000000   
8   0.0  0.000000  0.0  0.059207  0.033151  0.075654  0.015600  0.008134   
9   0.0  0.000000  0.0  0.032621  0.043815  0.065289  0.069260  0.068591   
10  0.0  0.000000  0.0  0.016428  0.036763  0.036655  0.058454  0.062323   
11  0.0  0.011712  0.0  0.067984  0.072415  0.036169  0.122698  0

In [136]:
print("The two most similar segments in the corpus are")
print("segments", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1), \
      "and", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0)[ similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1) ].astype(int), \
      ".")
print("They have a similarity score of")
print(similarities.values.max())

The two most similar segments in the corpus are
segments 37 and 38 .
They have a similarity score of
0.3330275428005039


<div class="alert alertbox alert-success">Of course, in every set of documents, we will always find two that are similar in the sense of them being more similar to each other than to the other ones. Whether or not this actually *means* anything in terms of content is still up to scholarly interpretation. But at least it means that a scholar can look at the two documents and when she determines that they are not so similar after all, then perhaps there is something interesting to say about similar vocabulary used for different puproses. Or the other way round: When the scholar knows that two passages are similar, but they have a low "similarity score", shouldn't that say something about the texts's rhetorics?</div>

# Word Clouds <a name="WordClouds"/>

We can use a library that takes word frequencies like above, calculates corresponding relative sizes of words and creates nice wordcloud images for our sections (again, taking the fourth segment as an example) like this:

In [14]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# We make tuples of (lemma, tf/idf score) for one of our segments
# But we have to convert our tf/idf weights to pseudo-frequencies (i.e. integer numbers)
frq = [ int(round(x * 100000, 0)) for x in Editions[1][3]]
freq = dict(zip(fn, frq))

wc = WordCloud(background_color=None, mode="RGBA", max_font_size=40, relative_scaling=1).fit_words(freq)

# Now show/plot the wordcloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

ModuleNotFoundError: No module named 'wordcloud'

In order to have a nicer overview over the many segments than is possible in this notebook, let's create a new html file listing some of the characteristics that we have found so far...

In [None]:
outputDir = "Azpilcueta"
htmlfile = open(outputDir + '/Overview.html', encoding='utf-8', mode='w')

# Write the html header and the opening of a layout table
htmlfile.write("""<!DOCTYPE html>
<html>
    <head>
        <title>Section Characteristics</title>
        <meta charset="utf-8"/>
    </head>
    <body>
        <table>
""")

a = [[]]
a.clear()
dicts = []
w = []

# For each segment, create a wordcloud and write it along with label and
# other information into a new row of the html table
for i in range(len(mx_array)):
    # this is like above in the single-segment example...
    a.append([ int(round(x * 100000, 0)) for x in mx_array[i]])
    dicts.append(dict(zip(fn, a[i])))
    w.append(WordCloud(background_color=None, mode="RGBA", \
                       max_font_size=40, min_font_size=10, \
                       max_words=60, relative_scaling=0.8).fit_words(dicts[i]))
    # We write the wordcloud image to a file
    w[i].to_file(outputDir + '/wc_' + str(i) + '.png')
    # Finally we write the column row
    htmlfile.write("""
            <tr>
                <td>
                    <head>Section {a}: <b>{b}</b></head><br/>
                    <img src="./wc_{a}.png"/><br/>
                    <small><i>length: {c} words</i></small>
                </td>
            </tr>
            <tr><td>&nbsp;</td></tr>
""".format(a = str(i), b = label[i], c = len(tokenised[i])))

# And then we write the end of the html file.
htmlfile.write("""
        </table>
    </body>
</html>
""")
htmlfile.close()

This should have created a nice html file which we can open [here](./Solorzano/Overview.html).