# Text Processing

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Text-Processing" data-toc-modified-id="Text-Processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Processing</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li></ul></li><li><span><a href="#Preparations" data-toc-modified-id="Preparations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparations</a></span><ul class="toc-item"><li><span><a href="#TF/IDF-" data-toc-modified-id="TF/IDF--2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>TF/IDF <a name="tfidf"></a></a></span></li></ul></li><li><span><a href="#Vector-Space-Model-of-the-text-" data-toc-modified-id="Vector-Space-Model-of-the-text--3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Vector Space Model of the text <a name="#VectorSpaceModel"></a></a></span><ul class="toc-item"><li><span><a href="#Another-method-to-generate-the-dimensions:-n-grams-" data-toc-modified-id="Another-method-to-generate-the-dimensions:-n-grams--3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Another method to generate the dimensions: n-grams <a name="N-Grams"></a></a></span></li><li><span><a href="#Extending-the-dimensions-" data-toc-modified-id="Extending-the-dimensions--3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Extending the dimensions <a name="AddDimensions"></a></a></span></li><li><span><a href="#Word-Clouds-" data-toc-modified-id="Word-Clouds--3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Word Clouds <a name="WordClouds"></a></a></span></li><li><span><a href="#Similarity-" data-toc-modified-id="Similarity--3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Similarity <a name="DocumentSimilarity"></a></a></span></li><li><span><a href="#Clustering-" data-toc-modified-id="Clustering--3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Clustering <a name="DocumentClustering"></a></a></span></li></ul></li><li><span><a href="#Working-with-several-languages" data-toc-modified-id="Working-with-several-languages-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Working with several languages</a></span><ul class="toc-item"><li><span><a href="#Translations?" data-toc-modified-id="Translations?-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Translations?</a></span></li></ul></li><li><span><a href="#Graph-based-NLP" data-toc-modified-id="Graph-based-NLP-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Graph-based NLP</a></span></li><li><span><a href="#Topic-Modelling" data-toc-modified-id="Topic-Modelling-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Topic Modelling</a></span></li><li><span><a href="#Manual-Annotation" data-toc-modified-id="Manual-Annotation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Manual Annotation</a></span></li><li><span><a href="#Further-information" data-toc-modified-id="Further-information-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further information</a></span></li></ul></div>

## Introduction

This file is the continuation of preceding work. Previously, I have worked my way through a couple of text-analysing approaches - such as tf/idf frequencies, n-grams and the like - in the context of a project concerned with Juan de Solórzano Pereira's *Politica Indiana*. This can be seen [here](TextProcessing_Solorzano.ipynb).

In the former context, I got somewhat stuck when I was trying to automatically align corresponding passages of two editions of the same work ... where the one edition would be a **translation** of the other and thus we would have two different languages.

The present file takes this up, tries to refine an approach taken there and to find alternative ways of analysing a text across several languages. This time, the work concerned is Martín de Azpilcueta's *Manual de confesores*, a work of the 16th century that has seen very many editions and translations, quite a few of them even by the work's original author and it is the subject of the research project ["Martín de Azpilcueta’s Manual for Confessors and the Phenomenon of Epitomisation"](http://www.rg.mpg.de/research/martin-de-azpilcuetas-manual-for-confessors) by Manuela Bragagnolo. 

(There are a few DH-ey things about the project that are not directly of concern here, like a synoptic display of several editions or the presentation of the divergence of many actual translations of a given term. Such aspects are being treated with other software, like [HyperMachiavel](http://hyperprince.ens-lyon.fr/hypermachiavel) or [Lera](http://lera.uzi.uni-halle.de/).)

As in the previous case, the programming language used in the following examples is called "python" and the tool used to get prose discussion and code samples together is called ["jupyter"](http://jupyter.org/). (A common way of installing both the language and the jupyter software, especially in windows, is by installing a python "distribution" like [Anaconda](https://www.anaconda.com/what-is-anaconda/).) In jupyter, you have a "notebook" that you can populate with text (if you want to use it, jupyter understands [markdown](http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) code formatting), or code and a program that pipes a nice rendering of the notebook to a web browser as you are reading right now. In many places in such a notebook, the output that the code samples produce is printed right below the code itself. Sometimes this can be quite a lot of output and depending on your viewing environment you might have to scroll quite some way to get to the continuation of the discussion.

You can save your notebook online (the current one is [here at github](https://github.com/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb)) and there is an online service, nbviewer, able to render any notebook that it can access online. So chances are you are reading this present notebook at the web address [https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb](https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Azpilcueta.ipynb).

A final word about the elements of this notebook:

<div class="alert alertbox alert-success">At some points I am mentioning things I consider to be important decisions or take-away messages for scholarly readers. E.g. whether or not to insert certain artefacts into the very transcription of your text, what the methodological ramifications of a certain approach or parameter are, what the implications of an example solution are, or what a possible interpretation of a certain result might be. I am highlighting these things in a block like this one here or at least in <font color="green">**green bold font**</font>.</div>

<div class="alert alertbox alert-danger">**NOTE:** As I am continually improving the notebook on the side of the source text, wordlists and other parameters, it is sometimes hard to keep the prose description in synch. So while the actual descriptions still apply, the numbers that are mentioned in the prose (as where we have e.g. a "table with 20 rows and 1.672 columns") might no longer reflect the latest state of the sources, auxiliary files and parameters and you should take these with a grain of salt. Best double check them by reading the actual code ;-)

I apologize for the inconsistency.</div>

# Preparations

Unlike in the previous case, where we had word files that we could export as plaintext, in this case Manuela has prepared a sample chapter with four editions transcribed *in parallel* in an office spreadsheet. So we first of all make sure that we have good **UTF-8** comma-separated-value files, e.g. by uploading a **csv** export of our office program of choice to [a CSV Linting service](https://csvlint.io/). (As a side remark, in my case, exporting with LibreOffice provided me with options to select UTF-8 encoding and choose the field delimiter and resulted in a valid csv file. MS Excel did neither of those.) Below, we expect the file below at the following position:

In [83]:
sourcePath = 'Azpilcueta/cap6_align_-_2018-01.csv'

Then, we can go ahead and open the file:

In [96]:
import csv

sourceFile = open(sourcePath, newline='', encoding='utf-8')
sourceTable = csv.reader(sourceFile)

And we read each line into new elements of four respective arrays (since we're dealing with one sample chapter, we try to handle it all in memory first and see if we run into problems):

*(Note here and in the following that in most cases, when the program is counting, it does so beginning with zero. Which means that if we end up with 20 segments, they are going to be called segment 0, segment 1, ..., segment 19. There is not going to be a segment bearing the number twenty, although we do have twenty segments. The first one has the number zero and the twentieth one has the number nineteen. Even for more experienced coders, this sometimes leads to mistakes, called "off-by-one errors".)*

In [97]:
    # Initialize a two-dimensional array ...
    Editions = [[]]

    # ...with four rows
    for i in range(0,3):
        a = []
        Editions.append(a)

    # Now populate it from our sourceTable
    for row in sourceTable:
        for i, field in enumerate(row):
            Editions[i].append(field)

    print(str(sourceTable.line_num) + " rows read.")

    # As an example, see the first seven sections of the second edition:
    for field in range(0,6):
        print(Editions[1][field])

41 rows read.
1552 por
¶ Capitolo VI. Das circunstancias.


1.     [1] Pera fundamento disto diremos : lho primeiro, que a circumstancia do peccado, segundo a mente de S. Tho P & outros, he hum accidente daquilo, que he peccado. Dissemos (he accidente) porque nenhũa circũstãcia da obra, he a substancia della. Dissemos (da quilo que he peccado) & nã do peccado : porque muytas vezes a obra em si não he peccado, & pola circũ se faz peccado : & como então ella he aquilo, em que consiste ho peccado, não he tãto accidente do peccado, quanto da quilo que he peccado : segundo que ho declaramos em outra parteq (q. in d. c. Consideret n. 3), seguindo a Alex. de Ales r (r in. 4 pt. q. 77 ar.z. co.la.z).
2.     ¶ [2] Ho II. Que a circunstancia se parte em sete species, que se conte nem aquelle verso : Quis, quid, ubi, quibus auxiliis, cur, quomodo, quando : Referido por S. Tho.s (in d. q. 7. ar 3) . Quem, Que, Onde, Com que ajudas, Porque, Em que maneira, Quando. O qual verso temos por melhor, que

  1. [TF/IDF](#tfidf)
  
  2. [Segment source text](#SegmentSourceText) 
  3. [Read segments into Variable/List](#ReadSegmentsIntoVariable)
  4. [Tokenising](#Tokenising)
  5. [Stemming/Lemmatising](#StemmingLemmatising)
  6. [Eliminate stopwords](#EliminateStopwords)

## TF/IDF <a name="tfidf"></a>

In the previous (i.e. Solórzano) analyses, things like tokenization, lemmatization and stop-word lists filtering are explained step by step. Here, we rely on what we have found there and feed it all into functions that are ready-made and available in suitable libraries...

First, we build our lemmatization resource and "function":

In [86]:
lemma    = {}    # we build a so-called dictionary for the lookups
tempdict = []

wordfile_path = 'Azpilcueta/wordforms-lat.txt'
wordfile = open(wordfile_path, encoding='utf-8')

# open the wordfile (defined above) for reading
wordfile = open(wordfile_path, encoding='utf-8')

for line in wordfile.readlines():
    tempdict.append(tuple(line.split('>'))) # we split each line by ">" and append a tuple to a
                                            # temporary list.

lemma = {k.strip(): v.strip() for k, v in tempdict} # for every tuple in the list,
                                                    # we strip whitespace and make a key-value
                                                    # pair, appending it to our "lemma" dictionary
wordfile.close
print(str(len(lemma)) + ' wordforms known to the system.')

1706 wordforms known to the system.


Again, a quick test: Let's see with which "lemma"/basic word the particular wordform "ciuicior" is associated, or, in other words, what *value* our lemma variable returns when we query for the *key* "ciuicior":

In [87]:
lemma['fidem']

'fides'

And we are going to need the stopwords list:

In [88]:
stopwords_path = 'Azpilcueta/stopwords-lat.txt'
stopwords = open(stopwords_path, encoding='utf-8').read().splitlines()

print(str(len(stopwords)) + ' stopwords known to the system, e.g.: ' + str(stopwords[95:170]))

388 stopwords known to the system, e.g.: ['a', 'ab', 'ac', 'ad', 'adhic', 'adhuc', 'ae', 'ait', 'ali', 'alii', 'aliis', 'alio', 'aliqua', 'aliqui', 'aliquid', 'aliquis', 'aliquo', 'am', 'an', 'ante', 'apud', 'ar', 'at', 'atque', 'au', 'aut', 'autem', 'bus', 'c', 'ca', 'cap', 'ceptum', 'co', 'con', 'cons', 'cui', 'cum', 'cur', 'cùm', 'd', 'da', 'de', 'deinde', 'detur', 'di', 'diu', 'do', 'dum', 'e', 'ea', 'eadem', 'ec', 'eccle', 'ego', 'ei', 'eis', 'eius', 'el', 'em', 'en', 'enim', 'eo', 'eos', 'er', 'erat', 'ergo', 'erit', 'es', 'esse', 'essent', 'esset', 'est', 'et', 'etenim', 'eti']


You can see how our corpus of four thousand "tokens" actually contains only one and a half thousand different words (plus stopwords, but these are at maximum 384). And, in contrast to simpler numbers that have been filtered out by our stopwords filter, I have left years like "1610" in place.

In [93]:
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# So first we build a tokenising and lemmatising function to work as an input filter
# to the CountVectorizer function
def ourLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[wordform].lower().strip() if wordform in lemma else wordform.lower().strip() for wordform in wordforms ]

# !!!!
# TODO: The above pipes all the tokens through the latin lemmatizer. We should lemmatize Spanish and Portuguese differently!
# !!!!

# Initialize the library's function
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords, use_idf=True, tokenizer=ourLemmatiser, norm='l2')

# Finally, we feed our corpus to the function to build a new "tfidf_matrix" object
tfidf_matrix = tfidf_vectorizer.fit_transform(Editions[0])

# Print some results
tfidf_matrix_frame = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())

print(len)(tfidf_matrix_frame

41


## Word Clouds <a name="WordClouds"/>

We can use a library that takes word frequencies like above, calculates corresponding relative sizes of words and creates nice wordcloud images for our sections (again, taking the fourth segment as an example) like this:

In [99]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# We make tuples of (lemma, tf/idf score) for one of our segments
# But we have to convert our tf/idf weights to pseudo-frequencies (i.e. integer numbers)
frq = [ int(round(x * 100000, 0)) for x in mx_array[3]]
freq = dict(zip(fn, frq))

wc = WordCloud(background_color=None, mode="RGBA", max_font_size=40, relative_scaling=1).fit_words(freq)

# Now show/plot the wordcloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

ModuleNotFoundError: No module named 'wordcloud'

In order to have a nicer overview over the many segments than is possible in this notebook, let's create a new html file listing some of the characteristics that we have found so far...

In [98]:
outputDir = "Azpilcueta"
htmlfile = open(outputDir + '/Overview.html', encoding='utf-8', mode='w')

# Write the html header and the opening of a layout table
htmlfile.write("""<!DOCTYPE html>
<html>
    <head>
        <title>Section Characteristics</title>
        <meta charset="utf-8"/>
    </head>
    <body>
        <table>
""")

a = [[]]
a.clear()
dicts = []
w = []

# For each segment, create a wordcloud and write it along with label and
# other information into a new row of the html table
for i in range(0, len(mx_array)):
    # this is like above in the single-segment example...
    a.append([ int(round(x * 100000, 0)) for x in mx_array[i]])
    dicts.append(dict(zip(fn, a[i])))
    w.append(WordCloud(background_color=None, mode="RGBA", \
                       max_font_size=40, min_font_size=10, \
                       max_words=60, relative_scaling=0.8).fit_words(dicts[i]))
    # We write the wordcloud image to a file
    w[i].to_file(outputDir + '/wc_' + str(i) + '.png')
    # Finally we write the column row
    htmlfile.write("""
            <tr>
                <td>
                    <head>Section {a}: <b>{b}</b></head><br/>
                    <img src="./wc_{a}.png"/><br/>
                    <small><i>length: {c} words</i></small>
                </td>
            </tr>
            <tr><td>&nbsp;</td></tr>
""".format(a = str(i), b = label[i], c = len(tokenised[i])))

# And then we write the end of the html file.
htmlfile.write("""
        </table>
    </body>
</html>
""")
htmlfile.close()

NameError: name 'WordCloud' is not defined

This should have created a nice html file which we can open [here](./Solorzano/Overview.html).

## Similarity <a name="DocumentSimilarity"/>

Also, once we have a representation of our text as a vector - which we can imagine as an arrow that goes a certain distance in one direction, another distance in another direction and so on - we can compare the different arrows. Do they go the same distance in a particular direction? And maybe almost the same in another direction? This would mean that one of the terms of our vocabulary has the same weight in both texts. Comparing the weight of our many, many dimensions, we can develop a measure for the similarity of the texts.

(Probably, similarity in words that are occurring all over the place in the corpus should not count so much, and in fact it is attenuated by our arrows being made up of tf/idf weights.)

Comparing arrows means calculating with angles and technically, what we are computing is the "cosine similarity" of texts. Again, there is a library ready for us to use (but you can find some documentation [here](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/), [here](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and [here](https://en.wikipedia.org/wiki/Cosine_similarity).)

In [141]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = pd.DataFrame(cosine_similarity(tfidf_matrix))
similarities[round(similarities, 0) == 1] = 0 # Suppress a document's similarity to itself
print("Pairwise similarities:")
print(similarities)

Pairwise similarities:
          0         1         2         3         4         5         6   \
0   0.000000  0.222897  0.120293  0.147745  0.130148  0.081761  0.094125   
1   0.222897  0.000000  0.092630  0.072586  0.082198  0.029001  0.030976   
2   0.120293  0.092630  0.000000  0.057610  0.120441  0.043699  0.041051   
3   0.147745  0.072586  0.057610  0.000000  0.131746  0.097417  0.056908   
4   0.130148  0.082198  0.120441  0.131746  0.000000  0.221282  0.181299   
5   0.081761  0.029001  0.043699  0.097417  0.221282  0.000000  0.141404   
6   0.094125  0.030976  0.041051  0.056908  0.181299  0.141404  0.000000   
7   0.035931  0.067855  0.035364  0.082660  0.132764  0.162764  0.081296   
8   0.100506  0.166803  0.060454  0.083347  0.078576  0.039125  0.044615   
9   0.044722  0.086246  0.019550  0.102677  0.076310  0.039573  0.020726   
10  0.025080  0.067164  0.076931  0.052992  0.110786  0.050311  0.022992   
11  0.000000  0.005779  0.000000  0.026106  0.041592  0.008422  0

In [142]:
print("The two most similar segments in the corpus are")
print("segments", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1), \
      "and", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0)[ similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1) ].astype(int), \
      ".")
print("They have a similarity score of")
print(similarities.values.max())

The two most similar segments in the corpus are
segments 0 and 1 .
They have a similarity score of
0.222896735543


<div class="alert alertbox alert-success">Of course, in every set of documents, we will always find two that are similar in the sense of them being more similar to each other than to the other ones. Whether or not this actually *means* anything in terms of content is still up to scholarly interpretation. But at least it means that a scholar can look at the two documents and when she determines that they are not so similar after all, then perhaps there is something interesting to say about similar vocabulary used for different puproses. Or the other way round: When the scholar knows that two passages are similar, but they have a low "similarity score", shouldn't that say something about the texts's rhetorics?</div>

## Clustering <a name="DocumentClustering"/>

Clustering is a method to find ways of grouping data into subsets, so that these do have some cohesion. Sentences that are more similar to a particular "paradigm" sentence than to another one are grouped with the first one, others are grouped with their respective "paradigm" sentence. Of course, one of the challenges is finding sentences that work well as such paradigm sentences. So we have two (or even three) stages: Find paradigms, group data accordingly. (And learn how many groups there are.)<img src="http://practicalcryptography.com/media/miscellaneous/files/k_mean_send.gif"/>

I hope to be able to add a discussion of this subject soon. For now, here are nice tutorials for the process:
  - [http://brandonrose.org/clustering](http://brandonrose.org/clustering)
  - [https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/](https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/)
  - [https://de.dariah.eu/tatom/working_with_text.html](https://de.dariah.eu/tatom/working_with_text.html)
  - [http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/](http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/)

  - Find good measure (word vectors, authorities cited, style, ...)
  - Find starting centroids
  - Find good K value
  - K-Means clustering


# Working with several languages

Let us prepare a second text, this time in Spanish, and see how they compare...

In [143]:
bigspanishfile = 'Solorzano/Sections_II.2_PI.txt'
spInput = open(bigspanishfile, encoding='utf-8').readlines()

spAt    = -1
spDest  = None

for line in spInput:
    if line[0:3] == '€€€':
        if spDest:
            spDest.close()
        spAt += 1
        spDest = open(outputBase + '.' + str(spAt) +
                    '.spanish.txt', encoding='utf-8', mode='w')
    else:
        spDest.write(line.strip())

spAt += 1
spDest.close()
print(str(spAt) + ' files written.')

spSuffix = '.spanish.txt'
spCorpus = []
for i in range(0, spAt):
    try:
        with open(path + '/' + filename + str(i) + spSuffix, encoding='utf-8') as f:
            spCorpus.append(f.read())
            f.close()
    except IOError as exc:
        if exc.errno != errno.EISDIR:  # Do not fail if a directory is found, just ignore it.
            raise                      # Propagate other kinds of IOError.

print(str(len(spCorpus)) + ' files read.')

# Labels
spLabel = []
i = 0
for spLine in spInput:
    if spLine[0:3] == '€€€':
        spLabel.append(spLine[6:].strip())
        i =+ 1
print(str(len(spLabel)) + ' labels found.')

# Tokens
spTokenised = []
for spSegment in spCorpus:
    spTokenised.append(list(filter(None, (spWord.lower()
                                        for spWord in re.split('\W+', spSegment)))))

# Lemmata
spLemma    = {}
spTempdict = []
spWordfile_path = 'Solorzano/wordforms-es.txt'
spWordfile = open(spWordfile_path, encoding='utf-8')

for spLine in spWordfile.readlines():
    spTempdict.append(tuple(spLine.split('>')))

spLemma = {k.strip(): v.strip() for k, v in spTempdict}
spWordfile.close
print(str(len(spLemma)) + ' spanish wordforms known to the system.')

# Stopwords
spStopwords_path = 'Solorzano/stopwords-es.txt'
spStopwords = open(spStopwords_path, encoding='utf-8').read().splitlines()
print(str(len(spStopwords)) + ' spanish stopwords known to the system.')

print(' ')
print('Significant words in the spanish text:')

# tokenising and lemmatising function
def spOurLemmatiser(str_input):
    spWordforms = re.split('\W+', str_input)
    return [spLemma[spWordform].lower() if spWordform in spLemma else spWordform.lower() for spWordform in spWordforms ]

spTfidf_vectorizer = TfidfVectorizer(stop_words=spStopwords, use_idf=True, tokenizer=spOurLemmatiser, norm='l2')
spTfidf_matrix = spTfidf_vectorizer.fit_transform(spCorpus)

spMx_array = spTfidf_matrix.toarray()
spFn = spTfidf_vectorizer.get_feature_names()

pos = 1
for l in spMx_array:
    print(' ')
    print(' Most significant words in the ' + str(pos) + '. segment:')
    print(pd.DataFrame.rename(pd.DataFrame.from_dict([(spFn[x], l[x]) for x in (l*-1).argsort()][:10]), columns={0:'lemma',1:'tf/idf value'}))
    pos += 1

18 files written.
18 files read.
18 labels found.
614725 spanish wordforms known to the system.
743 spanish stopwords known to the system.
 
Significant words in the spanish text:
 
 Most significant words in the 1. segment:
        lemma  tf/idf value
0    capitvlo      0.399528
1  totalmente      0.399528
2     español      0.349703
3        casa      0.286932
4    tributar      0.286932
5  particular      0.264527
6      llamar      0.264527
7        cosa      0.245585
8    prohibir      0.245585
9    personal      0.201756
 
 Most significant words in the 2. segment:
        lemma  tf/idf value
0       indio      0.176840
1    servicio      0.158496
2  famulicios      0.138930
3     público      0.138930
4  domesticos      0.138930
5       carga      0.138930
6    reservar      0.138930
7       color      0.138930
8    reperida      0.138930
9      cobrar      0.138930
 
 Most significant words in the 3. segment:
          lemma  tf/idf value
0        hombre      0.382174
1        

<div class="alert alertbox alert-success">Our spanish wordfiles ([lemmata list](Solorzano/wordforms-es.txt) and [stopwords list](Solorzano/stopwords-es.txt)) are quite large and generous - they spare us some work of resolving quite a lot of abbreviations. However, since they are actually originating from a completely different project, it is very unlikely, that this goes without mistakes. Also some lemmata (like "de+el" in the eighth segment) are not really such. So we need to clean our wordlist and adapt it to the current text material urgently!</div>

Now imagine how we would bring the two documents together in a vector space. We would generate dimensions for all the words of our spanish vocabulary and would end up with a common space of roughly twice as many dimensions as before - and the latin work would be only in the first half of the dimensions and the spanish work only in the second half. The respective other half would be populated with only zeroes. So in effect, we would not really have a *common* space or something on the basis of which we could compare the two works. :-(

What might be an interesting perspective, however - since in this case, the second text is a translation of the first one - is a parallel, synoptic overview of both texts. So, let's at least add the second text to our html overview with the wordclouds:

In [144]:
htmlfile2 = open(outputDir + '/Synopsis.html', encoding='utf-8', mode='w')

htmlfile2.write("""<!DOCTYPE html>
<html>
    <head>
        <title>Section Characteristics, parallel view</title>
        <meta charset="utf-8"/>
    </head>
    <body>
        <table>
""")
spA = [[]]
spA.clear()
spDicts = []
spW = []
for i in range(0, max(len(mx_array), len(spMx_array))):
    if (i > len(mx_array) - 1):
        htmlfile2.write("""
            <tr>
                <td>
                    <head>Section {a}: n/a</head>
                </td>""".format(a = str(i)))
    else:
        htmlfile2.write("""
            <tr>
                <td>
                    <head>Section {a}: <b>{b}</b></head><br/>
                    <img src="./wc_{a}.png"/><br/>
                    <small><i>length: {c} words</i></small>
                </td>""".format(a = str(i), b = label[i], c = len(tokenised[i])))
    if (i > len(spMx_array) - 1):
        htmlfile2.write("""
                <td>
                    <head>Section {a}: n/a</head>
                </td>
            </tr><tr><td>&nbsp;</td></tr>""".format(a = str(i)))
    else:
        spA.append([ int(round(x * 100000, 0)) for x in spMx_array[i]])
        spDicts.append(dict(zip(spFn, spA[i])))
        spW.append(WordCloud(background_color=None, mode="RGBA", \
                           max_font_size=40, min_font_size=10, \
                           max_words=60, relative_scaling=0.8).fit_words(spDicts[i]))
        spW[i].to_file(outputDir + '/wc_' + str(i) + '_sp.png')
        htmlfile2.write("""
                <td>
                    <head>Section {d}: <b>{e}</b></head><br/>
                    <img src="./wc_{d}_sp.png"/><br/>
                    <small><i>length: {f} words</i></small>
                </td>
            </tr>
            <tr><td>&nbsp;</td></tr>""".format(d = str(i), e = spLabel[i], f = len(spTokenised[i])))
    
htmlfile2.write("""
        </table>
    </body>
</html>
""")
htmlfile2.close()

Again, the resulting file can be opened [here](Solorzano/Synopsis.html).

## Translations?

Maybe there is an approach to inter-lingual comparison after all. Here is the [API documentation](https://github.com/commonsense/conceptnet5/wiki/API) of [conceptnet.io](http://conceptnet.io), which we can use to lookup synonyms, related terms and translations. Like with such a URI:

[http://api.conceptnet.io/related/c/la/rex?filter=/c/es](http://api.conceptnet.io/related/c/la/rex?filter=/c/es)

We can get an identifier for a word and many possible translations for this word. So, we could - this remains to be tested in practice - look up our ten (or so) most frequent words in one language and collect all possible translations in the second language. Then we could compare these with what we actually find in the second work. How much overlap there is going to be and how univocal it is going to be remains to be seen, however...

For example, with a single segment, we could do something like this:

In [159]:
import urllib
import json
from collections import defaultdict

segment_no = 6
spSegment_no = 8

print("Comparing words from segments " + str(segment_no) + " (latin) and " + str(spSegment_no) + " (spanish)...")

print(" ")
# Build List of most significant words for a segment
top10a = []
top10a = ([fn[x] for x in (mx_array[segment_no]*-1).argsort()][:12])
print("Most significant words in the latin text:")
print(top10a)

print(" ")
# Build lists of possible translations (the 15 most closely related ones)
top10a_possible_translations = defaultdict(list)
for word in top10a:
    concepts_uri = "http://api.conceptnet.io/related/c/la/" + word + "?filter=/c/es"
    response = urllib.request.urlopen(concepts_uri)
    concepts = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in concepts["related"][0:15]:
        top10a_possible_translations[word].append(rel.get("@id").split('/')[-1])

print(" ")
print("For each of the latin words, here are possible translations:")
for word in top10a_possible_translations:
    print(word + ":")
    print(', '.join(trans for trans in top10a_possible_translations[word]))

print(" ")
print(" ")
# Build list of 10 most significant words in the second language
top10b = []
top10b = ([spFn[x] for x in (spMx_array[spSegment_no]*-1).argsort()][:12])
print("Most significant words in the spanish text:")
print(top10b)

# calculate number of overlapping terms
print(" ")
print(" ")
print("Overlaps:")
for word in top10a_possible_translations:
    print(', '.join(trans for trans in top10a_possible_translations[word] if (trans in top10b or trans == word)))

# do a nifty ranking


Comparing words from segments 6 (latin) and 8 (spanish)...
 
Most significant words in the latin text:
['semi', 'haya', 'por', 'casso', 'pario', 'volo', 'paro', 'servicios', 'servicio', 'personal', 'indios', 'tribuo']
 
 
For each of the latin words, here are possible translations:
semi:
mitad, semi, medio, parcialmente, media, parte, mediano, cora, intermedio, parcial, tercio, semifinal, mediana, cuasi, cuarta
haya:
haya, hamás, ele, jeque, alteza, cordobés, mahoma, córdoba, tanzania, princesa, árabe, israel, tablón, malentendido, palestina
por:
veces, ésos, ele, aquéllos, aquéllas, ésas, éste, doña, aquél, por, hai, éstas, ia, ése, favor
casso:
caer, caída, recaer, caerse, caído, comenzar, empezar, empiece, empiezo, comienzo, empezado, inicio, iniciar, iniciarse, vacilar
volo:
vuelo, volando, volar, avión, copiloto, chicago, paloma, palomar, milán, volador, mosca, pájaro, piloto, aves, aviación
paro:
huelga, paro, nepal, desempleo, perú, pelotudo, desempleado, paraguay, laburo, desoc

# Graph-based NLP

  - [Unsupervised keywords extraction using graphs](https://graphaware.com/neo4j/2017/10/03/efficient-unsupervised-topic-extraction-nlp-neo4j.html)
  - [Reverse Engineering Book Stories with Neo4j and GraphAware NLP](https://graphaware.com/neo4j/2017/07/24/reverse-engineering-book-stories-nlp.html)


# Topic Modelling

...


# Manual Annotation

...


# Further information

  - http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/
  - http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/
  - http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/
  - https://de.dariah.eu/tatom/index.html
  - https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html
  - http://takwatanabe.me/data_science/pyspark/cs110_lab3b.html
  - https://github.com/mccurdyc/tf-idf/blob/master/README.md
  - https://people.duke.edu/~ccc14/sta-663/TextProcessingExtras.html
  