<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Text-Reuse" data-toc-modified-id="Text-Reuse-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Reuse</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Blockseminar-Studiengang-&quot;Digitale-Methodik-in-den-Geistes--und-Kulturwissenschaften&quot;-(18.1.2020,-8.2.2020,-15.2.2020)" data-toc-modified-id="Blockseminar-Studiengang-&quot;Digitale-Methodik-in-den-Geistes--und-Kulturwissenschaften&quot;-(18.1.2020,-8.2.2020,-15.2.2020)-1.0.0.1"><span class="toc-item-num">1.0.0.1&nbsp;&nbsp;</span>Blockseminar Studiengang "Digitale Methodik in den Geistes- und Kulturwissenschaften" (18.1.2020, 8.2.2020, 15.2.2020)</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Text Reuse
#### Blockseminar Studiengang "Digitale Methodik in den Geistes- und Kulturwissenschaften" (18.1.2020, 8.2.2020, 15.2.2020)

- lxml

Since for many of the subsequent steps, we need a language toolkit like [NLTK](http://www.nltk.org/) or [spaCy](https://spacy.io/) (to mention just the two best known ones) anyway, and since most of these also provide tokenization capabilities, we refrain from using a homebrew tokenizer like the one listed in the appendix.

Actually, there are many Python Lemmatizers, but quite a couple of them are only for English. Some depend on Wordnet resources and may load wordnet data for other languages as well. (I have found an online [comparison](https://lars76.github.io/nlp/lemmatize-portuguese/) of Python Lemmatizers for Portuguese but I cannot tell how reliable it is.) But it seems none of the options is really up to the task, especially (that's my contention now) for historical language variants.

Here is a list of toolkits and wordnets that I found:

- [FreeLing](http://nlp.lsi.upc.edu/freeling/)
- [spaCy](https://spacy.io/)
- [NLTK](http://www.nltk.org/)
- [Pattern](https://www.clips.uantwerpen.be/pattern)
- [RDRPoSTagger](https://github.com/datquocnguyen/RDRPOSTagger)
- [TreeTagger for Python](https://github.com/miotto/treetagger-python)
- [TextBlob](https://textblob.readthedocs.io/en/dev/)
- [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/)
- [Polyglot](https://polyglot.readthedocs.io/en/latest/)


- [WordNet](https://wordnet.princeton.edu/)
- [Open Multilinugual Wordnet](http://compling.hss.ntu.edu.sg/omw/)
- [MultiWordnet](http://multiwordnet.fbk.eu/english/home.php)
- [OpenWordnet-PT](https://github.com/own-pt/openWordnet-PT) for Portuguese
- [Multilingual Central Repository](http://adimen.si.ehu.es/web/MCR/)
- [BabelNet](https://babelnet.org/)
- [ConceptNet](http://conceptnet.io/)

However, there is one toolkit -- [FreeLing](http://nlp.lsi.upc.edu/freeling/) \[Padró/Stanilovsky 2012\] -- that is often overlooked and I have used its dictionary of word forms for historical Spanish in the past to some satisfaction \[also Sanchez-Marco/Boleda/Padró 2011\]. We will use this one and, besides its dictionary resources, also use some of its more advanced methods. (For sense annotation, FreeLing relies on Wordnet as well, but not for Lemmatization.) For learning about its API and how to use it, you could start [here](https://talp-upc.gitbook.io/freeling-4-1-user-manual/installation/calling-freeling-library-from-languages-other-than-c++). In the appendix, you will find example code for how to use its Python 3 API.)

<div class="alert alertbox alert-danger">
<p>Although I have tried for hours to build the python 3 interface for freeling on windows, I was not successful. So for the rest of this notebook, assume that it only works under linux!</p>
</div>



1  Preprocessing
In linguistic preprocessing, we add more preprocessing:

Tokenisation
(Normalisation?!)
Lemmatisation
2  Cosine similarity
2.1  Filters
Filter out things that might be irrelevant for characterizing a segment like stopwords or everything but tf/idf top words.

(And then do cosine similarity again.)

2.2  Boosters
Add weight to overlap in marginal number, question marks, (long) homographs, (long) numbers, proper names.

(And then do cosine similarity again.)

3  Postponed
3.1  Champollion/BSA
since Champollion [Ma 2006] and the Microsoft Bilingual Sentence aligner [Moore 2002] are available in Perl implementations only (and we have enough alternatives), we postpone analyses with them for now.

3.2  Bleualign
Sennrich/Volk 2010 uses a machine translation (e.g. google or DeepL) of the source to the target language and then does intra-language alignment. We postpone this, too.

3.3  Cognate alignment
Darriba Bilbao/Pereira Lopes/Ildefonso 2005 align via Longest Sorted Sequence and recognize cognates from language resources.

3.4  Embeddings
try to align recognizing similarities in word-context/word-document vectors according to Bizzoni/Reboul 2016, Bouamor/Sajjad 2018, Guo/Shen/Xang et al. 2018.



1.1  Knowledge-based
To make use of knowledge graphs, we rely on the linguistic preprocessing) and add a bit more graph-oriented preprocessing.

Then, for the knowledge-based mode of alignment, we use the following approaches:

graph similarity according to (Franco-Salvador/Rosso/Montes-y-Gómez 2016)
1.1.0.1  Literature
Franco-Salvador/Rosso/Montes-y-Gómez 2016: A systematic study of knowledge graph analysis for cross-language plagiarism detection
Mohamed/Oussalah 2018: A Hybrid Approach for Paraphrase Identification Based on Knowledge-enriched Semantic Heuristics
Paul/Rettinger et al. 2016: Efficient Graph-based Document Similarity
Speer/Chin/Havasi 2017: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

In [None]:
import os
import sys
import glob
import lxml
from lxml import etree
import re
import csv
import json
import locale
locale.setlocale(locale.LC_ALL, '')  # Use '' for auto, or force e.g. to 'en_US.UTF-8'
from functools import partial

from collections import OrderedDict

from decimal import Decimal

import ctypes
import nltk.translate.gale_church
from IPython.display import HTML, display
import tabulate
import bleualign.gale_church   # from Rico Sennrich's Bleualign: https://github.com/rsennrich/Bleualign
# import _align from gale-church   # from Li Ling Tan's https://github.com/alvations/gachalign

import nltk
from itertools import chain

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import tabulate
from IPython.display import HTML, display
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import CountVectorizer
from itertools import chain

import tabulate
from IPython.display import HTML, display
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
import numpy as np

from sklearn.compose import ColumnTransformer



# -- Freeling

aux_dir  = "\\auxiliary_files"
nb_dir   = os.path.split(os.getcwd())[0] + "\\" + os.path.split(os.getcwd())[1] + aux_dir

if nb_dir not in sys.path:
    sys.path.append(nb_dir)

print(sys.path)

from auxiliary_files import pyfreeling

## Check whether we know where to find FreeLing data files
if "FREELINGDIR" not in os.environ :
   if sys.platform == "win32" or sys.platform == "win64" : os.environ["FREELINGDIR"] = "C:\\Program Files"
   else : os.environ["FREELINGDIR"] = "/usr"

if not os.path.exists(os.environ["FREELINGDIR"]+"/share/freeling") :
   print("Folder",os.environ["FREELINGDIR"]+"/share/freeling",
         "not found.\nPlease set FREELINGDIR environment variable to FreeLing installation directory",
         file=sys.stderr)
   sys.exit(1)

# Location of FreeLing configuration files.
DATA = os.environ["FREELINGDIR"]+"/share/freeling/"

# Init locales
pyfreeling.util_init_locale("default")

# -- graph-based

import networkx as nx
import matplotlib.pyplot as plt
