# Import dependencies

In [62]:
import pandas as pd
import requests
import os
import json
from typing import List
from util.webscraper import WebScraper
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import numpy as np
import gensim.downloader as api
import gensim
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
#import spacy

https://python.langchain.com/v0.2/docs/tutorials/rag/

# Data Collection

First, we get the data from the API. As the API is not yet published, both the API-Url and the query to get information on edition-software need to be specified in your .env file. (consult the README for more information)

In [28]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [29]:
# get api_url and query
api_url = os.environ['API_URL']
query = os.environ['QUERY']

# get data from api
api_response = requests.get(api_url + query)

Now that we got the data from the API, we can load it into a dataframe to prepare it to be used as a knowledge base for rag. 

In [30]:
edition_software_info = json.loads(api_response.text)
edition_software_info = pd.DataFrame(edition_software_info)
edition_software_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                43 non-null     object
 1   slug              43 non-null     object
 2   brand_name        43 non-null     object
 3   concept_doi       0 non-null      object
 4   description       40 non-null     object
 5   description_url   3 non-null      object
 6   description_type  43 non-null     object
 7   get_started_url   41 non-null     object
 8   image_id          31 non-null     object
 9   is_published      43 non-null     bool  
 10  short_statement   43 non-null     object
 11  created_at        43 non-null     object
 12  updated_at        43 non-null     object
 13  closed_source     43 non-null     bool  
dtypes: bool(2), object(12)
memory usage: 4.2+ KB


A brief inspection allows us to formulate some initial tasks and questions for this experiment.

- **Preprocessing:** As we can see, not a single entry contains a associated concept_doi. We might consider dropping the column.
- **Impact of using short descriptions only:** Three entries are missing the in depth description. We can assume that rag won't be too useful for these entries. 
- **Impact of additional information:** Only three have a description-url. Down the road, we need to evaluate, if adding info from this source improves the performance of the rag-system.

# Data Cleaning

### 1. Remove Artefacts

Both the `description` and `short_statement` columns seem to be of particular interest for the task at hand. To asses necessary preprocessing step, we'll need to take a closer look at them.

In [31]:
descriptions = edition_software_info[["description", "short_statement"]]
with pd.option_context('display.max_colwidth', None):
    display(descriptions.head())

Unnamed: 0,description,short_statement
0,"# Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI\n\n- Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen.\n\n- KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung.\n\n- Manuelles Transkribieren im Transkriptionseditor\nKI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle\n\n- Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern.\n\n\n- Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen\n\n- Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML).\n\n- Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.","Transkribus ist eine umfassende Plattform für die Digitalisierung, Texterkennung mithilfe Künstlicher Intelligenz, Transkription und das Durchsuchen von historischen Dokumenten."
1,"autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users.\n\nSpecial features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future.\n\nautodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage.\n\n(quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024)\n\n--- \n## Official Site:\n[https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/)\n\n---\n## Usage Instructions\n[https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)\n","Autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users."
2,"[CollateX](http://collatex.net/) is a software to\n\n 1. read **multiple (≥ 2) versions of a text**, splitting each version into parts (tokens) to be compared,\n 1. **identify similarities of and differences between the versions** (including moved/transposed segments) by aligning tokens, and\n 1. output the alignment results in a **variety of formats for further processing**, for instance\n 1. to support **the production of a critical apparatus** or the stemmatical analysis of a text's genesis.\n\nIt resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results.\n\nAs such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable.\n\nPlease go to <http://collatex.net/> for further information.","CollateX is a software to (a.) read multiple versions of a text, (b.) identify differences by aligning tokens, and (c.) output the alignment results for further processing, for instance (d.) to support the production of a critical apparatus or the stemmatical analysis of a text's genesis."
3,"Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache.\n\nMit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket","LaTeX (gesprochen “Lah-tech” oder “Lay-tech”), ist eine Textsatz*sprache* und ein *Programm* für die Erstellung qualitativ hochwertiger Druckausgaben. Ursprünglich entwickelt für mathematischen Textsatz wird es heute für alle Arten von wissenschaftlichen Texten und auch darüber hinaus eingesetzt."
4,,The Research Software Directory is a content management system that is tailored to research software.


As we can see, the `description` column contains some formatting artefacts like `\n` and markdown syntax like `**` and `#`. Let's clean them up.
While we're at it, we can also remove double whitespaces etc.

In [32]:
pattern = '\\n+'
edition_software_info["description_clean"] = edition_software_info["description"].str.replace(pattern, ' ', regex=True)

pattern = r'[*#]+|\s-+\s|]]' #\[\]()<>
edition_software_info["description_clean"] = edition_software_info["description_clean"].str.replace(pattern, ' ', regex=True)

with pd.option_context('display.max_colwidth', None):
    display(edition_software_info[["brand_name", "description", "description_clean"]].head())

Unnamed: 0,brand_name,description,description_clean
0,Transkribus,"# Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI\n\n- Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen.\n\n- KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung.\n\n- Manuelles Transkribieren im Transkriptionseditor\nKI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle\n\n- Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern.\n\n\n- Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen\n\n- Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML).\n\n- Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.","Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform."
1,Autodone,"autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users.\n\nSpecial features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future.\n\nautodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage.\n\n(quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024)\n\n--- \n## Official Site:\n[https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/)\n\n---\n## Usage Instructions\n[https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)\n","autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024) Official Site: [https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/) Usage Instructions [https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)"
2,CollateX,"[CollateX](http://collatex.net/) is a software to\n\n 1. read **multiple (≥ 2) versions of a text**, splitting each version into parts (tokens) to be compared,\n 1. **identify similarities of and differences between the versions** (including moved/transposed segments) by aligning tokens, and\n 1. output the alignment results in a **variety of formats for further processing**, for instance\n 1. to support **the production of a critical apparatus** or the stemmatical analysis of a text's genesis.\n\nIt resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results.\n\nAs such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable.\n\nPlease go to <http://collatex.net/> for further information.","[CollateX](http://collatex.net/) is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <http://collatex.net/> for further information."
3,LaTeX,"Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache.\n\nMit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket","Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket"
4,Research Software Directory,,


### 2. Fill nan values

Before we continue preprocessing the data for later vectorization, we need to check for missing values and replace them with empty strings.

In [33]:
edition_software_info["description_clean"].fillna("", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  edition_software_info["description_clean"].fillna("", inplace=True)


# Scrape Webpages

To provide additional context-information for the retrieval process, we'll scrape all webpages referenced in the software-description.

### 1. Get urls

First, we isolate the urls from our description.

In [34]:
pattern = r"((?:https?:\/\/|w{3}.)[\w\d%/.-]+)"

urls = edition_software_info["description"].str.extractall(pattern)
urls = urls.droplevel(1)
urls_grouped = urls.groupby(urls.index).agg((lambda x: ','.join(set(x))))
edition_software_info["urls"] = urls_grouped

with pd.option_context('display.max_colwidth', None):
    display(edition_software_info[["description_clean", "urls"]].head())

Unnamed: 0,description_clean,urls
0,"Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.",
1,"autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024) Official Site: [https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/) Usage Instructions [https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)","https://autodone.idh.uni-koeln.de/usage,https://autodone.idh.uni-koeln.de/about,https://autodone.idh.uni-koeln.de/"
2,"[CollateX](http://collatex.net/) is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <http://collatex.net/> for further information.","http://collatex.net/,http://en.wikipedia.org/wiki/Diff,http://en.wikipedia.org/wiki/Textual_criticism,http://en.wikipedia.org/wiki/Philology,http://en.wikipedia.org/wiki/Sequence_alignment"
3,"Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket",
4,,


### 2. Scrape Webpages

Now we scrape the paragraphs from the webpages we found. 
The webscraper will take the list of urls associated with an entry and will save paragraphs from all webpages as a string in a column of our dataframe. 

**This might take some time**

In [35]:
webscraper = WebScraper(tags = ["p"], exclude = ["wikipedia"])
edition_software_info["webpages_text"] = edition_software_info["urls"].apply(lambda x: webscraper.scrape(x))

Scraping https://autodone.idh.uni-koeln.de/usage with parameters tags = ['p']
Scraping https://autodone.idh.uni-koeln.de/about with parameters tags = ['p']
Scraping https://autodone.idh.uni-koeln.de/ with parameters tags = ['p']
Scraping http://collatex.net/ with parameters tags = ['p']
Scraping http://www.tei-c.org/ with parameters tags = ['p']
Scraping http://vbd.humnet.unipi.it/ with parameters tags = ['p']
Scraping https://sites.fastspring.com/stefanhagel/product/cte with parameters tags = ['p']
Scraping http://csel.at/ with parameters tags = ['p']
Scraping https://www.oeaw.ac.at/oesterreichische-akademie-der-wissenschaften with parameters tags = ['p']
Scraping https://cte.oeaw.ac.at/ with parameters tags = ['p']
Scraping https://opensource.org/licenses/EUPL-1.2 with parameters tags = ['p']
Scraping https://phylipweb.github.io/phylip/general.html with parameters tags = ['p']
Scraping https://www.sglp.uzh.ch/static/MLS/stemmatology/PAUP_229150101.html with parameters tags = ['p']
HT

### 3. Inspect data

In [36]:
edition_software_info[["urls", "webpages_text"]].head()

Unnamed: 0,urls,webpages_text
0,,
1,"https://autodone.idh.uni-koeln.de/usage,https:...",On this page you will find instructions on how...
2,"http://collatex.net/,http://en.wikipedia.org/w...","“In a language, in the system of language, the..."
3,,
4,,


Now that the data is collected from the webpages, we can take a look at the average length of the texts received for each entry.

In [37]:
length = edition_software_info["webpages_text"].apply(lambda x: len(x) if not pd.isna(x) else 0)
length[length>0].describe()

count       13.000000
mean     15504.153846
std      17593.002197
min        640.000000
25%       2300.000000
50%       6739.000000
75%      23578.000000
max      55063.000000
Name: webpages_text, dtype: float64

Looking only at entries, that we were able to collected webpage text for, we have an average character count of about 15.000 per entry. 
The standard deviation is quite large compared to the mean, indicating that there is a high degree of variability in character counts.

The distribution is skewed towards entries with lower character counts, while some outliers with a high character counts pull the mean upwards.



# Export dataset

In [38]:
current_dir = os.getcwd()
path = os.path.join(current_dir, 'data/edition_software_info.csv')
edition_software_info.to_csv(path)

# Preprocessing

Now, we can preprocess the newfound text using the function we defined earlier. Again, we have to replace missing values with empty strings.

First, let us reimport the data.

In [39]:
current_dir = os.getcwd()
path = os.path.join(current_dir, "data/edition_software_info.csv")
edition_software_info = pd.read_csv(path)
edition_software_info[["description", "description_clean", "webpages_text", "urls"]] = edition_software_info[["description", "description_clean", "webpages_text", "urls"]].fillna('')

### 1. Remove links

First, we'll remove all links from the descriptions.

In [40]:
pattern = r"((?:https?:\/\/|w{3}.)[\w\d%/.-]+)"
edition_software_info["description_preprocessed"] = edition_software_info["description_clean"].str.replace(pattern, '', regex=True)
edition_software_info["description_preprocessed"].head(3)

0      Erkennen, Transkribieren und Durchsuchen von...
1    autodone is a service for the automated, time-...
2    [CollateX]() is a software to  1. read  multip...
Name: description_preprocessed, dtype: object

### 2. Remove Stopwords and Punctuation

For the later vectorisation of the texts, we remove both german and english stopwords.

In [41]:
def preprocess(stopwords: List[str], text: str) -> str:
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))    
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text

In [42]:
# get stopwords
stopwords_english = set(stopwords.words('english'))
stopwords_german = set(stopwords.words('german'))
stopwords_combined = stopwords_german.union(stopwords_english)

edition_software_info["description_preprocessed"] = edition_software_info["description_preprocessed"].apply(lambda x: preprocess(stopwords_combined, x))
edition_software_info["description_preprocessed"].head()


0    erkennen transkribieren durchsuchen historisch...
1    autodone service automated timecontrolled publ...
2    collatex software 1 read multiple ≥ 2 versions...
3    mathematiker donald e knuth entwickelte ende s...
4                                                     
Name: description_preprocessed, dtype: object

In [43]:
edition_software_info["webpages_text_preprocessed"] = edition_software_info["webpages_text"].apply(lambda x: preprocess(stopwords_combined, x) if not pd.isna(x) else "")
edition_software_info["webpages_text_preprocessed"].head()

0                                                     
1    page find instructions use autodone based scre...
2    “in language system language differences”– jac...
3                                                     
4                                                     
Name: webpages_text_preprocessed, dtype: object

### 3. Lemmatize

Finally, we can lemmatize our texts.

In [44]:
"""
def lemmatize_english_text(text):
    doc = nlp_en(text)
    return ' '.join([token.lemma_ for token in doc])

def lemmatize_german_text(text):
    doc = nlp_de(text)
    return ' '.join([token.lemma_ for token in doc])

edition_software_info["description_preprocessed"] = edition_software_info["description_preprocessed"].apply(lambda x: lemmatize_english_text(x))
edition_software_info["description_preprocessed"] = edition_software_info["description_preprocessed"].apply(lambda x: lemmatize_german_text(x))
edition_software_info["description_preprocessed"].head()
"""

'\ndef lemmatize_english_text(text):\n    doc = nlp_en(text)\n    return \' \'.join([token.lemma_ for token in doc])\n\ndef lemmatize_german_text(text):\n    doc = nlp_de(text)\n    return \' \'.join([token.lemma_ for token in doc])\n\nedition_software_info["description_preprocessed"] = edition_software_info["description_preprocessed"].apply(lambda x: lemmatize_english_text(x))\nedition_software_info["description_preprocessed"] = edition_software_info["description_preprocessed"].apply(lambda x: lemmatize_german_text(x))\nedition_software_info["description_preprocessed"].head()\n'

# Vectorization 1: TFIDF

We'll start off with a simple TF-IDF vectorization.

**Term Frequency-Inverse Document Frequency (TF-IDF)** is a weighting scheme that weights the cells of a term-document matrix by their potential to be discriminatory.

To do so, we first calculate the **term frequency (TF)**. The term frequency represents the number of instances of a given word $t$ in a document $d$.

$$
\text{TF}(t, d) = \frac{\text{Count of } t \text{ in } d}{\text{Total number of words in } d}
$$

This term frequency is then multiplied by the **inverse document frequency (IDF)**. The IDF is calculated by counting all documents that contain a term $t$ (the document frequency $\text{df}(t)$). Then, we divide the total number of documents $N$ in the corpus by $\text{df}(t)$.

This inverse frequency is chosen over the regular frequency to **downweight** terms that appear in many documents, since these terms are less likely to be useful for distinguishing between documents.

Usually, we also take the logarithm of the IDF to smooth out the very large values that can occur when a term appears in only a few documents. This ensures that rare terms are not excessively weighted.

$$
\text{df}(t) = \text{Document frequency of a term } t
$$
$$
N = \text{Number of documents}
$$
$$
\text{IDF}(t) = \log\left(\frac{N}{\text{df}(t)}\right)
$$

Finally, we calculate the **TF-IDF** by multiplying the term frequency $\text{TF}(t, d)$ with the inverse document frequency $\text{IDF}(t)$.

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

The resulting value can be interpreted as a measure of the importance of the term in a document relative to the entire corpus. Terms that are frequent in a document but rare across the corpus will have higher TF-IDF scores, indicating their importance.


**N-grams:**

To capture not just the importance of single words but also some of the **context** in which they are used, we can apply TF-IDF to **n-grams**. N-grams are contiguous sequences of $n$ words that appear together in a text. The size of the sequence, $n$, is a hyperparameter that can be adjusted depending on the specific task. 


### 1: Fit TF-IDF Vectorizer
First, we fit the vectorizer on the preprocessed descriptions. 
This way, the vectorizer can transform text into numerical feature vectors based on the learned vocabulary and its distribution over documents.

In [45]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,4))
tfidf_matrix = tfidf_vectorizer.fit_transform(edition_software_info["description_preprocessed"])

# display the resulting matrix
tfidf_matrix_beautify = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_matrix_beautify.head()

Unnamed: 0,12,12 languages,12 languages comprehensive,12 languages comprehensive righttoleft,1514,1514 digital,1514 digital marburg,1514 digital marburg büchner,19042024,19042024 official,...,überführt neben klassischen,überführt neben klassischen pdfformat,überschaubaren,überschaubaren teil,überschaubaren teil funktionen,überschaubaren teil funktionen aufgrund,überwiegende,überwiegende teil,überwiegende teil funktionen,überwiegende teil funktionen editionen
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064201,0.064201,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2: Inspect the tf-idf representations

Each column in this dataframe is a unique word, while each row is a document. The cells denote the number of occurances of a word in a document, weighted by the words potential to be distinctive.

Let's take a look at the tf-idf filtered words for each description (You can find them in the column "tf_idf_filtered_words")

In [46]:
# Define the threshold for TF-IDF scores
threshold = 0.1

# Filter words with TF-IDF scores greater than the threshold for each document
def filter_words_by_threshold(row, threshold):
    filtered_words = [(word, score) for word, score in zip(tfidf_matrix_beautify.columns, row) if score > threshold]
    return sorted(filtered_words, key=lambda x: x[1], reverse=True)


# Apply the function to each row of the TF-IDF DataFrame
filtered_words = tfidf_matrix_beautify.apply(lambda row: filter_words_by_threshold(row, threshold), axis=1)

# Create a dataframe to display the filtered words
filtered_words_df = pd.DataFrame(filtered_words, columns=["tf_idf_filtered_words"])
tfidf_display = pd.concat([edition_software_info["description_clean"], filtered_words_df], axis=1)
    
with pd.option_context('display.max_colwidth', None):
    display(tfidf_display[["description_clean", "tf_idf_filtered_words"]].head(5))

Unnamed: 0,description_clean,tf_idf_filtered_words
0,"Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.","[(dokumenten, 0.24192478472674683), (durchsuchen, 0.12096239236337342), (erkennen, 0.12096239236337342), (erkennung, 0.12096239236337342), (transkribieren, 0.12096239236337342)]"
1,"autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024) Official Site: [https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/) Usage Instructions [https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)","[(twitter, 0.1926029459944442), (autodone, 0.1284019639962961), (usage, 0.11567598600740456), (service, 0.10664676093079592)]"
2,"[CollateX](http://collatex.net/) is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <http://collatex.net/> for further information.","[(differences, 0.12889590979174362), (computational, 0.10345985709520088), (similarities, 0.10345985709520088), (similarities differences, 0.10345985709520088), (tokens, 0.10345985709520088)]"
3,"Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket","[(jahre, 0.1928958299837409)]"
4,,[]


### 3. Test tfidf-representation

This cell will return the most relevant documents from our dataset based on a comparison of their tf-idf representations and a query. The query can be changed.

In [47]:
query = 'I am looking for a transcription tool'
# Preprocess the query
query = preprocess(stopwords_combined, query)
# Transform the query to TF-IDF space
query_tfidf = tfidf_vectorizer.transform([query]) 

# Compute cosine similarity between the query and the documents
similarities = cosine_similarity(query_tfidf, tfidf_matrix)

similarity_df = pd.DataFrame({
    'similarity_score': similarities[0]
})

result_df = pd.concat([edition_software_info[["description_clean"]], similarity_df], axis=1)
result_df_sorted = result_df.sort_values(by='similarity_score', ascending=False)

# print the top 3 description, that might be relevant to our query
with pd.option_context('display.max_colwidth', None):
    display(result_df_sorted[['description_clean', 'similarity_score']].head(3))


Unnamed: 0,description_clean,similarity_score
22,"[Transcribo](https://tcdh.uni-trier.de/en/projekt/transcribo) is an editing tool developed by the Trier Center for Digital Humanities as part of the project “Arthur Schnitzler: Digital Historical-Critical Edition”. The digital tool can support users in transcribing texts in various fields. In addition to the actual transcription, the texts can also be annotated and text-genetic facts can be marked up. Transcribo offers the possibility to transcribe texts productively and in a time-saving way, using the same intuitive approach as you are used to from manual transcription. The tool offers all the subtleties needed for a differentiated transcription and supports the working process both efficiently and easily comprehensible. Transcribe almost like with sheet and pencil Transcribing is done in a graphical editor directly on the facsimile, simulating the procedure of analog transcribing with sheet and pencil. This intuitive approach is an advantage especially when transcribing difficult to decipher manuscripts or heavily revised typescript pages and additionally reduces the learning time for the tool, since no markup language (e.g. TEI/XML) has to be learned. Another advantage of this method is that by working on the facsimile, a positional link between the transcribed text and the image is automatically stored. This positional data can be used in the context of the electronic presentation to visualize the link between text and facsimile for the user in a sustainable way, e.g. in the form of cross-fades or mouse-over effects. Can be used both locally and with FuD Transcribo can be used both locally and in conjunction with our FuD system. In local operation, facsimiles located on the terminal device can be opened and edited. All processed information is stored locally in the form of XML files. When FuD is used in addition, the data is stored in FuD's own relational database, which enables collaborative work. In addition, some functionalities, such as the recording of cross-page phenomena, the correction tool or the storage of text states, are only available in this operating mode. The many years of development work accompanying the project have paid off. Thus, adaptations and ideas from numerous humanities projects at TCDH have been incorporated into the work and have resulted in a very differentiated and practically tested tool. Some of the most important features of Transcribo 1. Fine-grained markup of microgenetic facts (at document, page, sentence/line, word, and graph levels) 2. Markup of complex interrelated phenomena/change operations via relations (page and cross-page) 3. Correction function for automated comparison of A and B files from different users to ensure error-free transcription and annotation ('double-blind process') 4. OCR functionality for automatic reading of typescripts or prints (Tesseract) 5. Navigation perspective: interface to the FuD database (linking facsimiles with metadata recorded in FuD) 6. Structural perspective: overview of a text section of any size (synopsis of different graphical representations: Facsimile, transcription, annotations) and possibility of depositing cross-page phenomena. 7. Module for defining and labeling text states or text layers. Technical requirements The application was implemented in the Eclipse integrated development environment. It is a rich client application that reuses parts of the development environment and was developed in the Java programming language. We currently provide a version for Windows and macOS operating systems. Projects using Transcribo [Arthur Schnitzler Digital](https://tcdh.uni-trier.de/en/projekt/arthur-schnitzler-digital) [The Augsburg Master Builder’s Ledgers](https://tcdh.uni-trier.de/en/projekt/augsburg-master-builders-ledgers) [Digitale Edition and Analysis of the Medulla Gestorum Treverensium by Johann Enen (1514)](https://tcdh.uni-trier.de/en/projekt/digitale-edition-and-analysis-medulla-gestorum-treverensium-johann-enen-1514) [Digital Marburg Büchner Edition](https://tcdh.uni-trier.de/en/projekt/digital-marburg-buchner-edition) [Johann Caspar Lavater](https://tcdh.uni-trier.de/en/projekt/johann-caspar-lavater) [Kurt Schwitters' Intermedia Networks of the Avant-garde](https://tcdh.uni-trier.de/en/projekt/kurt-schwitters-intermedia-networks-avant-garde) [Stefan Heym: “Ahasver”](https://tcdh.uni-trier.de/en/projekt/stefan-heym-ahasver) [Digitalization of the Plock Bible, Old High German Dictionary](https://tcdh.uni-trier.de/en/projekt/digitalization-plock-bible) [Old High German Dictionary](https://tcdh.uni-trier.de/en/projekt/old-high-german-dictionary)",0.091537
38,"Scripto is an open-source tool that permits registered users to view digital files and transcribe them with an easy-to-use toolbar, rendering that text searchable. The tool includes a versioning history and editorial controls to make public contributions more manageable, and supports the transcription of a wide range of file types (both images and documents). There are two versions of Scripto, each of which works with a different version of Omeka. Scripto for Omeka Classic creates a single transcription project for the content of your Omeka Classic site. Scripto for Omeka S enables the creation of multiple projects built from shared items in your Omeka S installation.",0.066523
42,"The Virtual Manuscript Room Collaborative Research Environment (VMR CRE) brings community and a toolbox of powerful research components to support all stages of research and production of a digital edition. Beginning with the popular open-source portal, Liferay, the VMR CRE integrates 30 DH components to naturally support: Cataloging witnesses; managing and displaying images; producing well-formed TEI transcriptions using a web-based WYSIWYM editor and storing those transcriptions to a versioned transcription repository; community volunteer task assignment and project management; automatic realtime collation of witnesses; regularization and apparatus editing; online publishing of the final results-- as a traditional apparatus, or with interactive tools which let users choose different ways to visualize the data produced in the edition. Overview A walk through the workflow at the Institut für neutestamentliche Textforschung (INTF), in their efforts to edit the Editio Critica Maior (ECM), provides opportunity to touch on many components available in the VMR CRE. Work can be divided into 9 discrete stages, progressively: 1) witness cataloging; 2) witness selection; 3) image management; 4) indexing of folio content; 5) transcribing; 6) collating; 7) regularizing; 8) editing an apparatus; 9) genealogical analysis of the witness corpus. Metadata and Feature Tagging The VMR CRE stores with each manuscript a very limited set of descriptive data, reserving the primary metadata capture for a dynamic tagging facility called Feature Tagging. A Feature is any defined metadata information which might be captured for a manuscript or manuscript page. For example, an alternative catalog identifier, an external image repository, the canvas material type, the ink type, the script type; these are all Features which might be tagged on a manuscript; For individual pages: an illumination, a canon table, or even individual sample script characters might be tagged as Features. These Features must first be defined in the system, and the VMR CRE comes by default with a predefined set of Feature Definitions used at the INTF. A Feature Definition can specify that zero or more values should be captured with the Feature tag and what those value types and value domains should be. Once a Feature is defined, it can be used to tag manuscripts or manuscript pages, capturing individual Feature values for each tag, if necessary. Every Feature Definition adds to the number of facets available in the catalog search facility. For example, one might search for all manuscript folio sides from Egypt which include Illuminations and any part of the Gospel of John. A Feature tagged to a manuscript page can also include a region box, marking the area on a folio image where the Feature is present. If a region box is captured, a search query can specify to show the region box clips in the result. For example, a paleographer might choose to capture a set of representative letters for each manuscript and then perform a search for all double column manuscripts with a height of at least 20cm between the II and V centuries, and to ask the query results to show the representative α (alpha) clips. Transcription and Reconciliation Transcription work in the VMR CRE is done using a What You See Is What You Mean (WYSIWYM) web-based editor originally developed by the University of Trier in collaboration with the INTF and ITSEE in Birmingham. This transcription editor has been developed as a plugin for the popular TinyMCE HTML editor component. The editor includes menus and dialogs to assist the researcher with composing a transcription, without asking the transcriber to learn special markup codes. The content may then be obtained as EpiDoc influenced TEI. The VMR CRE saves content in a versioned transcription repository backed by Git. A user may have access to create and edit their own personal transcription, a project-wide transcription, or a site-wide (= published) transcription-- each having version history. The VMR CRE also includes a palaeography tool to assist a transcriber when encountering rare symbols, abbreviations, or ligatures. If a portion of the unknown text can be identified, the researcher can enter one or more letters and will be presented with images of text instances elsewhere which include these letters, offering possibilities. As more and more rare text items are tagged, the system grows more helpful. Quality assurance for the ECM requires that a transcription for a manuscript be produced independently by two transcribers. The products are then compared to each other and differences are reviewed by a manager and reconciled to produce a final transcription. The VMR CRE provides tools to facilitate this reconciliation work. Collation and Visualization Collation is a key component to find differences in text witnesses when producing a critical edition. Collation facilities in the VMR CRE are performed by CollateX. Collation and regularization of uninteresting differences is an iterative cycle in digital editing and the VMR CRE ties these two actions together with an intuitive visual interface. Visualization of a collation, either during the editing process or for the reader, can be rendered as a variant graph, an alignment table, or as a traditional negative apparatus. Web Services, Open Programmatic Access The VMR CRE Web Services API layer is primarily useful for exposing the functionality of the VMR CRE to other research projects wishing to access the functionality or contribute to the dataset through their own systems and tools. The VMR CRE Web Services API generally uses noun/verb nomenclature organized by category. This means that the last 2 segments of an API URL will consist first of the type of object the call will affect, and second, of the action to be performed on the object. Any path before the final two segments are merely for organizational purposes. This is different from a strict REST convention which confines the action to one of 6 HTTP verbs. The VMR CRE places no semantic meaning on the HTTP verb. Both GET and POST HTTP verbs are accepted as identical, relegating the verb, or semantic action to the final segment of the URL. This allows easy testing and examples for every action directly within a web browser. Parameters to a service request are passed as standard HTTP FORM POST parameters or as query string parameters.",0.065983


# Vectorization 2: Aggregated Word2Vec


Next, we'll create document representations by aggregating the word2vec embeddings of each word in a description. 

Word2vec encodes the meaning of the words by capturing their semantic relationships based on the context in which they appear. By aggregating the word2vec embeddings of each word in a description, we can create a document representation that retains the semantic information and provides a more nuanced understanding of the content.

From a computational perspective, these representations are shorter and denser than tf-idf representations, making them more suitable for computations such as similarity measures, clustering, or classification tasks. The dense nature of word2vec embeddings allows for efficient storage and faster processing compared to sparse representations like tf-idf. Additionally, because word2vec captures the meaning and context of words, it can provide more meaningful insights into the relationships between different documents or terms.


In [60]:
current_path = os.getcwd()
path = os.path.join(current_path, "models/word2vec-google-news-300.bin")

# Load the model if it is already in our project. If not, download it.
if os.path.isfile(path):
    print("Model found. Loading...")
    word2vec_model = KeyedVectors.load(path)
    
else:
    print("Model not found. Downloading...")
    word2vec_model = api.load("word2vec-google-news-300")
    word2vec_model.save(path)
    


Model found. Loading...


In [53]:
""" #TODO: Add some preprocessing
def preprocess_word2vec(text: str) -> str:
    pass
"""

' #TODO: Add some preprocessing\ndef preprocess_word2vec(text: str) -> str:\n    pass\n'

In [64]:
def get_word2vec_vector(words, model):
    words = words.split()
    # Filter words that are in the model's vocabulary
    valid_words = [word for word in words if word in model]
    
    if not valid_words:
        # Return a zero vector if no valid words are found
        return np.zeros(model.vector_size)
    
    # Average the vectors of the valid words to create a document representation
    vectors = [model[word] for word in valid_words]
    return np.mean(vectors, axis=0)

# Apply the function to create aggregated vectors
word2vec = edition_software_info['description_preprocessed'].apply(lambda x: get_word2vec_vector(x, word2vec_model))

# Convert the Series of 1D arrays to a 2D numpy array (to calculate the cosine similarity later on)
word2vec_array = np.array(word2vec.tolist())
len(word2vec_array)

43

#### Test Word2Vec Representation

In [26]:
query = 'I need help digitizing a document'
# Preprocess the query
query = preprocess(stopwords_combined, query)
# get vector representation of the query using word2vec
query_word2vec = get_word2vec_vector(query, word2vec_model)
# Reshape the query vector to be a 2D array with one row
query_word2vec = query_word2vec.reshape(1, -1)

# Compute cosine similarity between the query and the documents
similarities = cosine_similarity(query_word2vec, word2vec_array)

similarity_df = pd.DataFrame({
    'similarity_score': similarities.flatten()
})

result_df = pd.concat([edition_software_info[["brand_name","description_clean"]], similarity_df], axis=1)
result_df_sorted = result_df.sort_values(by='similarity_score', ascending=False)

# print the top 3 description, that might be relevant to our query
with pd.option_context('display.max_colwidth', None):
    display(result_df_sorted[['brand_name','description_clean', 'similarity_score']].head(5))

Unnamed: 0,brand_name,description_clean,similarity_score
27,T-Pen,"T‑PEN... Is an open and general tool for scholars of any technical expertise level Allows transcriptions to be created, manipulated, and viewed in many ways Collaborate with others through simple project management Exports transcriptions as a pdf, XML(plaintext) for further processing, or contribute to a collaborating institution with a click Respects existing and emerging standards for text, image, and annotation data storage Avoids prejudice in data, allowing users to find new ways to work The Transcription for Paleographical and Editorial Notation (T‑PEN) project was coordinated by the Center for Digital Theology at Saint Louis University (SLU) and funded by the Andrew W. Mellon Foundation and the NEH. The Electronic Norman Anonymous Project developed several abilities at the core of this project's functionality.",0.655957
22,Transcribo,"[Transcribo](https://tcdh.uni-trier.de/en/projekt/transcribo) is an editing tool developed by the Trier Center for Digital Humanities as part of the project “Arthur Schnitzler: Digital Historical-Critical Edition”. The digital tool can support users in transcribing texts in various fields. In addition to the actual transcription, the texts can also be annotated and text-genetic facts can be marked up. Transcribo offers the possibility to transcribe texts productively and in a time-saving way, using the same intuitive approach as you are used to from manual transcription. The tool offers all the subtleties needed for a differentiated transcription and supports the working process both efficiently and easily comprehensible. Transcribe almost like with sheet and pencil Transcribing is done in a graphical editor directly on the facsimile, simulating the procedure of analog transcribing with sheet and pencil. This intuitive approach is an advantage especially when transcribing difficult to decipher manuscripts or heavily revised typescript pages and additionally reduces the learning time for the tool, since no markup language (e.g. TEI/XML) has to be learned. Another advantage of this method is that by working on the facsimile, a positional link between the transcribed text and the image is automatically stored. This positional data can be used in the context of the electronic presentation to visualize the link between text and facsimile for the user in a sustainable way, e.g. in the form of cross-fades or mouse-over effects. Can be used both locally and with FuD Transcribo can be used both locally and in conjunction with our FuD system. In local operation, facsimiles located on the terminal device can be opened and edited. All processed information is stored locally in the form of XML files. When FuD is used in addition, the data is stored in FuD's own relational database, which enables collaborative work. In addition, some functionalities, such as the recording of cross-page phenomena, the correction tool or the storage of text states, are only available in this operating mode. The many years of development work accompanying the project have paid off. Thus, adaptations and ideas from numerous humanities projects at TCDH have been incorporated into the work and have resulted in a very differentiated and practically tested tool. Some of the most important features of Transcribo 1. Fine-grained markup of microgenetic facts (at document, page, sentence/line, word, and graph levels) 2. Markup of complex interrelated phenomena/change operations via relations (page and cross-page) 3. Correction function for automated comparison of A and B files from different users to ensure error-free transcription and annotation ('double-blind process') 4. OCR functionality for automatic reading of typescripts or prints (Tesseract) 5. Navigation perspective: interface to the FuD database (linking facsimiles with metadata recorded in FuD) 6. Structural perspective: overview of a text section of any size (synopsis of different graphical representations: Facsimile, transcription, annotations) and possibility of depositing cross-page phenomena. 7. Module for defining and labeling text states or text layers. Technical requirements The application was implemented in the Eclipse integrated development environment. It is a rich client application that reuses parts of the development environment and was developed in the Java programming language. We currently provide a version for Windows and macOS operating systems. Projects using Transcribo [Arthur Schnitzler Digital](https://tcdh.uni-trier.de/en/projekt/arthur-schnitzler-digital) [The Augsburg Master Builder’s Ledgers](https://tcdh.uni-trier.de/en/projekt/augsburg-master-builders-ledgers) [Digitale Edition and Analysis of the Medulla Gestorum Treverensium by Johann Enen (1514)](https://tcdh.uni-trier.de/en/projekt/digitale-edition-and-analysis-medulla-gestorum-treverensium-johann-enen-1514) [Digital Marburg Büchner Edition](https://tcdh.uni-trier.de/en/projekt/digital-marburg-buchner-edition) [Johann Caspar Lavater](https://tcdh.uni-trier.de/en/projekt/johann-caspar-lavater) [Kurt Schwitters' Intermedia Networks of the Avant-garde](https://tcdh.uni-trier.de/en/projekt/kurt-schwitters-intermedia-networks-avant-garde) [Stefan Heym: “Ahasver”](https://tcdh.uni-trier.de/en/projekt/stefan-heym-ahasver) [Digitalization of the Plock Bible, Old High German Dictionary](https://tcdh.uni-trier.de/en/projekt/digitalization-plock-bible) [Old High German Dictionary](https://tcdh.uni-trier.de/en/projekt/old-high-german-dictionary)",0.652115
31,TEITOK,"TEITOK is a web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation, initially developed at the Centro de Linguística da Universidade de Lisboa, later at CELGA-ILTEC, and currently maintained at the ÚFAL institute of Charles University, Prague. The system has a modular design with numerous modules making serving a wide range of different corpus types. Below are some examples of some of those, and the type of corpora TEITOK can deal with. More modules are added frequently, and it is possible to add custom modules as well. Historical Corpora For historical corpora, TEITOK provides the option to have an alignment between the transcription and the facsimile image, it provides the option to work with multiple orthographic realizations to combine several editions of a text into a single XML file, and it provides the option to create a searchable document map to see where in the world several phenomena are more frequent. TEITOK is freely available for anybody who wishes to create richly annotated textual corpora, and runs on any LINUX based web server. Features Manuscript-based corpora Align your manuscript with your transcript Display each manuscript line with its transcription Transcribe directly from the manuscript Search directly for manuscript fragments Keep multiple editions within the same environment Stand-off Annotations Adds stand-off annotations to any corpus file Edit using an efficient interface Annotate over discontinuous regions Incorporate annotations into the CQP corpus Audio-based corpora Align your audio with your transcription Transcribe directly from the audio file Scroll transcription vertical with wave function horizontal Search directly for audio segments Dependency Grammar Keep dependency relations inside any corpus type Visualize dependency trees for any sentence Edit trees easily Search using dependency relations Geolocation Coordinates Map documents onto the world map Document are clustered into counted groups Access the documents from the map Compare corpus queries on the world map Edit from CQP Query Search for words often incorrectly annotated Click on any token in a KWIC list to edit it Edit all results in a systematic way Edit each results individually in a list Pre-modify each result by a regular expression Search The rich XML format used in TEITOK is hard to search through. For easier access, all corpora are therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various orthographic forms, providing many ways to search through the data. The type of corpora that TEITOK is meant for are very labour-intensive: for ancient texts, hardly any of the data will be available in digital format, and have to be scanned. In many cases, OCR will not work and even for human readers the texts are often very hard to read. And the data will display a lot of orthographic variation in which a lot of the linguistic annotation, including normalization, will have to be done by hand. As a result, most corpora created with TEITOK will have a limited size, and searching for linguistic properties in them will not yield a lot of results. Therefore, TEITOK offers the option to index the corpus in a central database, which can be searched via this site. Each search result will only display the direct context of the word, and will link directly to the word in the original text on the site of the project it originated from. This way, it is possible to search through multiple corpora at the same time, and get access to the full original data in a way that prominently features the original project.",0.6359
38,Scripto,"Scripto is an open-source tool that permits registered users to view digital files and transcribe them with an easy-to-use toolbar, rendering that text searchable. The tool includes a versioning history and editorial controls to make public contributions more manageable, and supports the transcription of a wide range of file types (both images and documents). There are two versions of Scripto, each of which works with a different version of Omeka. Scripto for Omeka Classic creates a single transcription project for the content of your Omeka Classic site. Scripto for Omeka S enables the creation of multiple projects built from shared items in your Omeka S installation.",0.631786
40,Monasterium - Collaborative Archive,"MOM-CA is a collaborative database system which allows for the research and editing of medieval and early modern charters. By and large you can find the charters organized according to archival ""fonds""; charters that originate from different sources are subsumed under ""Collections"". General users can: browse search Registered users can: bookmark documents save and edit documents (description, transcription) with EditMOM3 create indices manipulate images (annotate images, create collections of image fragments, manipulate image fragments) create own charter collections Special users can: review user changes (moderators only) manage metadata on archives and collections (metadata manager only) import new data (metadata manager only) translate system messages (translators only) manage users (user manager only) manage static texts (like help, introduction etc., html-authors only)",0.62968


# Vectorization 3: FastText

One downside of pretrained Word2Vec representations is their inability to handle unknown words. 
FastText models clost this gap, by encoding words both as embeddings and collections of n-grams.