# Import dependencies and create frequently used functions

In [1]:
import pandas as pd
import requests
import os
import json
from typing import List
from util.webscraper import WebScraper
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import numpy as np
import gensim.downloader as api
import gensim
from gensim.models import KeyedVectors
from gensim.models import FastText
from sklearn.metrics.pairwise import cosine_similarity
import fasttext
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from joblib import Parallel, delayed
import sentence_transformers
import nltk
nltk.download('stopwords')


  from tqdm.autonotebook import tqdm, trange


In [6]:
# define functions to load the 
# 
# models
def download_file(url: str, file_path: str) -> None:
    """Download a file from a URL and save it locally."""
    try:
        
        if os.path.isfile(file_path):
            print("File was already downloaded.")
            return None
        
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Check if the request was successful
        with open(file_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:
                    file.write(chunk)
        print(f"The file has been downloaded and saved as: {file_path}")
    except requests.RequestException as e:
        print(f"An error occurred while downloading the file: {e}")
        
def load_word_vectors(file_path: str):
    """Load word vectors from a file."""
    try:
        model = gensim.models.KeyedVectors.load_word2vec_format(file_path)
        print("Vectors loaded successfully.")
        return model
    except Exception as e:
        print(f"An error occurred while loading the vectors: {e}")
        return None


# Data Collection

First, we get the data from the API. As the API is not yet published, both the API-Url and the query to get information on edition-software need to be specified in your .env file. (consult the README for more information)

In [7]:
%load_ext dotenv
%dotenv

In [8]:
# get api_url and query
api_url = os.environ['API_URL']
query = os.environ['QUERY']

# get data from api
api_response = requests.get(api_url + query)

Now that we got the data from the API, we can load it into a dataframe to prepare it to be used as a knowledge base for rag. 

In [9]:
edition_software_info = json.loads(api_response.text)
edition_software_info = pd.DataFrame(edition_software_info)
edition_software_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                43 non-null     object
 1   slug              43 non-null     object
 2   brand_name        43 non-null     object
 3   concept_doi       0 non-null      object
 4   description       40 non-null     object
 5   description_url   3 non-null      object
 6   description_type  43 non-null     object
 7   get_started_url   41 non-null     object
 8   image_id          31 non-null     object
 9   is_published      43 non-null     bool  
 10  short_statement   43 non-null     object
 11  created_at        43 non-null     object
 12  updated_at        43 non-null     object
 13  closed_source     43 non-null     bool  
dtypes: bool(2), object(12)
memory usage: 4.2+ KB


A brief inspection allows us to formulate some initial tasks and questions for this experiment.

- **Preprocessing:** As we can see, not a single entry contains a associated concept_doi. We might consider dropping the column.
- **Impact of using short descriptions only:** Three entries are missing the in depth description. We can assume that rag won't be too useful for these entries. 
- **Impact of additional information:** Only three have a description-url. Down the road, we need to evaluate, if adding info from this source improves the performance of the rag-system.

# Data Cleaning

### 1. Remove Artefacts

Both the `description` and `short_statement` columns seem to be of particular interest for the task at hand. To asses necessary preprocessing step, we'll need to take a closer look at them.

In [10]:
descriptions = edition_software_info[["description", "short_statement"]]
with pd.option_context('display.max_colwidth', None):
    display(descriptions.head())

Unnamed: 0,description,short_statement
0,"# Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI\n\n- Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen.\n\n- KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung.\n\n- Manuelles Transkribieren im Transkriptionseditor\nKI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle\n\n- Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern.\n\n\n- Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen\n\n- Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML).\n\n- Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.","Transkribus ist eine umfassende Plattform für die Digitalisierung, Texterkennung mithilfe Künstlicher Intelligenz, Transkription und das Durchsuchen von historischen Dokumenten."
1,"autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users.\n\nSpecial features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future.\n\nautodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage.\n\n(quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024)\n\n--- \n## Official Site:\n[https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/)\n\n---\n## Usage Instructions\n[https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)\n","Autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users."
2,"[CollateX](http://collatex.net/) is a software to\n\n 1. read **multiple (≥ 2) versions of a text**, splitting each version into parts (tokens) to be compared,\n 1. **identify similarities of and differences between the versions** (including moved/transposed segments) by aligning tokens, and\n 1. output the alignment results in a **variety of formats for further processing**, for instance\n 1. to support **the production of a critical apparatus** or the stemmatical analysis of a text's genesis.\n\nIt resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results.\n\nAs such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable.\n\nPlease go to <http://collatex.net/> for further information.","CollateX is a software to (a.) read multiple versions of a text, (b.) identify differences by aligning tokens, and (c.) output the alignment results for further processing, for instance (d.) to support the production of a critical apparatus or the stemmatical analysis of a text's genesis."
3,"Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache.\n\nMit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket","LaTeX (gesprochen “Lah-tech” oder “Lay-tech”), ist eine Textsatz*sprache* und ein *Programm* für die Erstellung qualitativ hochwertiger Druckausgaben. Ursprünglich entwickelt für mathematischen Textsatz wird es heute für alle Arten von wissenschaftlichen Texten und auch darüber hinaus eingesetzt."
4,,The Research Software Directory is a content management system that is tailored to research software.


As we can see, the `description` column contains some formatting artefacts like `\n` and markdown syntax like `**` and `#`. Let's clean them up.
While we're at it, we can also remove double whitespaces etc.

In [11]:
pattern = '\\n+'
edition_software_info["description_clean"] = edition_software_info["description"].str.replace(pattern, ' ', regex=True)

pattern = r'[*#]+|\s-+\s|]]' #\[\]()<>
edition_software_info["description_clean"] = edition_software_info["description_clean"].str.replace(pattern, ' ', regex=True)

with pd.option_context('display.max_colwidth', None):
    display(edition_software_info[["brand_name", "description", "description_clean"]].head())

Unnamed: 0,brand_name,description,description_clean
0,Transkribus,"# Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI\n\n- Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen.\n\n- KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung.\n\n- Manuelles Transkribieren im Transkriptionseditor\nKI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle\n\n- Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern.\n\n\n- Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen\n\n- Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML).\n\n- Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.","Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform."
1,Autodone,"autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users.\n\nSpecial features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future.\n\nautodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage.\n\n(quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024)\n\n--- \n## Official Site:\n[https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/)\n\n---\n## Usage Instructions\n[https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)\n","autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024) Official Site: [https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/) Usage Instructions [https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)"
2,CollateX,"[CollateX](http://collatex.net/) is a software to\n\n 1. read **multiple (≥ 2) versions of a text**, splitting each version into parts (tokens) to be compared,\n 1. **identify similarities of and differences between the versions** (including moved/transposed segments) by aligning tokens, and\n 1. output the alignment results in a **variety of formats for further processing**, for instance\n 1. to support **the production of a critical apparatus** or the stemmatical analysis of a text's genesis.\n\nIt resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results.\n\nAs such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable.\n\nPlease go to <http://collatex.net/> for further information.","[CollateX](http://collatex.net/) is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <http://collatex.net/> for further information."
3,LaTeX,"Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache.\n\nMit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket","Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket"
4,Research Software Directory,,


### 2. Fill nan values

Before we continue preprocessing the data for later vectorization, we need to check for missing values and replace them with empty strings.

In [12]:
edition_software_info["description_clean"].fillna("", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  edition_software_info["description_clean"].fillna("", inplace=True)


# Scrape Webpages

To provide additional context-information for the retrieval process, we'll scrape all webpages referenced in the software-description.

### 1. Get urls

First, we isolate the urls from our description.

In [13]:
pattern = r"((?:https?:\/\/|w{3}.)[\w\d%/.-]+)"

urls = edition_software_info["description"].str.extractall(pattern)
urls = urls.droplevel(1)
urls_grouped = urls.groupby(urls.index).agg((lambda x: ','.join(set(x))))
edition_software_info["urls"] = urls_grouped

with pd.option_context('display.max_colwidth', None):
    display(edition_software_info[["description_clean", "urls"]].head())

Unnamed: 0,description_clean,urls
0,"Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.",
1,"autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: https://autodone.idh.uni-koeln.de/about, 19.04.2024) Official Site: [https://autodone.idh.uni-koeln.de/](https://autodone.idh.uni-koeln.de/) Usage Instructions [https://autodone.idh.uni-koeln.de/usage](https://autodone.idh.uni-koeln.de/usage)","https://autodone.idh.uni-koeln.de/usage,https://autodone.idh.uni-koeln.de/about,https://autodone.idh.uni-koeln.de/"
2,"[CollateX](http://collatex.net/) is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff](http://en.wikipedia.org/wiki/Diff)) or tools for [sequence alignment](http://en.wikipedia.org/wiki/Sequence_alignment) which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology](http://en.wikipedia.org/wiki/Philology) or – more specifically – the field of [Textual Criticism](http://en.wikipedia.org/wiki/Textual_criticism) where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <http://collatex.net/> for further information.","http://collatex.net/,http://en.wikipedia.org/wiki/Philology,http://en.wikipedia.org/wiki/Sequence_alignment,http://en.wikipedia.org/wiki/Textual_criticism,http://en.wikipedia.org/wiki/Diff"
3,"Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket",
4,,


### 2. Scrape Webpages

Now we scrape the paragraphs from the webpages we found. 
The webscraper will take the list of urls associated with an entry and will save paragraphs from all webpages as a string in a column of our dataframe. 

**This might take some time**

In [14]:
webscraper = WebScraper(tags = ["p"], exclude = ["wikipedia"])
edition_software_info["webpages_text"] = edition_software_info["urls"].apply(lambda x: webscraper.scrape(x))

Scraping https://autodone.idh.uni-koeln.de/usage with parameters tags = ['p']
Scraping https://autodone.idh.uni-koeln.de/about with parameters tags = ['p']
Scraping https://autodone.idh.uni-koeln.de/ with parameters tags = ['p']
Scraping http://collatex.net/ with parameters tags = ['p']
Scraping http://www.tei-c.org/ with parameters tags = ['p']
Scraping http://vbd.humnet.unipi.it/ with parameters tags = ['p']
Scraping https://www.oeaw.ac.at/oesterreichische-akademie-der-wissenschaften with parameters tags = ['p']
Scraping https://cte.oeaw.ac.at/ with parameters tags = ['p']
Scraping https://opensource.org/licenses/EUPL-1.2 with parameters tags = ['p']
Scraping https://sites.fastspring.com/stefanhagel/product/cte with parameters tags = ['p']
Scraping http://csel.at/ with parameters tags = ['p']
Scraping https://phylipweb.github.io/phylip/general.html with parameters tags = ['p']
Scraping https://www.sglp.uzh.ch/static/MLS/stemmatology/PAUP_229150101.html with parameters tags = ['p']
HT

### 3. Inspect data

In [15]:
edition_software_info[["urls", "webpages_text"]].head()

Unnamed: 0,urls,webpages_text
0,,
1,"https://autodone.idh.uni-koeln.de/usage,https:...",On this page you will find instructions on how...
2,"http://collatex.net/,http://en.wikipedia.org/w...","“In a language, in the system of language, the..."
3,,
4,,


Now that the data is collected from the webpages, we can take a look at the average length of the texts received for each entry.

In [16]:
length = edition_software_info["webpages_text"].apply(lambda x: len(x) if not pd.isna(x) else 0)
length[length>0].describe()

count       13.000000
mean     15479.153846
std      17608.288786
min       1114.000000
25%       2300.000000
50%       6739.000000
75%      23657.000000
max      55063.000000
Name: webpages_text, dtype: float64

Looking only at entries, that we were able to collected webpage text for, we have an average character count of about 15.000 per entry. 
The standard deviation is quite large compared to the mean, indicating that there is a high degree of variability in character counts.

The distribution is skewed towards entries with lower character counts, while some outliers with a high character counts pull the mean upwards.



# Data Augmentation

We can combine some information to increase the information density in our description. We'll start off by simply appending the `short statement` to the front of the description. 
In later steps we might add other information. 

In [17]:
edition_software_info["description_clean"] = edition_software_info["short_statement"] + edition_software_info["description_clean"]

# Preprocessing

Next, we can prepare our data for the information retrieval process. We'll focus on 'descriptions' for now. However, this process could be easily expanded to the webpage text we collected earlier.

### 1. Remove links

First, we'll remove all links from the descriptions.

In [18]:
pattern = r"((?:https?:\/\/|w{3}.)[\w\d%/.-]+)"
edition_software_info["description_clean"] = edition_software_info["description_clean"].str.replace(pattern, '', regex=True)
edition_software_info["description_clean"].head(3)

0    Transkribus ist eine umfassende Plattform für ...
1    Autodone is a service for the automated, time-...
2    CollateX is a software to (a.) read multiple v...
Name: description_clean, dtype: object

### 2. Replace missing descriptions

To allow for smoother text processing, we'll replace all NaN values in the relevant text columns with an empty string.

In [19]:
edition_software_info["description_clean"] = edition_software_info["description_clean"].fillna('')
null = edition_software_info["description_clean"].isna().sum()
print(f"NaN remaining: {null}")

NaN remaining: 0


#### 4. Add Language Information

As the entries in our repository are in both english and german, we add information on the texts language to the dataset.

To do so, we'll use a fasttext-model for language identification, which can be found [here](https://fasttext.cc/docs/en/language-identification.html).

As the model was trained on UTF-8 data, it expects UTF-8 as input. This sould be the case, as pandas `read_csv`-function imports text in UTF-8 by default.



In [20]:
# download the fasttext model

fasttext_path = os.path.join(os.getcwd(), 'models/lid.176.bin')

if os.path.isfile(fasttext_path):
    print("Model found.")
    language_detection_model = fasttext.load_model(fasttext_path)
    
else:
    print("Model not found. Downloading...")
    url = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin"
    # download the model
    download_file(url, fasttext_path)
    # load the model
    print("Loading model from file...")
    language_detection_model = fasttext.load_model(fasttext_path)


Model found.


Now we can use the model to identify the languages of our entries. This might take some time. 

In [21]:

fasttext_model = fasttext.load_model(fasttext_path)

def identify_language(text):
    lang_detected = fasttext_model.predict(text)
    return lang_detected[0][0]

# clean the webpage text, as the model expect text without newlines
edition_software_info.loc[:,"description_clean"] = edition_software_info["description_clean"].str.replace("\n"," ")

# detect message-languages. If the column contains empty text, the language is set to nan
edition_software_info.loc[:,"description_lang"] = edition_software_info["description_clean"].apply(lambda x: identify_language(x) if not x == '' else np.nan)

# clean the output
edition_software_info.loc[:,"description_lang"] = edition_software_info["description_lang"].str.replace("__label__","")

# print the new columns
edition_software_info[["description","description_lang", "short_statement"]].head(10)

Unnamed: 0,description,description_lang,short_statement
0,"# Erkennen, Transkribieren und Durchsuchen von...",de,Transkribus ist eine umfassende Plattform für ...
1,"autodone is a service for the automated, time-...",en,"Autodone is a service for the automated, time-..."
2,[CollateX](http://collatex.net/) is a software...,en,CollateX is a software to (a.) read multiple v...
3,Der Mathematiker Donald E. Knuth entwickelte E...,de,"LaTeX (gesprochen “Lah-tech” oder “Lay-tech”),..."
4,,en,The Research Software Directory is a content m...
5,,de,Tesseract ist eine Software zur Texterkennung....
6,EVT (Edition Visualization Technology) is a so...,en,"A light-weight, open source tool specifically ..."
7,Der Classical Text Editor (CTE) wird auf Initi...,de,Der CTE ist ein Spezialwerkzeug für die Erstel...
8,In vielen Editionsprojekten wird die Datenbank...,de,Der TEI Publisher ist eine eXist-db Applikatio...
9,\n# Benutzerfreundliches Arbeiten\n\nAls zentr...,de,ediarum is a digital working environment consi...


#### 5. Chunking

Finally, we chunk longer entries into smaller paragraphs. We'lle leave some overlap between chunks, as to retain some context between chunks.

We start off with chunk sizes of 108 token and an overlap of 20, as 128 was identified to be a effective chunk size for RAG in this [blogpost](https://www.mattambrogi.com/posts/chunk-size-matters/) by Matt Ambrogi.

**Both the chunk and overlap size are hyperparamters and should be fine tuned for the task at hand.**


In [22]:
def chunk(text, chunk_size=226, overlap_size=30):
    """
    Splits the given text into chunks with overlap.
    Args:
        text (str): The input text to be chunked.
        chunk_size (int, optional): The size of each chunk. Defaults to 108.
        overlap_size (int, optional): The size of overlap between chunks. Defaults to 20.
    Returns:
        list: A list of chunks with overlap.
    """
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap_size):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
        if i + chunk_size >= len(words):
            break
    return chunks

# Apply the chunking function and explode the chunks, to give every chunk its own row while retaining all other data
edition_software_info_chunked = edition_software_info.copy()
edition_software_info_chunked['description_clean_chunks'] = edition_software_info_chunked['description_clean'].apply(lambda x: chunk(x))
edition_software_info_chunked = edition_software_info_chunked.explode('description_clean_chunks') 

# reindex the dataframe
edition_software_info_chunked = edition_software_info_chunked.reset_index(drop=True)

with pd.option_context('display.max_colwidth', None):
    display(edition_software_info_chunked[["brand_name", "description_clean", "description_clean_chunks"]].head(10))


Unnamed: 0,brand_name,description_clean,description_clean_chunks
0,Transkribus,"Transkribus ist eine umfassende Plattform für die Digitalisierung, Texterkennung mithilfe Künstlicher Intelligenz, Transkription und das Durchsuchen von historischen Dokumenten. Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.","Transkribus ist eine umfassende Plattform für die Digitalisierung, Texterkennung mithilfe Künstlicher Intelligenz, Transkription und das Durchsuchen von historischen Dokumenten. Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform."
1,Autodone,"Autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users.autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: , 19.04.2024) Official Site: []() Usage Instructions []()","Autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users.autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: , 19.04.2024) Official Site: []() Usage Instructions []()"
2,CollateX,"CollateX is a software to (a.) read multiple versions of a text, (b.) identify differences by aligning tokens, and (c.) output the alignment results for further processing, for instance (d.) to support the production of a critical apparatus or the stemmatical analysis of a text's genesis.[CollateX]() is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff]()) or tools for [sequence alignment]() which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology]() or – more specifically – the field of [Textual Criticism]() where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <> for further information.","CollateX is a software to (a.) read multiple versions of a text, (b.) identify differences by aligning tokens, and (c.) output the alignment results for further processing, for instance (d.) to support the production of a critical apparatus or the stemmatical analysis of a text's genesis.[CollateX]() is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff]()) or tools for [sequence alignment]() which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology]() or – more specifically – the field of [Textual Criticism]() where the assessment of findings is based on interpretation and therefore can be supported by computational means but is"
3,CollateX,"CollateX is a software to (a.) read multiple versions of a text, (b.) identify differences by aligning tokens, and (c.) output the alignment results for further processing, for instance (d.) to support the production of a critical apparatus or the stemmatical analysis of a text's genesis.[CollateX]() is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff]()) or tools for [sequence alignment]() which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology]() or – more specifically – the field of [Textual Criticism]() where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <> for further information.",[Philology]() or – more specifically – the field of [Textual Criticism]() where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <> for further information.
4,LaTeX,"LaTeX (gesprochen “Lah-tech” oder “Lay-tech”), ist eine Textsatz*sprache* und ein *Programm* für die Erstellung qualitativ hochwertiger Druckausgaben. Ursprünglich entwickelt für mathematischen Textsatz wird es heute für alle Arten von wissenschaftlichen Texten und auch darüber hinaus eingesetzt.Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket","LaTeX (gesprochen “Lah-tech” oder “Lay-tech”), ist eine Textsatz*sprache* und ein *Programm* für die Erstellung qualitativ hochwertiger Druckausgaben. Ursprünglich entwickelt für mathematischen Textsatz wird es heute für alle Arten von wissenschaftlichen Texten und auch darüber hinaus eingesetzt.Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket"
5,Research Software Directory,The Research Software Directory is a content management system that is tailored to research software.,The Research Software Directory is a content management system that is tailored to research software.
6,Tesseract OCR,"Tesseract ist eine Software zur Texterkennung. Mehr als 100 Sprachen und Sprachvarianten werden unterstützt, zudem verschiedene Schriften / Schriftsysteme: lateinische Antiqua, Fraktur, Devanagari (indische Schrift), chinesische, arabische, griechische, hebräische, kyrillische Schrift.","Tesseract ist eine Software zur Texterkennung. Mehr als 100 Sprachen und Sprachvarianten werden unterstützt, zudem verschiedene Schriften / Schriftsysteme: lateinische Antiqua, Fraktur, Devanagari (indische Schrift), chinesische, arabische, griechische, hebräische, kyrillische Schrift."
7,EVT,"A light-weight, open source tool specifically designed to create digital editions from XML-encoded texts, freeing the scholar from the burden of web programming and enabling the final user to browse, explore and study digital editions by means of a user-friendly interface.EVT (Edition Visualization Technology) is a software for creating and browsing digital editions of manuscripts based on text encoded according to the [TEI XML]() schemas and Guidelines. This tool was born as part of the [Digital Vercelli Book]() project in order to allow the creation of a digital edition of the Vercelli Book, a parchment codex of the late tenth century, now preserved in the Archivio e Biblioteca Capitolare of Vercelli and regarded as one of the four most important manuscripts of the Anglo-Saxon period as regards the transmission of poetic texts in the Old English language. To ensure that it will be working on all the most recent web browsers, and for as long as possible on the World Wide Web itself, EVT is built on open and standard web technologies such as HTML, CSS and JavaScript. Specific features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen among the open source and best supported ones to reduce the risk of future incompatibilities. The general architecture of the software, in any case, is modular, so that any component which may cause trouble or turn out to be not completely up to the task can be replaced easily.","A light-weight, open source tool specifically designed to create digital editions from XML-encoded texts, freeing the scholar from the burden of web programming and enabling the final user to browse, explore and study digital editions by means of a user-friendly interface.EVT (Edition Visualization Technology) is a software for creating and browsing digital editions of manuscripts based on text encoded according to the [TEI XML]() schemas and Guidelines. This tool was born as part of the [Digital Vercelli Book]() project in order to allow the creation of a digital edition of the Vercelli Book, a parchment codex of the late tenth century, now preserved in the Archivio e Biblioteca Capitolare of Vercelli and regarded as one of the four most important manuscripts of the Anglo-Saxon period as regards the transmission of poetic texts in the Old English language. To ensure that it will be working on all the most recent web browsers, and for as long as possible on the World Wide Web itself, EVT is built on open and standard web technologies such as HTML, CSS and JavaScript. Specific features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen among the open source and best supported ones to reduce the risk of future incompatibilities. The general architecture of the software, in any case, is modular, so that any component which may cause trouble"
8,EVT,"A light-weight, open source tool specifically designed to create digital editions from XML-encoded texts, freeing the scholar from the burden of web programming and enabling the final user to browse, explore and study digital editions by means of a user-friendly interface.EVT (Edition Visualization Technology) is a software for creating and browsing digital editions of manuscripts based on text encoded according to the [TEI XML]() schemas and Guidelines. This tool was born as part of the [Digital Vercelli Book]() project in order to allow the creation of a digital edition of the Vercelli Book, a parchment codex of the late tenth century, now preserved in the Archivio e Biblioteca Capitolare of Vercelli and regarded as one of the four most important manuscripts of the Anglo-Saxon period as regards the transmission of poetic texts in the Old English language. To ensure that it will be working on all the most recent web browsers, and for as long as possible on the World Wide Web itself, EVT is built on open and standard web technologies such as HTML, CSS and JavaScript. Specific features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen among the open source and best supported ones to reduce the risk of future incompatibilities. The general architecture of the software, in any case, is modular, so that any component which may cause trouble or turn out to be not completely up to the task can be replaced easily.","and best supported ones to reduce the risk of future incompatibilities. The general architecture of the software, in any case, is modular, so that any component which may cause trouble or turn out to be not completely up to the task can be replaced easily."
9,Classical Text Editor (CTE),"Der CTE ist ein Spezialwerkzeug für die Erstellung einer kritischen Ausgabe bzw. eines Texts mit Kommentar oder Übersetzung. Er dient der einfachen Erzeugung einer druckfertigen PDF- oder elektronischen Ausgabe.Der Classical Text Editor (CTE) wird auf Initiative der [Österreichischen Akademie der Wissenschaften]() und des Editionsprojekts [Corpus Scriptorum Ecclesiasticorum Latinorum (CSEL)]() seit 1997 entwickelt. Für die Nutzung des CTE muss eine [Lizenz erworben]() werden, da das Projekt keine öffentliche Förderung erhält. Perspektivisch wird eine Veröffentlichung gemäß der Opensource-Lizenz [EUPL-1.2]() angestrebt. Der CTE schließt für den Bereich der kritischen Ausgaben die Lücke zwischen klassischen Text verarbeitungs programmen (Open Office Writer, Microsoft Word u.a.) auf der einen und Text satz programmen (TeX, TUSTEP Satz) auf der anderen Seite. Autor:innen können mit Hilfe einer graphischen Oberfläche Publikationen erstellen und gleichzeitig die gerade für kritische Ausgaben nötigen typographischen Anforderungen umsetzen. Neben umfangreichen Funktionen im Bereich kritischer Editionen (verschiedene Layouts, beliebig viele Apparate, Varianten etc.) gehören daher auch die Unterstützung von Unicode, komplexen Skripten sowie die Einbindung verschiedener Referenzsysteme zum Funktionsumfang des CTE. Das Ergebnis kann sowohl für die Drucklegung im PDF-Format als auch für eine elektronische Fassung als HTML- oder XML-Publikation exportiert werden. Umgekehrt können Texte in verschiedenen Formaten importiert werden. Eine detaillierte Auflistung aller Feature des CTE findet man unter ?id0=features.","Der CTE ist ein Spezialwerkzeug für die Erstellung einer kritischen Ausgabe bzw. eines Texts mit Kommentar oder Übersetzung. Er dient der einfachen Erzeugung einer druckfertigen PDF- oder elektronischen Ausgabe.Der Classical Text Editor (CTE) wird auf Initiative der [Österreichischen Akademie der Wissenschaften]() und des Editionsprojekts [Corpus Scriptorum Ecclesiasticorum Latinorum (CSEL)]() seit 1997 entwickelt. Für die Nutzung des CTE muss eine [Lizenz erworben]() werden, da das Projekt keine öffentliche Förderung erhält. Perspektivisch wird eine Veröffentlichung gemäß der Opensource-Lizenz [EUPL-1.2]() angestrebt. Der CTE schließt für den Bereich der kritischen Ausgaben die Lücke zwischen klassischen Text verarbeitungs programmen (Open Office Writer, Microsoft Word u.a.) auf der einen und Text satz programmen (TeX, TUSTEP Satz) auf der anderen Seite. Autor:innen können mit Hilfe einer graphischen Oberfläche Publikationen erstellen und gleichzeitig die gerade für kritische Ausgaben nötigen typographischen Anforderungen umsetzen. Neben umfangreichen Funktionen im Bereich kritischer Editionen (verschiedene Layouts, beliebig viele Apparate, Varianten etc.) gehören daher auch die Unterstützung von Unicode, komplexen Skripten sowie die Einbindung verschiedener Referenzsysteme zum Funktionsumfang des CTE. Das Ergebnis kann sowohl für die Drucklegung im PDF-Format als auch für eine elektronische Fassung als HTML- oder XML-Publikation exportiert werden. Umgekehrt können Texte in verschiedenen Formaten importiert werden. Eine detaillierte Auflistung aller Feature des CTE findet man unter ?id0=features."


#### 5. Removing punctiation, stopwords and upper case letters.

As some vectorization methods need additional preprocessing steps, we'll create a new column for description with their punctiation, stopwords and upper case letters removed. As we have both english and german texts, we'll need to account for stopwords in both languages. 

In [23]:
def preprocess(stopwords: List[str], text: str) -> str:
    """
    Preprocesses the given text by converting it to lowercase, removing punctuation, and filtering out stopwords.
    Args:
        stopwords (List[str]): A list of stopwords to be filtered out from the text.
        text (str): The input text to be preprocessed.
    """
    
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))    
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text

# get stopwords
stopwords_english = set(stopwords.words('english'))
stopwords_german = set(stopwords.words('german'))
stopwords_combined = stopwords_german.union(stopwords_english)

In [24]:
edition_software_info_chunked["description_preprocessed_chunks"] = edition_software_info_chunked["description_clean_chunks"].apply(lambda x: preprocess(stopwords_combined, x))
edition_software_info_chunked["description_preprocessed_chunks"].head()

0    transkribus umfassende plattform digitalisieru...
1    autodone service automated timecontrolled publ...
2    collatex software read multiple versions text ...
3    philology – specifically – field textual criti...
4    latex gesprochen “lahtech” “laytech” textsatzs...
Name: description_preprocessed_chunks, dtype: object

# Export dataset

Before we move on to the vectorisations, we'll save the dataset to our drive.

In [25]:
current_dir = os.getcwd()
path = os.path.join(current_dir, 'data/edition_software_info_chunked.csv')
edition_software_info_chunked.to_csv(path)

# Vectorization 1: TFIDF

We'll start off with a simple TF-IDF vectorization.

**Term Frequency-Inverse Document Frequency (TF-IDF)** is a weighting scheme that weights the cells of a term-document matrix by their potential to be discriminatory.

To do so, we first calculate the **term frequency (TF)**. The term frequency represents the number of instances of a given word $t$ in a document $d$.

$$
\text{TF}(t, d) = \frac{\text{Count of } t \text{ in } d}{\text{Total number of words in } d}
$$

This term frequency is then multiplied by the **inverse document frequency (IDF)**. The IDF is calculated by counting all documents that contain a term $t$ (the document frequency $\text{df}(t)$). Then, we divide the total number of documents $N$ in the corpus by $\text{df}(t)$.

This inverse frequency is chosen over the regular frequency to **downweight** terms that appear in many documents, since these terms are less likely to be useful for distinguishing between documents.

Usually, we also take the logarithm of the IDF to smooth out the very large values that can occur when a term appears in only a few documents. This ensures that rare terms are not excessively weighted.

$$
\text{df}(t) = \text{Document frequency of a term } t
$$
$$
N = \text{Number of documents}
$$
$$
\text{IDF}(t) = \log\left(\frac{N}{\text{df}(t)}\right)
$$

Finally, we calculate the **TF-IDF** by multiplying the term frequency $\text{TF}(t, d)$ with the inverse document frequency $\text{IDF}(t)$.

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

The resulting value can be interpreted as a measure of the importance of the term in a document relative to the entire corpus. Terms that are frequent in a document but rare across the corpus will have higher TF-IDF scores, indicating their importance.


**N-grams:**

To capture not just the importance of single words but also some of the **context** in which they are used, we can apply TF-IDF to **n-grams**. N-grams are contiguous sequences of $n$ words that appear together in a text. The size of the sequence, $n$, is a hyperparameter that can be adjusted depending on the specific task. 


### 1: Fit TF-IDF Vectorizer
First, we fit the vectorizer on the preprocessed descriptions. 
This way, the vectorizer can transform text into numerical feature vectors based on the learned vocabulary and its distribution over documents.

In [26]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,4))
tfidf_matrix = tfidf_vectorizer.fit_transform(edition_software_info_chunked['description_preprocessed_chunks'])

# display the resulting matrix
tfidf_matrix_beautify = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_matrix_beautify

Unnamed: 0,100,100 sprachen,100 sprachen sprachvarianten,100 sprachen sprachvarianten unterstützt,12,12 languages,12 languages comprehensive,12 languages comprehensive righttoleft,1514,1514 digital,...,überschaubaren teil funktionen,überschaubaren teil funktionen aufgrund,übersetzung,übersetzung dient,übersetzung dient einfachen,übersetzung dient einfachen erzeugung,überwiegende,überwiegende teil,überwiegende teil funktionen,überwiegende teil funktionen editionen
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
66,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
67,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
68,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2: Inspect the tf-idf representations

Each column in this dataframe is a unique word, while each row is a document. The cells denote the number of occurances of a word in a document, weighted by the words potential to be distinctive.

Let's take a look at the tf-idf filtered words for each description (You can find them in the column "tf_idf_filtered_words")

In [27]:
# Define the threshold for TF-IDF scores
threshold = 0.11

# Filter words with TF-IDF scores greater than the threshold for each document
def filter_words_by_threshold(row, threshold):
    filtered_words = [(word, score) for word, score in zip(tfidf_matrix_beautify.columns, row) if score > threshold]
    return sorted(filtered_words, key=lambda x: x[1], reverse=True)


# Apply the function to each row of the TF-IDF DataFrame
filtered_words = tfidf_matrix_beautify.apply(lambda row: filter_words_by_threshold(row, threshold), axis=1)

# Create a dataframe to display the filtered words
filtered_words_df = pd.DataFrame(filtered_words, columns=["tf_idf_filtered_words"])
tfidf_display = pd.concat([edition_software_info_chunked[["brand_name","description_clean_chunks"]], filtered_words_df], axis=1)
    
with pd.option_context('display.max_colwidth', None):
    display(tfidf_display[["brand_name", "description_clean_chunks", "tf_idf_filtered_words"]].head(5))

Unnamed: 0,brand_name,description_clean_chunks,tf_idf_filtered_words
0,Transkribus,"Transkribus ist eine umfassende Plattform für die Digitalisierung, Texterkennung mithilfe Künstlicher Intelligenz, Transkription und das Durchsuchen von historischen Dokumenten. Erkennen, Transkribieren und Durchsuchen von historischen Dokumenten mitttels KI Trainieren von spezifischen Texterkennungsmodellen, die in der Lage sind, handschriftliche, maschinengeschriebene oder gedruckte Dokumente zu erkennen. KI-gestützte Erkennung von handgeschriebenem Text, Layout-Analyse und Strukturerkennung. Manuelles Transkribieren im Transkriptionseditor KI-gestützten Erkennung mittels öffentlicher oder selbst trainierter KI-Modelle Durchsuchen von Dokumenten mit erweiterten Suchoptionen, wie z. B. dem Tool zum Aufspüren von Schlüsselwörtern. Gemeinsames Arbeiten an Dokumenten, Organisation in Sammlungen Teilen von Dokumenten durch eine read&search Website oder Export als PDF oder ALTO (XML). Alle Transkribus-Inhalte, d.h. hochgeladene Bilder, erkannte Texte, trainierte Erkennungsmodelle und eingegebene Metadaten, werden innerhalb der EU gehostet und sind GDPR konform.","[(dokumenten, 0.2706568514645822), (durchsuchen, 0.16239411087874928)]"
1,Autodone,"Autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users.autodone is a service for the automated, time-controlled publication of status updates on any Mastodon instance. The codebase is developed under a free license by the Department of Digital Humanities at the University of Cologne and is open to all interested users. Special features of the service include the ability to upload content in tabular format (tsv files) and the ability to publish posts as a thread. In addition to these basic functionalities, more features will be developed in the future. autodone replaces autoChirp, which offered the same functionality for Twitter before the Twitter API and Twitter itself was massively restricted regarding free and ethical usage. (quoted from: , 19.04.2024) Official Site: []() Usage Instructions []()","[(twitter, 0.14227410099289636), (service, 0.12069270558109332)]"
2,CollateX,"CollateX is a software to (a.) read multiple versions of a text, (b.) identify differences by aligning tokens, and (c.) output the alignment results for further processing, for instance (d.) to support the production of a critical apparatus or the stemmatical analysis of a text's genesis.[CollateX]() is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff]()) or tools for [sequence alignment]() which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology]() or – more specifically – the field of [Textual Criticism]() where the assessment of findings is based on interpretation and therefore can be supported by computational means but is","[(differences, 0.13884687211434793), (tokens, 0.1302538939544816)]"
3,CollateX,[Philology]() or – more specifically – the field of [Textual Criticism]() where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable. Please go to <> for further information.,"[(computable, 0.13397024401843652), (computable please, 0.13397024401843652), (computable please go, 0.13397024401843652), (computable please go information, 0.13397024401843652), (computational means necessarily, 0.13397024401843652), (computational means necessarily computable, 0.13397024401843652), (go, 0.13397024401843652), (go information, 0.13397024401843652), (means necessarily, 0.13397024401843652), (means necessarily computable, 0.13397024401843652), (means necessarily computable please, 0.13397024401843652), (necessarily, 0.13397024401843652), (necessarily computable, 0.13397024401843652), (necessarily computable please, 0.13397024401843652), (necessarily computable please go, 0.13397024401843652), (please, 0.13397024401843652), (please go, 0.13397024401843652), (please go information, 0.13397024401843652), (supported computational means necessarily, 0.13397024401843652), (assessment, 0.12208275724850044), (assessment findings, 0.12208275724850044), (assessment findings based, 0.12208275724850044), (assessment findings based interpretation, 0.12208275724850044), (based interpretation, 0.12208275724850044), (based interpretation therefore, 0.12208275724850044), (based interpretation therefore supported, 0.12208275724850044), (computational means, 0.12208275724850044), (criticism, 0.12208275724850044), (criticism assessment, 0.12208275724850044), (criticism assessment findings, 0.12208275724850044), (criticism assessment findings based, 0.12208275724850044), (field, 0.12208275724850044), (field textual, 0.12208275724850044), (field textual criticism, 0.12208275724850044), (field textual criticism assessment, 0.12208275724850044), (findings, 0.12208275724850044), (findings based, 0.12208275724850044), (findings based interpretation, 0.12208275724850044), (findings based interpretation therefore, 0.12208275724850044), (interpretation, 0.12208275724850044), (interpretation therefore, 0.12208275724850044), (interpretation therefore supported, 0.12208275724850044), (interpretation therefore supported computational, 0.12208275724850044), (philology specifically, 0.12208275724850044), (philology specifically field, 0.12208275724850044), (philology specifically field textual, 0.12208275724850044), (specifically field, 0.12208275724850044), (specifically field textual, 0.12208275724850044), (specifically field textual criticism, 0.12208275724850044), (supported computational, 0.12208275724850044), (supported computational means, 0.12208275724850044), (textual criticism, 0.12208275724850044), (textual criticism assessment, 0.12208275724850044), (textual criticism assessment findings, 0.12208275724850044), (therefore supported, 0.12208275724850044), (therefore supported computational, 0.12208275724850044), (therefore supported computational means, 0.12208275724850044), (computational, 0.11364845115943978), (philology, 0.11364845115943978), (specifically, 0.11364845115943978)]"
4,LaTeX,"LaTeX (gesprochen “Lah-tech” oder “Lay-tech”), ist eine Textsatz*sprache* und ein *Programm* für die Erstellung qualitativ hochwertiger Druckausgaben. Ursprünglich entwickelt für mathematischen Textsatz wird es heute für alle Arten von wissenschaftlichen Texten und auch darüber hinaus eingesetzt.Der Mathematiker Donald E. Knuth entwickelte Ende der Siebziger Jahre ein Textsatzprogramm, um seine Bücher schöner setzen zu können. Das so entstandene TeX-System verbreitete sich recht schnell, erforderte aber eine intensive Einarbeitung in die zugehörige Programmiersprache. Mit LaTeX 2e, dem Anfang der Neunziger Jahre entwickelten Makropaket","[(jahre, 0.14427051575072258), (latex, 0.12238628647109732)]"


### 3. Test tfidf-representation

This cell will return the most relevant documents from our dataset based on a comparison of their tf-idf representations and a query. The query can be changed.

In [28]:
query = 'I want to transcribe and annotate a manuscript'
# Preprocess the query
query = preprocess(stopwords_combined, query)
# Transform the query to TF-IDF space
query_tfidf = tfidf_vectorizer.transform([query]) 

# Compute cosine similarity between the query and the documents
similarities = cosine_similarity(query_tfidf, tfidf_matrix)

similarity_df = pd.DataFrame({
    'similarity_score': similarities[0]
})

result_df = pd.concat([edition_software_info_chunked[["brand_name", "description_clean_chunks"]], similarity_df], axis=1)
result_df_sorted = result_df.sort_values(by='similarity_score', ascending=False)

# print the top 3 description, that might be relevant to our query
with pd.option_context('display.max_colwidth', None):
    display(result_df_sorted[["brand_name", 'description_clean_chunks', 'similarity_score']].head(3))

Unnamed: 0,brand_name,description_clean_chunks,similarity_score
48,TEITOK,"based web server. Features Manuscript-based corpora Align your manuscript with your transcript Display each manuscript line with its transcription Transcribe directly from the manuscript Search directly for manuscript fragments Keep multiple editions within the same environment Stand-off Annotations Adds stand-off annotations to any corpus file Edit using an efficient interface Annotate over discontinuous regions Incorporate annotations into the CQP corpus Audio-based corpora Align your audio with your transcription Transcribe directly from the audio file Scroll transcription vertical with wave function horizontal Search directly for audio segments Dependency Grammar Keep dependency relations inside any corpus type Visualize dependency trees for any sentence Edit trees easily Search using dependency relations Geolocation Coordinates Map documents onto the world map Document are clustered into counted groups Access the documents from the map Compare corpus queries on the world map Edit from CQP Query Search for words often incorrectly annotated Click on any token in a KWIC list to edit it Edit all results in a systematic way Edit each results individually in a list Pre-modify each result by a regular expression Search The rich XML format used in TEITOK is hard to search through. For easier access, all corpora are therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various",0.086888
65,VMR CRE,"management; 4) indexing of folio content; 5) transcribing; 6) collating; 7) regularizing; 8) editing an apparatus; 9) genealogical analysis of the witness corpus. Metadata and Feature Tagging The VMR CRE stores with each manuscript a very limited set of descriptive data, reserving the primary metadata capture for a dynamic tagging facility called Feature Tagging. A Feature is any defined metadata information which might be captured for a manuscript or manuscript page. For example, an alternative catalog identifier, an external image repository, the canvas material type, the ink type, the script type; these are all Features which might be tagged on a manuscript; For individual pages: an illumination, a canon table, or even individual sample script characters might be tagged as Features. These Features must first be defined in the system, and the VMR CRE comes by default with a predefined set of Feature Definitions used at the INTF. A Feature Definition can specify that zero or more values should be captured with the Feature tag and what those value types and value domains should be. Once a Feature is defined, it can be used to tag manuscripts or manuscript pages, capturing individual Feature values for each tag, if necessary. Every Feature Definition adds to the number of facets available in the catalog search facility. For example, one might search for all manuscript folio sides from Egypt",0.06438
47,TEITOK,"TEITOK is a web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation. TEITOK is a web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation, initially developed at the Centro de Linguística da Universidade de Lisboa, later at CELGA-ILTEC, and currently maintained at the ÚFAL institute of Charles University, Prague. The system has a modular design with numerous modules making serving a wide range of different corpus types. Below are some examples of some of those, and the type of corpora TEITOK can deal with. More modules are added frequently, and it is possible to add custom modules as well. Historical Corpora For historical corpora, TEITOK provides the option to have an alignment between the transcription and the facsimile image, it provides the option to work with multiple orthographic realizations to combine several editions of a text into a single XML file, and it provides the option to create a searchable document map to see where in the world several phenomena are more frequent. TEITOK is freely available for anybody who wishes to create richly annotated textual corpora, and runs on any LINUX based web server. Features Manuscript-based corpora Align your manuscript with your transcript Display each manuscript line with its transcription Transcribe directly from the manuscript Search directly for manuscript fragments Keep",0.05822


#### 4. Save vectorisations and the TFIDF-vectorizer

Now we can save the vectorisations and the vectorizer to be later used in our RAG-Pipeline.

In [29]:
path = "vectorisations/tfidf.npy"
np.save(path, tfidf_matrix)

In [30]:
import pickle
path = os.path.join(os.getcwd(), "models/tfidf_vectorizer.pickle")
pickle.dump(tfidf_vectorizer, open(path, "wb"))

# Vectorization 2: Aggregated Word2Vec


Next, we'll create document representations by aggregating the word2vec embeddings of each word in a description. 

Word2vec encodes the meaning of the words by capturing their semantic relationships based on the context in which they appear. By aggregating the word2vec embeddings of each word in a description, we can create a document representation that retains the semantic information and provides a more nuanced understanding of the content.

From a computational perspective, these representations are shorter and denser than tf-idf representations, making them more suitable for computations such as similarity measures, clustering, or classification tasks. The dense nature of word2vec embeddings allows for efficient storage and faster processing compared to sparse representations like tf-idf. Additionally, because word2vec captures the meaning and context of words, it can provide more meaningful insights into the relationships between different documents or terms.


#### 1. Load the pretrained model

In [31]:
current_path = os.getcwd()
path = os.path.join(current_path, "models/word2vec-google-news-300.bin")

# Load the model if it is already in our project. If not, download it.
if os.path.isfile(path):
    print("Model found. Loading...")
    word2vec_model = KeyedVectors.load(path)
    
else:
    print("Model not found. Downloading...")
    word2vec_model = api.load("word2vec-google-news-300")
    word2vec_model.save(path)
    


Model found. Loading...


In [32]:
""" #TODO: Add some preprocessing
def preprocess_word2vec(text: str) -> str:
    pass
"""

' #TODO: Add some preprocessing\ndef preprocess_word2vec(text: str) -> str:\n    pass\n'

#### 2. Create the document representations

In [33]:
def get_word2vec_vector(words, model):
    words = words.split()
    # Filter words that are in the model's vocabulary
    valid_words = [word for word in words if word in model]
    
    if not valid_words:
        # Return a zero vector if no valid words are found
        return np.zeros(model.vector_size)
    
    # Average the vectors of the valid words to create a document representation
    vectors = [model[word] for word in valid_words]
    return np.mean(vectors, axis=0)

# Apply the function to create aggregated vectors
word2vec = edition_software_info_chunked['description_preprocessed_chunks'].apply(lambda x: get_word2vec_vector(x, word2vec_model))

# Convert the Series of 1D arrays to a 2D numpy array (to calculate the cosine similarity later on)
word2vec_array = np.array(word2vec.tolist())
len(word2vec_array)

70

#### 3. Test Word2Vec Representation

In [34]:
query = 'I need a search capability for my edition of letters'
# Preprocess the query
query = preprocess(stopwords_combined, query)
# get vector representation of the query using word2vec
query_word2vec = get_word2vec_vector(query, word2vec_model)
# Reshape the query vector to be a 2D array with one row
query_word2vec = query_word2vec.reshape(1, -1)

# Compute cosine similarity between the query and the documents
similarities = cosine_similarity(query_word2vec, word2vec_array)

similarity_df = pd.DataFrame({
    'similarity_score': similarities.flatten()
})

result_df = pd.concat([edition_software_info_chunked[["brand_name","description_clean_chunks"]], similarity_df], axis=1)
result_df_sorted = result_df.sort_values(by='similarity_score', ascending=False)

# print the top 3 description, that might be relevant to our query
with pd.option_context('display.max_colwidth', None):
    display(result_df_sorted[['brand_name','description_clean_chunks', 'similarity_score']].head(5))

Unnamed: 0,brand_name,description_clean_chunks,similarity_score
52,TEI Critical Apparatus Toolbox,"A visualization and quality control tool for people preparing a natively digital TEI critical edition.The TEI Critical Apparatus Toolbox (TEI CAT) is a simple tool offering an easy visualization for TEI XML critical editions. It especially targets the needs of people working on natively-digital editions. Its main purpose is to provide editors with an easy way of visualizing their ongoing work before it is finalized, and also to perform some automatic quality checks on their encoding. Features The Toolbox lets you... [Check your encoding](): offers facilities to display your edition while it is still in the making, and check the consistency of your encoding [Display parallel versions](): choose the sigla of the witnesses, and the different versions of the text, following each chosen witness, will be displayed in parallel columns. [Print an edition]() of a TEI XML edition, with a TEI-to-LateX and PDF transformation [Annotate an image](): lets you easily trace zones on an image to prepare a documentary edition [Get statistics]() on the XML tags effectively used in different parts of your edition, and some word count. A [downloadable version]() is coming soon.",0.649751
27,Publex,"online version. Upon import, Publex parses the data and captures all tags, attributes, and associated attribute values with which the resource has been tagged. For each of these elements and specific combinations, properties such as font style, font size, color, letterspacing, or text indent can now be specified. In addition, elements can be defined as search categories and added to the lemma list. Special characters can be included and displayed via the [KompLett]() font. For this purpose, the TCDH provides a file with all the [non-Unicode characters defined for the Trier dictionaries](). In parallel to the creation of the display rules, the user can look at the Dictionary Preview at any time to see how the defined styling rules are implemented and how the dictionary will look in the publication version. Publex provides different accesses to the dictionary: a lemma list and different search options. The items in the lemma list are sorted alphabetically and can be searched using a search box. Clicking on them displays the linked dictionary item on the screen. The general search can be used to search the dictionary contents in full text, the advanced search offers an AND-linked search over the full text and any number of information fields previously defined as search categories in the styling rules. Once published, the dictionary receives its own URI and also appears, along with",0.637895
14,correspSearch,"With correspSearch you can search within the metadata of diverse scholarly editions of letters. One can search according to the letter's sender, adressee, as well as place and date of the letter's creation. With correspSearch you can search through indexes of different letter collections (digital or print) by sender, addressee, location written, location sent, and date. To this purpose a website and a technical interface are provided. The web service collects and evaluates TEI-XML data in the ‘Correspondence Metadata Interchange’ format. The web service correspSearch is operated and developed according to the following principles: Reference System : The web service aims to help users with their research by offering a central location to search for letters, and by guiding them to the original publication. Academic Data : The web service is based on the data from letter-indexes of editions or repositories that are edited according to academic criteria. Conceptionaly Open : There is no focus on a particular time period or place. This allows for new kinds of research questions to be explored. Open Access : Data is only collected that is under a free license, and the data from the web service continues to be under a free license and is thus available for further use. Open Interfaces : correspSearch offers technical interfaces that are open and well documented. Other projects can easily query and",0.635463
66,VMR CRE,"tag, if necessary. Every Feature Definition adds to the number of facets available in the catalog search facility. For example, one might search for all manuscript folio sides from Egypt which include Illuminations and any part of the Gospel of John. A Feature tagged to a manuscript page can also include a region box, marking the area on a folio image where the Feature is present. If a region box is captured, a search query can specify to show the region box clips in the result. For example, a paleographer might choose to capture a set of representative letters for each manuscript and then perform a search for all double column manuscripts with a height of at least 20cm between the II and V centuries, and to ask the query results to show the representative α (alpha) clips. Transcription and Reconciliation Transcription work in the VMR CRE is done using a What You See Is What You Mean (WYSIWYM) web-based editor originally developed by the University of Trier in collaboration with the INTF and ITSEE in Birmingham. This transcription editor has been developed as a plugin for the popular TinyMCE HTML editor component. The editor includes menus and dialogs to assist the researcher with composing a transcription, without asking the transcriber to learn special markup codes. The content may then be obtained as EpiDoc influenced TEI.",0.632575
49,TEITOK,"therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various orthographic forms, providing many ways to search through the data. The type of corpora that TEITOK is meant for are very labour-intensive: for ancient texts, hardly any of the data will be available in digital format, and have to be scanned. In many cases, OCR will not work and even for human readers the texts are often very hard to read. And the data will display a lot of orthographic variation in which a lot of the linguistic annotation, including normalization, will have to be done by hand. As a result, most corpora created with TEITOK will have a limited size, and searching for linguistic properties in them will not yield a lot of results. Therefore, TEITOK offers the option to index the corpus in a central database, which can be searched via this site. Each search result will only display the direct context of the word, and will link directly to the word in the original text on the site of the project it originated from. This way, it is possible to search through multiple corpora at the same time, and get access to the full original data in a way that prominently features the",0.623739


#### 4. Save the vectorisations

Now we can save the vectorisations for later use in RAG.

In [35]:
path = "vectorisations/word2vec.npy"
np.save(path, np.array(word2vec_array))

# Vectorization 3: Aggregated FastText


One downside of pretrained Word2Vec representations is their inability to handle words not contained in their vocabulary. 

FastText overcomes this issue by representing words not only as embeddings but also as collections of embedded character n-grams. This approach allows FastText to generate meaningful word vectors for previously unseen words, which is particularly useful when dealing with highly specialized terminologies. In our dataset, which contains such specialized language, FastText may therefore offer more reliable performance compared to traditional Word2Vec models.


### 1. Load the models

FastText provides pre-aligned word vectors, meaning that word vectors for different languages (like German and English) have already been mapped into a common vector space. This allows words with similar meanings across different languages to have similar vector representations, which is crucial when working with multilingual datasets.

Since our dataset contains both German and English texts, we need to download the pre-aligned FastText models for these two languages.


In [36]:
import zipfile

def unzip_file(zip_file_path: str, extract_to: str) -> None:
    """Unzip a file to a target directory."""
    try:
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            zip_ref.extractall(extract_to)
        print(f"Unzipped {zip_file_path} to {extract_to}")
    except zipfile.BadZipFile as e:
        print(f"Error while unzipping the file: {e}")

In [7]:
# Check if the english model file exists. If so, load it. If not, download it and convert it to .bin for faster loading in the future. 
# This might take a while

current_path = os.getcwd()
models_dir = os.path.join(current_path, "models")
fasttext_eng_zip_path = os.path.join(models_dir, "wiki.en.zip")
fasttext_eng_path_vec = os.path.join(models_dir, "wiki.en.vec")
fasttext_eng_path_bin = os.path.join(models_dir, "wiki.en.bin")

if os.path.isfile(fasttext_eng_path_bin):
    print("Model found. Loading...")
    aligned_vectors_eng = gensim.models.fasttext.load_facebook_model(fasttext_eng_path_bin) #load the full model, including subword information.
    
else:
    print("Model not found. Downloading...")
    url = "https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip" 
    # download the models
    download_file(url, fasttext_eng_zip_path)
    
    print("Unzipping the file...")
    unzip_file(fasttext_eng_zip_path, models_dir)    

    # load the model
    print("Loading model from file...")
    aligned_vectors_eng = gensim.models.fasttext.load_facebook_model(fasttext_eng_path_bin)
    # save the model as binary to reduce loading time in the future
    aligned_vectors_eng.save(fasttext_eng_path_bin)

    
if aligned_vectors_eng is None:
    raise ValueError("The FastText model was not loaded properly.")

Model found. Loading...


In [38]:
# Check if the german model file exists. If so, load it. If not, download it and convert it to .bin for faster loading in the future.
# This might take a while

fasttext_de_path_bin = os.path.join(current_path, "models/wiki_de_align.bin")
fasttext_de_path_vec = os.path.join(current_path, "models/wiki_de_align.vec")

if os.path.isfile(fasttext_de_path_bin):
    print("Model found. Loading...")
    aligned_vectors_de = KeyedVectors.load(fasttext_de_path_bin)
    
else:
    print("Model not found. Downloading...")
    url = "https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.de.align.vec"
    # download the model
    download_file(url, fasttext_de_path_vec)
    # load the model
    print("Loading model from file...")
    aligned_vectors_de = load_word_vectors(fasttext_de_path_vec)
    # save the model as binary to reduce loading time in the future
    aligned_vectors_de.save(fasttext_de_path_bin)
    
if aligned_vectors_de is None:
    raise ValueError("The FastText model or vectors were not loaded properly.")


Model found. Loading...


Now that we loaded both models, let's check if they are properly aligned. 

To do so, we'll pick an english word, get its english vector representation and return the most similar word vector in the german vector space.

In [39]:
# get the vector-representation of a random english word
word_english = 'skyscraper'
word_vector_in_english = aligned_vectors_eng.wv[word_english]

print(f"German word vector closest to {word_english}:", aligned_vectors_de.most_similar(positive=[word_vector_in_english]))

German word vector closest to skyscraper: [('hochhaus', 0.5159210562705994), ('wolkenkratzer', 0.4891827404499054), ('bürohochhaus', 0.4611120820045471), ('hochhausturm', 0.43771031498908997), ('wolkenkratzern', 0.43421125411987305), ('hochhäuser', 0.4312896728515625), ('bürohochhäuser', 0.4262181520462036), ('wolkenkratzers', 0.4224645495414734), ('hochhausbau', 0.4211919605731964), ('„wolkenkratzer', 0.42107880115509033)]


#### 2. Create Document Representations

Now that we confirmed, that the vector spaces are properly aligned, we can create the document representations. 
As the pre-aligned german model does not contain subword-infomation, we'll use the subword-information contained in the english model to embedd unknown words in both languages.

We'll use the preprocessed descriptions we created earlier.

Additionaly, we'll print the words we created new wod vectors using the english subword information. 

In [40]:
def get_fasttext_vector(row, aligned_vectors_de=None, aligned_vectors_eng=None):
    """
    Calculates the FastText vector representation for a given row.
    Parameters:
    - row: A row of data.
    - aligned_vectors_de: Aligned FastText vectors for the German language. Default is None.
    - aligned_vectors_eng: Aligned FastText vectors for the English language. Default is None.
    Note:
    - If the language is not specified or not supported (only "en" and "de" are supported), it returns a zero vector.
    - If a word in the row's description is not found in the aligned vectors, it tries to create a vector based on english subword information.
    - If no vectors are found, it returns a zero vector.
    """
    
    
    # default size to avoid errors if vectors are None
    vector_size = aligned_vectors_de.vector_size if aligned_vectors_de else 300
    
    # check if language is valid
    lang = row.get("description_lang")
    if pd.isna(lang) or lang not in ["en", "de"]:
        return np.zeros(vector_size) #Maybe rather use none?
    
    words = row.get("description_preprocessed_chunks", "").split()
    vectors = []

    # process based on language
    if lang == "de" and aligned_vectors_de:
        for word in map(str.lower, words):
            try:
                vectors.append(aligned_vectors_de[word])
            except KeyError:
                print(f"Created Vector based on Subword Information for: {word}")                
                vectors.append(aligned_vectors_eng.wv[word])
                #vectors.append(np.zeros(vector_size))
                
    elif lang == "en" and aligned_vectors_eng:
        for word in map(str.lower, words):
            try:
                vectors.append(aligned_vectors_eng.wv[word])
            except KeyError:
                print(f"Missing Vector for: {word}")
                vectors.append(aligned_vectors_eng.wv[word])
    
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros


# Apply the function to create aggregated vectors
fasttext = edition_software_info_chunked.apply(lambda x: get_fasttext_vector(x, aligned_vectors_de, aligned_vectors_eng), axis=1)

# Convert the Series of 1D arrays to a 2D numpy array (to calculate the cosine similarity later on)
fasttext_array = np.array(word2vec.tolist())
len(fasttext)

Created Vector based on Subword Information for: mitttels
Created Vector based on Subword Information for: texterkennungsmodellen
Created Vector based on Subword Information for: kigestützte
Created Vector based on Subword Information for: layoutanalyse
Created Vector based on Subword Information for: strukturerkennung
Created Vector based on Subword Information for: transkriptionseditor
Created Vector based on Subword Information for: kigestützten
Created Vector based on Subword Information for: kimodelle
Created Vector based on Subword Information for: readsearch
Created Vector based on Subword Information for: transkribusinhalte
Created Vector based on Subword Information for: erkennungsmodelle
Created Vector based on Subword Information for: gdpr
Created Vector based on Subword Information for: “lahtech”
Created Vector based on Subword Information for: “laytech”
Created Vector based on Subword Information for: textsatzsprache
Created Vector based on Subword Information for: eingese

70

#### 3. Test the FastText Representations

In [41]:
query = 'I want to transcribe a manuscript'

# check the queries language, to decide which model to use.
query_lang = identify_language(query).replace("__label__","") if not query == '' else np.nan

# preprocess the query
query = preprocess(stopwords_combined, query)

# create a series that can be used as an input in the get_fasttext_vector function. Ignore the key names.
query_information = pd.Series({
    "description_lang": query_lang,
    "description_preprocessed_chunks": query
})

# get vector representation of the query using fasttext
query_fasttext= get_fasttext_vector(query_information, aligned_vectors_de, aligned_vectors_eng)

# Reshape the query vector to be a 2D array with one row
query_fasttext = query_fasttext.reshape(1, -1)

# Compute cosine similarity between the query and the documents
similarities = cosine_similarity(query_fasttext, fasttext_array)

similarity_df = pd.DataFrame({
    'similarity_score': similarities.flatten()
})

result_df = pd.concat([edition_software_info_chunked[["brand_name","description_clean_chunks"]], similarity_df], axis=1)
result_df_sorted = result_df.sort_values(by='similarity_score', ascending=False)

# print the top 3 description, that might be relevant to our query
with pd.option_context('display.max_colwidth', None):
    display(result_df_sorted[['brand_name','description_clean_chunks', 'similarity_score']].head(5))

Unnamed: 0,brand_name,description_clean_chunks,similarity_score
50,TEITOK,"from. This way, it is possible to search through multiple corpora at the same time, and get access to the full original data in a way that prominently features the original project.",0.147178
34,Comparo,"(Werke 1905–1931)“]() and is used there to create a 'microgenesis' view, which arranges all text-genetically relevant documents of a work by Arthur Schnitzler in a single view in the order in which they were created and parallels the corresponding sentences in different versions with the connections which have been stored via Comparo. This way, a highly complex and yet clear view of all changes that the text has undergone in the course of its creation is created and thus not only answers exciting questions from a philological point of view, such as: Which passages were essentially already present in the first sketch of a work? Which sentences have meanwhile been planned elsewhere? Which sentences have been dropped or newly added in the course of the text's genesis and which have meanwhile been discarded and then restituted? Which passages were once planned completely differently in terms of content (e.g. alternative endings)? In which areas did Schnitzler change and file a lot and which areas have remained almost unchanged from the first note to the later print? The information generated in Comparo is stored in an [FuD]() database. With the help of this database it is possible to work out a detailed presentation in the form of a web view. In the website created for the project ['Arthur Schnitzler digital']() , further settings can be made in order",0.135922
60,CATview,the respective text segment. The current scroll position in the text is marked by a scroll spy in the overview bar. Highlighting Search Results CATview visualizes search hits by coloring (yellow) the segments matching the search request. Easy Restriction of the Text under Consideration CATview offers a comfortable selection of consecutive text segments by drawing a box around them. Statistical Analysis CATview allows statistical analysis with respect to text excerpts.,0.108582
49,TEITOK,"therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various orthographic forms, providing many ways to search through the data. The type of corpora that TEITOK is meant for are very labour-intensive: for ancient texts, hardly any of the data will be available in digital format, and have to be scanned. In many cases, OCR will not work and even for human readers the texts are often very hard to read. And the data will display a lot of orthographic variation in which a lot of the linguistic annotation, including normalization, will have to be done by hand. As a result, most corpora created with TEITOK will have a limited size, and searching for linguistic properties in them will not yield a lot of results. Therefore, TEITOK offers the option to index the corpus in a central database, which can be searched via this site. Each search result will only display the direct context of the word, and will link directly to the word in the original text on the site of the project it originated from. This way, it is possible to search through multiple corpora at the same time, and get access to the full original data in a way that prominently features the",0.108265
48,TEITOK,"based web server. Features Manuscript-based corpora Align your manuscript with your transcript Display each manuscript line with its transcription Transcribe directly from the manuscript Search directly for manuscript fragments Keep multiple editions within the same environment Stand-off Annotations Adds stand-off annotations to any corpus file Edit using an efficient interface Annotate over discontinuous regions Incorporate annotations into the CQP corpus Audio-based corpora Align your audio with your transcription Transcribe directly from the audio file Scroll transcription vertical with wave function horizontal Search directly for audio segments Dependency Grammar Keep dependency relations inside any corpus type Visualize dependency trees for any sentence Edit trees easily Search using dependency relations Geolocation Coordinates Map documents onto the world map Document are clustered into counted groups Access the documents from the map Compare corpus queries on the world map Edit from CQP Query Search for words often incorrectly annotated Click on any token in a KWIC list to edit it Edit all results in a systematic way Edit each results individually in a list Pre-modify each result by a regular expression Search The rich XML format used in TEITOK is hard to search through. For easier access, all corpora are therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various",0.100146


#### 5. Save the vectorisations

Now we save the vectors we created for later use in RAG.

In [42]:
path = "vectorisations/fasttext.npy"
np.save(path, np.array(fasttext_array))

# Vectorization 4: SBERT

**SBERT**: 

In contrast to the output of regular embedding models, word vectors created using BERT models are contextualized. This means that they can generate multiple word embeddings for the same word, depending on the meaning it takes on in a certain context. "Bank," for example, will have two different representations—one as a river's shore and another as a place to store money.

Since BERT models by themselves are notoriously bad for semantic similarity tasks applied to sentence- or paragraph-level vectors, we will use  SBERT to create document representations instead. These models also create contextualized embeddings but are specifically trained with semantic similarity in mind: A triplet loss function is used to minimize the distance between an "anchor point" and a positive sample while maximizing the distance to a negative sample. This forces sentence transformers to produce a vector space where semantically similar sentences are close together, while sentence embeddings of semantically dissimilar sentences are far apart.

Specifically, we will use the sentence-transformers library, which builds on the original [SBERT paper](https://arxiv.org/abs/1908.10084).

**Reranking:**

As it is suggested in the documentation of the sentence transformer module, the similarities calculated using the SBERT-representations will be reranked using a Cross-Encoder. 
Cross-Encoders tackle the task of calculating similarity as a classification task, classifying two sentences as either "relevant" or "not relevant" in relation to one another. We'll not rerank the all similarity scores, but only the entries most similar to our query.

Detailed information on this retrieve & re-rank process and its benefits can be found [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).


#### 1. Load the models

In [60]:
import os
import sentence_transformers
# Download SBERT model or load it from drive
sbert_path = os.path.join(os.getcwd(),"models/sbert")
downloaded = os.path.isdir(sbert_path)

if not downloaded:
    print("Downloading Sentence Transformer...")
    sbert_model = sentence_transformers.SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
    sbert_model.save(sbert_path)
else:
    print("Load Sentence Transformer from drive...")
    sbert_model = sentence_transformers.SentenceTransformer(sbert_path)
    print("Success")

Load Sentence Transformer from drive...
Success


In [64]:
# Download Cross Encoder or load it from drive

cross_encoder_path = os.path.join(os.getcwd(), "models/cross")
downloaded = os.path.isdir(cross_encoder_path)

if not downloaded:
    print("Downloading Cross Encoder...")
    cross_encoder_model = sentence_transformers.CrossEncoder("corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1")
    cross_encoder_model.save(cross_encoder_path)
else:
    print("Load Cross Encoder from drive...")
    cross_encoder_model = sentence_transformers.CrossEncoder(cross_encoder_path)
    print("Success")

Load Cross Encoder from drive...
Success


#### 2. Create Document Representations

As creating SBERT embeddings may take some time, we'll first create a function to parallelize the process. The parameter num_chunks can be increased to control the number of concurrent embedding processes.

In [44]:
# Set environment variable to control tokenizers parallelism
os.environ["TOKENIZERS_PARALLELISM"] = "true"

def get_sbert_embeddings(text, model):
    """
    Get Sentence-BERT embeddings for a given text using a specified model.
    Parameters:
    text (str): The input text to encode.
    model: The Sentence-BERT model to use for encoding.
    Returns:
    numpy.ndarray: The Sentence-BERT embeddings for the input text.
    """
    default_embedding = np.zeros((model.get_sentence_embedding_dimension(),))
    
    if pd.isna(text) or text.strip() == '':
        return default_embedding    
    return model.encode(text, convert_to_tensor=False)


def compute_sbert_embeddings_in_parallel(df, text_column, model, num_chunks=3):
    """
    Compute SBERT embeddings in parallel for the given text column in the DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the text data.
        text_column (str): The column name of the text data.
        model (Any): SBERT model for generating embeddings.
        num_chunks (int): The number of chunks to split the data for parallel processing.

    Returns:
        list: A list containing the computed SBERT embeddings.
    """

    # Function to process each chunk of data
    def process_chunk(chunk):
        embeddings = []
        
        for text in chunk:
            embeddings.append(get_sbert_embeddings(text, model))
        
        return embeddings

    # Split DataFrame into chunks for parallel processing
    df_chunks = np.array_split(df[text_column], num_chunks)

    # Process each chunk in parallel
    results = Parallel(n_jobs=num_chunks)(
        delayed(process_chunk)(chunk) for chunk in df_chunks
    )

    # Combine and return the results
    return np.concatenate(results).tolist()

In [45]:
# get vector representation of the query using the sbert model
sbert_array = compute_sbert_embeddings_in_parallel(edition_software_info_chunked, "description_clean_chunks", sbert_model, num_chunks=3)

  return bound(*args, **kwds)


#### 3. Test SBERT-Embeddings

In this cell, we can test similarities between a query and the SBERT-Embeddings, without reranking of the results.

In [46]:
#query = 'Ich muss verschiedene Stadien eines Manuskripts vergleichen und verschiedene Versionen desselben Schriftstückes nebeneinander darstellen'
query = 'I want to compare different versions of a text to see how it developed over time'

query_sbert = get_sbert_embeddings(query, sbert_model)

# Reshape the query vector to be a 2D array with one row
query_sbert = query_sbert.reshape(1, -1)

# Compute cosine similarity between the query and the documents
similarities = cosine_similarity(query_sbert, sbert_array)

similarity_df = pd.DataFrame({
    'similarity_score': similarities.flatten()
})

result_df = pd.concat([edition_software_info_chunked[["brand_name","description_clean_chunks"]], similarity_df], axis=1)
result_df_sorted = result_df.sort_values(by='similarity_score', ascending=False)

# print the top 5 descriptions, that might be relevant to our query
with pd.option_context('display.max_colwidth', None):
    display(result_df_sorted[['brand_name','description_clean_chunks', 'similarity_score']].head(5))

Unnamed: 0,brand_name,description_clean_chunks,similarity_score
33,Comparo,"A tool for synchronously comparing an almost unlimited number of versions of a text. Comparo is a tool for synchronously comparing an almost unlimited number of versions of a text. The comparison unit is the 'sentence'. Since all versions of a text are compared with each other in a previously defined order at the same time, Comparo is a very clear tool that can be used for collation in a time-efficient manner. The comparison parameters can also be comprehensively configured by the user so that they can be individually tailored to the respective text type and readjusted if necessary in order to optimize the comparison result (for example, individual 'high-frequency' words can be defined to be explicitly excluded from the comparison). Following the suggestion that is automatically generated on the basis of the default settings, the user can manually modify, add or remove each assignment. Features such as 'Bookmark', 'Comment', 'Search', 'Mark as done' as well as various evaluation options in clear list form (e.g. display of all unconnected elements) also increase usability. Use in philological research Comparo was developed by the TCDH as part of the binational research project [“Arthur Schnitzler digital. Digitale historisch-kritische Edition (Werke 1905–1931)“]() and is used there to create a 'microgenesis' view, which arranges all text-genetically relevant documents of a work by Arthur Schnitzler in a single view in the order",0.583317
34,Comparo,"(Werke 1905–1931)“]() and is used there to create a 'microgenesis' view, which arranges all text-genetically relevant documents of a work by Arthur Schnitzler in a single view in the order in which they were created and parallels the corresponding sentences in different versions with the connections which have been stored via Comparo. This way, a highly complex and yet clear view of all changes that the text has undergone in the course of its creation is created and thus not only answers exciting questions from a philological point of view, such as: Which passages were essentially already present in the first sketch of a work? Which sentences have meanwhile been planned elsewhere? Which sentences have been dropped or newly added in the course of the text's genesis and which have meanwhile been discarded and then restituted? Which passages were once planned completely differently in terms of content (e.g. alternative endings)? In which areas did Schnitzler change and file a lot and which areas have remained almost unchanged from the first note to the later print? The information generated in Comparo is stored in an [FuD]() database. With the help of this database it is possible to work out a detailed presentation in the form of a web view. In the website created for the project ['Arthur Schnitzler digital']() , further settings can be made in order",0.454486
31,Transcribo,"and have resulted in a very differentiated and practically tested tool. Some of the most important features of Transcribo 1. Fine-grained markup of microgenetic facts (at document, page, sentence/line, word, and graph levels) 2. Markup of complex interrelated phenomena/change operations via relations (page and cross-page) 3. Correction function for automated comparison of A and B files from different users to ensure error-free transcription and annotation ('double-blind process') 4. OCR functionality for automatic reading of typescripts or prints (Tesseract) 5. Navigation perspective: interface to the FuD database (linking facsimiles with metadata recorded in FuD) 6. Structural perspective: overview of a text section of any size (synopsis of different graphical representations: Facsimile, transcription, annotations) and possibility of depositing cross-page phenomena. 7. Module for defining and labeling text states or text layers. Technical requirements The application was implemented in the Eclipse integrated development environment. It is a rich client application that reuses parts of the development environment and was developed in the Java programming language. We currently provide a version for Windows and macOS operating systems. Projects using Transcribo [Arthur Schnitzler Digital]() [The Augsburg Master Builder’s Ledgers]() [Digitale Edition and Analysis of the Medulla Gestorum Treverensium by Johann Enen (1514)]() [Digital Marburg Büchner Edition]() [Johann Caspar Lavater]() [Kurt Schwitters' Intermedia Networks of the Avant-garde]() [Stefan Heym: “Ahasver”]() [Digitalization of the Plock Bible, Old High German Dictionary]() [Old High German",0.432837
6,Tesseract OCR,"Tesseract ist eine Software zur Texterkennung. Mehr als 100 Sprachen und Sprachvarianten werden unterstützt, zudem verschiedene Schriften / Schriftsysteme: lateinische Antiqua, Fraktur, Devanagari (indische Schrift), chinesische, arabische, griechische, hebräische, kyrillische Schrift.",0.427719
62,Versioning Machine,"A framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI)The Versioning Machine is a framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI) Guidelines, and is P5 compatible. While the VM provides for features typically found in critical editions, such as annotation and introductory material, it also takes advantage of the opportunities afforded by electronic publication to allow for the comparison diplomatic versions of witnesses, and the ability to easily compare an image of the manuscript with a diplomatic version. VM 5.0 adds a number of new features, including the ability to resize and reorganize text panels, panning and zooming in the image viewer, and text-audio interlinking. The Versioning Machine’s underlying code has also been completely revised to support enhanced features. The Versioning Machine is a useful tool for textual editors, providing an environment that allows editors to immediately see the consequences of their editorial decisions. The platform also has applications in teaching, translation, and digital publication. The many uses of the Versioning Machine are illustrated in the new VM IN USE section. The Versioning Machine can be used locally on a Mac or a PC, or it can be mounted on the Internet for public access. The documentation provided with the software not only provides information about",0.421393


#### 4. Test reranked SBERT-Representations

Now, we can apply reranking to the similariites calculated above and compare the results.

In [65]:
# Sort the results by similarity scores calculated using the sbert-representations in descending order
result_df_sorted = result_df.sort_values(by='similarity_score', ascending=False)

# Select the top N (e.g., top 20) results for reranking
n = 20
top_n_results = result_df_sorted.head(n)

# Prepare the input pairs (query, document description) for the reranking model
model_inputs = [[query, description] for description in top_n_results["description_clean_chunks"]]

# Predict the relevance scores using the CrossEncoder model
scores = cross_encoder_model.predict(model_inputs)

# Create a DataFrame for reranked results
reranked_results = pd.DataFrame({
    "brand_name": top_n_results["brand_name"].values,
    "description_clean_chunks": top_n_results["description_clean_chunks"].values,
    "initial_similarity_score": top_n_results["similarity_score"].values,
    "rerank_score": scores
})

# Sort the results by rerank score in descending order
reranked_results_sorted = reranked_results.sort_values(by='rerank_score', ascending=False)

# Display the top reranked results
with pd.option_context('display.max_colwidth', None):
    display(reranked_results_sorted[['brand_name', 'description_clean_chunks', 'initial_similarity_score', 'rerank_score']].head(5))

Unnamed: 0,brand_name,description_clean_chunks,initial_similarity_score,rerank_score
0,Comparo,"A tool for synchronously comparing an almost unlimited number of versions of a text. Comparo is a tool for synchronously comparing an almost unlimited number of versions of a text. The comparison unit is the 'sentence'. Since all versions of a text are compared with each other in a previously defined order at the same time, Comparo is a very clear tool that can be used for collation in a time-efficient manner. The comparison parameters can also be comprehensively configured by the user so that they can be individually tailored to the respective text type and readjusted if necessary in order to optimize the comparison result (for example, individual 'high-frequency' words can be defined to be explicitly excluded from the comparison). Following the suggestion that is automatically generated on the basis of the default settings, the user can manually modify, add or remove each assignment. Features such as 'Bookmark', 'Comment', 'Search', 'Mark as done' as well as various evaluation options in clear list form (e.g. display of all unconnected elements) also increase usability. Use in philological research Comparo was developed by the TCDH as part of the binational research project [“Arthur Schnitzler digital. Digitale historisch-kritische Edition (Werke 1905–1931)“]() and is used there to create a 'microgenesis' view, which arranges all text-genetically relevant documents of a work by Arthur Schnitzler in a single view in the order",0.583317,3.694656
1,Comparo,"(Werke 1905–1931)“]() and is used there to create a 'microgenesis' view, which arranges all text-genetically relevant documents of a work by Arthur Schnitzler in a single view in the order in which they were created and parallels the corresponding sentences in different versions with the connections which have been stored via Comparo. This way, a highly complex and yet clear view of all changes that the text has undergone in the course of its creation is created and thus not only answers exciting questions from a philological point of view, such as: Which passages were essentially already present in the first sketch of a work? Which sentences have meanwhile been planned elsewhere? Which sentences have been dropped or newly added in the course of the text's genesis and which have meanwhile been discarded and then restituted? Which passages were once planned completely differently in terms of content (e.g. alternative endings)? In which areas did Schnitzler change and file a lot and which areas have remained almost unchanged from the first note to the later print? The information generated in Comparo is stored in an [FuD]() database. With the help of this database it is possible to work out a detailed presentation in the form of a web view. In the website created for the project ['Arthur Schnitzler digital']() , further settings can be made in order",0.454486,1.215679
14,CollateX,"CollateX is a software to (a.) read multiple versions of a text, (b.) identify differences by aligning tokens, and (c.) output the alignment results for further processing, for instance (d.) to support the production of a critical apparatus or the stemmatical analysis of a text's genesis.[CollateX]() is a software to 1. read multiple (≥ 2) versions of a text , splitting each version into parts (tokens) to be compared, 1. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and 1. output the alignment results in a variety of formats for further processing , for instance 1. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis. It resembles software used to compute differences between files (e.g. [diff]()) or tools for [sequence alignment]() which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results. As such it is primarily designed for use cases in disciplines like [Philology]() or – more specifically – the field of [Textual Criticism]() where the assessment of findings is based on interpretation and therefore can be supported by computational means but is",0.337188,-0.642481
4,Versioning Machine,"A framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI)The Versioning Machine is a framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI) Guidelines, and is P5 compatible. While the VM provides for features typically found in critical editions, such as annotation and introductory material, it also takes advantage of the opportunities afforded by electronic publication to allow for the comparison diplomatic versions of witnesses, and the ability to easily compare an image of the manuscript with a diplomatic version. VM 5.0 adds a number of new features, including the ability to resize and reorganize text panels, panning and zooming in the image viewer, and text-audio interlinking. The Versioning Machine’s underlying code has also been completely revised to support enhanced features. The Versioning Machine is a useful tool for textual editors, providing an environment that allows editors to immediately see the consequences of their editorial decisions. The platform also has applications in teaching, translation, and digital publication. The many uses of the Versioning Machine are illustrated in the new VM IN USE section. The Versioning Machine can be used locally on a Mac or a PC, or it can be mounted on the Internet for public access. The documentation provided with the software not only provides information about",0.421393,-1.684034
3,Tesseract OCR,"Tesseract ist eine Software zur Texterkennung. Mehr als 100 Sprachen und Sprachvarianten werden unterstützt, zudem verschiedene Schriften / Schriftsysteme: lateinische Antiqua, Fraktur, Devanagari (indische Schrift), chinesische, arabische, griechische, hebräische, kyrillische Schrift.",0.427719,-2.073432


Now, we can manually inspect the results. 

After testing it using different queries, we can conclude that the reranking provides more relevant results, while taking a lot longer than just  calculating the similarities across vectors. 

Further testing should be conducted by domain experts to evaluate, if the trade-off between performance and quality is justifyable.

#### 5. Save the vectorisations

Now we save the vectors we created for later use in RAG.

In [None]:
path = "vectorisations/sbert.npy"
np.save(path, np.array(sbert_array))