# Reproducible Research with Docker

A [*DraCorOS Training Session*](https://summit.dracor.org/dracoros_training_sessions) at the [DraCor Summit 2025](https://summit.dracor.org/) by [Ingo Börner](mailto:ingo.boerner@uni-potsdam.de)

This training session demonstrates how to establish reproducible research workflows using Docker to create local instances of the DraCor infrastructure, enabling researchers to work with fixed, versioned corpus data. Participants will learn to set up and populate a containerized DraCor environment with specific versions of custom corpora, addressing the fundamental challenge that DraCor’s “living corpora” continuously evolve over time, making repeating research difficult.

**Please make sure you install [Docker Desktop](https://www.docker.com/products/docker-desktop) on your machine!**
This should be executed locally. Using Binder or Colab will probably not work well because of the missing Docker installation. Prefeably use a local [Jupyter Lab](https://jupyter.org/install) instance:

From within the cloned folder of this notebook execute the following commands:

```
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
```

To add the right kernel to your Jupyter lab instance and the environment activated use:

```
sudo python3 -m ipykernel install --name reproducible-research-with-docker
```

If you encounter any problems during the workshop please don't hesitate to ask. You can use the designated [Mattermost channel](https://dh-up.uni-potsdam.de/dracor-community/channels/summit-ts-reproducible-research) to share error messages or code snippets.

## Challenges in Repeating (DraCor) Research

### What do we mean with "Repeating Research"?

The concept of *repeating research* encompasses various scholarly activities that build upon or verify previous work. One might think of the re-implementation of methods and scripts in new research projects; of the re-analysis of data sets with optimized tools; or of the exact re-production of analyses in the course of scientific quality assurance, for example in peer review. In these and many other respects, Computational Literary Studies (but also Computational Humanities and Digital Humanities in general) are facing the demand for **reproducibility**.

### Do we have a "Reproducibility Problem"?

However, according to critical voices, research in the humanities has not adequately met the demand for reproducibility. Alluding to the so-called "replication crisis" ([Open Science Collaboration 2015](https://doi.org/10.1126/science.aac4716)) in some empirical sciences (particularly in psychology and medicine), James O'Sullivan, for example, already stated in 2019 that "the humanities have a 'reproducibility' problem" ([O'Sullivan 2019](https://talkinghumanities.blogs.sas.ac.uk/2019/07/09/the-humanities-have-a-reproducibility-problem)).

In her widely discussed critique of CLS, Nan Z Da pointed out that in several cases it was not possible to reproduce the results of research in this field ([Da 2019](https://doi.org/10.1086/702594)). And in a paper as relevant as it is comprehensive, Christof Schöch recently concluded that when it comes to "reproducibility" there are "serious and relevant challenges for the field of CLS", "starting with issues of access to data and code, but also concerning questions of lacking reporting standards, limited scholarly recognition, and missing community commitment and capacity that would all be needed to foster a culture of \[repetitive research\] in CLS and beyond" ([Schöch 2023](https://doi.org/10.1007/s42803-023-00073-y): 379).

### Schöch's Conceptual Framework for Repeating Research

To address these challenges systematically, Christof Schöch (2023) has developed a comprehensive framework that provides both a theoretical foundation and practical terminology for understanding different types of research repetition.

#### The Five Dimensions of Repeating Research

Schöch's model identifies five key dimensions that characterize any research endeavor:

- **(Q) Research Question**: The key research question being studied (or the key hypothesis or claim to be verified)
- **(D) Dataset**: The dataset used (or more generally, the empirical basis of enquiry)
- **(M) Method**: The research method employed (and its implementation, e.g. in a code-based algorithm or tool)
- **(T) Team**: The team performing the research (including, of course, the case of a one-person team)
- **(R) Result**: The result of the research (and the claims or conclusions supported by the results)

#### Three Qualities of Similarity

The relationship between an earlier study and a later study that repeats it can be described along each dimension using three types of similarity:

- **(1) Identical**: exactly or virtually the same
- **(2) Similar**: more or less closely related
- **(3) Unrelated**: largely dissimilar or entirely different

#### The Conceptual Space of Repeating Research and Common Scenarios

For practical analysis, Schöch focuses on the three most operationally relevant dimensions: **Method (M)**, **Data (D)**, and **Question (Q)**. Setting aside the team and results dimensions—important descriptive aspects that do not require inclusion in distinctions between recurring scenarios—we obtain a three-dimensional conceptual space that can be visualized as a cube (Schöch 2023: 385). 

![Conceptual Space of RR (Schöch 2023)](img/schoech-conceptual-space-cube-RR.png "Conceptual Space of RR visiualized as a cube (Schöch 2023)")

From this conceptual framework, Schöch identifies several recurring scenarios that can be grouped into three categories as can seen in the following table from the article (Schöch 2023: 386):

![Common Scenarios of Repeating Research (Schöch 2023)](img/schoech-common-scenarios-table.png "Common Scenarios of Repeating Research (Schöch 2023)")

We will try to keep this framework and the resulting terminology in mind when we tackle the specific reproducibility challenges we encounter when working with evolving digital resources like DraCor's "Living Corpora".

### DraCor Corpora as *Living Corpora*

This reproducibility challenge becomes even more complex when we consider the dynamic nature of the data itself. In the CLS INFRA Report "On Versioning Living and Programmable Corpora" ([Börner/Trilcke 2024](https://doi.org/10.5281/zenodo.11081934)) we have introduced the concept of "living corpora" to describe a fundamental characteristic of many digital humanities resources and DraCor corpora in particular. In Computational Literary Studies, this epistemic object is regularly no longer just an individual text or a small group of individual texts, but a "corpus" that has emerged "across many research domains in the humanities and social sciences" as "a major genre of cultural and scientific knowledge" (Gavin 2023: 4).

While there are corpora that can be fully digitized with manageable resources—such as complete author corpora like all of Henrik Ibsen's plays (see the new [IbsDraCor](https://staging.dracor.org/ibs))—there is also a large number of epistemic objects that cannot be made digitally available so easily. In many cases, we don't even know exactly which texts would have to be included in such corpora, and some texts aren't available in digital form at all.

In these cases, we must assume that the epistemic object of CLS is currently (and presumably for a long time to come) in the making—in the process of becoming, of growing and thus, in a certain sense, "living." 

Therefore, speaking of "living corpora" emphasizes that the digitization of our cultural and literary heritage is not so much a state that is or could be achieved, but rather a process, a (permanent) mode of transformation that we have entered. One of the consequences is that these epistemic objects of CLS must be conceptualized as dynamic ([Trilcke/Börner 2023](https://doi.org/10.5281/zenodo.7664964)).

This dynamic nature of "living corpora" creates a particular challenge for reproducible research: how can we ensure that analyses remain replicable when the underlying corpus itself is continuously evolving? DraCor corpora, as a prime example of such a living datasets, exemplify this challenge perfectly—our corpora grow through new additions, undergo corrections and improvements, and evolve their encoding standards over time.

### Citing "Living Corpora"

we will take a short and exemplary look at an actual CLS research project and how it deals with the living corpora of DraCor. Our aim is to show that the way DraCor is cited is insufficient to enable reproducibility of the research.

It has become quite common for research that use DraCor corpora to

1. cite the paper [Fischer et al. 2019](https://doi.org/10.5281/ZENODO.4284002)
2. include the information on how many plays are in the corpus used.

Plays used as examples are mostly referenced by author and title (and not, what we would recommend, by their DraCor ID). 

This can, for example be observed in the following quotations of a research paper that uses GerDraCor to develop and test a tool using machine learning methods to detect chiasmi in literary texts:

> We perform two types of experiments. \[...\] In the second experiment we evaluate how well our model generalizes to texts from different authors not included in the training data. To this end we extract PoS tag inversions from the **GerDraCor corpus (Fischer et al., 2019)** \[...\]” The **training data set** (https://git.io/ChiasmusData) “consists of **four
annotated texts by Friedrich Schiller** *Die Piccolomini*, *Wallensteins Lager*, *Wallensteins Tod* and *Wilhelm Tell*. We annotated the whole texts, finding 45 general chiasmi and 9 antimetaboles. ([Schneider et al. 2021](https://doi.org/10.18653/v1/2021.latechclfl-1.11): 98; emphasis \[bold\] by us)

And further

> \[...\] we evaluate the generalization performance of our chiasmus classifier trained on the
four annotated Schiller dramas to other texts. The **first set of texts comprises seven
other dramas by Friedrich Schiller** \[...\]. To see how well our method generalizes to
different authors, we tested it on the remaining **493 documents from GerDraCor**. (Schneider et al. 2021: 99; emphasis \[bold\] by us)

We can also see this method of referencing a certain "version" of the corpus by number of plays in the [reader](https://zenodo.org/records/16936633) of the upcoming *Computational Drama Analysis Workshop*:

> We use GerDraCor \[...\] in a downloaded version from March 2025. It includes 732 German dramas in TEI-XML-format. Out of technical reasons we excluded 52 texts. This leaves us with an analysis corpus of 680 texts covering a timespan from 1510 to 1947. Most texts stem from 1750–1950. ([Schuhmacher et al. 2025](https://doi.org/10.5281/zenodo.16936633): 110)

Meanwhile GerDraCor already includes 742 plays as can be seen on the front end of [dracor.org](https://dracor.org). 

![GerDraCor Card on dracor.org](img/gerdracor-card-frontend.png "GerDraCor Card on dracor.org")

Based on the information given in these (and many other) papers, it is therefore not clear what data was used exactly in the
study. However, this would be a problem for some scenarios of repeating research.

This observation about researchers referencing corpus versions the way described above is merely a neutral finding, and researchers should not be blamed for this practice. In fact, DraCor itself has made and continues to make it challenging to provide proper citations of the data used. The paper by Fischer et al. (2019) is included as a citation recommendation on the DraCor website, and the corpora themselves lack explicit version information—neither as "releases" on GitHub nor as DOIs for stable datasets ingested into repositories like Zenodo, for example. In this regard, DraCor could indeed still improve its approach to data citation and versioning to better support reproducible research practices.

But how the problem could be solved?

### Git commit as a solution?

The platform GitHub serves as a "key infrastructural component" in developing the DraCor toolset as well as in curating and hosting DraCor corpora. We can also rely on GitHub to effectively manage datasets that are constantly in flux. Because DraCor uses Git (and GitHub respectively) for publishing corpora, the process of creating and maintaining a corpus is fully transparent and traceable. As we will show, this also opens up unrivaled possibilities for versioning and the corresponding referencing of living corpora.

Unlike the repositories of DraCor software components (cf. the repository of the DraCor API) for which [releases](https://github.com/dracor-org/dracor-api/releases) are published, in the case of corpus repositories this feature is
(currently) not used. However, it is still possible to very precisely point to a single “version” (or “snapshot”) of the data set. This can be done by referring to an individual commit. Because all editing operations are “recorded” or “logged” when committed, the commits can be used to reconstruct the state of a corpus of a given point in time. We can consider the **commits the “implicit versions”** of DraCor corpora.

![Commits displayed on GitHub](img/github-frontend-gerdracor-commits-highlighted.png "Commits displayed on GitHub")

The commit history on GitHub allows for filtering commits by a certain date range, e.g. it is possible to display commits dating from February 2018:

[https://github.com/dracor-org/gerdracor/commits/main/?since=2018-
02-01&until=2018-02-28](https://github.com/dracor-org/gerdracor/commits/main/?since=2018-02-01&until=2018-02-28)

![Select Commits by Date on GitHub](img/github-select-commits-by-daterange.png "Select Commits by Date on GitHub")

From this list a single commit can be explored, e.g. from February 14th 2018:

[https://github.com/dracor-org/gerdracor/commit/30760ec3ff4aa340f785bcc17bfd3ca81e7e2d06](https://github.com/dracor-org/gerdracor/commit/30760ec3ff4aa340f785bcc17bfd3ca81e7e2d06)

This commit is identified by the SHA value (as “commit identifier”) of `30760ec3ff4aa340f785bcc17bfd3ca81e7e2d06`, which can also be found as part of the URL in the address bar of the browser.

From the single commit view it is possible to get to all TEI-XML files of the plays in the corpus at that point in time. This can either be done by clicking on the button “Browse files” in the upper right corner of the gray commit page header and then, on the landing page, by navigating to the folder `tei`; or, as a shortcut, by directly changing the URL in the address bar of the browser: 

To address the TEI files in the state of February 2018 the commit identifier `/tree/{commit SHA}/tei` can be appended to the URL of the GerDraCor repository `https://github.com/dracor-org/gerdracor`, resulting in:

[https://github.com/dracor-org/gerdracor/tree/30760ec3ff4aa340f785bcc17bfd3ca81e7e2d06/tei](https://github.com/dracor-org/gerdracor/tree/30760ec3ff4aa340f785bcc17bfd3ca81e7e2d06/tei)

This example demonstrates that even without specialized tools and just by using the GitHub Web Interface it is straightforward to precisely retrieve a dated “version” of the corpus files. The only requirement is that the commit, or at least, the precise date or the date range in which the corpus was used is known.

With the release of the Version 1.1 of the DraCor API and the corresponding front end versions the commit SHA of the corpus loaded to the database is made explicit (see also [Börner et al. 2025](https://doi.org/10.5281/zenodo.15301341): 28).

![Truncated Commit SHA on DraCor frontend](img/commit-sha-on-dracor-frontend.png "Truncated Commit SHA on DraCor frontend")

This data is also available via the API from the `/corpora`, the `corpora/{corpusname}` and the `corpora/{corpusname}/plays/{playname}` endpoints.

The following code cell shows how to access this information:

In [None]:
# Get the commit sha of a corpus via the DraCor API

import requests
corpusname = "ger"
request_url = f"https://dracor.org/api/v1/corpora/{corpusname}"
r = requests.get(request_url)
print(f"Commit of the corpus with the identifier {corpusname} is {r.json()["commit"]}.")

### Corpus Archeology

But what can be done if the commit SHA is not known and only the number of plays present in the corpus at a certain time serves as the sole indicator of the data version used? We can use the GitHub API to answer this question through a process we have called "Corpus Archaeology" [elsewhere](https://versioning-living-corpora.clsinfra.io/3-2_gerdracor_corpus_archeology.html).

The following code cells demonstrate how it is possible to pin down the versions of the GerDraCor corpus used for the studies mentioned above. For this analysis, we developed a set of functions written in Python. The functionality of this prototype of a tool is bundled as methods of the class `GitHubRepo` contained in the module `github_utils` [view on GitHub](https://github.com/dh-network/clsinfra-d73/blob/main/report/github_utils.py). We will first briefly introduce some functionalities and than get back to identifying the corpus version of the paper used as an example before.

In [None]:
# Add the file github_utils.py to the folder

!curl -o github_utils.py https://raw.githubusercontent.com/dh-network/clsinfra-d73/refs/heads/main/report/github_utils.py

In [None]:
# The methods needed for the following analysis are bundled as the class "GitHubRepo" 
# which we import in with the following line

from github_utils import GitHubRepo

# The number of requests that can be sent to the GitHub API anonymously is very limited 
# see https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28 
# We need to send a token (stored in an environment variable here) 
# by supplying the token with each request to identify ourselves to the API 
# and thus having a higher limit of requests. 

import os
github_token = os.environ.get("GITHUB_TOKEN")

The German Drama Corpus (GerDraCor) will serve as a test case. The corpus’ repository is available at `dracor-org/gerdracor`.

In [None]:
# we have to provide the repository name

repository_name = "gerdracor"

In [None]:
#%%time
# Uncomment the Jupyter magic keyword above to have the operation timed

# The following line of code downloads and prepares the data 
# when initializing the a new instance of the class  "GitHubRepo 
# which provides the methods to analyze a corpus repository
# DON'T RUN IT HERE!!!

#repo = GitHubRepo(repository_name=repository_name, github_access_token=github_token, download_and_prepare_data=True)

The first step in the analysis consists in downloading all data on all commits from GitHub. Depending on the overall number of commits this can take a long time. In a previous attempt fetching and preparing the data of GerDraCor from GitHub with the code in the code cell below took 53min 29s to execute the operation.

You can download pre-generated commit histories, e.g. GerDraCor (`gerdracor.zip`) from [https://boxup.uni-potsdam.de/s/fsy6jxfyogJnREX](https://boxup.uni-potsdam.de/s/fsy6jxfyogJnREX) (Password: `reproducible-research`) and use these in the following analysis step. Unfortunately, for bigger corpora, the files are quite large and can not be versioned on GitHub easily. Please keep in mind that these files date from February 2025 and do not reflect newest developments of the corpora. The process of analyzing the history of the genesis of corpora based on the API as proposed can and must still be optimized.

Put the `.zip` file into the notebook folder and unzip it to the folder `tmp`. Run the following code cell to execute the command. By adding `!` before the command, it will be executed in your terminal in the background and you will see the results in the notebook.

In [None]:
# Extract the files to a folder tmp
!unzip -o gerdracor.zip -d ./tmp

In [None]:
# Start the analysis with previously downloaded data

repo = GitHubRepo(repository_name=repository_name, 
                  github_access_token=github_token,
                  import_commit_list="tmp/gerdracor_commits.json",
                  import_commit_details="tmp/gerdracor_commits_detailed.json",
                  import_data_folder_objects="tmp/gerdracor_data_folder_objects.json",
                  import_corpus_versions="tmp/gerdracor_corpus_versions.json")

After this initilization step you can run several analysis. For example you can visualize the "growth" of the corpus with the code in the following code cell:

In [None]:
# Quickly plot the growth of the Corpus over time

repo.plot_documents_in_corpus_versions()

Or visualize how the distribution of the sources of the corpora change over time:

In [None]:
# Plot the distribution of sources over time

repo.plot_source_distribution_of_corpus_versions()

For some additional examples see the chapter "An Algorithmic Archaeology of a Living Corpus: GerDraCor as a Dynamic Epistemic Object" the executable version of the Report [On Versioning of Living and Programmable corpora](https://versioning-living-corpora.clsinfra.io/3-2_gerdracor_corpus_archeology.html).

In [None]:
# Get some help about the class
#?GitHubRepo

# List methods
#dir(GitHubRepo)

To identify the corpus version used in the "Chiasmus Detection" paper mentioned above. We are looking for a version of GerDraCor that contains 504 plays. 

In [None]:
# To get the version with 504 plays:
# Get the versions as a dataframe containing the number of plays included ("document_count")

play_counts_df = repo.get_corpus_versions_as_df(columns=["id","date_from","document_count"])

# Filter the dataframe on versions that have exactly 504 plays
play_counts_df[play_counts_df["document_count"] == 504]

The most probable version of the GerDraCor data used is identified by the SHA value `6e1020dcfcb98a0d027ceb401a6a5fbd4537fe29`  and dates from `2020-09-26`.

In [None]:
# Get an URL to view this version of the Corpus on GitHub

repo.get_github_commit_url_of_version(version="6e1020dcfcb98a0d027ceb401a6a5fbd4537fe29")

We have now seen that even with limited information—for example, the number of documents included at a given time—we can more or less reconstruct the version and then use the commit SHA to unambiguously address it.

After this brief excursus into "Corpus Archaeology," we can shift the focus back to ways of conducting research with DraCor that render such reconstruction attempts unnecessary.

### Docker as a solution to the "Reproducibility Problem"?

The challenge of reproducibility in Computational Literary Studies is further complicated by the temporal dimension of digital research environments. Andrew Piper, in his work "Enumerations," addresses this issue when discussing his own reproducibility efforts: 

> I am trying to set a standard of reproducibility that will, I hope, gradually become more of a norm. Throughout, I have adopted the convention of describing each model or calculation referenced in the text in the notes \[...\] I am trying to strike a balance between the conventions of the humanities, which emphasize reading as a form of knowledge in its own right, and those of more quantitative disciplines, which put all the formulas and tables up front. \[...\] You are free to use the code for your own purposes or to try to reproduce the results I put forth here. I make no claims to elegance in programming, but **I am confident that the scripts work, at least as of today** \[emphasis added\]. Durability has taken on a new scale of meaning when seen against the long timescales of bibliographic preservation (Piper 2018: xii).

Piper's emphasis on "at least as of today" captures a fundamental tension in digital humanities research: while traditional humanities scholarship aims for enduring insights that remain valid across centuries, computational work operates within rapidly changing technological ecosystems where code, data formats, and digital infrastructures evolve continuously. This temporal fragility of digital research environments makes the reproducibility challenge in computational humanities particularly acute—and makes containerized solutions like Docker an interesting technology for preserving not just the data, but the entire computational environment in which research was conducted.

## Getting started with Docker

In this part of this notebook we will at frist not use much Python code but use it to execute various commands in the shell/terminal of your machine. By adding `!` before the command, it will be executed in your terminal in the background and you will see the results in the notebook. E.g. to test, if Docker is installed – and if you a on a Linux/Mac machine, you can use the command `which` to get the location of the program.

If you are working on Windows and sending the commands directly from this notebook does not work, you can open a terminal tab within Jupyter Lab (File > New > Terminal) and execute the commands there. In this case, do not prepend them with a `!`. 

![Open new Terminal tab in Jupyter Lab](img/open-new-terminal.png "Open Terminal in Jupyter Lab")

In [None]:
!which docker

Or use the command `docker --version` to get the current version of Docker installed.

In [None]:
!docker --version

You can list running Docker containers with the command `docker ps`. As we have not started anything yet the table should be empty.

In [None]:
!docker ps

## Setting up a local DraCor Environment with Docker (the canonical way)

The README file in the GitHub [Repository of the DraCor API eXist-DB Application](https://github.com/dracor-org/dracor-api) contains [instructions](https://github.com/dracor-org/dracor-api?tab=readme-ov-file#getting-started) on how to setup a local DraCor environment with the help of Docker. 

You can either clone the repository and follow the instructions as detailed in the README, but if you are not planning to change the xQuery code of the API all you really need are two compose files: 

* [compose.yml](https://github.com/dracor-org/dracor-api/blob/main/compose.yml) contains your setup and specifies the images you are using, e.g. `dracor/api:1.1.0`. The default compose file uses the images that are available on [Docker Hub](https://hub.docker.com/u/dracor).
* [compose.override.yml](https://github.com/dracor-org/dracor-api/blob/main/compose.override.yml) opens the ports of the containers so that you can access them locally at your `http://localhost` at the predefined ports, e.g. `8088` for the Dracor Front End at [http://localhost:8088](http://localhost:8088). If you are not using the `compose.override.yml` you will not be able to access you local instance in you webbrowser (unless you get the NGINX server or other fancy reverse proxy – thinking of traefik - to run).

In the following code cells we will get these two files:

In [None]:
# Download the compose.yml file from the dracor-api repository and put it into the folder

!curl -o compose.yml https://raw.githubusercontent.com/dracor-org/dracor-api/refs/heads/main/compose.yml

In [None]:
# Check if the file is there

!ls | grep compose.yml

In [None]:
# Print the file contents of the compose.yml file

!cat compose.yml

You also need the `compose.overwrite.yml` file.

In [None]:
!curl -o compose.override.yml https://raw.githubusercontent.com/dracor-org/dracor-api/refs/heads/main/compose.override.yml

In [None]:
!cat compose.override.yml

When the `compose.yml` and the `compose.override.yml` files have been placed in the folder you can use the following command to launch your local DraCor. We set an empty password for the eXist database (do not do that in production!) with the environment variable `EXIST_PASSWORD`:

```
EXIST_PASSWORD= docker compose -f compose.yml -f compose.override.yml up
```

Please don't run the command this way from the notebook because it will produce a lengthy output and block the further execution. You can try the command it in a terminal tab of Jupyter Lab if you want.

## Running DraCor from a Jupyter Notebook in the background using *subprocess* (the non-canonical way)

To fully document our work with a local dockerized DraCor in a Jupyter Notebook (to foster reproducibility!), we can document the setting up of the infrastructure here too.

However, we can not use the `&` to send a process to the background form within the Jupyter Notebook. Still, we can run processes in the background using a workaround by using *subprocess*:

In [None]:
import subprocess

# Store the command in the variable cmd
cmd = "EXIST_PASSWORD= docker compose -f compose.yml -f compose.override.yml up&"

# Run the process and prevent the cell outputting the log. 
subprocess.run(cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

You can use the docker command `docker ps` to see your running containers:

In [None]:
# List the running containers

!docker ps

Check if you can access your local DraCor instance at [http://localhost:8088](http://localhost:8088).
You can also use the local API at `http://localhost:8088/api/v1/`.

In [None]:
# Test the API by querying the info endpoint
import requests

apibase = "http://localhost:8088/api/v1/"
request_url = apibase + "info"
r = requests.get(request_url)
r.json()

## Add a corpus and load the data

The [documentation in the README.md](https://github.com/dracor-org/dracor-api?tab=readme-ov-file#load-data) explains how corpora can be added and loaded by using curl in the command line.

Adding a corpus is a two step process:

* in a first step, a corpus needs to be added to the database. This step will only add few metadata, a `name`, a `title` and a link to the repository, from which the data can be retrieved. Alternatively, the `corpus.xml` can be posted to the endpoint`/corpora`;
* in a second step the TEI files of the plays are loaded from the repository by posting a load command.

We will add the Test Drama Corpus "testdracor" as explained in the README:

In [None]:
!curl https://raw.githubusercontent.com/dracor-org/testdracor/main/corpus.xml | \
curl -X POST \
  -u admin: \
  -d@- \
  -H 'Content-type: text/xml' \
  http://localhost:8088/api/v1/corpora

If we access our local instance at [http://localhost:8088](http://localhost:8088) we see an empty corpus card for the new corpus "TestDraCor".

As described in the README, the plays can be loaded by sending a POST request with the payload `{"load":true}` to the `corpora/{corpusname}` endpoint where the URL parameter `corpusname` is the name of the newly added corpus:

In [None]:
!curl -X POST \
  -u admin: \
  -H 'Content-type: application/json' \
  -d '{"load":true}' \
  http://localhost:8088/api/v1/corpora/test

If we now access the local instance at [http://localhost:8088](http://localhost:8088) again, we see the the corpus is being populated with the plays from TestDraCor.

We can also use Python commands to add data, which opens up a lot of possibilities, e.g., populating the database in a loop, or whatever you can think of. 

The endpoints in the "Admin" section (see [Documentation](https://dracor.org/doc/api#/admin)) are only available for authorized users with admin rights. The default user of the eXist-DB is `admin` and in our case the password is an empty string. This should, of course, be changed for production use. We will assign username and password to the variables `usr` and `pwd`. To be able to include these information in the request, we need to import the class `HTTPBasicAuth` from the `requests` library first:

In [None]:
#needed for authorization
from requests.auth import HTTPBasicAuth

#Username of the local instance
usr = "admin"
#Password of the admin user
pwd = ""

We also have to construct the metadata of our corpus:

In [None]:
#construct the payload
bashdracor_metadata = {
  "name": "bash",
  "title": "Bashkir Drama Corpus",
  "repository": "https://github.com/dracor-org/bashdracor"
}

We can then send the `POST` request to the `/corpora` endpoint, supply the metadata and also include the credentials of the admin user:

In [None]:
# set the URL of the /corpora endpoint
request_url = apibase + "corpora"

# send the POST request with the payload and provide the credentials
r = requests.post(request_url, json = bashdracor_metadata, auth=HTTPBasicAuth(usr, pwd))

# output the status code returned by the server
r.status_code

When running for the first time, the API should return a HTTP status code of `200`. For other status codes, please check the [documentation](https://dracor.org/doc/api#/admin/post-corpora). For example, if a corpus already exists, the API will return a status code of `409`. To get the status code, we can use the method `status_code` as is demonstrated above.

To trigger the loading process, we have to send a JSON array containing `{"load" : true}` (in a Python dictionary, the Boolean value will be `True`) to the `/corpora/{corpusname}` endpoint.

In [None]:
#construct the url
load_bash_endpoint_url = apibase + "corpora/bash"

#construct the payload to be send to the endpoint
load_cmd_payload = {"load" : True}

#send the POST request using library requests
r = requests.post(load_bash_endpoint_url, json = load_cmd_payload, auth=HTTPBasicAuth(usr, pwd))

If a corpus update was sheduled successfully, the API returns a `202` status code:

In [None]:
r.status_code

## Using *stabledracor* (the non-canonical, very experimental, yet comfortable way)

The workflow presented above is still quite complex. There are multiple steps involved in setting up the locally running infrastructure, some of which need to be run from the command line. Using a notebook for the setup and management of the corpus already makes the whole process more transparent. Still, there is a considerable need to make the process more user-friendly.

Our approach to simplifying the process focuses on developing a Python package which—because of the lack of a better name—is called "StableDraCor" (could also be called "DraCor Freezer", or whatever) that makes setting up local DraCor instances and populating them with data easier by somewhat "hiding" the complexity of the Docker commands. While there is no real need for a generic tool for managing containers and images (because this can be done with Docker Desktop), with "StableDraCor" we address the complexity of setting up the specific DraCor infrastructure components and loading DraCor corpora (or a subset thereof). For more information see the section of the report [On Versioning Living and Programmable Corpora](https://versioning-living-corpora.clsinfra.io/4_dockerizing_dracor.html#simplifying-the-workflow-stabledracor).

We will not install the package from its repository but just use the adapted `stabledracor.py` from our working directory. The code in the stable-dracor repository still needs to be updated to Version 1.0 of the DraCor API. The version in the notebook folder has been patched to some extent.

In [None]:
# Import the StableDraCor class
from stabledracor import StableDraCor

In [None]:
# Get some info
#?StableDraCor

# list all methods
#dir(StableDraCor)

For working with the GitHub API you really should create a Personal Access Token at this point. For more details see the [FAQ](https://github.com/ingoboerner/stable-dracor/blob/e300d77c419537538b4d491a8bbe2b9449123131/notebooks/03_faq.ipynb) or the Readme in the Notebook Repository. 

An easy way is to load Jupyter Lab with your token from the start would have been:

```
GITHUB_TOKEN=yourtoken jupyter lab
```

If you have not done so you can still set the environment variable as described below. Please run the cell once and then **REMOVE** you token again!

In [None]:
# Magic Command to print all environment variables
#%env

In [None]:
# Magic command to set the Environment Variable
# add it, run it, and then delete your token and run again (it should not stay in the notebook)
#%env GITHUB_TOKEN=your token

In [None]:
# Run this cell to check if a token is set.
import os
github_token = os.environ.get("GITHUB_TOKEN")
if github_token is not None:
    print("A GitHub Access Token is set.")
else:
    print("You have to set a token. Follow the instructions in the FAQ.")

In [None]:
# Check if anything is running
!docker ps

After successfully importing the class, we can set up an instance `local_dracor` of the `StableDraCor` class that we will use to "control" our local DraCor system.

In [None]:
local_dracor = StableDraCor(
    name="my_local_dracor", 
    description="My local demo DraCor system",
    github_access_token=github_token
)

The package supports loading corpora either by copying a corpus or parts thereof from any running DraCor system, for example the production system or the staging server, containing even more corpora that are currently prepared for publication.

In the following code cell we add the “Tatar Drama Corpus” from the production instance of DraCor (which is the default) to the local database:

In [None]:
local_dracor.copy_corpus(source_corpusname="tat")

We can also add a corpus from the staging server by explicitly setting the argument `source_api_url` to the URL of the staging server `https://staging.dracor.org/api/v1/`. In the following code cell we import the "Ibsen Drama Corpus":

In [None]:
local_dracor.copy_corpus(source_corpusname="ibs", source_api_url="https://staging.dracor.org/api/v1/")

We can always delete a corpus:

In [None]:
local_dracor.remove_corpus(corpusname="test")

It is also possible to directly add TEI files from the local filesystem, which allows us to even use the DraCor environment with data not published on dracor.org or a public GitHub repository. When adding data to a local Docker container with the help of the “StableDraCor” package, the program keeps track of the constitution of the corpora and the sources used.

In the following code cell a single file is imported into the custom local corpus “FilesDraCor”.

For demonstration purposes, we use a play already available in DraCor, but feel free to add another play that is not available anywhere. Just stick to the naming convention—the "slug" of `{author-name}-{title}.xml`. If you are too creative with naming your file, the import will fail.

In [None]:
# create a folder "import"
!mkdir import

In [None]:
# Add a file to the import folder
!curl -o import/alberti-brot.xml https://raw.githubusercontent.com/dracor-org/gerdracor/refs/heads/main/tei/alberti-brot.xml

In [None]:
# Create a corpus "FilesDraCor" and add a single play from the folder "import" to it

local_dracor.add_plays_from_directory(
    corpusname="files",
    directory="./import/"
)

To allow for better reproducibility of the local infrastructure, it is recommended to use the functionality to directly load corpora or parts thereof from a GitHub repository. This method of adding data allows specifying the "version" of the data in the corpus compilation process at a given point in time by referring to a single GitHub commit. As mentioned above, because DraCor corpora are "living corpora," it is not guaranteed that corpora available on the web platform do not change. Therefore, it would not be a good idea to base research aiming to be repeatable on the data in the live system. By using data directly from GitHub with StableDraCor, it is possible to include only the plays that were available, let's say, two years ago and in the encoding state they were in at that time.

In the following code cell the “Yiddish Drama Corpus” is added to the local database directly from its GitHub Repository using an early version `5ca48607e7c13173d8a482ba9e8790dfccf66a95` of 2024:

In [None]:
local_dracor.add_corpus_from_repo(
    repository_name="yidracor", 
    commit="5ca48607e7c13173d8a482ba9e8790dfccf66a95")

You can also set up a custom corpus and add files in a given version directly from a GitHub repository. The following section shows that and also hints at how a DraCor environment supporting reproducible research can be set up.

## Example: Reconstructing (and Stabilizing) a Corpus Used to Train and Evaluate a Classifier for Chiasmus Detection

We already discussed the citation practices in the paper „Data-Driven Detection of General Chiasmi Using Lexical and Semantic Features“ (Schneider et al. 2021) as an example of research that re-uses the German Drama Corpus. The authors do not use the DraCor API for their study but download data directly from the GerDraCor GitHub Repository. The only information that hints at which version of the corpus was used to train and test the classifier is the number of plays that were included in the corpus at the time. The authors report that GerDraCor included 504 plays.

With the help of the “corpus archeology script” described above we could identify the actual version of the German Drama Corpus. For training of the classifier a manually annotated data set consisting of four plays by the author Friedrich Schiller are used (cf. Schneider et al. 2021: 98); In the paper the titles of these plays are included. In the following list DraCor identifiers are added:

* Die Piccolomini (schiller-die-piccolomini, ger000086)
* Wallensteins Lager (schiller-wallensteins-lager, ger000025)
* Wallensteins Tod (schiller-wallensteins-tod, ger000058)
* Wilhelm Tell (schiller-wilhelm-tell, ger000452)

In the following we will set up a corpus and load the plays in the versions that were most likely used in the study:

In [None]:
play_counts_df[play_counts_df["document_count"] == 504]

In [None]:
chiasmus_version_commit_id = "6e1020dcfcb98a0d027ceb401a6a5fbd4537fe29"

To add a single play in a version to a corpus from an repository use the method `add_play_version_to_corpus` as we will be doing in the loop:

```
local_dracor.add_play_version_to_corpus(
        filename=playname,
        repository_name="gerdracor",
        commit=chiasmus_version_commit_id,
        corpusname="training")
```

In [None]:
# [...] four annotated texts by Friedrich Schiller
# Die Piccolomini, 
# Wallensteins Lager, 
# Wallensteins Tod 
# and Wilhelm Tell.

# Add an empty new corpus "training" with the following metadata

chiasmus_annotated_corpus_metadata = {
    "name" : "training", 
    "title": "Schiller Training Corpus",
    "description": "Corpus of four plays by Friedrich Schiller used to train the Chiasmus Classifier"
}

local_dracor.add_corpus(corpus_metadata=chiasmus_annotated_corpus_metadata)

# Create a list with the playnames/filenames of the plays to add

chiasmus_annotated_schiller_corpus_playnames = [
    "schiller-die-piccolomini",
    "schiller-wallensteins-lager",
    "schiller-wallensteins-tod",
    "schiller-wilhelm-tell"]

# Add each play in the respective version to the previously created corpus

for playname in chiasmus_annotated_schiller_corpus_playnames:
    local_dracor.add_play_version_to_corpus(
        filename=playname,
        repository_name="gerdracor",
        commit=chiasmus_version_commit_id,
        corpusname="training")

After we set up our infrastructure, we could now "freeze" it in this state by using `docker commit`, which creates a Docker image that could be shared. At some point, creating a Docker image with the package worked (`create_docker_image_of_service`), and even pushing to Docker Hub might still work (`publish_docker_image`), but this was not tested while preparing for this workshop.

The following section reports on an infrastructure experiment in which we dockerized a whole research environment to make a study fully replicable (same data, same code).

## Dockerizing a whole research environment: Small World Paper

We exemplify the benefits of a Docker-based research workflow by referring to our study "Detecting Small Worlds in a Corpus of Thousands of Theater Plays" (Trilcke et al., 2024). In this study, we tested different operationalizations of the so-called "Small World" concept based on a multilingual ["Very Big Drama Corpus" (VeBiDraCor)](https://github.com/dracor-org/vebidracor) of almost 3,000 theater plays. As explained above, the corpora available on DraCor are "living corpora"—which means that both the number of text files contained and the information contained in the text files changes (e.g., with regard to metadata or markup). This poses an additional challenge for reproducing our study. Furthermore, our [analysis script](https://github.com/dracor-org/small-world-paper/blob/publication-version/smallworlds-script.R) (written in R) retrieves metadata and network metrics from the REST API of the "programmable corpus." Thus, we had to devise a way of not only stabilizing the corpus but also the API.

For VeBiDraCor, we devised a workflow that spins up a Docker container from a versioned bare Docker image of the DraCor database and ingests the data of the plays downloaded ("pulled") from specified GitHub commits using a [Python script](https://github.com/dracor-org/vebidracor/blob/main/vebidracor-workflow.ipynb). We then committed this container with [docker commit](https://docs.docker.com/reference/cli/docker/container/commit/) to create a ready-to-use Docker image of the populated database and API.

```
!docker commit -m "prepared VeBiDraCor based on pre-built images, loaded corpora, added metrics" $vebidracor_api_container ingoboerner/vebidracor-api:3.0.0
```

Because the assembly of the infrastructure is transparent because of the [Docker Compose file](https://github.com/dracor-org/vebidracor/blob/main/docker-compose.empty.yml) and documented in a Jupyter Notebook, it is also possible to quickly change the API's base image or the composition of the corpus by editing a manifest file that controls which plays from which repositories at which state are included.

In a second step, we also dockerized the whole research environment: a Docker container running [RStudio](https://posit.co/products/open-source/rstudio) to which we added our analysis script. The preparation of this image is documented in a [Dockerfile](https://github.com/dracor-org/small-world-paper/blob/publication-version/Dockerfile). As a base image, we used an image from the [rocker project](https://rocker-project.org/). We used `docker commit` to "freeze" this state of our system and published all images. We call this state the "pre-analysis state," which is documented in a [Docker Compose file](https://github.com/dracor-org/small-world-paper/blob/publication-version/docker-compose.pre.yml).

After we ran the analysis, we again created an image of the RStudio container with `docker commit`, thus turning it into a Docker image in which we basically "froze" the state of the research environment after the R script was run. The [image](https://hub.docker.com/layers/ingoboerner/smallworld-rstudio/dcac262/images/sha256-03bec767bdc213a002783e2d0b34d896dce308ddfc243b90ef13b9292a972c54?context=explore) of this "post-analysis state" was also published on the Docker Hub repository. It allows for inspection and verification.

In the code cell below we demonstrate how to return to this "post-analysis state" withe the help of Docker and the published Docker image. Running the code cell might take a while because the images defined in the Docker Compose file need to be downloaded. It is recommended to run the command `docker compose -f docker-compose.post.yml up` from within the cloned GitHub directory in the terminal.

It is also possible to first only pull the relevant images using `docker pull`, e.g. `docker pull ingoboerner/vebidracor-api:3.0.0`, `docker pull ingoboerner/dracor-frontend:v1.4.3_local` and `docker pull ingoboerner/smallworld-rstudio:dcac262`.

In [None]:
%%script bash --bg

# Clone the GitHub repository containing the data of the study

git clone https://github.com/dracor-org/small-world-paper.git

# Go into the just downloaded repository and switch to the branch "publication-version"

cd small-world-paper
git checkout publication-version

# Start the infrastructure in the "post-analysis-state" as defined 
# in the Compose file "docker-compose.post.yml"

docker compose -f docker-compose.post.yml up 

## Outlook: LLM-assisted interaction with a local DraCor instance

Recently, we have been developing the [DraCor MCP Server](https://github.com/dracor-org/dracor-mcp) (Börner 2025), which we will present on Wednesday. This server is based on the Model Context Protocol (MCP) and provides LLM-based access to DraCor instances. Apart from providing tools for querying the DraCor API, it also facilitates the management of local instances to a certain degree. The MCP server enables users to interact with DraCor data and functionality through natural language queries, making the platform more accessible to researchers who may not be familiar with direct API calls or technical implementation details.

![Manage local DraCor via MCP with Claude](img/claude-mcp-to-local-dracor.png "Manage local DraCor via MCP with Claude")


## References

Ingo Börner. DraCor MCP Server, 2025. URL [https://github.com/dracor-org/dracor-mcp](https://github.com/dracor-org/dracor-mcp).

Ingo Börner and Peer Trilcke. CLS INFRA D7.1 On Programmable Corpora, 2023. DOI [https://doi.org/10.5281/zenodo.7664964](https://doi.org/10.5281/zenodo.7664964).

Ingo Börner and Peer Trilcke. CLS INFRA D7.3 On Versioning Living and Programmable Corpora, 2024.
DOI [https://doi.org/10.5281/zenodo.11081934](https://doi.org/10.5281/zenodo.11081934).

Ingo Börner, Peer Trilcke, Daniil Skorinkin, and Luca Giovannini. CLS INFRA D7.4 Report on the
Implementation of Programmable Corpora, 2025. DOI [https://doi.org/10.5281/zenodo.15301341](https://doi.org/10.5281/zenodo.15301341).

Da, Nan Z. „The Computational Case against Computational Literary Studies“. Critical Inquiry 45, Nr.
3 (March 2019): 601–39. DOI: [https://doi.org/10.1086/702594](https://doi.org/10.1086/702594).

Frank Fischer, Ingo Börner, Mathias Göbel, Angelika Hechtl, Christopher Kittel, Carsten Milling, and
Peer Trilcke. Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on
European Drama. In DH2019: »Complexities«. 9–12 July 2019. Book of Abstracts, Utrecht, 2019. Utrecht
University. DOI: [10.5281/ZENODO.4284002](https://doi.org/10.5281/ZENODO.4284002).

Michael Gavin. Literary mathematics: quantitative theory for textual studies. Stanford text technologies. Stanford University Press, Stanford, California, 2023. 

Open Science Collaboration. „Estimating the Reproducibility of Psychological Science“. Science 349, Nr. 6251 (28 August 2015): aac4716. DOI: [https://doi.org/10.1126/science.aac4716](https://doi.org/10.1126/science.aac4716).

James O’Sullivan. „The humanities have a ‘reproducibility’ problem“. Talking Humanities (blog), 9 July 2019.
URL [https://talkinghumanities.blogs.sas.ac.uk/2019/07/09/the-humanities-have-a-reproducibility-problem](https://talkinghumanities.blogs.sas.ac.uk/2019/07/09/the-humanities-have-a-reproducibility-problem).

Felix Schneider, Björn Barz, Phillip Brandes, Sophie Marshall, and Joachim Denzler. Data-Driven Detection of General Chiasmi Using Lexical and Semantic Features. In Stefania Degaetano-Ortlieb, Anna Kazantseva, Nils Reiter, and Stan Szpakowicz, editors, Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 96–100. Punta Cana, Dominican Republic (online), November 2021. Association for Computational Linguistics. DOI: [10.18653/v1/2021.latechclfl-1.11](https://doi.org/10.18653/v1/2021.latechclfl-1.11).

Christof Schöch. „Repetitive Research: A Conceptual Space and Terminology of Replication,
Reproduction, Revision, Reanalysis, Reinvestigation and Reuse in Digital Humanities“.
International Journal of Digital Humanities 5, Nr. 2–3 (6 November 2023): 373–403.
DOI: [https://doi.org/10.1007/s42803-023-00073-y](https://doi.org/10.1007/s42803-023-00073-y).

Mareike Schumacher, Marie Flüh, and Felix Lempp. Ecologies on stage. In Luca Giovannini and Daniil Skorinkin, editors, Conference Reader: Second Workshop on Computational Drama Analysis (Berlin, 03.09.2025). 2025: 107-125. DOI: [https://doi.org/10.5281/zenodo.16936633](https://doi.org/10.5281/zenodo.16936633).

Peer Trilcke, Eugenia Ustinova, Ingo Börner, Frank Fischer, and Carsten Milling. Detecting Small Worlds in a Corpus of Thousands of Theater Plays. A DraCor Study in Comparative Literary Network Analysis. In Melanie Andresen and Nils Reiter, editors, Computational Drama Analysis: Reflecting Methods and Interpretations. De Gruyter, Boston, 2024.

## Note on AI-Assisted Content Development and Material Reuse
This notebook is based on deliverables created for the CLS INFRA project. Claude Sonnet 4 was used for summarization, text generation and proofreading while adapting the material.

## Acknowledgements

In the context of CLS INFRA, the project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 101004984.

We acknowledge the OSCARS project, which has received funding from the European Commission’s Horizon Europe Research and Innovation programme under grant agreement No. 101129751.