Lorenz Stangier*
, Ji-Ung Lee*
, Yuxi Wang, Marvin Müller, Nicholas Frick, Joachim Metternich, and Iryna Gurevych
*
Both authors contributed equally.
This repository contains code and data from our TexPrax demo paper published at AACL 2022.
Abstract: Collecting and annotating task-oriented dialog data is difficult, especially for highly specific domains that require expert knowledge. At the same time, informal communication channels such as instant messengers are increasingly being used at work. This has led to a lot of work-relevant information that is disseminated through those channels and needs to be post-processed manually by the employees. To alleviate this problem, we present TexPrax, a messaging system to collect and annotate problems, causes, and solutions that occur in work-related chats. TexPrax uses a chatbot to directly engage the employees to provide lightweight annotations on their conversation and ease their documentation work. To comply with data privacy and security regulations, we use an end-to-end message encryption and give our users full control over their data which has various advantages over conventional annotation tools. We evaluate TexPrax in a user-study with German factory employees who ask their colleagues for solutions on problems that arise during their daily work. Overall, we collect 202 task-oriented German dialogues containing 1,027 sentences with sentence-level expert annotations. Our data analysis also reveals that real-world conversations frequently contain instances with code-switching, varying abbreviations for the same entity, and dialects which NLP systems should be able to handle.
- Contact
- Ji-Ung Lee (ji-ung.lee@tu-darmstadt.de)
- UKP Lab: http://www.ukp.tu-darmstadt.de/
- PTW: https://www.ptw.tu-darmstadt.de/
- TU Darmstadt: http://www.tu-darmstadt.de/
Drop us a line or report an issue if something is broken (and shouldn't be) or if you have any questions.
For license information, please see the LICENSE and README files.
The code for the TexPrax project consists of three components:
- recorder-bot
- texpraxconnector
- examples
The modification of the matrix-synapse server (synapserecording
) has been removed from the main branch with the port to python 3.10.
It is still available in the branch python3.7
A detailed description and installation instructions can be found below.
A demo video of the project can be found here.
Disclaimer: This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
@inproceedings{stangier-etal-2022-texprax,
title = "{T}ex{P}rax: A Messaging Application for Ethical, Real-time Data Collection and Annotation",
author = {Stangier, Lorenz and
Lee, Ji-Ung and
Wang, Yuxi and
M{\"u}ller, Marvin and
Frick, Nicholas and
Metternich, Joachim and
Gurevych, Iryna},
booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations",
month = nov,
year = "2022",
address = "Taipei, Taiwan",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.aacl-demo.2",
pages = "9--16",
}
An anoymized version of the collected data including annotations can be downloaded from tudatalib or via huggingface-datasets (CC-by-NC).
The chatbot that keeps track of messages, provides label suggestions, and collects feedback via reactions.
Example code to exchange data with an external dashboard via HTTP requests.
Please check the branch remote-storage
to see an implementation that utilizes remote storage.
Detailed instructions on how to setup the TexPrax messaging and recording system.
Clone the repostiory
git clone https://github.com/UKPLab/TexPrax.git
Setup your python environment.
conda create --name=texprax-demo python=3.10
conda activate texprax-demo
Install the synapse server first:
pip install matrix-synapse
Now we need to create a config file via:
python -m synapse.app.homeserver -c homeserver.yaml --generate-config --server-name=<server-name> --report-stats=<yes|no>
This has now created a homeserver.yaml
file. Now you can start the homeserver via
synctl start
You can check if the installation is running by going to http://localhost:8008 in your browser. For further steps, we ask you to follow the instructions in the official synapse documentation.
-
Go to your
homeserver.yaml
location. -
Add a new user via
register_new_matrix_user -c homeserver.yaml http://localhost:8008
Note: Make sure that you are in the correct python environment e.g.,
conda activate texprax-demo
-
Go to Element
-
Go to Sign In, and
Edit
the homeserver from matrix.org to http://localhost:8008 -
Sign in with your credentials
Note: You can setup the bot independently of your synapse server, for instance, using a new env:
conda create --name=texprax-bot python=3.10
conda activate texprax-bot
OLM is required for encryption. Install it via:
git clone https://gitlab.matrix.org/matrix-org/olm.git olm
cd olm
cmake . -Bbuild
cmake --build build
Now go to the recorder-bot folder:
cd recorder-bot
and install the requirements:
pip install -r requirements.txt
.
Nake sure that you are in the correct python environment e.g., conda activate texprax-bot
. If there are issues with python-olm, try this:
pip install python-olm --extra-index-url https://gitlab.matrix.org/api/v4/projects/27/packages/pypi/simple
Now we need to create a config file with the respective paths etc. You can use sample.config.yaml
as your base file.
We also need to add a new account for the bot (follow the steps above to create a new account).
As an example, we will use the username bot
with the password bot
.
Setting bot credentials (config.yaml
):
matrix
user_id: "@bot:texprax-demo"
user_password: "bot"
homeserver_url: "http://localhost:8008"
The default storage location of your messages will be ./store
.
You will also have to supply a message_path
(line 34 in config.yaml
):
message_path: ".store/messages.json"
To use the models finetuned on German dialog data, download them from tudatalib and put them into a models folder:
mkdir models
cd models
wget -q --show-progress https://tudatalib.ulb.tu-darmstadt.de/bitstream/handle/tudatalib/3534/sequence_classification_model.zip
wget -q --show-progress https://tudatalib.ulb.tu-darmstadt.de/bitstream/handle/tudatalib/3534/token_classification_model.zip
unzip sequence_classification_model.zip
unzip token_classification_model.zip
Now add them to the config.yaml
:
sequence_model_path: "models/sequence_classification_model"
token_model_path: "models/token_classification_model"
We further set the language of the bot to German by setting:
language_file_path: "language_files/DE.txt"
Finally, run the bot via:
LD_LIBRARY_PATH=<path-to-olm>/olm/build/ python autorecorderbot_start
After the bot is running, you can add it like any user to your room. The bot's id in this example will be: @bot:texprax-demo
The modified Synapse instance to automatically invite the bot into newly created rooms.
Important: This requires some features that are only available in an older (deprecated) version that uses python 3.7.
Please switch to the branch python3.7
for this.