Authors: Nele Mastracchio, Wiebke Petersen, Anna Stein, Cornelia Genz, Hanxin Xia, Vittorio Ciccarelli
Affiliation: Heinrich Heine University Düsseldorf
This repository provides the code and data for our solution for subtask A of shared task 8 of SemEval 2024 for classifying human- and machine-written texts in English across multiple domains. We propose a fusion model consisting of RoBERTa based pre-classifier and two MLPs that have been trained to correct the pre-classifier using linguistic features. Our model achieves an accuracy of 85% and ranks 26th out of 141 participants.
The code is written for Python 3.11. The required packages can be installed via pip using the provided requirements.txt
file:
pip install -r requirements.txt
The data for the task can be downloaded from the official Task repository. The data should be placed in the data/
directory. The test set must be re-named to include the word test
to match the dev and train files.
Additional word lists need to be downloaded into the datafolder:
- Use the scripts in
code/features/
to compute the features for the training, dev, and test data. Optionally, merge them usingcode/merge_features.py
. - Run
code/finetune_transformer.ipynb
to fine-tune the RoBERTa model (pre-classifier) and obtain predictions, logits and hidden states. - Finally, use
code/submission.ipynb
to replicate the final submission results.
This project is licensed under the MIT License - see the LICENSE file for details.