Skip to content

Code for team art.-nat.'s submission to SemEval 2024 task 8 on multi-domain, multi-generator human vs. machine generated text classification.

License

Notifications You must be signed in to change notification settings

ansost/art-nat-HHU-semeval2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Submission of team "Artificially Natural HHU" for SemEval 2024 task 8

Authors: Nele Mastracchio, Wiebke Petersen, Anna Stein, Cornelia Genz, Hanxin Xia, Vittorio Ciccarelli

Affiliation: Heinrich Heine University Düsseldorf

This repository provides the code and data for our solution for subtask A of shared task 8 of SemEval 2024 for classifying human- and machine-written texts in English across multiple domains. We propose a fusion model consisting of RoBERTa based pre-classifier and two MLPs that have been trained to correct the pre-classifier using linguistic features. Our model achieves an accuracy of 85% and ranks 26th out of 141 participants.

Requirements

The code is written for Python 3.11. The required packages can be installed via pip using the provided requirements.txt file:

pip install -r requirements.txt

Data

The data for the task can be downloaded from the official Task repository. The data should be placed in the data/ directory. The test set must be re-named to include the word test to match the dev and train files.

Additional word lists need to be downloaded into the datafolder:

Usage

  1. Use the scripts in code/features/ to compute the features for the training, dev, and test data. Optionally, merge them using code/merge_features.py.
  2. Run code/finetune_transformer.ipynb to fine-tune the RoBERTa model (pre-classifier) and obtain predictions, logits and hidden states.
  3. Finally, use code/submission.ipynb to replicate the final submission results.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Code for team art.-nat.'s submission to SemEval 2024 task 8 on multi-domain, multi-generator human vs. machine generated text classification.

Topics

Resources

License

Stars

Watchers

Forks