Skip to content

UKPLab/arxiv2024-constrained-ctest-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Constrained C-Test Generation via Mixed-Integer Programming

This repository includes the code for generating C-Tests with Mixed-Integer Programming (click here for the paper). The implementation includes our feature extraction pipeline, various gap difficulty prediction models, the C-Test generation code, and our user study interface. Our research data described here can be found at tu datalib.

Abstract: This work proposes a novel method to generate C-Tests; a deviated form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap. In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach. This allows us to consider gap size and placement simultaneously, achieving globally optimal solutions and to directly integrate state-of-the-art models for gap difficulty prediction into the optimization problem. A user study with 40 participants across four C-Tests generation strategies (including GPT-4) shows that our approach (MIP) significantly outperforms two of the baseline strategies (based on gap placement and GPT-4); and performs on-par with the third (based on gap size). Our analysis shows that GPT-4 still struggles to fulfill explicit constraints during generation and that MIP produces C-Tests that correlate best with the perceived difficulty. We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (3,200 in total) under an open source license.

General information

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Code

The code is split into four major components. Before running any of the components, first setup an appropriate virtual enviroment (for instance, using conda) and install all packages:

conda create --name=<envname> python=3.10
conda activate <envname>
pip install -r requirements.txt

We run all our experiments with python 3.10.

Data

We make our data available at tu datalib. We provide three kinds of research data:

  1. User Study Data: All detailed results are provided in the folder csv. This folder consists of five .csv files containing the C-Tests, the users (names obfucsated), and their questionnaire responses. strategy.csv denotes the order in which each C-Test has been seen and ctest_user_mapping.csv individual user responses for each C-Test. In addition, we provide C-Tests generated with different strategies and their aggregated respective error-rates under aggregated_ctests.
  2. GPT-4 Prompts and Responses: The full input prompts (full_prompts) and responses (full_responses) generated by GPT-4. If there was a regeneration necessary due to a lack of gaps, this is indicated in the first line of the response file, e.g., 2nd Try: . c_tests provides the format used in the user study consisting of the C-Test where #GAP# indicates a gap, the solutions, the tokenized, and original texts.
  3. Variability Data: Contains 100 preprocessed C-Tests for the variability experiments from the GUM corpus. The text passages were randomly sampled according to the criteria mentioned in the paper.

Models

The trained XGB model we use for MIP and the reimplemented baselines is on tu datalib XGB Model.zip. After installing the respective python package pip install xgboost , you can use the model as follows:

import xgboost
xgb_model =  xgboost.Booster()
xgb_model.load_model(path_to_model)
xgb_model.predict(xgboost.DMatrix(features_vector.reshape(1, -1))) # Prediction for a single instance

The tu datalib repository also contains fine-tuned checkpoints of the best performing model (MLP, BERT-base/large, RoBERTa-base/large, DeBERTa-base/large) for gap difficulty prediction. We provide three variants of the transformer-based models:

  • MR: masked regression
  • CLS: CLS-token prediction
  • CLS-F: CLS-token + feature-vector prediction

Project structure

This project is structured as follows:

  1. Feature Extraction
  2. Gap Difficulty Model
    1. Transformer
    2. Feature-based
  3. C-Test Generation
    1. MIP
    2. Reimplemented Baseline
  4. User Study
    1. Study Interface
    2. Latin Hypercube Design
    3. Analysis

Here, we only provide a brief overview and detailed instructions on installation and usage in each of the subprojects.

Feature Extraction

Feature Extraction contains our feature extraction pipeline, derived from Beinborn (2016) and Lee et al. (2020) extracts 61 features for a given input text in several successive steps. The feature extraction pipeline proposed by Beinborn (2016) requires a working DKPro-Core environment that include some licensed resources. Lee et al. (2019) provide explicit instructions on setting up the DKPro-Core environment.

For easier processing, we compiled the whole pipeline of Beinborn (2016) into two executable .jar files; one for sentence scoring (sentence_scoring.jar) and one for feature extraction (feature_extraction.jar). The .jar files can be found on tu datalib. The required file format for the pipelines are tc files used by DKPro TC. This format is comparable to the CoNLL-U Format with a single token per line and additional annotations added via tab separators. In addition, sentence endings are explicitly marked via ---- .

Our whole extraction pipeline consists of five parts:

  1. Sentence Scoring
  2. Feature Extraction (Beinborn, 2016)
  3. Feature Extraction (Lee, 2020)
  4. Feature Imputing
  5. Aggregating and Re-indexing

First, set the DKPro environment via:

export "DKPRO_HOME=<path-to-the-project>/resources/DKPro"

In addition, we need to explicitly set the path to the treetagger library:

export "TREETAGGER_HOME=<path-to-the-project>/resources/DKPro/treetagger/lib"

Finally, you can run the full feature extraction pipeline via:

run_feature_extraction.sh <input-folder> <tmp-folder> <output-folder>

Note, that the tmp-folder is only used for storing the intermediate outputs and will be deleted afterwards. The C-Tests in the tc format are generated using the default strategy. An appropriate generator is provided by Lee et al. (2019).

Gap Difficulty Model

For the gap difficulty prediction models, we provide the code in two separate projects for MLM models and Feature-based models.

Transformer

Implementation for using Bert-like models in a regression setup with masks. Major changes compared to a default MLM classification setup are:

  • Added ignore_index (int -100) for MSE (mse) and L1 (l1) losses in pytorch.
  • Extended data collator to return padded float values (required for regression)

There are two scripts provided, one for training (main_bert_tokenregression.py) and one for testing (inference_only.py). You can run them via:

python main_bert_tokenregression.py --train-folder <training-data> --test-folder <test-data> --seed 42 --epochs 250 --batch-size 5 --result-path <results-folder> --model microsoft/deberta-v3-base --model-path models --loss mse --max-len 512
python inference_only.py --test-folder <test-data> --seed 42 --batch-size 5 --result-path <results-folder> --model <model-type> --model-path <path-to-trained-model> --max-len 512

UPDATE: Two additional scripts have been added for training Bert-like models using the [CLS] token prediction and the hand crafted features used by Beinborn 2016 (--features True). For example use cases, please see the respective bash scripts train_<model>_features.sh.

Feature-based

This folder provides code for training and testing the feature based models used in our work. We investigate following models:

  • XGBoost (XGB)
  • Multi-layer Perceptrons (MLP)
  • Linear Regression (LR)
  • Support Vector Machines (SVM)

We tune 100 randomly generated configurations for the MLP found in configs generated with python create_configs.py. We further the activation (relu or linear) and tune c for our SVM (in the .sh files). XGB and LR are not further tuned.

C-Test Generation

We provide code for three C-Test generation strategies, namely, our MIP-based approach (MIP), and the reimplemented baselines for SEL and SIZE using the XGB model (Reimplemented Baseline). We provide preprocessed data for running the variability experiments under data/Variability Data. To run the models, add a respective data folder containing the preprocessed data as well as a model folder with the trained model. Results will be output in a respective results folder.

To process the gap size features (MIP and SIZE generation strategies), we require following spacy model:

python -m spacy download en_core_web_sm

We need the pyphen package for hyphenation and the standard ubuntu dictionary for American English to check for compound breaks (found in the resources folder).

To run the SEL and SIZE baselines, please follow the instructions provided by Lee et al. (2019).

MIP

This folder provides the implementation of our MIP generation strategy using the XGB model. We use Gurobi as the solver for our optmization problem. It is likely that the optimization model has more parameters than the defaul license of Gurobi allows. Gurobi offers free licenses for academic and educational purposes. After registration, you will obtain a license file you can reference in your virtual environment for running the experiments:

export GRB_LICENSE_FILE=<path-to-license>

The generation can be run via:

python main_MIP_XGB.py --model models/XGB_42_model.json --output-folder results --input-file example_data/GUM_academic_art_4.txt --tau 0.1 --update-bert True

With --tau being the target difficulty (a value between [0, 1]) and --update-bert the indicator if the BERT-base features should be updated (adds overhead). A prefix to the resulting output can be added via the option --output-name--output-name.

For debugging the MIP model, the option --debug-mip should be set to True and a path for the logging file provided via --debug-log.

Finally, we provide additional implementations of the optimization objective using indicator functions and piecewise linear approximations. You can find the implementations in MIP Objective Functions. They can be run using the same setup as the default one.

Reimplemented Baseline

We implement both baselines SEL and SIZE proposed by Lee et al. (2019) using the trained XGB model. Runscripts for the variability experiments to assess the performance against the original implementations are provided in the shell scripts. The data to run these experiments is provided at tu datalib Variability Data.zip. To run the models, add a respective data folder containing the preprocessed data as well as a model folder with the trained model. Results will be output in a respective results folder.

User Study

Code regarding the user study consists of three parts. First, the interface we implement and use written in Flask using SQLAlchemy as a backend for our (MySQL) database. Second, the Latin Hypercube Sampling implemented as a constratined optimization problem, and third, the R scripts implementing the GAMM we use for our analysis.

Interface

This implements the interface used in our user study. For the reviewing, all links and names that may lead to deanonymization have been removed. The user study runs on a Flask application with a database connected via SQLAlchemy.

After setting up the virtual environment, create a database (with an example user admin with the password admin):

mysql -u admin -p
CREATE DATABASE c-test;

Then import the database structure (including c-tests and selection)

mysql -u admin -p c-test < c-test.sql

The application can be started via:

cd c_test
python __init__.py

Exporting data from the database can be done via:

mysqldump -u admin -p c-test --add-drop-table > c-test.sql

Sampling

This script generates user study configurations with maximal distance and equal distribution. The code can be run via:

python fetch_minizinc_solutions.py

This generates some files in the output folder. First, a file containing all distances and possible combinations (all_distances.dzn) and second, one valid configuration (final_combinations.json). The code relies upon a constrainted optimization model found in the minizinc folder. For more information on Minizinc; an open source constrained modeling language, please check their documentation.

Analysis

We provide simple preprocessing scripts for R as well as the respective R scripts for the significance analysis conducted in the paper.

After setting up the virtual environment, install R and RStudio (RStudio is not required, but convenient). The R scripts use following packages:

  • mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation
  • itsadug: Interpreting Time Series and Autocorrelated Data Using GAMMs
  • report: Automated Reporting of Results and Statistical Models
  • xtable: Export Tables to LaTeX or HTML
  • hash: Full Featured Implementation of Hash Tables/Associative Arrays/Dictionaries

The raw data found in study_data_raw can be converted via format_data_for_r.py and format_data_for_r_feedback.py, which will be written into the r_data folder. The statistical models are provided in r_models.

Citing

Please use the following citation (to be updated):

@misc{Lee:2024:CTestArxiv,
  author    = {Lee, Ji-Ung and Pfetsch, Marc E. and Gurevych, Iryna},
  title     = {Constrained C-Test Generation via Mixed-Integer Programming},
  month     = {April},
  year      = {2024},
  pages     = {1--32},
  eprint={2404.08821},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2404.08821}
}

About

Code to the paper "Constrained C-Test Generation via Mixed-Integer Programming"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published