SemEval2024-Task2

A repository that contains public code which was used to submit runs to SemEval2024, specifically in the context of Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. This is the full implementation of the system we submitted to the Task Leaderboards under the username "araag2".

In order to fully understand the scope of our work, we recommend reading our System Paper (TO:DO insert link)

Repository Setup

├── corpus/  # Contains the CT corpus
├── prompts/ # Contains all used prompts
├── qrels/   # Contains all qrels files
├── queries/ # Contains all query files
├── eval_prompt.py # Contains all functions to generate text and evaluate a given prompt
├── finetune_Mistral.py # Training functions for Mistral-7b
├── label_prompt_funcs.py # Contains all functions that format queries, outputted labels and prompts
├── parsel_qrels2queries # Script to parse queries into the intended qrel form
├── README.md
├── run_inference.py # Script to use the Mistral model to run inference
├── utils.py # General purpose util functions
└── .gitignore

Task Description

TO:DO - For now info in the System Description Paper

Available Data

The Dev and Train sets are balanced in labels, having 50% of Entailment and Contradiction samples.

Set	#Samples	Single	Comparison
Train	1700	1035	665
Dev	200	140	60

The Pratice-test and Test set are not balanced, being heavily (65.92% and 66.53% respectively) slated towards Contradictions.

Set	#Samples	Single	Comparison	Entailment	Contradiction
Pratice-Test	2142	1526	616	730	1412
Test	5500	2553	2947	1841	3659

Both the Pratice-test and the test set include data augmentation of their original queries, by using textual alteration techniques, some of which preserve and others that alter the intended label for the query.

Alteration	Paraphrase	Contradiction	Text_App	Num_contra	Num_para
Pratice-Test	600	600	600	78	64
Test	1500	1500	1500	276	224

Type of Alteration	#Total Number	Preserving	Altering
Pratice-Test	1942	1606	336
Test	5000	4136	864

The Pratice-test also includes the dev-set. whilst the Test set includes several rephrasing of the same queries, in order to test Faithfullness and Consistency.

We also expanded sets in order to train the model on additional data:

| Set | #Samples | Single | Comparison | Entailment | Contradiction | |:-------------- |:--:|:--:|:--:||:--:|:--:| | TREC-synthetic | 1630 | 1542 | 88 | 815 | 815 | | Train-manual | 2334 | 1380 | 854 | 1167 | 1167 | | Train-manual-synthetic | 3720 | 2368 | 1352 | 1860 | 1860 | | Train-full-synthetic | 11011 | 6705 | 4306 | 5098 | 5913 |

Experimental Results

TO:DO - For now info in the System Description Paper

Evaluation Criteria

TO:DO - For now info in the System Description Paper

Mistral Results

The model we chose to conduct most experiments on was the Mistral-7B-Instruct-v0.2 model, using the pubicly available huggingface weights, and the python libraries torch, transformers and peft.

TO:DO - For now info in the System Description Paper

T5 Base Results

We also obtained baseline results using Flan-T5 from huggingface, with the following generation prompt: $premise \n Question: Does this imply that $hypothesis? $options, checking the outputs for "Entailment" or "Contradiction".

Train Set (0-shot)

Metrics	F1-score	Precision	Recall	Notes
flanT5-small	-	-	-	Always Contradiction
flanT5-base	0.32	0.50	0.23	-
flanT5-large	0.53	0.56	0.49	-
flanT5-xl	0.67	0.59	0.77	-
flanT5-xxl	0.69	0.61	0.79	-

Dev Set (0-shot)

Metrics	F1-score	Precision	Recall	Notes
flanT5-small	-	-	-	Always Contradiction
flanT5-base	0.34	0.55	0.25	-
flanT5-large	0.57	0.61	0.53	-
flanT5-xl	0.69	0.61	0.79	-
flanT5-xxl	0.71	0.59	0.88	-

Dev Set (fine-tuned)

Metrics	F1-score	Precision	Recall	Notes
flanT5-xl	0.754	0.59	0.831	-

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
corpus		corpus
inference		inference
outputs		outputs
pre-training_source-files		pre-training_source-files
prompts		prompts
qrels		qrels
queries		queries
training		training
util_scripts		util_scripts
.gitignore		.gitignore
Lisbon_Computational_Linguists_at_SemEval_2024_Task_2__Using_A_Mistral_7B_Model_and_Data_Augmentation.pdf		Lisbon_Computational_Linguists_at_SemEval_2024_Task_2__Using_A_Mistral_7B_Model_and_Data_Augmentation.pdf
README.md		README.md
__init__.py		__init__.py
current.txt		current.txt
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemEval2024-Task2

Repository Setup

Task Description

Available Data

Experimental Results

Evaluation Criteria

Mistral Results

T5 Base Results

Train Set (0-shot)

Dev Set (0-shot)

Dev Set (fine-tuned)

About

Releases 1

Packages

Languages

araag2/SemEval2024-Task2

Folders and files

Latest commit

History

Repository files navigation

SemEval2024-Task2

Repository Setup

Task Description

Available Data

Experimental Results

Evaluation Criteria

Mistral Results

T5 Base Results

Train Set (0-shot)

Dev Set (0-shot)

Dev Set (fine-tuned)

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages