# Quick Startup

```Bash
# enter runtime dir
cd code/

# install deps
pip3 install -r requirements.txt

# edit parameters (see Code and Experiment Setup section)
vim main.py

# run
python3 main.py
```

# Movitation
Clinical trials are essential for experimental medicine because they assess the safety and efficacy
of novel medical treatments and ensure their compliance with regulations[1][4]. Clinical Trial
Reports (CTR) are important tools that record trial methodologies and results, and are pivotal
in evaluating trail effectiveness [1][2]. As of 2020, there were over 400,000 public CTRs, with 130
CTRs published daily, making it impractical for medical practitioners to perform a comprehensive
study before prescribing new treatments to patients [5].

Natural Language Inference (NLI), which uses language models to understand documents and conduct 
subsequent tasks, is one appealing solution. It was the topic of the 2023 SEMEVAL Task 7 Challenge:
Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT), where different models 
were proposed to solve first textual entailment then evidence retrieval tasks for a set of breast 
cancer CTRs and expert-annotated labels[5]. The entailment task was considered more challenging because
the models needed to perform multi-hop reasonings, numerical understanding, biomedical understanding and 
common-sense reasoning. Out of 643 submissions for the entailment task, a fine-tuned *Flan-T5* LLM 
scored the second place[5].

### Task Description

The entailment task targeted in the project can be described by Figure 1. Given a CTR premise and an expert-
prepared medical statement, the model needs to predict the inference relation between the premise and the
statement (hypothesis) as either *Contradiction* or *Entailment*. In the original task (2023), a section ID was
also provided so only the relevant section of the CTR premise was used. In our project, there will be no
Section ID so the entire CTR premise will be used.

<div style="display:inline-block;">
    <img src="imgs/task_description.png" alt="Task Description" style="width: 400px;"/>
    <div><text>Figure 1. Task Description</text></div>
    <br/>
</div>

The reason for removing the Section IDs is that annotation Section IDs is a rather expensive 
task and not always available in real life scenarios. By removing the Section IDs, we could generalize our 
approaches to meet more realistic use cases and hopefully provide more values. 

By removing Section IDs, we subject our task to new challenges and will answer the following questions:

1. **Is Section ID necessary to achieve good accuracy?**
2. **Can we improve accuracy using full CTR as premise?**
3. **Can we operate using limited input token length?**

# Approach

Our approach is divided into two major categories:

1. Section ID Baselines
2. Full CTR Text Methods

The Section ID baselines follow the original task workflow described in Figure 1, where a (Section Premise, Hypothesis)
is combined into a prompt and fed into a LLM to make an entailment prediction. This method is used to establish a Section ID
baseline and answer question 1. It is also described in Figure 2 (*). 

The Full CTR Text methods contain 5 different ways to handle a full CTR text as the premise, which all tries to either compress
or extract important information from the CTR. These methods are described in Figure 2 (A)-(E). A baseline using Full CTR text 
directly as premise is also established and described in Figure 2 (*).

A detailed explanation for each method is provided in the next sub-section.

<div style="display:inline-block;">
    <img src="imgs/methods_description.png" alt="Method Description" style="width: 800px;"/>
    <div><text>Figure 2. Methods Overview</text></div>
    <br/>
</div>

## Methods Explanation

**(*) Baselines:** A premise (either Section Extraction or Full Text) is combined with the hypothesis and fed into the 
LLM using a prompt (shown next sub-section). The LLM then makes a zero-shot inference, predicting the inference relation
between the premise and the hypothesis as either *Contradiction* or *Entailment*.

$$\begin{aligned}
Prompt &= Prompt\_Template(premise, hypothesis)\\
Prediction &= LLM(Prompt)
\end{aligned}$$

**(A) Top K Sentence Retrieval:** Sentence embeddings for all CTRs in dataset are pre-calculated using *all-MiniLM-L6-v2* 
model into a vector databse. When a (CTR, hypothesis) is provided, the hypothesis is first converted into a sentence embedding
using the same embedding model, which is used to search within the CTR vector database to find K most relevant CTR sentences
using cosine similarity. The top K sentences were then retrieved, concatenated together to form a *new premise*. The new premise
and the hypothesis is then combined into a prompt and fed into the LLM for entailment.

$$\begin{aligned}
\vec{h} &= Embed(Hypothesis)\\
Similarity\_matrix &= cosine\_similarity(h\_vector, all\_CTR\_sentence\_vectors)\\
Top\_K\_sentences &= Find\_max(Similarity\_matrix, K, all\_CTR\_sentences)\\
premise &= concat(Top\_K\_sentences)\\
Prediction &= LLM(Prompt\_Template(premise, hypothesis))
\end{aligned}$$

**(B) Sliding Window:** Choosing a *MODEL_TOKEN_LIMIT as window_size* and a *STRIDE as stride_size*, the full CTR text is splitted into chunks by sliding a fixed-size window along itself. Each chunk is treated as a *chunk premise* and combined with the hypothesis to make a *chunk prediction* using the prompt and LLM. After obtaining all chunk predictions, we count the number of each classes (*Contradiction*, *Entailment*) and use the maximum count class as the final entailment label.

$$\begin{aligned}
&chunks = Split(fullCTRPremise, window\_size, stride\_size)\\
&chunk\_predictions = [\quad]\\
&for\quad chunk\_premise\quad in\quad chunks:\\
&\quad chunk\_predictions.add(LLM(Prompt\_Template(chunk_premise, hypothesis)))\\
&Prediction =  argmax(chunk\_predictions)
\end{aligned}$$

**(C) LLM Summarization:** Choosing a *SUMMARY_GEN_MAX_TOKEN* and a *MODEL_TOKEN_LIMIT*, we use the same LLM to generate a summary of 
maximum length *SUMMARY_GEN_MAX_TOKEN* for each *SECTION* of the full CTR premise. Each section summary is then concatenated together
to form a final premise, and used with the hypothesis to generate entailment prediction. 

When generating section summary, section text is split into chunks IF the size is over *MODEL_TOKEN_LIMIT*. Each chunk is used to generate
a chunk summary and combined together either by *concatenatin* or *a secondary summary using LLM performed on concatenated chunk summaries* to generate the final *section summary*. 

$$\begin{aligned}
&section\_summaries = [\quad]\\
&for\quad section\_id\quad in\quad fullCTRPremise\\
&\quad section\_text = extract(fullCTRPremise, section\_id)\\
&\quad if\quad len(section\_text) > MAX\_INPUT\_TOKEN:\\
&\quad\quad chunks = Split(section\_text, MAX\_INPUT\_TOKEN)\\
&\quad\quad\quad chunk\_section\_summaries = LLM(Summary\_Prompt\_Template(chunk\_text))\\
&\quad\quad\quad section\_summary = concat(chunk\_section\_summaries)\\
&\quad\quad\quad\quad\quad\quad OR\\
&\quad\quad\quad section\_summary = LLM(Summary\_Prompt\_Template(concat(chunk\_section\_summaries)))\\
&\quad else:\\
&\quad\quad section\_summary = LLM(Summary\_Prompt\_Template(section\_text))\\
&\quad section\_summaries.add(section\_summary)\\
&CTR\_Summary = concat(section\_summaries)\\
&Prediction = LLM(Prompt\_Template(CTR\_Summary, hypothesis))
\end{aligned}$$

**(D) Premise Truncation:** Choosing a *MODEL_TOKEN_LIMIT*, we truncate the full CTR premise and use the truncated premise and hypothesis to predict entailment label.

$$\begin{aligned}
Truncated\_Premise &= Truncate(fullCTRPremise, MAX\_INPUT\_TOKEN)\\
Prediction &= LLM(Prompt\_Template(Truncated\_Premise, hypothesis))
\end{aligned}$$

**(E) Section ID Retrieval:** Similar to original Section ID workflow, we use the same LLM to first predict relevant Section ID given the full CTR premise and the hypothesis. We then retreive section premise using the predicted Section ID, and use it together with the hypothesis to predict entailment label.

$$\begin{aligned}
Section\_ID &= LLM(Predict\_SECTION\_Prompt\_Template(fullCTRPremise, hypothesis))\\
Premise &= Extract(fullCTRPremise, Section\_ID)\\
Prediction &= LLM(Prompt\_Template(Premise, hypothesis))
\end{aligned}$$

**Some important conditions:**

1. Without otherwise mentioned, the full CTR premise is created by concatenating sentences from each section together, along with the Section ID as an added title in between. The concatenation order is always "intervention", "eligibility", "adverse events", and "results", the same order presented in original CTR file. An glimpse of full CTR premise is shown below:
```Python
"""
  intervention\nIntervention Sentence 1\nIntervention Sentence 2\nIntervention Sentence 3\nIntervention Sentence 1...\n
  eligibility\nEligibility Sentence 1\nnEligibility Sentence 1\nEligibility Sentence 2\nEligibility Sentence 3\n
  adverse events\nAdverse Events Sentence 1\nAdverse Events Sentence 2\nAdverse Events Sentence 3\n
  results\nResults Sentence 1\nResults Sentence 2\nResults Sentence\n
"""
```
2. In *Top K Sentence Retrieval*, the sentence embeddings are created in the same order they appear in the full CTR premise. However, in Top K search, all sentences from all sections are compared with hypothesis and any sentence could be retrieved.

3. In *Top K Sentence Retrieval*, the pre-calculated vector database (pickle files) should be provided and loaded into the program using parameter `TOPK_DB` in `main.py, line 255`. This should be the relative path pointed to the location where two files exist: `raw_text_db.pickle` and `annotations_db_val.pickle`. They both must exist with exact same file names. Please refer to Setup Section for more information.

4. While we use the same LLM model for entailment, summarization, and section prediction, you could replace them by changing the code.

## Model Choice
**Flan-T5** is chosen as our base infernce model. It is chosen for two reasons: 1. It was proven to work well on 
the NLI4CT entailment task compared to other LLMs[6]. 2. It is relatively small (780M parameters for large and 3B 
for XL vs 7 and 11 B for LLaMA). 3. It has been instruction fine-tune to work well with zero-shot inference[3].

**all-MiniLM-L6-v2** is chosen for sentence embedding for one of our methods (Top K). It is a well-known sentence transformer
pretrained model with fast inference speed and good quality.

## Prompt Template

Three prompt templates will be used throughout our experiments. 

### Entailment Prompt

This prompt is taken out from the top LLM paper in 2023 NLI4CT challenge[6]. The author experimented with multiple prompt templates for Flan-T5 and published the best working one[6]. The prompt takes in two parameters: a **CTR Premise** which is either the Section premise, full CTR premise, or augmented premise by one of the five methods we experiment with, and a **statement**, which is taken directly out of each sample/instance. The *Options* line is added to guide the instruction-tuned model to produce only expected output.

```Python
"""
{CTR Premise}
Question: Does this imply { statement }?
Options: entailment , contradiction"
"""
```

### Summarization Prompt

We crafted the following prompt for Flan-T5 to generate a text summary given a premise (long text). We ask the LLM to include details and control the length of maximum generation using a parameter.

```Python
"""
Please summarize the following and include important details as much as possible: {}
"""
```

### Section ID Prediction Prompt

We crafted a prompt for Flan-T5 to predict the section ID given full CTR text and the pairing hypothesis. We instructed the LLM to pay attention to four sections and provides a template to generate a relevant response.

```Python
"""
A premise contains four sections: {}\nA hypothesis describes one of the sections: {}\nDetermine the most relevant section from the four options: intervention, results, eligibility, adverse_events".
"""
```

# Data

*Details on how to load and use the dataset is provided in Experimental Setup section.*

## Source
In our project, we use the data from [NLI4CT 2024](https://sites.google.com/view/nli4ct/semeval-2024). The official dataset could be found at this [Github Link](https://github.com/ai-systems/Task-2-SemEval-2024/tree/main). A community-built Transformer Dataset is also available at [Huggingface Link](https://huggingface.co/datasets/bigbio/sem_eval_2024_task_2/tree/main). We used the latter for convenience.

In case the community dataset is no longer available, please go to the official link and prepare your data into (annotations, id_to_clinical_trial_record) tuple. For more information, read `code/main.py/load_data()` function.

In case you want to try the dataset yourself:
```Python
import datasets
annotations = datasets.load_dataset("bigbio/sem_eval_2024_task_2", name="sem_eval_2024_task_2_source")
raw_texts = datasets.load_dataset("bigbio/sem_eval_2024_task_2", name="sem_eval_2024_task_2_ct")["train"]
```

## Dataset Description
The NLI4CT dataset is based on a collection of 1000 publicly available breast cancer **CTRs** on [ClinicalTrials.gov](https://clinicaltrials.gov/ct2/home), expert annotated **statements**, and **labels**[5]. 

Each CTR (premise) is separated into four sections: intervention, eligibility, results and adverse events. Each section consists of a list of medical sentences. Each statement (hypothesis) is a single sentence making a claim about one or two CTRs, depending on the annotation type being either *Single* or *Comparison*. Each label states the inference relation between the statement and the CTRs. The entire dataset is split into train, dev, test sets by 1700, 200, 500 respectively[5]. 

**In our project, we use only the `dev` set because `test` set is not public and `train` set is too large for our limited resources.**

When working with the dataset, you will have two files: the CTR file and the annotation file. We provide examples for each of them below.

### CTR Sample
```json
{
    "Clinical Trial ID": "NCT03374995",
    "Intervention": [
        "INTERVENTION 1: ",
        "  Group I (Topical Keratin)",
        "  Patients receive topical keratin topically at least BID until the end of radiation therapy (approximately 3-6 weeks)."
    ],
    "Eligibility": [
        "Inclusion Criteria:",
        "  Area to be irradiated representing 1-10% of total body surface area (TBSA)",
        "Exclusion Criteria:"
    ],
    "Results": [
        "Outcome Measurement: ",
        "  Incidence of Early Adverse Skin Reactions (EASRs)",
        "Results 1: ",
        "  Arm/Group Title: Group I (Topical Keratin)"
    ],
    "Adverse Events": [
        "Adverse Events 1:",
        "  Total: 0/13 (0.00%)",
        "Adverse Events 2:",
        "  Total: 0/11 (0.00%)"
    ]
```

Here, "Clinical Trial ID" refers to the unique ID of this CTR file. Each section is represented as a list of sentences. The number of sentences range from very few (2 or 3) to large (over 30). Each section contains on average 5-500 tokens[5].

### Annotation Sample
```json
{
    "1adc970c-d433-44d0-aa09-d3834986f7a2": {
        "Type": "Single",
        "Section_id": "Results",
        "Primary_id": "NCT00066573",
        "Statement": "there is a 13.2% difference between the results from the two the primary trial cohorts",
        "Label": "Contradiction",
        "Primary_evidence_index": [0, 1, 23, 4, 23]
    }
```

Each annotation represents a data sample for the entailment task. It contains expert-annotated class label, the section ID and the statement (hypothesis). Here, "Type" refers to whether the annotation looks at a single CTR or a pair of CTR. When looking at a pair, the statement makes a claim about both of the CTRs (usually as a comparison). The "Primary_id" and "Secondary_id", if applicable, represent the "Clinical Trial ID" of relevant CTRs. The "Primary_evidence_index" and "Secondary_evidence_index", if applicable, are correct retrieval sentence indices from relevant section for evidence retrieval task, which is not applicable in our project. "Section_id", "Label" and "Statement" are self-explanatory. 

### Data Balance

The dataset (dev) contains 200 samples and is overall balanced. 

+ There are exactly 100 Contradication samples and 100 Entailment samples.
+ There are similar number of samples for each annotated section.
+ There are more single samples than comparison samples.

| Type |  | Section |  |  |  | Label |  |
|---|---|---|---|---|---|---|---|
|  | Count | Intervention | Eligibility | Adverse Events | Results | Contradiction | Entailment |
| Single | 140 | 26 | 44 | 32 | 38 | 70 | 70 |
| Comparison | 60 | 10 | 12 | 20 | 18 | 30 | 30 |

# Code 

**We programed our own inference / evaluation code from scratch** using PyTorch and HuggingFace Transformer packages. Our code supports **commandline executation** with parameter setting within the executable script. The code directory structure is shown below:

```
- Root Directory/
  - code/
    - main.py               -- IMPORTANT: executable eval script
    - functions.py          -- IMPORTANT: primary model class and help functions
    - topksearcher.py       -- IMPORTANT: top k search class
    - vectordb/              -- IMPORTANT: location of pre-calculated embedding binaries
      - allMiniLML6V2
        - annotations_db_val.pickle  -- IMPORTANT: embedding binary for all annotation statements
        - raw_text_db.pickle          -- IMPORTANT: embedding binary for all CTR files
    - prepare-topk-embeddings.ipynb  -- IMPORTANT: use this to generate embedding binaries
    - f1-calc.ipynb                  -- Script to calculate precision, recall, F1
    - prepare-section-id.ipynb       -- Script to test section id prediction
    - requirements.txt
    - Final_Project.ipynb            -- Playground notebook for code development.

  - large_results/         -- experiment result logs for large model
  - xl_results/            -- experiment result logs for xl model
  - f1_calculation/        -- sanitized logs prepared for precision-recall calculation for project submission
```

## Environment Setup
```bash
cd code
pip install -r requirements.txt
```

We recommend using `conda` to create a fresh `python3.11` environement first.

You will need enough disk space to download datasets and models as well.

## Run Inference

To run the experiment, all you need to do is to set up parameters in `main.py`. If you want to use `topk` method, make sure you have correctly prepared embedding database (see Biuld Embedding Binaries section).

```bash
python3 main.py
```

### Experiment Parameters
Our `main.py` script supports all methods listed in our Approach Section and their associated parameters tuning directly in file. The setup section starts at `LINE 274`. To understand these parameteres, please refer to the comments as well as the docstring at `LINE 70`. 

On a high level, you can choose between using section ID or not (full CTR) for building the premise, and 6 different ways for augmenting the premise. A compatability matrix is provided below:

|  | base | truncate | sliding_window | summarize-concat | summarize | topk | autosection |
|---|---|---|---|---|---|---|---|
| Section Mode | Yes | Yes | Yes | No | No | No | Yes |
| Full CTR Mode | Yes | Yes | Yes | Yes | Yes | Yes | Yes |

*autsection is the code name for section retrieval method.*

*summarize-concat and summarize are both summarizaiton method. The difference is how to combine chunks when building section summary.*

*topk requires existence of embedding binaries.*

### Logging

**Logging is an important part of the program and required for successful recording of experiment results.**

Logs are automatically created when running the script and two logs files are created. `{expname}_result.log` contains prediction results. `{expname}.log` contains runtime logs. The reusult log is a dictionary file:

```python
{'acc': pred_accuracy, 'result': [[pred_label, true_label]], 'logs': "additional logs for some methods"}
```

You need to set up the `expname` in `main.py, LINE 298`, `logname = f"{expname}"`. If you use the same name from previous experiments, logs will be overwritten. By default, the name is using relative path to `main.py`.

Cmd-line output is the same as from `{expname}.log`.

### Parameter Examples For Different Experiments

**Shared Parameters**

These parameters are used with all methods.

```python
MODEL_NAME = "xl" # Flan-T5 model size. Check this page for available models: https://huggingface.co/docs/transformers/model_doc/flan-t5
INCLUDE_ID = True # Whether to include section id names when building full CTR premise. Refer to Model Explanation section. We use True for all our experiments.
RANDOM_SECTION = False # (Experimental) Whether to choose a random section (Section mode) or a random section orders (Full CTR mode) when building full CTR premise.
USE_SECTION = True # whether to use section mode or full CTR mode. Each is compatible with a selected methods
```

**Section ID Baseline**
```python
USE_SECTION = True
PREMISE_METHOD = "base" # Chooses what type of method to handle premise. For section mode, only supports "base", "sliding_window", "truncate", "autosection". "autosection" is the name for section retrieval.
```

**Full CTR Baseline**
```python
USE_SECTION = False
PREMISE_METHOD = "base"
```

**Full CTR Truncation**
```python
USE_SECTION = False
PREMISE_METHOD = "truncate"
MODEL_TOKEN_LIMIT = 512 # unit: token
```

**Full CTR Sliding Window**
```python
USE_SECTION = False
PREMISE_METHOD = "sliding_window"
MODEL_TOKEN_LIMIT = 512 # unit: token
STRIDE = 256 # unit: token
```

**Full CTR Summarization with Concat**
```python
USE_SECTION = False
PREMISE_METHOD = "summarize-concat"
MODEL_TOKEN_LIMIT = 512 # unit: token
SUMMARY_GEN_MAX_TOKEN = 380 # unit: token
```

**Full CTR Hierarchical Summarization**
```python
USE_SECTION = False
PREMISE_METHOD = "summarize"
MODEL_TOKEN_LIMIT = 512 # unit: token
SUMMARY_GEN_MAX_TOKEN = 380 # unit: token
```

**Full CTR Top K**
```python
USE_SECTION = False
PREMISE_METHOD = "topk"
TOPK = 50
TOPK_DB = "./vectordb/allMiniLML6V2/" # important, must exist with two pickle files: raw_text_db.pickle, annotations_db_val.pickle
```

**Full CTR Section Retrieval**
```python
USE_SECTION = False
PREMISE_METHOD = "autosection"
```

### (IMPORTANT) Build Embedding Binaries
To build embedding binaries, run `prepare-section-id.ipynb`.

By default, it will use `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace. 

Run the section `embed raw_text and annotation sentences`.

At the end of the section, you will save to pickle files to the current dir, `raw_text_db.pickle, annotations_db_val.pickle`. DO NOT change their names but do move them to a place where you want to store them. By default, we put it under `code/vectordb/allMiniLML6V2`. If you decide to change the location, make sure you update your `TOPK_DB` parameter in `main.py`.

### Calcualte Precision-Recall
We also provided a notebook `f1-calc.ipynb` to help you process experiment resulting log files and calculate f1 scores. 

# Experimental Setup

We ran the following sets of experiments to compare 1) Section Baseline and Full CTR Text Baseline 2) Five different full text handling methods against Full CTR Text Baseline 3) Methods among each other. Our primary metric is accuracy while we also calculated precision, recall for some experiments. Below is a list of experiments we have run:

<div style="display:inline-block;">
    <img src="imgs/list_of_experiments.png" alt="" style="width: 600px;"/>
    <br/>
</div>

## Evaluation Metrics
Accuracy is used as the primary metric. We calculate precision, recall and macro-F1 scores for some cases as well.

$Accuracy = {True\ Positive\ +\ True\ Negative \over (True\ Positive\ +\ True\ Negative\ +\ False\ Positive\ +\ False\ Negative)}$

$Precision = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$

$Recall = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$

$F1\ Score = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}$


# Results
## 1. Baseline Comparison
<div style="display:inline-block;">
    <img src="imgs/result_sectionID_effectiveness.png" alt="Effectiveness of Section ID" style="width: 400px;"/>
    <div><text>Figure 3: Zero-shot performance of FlanT5-large and XL on premise with a) only true
section, b) full CTR, and c) a random section, compared against previous work with true
section on FlanT5-XXL</text></div>
    <br/>
</div>

## 2. Optimizing Full Text Prediction 
We tried 5 methods to handle the lengthy text by reducing the premise, their performance compared with the baseline are shown below:
<div style="display:inline-block;">
    <img src="imgs/results_performance.png" alt="Performance of Full Text Handling Methods" style="width: 400px;"/>
    <div><text>With Section Retrieval Success Rate: 0.88</text></div>
    <div><text>Figure 4: Performance of Full Text Handling Methods</text></div>
    <br/>
</div>

Their precision and recall for both tasks and sizes of models are compared:

<div style="display:inline-block;">
    <img src="imgs/results_precision_recall.png" alt="Performance of Full Text Handling Methods" style="width: 400px;"/>
    <div><text>Figure 5: Precision and Recall of Full Text Handling Methods</text></div>
    <br/>
</div>


# Analysis of the Results


## Effectiveness of Section ID
In Figure 3, to evaluate the effectiveness of section ID,  we evaluated the baseline performance of zero-shot flanT5 on the task when the premise is *a) the correct section b) full text, and c) a random section*. 
* We can see that the baseline model **with section ID** performs the best overall for both large and xl, at **58.5%** and **65.5%**, respectively.
* The performance of using full-text and random section as premise is **lower, especially the random section**, which reduced the accuracy from **58.5%** to **54.3%** for large model and **65.5%** to **56%**.
* This demonstrates that section ID has a positive impact on the model performance. 

*Note in pervious work, an accuracy of 74.3% was claimed when using the XXL model on premise with correct section ID, which we did not reproduce due to the limitation of model size*


## Performance of Full-text Handling Methods
Of the 5 methods to handle lengthy text as premise, **Truncation and Section-retrieval** relatively out performed other methods and improved the accuracy from the full-text baseline model.

In Figure 4:
* For the large model, truncation significantly improve results **from 58.5 to 61.5** .
* For XL model, section retrieval improved the performance by 3%, **from 62% to 65%.** This is almost on par with our baseline performance when section is given (65.5%).
* Note the success rate of section retrieval is 88%, meaning 88% of the predicted Section ID were correct, and the performance was based on this success rate.

In Figure 5, precision and recall are compared for all methods for each task (Contradiction and Entailment):
* XL model out-performs the Large model in precision and recall, which is consistent with their accuracy performance.
* The saparation between Contradiction and Entailment task may give insight into FlanT5's behaviour, as in, it tends to cautiously predict on contradiction statements, and greedy predicting on entailment. 


# Conclusion 

To answer the 3 questions we raised in Motivation:
1. **Is Section ID necessary to achieve good accuracy?**

Section ID positively impact model performance, but alternative methods can compensate for the lack of section IDs.

2. **Can we improve accuracy using full CTR as premise?**

Yes, some of the methods, especially section retrieval and truncation, visibly improved the performance of the baseline model

3. **Can we operate using limited input token length?**

Yes,  the methods also allows us to operate using limited input token length, if that is a constraint.

### Why we care about limit input token size?
1. Most large language models have a fixed maximum number of tokens they can consider at a time, commonly referred to as the model’s “context window.”
2.  if a conversation exceeds the model’s token limit, it may forget critical information discussed earlier, resulting in out-of-context or irrelevant responses. The same applies to long documents, where the model might lose the thread of the narrative or forget key details from earlier sections. For seq2seq models, the model might foret critical information in the middle.
3.  As the length of the sequence increases, so does the computational burden. The FlanT5 has quardratic increase of memory use as the input length.

### Future Work

#### Continuous Improvement of Accuracy Through Hyperparameter Tuning

One potential avenue for future research could center on enhancing prediction accuracy. In our ongoing investigation, we've delved into certain parameter ranges and achieved favorable outcomes through manual tuning. We posit that employing more systematic tools, such as Weights and Biases, could enable a more exhaustive exploration of parameters, leading to further enhancements in model performance.

#### Leveraging Entailment Pipeline to Perform Evidence Retrieval

A natural follow-up study after our research would be using the entailment relations to extract relevant sentences from chosen CTR and support downstream tasks. 

For instance, we might modify the prompt to instruct the Language Model (LLM) to extract sentences that inherently support the hypothesis. Alternatively, employing a two-step approach, we could instruct the LLM to first extract relevant sections and then carry out sentence retrieval.

#### Model Explainability

Based on our current findings, we have made several noteworthy observations regarding model explainability that warrant further investigation. These observations include:

1. Truncation Method Insights:

When employing the Truncation Method, truncation occurs at the start of the full CTR text. Since our complete CTR text is a concatenation of report sections following a fixed order ("intervention," "eligibility," "adverse events," "results"), achieving high accuracy implies effective comprehension by the model, even when focusing solely on the initial sections ("intervention" and "eligibility"). This insight suggests two possibilities: Firstly, the "intervention" segment may contain crucial information regardless of the annotated statement's source section. Secondly, there may be inherent biases among human annotators when formulating statements.

2. Sliding Window Method Optimization:

Notably, one of optimal performances with the Sliding Window Method was observed when utilizing a larger stride size and a smaller window size. This observation indicates that improved predictions could result from strategically skipping through certain information. We are intrigued by the prospect of distilling this skipping approach into a separate method to systematically improve accuracy.

3. Precision and Recall Disparities across Label Classes:

A comprehensive comparison of precision and recall for different label classes (contradiction and entailment) across all methods revealed a consistent pattern. On average, the entailment class exhibited higher recall than precision, suggesting a cautious approach by the model when predicting entailment. In contrast, the contradiction class displayed higher precision than recall, indicating a more assertive prediction tendency. We are keen to determine whether these effects are intrinsic to the dataset or specific to the FlanT5 model.

These insights shed light on the inner workings of the underlying model within the specific dataset, presenting compelling avenues for further exploration in explainability studies.


# Reference

[1] Nancy E. Avis, Kevin W. Smith, Carol L. Link, Gabriel N. Hortobagyi, and Edgardo Rivera.
Factors associated with participation in breast cancer treatment clinical trials. Journal of
Clinical Oncology, 24(12):1860–1867, apr 2006.


[2] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large
annotated corpus for learning natural language inference. InProceedings of the 2015 Confer-
ence on Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, 2015.

[3] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan
Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu,
Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie
Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent
Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob
Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned
language models, 2022.

[4] Jay DeYoung, Eric Lehman, Benjamin Nye, Iain Marshall, and Byron C. Wallace. Evidence
inference 2.0: More data, better models. InProceedings of the 19th SIGBioMed Workshop on
Biomedical Language Processing. Association for Computational Linguistics, 2020.

[5] Maël Jullien, Marco Valentino, Hannah Frost, Paul O’regan, Donal Landers, and André Freitas.
SemEval-2023 task 7: Multi-evidence natural language inference for clinical trial data. In
Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023).
Association for Computational Linguistics, 2023.

[6] Kamal Raj Kanakarajan and Malaikannan Sankarasubbu. Saama AI research at SemEval-
2023 task 7: Exploring the capabilities of flan-t5 for multi-evidence natural language inference
in clinical trial data. InProceedings of the The 17th International Workshop on Semantic
Evaluation (SemEval-2023). Association for Computational Linguistics, 2023.

[7] Jiongnan Liu, Jiajie Jin, Zihan Wang, Jiehan Cheng, Zhicheng Dou, and Ji-Rong Wen. Reta-
llm: A retrieval-augmented large language model toolkit, 2023.

[8] Keivalya Pandya and Mehfuza Holia. Automating customer service using langchain: Building
custom open-source gpt chatbot for organizations, 2023.

[9] Yuxuan Zhou, Ziyu Jin, Meiwei Li, Miao Li, Xien Liu, Xinxin You, and Ji Wu. THiFLY
research at SemEval-2023 task 7: A multi-granularity system for CTR-based textual entailment
and evidence retrieval. InProceedings of the The 17th International Workshop on Semantic
Evaluation (SemEval-2023). Association for Computational Linguistics, 2023.

[10] Yun Luo and Zhen Yang and Fandong Meng and Yafu Li and Jie Zhou and Yue Zhang. An Empirical Study of 
Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. arXiv preprint arXiv:2308.08747, 2023.

# Appendix

## Additional Results

#### Inference Speed Comparison@Large Model

<div style="display:inline-block;">
    <img src="imgs/results_speed_large.png" alt="Performance of Full Text Handling Methods" style="width: 400px;"/>
    <div><text>Figure 6: Inference Speed@Large</text></div>
    <br/>
</div>

#### Increased Accuracy When Sliding Over Some Gap@XL Model?
<div style="display:inline-block;">
    <img src="imgs/results_sliding_XL.png" alt="Performance of Full Text Handling Methods" style="width: 400px;"/>
    <div><text>Figure 7: Sliding Window Performance@XL</text></div>
    <br/>
</div>

Comment: It's interesting that when the window size is smaller than the stride size, meaning the window will skip some information, the accuracy could improve. This effect might indicate that either 1) some information in original CTR text actually confuses the model in terms
of entailment prediction OR 2) the model happens to "focus" on some important information.

#### Accuracy Fluctuation When Truncating at Different Size@XL Model
<div style="display:inline-block;">
    <img src="imgs/results_truncation_large.png" alt="Performance of Full Text Handling Methods" style="width: 400px;"/>
    <div><text>Figure 8: Truncation Performance@XL</text></div>
    <br/>
</div>

Comment: Accuracy is higher at low truncation size (768) and high truncation size (2000@XL, 4000@large). It's interesting because the relationship is not linear. Also, when truncation > 4000, effectively same as base case where no truncation is performed, the accuracy is 
higher than base (@XL). By further investigating the output from tokenizer, it appears that even trunction size > text size, the encoded-decoded sentence is still truncated a little (smaller length than original length). This might contribute to the performance difference and has something to do with how the Huggingface transformer encoder/decoder work internally.