From the provided text, it appears that there are two subtasks within the larger task of "Multi-Evidence Natural Language Inference for Clinical Trial Data (NLI4CT)":

**1. Entailment and Evidence Selection Subtask:**

This subtask involves determining the inference relation (entailment vs. contradiction) between Clinical Trial Reports (CTR) and statements.
The statements may make claims about a single CTR or compare two CTRs.
The task is to determine the inference relation between CTR - statement pairs.
The training set includes annotated statements that make claims about the information contained in different sections of the CTR premise.

**2. Intervention Analysis Subtask:**

This subtask focuses on analyzing the robustness, consistency, and faithfulness of Natural Language Inference (NLI) models, particularly in the context of clinical NLI settings.
The goal is to investigate the behavior of NLI models in their representation of semantic phenomena necessary for complex inference in clinical NLI settings.
The analysis includes exploring the ability of clinical NLI models to perform faithful reasoning, making correct predictions for the correct reasons.
Interventions are applied to the test set and development set statements, targeting specific aspects such as numerical reasoning, vocabulary and syntax, semantics, and notes.
The interventions are designed to assess the models' performance under different challenges and to enrich the dataset with a contrast set for evaluation.
The specific type of intervention performed on a statement is not disclosed during test or training time, emphasizing the need for robust and generalizable approaches.

For the sake of this project, we chose to focus ont he former. According to our understanding, 

In [1]:
!pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances
import numpy
from sklearn.metrics import f1_score,precision_score,recall_score



##1) Installations and imports


a. Mount drive (if you are running on colab)

In [2]:
'''from google.colab import drive
drive.mount('/content/drive')'''

"from google.colab import drive\ndrive.mount('/content/drive')"

b. Clone or update competition repository
After cloning, under MyDrive, you will see NLI4CT-semeval-2023 folder with the training and dev set, aswell as the full list of CTRs.

In [3]:
%cd /content/drive/MyDrive

import os

PROJECT_DIR = '/content/drive/MyDrive/semeval-2024'
PROJECT_GITHUB_URL = 'https://github.com/ai-systems/Task-2-SemEval-2024.git'

if not os.path.isdir(PROJECT_DIR):
  !git clone {PROJECT_GITHUB_URL}
else:
  %cd {PROJECT_DIR}
  !git pull {PROJECT_GITHUB_URL}

[WinError 3] The system cannot find the path specified: '/content/drive/MyDrive'
C:\Users\Srija Vakiti


Cloning into 'Task-2-SemEval-2024'...


##2) Dataset

In [4]:
# Training data
#!unzip /content/drive/MyDrive/Task-2-SemEval-2024/training_data.zip

In [9]:
# Dev set
import json

dev_path = r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\dev.json"
with open(dev_path) as json_file:
    dev = json.load(json_file)

# Example instance
print(dev[list(dev.keys())[1]])

{'Type': 'Comparison', 'Section_id': 'Eligibility', 'Primary_id': 'NCT00425854', 'Secondary_id': 'NCT01224678', 'Statement': 'Patients with significantly elevated ejection fraction are excluded from the primary trial, but can still be eligible for the secondary trial if they are 55 years of age or over', 'Label': 'Contradiction'}


In [10]:
uuid_list = list(dev.keys())
statements = []
gold_dev_primary_evidence = []
gold_dev_secondary_evidence = []
for i in range(len(uuid_list)):
  #Retrieve all statements from the development set
  statements.append(dev[uuid_list[i]]["Statement"])

##3) TF-IDF Entailment prediction baseline

In [13]:
import os
import json
import numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances

Results = {}

for i in range(len(uuid_list)):
    primary_ctr_path = os.path.join(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\CT json", dev[uuid_list[i]]["Primary_id"] + ".json")
    with open(primary_ctr_path) as json_file:
        primary_ctr = json.load(json_file)

    # Retrieve the full section from the primary trial
    primary_section = primary_ctr[dev[uuid_list[i]]["Section_id"]]

    # Convert a primary section entries to a matrix of TF-IDF features.
    vectorizer = TfidfVectorizer().fit(primary_section)
    X_s = vectorizer.transform([statements[i]])
    X_p = vectorizer.transform(primary_section)

    # Compute the cosine similarity between the primary section entries and the statement
    primary_scores = cosine_distances(X_s, X_p)

    # Repeat for the secondary trial
    if dev[uuid_list[i]]["Type"] == "Comparison":
        secondary_ctr_path = os.path.join(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\CT json", dev[uuid_list[i]]["Secondary_id"] + ".json")
        with open(secondary_ctr_path) as json_file:
            secondary_ctr = json.load(json_file)
        secondary_section = secondary_ctr[dev[uuid_list[i]]["Section_id"]]
        vectorizer = TfidfVectorizer().fit(secondary_section)
        X_s = vectorizer.transform([statements[i]])
        X_p = vectorizer.transform(secondary_section)
        secondary_scores = cosine_distances(X_s, X_p)

        # Combine and average the cosine distances of all entries from the relevant section of the primary and secondary trial
        combined_scores = []
        combined_scores.extend(secondary_scores[0])
        combined_scores.extend(primary_scores[0])
        score = numpy.average(combined_scores)

        # If the cosine distance is greater than 0.9 the prediction is contradiction
        if score > 0.9:
            Prediction = "Contradiction"
        else:
            Prediction = "Entailment"
        Results[str(uuid_list[i])] = {"Prediction": Prediction}
    else:
        # If the cosine distance is greater than 0.9 the prediction is contradiction
        score = numpy.average(primary_scores)
        if score > 0.9:
            Prediction = "Contradiction"
        else:
            Prediction = "Entailment"
        Results[str(uuid_list[i])] = {"Prediction": Prediction}


## Save the results in the submission format.

In [14]:
print(Results)
with open(r"C:\Users\asuri\OneDrive\Desktop\results.json",'w') as jsonFile:
    jsonFile.write(json.dumps(Results,indent=4))

{'1adc970c-d433-44d0-aa09-d3834986f7a2': {'Prediction': 'Contradiction'}, '6b9162d0-0816-46d4-81af-c60028dcc63b': {'Prediction': 'Entailment'}, '0b6cc8e3-69ee-4a91-b93d-2ad3fddce65f': {'Prediction': 'Contradiction'}, 'cc1f712a-2116-4e40-9810-f315e3fa5ff8': {'Prediction': 'Entailment'}, '904061c0-14fa-4f13-9118-9a41e24fa8eb': {'Prediction': 'Entailment'}, '43ee7645-ce1e-42d5-9a74-3e379f6f367b': {'Prediction': 'Contradiction'}, '0cef8c8e-7986-46c7-a597-c5733a9899c0': {'Prediction': 'Contradiction'}, '43ce26e5-03fa-4e9d-b0eb-6ea356295753': {'Prediction': 'Contradiction'}, '3facad41-0221-42f8-834d-470e65c4aad5': {'Prediction': 'Entailment'}, '9cbc00e9-3a2d-4471-a93e-72c95132fb6a': {'Prediction': 'Entailment'}, '8b91cab9-d858-45f3-bf8d-3d6fc55b4818': {'Prediction': 'Entailment'}, '4a75574c-fa86-4e62-a210-81c7b98a3807': {'Prediction': 'Contradiction'}, 'd0b50aeb-aad8-4a8d-aae6-5c58a7d382c7': {'Prediction': 'Entailment'}, 'b0b61978-57db-4a1c-812c-509e8b05f2dc': {'Prediction': 'Contradiction'}

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\asuri\\OneDrive\\Desktop\\results.json'

##4) Evaluation

Compute F1 score, Precision, and Recall. Note that in the final evaluation systems will be ranked by Faithfulness and Consistency, which cannot be computed on the training and development set.

In [15]:
def main():

    gold = dev
    results = Results
    uuid_list = list(results.keys())

    results_pred = []
    gold_labels = []
    for i in range(len(uuid_list)):
        if results[uuid_list[i]]["Prediction"] == "Entailment":
            results_pred.append(1)
        else:
            results_pred.append(0)
        if gold[uuid_list[i]]["Label"] == "Entailment":
            gold_labels.append(1)
        else:
            gold_labels.append(0)

    f_score = f1_score(gold_labels,results_pred)
    p_score = precision_score(gold_labels,results_pred)
    r_score = recall_score(gold_labels,results_pred)

    print('F1:{:f}'.format(f_score))
    print('precision_score:{:f}'.format(p_score))
    print('recall_score:{:f}'.format(r_score))

if '__main__' == __name__:
    main()

F1:0.502415
precision_score:0.485981
recall_score:0.520000


## Our approach

The submitted systems for this task use various techniques and models, including:

1. Generative LLMs: 8 submissions
2. Discriminative LLMs: 16 submissions
3. Ontology-based: 1 submission
4. Semantic rule-based: 1 submission
5. Biomedical Pre-training: 12 submissions

Discriminative transformers, especially those fine-tuned for specific tasks, have shown strong performance in various natural language processing (NLP) tasks. In addition to this, pre-training on large biomedical datasets can help models capture domain-specific features, which is crucial when dealing with clinical trial reports. 


There are 3 main tasks involved in this:


**A. Choose a Discriminative Transformer Model:** Select a pre-trained transformer model suitable for your task. Models like BERT, RoBERTa, or BioBERT are good choices. We can leverage models from the Hugging Face Model Hub.

**B. Biomedical Pre-training:** Fine-tune our chosen discriminative transformer on a large biomedical dataset. We can use publicly available biomedical corpora or create your own dataset by scraping relevant biomedical texts.

**C. Hybrid Model Architecture:** Build a hybrid model architecture that combines the discriminative transformer with the biomedical pre-trained features. You may append additional layers to the pre-trained model to adapt it to your specific task.


In [16]:
# Install necessary libraries
#!pip install transformers scikit-learn
!pip install torch torchvision torchaudio

import os
import json
import numpy as np
import torch
from sklearn.metrics import f1_score, precision_score, recall_score
from transformers import BertTokenizer, BertForSequenceClassification




In [22]:
# Load the training set
train_path = r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\train.json"
with open(train_path) as json_file:
    train_data = json.load(json_file)

# Load the development set
dev_path = r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\dev.json"
with open(dev_path) as json_file1:
    dev_data = json.load(json_file1)

In [23]:
train_data

{'5bc844fc-e852-4270-bfaf-36ea9eface3d': {'Type': 'Comparison',
  'Section_id': 'Intervention',
  'Primary_id': 'NCT01928186',
  'Secondary_id': 'NCT00684983',
  'Statement': 'All the primary trial participants do not receive any oral capecitabine, oral lapatinib ditosylate or cixutumumab IV, in conrast all the secondary trial subjects receive these.',
  'Label': 'Contradiction'},
 '86b7cb3d-6186-4a04-9aa6-b174ab764eed': {'Type': 'Single',
  'Section_id': 'Eligibility',
  'Primary_id': 'NCT00662129',
  'Statement': 'Patients with Platelet count over 100,000/mm¬¨‚â•, ANC <  1,700/mm¬¨‚â• and Hemoglobin between 4 to 5 grams per deciliter are eligible for the primary trial.',
  'Label': 'Contradiction'},
 'dbed5471-c2fc-45b5-b26f-430c9fa37a37': {'Type': 'Comparison',
  'Section_id': 'Adverse Events',
  'Primary_id': 'NCT00093145',
  'Secondary_id': 'NCT00703326',
  'Statement': 'Heart-related adverse events were recorded in both the primary trial and the secondary trial.',
  'Label': 'Ent

In [24]:
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
import json

# Replace this path with your actual path
train_data_path = r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\train.json"

with open(train_data_path) as json_file:
    train_data = json.load(json_file)


In [28]:
import json

# Replace this path with your actual path
dev_data_path = r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\dev.json"

with open(dev_data_path) as json_file1:
    dev_data = json.load(json_file1)


In [29]:
# Assuming you already have the CustomDataset class defined
train_dataset = CustomDataset(train_data)

NameError: name 'CustomDataset' is not defined

In [19]:
# Custom Dataset class for PyTorch DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        try:
            instance = self.data[idx]
        except IndexError:
            print(f"IndexError: Index {idx} out of range for dataset length {len(self.data)}")
        raise

        statement = instance["Statement"]
        label = 1 if instance["Label"] == "Entailment" else 0

        # Tokenize and encode text
        encoding = tokenizer(statement, truncation=True, padding="max_length", max_length=128, return_tensors="pt")
        input_ids = encoding["input_ids"].squeeze()
        attention_mask = encoding["attention_mask"].squeeze()

        return {"input_ids": input_ids, "attention_mask": attention_mask, "label": label}


NameError: name 'Dataset' is not defined

In [20]:
from torch.utils.data import DataLoader
from transformers import AdamW

# Create DataLoader for training and dev sets
train_dataset = CustomDataset(train_data)
dev_dataset = CustomDataset(dev_data)

train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
dev_dataloader = DataLoader(dev_dataset, batch_size=8, shuffle=False)



NameError: name 'CustomDataset' is not defined

In [21]:
# Set up training parameters
optimizer = AdamW(model.parameters(), lr=2e-5)
num_epochs = 3

# Training loop
for epoch in range(num_epochs):
    model.train()

    for batch in train_dataloader:
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Evaluation on dev set
model.eval()
all_predictions = []
all_labels = []

with torch.no_grad():
    for batch in dev_dataloader:
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probabilities = torch.nn.functional.softmax(logits, dim=1)
        predictions = torch.argmax(probabilities, dim=1).tolist()

        all_predictions.extend(predictions)
        all_labels.extend(labels.tolist())

# Calculate evaluation metrics
f_score = f1_score(all_labels, all_predictions)
p_score = precision_score(all_labels, all_predictions)
r_score = recall_score(all_labels, all_predictions)

print('F1: {:.6f}'.format(f_score))
print('Precision: {:.6f}'.format(p_score))
print('Recall: {:.6f}'.format(r_score))

NameError: name 'model' is not defined

In [30]:
# Install necessary libraries
!pip install transformers scikit-learn
from transformers import AutoTokenizer, AutoModelForSequenceClassification

import os
import json
import numpy as np
import torch
from sklearn.metrics import f1_score, precision_score, recall_score
from transformers import BertTokenizer, BertForSequenceClassification

# Load the development set
dev_path = r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\dev.json"
with open(dev_path) as json_file:
    dev = json.load(json_file)

# Load Biomedical Pre-trained Discriminative Transformer (BERT)
tokenizer = AutoTokenizer.from_pretrained("roberta-base", padding="max_length", truncation=True)
model = BertForSequenceClassification.from_pretrained("biomed_roberta_base")

# Results dictionary
results = {}

# Helper function to encode text using BERT tokenizer
def encode_text(text, max_length=128):
    encoding = tokenizer(text, truncation=True, padding="max_length", max_length=max_length, return_tensors="pt")
    return encoding

# Iterate over instances in the development set
for uuid, instance in dev.items():
    primary_ctr_path = os.path.join(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\CT json", instance["Primary_id"] + ".json")
    
    with open(primary_ctr_path) as json_file:
        primary_ctr = json.load(json_file)
    
    primary_section = primary_ctr[instance["Section_id"]]
    statement = instance["Statement"]

    # Encode statement and primary section using BERT tokenizer
    encoding_s = encode_text(statement)
    encoding_p = encode_text(" ".join(primary_section))

    # Make predictions using the pre-trained model
    with torch.no_grad():
        outputs = model(**encoding_s)
        logits_s = outputs.logits

        outputs = model(**encoding_p)
        logits_p = outputs.logits

    # Combine logits or probabilities as needed for your specific task
    combined_logits = (logits_s + logits_p) / 2
    probabilities = torch.nn.functional.softmax(combined_logits, dim=1)
    prediction = torch.argmax(probabilities, dim=1).item()

    # Map prediction to "Entailment" or "Contradiction"
    prediction_label = "Contradiction" if prediction == 0 else "Entailment"

    # Store results in the dictionary
    results[uuid] = {"Prediction": prediction_label}

# Save results to a JSON file
results_path = r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\results.json"
with open(results_path, 'w') as json_file:
    json_file.write(json.dumps(results, indent=4))

# Calculate evaluation metrics
gold_labels = [1 if instance["Label"] == "Entailment" else 0 for instance in dev.values()]
results_pred = [1 if result["Prediction"] == "Entailment" else 0 for result in results.values()]

f_score = f1_score(gold_labels, results_pred)
p_score = precision_score(gold_labels, results_pred)
r_score = recall_score(gold_labels, results_pred)

print('F1: {:.6f}'.format(f_score))
print('Precision: {:.6f}'.format(p_score))
print('Recall: {:.6f}'.format(r_score))




OSError: biomed_roberta_base is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [None]:
pip install gensim


In [32]:
import json
import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
w
# Load training set
with open(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\train.json") as file:
    train_data = json.load(file)

# Load development set
with open(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\dev.json") as file:
    dev_data = json.load(file)

# Convert the data into a DataFrame
train_df = pd.DataFrame(train_data).T  # Transpose to have columns as features
dev_df = pd.DataFrame(dev_data).T

# Split the data into training and testing sets
train_data, test_data = train_test_split(train_df, test_size=0.2, random_state=42)

# Word Embeddings using Word2Vec
word2vec_model = Word2Vec(sentences=train_data['Statement'].apply(str.split), vector_size=100, window=5, min_count=1, workers=4)

# Function to convert statements to average word embeddings
def average_word_embeddings(statement, model):
    words = str(statement).split()
    vectors = [model.wv[word] for word in words if word in model.wv]
    return sum(vectors) / len(vectors) if vectors else [0] * model.vector_size

# Apply the function to get embeddings for each statement
train_embeddings = train_data['Statement'].apply(lambda x: average_word_embeddings(x, word2vec_model)).tolist()
test_embeddings = test_data['Statement'].apply(lambda x: average_word_embeddings(x, word2vec_model)).tolist()

# Define labels
y_train = train_data['Label']
y_test = test_data['Label']

# Train a logistic regression model
model = LogisticRegression(random_state=42)
model.fit(train_embeddings, y_train)

# Make predictions on the test set
predictions = model.predict(test_embeddings)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)


Accuracy: 0.4824
Classification Report:
               precision    recall  f1-score   support

Contradiction       0.48      0.26      0.33       172
   Entailment       0.48      0.71      0.58       168

     accuracy                           0.48       340
    macro avg       0.48      0.49      0.46       340
 weighted avg       0.48      0.48      0.45       340



In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(train_data['Statement'])
X_test_tfidf = vectorizer.transform(test_data['Statement'])

# Train a logistic regression model
model_tfidf = LogisticRegression(random_state=42)
model_tfidf.fit(X_train_tfidf, y_train)

# Make predictions on the test set
predictions_tfidf = model_tfidf.predict(X_test_tfidf)

# Evaluate the model
accuracy_tfidf = accuracy_score(y_test, predictions_tfidf)
report_tfidf = classification_report(y_test, predictions_tfidf)

print(f"Accuracy with TF-IDF: {accuracy_tfidf:.4f}")
print("Classification Report with TF-IDF:")
print(report_tfidf)


Accuracy with TF-IDF: 0.3588
Classification Report with TF-IDF:
               precision    recall  f1-score   support

Contradiction       0.34      0.29      0.31       172
   Entailment       0.37      0.43      0.40       168

     accuracy                           0.36       340
    macro avg       0.36      0.36      0.36       340
 weighted avg       0.36      0.36      0.36       340



In [51]:
import os
import json
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances

Results = {}

for i in range(len(uuid_list)):
    primary_ctr_path = os.path.join(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\CT json", dev[uuid_list[i]]["Primary_id"] + ".json")
    with open(primary_ctr_path) as json_file:
        primary_ctr = json.load(json_file)

    # Retrieve the full section from the primary trial
    primary_section = primary_ctr[dev[uuid_list[i]]["Section_id"]]

    # Convert a primary section entries to a matrix of Count Vectorizer features.
    vectorizer = CountVectorizer().fit(primary_section)
    X_s = vectorizer.transform([statements[i]])
    X_p = vectorizer.transform(primary_section)

    # Compute the cosine similarity between the primary section entries and the statement
    primary_scores = cosine_distances(X_s, X_p)

    # Repeat for the secondary trial
    if dev[uuid_list[i]]["Type"] == "Comparison":
        secondary_ctr_path = os.path.join(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\CT json", dev[uuid_list[i]]["Secondary_id"] + ".json")
        with open(secondary_ctr_path) as json_file:
            secondary_ctr = json.load(json_file)
        secondary_section = secondary_ctr[dev[uuid_list[i]]["Section_id"]]
        vectorizer = CountVectorizer().fit(secondary_section)
        X_s = vectorizer.transform([statements[i]])
        X_p = vectorizer.transform(secondary_section)
        secondary_scores = cosine_distances(X_s, X_p)

        # Combine and average the cosine distances of all entries from the relevant section of the primary and secondary trial
        combined_scores = []
        combined_scores.extend(secondary_scores[0])
        combined_scores.extend(primary_scores[0])
        score = numpy.average(combined_scores)

        # If the cosine distance is greater than 0.9 the prediction is contradiction
        if score > 0.9:
            Prediction = "Contradiction"
        else:
            Prediction = "Entailment"
        Results[str(uuid_list[i])] = {"Prediction": Prediction}
    else:
        # If the cosine distance is greater than 0.9 the prediction is contradiction
        score = numpy.average(primary_scores)
        if score > 0.9:
            Prediction = "Contradiction"
        else:
            Prediction = "Entailment"
        Results[str(uuid_list[i])] = {"Prediction": Prediction}


In [52]:
for uuid, result in Results.items():
    print(f"UUID: {uuid}, Prediction: {result['Prediction']}")


UUID: 1adc970c-d433-44d0-aa09-d3834986f7a2, Prediction: Contradiction
UUID: 6b9162d0-0816-46d4-81af-c60028dcc63b, Prediction: Entailment
UUID: 0b6cc8e3-69ee-4a91-b93d-2ad3fddce65f, Prediction: Contradiction
UUID: cc1f712a-2116-4e40-9810-f315e3fa5ff8, Prediction: Entailment
UUID: 904061c0-14fa-4f13-9118-9a41e24fa8eb, Prediction: Entailment
UUID: 43ee7645-ce1e-42d5-9a74-3e379f6f367b, Prediction: Contradiction
UUID: 0cef8c8e-7986-46c7-a597-c5733a9899c0, Prediction: Contradiction
UUID: 43ce26e5-03fa-4e9d-b0eb-6ea356295753, Prediction: Entailment
UUID: 3facad41-0221-42f8-834d-470e65c4aad5, Prediction: Entailment
UUID: 9cbc00e9-3a2d-4471-a93e-72c95132fb6a, Prediction: Entailment
UUID: 8b91cab9-d858-45f3-bf8d-3d6fc55b4818, Prediction: Entailment
UUID: 4a75574c-fa86-4e62-a210-81c7b98a3807, Prediction: Contradiction
UUID: d0b50aeb-aad8-4a8d-aae6-5c58a7d382c7, Prediction: Entailment
UUID: b0b61978-57db-4a1c-812c-509e8b05f2dc, Prediction: Contradiction
UUID: 24b85b44-b9e6-4c28-b3aa-1bd97102b7f1, 

In [53]:
dev

{'1adc970c-d433-44d0-aa09-d3834986f7a2': {'Type': 'Single',
  'Section_id': 'Results',
  'Primary_id': 'NCT00066573',
  'Statement': 'there is a 13.2% difference between the results from the two the primary trial cohorts',
  'Label': 'Contradiction'},
 '6b9162d0-0816-46d4-81af-c60028dcc63b': {'Type': 'Comparison',
  'Section_id': 'Eligibility',
  'Primary_id': 'NCT00425854',
  'Secondary_id': 'NCT01224678',
  'Statement': 'Patients with significantly elevated ejection fraction are excluded from the primary trial, but can still be eligible for the secondary trial if they are 55 years of age or over',
  'Label': 'Contradiction'},
 '0b6cc8e3-69ee-4a91-b93d-2ad3fddce65f': {'Type': 'Comparison',
  'Section_id': 'Adverse Events',
  'Primary_id': 'NCT02273973',
  'Secondary_id': 'NCT00281697',
  'Statement': 'a significant number of the participants in the secondary trial and the primary trial suffered from Enterocolitis',
  'Label': 'Contradiction'},
 'cc1f712a-2116-4e40-9810-f315e3fa5ff8': 

In [55]:
import os
import json
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

Results = {}
actual_labels = []  # Assuming you have a list of actual labels

for i in range(len(uuid_list)):
    # If the cosine distance is greater than 0.9 the prediction is contradiction
    if score > 0.9:
        Prediction = "Contradiction"
    else:
        Prediction = "Entailment"
    Results[str(uuid_list[i])] = {"Prediction": Prediction}

    # Assuming you have a list of actual labels
    actual_labels.append(dev[uuid_list[i]]["Label"])

# Calculate accuracy
predicted_labels = [result["Prediction"] for result in Results.values()]
accuracy = accuracy_score(actual_labels, predicted_labels)

print(f"Accuracy: {accuracy}")

Accuracy: 0.5


In [49]:
import os
import json
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

Results = {}

for i in range(len(uuid_list)):
    primary_ctr_path = os.path.join(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\CT json", dev[uuid_list[i]]["Primary_id"] + ".json")
    with open(primary_ctr_path) as json_file:
        primary_ctr = json.load(json_file)

    # Retrieve the full section from the primary trial
    primary_section = primary_ctr[dev[uuid_list[i]]["Section_id"]]

    # Convert a primary section entries to a matrix of Count Vectorizer features.
    vectorizer = CountVectorizer().fit(primary_section)
    X_s = vectorizer.transform([statements[i]])
    X_p = vectorizer.transform(primary_section)

    # Compute the cosine similarity between the primary section entries and the statement
    primary_scores = cosine_distances(X_s, X_p)

    # Repeat for the secondary trial
    if dev[uuid_list[i]]["Type"] == "Comparison":
        secondary_ctr_path = os.path.join(r"C:\Users\Srija Vakiti\Desktop\Task-2-SemEval-2024\CT json", dev[uuid_list[i]]["Secondary_id"] + ".json")
        with open(secondary_ctr_path) as json_file:
            secondary_ctr = json.load(json_file)
        secondary_section = secondary_ctr[dev[uuid_list[i]]["Section_id"]]
        vectorizer = CountVectorizer().fit(secondary_section)
        X_s = vectorizer.transform([statements[i]])
        X_p = vectorizer.transform(secondary_section)
        secondary_scores = cosine_distances(X_s, X_p)

        # Combine and average the cosine distances of all entries from the relevant section of the primary and secondary trial
        combined_scores = []
        combined_scores.extend(secondary_scores[0])
        combined_scores.extend(primary_scores[0])
        score = numpy.average(combined_scores)

        # If the cosine distance is greater than 0.9 the prediction is contradiction
        if score > 0.9:
            Prediction = "Contradiction"
        else:
            Prediction = "Entailment"
        Results[str(uuid_list[i])] = {"Prediction": Prediction}
    else:
        # If the cosine distance is greater than 0.9 the prediction is contradiction
        score = numpy.average(primary_scores)
        if score > 0.5:
            Prediction = "Contradiction"
        else:
            Prediction = "Entailment"
        Results[str(uuid_list[i])] = {"Prediction": Prediction}

# Evaluate results
gold = dev
uuid_list = list(Results.keys())

results_pred = []
gold_labels = []

for i in range(len(uuid_list)):
    if Results[uuid_list[i]]["Prediction"] == "Entailment":
        results_pred.append(1)
    else:
        results_pred.append(0)
    
    if gold[uuid_list[i]]["Label"] == "Entailment":
        gold_labels.append(1)
    else:
        gold_labels.append(0)

# Calculate evaluation metrics
f_score = f1_score(gold_labels, results_pred)
p_score = precision_score(gold_labels, results_pred)
r_score = recall_score(gold_labels, results_pred)
accuracy = accuracy_score(gold_labels, results_pred)

# Print results and evaluation metrics
for uuid, result in Results.items():
    print(f"UUID: {uuid}, Prediction: {result['Prediction']}")

print('\nEvaluation Metrics:')
print('F1 Score: {:f}'.format(f_score))
print('Precision Score: {:f}'.format(p_score))
print('Recall Score: {:f}'.format(r_score))
print('Accuracy: {:f}'.format(accuracy))


UUID: 1adc970c-d433-44d0-aa09-d3834986f7a2, Prediction: Contradiction
UUID: 6b9162d0-0816-46d4-81af-c60028dcc63b, Prediction: Entailment
UUID: 0b6cc8e3-69ee-4a91-b93d-2ad3fddce65f, Prediction: Contradiction
UUID: cc1f712a-2116-4e40-9810-f315e3fa5ff8, Prediction: Contradiction
UUID: 904061c0-14fa-4f13-9118-9a41e24fa8eb, Prediction: Contradiction
UUID: 43ee7645-ce1e-42d5-9a74-3e379f6f367b, Prediction: Contradiction
UUID: 0cef8c8e-7986-46c7-a597-c5733a9899c0, Prediction: Contradiction
UUID: 43ce26e5-03fa-4e9d-b0eb-6ea356295753, Prediction: Contradiction
UUID: 3facad41-0221-42f8-834d-470e65c4aad5, Prediction: Contradiction
UUID: 9cbc00e9-3a2d-4471-a93e-72c95132fb6a, Prediction: Contradiction
UUID: 8b91cab9-d858-45f3-bf8d-3d6fc55b4818, Prediction: Contradiction
UUID: 4a75574c-fa86-4e62-a210-81c7b98a3807, Prediction: Contradiction
UUID: d0b50aeb-aad8-4a8d-aae6-5c58a7d382c7, Prediction: Contradiction
UUID: b0b61978-57db-4a1c-812c-509e8b05f2dc, Prediction: Contradiction
UUID: 24b85b44-b9e6-4c2