# 2024 COMP90042 Project
*Make sure you change the file name with your group id.*

# Readme
*If there is something to be noted for the marker, please mention here.*

*If you are planning to implement a program with Object Oriented Programming style, please put those the bottom of this ipynb file*

# 0.Setting Colab Method for future model developing
Firstly, run the following block to mount the drive to the colab. Then, drag the data folder/**eval.py** to the "Colab Folder Space" to ensure the code runs successfully.

If data folder updated, attempt to forcibly remount, call `drive.mount("/content/drive", force_remount=True)`.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 1.DataSet Processing

## 1.1 Reading and gathering data

Using `json` package reading and gathering claims and evidences, then print an output.

In [8]:
import json
from collections import Counter
from statistics import mean

with open('data/train-claims.json', 'r') as input_file:
    train_claim_data = json.load(input_file)

# Read in development data (claim)
with open('data/dev-claims.json', 'r') as input_file:
    dev_claim_data = json.load(input_file)

# Read in test data (claim)
with open('data/test-claims-unlabelled.json', 'r') as input_file:
    test_claim_data = json.load(input_file)

# Read in evidence data
with open('data/evidence.json', 'r') as input_file:
    evi_data = json.load(input_file)

#EDA


claim_count = 0
evi_count = 0
claim_length = []
evidence_count = []
evidence_length = []
labels = []

for key,value in train_claim_data.items():
    claim_count+=1
    claim_length.append(len(value["claim_text"]))
    evidence_count.append(len(value["evidences"]))
    evidence_length += [len(evi_data[x]) for x in value["evidences"]]
    labels.append(value["claim_label"])

for key,value in evi_data.items():
    evi_count+=1

print("claim count: ",claim_count)
print("evidence count: ",evi_count)
print("max claim length: ",max(claim_length))
print("min claim length: ",min(claim_length))
print("mean claim length: ",mean(claim_length))
print("max evidence count: ",max(evidence_count))
print("min evidence count: ",min(evidence_count))
print("mean evidence count: ",mean(evidence_count))
print("max evidence length: ",max(evidence_length))
print("min evidence length: ",min(evidence_length))
print("mean evidence length: ",mean(evidence_length))
print(Counter(labels))



inside = 0
outside = 0

train_evi_id = []
for claim_id,claim_value in train_claim_data.items():
    train_evi_id=train_evi_id+claim_value['evidences']

for claim_id,claim_value in dev_claim_data.items():
    test_evi_id=claim_value['evidences']
    for e in test_evi_id:
        if e in train_evi_id:
            inside += 1
        else:
            outside += 1
print("Dev evi inside train evi", inside)
print("Dev evi outside train evi", outside)

full_evidence_id = list(evi_data.keys())
full_evidence_text  = list(evi_data.values())
train_claim_id = list(train_claim_data.keys())
train_claim_text  = [ v["claim_text"] for v in train_claim_data.values()]
print("Train claim count: ",len(train_claim_id))


claim count:  1228
evidence count:  1208827
max claim length:  332
min claim length:  26
mean claim length:  122.95521172638436
max evidence count:  5
min evidence count:  1
mean evidence count:  3.3566775244299674
max evidence length:  1979
min evidence length:  13
mean evidence length:  173.5
Counter({'SUPPORTS': 519, 'NOT_ENOUGH_INFO': 386, 'REFUTES': 199, 'DISPUTED': 124})
Dev evi inside train evi 154
Dev evi outside train evi 0
Train claim count:  1228


## 1.2 Data preprocessing

### Implementing preprocessing fuctions

In [9]:
import nltk
import string
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
stopwords = set(stopwords.words('english'))

def lemmatize(word):
    lemma = lemmatizer.lemmatize(word, 'v')
    return lemma if lemma != word else lemmatizer.lemmatize(word, 'n')

def is_pure_english(text):
    english_letters = set(string.ascii_letters)
    cleaned_text = ''.join(char for char in text if char.isalpha() or char.isspace())
    return all(char in english_letters or char.isspace() for char in cleaned_text)

def remove_non_eng(dictionary):
    eng_data = {}
    for key, value in dictionary.items():
        if is_pure_english(value):
            eng_data[key] = value
    return eng_data

def contains_climate_keywords(text, keywords):
    text = text.lower()
    for keyword in keywords:
        if re.search(r"\b" + re.escape(keyword) + r"\b", text):
            return True
    return False

def filter_climate_related(dictionary, keywords):
    cs_data = {}
    for key, value in dictionary.items():
        if contains_climate_keywords(value, keywords):
            cs_data[key] = value
    return cs_data

def text_preprocessing(text, remove_stopwords=False):
    words = [lemmatize(w) for w in text.lower().split()]
    if remove_stopwords:
        words = [w for w in words if w not in stopwords]
    return " ".join(words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ABC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ABC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Implementing **Claim data preprocessing** and **Evidence data preprocessing** functions

In [10]:
# 权威网站 https://www.ucdavis.edu/climate/definitions
climate_keywords = [
    "climate", "environment", "global warming", "greenhouse effect", "carbon", "co2", "carbon dioxide",
    "methane", "renewable energy", "sustainability", "ecology", "biodiversity", "fossil fuels",
    "emissions", "air quality", "ozone", "solar energy", "wind energy", "climate change", "climate crisis",
    "climate adaptation", "climate mitigation", "ocean", "sea levels", "ice melting", "deforestation",
    "reforestation", "pollution"
]

# def filter_evidence_by_train(train_claim_data, evidence_data):

#     # Collect all evidence ids in the training set
#     train_evidence_ids = set()

#     for claim in train_claim_data.values():
#         train_evidence_ids.update(claim['evidences'])

#     # filter evidence data by the evidence ids in the training set
#     filtered_evidence_data = {key: value for key, value in evidence_data.items() if key in train_evidence_ids}

#     return filtered_evidence_data

def preprocess_claim_data(claim_data, existed_evidences_id=None):
    claim_data = remove_non_eng(claim_data)
    claim_data_text = []
    claim_data_id = []
    claim_data_label = []
    claim_evidences = []

    for key in claim_data.keys():
        claim_data[key]["claim_text"] = text_preprocessing(claim_data[key]["claim_text"])
        claim_data_text.append(claim_data[key]["claim_text"])
        claim_data_id.append(key)

        if "claim_label" in claim_data[key]:
            claim_data_label.append(claim_data[key]["claim_label"])
        else:
            claim_data_label.append(None)

        if existed_evidences_id and "evidences" in claim_data[key]:
            claim_evidences.append([existed_evidences_id[i] for i in claim_data[key]["evidences"]])
        else:
            claim_evidences.append([])

    return claim_data_text, claim_data_id, claim_data_label, claim_evidences

# def preprocess_evi_data(evi_data, climate_keywords, train_claim_data):
#     evi_data = remove_non_eng(evi_data)
#     # cs_evi_data = filter_climate_related(evi_data, climate_keywords)

#     # filter evidence data by the evidence ids in the training set
#     # train_evi_data = filter_evidence_by_train(train_claim_data, cs_evi_data)

#     for key in evi_data.keys():
#         evi_data[key] = text_preprocessing(evi_data[key], remove_stopwords=True)

#     cleaned_evidence_text = list(evi_data.values())
#     cleaned_evidence_id = list(evi_data.keys())

#     return cleaned_evidence_text, cleaned_evidence_id

def preprocess_evi_data(evi_data):
    evi_data = remove_non_eng(evi_data)
    # cs_evi_data = filter_climate_related(evi_data, climate_keywords)

    # filter evidence data by the evidence ids in the training set
    # train_evi_data = filter_evidence_by_train(train_claim_data, cs_evi_data)

    for key in evi_data.keys():
        evi_data[key] = text_preprocessing(evi_data[key], remove_stopwords=True)

    cleaned_evidence_text = list(evi_data.values())
    cleaned_evidence_id = list(evi_data.keys())

    return cleaned_evidence_text, cleaned_evidence_id


### Start dataset preprocessing

In [11]:
# Preprocessing the claim data, split the data into text, id, label and evidences
train_claim_text, train_claim_id, train_claim_label, train_claim_evidences = preprocess_claim_data(train_claim_data)

dev_claim_text, dev_claim_id, dev_claim_label, dev_claim_evidences = preprocess_claim_data(dev_claim_data)

test_claim_text, test_claim_id, _, _ = preprocess_claim_data(test_claim_data)

# Preprocessing the evidence data, split the data into text and id
# cleaned_evidence_text, cleaned_evidence_id = preprocess_evi_data(evi_data, climate_keywords, train_claim_data)

cleaned_evidence_text, cleaned_evidence_id = preprocess_evi_data(evi_data)

In [11]:
print("Train claim count: ",len(train_claim_text))
print("Dev claim count: ",len(dev_claim_text))
print("Test claim count: ",len(test_claim_text))
print("Evidence count: ",len(cleaned_evidence_text))

Train claim count:  1228
Dev claim count:  154
Test claim count:  153
Evidence count:  1114577


## 1.3 Development Set Prediction

In this section, we perform the main tasks of the project on the development set:

1. **Evidence Retrieval**: For each claim, find the most relevant evidence from the corpus.
2. **Claim Classification**: Predict the label for each claim based on the retrieved evidence and the claim's similarity to the training claims.

The code uses TF-IDF vectorization and cosine similarity to measure the relevance between claims and evidence, and between development and training claims. The most similar evidence and training claims are used for prediction.

In [21]:
import operator
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import euclidean

# Creating two vectorizer
evidence_tfidf_vectorizer = TfidfVectorizer(max_features=400000, use_idf=True)

# fit the vectorizer on the evidence data
evidence_tfidf_vectorizer.fit(cleaned_evidence_text)

# Transform cleaned_evidence_text
transformed_evidence = evidence_tfidf_vectorizer.transform(cleaned_evidence_text)

# Transform claim data
train_claim_tfidf = evidence_tfidf_vectorizer.transform(train_claim_text)
dev_claim_tfidf = evidence_tfidf_vectorizer.transform(dev_claim_text)
test_claim_tfidf = evidence_tfidf_vectorizer.transform(test_claim_text)

In [22]:
print("Transformed evidence shape: ", transformed_evidence.shape)
print("Transformed train claim shape: ", train_claim_tfidf.shape)
print("Transformed dev claim shape: ", dev_claim_tfidf.shape)

Transformed evidence shape:  (1114577, 400000)
Transformed train claim shape:  (1228, 400000)
Transformed dev claim shape:  (154, 400000)


In [23]:
# Calculate cosine similarity between train claims and evidence
train_similarity = cosine_similarity(train_claim_tfidf, transformed_evidence)

# Calculate cosine similarity between dev claims and evidence
dev_similarity = cosine_similarity(dev_claim_tfidf, transformed_evidence)

# Calculate cosine similarity between test claims and evidence
test_similarity = cosine_similarity(test_claim_tfidf, transformed_evidence)


In [25]:
print("Train similarity shape: ", train_similarity.shape)
print("Dev similarity shape: ", transformed_evidence.shape)
print("Test similarity shape: ", test_similarity)

Train similarity shape:  (1228, 1114577)
Dev similarity shape:  (1114577, 400000)
Test similarity shape:  [[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.02440987 0.        ]
 [0.         0.         0.         ... 0.         0.02700692 0.        ]
 ...
 [0.         0.01452821 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [96]:
# # Calculate Euclidean distance between train claims and evidence
# train_distance = np.zeros((train_claim_tfidf.shape[0], transformed_evidence.shape[0]))
# for i in range(train_claim_tfidf.shape[0]):
#     for j in range(transformed_evidence.shape[0]):
#         train_distance[i, j] = euclidean(train_claim_tfidf[i].toarray().ravel(), transformed_evidence[j].toarray().ravel())

# # Calculate Euclidean distance between dev claims and evidence
# dev_distance = np.zeros((dev_claim_tfidf.shape[0], transformed_evidence.shape[0]))
# for i in range(dev_claim_tfidf.shape[0]):
#     for j in range(transformed_evidence.shape[0]):
#         dev_distance[i, j] = euclidean(dev_claim_tfidf[i].toarray().ravel(), transformed_evidence[j].toarray().ravel())

# # Calculate Euclidean distance between test claims and evidence
# test_distance = np.zeros((test_claim_tfidf.shape[0], transformed_evidence.shape[0]))
# for i in range(test_claim_tfidf.shape[0]):
#     for j in range(transformed_evidence.shape[0]):
#         test_distance[i, j] = euclidean(test_claim_tfidf[i].toarray().ravel(), transformed_evidence[j].toarray().ravel())

KeyboardInterrupt: 

# 2. Model Implementation
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)

In [26]:
def spliting_dataset(similarity, claim_texts, claim_evidences, evidence_texts, top_k=5, neg_ratio=1):

    dataset = []
    labels = []

    # Based on the similarity matrix, find the top k most similar evidence for each claim
    for i in range(similarity.shape[0]):

        claim_text = claim_texts[i]

        # Find the top k most similar evidence
        top_evidences = np.argsort(-similarity[i])[:top_k]

        # Add the top k most similar evidence to the dataset, label as 1
        for evidence_index in top_evidences:
            evidence_text = evidence_texts[evidence_index]
            dataset.append("[cls] " + claim_text + " [sep] " + evidence_text)
            labels.append(1)

        # If the claim has evidences, add the evidence to the dataset, label as 1
        if claim_evidences is not None:
            for evidence_index in claim_evidences[i]:
                evidence_text = evidence_texts[evidence_index]
                dataset.append("[cls] " + claim_text + " [sep] " + evidence_text)
                labels.append(1)

        # Randomly sample negative samples, label as 0
        neg_samples_num = int(neg_ratio * len(top_evidences))

        # Randomly sample negative samples that are not in the top k most similar evidence
        neg_evidences = np.random.choice(
            [j for j in range(similarity.shape[1]) if j not in top_evidences],
            neg_samples_num
        )

        # Add the negative samples to the dataset
        for evidence_index in neg_evidences:
            evidence_text = evidence_texts[evidence_index]
            dataset.append("[cls] " + claim_text + " [sep] " + evidence_text)
            labels.append(0)

    return dataset, labels

In [27]:
train_dataset, train_dataset_labels = spliting_dataset(
    train_similarity, train_claim_text, train_claim_evidences, cleaned_evidence_text, top_k=10, neg_ratio=1.2
)
dev_dataset, dev_dataset_labels = spliting_dataset(
    dev_similarity, dev_claim_text, None, cleaned_evidence_text, top_k=10, neg_ratio=1.2
)
test_dataset, test_dataset_labels = spliting_dataset(
    test_similarity, test_claim_text, None, cleaned_evidence_text, top_k=10, neg_ratio=1.2
)

In [28]:
# Convert the dataset labels to numpy array
train_label_array = np.array(train_dataset_labels)
dev_label_array = np.array(dev_dataset_labels)
test_label_array = np.array(test_dataset_labels)

In [29]:
# need to install
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(oov_token="<UNK>")
tokenizer.fit_on_texts(train_dataset)

In [30]:
vocab_size = len(tokenizer.word_index) + 1  # 0 is padding token

In [31]:
# Convert the text data to sequence
train_sequence = tokenizer.texts_to_sequences(train_dataset)
dev_sequence = tokenizer.texts_to_sequences(dev_dataset)
test_sequence = tokenizer.texts_to_sequences(test_dataset)

In [32]:
longest_train_sequence = 0
for i in train_sequence:
    longest_train_sequence = max(longest_train_sequence, len(i))

longest_dev_sequence = 0
for i in dev_sequence:
    longest_dev_sequence = max(longest_dev_sequence, len(i))


In [33]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padding_length = max(longest_train_sequence, longest_dev_sequence) + 5

padded_train_sequence = pad_sequences(train_sequence, maxlen=padding_length, padding='post')
padded_dev_sequence = pad_sequences(dev_sequence, maxlen=padding_length, padding='post')
padded_test_sequence = pad_sequences(test_sequence, maxlen=padding_length, padding='post')

In [1]:
# from workshop
import tensorflow as tf
from tensorflow.keras.layers import LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.regularizers import l2


embedding_dim = 200
hidden_dim = 400

#model definition
# feedforward network (MLP)
model = Sequential(name="retrieval_cls_lstm")
model.add(layers.Embedding(input_dim=vocab_size,
                           output_dim=embedding_dim,
                           input_length=padding_length, embeddings_regularizer=l2(0.02)))

model.add(layers.Dropout(0.5))
# model.add(LSTM(hidden_dim, return_sequences=True, dropout=0.1))
# model.add(LSTM(hidden_dim, dropout=0.1))

model.add(layers.Bidirectional(LSTM(hidden_dim, return_sequences=True, dropout=0.5, kernel_regularizer=l2(0.02), recurrent_regularizer=l2(0.02))))
model.add(layers.GlobalMaxPooling1D())

model.add(layers.Dense(hidden_dim, activation='tanh', kernel_regularizer=l2(0.02), bias_regularizer=l2(0.02)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

#since it's a binary classification problem, we use a binary cross entropy loss here
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[keras.metrics.Recall()])
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# model.compile(loss='binary_crossentropy', optimizer='adam')

decay_steps = 3000
learning_rate = 1e-5
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    learning_rate, decay_steps
)

optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
model.compile(loss='binary_crossentropy', optimizer=optimizer)
earlystopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='min')
model.summary()

NameError: name 'vocab_size' is not defined

In [50]:
# Train the model

model.fit(padded_train_sequence,train_label_array,epochs=15,validation_data=(padded_dev_sequence, dev_label_array),verbose=True,batch_size=500,callbacks=[earlystopping])

Epoch 1/15

: 

In [82]:
# Save the model
# model.save('retrieval_cls_lstm')

# Load the model
# model = tf.keras.models.load_model('retrieval_cls_lstm')

# 3.Testing and Evaluation
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)

In [47]:
# Start prediction

dev_predictions = model.predict(padded_dev_sequence, batch_size=64)
test_predictions = model.predict(padded_test_sequence, batch_size=64)




In [48]:
print(dev_predictions[:20])
print(test_predictions[:5])

[[0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]]
[[0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]
 [0.45046675]]


In [49]:
def evidences_retrieval(claim_evidence_scores, top_k):

    top_evidence_indices = []

    for scores in claim_evidence_scores:
        sorted_indices = np.argsort(scores)[::-1]
        top_indices = sorted_indices[:top_k]
        top_evidence_indices.append(top_indices)

    return top_evidence_indices


select_evidence_k = 6
dev_top_evidence_indices = evidences_retrieval(dev_predictions, select_evidence_k)
test_top_evidence_indices = evidences_retrieval(test_predictions, select_evidence_k)

In [39]:
# Update the dev JSON file
with open('data/dev-claims.json', 'r') as f:
    dev_claims = json.load(f)

for claim_id, evidence_indices in zip(dev_claim_id, dev_top_evidence_indices):
    top_evidence_ids = [cleaned_evidence_id[idx] for idx in evidence_indices]
    dev_claims[claim_id]['evidences'] = top_evidence_ids

with open('data/dev-claims.json', 'w') as f:
    json.dump(dev_claims, f)


In [40]:
# Update the test JSON file
with open('data/test-claims-unlabelled.json', 'r') as f:
    test_claims = json.load(f)

for claim_id, evidence_indices in zip(test_claim_id, test_top_evidence_indices):
    top_evidence_ids = [cleaned_evidence_id[idx] for idx in evidence_indices]
    test_claims[claim_id]['evidences'] = top_evidence_ids

with open('data/test-claims-unlabelled.json', 'w') as f:
    json.dump(test_claims, f)

In [41]:
# %%cmd
# python eval.py --predictions dev-claims-baseline.json --groundtruth dev-claims.json
# python eval.py --predictions dev_predict.json --groundtruth dev-claims.json


import subprocess

# proc = subprocess.Popen(["python", "eval.py", "--predictions", "data\dev_predict.json", "--groundtruth", "data\dev-claims.json"
# ], stdout=subprocess.PIPE, shell=True)
# (out, err) = proc.communicate()
# print(str(out))

# 高自动化模型/预处理选择，可以自动读取准确度
output = subprocess.check_output("python eval.py --predictions data/dev_predict.json --groundtruth data/dev-claims.json", shell=True)
output_str = output.decode('utf-8')

# Split the output into lines
output_lines = output_str.strip().split('\n')

# Format the output
formatted_lines = []
for line in output_lines:
    metric, value = line.split('=')
    metric = metric.strip()
    value = value.strip()
    formatted_line = f"{metric}: {value}"
    formatted_lines.append(formatted_line)

# Join the formatted lines into a single string
formatted_output = '\n'.join(formatted_lines)
print(formatted_output)

Evidence Retrieval F-score (F): 0.0
Claim Classification Accuracy (A): 0.38961038961038963
Harmonic Mean of F and A: 0.0


## Object Oriented Programming codes here

*You can use multiple code snippets. Just add more if needed*