# Finding complex answers to medical questions


This project focuses on "query-focused summarisation" on medical questions where the goal is, given a medical question and a list of sentences extracted from relevant medical publications, to determine which of these sentences from the list can be used as part of the answer to the question. Assignment 3 is divided into two parts. Part 1 will help you get familar with the data, and Part 2 requires you to implement deep neural networks.

We will use data that has been derived from the **BioASQ challenge** (http://www.bioasq.org/). The BioASQ challenge organises several "shared tasks", including a task on biomedical semantic question answering which we are using here. The data are in the file `bioasq10_labelled.csv`, which is part of the zip file provided. Each row of the file has a question, a sentence text, and a label that indicates whether the sentence text is part of the answer to the question (1) or not (0).

## Data Review

The following code uses pandas to store the file `bioasq10_labelled.csv` 

In [2]:
from google.colab import drive

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import pandas as pd
dataset = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/3420-Assignment3-data/bioasq10b_labelled.csv")
dataset.head()

Unnamed: 0,qid,sentid,question,sentence text,label
0,0,0,Is Hirschsprung disease a mendelian or a multi...,Hirschsprung disease (HSCR) is a multifactoria...,0
1,0,1,Is Hirschsprung disease a mendelian or a multi...,"In this study, we review the identification of...",1
2,0,2,Is Hirschsprung disease a mendelian or a multi...,The majority of the identified genes are relat...,1
3,0,3,Is Hirschsprung disease a mendelian or a multi...,The non-Mendelian inheritance of sporadic non-...,1
4,0,4,Is Hirschsprung disease a mendelian or a multi...,Coding sequence mutations in e.g.,0


The columns of the CSV file are:

* `qid`: an ID for a question. Several rows may have the same question ID, as we can see above.
* `sentid`: an ID for a sentence.
* `question`: The text of the question. In the above example, the first rows all have the same question: "Is Hirschsprung disease a mendelian or a multifactorial disorder?"
* `sentence text`: The text of the sentence.
* `label`: 1 if the sentence is a part of the answer, 0 if the sentence is not part of the answer.

## Task 1: Implementation of a Simple Siamese Neural Network for Text Similarity

**Model Architecture:**
A simple Siamese neural network was implemented using TensorFlow and Keras. The architecture comprises dense layers with ReLU activation functions, complemented by batch normalization and dropout for regularization. The model takes TF-IDF representations of text triplets (anchor, positive, and negative examples) as input. After experimenting with different configurations, the optimal number and size of hidden layers were determined based on performance on the dev_test set.
A custom distance layer was implemented to calculate the squared Euclidean distance between the anchor and positive/negative examples. This layer is crucial for optimizing the triplet loss function, which encourages the model to minimize the distance between related sentence pairs while maximizing the distance between unrelated pairs. <br>

**Data Preparation:**
A function was developed to process the raw CSV data into suitable triplets for training. This function ensures a good balance between positive and negative pairs, typically generating 10-20 negative pairs for every 10 positive pairs per question. <br>

**Model Training:**
The model was trained on the provided training data, using the dev_test set to fine-tune hyperparameters. The triplet loss function was employed to effectively learn distinctions between related and unrelated sentence pairs.  <br>

**Summarizer Implementation:**
The nn_summariser function was created as specified. This function takes a CSV file, a list of question IDs, and a number n as input, and returns the IDs of the n most relevant sentences for each question based on the model's predictions. <br>

**Evaluation:**
The best model was evaluated using the test set. 

In [5]:
pip install keras-tuner

Collecting keras-tuner
  Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/129.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting kt-legacy (from keras-tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras-tuner
Successfully installed keras-tuner-1.4.7 kt-legacy-1.0.5


In [6]:
import tensorflow as tf
import pandas as pd
import numpy as np
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras import layers
from keras_tuner import HyperModel
from keras_tuner.tuners import RandomSearch
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import random
import os
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
from transformers import BertTokenizer, TFBertModel

In [7]:
from tensorflow.keras.layers import Layer
from tensorflow.keras import backend as K

In [8]:
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization, Layer
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.preprocessing import normalize


In [9]:
# Function to load the data
def load_data(training_path, dev_test_path, test_path):
    print("Loading data...")
    training_data = pd.read_csv(training_path)
    dev_test_data = pd.read_csv(dev_test_path)
    test_data = pd.read_csv(test_path)
    print("Data loaded successfully.")
    return training_data, dev_test_data, test_data

In [10]:
# Function to prepare triplets
def prepare_triplets(data):
    print("Preparing triplets...")
    triplets = []
    grouped = data.groupby('qid')
    for qid, group in grouped:
        question = group['question'].values[0]
        positives = group[group['label'] == 1]
        negatives = group[group['label'] == 0]
        num_pos = len(positives)
        num_neg = len(negatives)
        num_samples = min(num_pos, num_neg)
        positives = positives.sample(n=num_samples)
        negatives = negatives.sample(n=num_samples)
        for pos, neg in zip(positives.itertuples(), negatives.itertuples()):
            triplets.append((question, pos._4, neg._4))
    print(f"Prepared {len(triplets)} triplets.")
    return triplets

In [11]:
# Function to build the Siamese network
def build_siamese_nn(input_shape, hidden_layer_size):
    print("Building Siamese network...")
    input = Input(shape=(input_shape,))
    x = Dense(hidden_layer_size, activation='relu')(input)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    x = Dense(hidden_layer_size, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    x = Dense(hidden_layer_size, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    model = Model(inputs=input, outputs=x)
    print("Siamese network built successfully.")
    return model


#### **Model Specification**
We employed a simple Siamese neural network architecture comprising dense layers with ReLU activation functions, batch normalization, and dropout for regularization. The model uses a custom distance layer to compute the squared Euclidean distances between the anchor, positive, and negative embeddings, which is crucial for optimizing the triplet loss function. The triplet loss function encourages the model to minimize the distance between related sentence pairs while maximizing the distance between unrelated pairs, thereby effectively learning to distinguish between them.

In [12]:
# Function to make a custom Distance Layer
class DistanceLayer(Layer):
    """Custom layer to calculate squared Euclidean distances."""
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def call(self, inputs):
        anchor, positive, negative = inputs
        positive_dist = tf.reduce_sum(tf.square(anchor - positive), axis=1, keepdims=True)
        negative_dist = tf.reduce_sum(tf.square(anchor - negative), axis=1, keepdims=True)
        return tf.concat([positive_dist, negative_dist], axis=1)

In [13]:
# Custom Triplet Loss Function
def triplet_loss(margin=1.0):
    def loss(y_true, y_pred):
        positive_dist = y_pred[:, 0]
        negative_dist = y_pred[:, 1]
        return tf.maximum(positive_dist - negative_dist + margin, 0)
    return loss

In [14]:
# Prepare the model
def siamese_model(input_shape, hidden_layer_size):
    """Constructs the Siamese neural network model."""
    base_nn = build_siamese_nn(input_shape, hidden_layer_size)

    anchor_input = Input(shape=(input_shape,))
    positive_input = Input(shape=(input_shape,))
    negative_input = Input(shape=(input_shape,))

    anchor_embedding = base_nn(anchor_input)
    positive_embedding = base_nn(positive_input)
    negative_embedding = base_nn(negative_input)

    distances = DistanceLayer()([anchor_embedding, positive_embedding, negative_embedding])

    model = Model(inputs=[anchor_input, positive_input, negative_input], outputs=distances)
    model.compile(optimizer=Adam(learning_rate=0.001), loss=triplet_loss(margin=1.0))
    print("Siamese model compiled successfully.")
    return model

In [15]:
# Function to vectorize text using TFIDF
def vectorize_text(data, tfidf_vectorizer=None):
    """Vectorizes the text data using TF-IDF."""
    if tfidf_vectorizer is None:
        print("Fitting TF-IDF vectorizer...")
        tfidf_vectorizer = TfidfVectorizer(max_features=5000)
        tfidf_vectorizer.fit(data)
        print("TF-IDF vectorizer fitted.")

    return tfidf_vectorizer.transform(data), tfidf_vectorizer

In [16]:
# Train the model
def train_model(model, triplets, tfidf_vectorizer):
    """Trains the Siamese model with the provided triplets."""
    print("Training the model...")
    questions = [triplet[0] for triplet in triplets]
    positives = [triplet[1] for triplet in triplets]
    negatives = [triplet[2] for triplet in triplets]

    anchor_vectors, _ = vectorize_text(questions, tfidf_vectorizer)
    positive_vectors, _ = vectorize_text(positives, tfidf_vectorizer)
    negative_vectors, _ = vectorize_text(negatives, tfidf_vectorizer)

    anchor_vectors = anchor_vectors.toarray()
    positive_vectors = positive_vectors.toarray()
    negative_vectors = negative_vectors.toarray()

    y_dummy = np.zeros(len(triplets))

    early_stopping = EarlyStopping(monitor='loss', patience=3)
    reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=2)

    model.fit(
        [anchor_vectors, positive_vectors, negative_vectors],
        y_dummy,
        epochs=50,
        batch_size=32,
        callbacks=[early_stopping, reduce_lr]
    )
    print("Model training completed.")

In [17]:
# Summarizer function
def nn_summariser(csvfile, questionids, n=1):
    """Return the IDs of the n sentences that have the highest predicted score."""
    print("Summarizing results...")
    data = pd.read_csv(csvfile)
    grouped = data.groupby('qid')

    results = []
    for qid in questionids:
        group = grouped.get_group(qid)
        question = group['question'].values[0]
        sentences = group['sentence text'].values
        sentence_ids = group['sentid'].values

        anchor_vector, _ = vectorize_text([question], tfidf_vectorizer)
        sentence_vectors, _ = vectorize_text(sentences, tfidf_vectorizer)

        anchor_vector = anchor_vector.toarray()
        sentence_vectors = sentence_vectors.toarray()

        scores = model.predict([np.tile(anchor_vector, (len(sentences), 1)), sentence_vectors, sentence_vectors])
        ranked_sentences = np.argsort(scores[:, 0])[:n]

        results.append(sentence_ids[ranked_sentences].tolist())

    print("Summarization completed.")
    return results

In [30]:

# Main code
print("Starting the process...")
training_data, dev_test_data, test_data = load_data('/content/drive/MyDrive/Colab Notebooks/3420-Assignment3-data/training.csv', '/content/drive/MyDrive/Colab Notebooks/3420-Assignment3-data/dev_test.csv','/content/drive/MyDrive/Colab Notebooks/3420-Assignment3-data/test.csv')
training_data['question'] = training_data['question']
training_data['sentence text'] = training_data['sentence text']
dev_test_data['question'] = dev_test_data['question']
dev_test_data['sentence text'] = dev_test_data['sentence text']
test_data['question'] = test_data['question']
test_data['sentence text'] = test_data['sentence text']

training_triplets = prepare_triplets(training_data)
dev_test_triplets = prepare_triplets(dev_test_data)

questions = training_data['question'].unique()
all_sentences = pd.concat([training_data['sentence text'], dev_test_data['sentence text'], test_data['sentence text']])
all_text = np.concatenate([questions, all_sentences])

print("Vectorizing text...")
_, tfidf_vectorizer = vectorize_text(all_text)

input_shape = 5000
hidden_layer_size = 256

model = siamese_model(input_shape, hidden_layer_size)
train_model(model, training_triplets, tfidf_vectorizer)

Starting the process...
Loading data...
Data loaded successfully.
Preparing triplets...
Prepared 8182 triplets.
Preparing triplets...
Prepared 2873 triplets.
Vectorizing text...
Fitting TF-IDF vectorizer...
TF-IDF vectorizer fitted.
Building Siamese network...
Siamese network built successfully.
Siamese model compiled successfully.
Training the model...
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Model training completed.


In [31]:
# Model Evaluation
def evaluate_model(data, model, tfidf_vectorizer, n=1):
    """Evaluates the model on the test data, returns top n sentences, and calculates precision, recall, and F1 score."""
    print("Evaluating the model...")
    grouped = data.groupby('qid')
    y_true = []
    y_pred = []
    top_sentences = []
    for qid, group in grouped:
        question = group['question'].values[0]
        sentences = group['sentence text'].values
        labels = group['label'].values
        sentence_ids = group['sentid'].values

        anchor_vector, _ = vectorize_text([question], tfidf_vectorizer)
        sentence_vectors, _ = vectorize_text(sentences, tfidf_vectorizer)

        anchor_vector = anchor_vector.toarray()
        sentence_vectors = sentence_vectors.toarray()

        scores = model.predict([np.tile(anchor_vector, (len(sentences), 1)), sentence_vectors, sentence_vectors])
        ranked_sentences = np.argsort(scores[:, 0])[:n]
        top_sentences.extend(sentence_ids[ranked_sentences])

        y_true.extend(labels)
        y_pred.extend([1 if i in ranked_sentences else 0 for i in range(len(labels))])

    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    print("Model evaluation completed.")
    return top_sentences, precision, recall, f1

# Evaluate the model on the test set
top_sentences, test_precision, test_recall, test_f1_score = evaluate_model(test_data, model, tfidf_vectorizer)

print(f"Top sentences: {top_sentences}")
print(f"Precision on test set: {test_precision}")
print(f"Recall on test set: {test_recall}")
print(f"F1 Score on test set: {test_f1_score}")


Evaluating the model...
Model evaluation completed.
Top sentences: [6, 9, 15, 4, 2, 35, 21, 19, 30, 0, 24, 11, 13, 28, 14, 1, 8, 36, 1, 1, 12, 6, 3, 8, 12, 15, 1, 0, 13, 1, 14, 0, 0, 2, 5, 1, 11, 1, 9, 8, 1, 0, 0, 17, 4, 3, 17, 2, 9, 13, 2, 4, 2, 2, 10, 6, 18, 2, 1, 21, 1, 9, 0, 87, 0, 0, 1, 34, 19, 0, 0, 7, 21, 0, 13, 9, 6, 4, 2, 1, 1, 5, 13, 1, 48, 0, 8, 24, 13, 3, 7, 43, 12, 4, 1, 1, 1, 10, 1, 20, 11, 13, 3, 11, 3, 4, 0, 2, 0, 33, 16, 6, 2, 0, 62, 2, 15, 31, 17, 0, 42, 33, 10, 43, 0, 4, 11, 0, 2, 5, 1, 0, 34, 3, 5, 0, 3, 17, 13, 4, 24, 10, 4, 0, 18, 0, 48, 6, 0, 2, 25, 6, 1, 0, 20, 6, 0, 1, 5, 11, 2, 6, 20, 31, 8, 2, 1, 4, 7, 1, 0, 2, 4, 7, 1, 60, 6, 27, 1, 6, 19, 1, 3, 1, 5, 13, 10, 21, 9, 2, 25, 3, 21, 1, 5, 5, 7, 2, 1, 23, 11, 20, 6, 6, 4, 6, 0, 1, 27, 4, 10, 14, 1, 4, 3, 28, 1, 7, 0, 20, 4, 10, 9, 6, 23, 8, 19, 14, 1, 4, 4, 7, 7, 3, 1, 5, 0, 3, 1, 18, 5, 56, 9, 4, 1, 2, 20, 11, 8, 0, 3, 16, 18, 0, 20, 5, 5, 6, 24, 1, 2, 9, 41, 7, 3, 1, 3, 1, 4, 2, 7, 9, 8, 22, 1, 14, 8, 4, 1, 0,

#### **Results**
* Top sentences: [6, 9, 15, 4, 2, 35, 21, 19, 30, 0, 24, 11, 13, 28, 14, 1, 8, ]<br>
* Precision on test set: 0.587 <br>
* Recall on test set: 0.130 <br>
* F1 Score on test set: 0.213 <br>

#### **Takeaways**
* The model correctly identifies relatedness between sentencesroughly around 60% of the time.
*  However, recall value indicates models poor performance in capturing the actual relatedness in sentences. The model can only identify about 13% of actual related sentences in the dataset.
* Similar to recall value, the model performs poorly on F1 scores, indicating the imbalance between precision and recall and a general poor performance of the model.

#### **Recommendations**
* Apply regularization techniques like L2 regularization and increase the dropout rate to prevent overfitting.
* Perform grid search or random search to find optimal values for learning rate, dropout rate, and hidden layer sizes.

## Task 2: Implementation of an Advanced Siamese Neural Network for Text Similarity

**Model Architecture:**
An advanced Siamese neural network was implemented using TensorFlow and Keras. The architecture includes:

* An embedding layer generating 35-dimensional vectors for sentence text
* An LSTM layer to capture sequential dependencies and contextual information
* Three hidden layers with ReLU activation functions
* Dropout layers for regularization <br>

A custom distance layer was implemented to calculate the squared Euclidean distance between the anchor and positive/negative examples. This layer is crucial for optimizing the triplet loss function, which encourages the model to minimize the distance between related sentence pairs while maximizing the distance between unrelated pairs. <br>

**Data Preparation:**
A function was developed to process the raw CSV data into suitable triplets for training, ensuring a good balance between positive and negative pairs. <br>

**Model Training:**
The model was trained on the provided training data. The dev_test set was used to determine the optimal size of the LSTM layer and an appropriate sentence length limit. The triplet loss function was employed to effectively learn distinctions between related and unrelated sentence pairs. <br>

**Summarizer Implementation:**
The nn_summariser function was created as specified, taking a CSV file, a list of question IDs, and a number n as input, and returning the IDs of the n most relevant sentences for each question based on the model's predictions.<br>

**Evaluation:**
The best model was evaluated using the test set. 

In [32]:
# Loading data using a function
def load_data_task2(training_path, dev_test_path, test_path):
    print("Loading data...")
    training_data = pd.read_csv(training_path)
    dev_test_data = pd.read_csv(dev_test_path)
    test_data = pd.read_csv(test_path)
    print("Data loaded successfully.")
    return training_data, dev_test_data, test_data

training_data, dev_test_data, test_data = load_data_task2('/content/drive/MyDrive/Colab Notebooks/3420-Assignment3-data/training.csv', '/content/drive/MyDrive/Colab Notebooks/3420-Assignment3-data/dev_test.csv','/content/drive/MyDrive/Colab Notebooks/3420-Assignment3-data/test.csv')
for dataset in [training_data, dev_test_data, test_data]:
    dataset['question'] = dataset['question']
    dataset['sentence text'] = dataset['sentence text']



Loading data...
Data loaded successfully.


In [33]:
# Preparing triplets
def prepare_triplets_task2(data):
    triplets = []
    grouped = data.groupby('qid')
    for qid, group in grouped:
        question = group['question'].values[0]
        positives = group[group['label'] == 1]
        negatives = group[group['label'] == 0]
        num_samples = min(len(positives), len(negatives))
        positives = positives.sample(n=num_samples)
        negatives = negatives.sample(n=num_samples)
        for pos, neg in zip(positives.itertuples(), negatives.itertuples()):
            triplets.append((question, pos[3], neg[3]))
    print(f"Prepared {len(triplets)} triplets.")
    return triplets

# Prepare triplets for training and dev/test datasets
training_triplets = prepare_triplets_task2(training_data)
dev_test_triplets = prepare_triplets_task2(dev_test_data)



Prepared 8182 triplets.
Prepared 2873 triplets.


#### **Model Specification**
We advanced to a more complex Siamese network architecture, incorporating an embedding layer to convert words into dense vectors and an LSTM layer to capture sequential dependencies and contextual information within the sentences. This model also includes dense and dropout layers to further process the LSTM output and prevent overfitting. Similar to Task 1, we used a custom distance layer to calculate the squared Euclidean distances and employed the triplet loss function to train the model. The decision to use LSTM layers aims to leverage their strength in handling sequential data, making this architecture potentially more powerful for capturing sentence similarities.

In [34]:
#Custom Distance Layers
class DistanceLayerTask2(Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def call(self, inputs):
        anchor, positive, negative = inputs
        positive_dist = tf.reduce_sum(tf.square(anchor - positive), axis=1, keepdims=True)
        negative_dist = tf.reduce_sum(tf.square(anchor - negative), axis=1, keepdims=True)
        return tf.concat([positive_dist, negative_dist], axis=1)

In [35]:
# Custom loss function
def triplet_loss_task2(margin=1.0):
    def loss(y_true, y_pred):
        positive_dist = y_pred[:, 0]
        negative_dist = y_pred[:, 1]
        return tf.maximum(positive_dist - negative_dist + margin, 0)
    return loss

In [36]:
def build_siamese_nn_task2(input_shape, vocab_size, embedding_dim, lstm_units):
    input = Input(shape=(input_shape,))
    x = Embedding(vocab_size, embedding_dim, input_length=input_shape)(input)
    x = LSTM(lstm_units, return_sequences=False)(x)
    x = Dense(256, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    return Model(inputs=input, outputs=x)

In [37]:
def siamese_model_task2(input_shape, vocab_size, embedding_dim, lstm_units):
    base_nn = build_siamese_nn_task2(input_shape, vocab_size, embedding_dim, lstm_units)

    anchor_input = Input(shape=(input_shape,))
    positive_input = Input(shape=(input_shape,))
    negative_input = Input(shape=(input_shape,))

    anchor_embedding = base_nn(anchor_input)
    positive_embedding = base_nn(positive_input)
    negative_embedding = base_nn(negative_input)

    distances = DistanceLayerTask2()([anchor_embedding, positive_embedding, negative_embedding])

    model = Model(inputs=[anchor_input, positive_input, negative_input], outputs=distances)
    model.compile(optimizer=Adam(learning_rate=0.001), loss=triplet_loss_task2(margin=1.0))
    return model

In [38]:
# Vectorizing function
def vectorize_text_task2(data, tokenizer=None, max_len=100):
    if tokenizer is None:
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(data)

    sequences = tokenizer.texts_to_sequences(data)
    padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')

    return padded_sequences, tokenizer

In [39]:
# defining training function
def train_model_task2(model, triplets, tokenizer, max_len=100):
    questions = [triplet[0] for triplet in triplets]
    positives = [triplet[1] for triplet in triplets]
    negatives = [triplet[2] for triplet in triplets]

    anchor_vectors, _ = vectorize_text_task2(questions, tokenizer, max_len=max_len)
    positive_vectors, _ = vectorize_text_task2(positives, tokenizer, max_len=max_len)
    negative_vectors, _ = vectorize_text_task2(negatives, tokenizer, max_len=max_len)

    y_dummy = np.zeros(len(triplets))  # Dummy variable for loss function

    early_stopping = EarlyStopping(monitor='loss', patience=3)
    reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=2)

    model.fit(
        [anchor_vectors, positive_vectors, negative_vectors],
        y_dummy,
        epochs=50,
        batch_size=32,
        callbacks=[early_stopping, reduce_lr]
    )

In [40]:
 from tensorflow.keras.layers import LSTM, Embedding

In [41]:
# Model training
input_shape = 100
questions = training_data['question'].unique()
all_sentences = pd.concat([training_data['sentence text'], dev_test_data['sentence text'], test_data['sentence text']])
all_text = np.concatenate([questions, all_sentences])

padded_sequences, tokenizer = vectorize_text_task2(all_text, max_len=input_shape)
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50
lstm_units = 128

model = siamese_model_task2(input_shape, vocab_size, embedding_dim, lstm_units)
train_model_task2(model, training_triplets, tokenizer, max_len=input_shape)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [46]:
# summarising and evaluation function
def nn_summariser_task2(csvfile, questionids, tokenizer, model, max_len=100, n=1):
    data = pd.read_csv(csvfile)
    grouped = data.groupby('qid')

    results = []
    for qid in questionids:
        if qid not in grouped.groups:
            print(f"Warning: qid {qid} not found in the data.")
            continue
        group = grouped.get_group(qid)
        question = group['question'].values[0]
        sentences = group['sentence text'].values
        sentence_ids = group['sentid'].values

        anchor_vector, _ = vectorize_text_task2([question], tokenizer, max_len=max_len)
        sentence_vectors, _ = vectorize_text_task2(sentences, tokenizer, max_len=max_len)

        scores = model.predict([np.tile(anchor_vector, (len(sentences), 1)), sentence_vectors, sentence_vectors])
        ranked_sentences = np.argsort(scores[:, 0])[:n]

        results.append(sentence_ids[ranked_sentences].tolist())

    return results

def evaluate_model_task2(data, model, tokenizer, max_len=100, n=1):
    grouped = data.groupby('qid')
    y_true = []
    y_pred = []
    top_sentences = []
    for qid, group in grouped:
        question = group['question'].values[0]
        sentences = group['sentence text'].values
        labels = group['label'].values
        sentence_ids = group['sentid'].values

        anchor_vector, _ = vectorize_text_task2([question], tokenizer, max_len=max_len)
        sentence_vectors, _ = vectorize_text_task2(sentences, tokenizer, max_len=max_len)

        scores = model.predict([np.tile(anchor_vector, (len(sentences), 1)), sentence_vectors, sentence_vectors])
        ranked_sentences = np.argsort(scores[:, 0])[:n]
        top_sentences.extend(sentence_ids[ranked_sentences])

        y_true.extend(labels)
        y_pred.extend([1 if i in ranked_sentences else 0 for i in range(len(labels))])

    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)


    return top_sentences, precision, recall, f1

# Ensure the following variables are defined properly:
# test_data, tokenizer, model, input_shape

# Evaluate the model on the test set
test_question_ids = test_data['qid'].unique()
top_sentences, test_precision, test_recall, test_f1_score = evaluate_model_task2(test_data, model, tokenizer, max_len=input_shape)

print(f"Top sentences: {top_sentences}")
print(f"Precision on test set: {test_precision}")
print(f"Recall on test set: {test_recall}")
print(f"F1 Score on test set: {test_f1_score}")


Top sentences: [30, 15, 11, 2, 6, 35, 31, 2, 36, 1, 18, 21, 28, 8, 7, 3, 0, 7, 15, 2, 15, 6, 3, 6, 2, 17, 1, 5, 14, 1, 43, 2, 1, 14, 3, 6, 10, 1, 8, 1, 12, 0, 3, 5, 3, 3, 15, 1, 4, 8, 1, 6, 2, 7, 7, 1, 15, 6, 8, 19, 0, 14, 3, 31, 19, 6, 3, 7, 11, 16, 14, 6, 15, 0, 4, 30, 9, 2, 2, 2, 4, 2, 16, 13, 32, 9, 7, 29, 0, 8, 2, 9, 3, 9, 13, 8, 7, 10, 2, 39, 11, 25, 10, 21, 6, 4, 3, 5, 4, 11, 0, 13, 3, 9, 1, 1, 31, 16, 16, 0, 42, 2, 6, 2, 4, 9, 4, 0, 4, 2, 5, 0, 44, 1, 4, 15, 5, 13, 0, 35, 24, 8, 6, 0, 0, 3, 36, 4, 0, 5, 14, 4, 3, 2, 8, 11, 0, 4, 6, 43, 3, 5, 7, 17, 14, 0, 9, 2, 7, 3, 0, 3, 4, 6, 0, 54, 6, 5, 68, 9, 29, 1, 3, 5, 6, 16, 20, 26, 4, 1, 13, 0, 9, 0, 19, 4, 2, 1, 0, 11, 7, 12, 4, 0, 1, 3, 1, 1, 21, 6, 5, 0, 0, 5, 25, 31, 0, 2, 2, 21, 2, 10, 6, 0, 14, 3, 12, 14, 12, 8, 2, 23, 10, 5, 1, 5, 3, 7, 0, 15, 0, 27, 8, 3, 8, 7, 15, 10, 14, 1, 2, 5, 24, 9, 30, 18, 4, 6, 4, 4, 2, 19, 37, 6, 9, 1, 1, 0, 3, 6, 11, 1, 30, 7, 3, 1, 6, 7, 1, 2, 6, 11, 11, 5, 6, 1, 10, 19, 18, 7, 1, 7, 24, 16, 1, 4, 

#### Results
* Precision on test set: 0.45690672963400236 <br>
* Recall on test set: 0.10109717868338558 <br>
* F1 Score on test set: 0.1655614973262032 <br>
* Top Sentences: [[15], [4], [7], [7], [8], [1], [4], [3], [3], [2]


#### **Takeaways**
* Precison value of 0.457 suggests that the model has a moderate ability to correctly identify related sentences among the sentences it predicted as related. However, it is not very high, implying that there are still many false positives.
* The recall of 0.101 indicates that the model correctly identifies only 10% of all actual related sentences in the dataset. .
* The F1 score in turn, also performs poorly at around 0.166

#### **Comparison**
Task 1's simpler model achieved higher precision (0.587 vs. 0.457), recall (0.130 vs. 0.101), and F1 score (0.213 vs. 0.166) compared to Task 2's more complex model. This indicates that the simpler model in Task 1 is better at correctly identifying related sentences and has overall better performance. The more complex model in Task 2 may be overfitting or insufficiently trained, highlighting the need for more data and better regularization techniques.

 #### **Recommendations**
* Conduct extensive hyperparameter tuning to optimize LSTM layer size, dropout rates, and learning rate.
* Apply more aggressive regularization methods like higher dropout rates and weight decay.
Use techniques like early stopping and model checkpointing to enhance training
* Implement k-fold cross-validation to ensure model generalization and robustness.