# 1. Overview

This assignment is built on the skills and techniques introduced in previous assignments, but completion (partially or entirely) of any previous assignment is not required to complete this work.

The previous assignment ended with word similarity and word analogy tests. Here, we take the next step and explore sentence similarity: students will learn how the representations of smaller language units (e.g., words) can be composed to form representations of larger units (e.g., sentences) using deep learning. Specifically, the assignment introduces
1. the **task** of measuring semantic textual similarity (STS),
2. a popular **dataset** of [Sentences Involving Compositional Knowledge (SICK)](https://zenodo.org/records/2787612),
3. **fine-tuning** pretrained embeddings for a specific task, and
4. how regression (or regression-like) tasks use **correlation statistics for evaluation**.

# 2. Technical Overview of Model Architecture

Modeling semantic textual similarity is complicated by the ambiguity and variability of linguistic expressions. To tackle this, you will develop and test a model comprising two components:

1. A sentence model for converting a sentence into a representation for similarity measurement. This is a convolutional neural network (CNN) architecture with multiple types of convolution and pooling, designed to capture different granularities of information.
2. A similarity measurement layer using structured similarity measurements, which compare local regions of sentence representations (obtained from the sentence model).

This approach involves two subnetworks, each processing a sentence (in parallel). The subnetworks share all their weights, and are eventually joined by a similarity measurement layer. This is followed by a fully connected layer for the final similarity score output. This kind of an architecture is called a *Siamese network* or *twin network* in NLP research literature.

> **[Schematic diagram of a twin network](https://drive.google.com/file/d/1sqS8n145QCEjxdBo6Ztlrjyf0ahoF8eJ/view?usp=drive_link)**

*Make sure to understand the conceptual layout shown in the above schematic diagram before you proceed.*

# 3. Semantic Textual Similarity (STS): Technical Details and Programming


## 3.1 The SICK Dataset

You are going to use a very well-known corpus called [the SICK (Sentences Involving Compositional Knowledge) dataset](https://zenodo.org/records/2787612). It includes information other than semantic similarity of sentences, but for the purposes of this assignment, you can ignore those additional properties of this corpus.

So, let us first obtain the corpus.

In [None]:
!wget https://zenodo.org/records/2787612/files/SICK.zip

--2024-04-30 00:00:57--  https://zenodo.org/records/2787612/files/SICK.zip
Resolving zenodo.org (zenodo.org)... 188.185.79.172, 188.184.103.159, 188.184.98.238, ...
Connecting to zenodo.org (zenodo.org)|188.185.79.172|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 217584 (212K) [application/octet-stream]
Saving to: ‘SICK.zip’


2024-04-30 00:00:58 (591 KB/s) - ‘SICK.zip’ saved [217584/217584]



This is a `.zip` archive, so we need to extract it.

In [None]:
from zipfile import ZipFile

with ZipFile('SICK.zip', 'r') as z:
    z.extractall('sick_dataset')

You should be able to see the extracted corpus using the `Files` icon on the left sidebar here on Colab. The corpus resides in the `sick_dataset` folder, and contains a `readme.txt` and a `SICK.txt`.

By default, this should be located in your `/content` folder on Colab. You can/should verify this using the `!pwd`, `!cd`, and `!ls` commands.

*Before moving forward in this assignment, check the structure of the data and understand what it provides.*

## 3.2 The `torchtext` package

This package consists of data processing utilities for natural language processing. You are being introduced to this package through this assignment with the expectation that you will find it useful not just in this assignment, but in future work related to NLP. It has the added advantage of being extremely well-integrated with the wider PyTorch project.

In [None]:
!pip install torchtext

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.1->torchtext)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.1->torchtext)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.1->torchtext)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.1->torchtext)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.2.1->torchtext)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.2.1->torchtext)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 

## 3.3 Utility functions and Lexical Similarity

Here, we provide you with a utility function that creates a dictionary and one function that provides a few features based on lexical overlap.

The use of these functions is highly recommended (but not mandatory), as these features are known to improve the performance in semantic similarity detection.

You are free to add other utility functions that compute specific features or create various dictionaries (for maintaining indices, or other mappings required by your implementation).
- **Always provide a docstring with any function you add, and also include type hints so that the data types are obvious to anyone using your code**.

In [None]:
from collections import defaultdict

import nltk

nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

import numpy as np

def pairwise_word2doc_frequency(sentence_list_1: list[str], sentence_list_2: list[str]) -> dict[str, int]:
    """
    Calculate the document frequency of each unique word from two lists of sentences.

    This function counts how many "documents" (in this context, a pair of sentences) each unique word appears in.
    Each pair of sentences from the two lists is considered as a separate "document". If a word occurs in either
    sentence of the pair, it is counted once for that pair.

    Args:
    sentence_list_1 (list[str]): The first list of sentences.
    sentence_list_2 (list[str]): The second list of sentences. It should be of the same length as sentence_list_1.

    Returns:
    dict[str, int]: A dictionary where keys are the unique words and values are the number of "documents" in which the
                    word appears.

    Raises:
    ValueError: If the input lists have different lengths.
    """
    if len(sentence_list_1) != len(sentence_list_2):
        raise ValueError("Sentence lists have different lengths.")
    word2doc_counts = defaultdict(int)
    for s1, s2 in zip(sentence_list_1, sentence_list_2):
        uniquetokens = set(s1) | set(s2)

        for t in uniquetokens:
            word2doc_counts[t] += 1
            #print(word2doc_counts)
    return word2doc_counts

def pairwise_lexical_overlap_features(sentence_list_1: list[str], sentence_list_2: list[str],
                                      word2doc_counts: dict[str, int]) -> list[list[float]]:
    """
    Calculate various lexical overlap features between two lists of tokenized sentences.

    This function computes four types of lexical overlap features for each pair of sentences:
    1. Basic overlap: The proportion of overlapping tokens in the two sentences.
    2. IDF-weighted overlap: The inverse document frequency (IDF) weighted overlap score.
    3. Content-only overlap: The basic overlap excluding stopwords.
    4. Content-only IDF-weighted overlap: The IDF-weighted overlap score excluding stopwords.

    Args:
        sentence_list_1 (list[str]): The first list of sentences to be analyzed.
        sentence_list_2 (list[str]): The second list of sentences to be analyzed.
        word2doc_counts (dict[str, int]): A dictionary mapping tokens to their document frequency across a corpus.

    Returns:
        list[list[float]]: A list of lists, where each sublist contains four float values for each sentence-pair:
                           [overlap, IDF-weighted overlap, content-only overlap, content-only IDF-weighted overlap]

    Raises:
        ValueError: If `sentence_list_1` and `sentence_list_2` have different lengths.
    """
    if len(sentence_list_1) != len(sentence_list_2):
        raise ValueError("Sentence lists have different lengths.")

    stopwords_set = set(stopwords.words('english'))
    num_docs = len(sentence_list_1)
    overlap_features = []
    for s1, s2 in zip(sentence_list_1, sentence_list_2):
        tokens_a_set, tokens_b_set = set(s1), set(s2)
        intersection = tokens_a_set & tokens_b_set
        overlap = len(intersection) / (len(tokens_a_set) + len(tokens_b_set))

        #Needed to switch to check divide by 0 later
        intersection_with_counts = [t for t in intersection if t in word2doc_counts and word2doc_counts[t] != 0]
        idf_intersection = sum(np.log(num_docs / word2doc_counts[t]) for t in intersection_with_counts)

        tokens_a_contentset = set(t for t in s1 if t not in stopwords_set)
        tokens_b_contentset = set(t for t in s2 if t not in stopwords_set)
        intersection_content = tokens_a_contentset & tokens_b_contentset
        overlap_content = len(intersection_content) / (len(tokens_a_contentset) + len(tokens_b_contentset))

        #Needed to switch to check divide by 0 later
        intersection_content_with_counts = [t for t in intersection_content if t in word2doc_counts and word2doc_counts[t] != 0]
        idf_intersection_content = sum(np.log(num_docs / word2doc_counts[t]) for t in intersection_content_with_counts)

        #Check divide by 0, if 0 set to 0 else complete
        if len(intersection_with_counts) >0:
            idf_weighted_overlap = idf_intersection / (len(tokens_a_set) + len(tokens_b_set))
        else:
            idf_weighted_overlap = 0.0
        if len(intersection_content_with_counts) >0:
            idf_weighted_overlap_content = idf_intersection_content / (len(tokens_a_contentset) + len(tokens_b_contentset))
        else:
            idf_weighted_overlap_content =0.0
        overlap_features.append([overlap, idf_weighted_overlap, overlap_content, idf_weighted_overlap_content])
    return overlap_features

Before going any further, we must fix random seeds for random number generation throughout the remainder of this assignment. This ensures that the experiments are reproducible.

In [None]:
import random
import torch

# Do not change the seed value and any line where this value is used for settings.
# Changing the seed values may prevent your results from being reproduced if needed.

SEED = 1234
DATA_SPLIT_SEED = 99
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if device.type == 'cuda':
    torch.cuda.manual_seed(SEED)

## 3.4 The SICK Dataset Object in Python

Here, you are given the class to represent the SICK dataset. Some preprocessing functionality is also included. PyTorch allows for [map-style datasets](https://pytorch.org/docs/stable/data.html#map-style-datasets), which is the approach taken in this assignment.

You are encouraged to add methods and/or enhance the structure of this dataset in any way, as long as the additional code does not use a prohibited library or package. But please remember that the primary objective of this assignment is to understand convolutional neural networks (CNNs) and semantic representation of sentences. Your enhancements should be guided by those goals. Otherwise, you run the risk of overinvesting in this portion of the assignment for diminished returns!

> A standard recommendation from the teaching staff is that once you understand the `SickDataset` class in its given form, you should move on to the CNN and its filters (next section). There, you may think of enhacing the `SickDataset` class in certain ways so that the class' structure integrates smoothly with how you want to use the CNN architecture.


In [None]:
import os
import torch

import pandas as pd

from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from sklearn.model_selection import train_test_split

class SickDataset(Dataset):
    """
    A PyTorch Dataset class for loading and processing the SICK dataset.

    This class handles loading the SICK dataset from a specified path and provides methods for processing the data,
    splitting it into training, development, and test sets, and accessing individual instances.

    Attributes:
        path (str): The path to the directory containing the SICK dataset files.
        tokenizer (callable): The tokenizer function used to process text data.
        fields (list): A list specifying how each column in the dataset should be processed.
        instances (list): A list containing dictionaries, each representing an instance in the dataset.
                          Each dictionary contains the processed data for a single instance.

    Methods:
        __len__(): Returns the total number of instances in the dataset.
        __getitem__(idx): Returns the instance at the specified index.
        splits(): Splits the dataset into training, development, and test sets.
    """

    def __init__(self, path):
        self.path = path
        self.tokenizer = get_tokenizer("basic_english")
        self.fields = [('id', None), ('sentence_1', self.tokenizer), ('sentence_2', self.tokenizer),
                       ('external_features', None), ('label', float)]
        self.instances = self._process_data()

    def __len__(self):
        return len(self.instances)

    def __getitem__(self, idx):
        return self.instances[idx]

    def __create_instance(self, row):
        return {
            'id': row['pair_ID'],
            'sentence_1': row['sentence_A'],
            'sentence_2': row['sentence_B'],
            'external_features': row['overlap_features'],
            'label': float(row['relatedness_score'])
        }

    def _process_data(self):
        corpus_df = pd.read_csv(os.path.join(self.path, 'SICK.txt'), sep='\t')
        remove_trailing_space = lambda s : self.tokenizer(s.rstrip())
        sentence_1_list = corpus_df['sentence_A'].apply(remove_trailing_space).tolist()
        sentence_2_list = corpus_df['sentence_B'].apply(remove_trailing_space).tolist()

        self.word2doc_counts = pairwise_word2doc_frequency(sentence_1_list, sentence_2_list)
        corpus_df['overlap_features'] = pairwise_lexical_overlap_features(sentence_1_list, sentence_2_list,
                                                                          self.word2doc_counts)
        instances = corpus_df.apply(self.__create_instance, axis=1).tolist()

        return instances

    def splits(self) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """
        Split the dataset into training (70%), development (10%), and test (20%) sets.

        Returns:
            tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: A tuple containing three DataFrames, in the order train_df,
                                                             dev_df, and test_df.
        """
        entire_df = pd.DataFrame(self.instances)

        train_df, other = train_test_split(entire_df, test_size=0.3, random_state=DATA_SPLIT_SEED)
        dev_df, test_df = train_test_split(other, test_size=2/3, random_state=DATA_SPLIT_SEED)

        return train_df, dev_df, test_df

## 3.5 Model Description

A **convolutional neural network (CNN)** is a *regularized* feed-forward neural network that learns useful features automatically through the use of **filters** (a process that removes some unwanted components or features from an input).
- A **convolution** is the process of applying a filter to text data to extract features from it. The original idea comes from image processing, where convolutions are applied to 2D grids. In NLP, we apply a filter to a sequence of words or characters.

You will use a CNN to model each sentence, by using two types of convolution filters. The intention is to model two different perspectives of the semantics of a sentence. You will also explore multiple types of pooling.

We view the input as a sequence of tokens where nearby tokens are very likely correlated. Thus, we consider a sentence $S \in \mathbb{R}^{l \times d}$ as a sequence of $l$ input words, and each word represented by a $d$-dimensional embedding.

We will need to introduce some notation for a technical description of the model. To keep things as similar as possible to our textbook,

- $S_i \in \mathbb{R}^d$ will denote the embedding of the $i^{\text{th}}$ word in the sequence
- $S_{i:j}$ will denote the concatenation of embeddings from word $i$ up to and including word $j$.
- The $k^{\text{th}}$ dimension will be denoted by $[k]$ in the supersccript. That is, $S_i^{[k]}$ is the dimension $k$ in the representation of word $i$ in our sentence. Similarly, $S_{i:j}^{[k]}$ is the vector of the values in the $k^{\text{th}}$ dimension of words $i$ to $j$.


### 3.5.1 Convolution filters

We define a convolution filter $F = \langle ws, w_F, b_F, h_F \rangle$ as a tuple of size 4, comprising
1. $ws$, the width of the sliding window,
2. $w_F$, the weight vector for the filter (this is a vector in $\mathbb{R}^{ws \times d}$),
3. a real-valued scalar bias $b_F$, and
4. a nonlinear activation function $h_F$.

When the above filter is applied to a sentence $S$, it computes the inner product between $w_F$ and each possible window of length $ws$ in the sentence $S$. Then, as with any feedforward neural network, we add the bias and apply the activation function. Thus, the output is a vector $\mathbf{o}_F \in \mathbb{R}^{1+l-ws}$ given by
$$\mathbf{o}_F = \langle h_F(w_F⋅S_{i:i+ws-1}+b_F)\rangle_{i=1}^{1+l-ws}$$

First, you will need some pretrained embeddings to serve as initial $d$-dimensional word embeddings. For this, let us resort to something you have already seen in the previous assignment: GloVe embeddings.

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2024-04-30 00:02:23--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-04-30 00:02:23--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2024-04-30 00:05:03 (5.16 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


Now you have the following embeddings:

- `glove.6B.50d.txt`
- `glove.6B.100d.txt`
- `glove.6B.200d.txt`
- `glove.6B.300d.txt`

You can either use the `gensim` library as shown in the previous assignment, or use these pretrained embeddings directly as follows:

In [None]:
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

# Example use:
embeddings_file = 'glove.6B.50d.txt'
glove_embeddings = load_glove_embeddings(embeddings_file)

if 'fox' in glove_embeddings:
    fox_embedding = glove_embeddings['fox']
    print("Embedding vector for 'fox':", fox_embedding)
else:
    print("Embedding vector for 'fox' not found.")

Embedding vector for 'fox': [ 0.44206   0.059552  0.15861   0.92777   0.1876    0.24256  -1.593
 -0.79847  -0.34099  -0.24021  -0.32756   0.43639  -0.11057   0.50472
  0.43853   0.19738  -0.1498   -0.046979 -0.83286   0.39878   0.062174
  0.28803   0.79134   0.31798  -0.21933  -1.1015   -0.080309  0.39122
  0.19503  -0.5936    1.7921    0.3826   -0.30509  -0.58686  -0.76935
 -0.61914  -0.61771  -0.68484  -0.67919  -0.74626  -0.036646  0.78251
 -1.0072   -0.59057  -0.7849   -0.39113  -0.49727  -0.4283   -0.15204
  1.5064  ]


With such pretrained embeddings, and the technical details described above, it's time to implement the convolution filter. The start of this portion is given to you below. You are free to add and/or modify fields to make this filter class richer, if you want/need (and you will almost certainly need to add more methods).

In [None]:
import torch

import torch.nn as nn

class ConvolutionalFilter(nn.Module):
    def __init__(self, window_size, embedding_dim, activation_fn=nn.Tanh()):
        super(ConvolutionalFilter, self).__init__()
        self.window_size = window_size
        self.embedding_dim = embedding_dim
        self.weight = nn.Parameter(torch.randn(window_size, embedding_dim))
        self.bias = nn.Parameter(torch.randn(1))
        self.activation_function = activation_fn

    def forward(self, sentence):
        # In a PyTorch nn.Module, the forward function defines the forward pass computation of the neural network. This
        # function describes how input data is processed through the layers of the network to produce the output.
        # Usually, it consists of
        # 1. The input data (usually a tensor) is passed into the forward function as an argument.
        # 2. Then, you define the computation graph by specifying how the input data flows forward through the layers
        #    of your network.
        # 3. You apply operations defined by the layers (e.g., convolutional transformation, activation functions,
        #    pooling, etc.) to the input data successively, forming the forward pass of the network.
        # 4. Finally, the forward function returns the output of the network after processing the input data through all
        #    the layers.
        # TODO

        #Unfold sentence to get sentence length - window size + 1 windows of the sentence embeddings
        sentence_unfolded = sentence.unfold(0, self.window_size, 1).transpose(1, 2)

        #Compute product of windows and weights transformed
        out = torch.matmul(sentence_unfolded, self.weight.t())
        out += self.bias

        #Sum output matrix to get convultion
        out = torch.sum(out, dim=(1, 2))
        #print(out)
        #Finally do activation on result
        out = self.activation_function(out)
        #print(out.shape)
        #do pooling outside
        return out #Should be vector of shape - (length - window size + 1)

# # Example use:
# embeddings_file = 'glove.6B.50d.txt'
# sentence = 'the quick brown jumped over the moon'

# sentence_embeddings = []
# for word in sentence.split():
#   if word in glove_embeddings:
#     word_embedding = glove_embeddings[word]
#     sentence_embeddings.append(word_embedding)
#   else:
#     print('error')

# sentence_tensor = torch.tensor(sentence_embeddings)
# #print(sentence_tensor)
# cf = ConvolutionalFilter(window_size=3, embedding_dim=50)

# print(cf.forward(sentence_tensor))

### 3.5.2 Pooling

Once you start looking into the details of the convolution filter, you will perhaps notice a discrepancy in the shape of certain intermediate results and the output vector as described. This is where **pooling** comes in.

In networks like this, the output vector of a convolution filter is usually converted to a scalar, and this conversion is done using some sort of pooling. In this context, it makes sense for us to think of an operational object triple $(ws, p, S)$ that contains a convolution layer with width $ws$, uses a pooling function $p$, and operates on the sentence $S$. These operational triples model different perspectives on the semantics of a sentence. That is, such a "perspective" can be defined as

$$\{(ws, p, S) : p \in \{\max, \min, \text{mean}\}\}$$

**Why all these complicated things?**
*We want each convolution layer to learn to recognize distinct phenomena of the input. This allows for richer modeling of compositional semantics. To this end, the design of the operational triples allows a pooling function to interact with its own underlying convolution layers independently.*

For a perspective $(ws, p, S)$, with a convolution layer with $n$ filters, the output is a vector of length $n$, whose $i^\text{th}$ entry is $p(\mathbf{o}_{F_i})$, where $F_i$ is the $i^\text{th}$ filter.

Pooling is a common technique used in neural networks for reducing the spatial dimensions of tensors.

**Max Pooling** is a pooling operation that takes the maximum value from each patch of the input tensor. It effectively downsamples the input tensor by retaining only the maximum value from each patch. In the simplest scenario, the entire tensor is a single patch, and it is downsampled to a single scalar -- the element in that tensor which had the highest value.

**Mean Pooling**, also known as average pooling, computes the average value from each patch of the input tensor. It downsamples the input tensor by retaining the average value from each patch. In the simplest scenario, the entire tensor is downsampled to a single scalar, which is the average of all the elements in that tensor.

**Min Pooling** computes the minimum value from each patch of the input tensor. It downsamples the input tensor by retaining the minimum value from each patch. In the simplest case, the entire tensor is replaced by a scalar -- its minimum element.

These pooling operations are commonly used in convolutional neural networks (CNNs) and other types of neural network architectures to reduce the spatial dimensions of feature maps while retaining important information. They help in controlling overfitting and in extracting essential features from input data.

### 3.5.3 Structured Similarity Computation

Recall the lexical overlap features provided to you initially as a utility function. You may or may not have used it in your code already. Now, however, is the time to look at it again.

You will combine word representations through different pooling techniques and window sizes to create the input to the similarity computation layer in your Siamese network (i.e., the `ConvolutionalTwinNetwork` coming up soon). So, simply using cosine similarity is most probably not leveraging all that hard work!

For both sentences in the input, you have used convolution filters. Some of these filters may have used max pooling, others may have used min or mean pooling. Similarly, multiple window sizes may have also been used.

Now, build a similarity measurement function with the following properties:

- it computes the similarity of the max-pooled regions of the representation of sentence 1 with the max-pooled regions of the representation of sentence 2 (and similarly for min-pooled and mean-pooled)
- it computes a weighted sum of these regional similarities, where the weights are derived from the amount of lexical overlap
- the final similarity measure abides by the basic rule of cosine similarity in that its range is $[-1, 1]$.

In [None]:
# TODO
# Write your structured similarity computation function here. This is the function you should use in the siamese network
# code. Observe carefully how `nn.CosineSimilarity` works, and write your code in a similar form so that in your final
# experiments, you can swap the standard cosine similarity with your custom function (i.e., this one) and see which one
# works better.


def structured_similarity(sentence1: str, sentence2: str, filtered_tokens1: torch.Tensor, filtered_tokens2: torch.Tensor, word2doc_counts: dict[str, int], device) -> float:
    """
    Computes structured similarity between two sentences using lexical overlap
    """
    #Get overlap should always be one list with 4 floats if given just 2 sentences
    # print(sentence1)
    # print(sentence2)
    # print(word2doc_counts)
    overlap_features = pairwise_lexical_overlap_features([sentence1.lower()], [sentence2.lower()], word2doc_counts)[0]
    basic_overlap, IDF_weighted_overlap, content_only_overlap, content_only_IDF_weighted_overlap = overlap_features

    #CosineSim between the pooled filters
    pool_similarity = nn.CosineSimilarity(dim=0)(filtered_tokens1, filtered_tokens2)

    #Combine features together (no weights for now so weights are just equal may add weights in fututre)
    similarity = torch.tensor([pool_similarity, basic_overlap, IDF_weighted_overlap, content_only_overlap, content_only_IDF_weighted_overlap]).to(device).float()

    #Get in range -1 to 1, still needs linear
    similarity = torch.tanh(similarity)

    return similarity



### 3.5.4 Siamese Network

Once you have finished the convolutional filter, the next step is to build a "Siamese" feedforward neural network whose input layer takes in two sentences.

**How to handle sentences of varying lengths?** *Pad the shorter sentence with zero vectors.*

PyTorch can be used to combine the following to create a Siamese neural network:

- an input layer (with padding, as described above),
- a hidden convolutional layer (which uses the convolutional filter described earlier),
- a similarity computation layer,
- a fully connected (linear) layer, and
- the final scalar output, which is the "relatedness score" of the two sentences (recall that the gold-standard actual score is available from the dataset during training, and also for evaluation during testing).

This component of the assignment requires you to build this network. You do **not** have to build it from scratch! You should use the flexibility of PyTorch and combine layers built using PyTorch.

A skeleton code is given below. Please understand that this is simply the skeleton to show you the overall structure to be followed.



In [None]:
import torch

import torch.nn as nn
import torch.nn.functional as F

class ConvolutionalTwinNetwork(nn.Module):
    def __init__(self, embedding_dim, vocab_size, filter_widths, num_filters, hidden_dim, similarity_measurement='structure', convolution_layer='our_conv', padding_size=15, device='cpu'):
        super(ConvolutionalTwinNetwork, self).__init__()

        #Embedding layer, choosing not to update glove embedding, just using weight to update instead, may switch later
        #self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.embedding_dim = embedding_dim

        self.glove_embeddings = load_glove_embeddings(f'glove.6B.{embedding_dim}d.txt')

        self.padding_size = padding_size

        self.window_sizes = filter_widths


        #Need to create num_filters filters, will loop through all window sizes till all created
        self.conv_layers = nn.ModuleList()

        num_filters_made = 0
        while num_filters_made < num_filters:
          for window_size in self.window_sizes:
              if num_filters_made < num_filters:
                  num_filters_made +=1
                  conv_layer = ConvolutionalFilter(window_size, self.embedding_dim, nn.Tanh())
                  self.conv_layers.append(conv_layer)
              else:
                  break

        print(self.conv_layers)

        # imilarity measurement layer
        self.similarity_measurement = similarity_measurement #updated to use either simularity

        #Fully connected layer only needed for own similarity
        self.fc = nn.Linear(5, 1) #5 for size of output from similarity function

        self.device = device

    def forward(self, sentence1, sentence2, word2doc_counts):
        sentence1 = list(sentence1)
        sentence2 = list(sentence2)
        final_scaled_outputs = []

        #Go through each sentence in a batch
        for sen1, sen2 in zip(sentence1, sentence2):
            pooled_outputs1 = []
            pooled_outputs2 = []

            #Get embeddings from helper function defined below, padding is done there too
            embedded_sen1 = self.embed_sentence(sen1)
            embedded_sen2 = self.embed_sentence(sen2)

            # print(sen1)
            # print(embedded_sen1.shape)
            # print(sen2)
            # print(embedded_sen2.shape)
            # move to device, done in helper
            # embedded_sen1 = embedded_sen1.to(self.device)
            # embedded_sen2 = embedded_sen2.to(self.device)

            #Put sentence 1 and 2 through all conv filters
            conv_outputs1 = [conv(embedded_sen1) for conv in self.conv_layers]
            conv_outputs2 = [conv(embedded_sen2) for conv in self.conv_layers]

            #Pool each output to extract information in sentence
            #Use this for similarity comparision
            #Use every form of pooling, may cause issues but should extract different semantical info
            for conv_output1, conv_output2 in zip(conv_outputs1, conv_outputs2):
                max_pooled1 = torch.max(conv_output1, dim=-1)[0].unsqueeze(0)
                avg_pooled1 = torch.mean(conv_output1, dim=-1).unsqueeze(0)
                min_pooled1 = torch.min(conv_output1, dim=-1)[0].unsqueeze(0)

                max_pooled2 = torch.max(conv_output2, dim=-1)[0].unsqueeze(0)
                avg_pooled2 = torch.mean(conv_output2, dim=-1).unsqueeze(0)
                min_pooled2 = torch.min(conv_output2, dim=-1)[0].unsqueeze(0)

                #Combine all pooled output from one set of convooltuion outputs
                pooled_outputs1.extend([max_pooled1,avg_pooled1,min_pooled1])
                pooled_outputs2.extend([max_pooled2,avg_pooled2,min_pooled2])

            #Merge together all poolings for comparision
            pooled_outputs1_cat = torch.cat(pooled_outputs1,dim=0)
            pooled_outputs2_cat = torch.cat(pooled_outputs2,dim=0)

            #Similarity comparision can use just cosine or own function,
            #Own fucntion should give better results
            # print(self.similarity_measurement)

            if self.similarity_measurement == 'structure':
                similarity = structured_similarity(sen1, sen2, pooled_outputs1_cat, pooled_outputs2_cat, word2doc_counts, self.device)
                similarity = self.fc(similarity)
                activ_fn = nn.Tanh()
                similarity = activ_fn(similarity)
            else:
                similarity = F.cosine_similarity(pooled_outputs1_cat, pooled_outputs2_cat, dim=0)
            #print('sim: ', similarity, conv_output1, conv_output2)

            #print('here')

            #
            # Apply fully connected layer
            # output = self.fc(similarity)

            # activ_fn = nn.Tanh()
            # final_output = activ_fn(output)

            #Scale to 1-5 so we can mesure against labels
            final_scaled_output = (((similarity + 1) * 4) / 3) + 1

            final_scaled_outputs.append(final_scaled_output)

        #Meger all batch results together for one gradient update
        final_scaled_outputs_tensor = torch.stack(final_scaled_outputs)
        return final_scaled_outputs_tensor

    def embed_sentence(self, sentence):
      """
      Given a sentence uses glove embedding to get the embeddings for each word.
      """
      #Get padding for sentence, paddding should always be greater than length since added extra
      sentence_length = len(sentence.split())
      amount_to_pad = self.padding_size - sentence_length

      #Break sentence into words and get emebedding for each
      #Using the word 'the' to as an unknown token since it is a common stop word
      #Shouldn't change meaning of sentence and should work since if word not known not
      #much we can do since not focused on getting own word embedding
      #Will maybe experiment with using other stop words
      embedded_sentence = []
      for word in sentence.split():
          word = word.lower()
          if word in self.glove_embeddings:
              word_embedding = self.glove_embeddings[word]
          else:
              word_embedding = self.glove_embeddings['the']
          embedded_sentence.append(torch.tensor(word_embedding))

      #Padding here, just adding 0s to fill up size
      zero_array = torch.zeros(self.embedding_dim, dtype=torch.float32)
      for _ in range(amount_to_pad):
          embedded_sentence.append(zero_array)
      embedded_sentence = torch.stack(embedded_sentence)

      return embedded_sentence.to(self.device)


## 3.6 Training (fine-tuning)

With your siamese network ready, you can now start modifying the pretrained vectors. This is easier than training a model from scratch, and typically happens much faster (even without a GPU runtime, the entire process should not take more than 1 - 1.5 hours).

The training process is nothing new, so we won't repeat any code from the previous assignment here. The steps are:

1. Define your Siamese network with pretrained embeddings
2. Define your loss function
3. Define/choose your optimizer (we recommend Adam, available in PyTorch)
4. Create your training loop (in this assignment, you will not need more than 6-8 epochs; the use of small batch sizes such as 8 is recommended)

By training your Siamese network with gold-standard labels, you will effectively be modifying the pretrained embeddings to become better suited to your specific task. This process is called **fine tuning**. In this assignment, you are thus fine-tuning GloVe embeddings to capture semantic similarity between sentences, and thus, you are also creating embeddings (think of the output of your convolution layer) that capture the meaning of sentences instead of individual words.

One loss function you can use is the **mean squared error**. But, it turns out that there is better loss function for STS, called the **KL Divergence loss**. This is defined as follows:

$\mathcal{L}_{KL}(\theta) = \frac{1}{m}\sum_{k=1}^m \left(\mathbf{y} || \hat{\mathbf{y}}_\theta \right) + \lambda \Vert\theta\Vert^2$

A few things to note:

- the loss incorporates $L_2$ regularization.
- for STS, the KL divergence can be computed by considering the actual scores $\mathbf{y}$ over all test instances as the first probability distribution, and the estimated scores $\hat{\mathbf{y}}_\theta$ as the second distribution.

In [None]:
# TODO
# Write your training code here, as per the above description.

from typing import List, Tuple, Dict, Union
from tqdm import tqdm
import torch
import torch.nn as nn


class Trainer:
    def __init__(self, model, ckpt_save_path, dataset, train_loader):
        self.model = model
        self.ckpt_save_path = ckpt_save_path
        self.dataset = dataset
        self.train_loader = train_loader

    def train(self, dataset, max_epochs, ckpt_interval, validation_interval, device="cpu", learning_rate=0.01, batch_size=8, loss_fn = 'kl'):
        optimizer = torch.optim.Adam(params=self.model.parameters(), lr=learning_rate)

        if loss_fn == 'kl':
            loss_fn =nn.KLDivLoss(reduction='batchmean') #Use KLDivLoss by default need to be able to change though
        else:
            loss_fn =nn.MSELoss()

        self.model.to(device)

        progress = tqdm(range(max_epochs))
        training_loss = 0.0
        num_batches = 0

        #Run though specified amount of epochs
        for epoch in progress:
            self.model.train()

            total_loss = 0.0
            for batch in self.train_loader:  #Need to extract batches
                sentence1, sentence2, label = batch
                label = label.to(device)
                optimizer.zero_grad()

                output =self.model(sentence1, sentence2,self.dataset.word2doc_counts)  #forward pass
                #print(output)
                loss = loss_fn(output, label)

                loss.backward()   #backward pass

                #Check params and grads
                # for name, param in self.model.named_parameters():
                #     print(f'Gradient of {name}:')
                #     print(param.grad)


                optimizer.step()  #update the parameters

                total_loss += loss.item()
                num_batches += 1
                #print('loss:', loss)
                #print('total_loss: ', total_loss)
            training_loss = total_loss / num_batches
            progress.set_description("Epoch %d - Average loss: %.4f" % (epoch + 1, training_loss))

            if (epoch> 0) and (epoch % ckpt_interval == 0) :
                self.save_model_checkpoint(epoch)

    def save_model_checkpoint(self, current_epoch):
        torch.save(self.model.state_dict(), "%s/%s.ckpt" % (self.ckpt_save_path, str(current_epoch)))


In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import pickle

def create_path(path):
    if not os.path.exists(path):
        os.mkdir(path)
        print ("Created the path: %s" % (path))

def run_training(dataset, #dataset of all sets
                 loss_fn='kl', #loss function to use
                 embedding_dimensions=50,#pretrained emebding dimensions to use
                 regularization_parameter=0.4, #regularization param to use
                 window_sizes = [2,3,4], #window sizes to use
                 convolution_filters_used=6, #number of filters to use
                 similarity_fn='cosine', #similarity function to use
                 ckpt_model_path='./ckpt', #path to checkpoint
                 final_model_path='./final_model_ckpt', # path to trained model
                 batch_size=20,#batch size
                 ckpt_epoch_size=2, #number of epochs after model is saved
                 epochs=20): #number of epochs

    ckpt_model_path = f'{ckpt_model_path}_glove.{embedding_dimensions}d.tuned.{convolution_filters_used}-filters.{loss_fn}-loss'
    create_path(ckpt_model_path)


    vocab_size = len(dataset.word2doc_counts)

    train_data, dev_data, test_data = dataset.splits()

    #Find padding length for all sentences by finding max length and adding 5, just chose 5 may need to change
    longest_sentence = 0
    for sentence_1, sentence_2,  in zip(train_data['sentence_1'], train_data['sentence_2']):
        max_length = max(len(sentence_1.split()), len(sentence_2.split()))
        if max_length > longest_sentence:
            longest_sentence =max_length

    padding_length = longest_sentence + 5

    train_dataset = []
    for sen_1, sen_2, label in zip(train_data['sentence_1'], train_data['sentence_2'], train_data['label']):
        train_dataset.append((sen_1, sen_2, label))

    #print(train_dataset)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    #Get word to counts
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    model = ConvolutionalTwinNetwork(embedding_dim=embedding_dimensions,
                                     vocab_size=vocab_size,
                                     filter_widths=window_sizes,
                                     num_filters=convolution_filters_used,
                                     hidden_dim=128,
                                     similarity_measurement=similarity_fn,
                                     convolution_layer='our_conv',
                                     padding_size=padding_length,
                                     device = device
                                     )

    trainer = Trainer(model, ckpt_model_path,dataset, train_loader)

    this_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f'Device: {this_device}')

    trainer.train(dataset,
                  max_epochs=epochs,
                  ckpt_interval=ckpt_epoch_size,
                  validation_interval=1000,
                  device=this_device,
                  batch_size=batch_size,
                  loss_fn = loss_fn)

    create_path(final_model_path)
    model_filepath = os.path.join(final_model_path, f'glove.{embedding_dimensions}d.tuned.{convolution_filters_used}-filters.{loss_fn}-loss.pkl')
    property_filepath = os.path.join(final_model_path, f'glove.{embedding_dimensions}d.tuned.{convolution_filters_used}-filters.{loss_fn}-loss-properties.md')
    with open(model_filepath, 'wb') as f:
        pickle.dump(model, f)


    property_dict = {
        'Embedding_dimension': embedding_dimensions,
        'Training_loss_function' :loss_fn,
        'Regularization_parameter_(in_loss_function)':regularization_parameter,
        'Window_sizes_used': window_sizes,
        'Number_of_convolutional_filters_used' : convolution_filters_used,
        'Similarity_computation_(cosine/structured)': similarity_fn,
        'Number_of_training_epochs': epochs,
        'Batch_size_used_for_training' : batch_size
    }
    with open(property_filepath, 'wb') as f:
        pickle.dump(property_dict,f)

In [None]:
#Test with dev
dataset = SickDataset("/content/sick_dataset")
run_training(dataset, epochs=5, embedding_dimensions=50, convolution_filters_used=3, similarity_fn='cosine', ckpt_model_path='./dev_ckpt', final_model_path='./dev_ckpt')

ModuleList(
  (0-2): 3 x ConvolutionalFilter(
    (activation_function): Tanh()
  )
)
Device: cuda:0


Epoch 5 - Average loss: -1.6579: 100%|██████████| 5/5 [02:55<00:00, 35.16s/it]


In [None]:
#Test with cosine
dataset = SickDataset("/content/sick_dataset")
run_training(dataset, epochs=25, embedding_dimensions=200, convolution_filters_used=18, similarity_fn='cosine', ckpt_model_path='./cosine_ckpt', final_model_path='./final_model_cosine_ckpt')

Created the path: ./cosine_ckpt_glove.200d.tuned.18-filters.kl-loss
ModuleList(
  (0-17): 18 x ConvolutionalFilter(
    (activation_function): Tanh()
  )
)
Device: cuda:0


Epoch 25 - Average loss: -0.3319: 100%|██████████| 25/25 [1:05:26<00:00, 157.07s/it]


Created the path: ./final_model_cosine_ckpt


In [None]:
#Test with structured similarity function 1
dataset = SickDataset("/content/sick_dataset")
run_training(dataset, epochs=15, embedding_dimensions=200, convolution_filters_used=18, similarity_fn='structure', ckpt_model_path='./structure_ckpt', final_model_path='./final_model_structure_ckpt')

ModuleList(
  (0-17): 18 x ConvolutionalFilter(
    (activation_function): Tanh()
  )
)
Device: cuda:0


Epoch 15 - Average loss: -11.0492: 100%|██████████| 15/15 [20:29<00:00, 81.99s/it]


Created the path: ./final_model_structure_ckpt


In [None]:
#Test with structured similarity function 2
dataset = SickDataset("/content/sick_dataset")
run_training(dataset, epochs=12, embedding_dimensions=300, convolution_filters_used=12, similarity_fn='structure', ckpt_model_path='./structure2_ckpt', final_model_path='./final_model_structure2_ckpt')

ModuleList(
  (0-11): 12 x ConvolutionalFilter(
    (activation_function): Tanh()
  )
)
Device: cuda:0


Epoch 12 - Average loss: -13.8110: 100%|██████████| 12/12 [12:04<00:00, 60.40s/it]


Created the path: ./final_model_structure2_ckpt


In [None]:
#Test with best KL loss
dataset = SickDataset("/content/sick_dataset")
run_training(dataset, epochs=12, embedding_dimensions=300, convolution_filters_used=24, similarity_fn='cosine', ckpt_model_path='./best_kl_ckpt', final_model_path='./final_model_best_kl_ckpt')

Created the path: ./best_kl_ckpt_glove.300d.tuned.24-filters.kl-loss
ModuleList(
  (0-23): 24 x ConvolutionalFilter(
    (activation_function): Tanh()
  )
)
Device: cuda:0


Epoch 12 - Average loss: -0.6906: 100%|██████████| 12/12 [46:04<00:00, 230.36s/it]


Created the path: ./final_model_best_kl_ckpt


In [None]:
# dataset = SickDataset("/content/sick_dataset")

# vocab_size = len(dataset.word2doc_counts)

# train_data, dev_data, test_data = dataset.splits()

# print(train_data)

## 3.7 Experiments and Evaluation

In this final leg of the assignment, you are required to report on your experiments and the final results. Unlike the previous assignments, the evaluation here is not in terms of precision, recall, and accuracy. Rather, it is in terms of the following:

- Pearson's correlation coefficient $r$
- Mean Squared Error (MSE)

Your aim should be to achieve $r \geq 0.80$ and MSE $\leq 0.35$ on the test set. This is a somewhat ambitious aim, so please don't despair if you can't reach this goal! These numbers are give to you as a yardstick.

When you finish fine-tuning and start to achieve decent results, you should save your model as a `.zip` (see the last section on what to submit).

For grading the performance of your models, four models are required:

- Your two best models using structured similarity computation
- Your best model using cosine similarity
- Your best model among those trained using KL divergence loss. If this criterion is already fulfilled by the earlier models, then you should provide your best fine-tuned model that was trained using mean squared error as the loss function.



In [None]:
from scipy.stats import pearsonr #Used to calculate corr

def load_finetuned_embeddings(embeddings_path):
    """
    Load fine-tuned GloVe embeddings from a file.
    """
    # Load embeddings from file
    # Assuming embeddings are stored as numpy arrays
    embeddings = np.load(embeddings_path, allow_pickle=True)
    return embeddings

def load_model_properties(properties_path):
    """
    Load model properties from a file.
    """
    #Load properties from file
    properties = np.load(properties_path, allow_pickle=True)
    return properties

def evaluate_model_metrics(model, test_loader, loss_fn, device):
    """
    Calculates both the correlation and MSE given our model, dataloader and device.
    """
    targets = []
    predictions = []

    losses = []

    #Use testing modes dont want to update model
    model.eval()
    with torch.no_grad():
        for batch in test_loader:
            sentence1,sentence2,label = batch
            label = label.to(device)

            output =model(sentence1,sentence2,dataset.word2doc_counts)
            # print('output', output)
            # print('label', label)

            #Store for MSE
            loss =loss_fn(output, label)  # Assuming your output needs log_softmax
            losses.append(loss.item())

            #Store for corr
            predictions.extend(output.cpu().numpy() )
            targets.extend(label.cpu().numpy())

    # print(predictions)
    # print(targets)

    #Calc corr
    predictions = np.array(predictions).squeeze()
    correlation, _= pearsonr(predictions, targets)

    #Calc MSE
    mse=np.mean(losses)

    return (correlation, mse)

def test(finetuned_embeddings_path, dataset,  model_properties_file_path=None) -> None:
    """
    Computes the Pearson's correlation coefficient r and the mean squared error (MSE) for fine-tuned GloVe embeddings
    and prints the result in a neat tabular structure with model details, as follows (actual values not displayed):

    Embedding dimension                            200
    Training loss function                         KL Divergence Loss
    Regularization parameter (in loss function)    0.4
    Window sizes used                              2, 3, 4
    Number of convolutional filters used           15
    Number of filters with max pooling             8
    Number of filters with min pooling             5
    Number of filters with mean pooling            2
    Similarity computation (cosine/structured)     structured
    Number of training epochs                      8
    Batch size used for training                   20

    Pearson's Correlation Coefficient              0.821
    Mean Squared Error                             0.344
    """

    #Get model and properties
    model = load_finetuned_embeddings(finetuned_embeddings_path)
    properties = load_model_properties(model_properties_file_path)


    #Print properties
    for key, value in properties.items():
        key = key.replace("_", " ")
        print(key, ": ", value)


    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    loss_fn = nn.MSELoss()

    train_data, dev_data, test_data = dataset.splits()


    test_dataset = []
    for sen_1, sen_2, label in zip(test_data['sentence_1'], test_data['sentence_2'], test_data['label']):
        test_dataset.append((sen_1, sen_2, label))

    test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

    #Get MSE and corr using helper function
    correlation, mse = evaluate_model_metrics(model, test_loader, loss_fn, device)

    #Print MSE and corr
    print(f"Pearson's Correlation Coefficient: {correlation:0.4f}")
    print(f"Mean Squared Error: {mse:0.4f}")

def worst(finetuned_embeddings_path, dataset,  model_properties_file_path=None, k=10):
    """
    Returns the top k sentence pairs for which your model's estimated score was the worst (by worst, we mean the
    highest difference between the gold-standard relatedness score and your predicted score).
    """

    #Get model and properties
    model = load_finetuned_embeddings(finetuned_embeddings_path)
    properties = load_model_properties(model_properties_file_path)

    #Get test data
    train_data, dev_data, test_data = dataset.splits()

    test_dataset = []
    for sen_1, sen_2, label in zip(test_data['sentence_1'], test_data['sentence_2'], test_data['label']):
        test_dataset.append((sen_1, sen_2, label))

    test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    #Loss fn is now MSE
    loss_fn = nn.MSELoss()

    differences = []

    #Use testing modes dont want to update model
    model.eval()
    with torch.no_grad():
        for batch in test_loader:
            sentence1, sentence2, label = batch
            label = label.to(device)

            output = model(sentence1, sentence2, dataset.word2doc_counts)

            #Get difference between our outputs and labels (true resutls)
            difference = torch.abs(output - label)
            differences.append((sentence1, sentence2, difference.item(), output, label))

    #Sort list by worst difference
    worst_pairs = sorted(differences, key=lambda x: x[2], reverse=True)
    worst_pairs_k = worst_pairs[:k]

    #Print k worst with stats
    for i, (sentence1, sentence2, difference, output, label) in enumerate(worst_pairs_k):
        print()
        print(f"Worst Pair {i + 1}:")
        print("Sentence 1: ", sentence1)
        print("Sentence 2: ", sentence2)
        print("Difference: ", difference)
        print("Model Output: ", output)
        print("True Label: ", label)
        print()

def best(finetuned_embeddings_path, dataset,  model_properties_file_path=None, k=10):
    """
    Returns the top k sentence pairs for which your model's estimated score was the best (by best, we mean that the
    difference between the gold-standard relatedness score and your predicted score was the lowest)
    """

    #Get model and properties
    model = load_finetuned_embeddings(finetuned_embeddings_path)
    properties = load_model_properties(model_properties_file_path)

    #Get test data
    train_data, dev_data, test_data = dataset.splits()

    test_dataset = []
    for sen_1, sen_2, label in zip(test_data['sentence_1'], test_data['sentence_2'], test_data['label']):
        test_dataset.append((sen_1, sen_2, label))

    test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    #Loss fn is now MSE
    loss_fn = nn.MSELoss()

    differences = []

    #Use testing modes dont want to update model
    model.eval()
    with torch.no_grad():
        for batch in test_loader:
            sentence1, sentence2, label = batch
            label = label.to(device)

            output = model(sentence1, sentence2, dataset.word2doc_counts)

            #Get difference between our outputs and labels (true resutls)
            difference = torch.abs(output - label)
            differences.append((sentence1, sentence2, difference.item(), output, label))

    #Sort list by best difference (closest)
    best_pairs = sorted(differences, key=lambda x: x[2], reverse=False)
    best_pairs_k = best_pairs[:k]

    #Print k best with stats
    for i, (sentence1, sentence2, difference, output, label) in enumerate(best_pairs_k):
        print()
        print(f"Best Pair {i + 1}:")
        print("Sentence 1: ", sentence1)
        print("Sentence 2: ", sentence2)
        print("Difference: ", difference)
        print("Model Output: ", output)
        print("True Label: ", label)
        print()

In [None]:
#Test for cosine model with 200 embed, 18 filters, and kl loss
test('/content/final_model_cosine_ckpt/glove.200d.tuned.18-filters.kl-loss.pkl',dataset,'/content/final_model_cosine_ckpt/glove.200d.tuned.18-filters.kl-loss-properties.md')
worst('/content/final_model_cosine_ckpt/glove.200d.tuned.18-filters.kl-loss.pkl',dataset,'/content/final_model_cosine_ckpt/glove.200d.tuned.18-filters.kl-loss-properties.md')
best('/content/final_model_cosine_ckpt/glove.200d.tuned.18-filters.kl-loss.pkl',dataset,'/content/final_model_cosine_ckpt/glove.200d.tuned.18-filters.kl-loss-properties.md')

Embedding dimension :  200
Training loss function :  kl
Regularization parameter (in loss function) :  0.4
Window sizes used :  [2, 3, 4]
Number of convolutional filters used :  18
Similarity computation (cosine/structured) :  cosine
Number of training epochs :  25
Batch size used for training :  20
Pearson's Correlation Coefficient: 0.0995
Mean Squared Error: 1.0460

Worst Pair 1:
Sentence 1:  ('The band is singing ',)
Sentence 2:  ('A woman is carefully removing her makeup',)
Difference:  2.666499614715576
Model Output:  tensor([3.6665], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 2:
Sentence 1:  ('A jet is not flying',)
Sentence 2:  ('A dog is barking',)
Difference:  2.6664578914642334
Model Output:  tensor([3.6665], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 3:
Sentence 1:  ('A woman is putting on makeup carefully',)
Sentence 2:  ('The band is singing ',)
Difference:  2.666428565979

In [None]:
#Structured 1 model results
test('/content/final_model_structure_ckpt/glove.200d.tuned.18-filters.kl-loss.pkl',dataset,'/content/final_model_structure_ckpt/glove.200d.tuned.18-filters.kl-loss-properties.md')
worst('/content/final_model_structure_ckpt/glove.200d.tuned.18-filters.kl-loss.pkl',dataset,'/content/final_model_structure_ckpt/glove.200d.tuned.18-filters.kl-loss-properties.md')
best('/content/final_model_structure_ckpt/glove.200d.tuned.18-filters.kl-loss.pkl',dataset,'/content/final_model_structure_ckpt/glove.200d.tuned.18-filters.kl-loss-properties.md')

Embedding dimension :  200
Training loss function :  kl
Regularization parameter (in loss function) :  0.4
Window sizes used :  [2, 3, 4]
Number of convolutional filters used :  18
Similarity computation (cosine/structured) :  structure
Number of training epochs :  15
Batch size used for training :  20


  return F.mse_loss(input, target, reduction=self.reduction)


Pearson's Correlation Coefficient: 0.4235
Mean Squared Error: 1.0475

Worst Pair 1:
Sentence 1:  ('The dog is catching a ball',)
Sentence 2:  ('A small girl is riding in a toy car',)
Difference:  2.666635036468506
Model Output:  tensor([[3.6666]], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 2:
Sentence 1:  ('A man is shooting guns',)
Sentence 2:  ('A woman is not riding a horse',)
Difference:  2.6666347980499268
Model Output:  tensor([[3.6666]], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 3:
Sentence 1:  ('A dog is sitting on the ground',)
Sentence 2:  ('A girl is tapping her fingernails',)
Difference:  2.6666340827941895
Model Output:  tensor([[3.6666]], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 4:
Sentence 1:  ('Two kids are doing martial arts on a blue mat',)
Sentence 2:  ('A tan dog is running through the brush',)
Difference:  2.666

In [None]:
#Structured 2 model results
test('/content/final_model_structure2_ckpt/glove.300d.tuned.12-filters.kl-loss.pkl',dataset,'/content/final_model_structure2_ckpt/glove.300d.tuned.12-filters.kl-loss-properties.md')
worst('/content/final_model_structure2_ckpt/glove.300d.tuned.12-filters.kl-loss.pkl',dataset,'/content/final_model_structure2_ckpt/glove.300d.tuned.12-filters.kl-loss-properties.md')
best('/content/final_model_structure2_ckpt/glove.300d.tuned.12-filters.kl-loss.pkl',dataset,'/content/final_model_structure2_ckpt/glove.300d.tuned.12-filters.kl-loss-properties.md')

Embedding dimension :  300
Training loss function :  kl
Regularization parameter (in loss function) :  0.4
Window sizes used :  [2, 3, 4]
Number of convolutional filters used :  12
Similarity computation (cosine/structured) :  structure
Number of training epochs :  12
Batch size used for training :  20
Pearson's Correlation Coefficient: 0.4248
Mean Squared Error: 1.0475

Worst Pair 1:
Sentence 1:  ('An email is being read by a man',)
Sentence 2:  ('A person is peeling a banana',)
Difference:  2.666576862335205
Model Output:  tensor([[3.6666]], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 2:
Sentence 1:  ('A baby elephant is eating a small tree',)
Sentence 2:  ('A little girl is selling a scooter',)
Difference:  2.666566848754883
Model Output:  tensor([[3.6666]], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 3:
Sentence 1:  ('A baby elephant is not eating a small tree',)
Sentence 2:  ('A lit

In [None]:
#Best KL loss model results
test('/content/final_model_best_kl_ckpt/glove.300d.tuned.24-filters.kl-loss.pkl',dataset,'/content/final_model_best_kl_ckpt/glove.300d.tuned.24-filters.kl-loss-properties.md')
worst('/content/final_model_best_kl_ckpt/glove.300d.tuned.24-filters.kl-loss.pkl',dataset,'/content/final_model_best_kl_ckpt/glove.300d.tuned.24-filters.kl-loss-properties.md')
best('/content/final_model_best_kl_ckpt/glove.300d.tuned.24-filters.kl-loss.pkl',dataset,'/content/final_model_best_kl_ckpt/glove.300d.tuned.24-filters.kl-loss-properties.md')

Embedding dimension :  300
Training loss function :  kl
Regularization parameter (in loss function) :  0.4
Window sizes used :  [2, 3, 4]
Number of convolutional filters used :  24
Similarity computation (cosine/structured) :  cosine
Number of training epochs :  12
Batch size used for training :  20
Pearson's Correlation Coefficient: 0.2081
Mean Squared Error: 1.0427

Worst Pair 1:
Sentence 1:  ('A man is jumping rope outside',)
Sentence 2:  ('A woman is slicing a cucumber',)
Difference:  2.66607666015625
Model Output:  tensor([3.6661], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 2:
Sentence 1:  ('A jet is not flying',)
Sentence 2:  ('A dog is barking',)
Difference:  2.665893316268921
Model Output:  tensor([3.6659], device='cuda:0')
True Label:  tensor([1.], device='cuda:0', dtype=torch.float64)


Worst Pair 3:
Sentence 1:  ('A fish is being sliced by a man',)
Sentence 2:  ('A cat is jumping into a box',)
Difference:  2.6658926010131836

Complete the above `test` function's implementation. You may want to store various model properties in a file (e.g., the actual fine-tuned embeddings are in a file `blah-blah.txt`; and all the details about its training and other properties may be stored in `blah-blah.properties`).

If your `test` function expects such a properties file, make sure that those files are also saved in a human-readable format, and included in your submission.

For your four models (as required for grading, described earlier), run the `test` function. Each run should be in a separate code cell. **Do NOT remove the results of these runs.**

Similarly, for each one of your four models, return the results of calling the `best` and `worst` functions. You are free to change the signature of these two functions, if needed. Each run should be in a separate code cell, and clearly mention which model's `best` and `worst` are being called. **Do NOT remove the results of these runs.**


# 4. Conceptual Questions and Qualitative Analysis

**4.1** Write down a brief but precise description of your structured similarity computation. Then, write this similarity using a mathematical formula. Make sure your formula and your implementation are faithful to each other!

My similarity function is simple but effective, I get the cosine similarity of my pooled sentence 1 and 2 which is the pooled similarity. I then also get the lexical overlap of sentence 1 and 2 which gives me overlap, IDF_weighted_overlap, content_only_overlap, and content_only_IDF_weighted_overlap. I then just combine all of these together in a tensor and pass it into my linear layer to get a ouput between 1 and 5. I added weights in the formula since orignally I was going to weigh them differently but for now they are just weighted the same.

similarity = tanh(pool_similarity * a + overlap * b + IDF_weighted_overlap * c + content_only_overlap * d + content_only_IDF_weighted_overlap * e)

**4.2** Identify at least one linguistic pattern in the worst performing sentence-pairs as shown by the call to `worst` for your best performing model (you may need to use a non-default value of `k`). What aspect of your fine-tuning process will you change to improve the predicted scores for sentences that fall into this pattern? Why do you think this change will work to improve the result for these sentence-pairs?

It seems that my models like to always guess near 3.6 since that is closest to both 1 and 5. So it performs worse on sentences that are not related to each other. This probably occurs since my model figures out that this can help minimize the loss very fast but then getting the true outputs takes much longer to learn. I can see as I train my model for longer the MSE decreases and the correlation increases. To help improve my model from falling into this pattern I think it may help to also allow tunings on the embeddings so more change is able to propagate backwards and have a larger impact. I think this change will definitley help improve the model and will definitley lead to better results. I think just training the model more will also help it learn more relations between sentences.