## Task 1: Content Summarization

Build a model to classify if two sentences are paraphrases of each other. “1” = yes, “0” = no. You are expected to establish an end-to-end process, including pre-processing, modeling, validation, etc.


**IMPORTANT:** Please make sure you are in the '/project_root/src/' directory when running this notebook. Further, install the required packages by running the following command in the terminal: `pip install -r requirements.txt`

#### Quroa Question Pairs Dataset Stats:

1. Available Columns: id, qid1, qid2, question1, question2, is_duplicate

2. Class labels: 0 (not paraphrases), 1 (paraphrases/duplicates)

3. Total training data / No. of rows: 404290

4. No. of columns: 6

5. No. of non-duplicate data points is 255027

6. No. of duplicate data points is 149263

7. We have 404290 training data points. And only 36.92% are positive. That means it is an **imbalanced** dataset.


#### Losses Available to build such a model:

1. **Cosine Embedding Loss:** Similar pairs with label 1 are pulled together, so that they are close in vector space. Dissimilar pairs, that are closer than a defined margin, are pushed away in vector space. Metric used is cosine distance.

2. **Online Contrastive Loss:** improved version of cosine emb. loss. Looks which negative pairs have a lower distance that the largest positive pair and which positive pairs have a higher distance than the lowest distance of negative pairs. I.e., this loss automatically detects the hard cases in a batch and computes the loss only for these cases.

3. **Multiple Negatives Ranking Loss:** reduces the distance between positive pairs out of large set of possible candidates. However, the distance between non-duplicate questions is not so large, so that this loss does not work that weill for pair classification.

4. **MSE:** problematic as the loss does not
take the relative order into account. For instance, for two pairs with correct target scores (0.4, 0.5), the loss function would equally penalize answers like (0.3, 0.6) and (0.5, 0.4). However, the first pair is better, as it keeps the correct ranking, while the second one does not.

5. **Triplet Loss & InfoNCE:** there is a problem
for constructing pairs or triplets in the training set, as it is hard to find non-trivial negatives examples.

6. **Batch Softmax Contrastive Loss:** Best of all & latest, but too complex to implement & no open source code available.

7. **Cross Entropy Loss (Easiest of the above):** this is the most simplest & effective loss functions among all the above. It is also the most widely used loss function for classification tasks. But, if we use this, then only the final linear layer will learn the mapping to output distribution really well & the rest of the layers in the model might not learn good representations of the sentences. Further, zero-shot formulation during inference is not possible with this loss function.


**VERY IMPORTANT:** It is advised to rather use the **task_1_run_me.py** script to seriously train & evaluate the model on multiple GPU's. Jupyter Notebooks are slow & only useful for demonstration purposes. Thanks!

In [1]:
# Run this cell to import all the required custom functions and classes
from utils.dataset_utils import (
    get_quora_dataset, 
    create_k_fold_datasets,
    load_df_from_pickle
)

from utils.preprocess_utils import normalize_text
from utils.plot import plot_zipf_distribution

from utils.dataloader import get_dataset_generators
from utils.model import AkshayFormer
from utils.trainer import AdapterTransfomerTrainer

[nltk_data] Downloading package stopwords to /home/joshi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.



Logging training progress using wandb...Please make sure to login to wandb using your API key.


[34m[1mwandb[0m: Currently logged in as: [33makshayjoshi[0m ([33makshay-joshi[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/joshi/.netrc


In [None]:
# Load the dataset
quora = get_quora_dataset(
            num_samples=50000,
            balanced=True,
            rand_seed=7
        )

# Split into 3 folds for fair evaluation
create_k_fold_datasets(
            dataset=quora, 
            num_folds=5,
            rand_seed=7
        )

In [None]:
# Load df from pickle file
df1 = load_df_from_pickle("data/cross_folds/train_1_folds.pkl")
df2 = load_df_from_pickle("data/cross_folds/test_1_folds.pkl")

In [None]:
# Apply preprocessing operations on the text & normalize the text
df1_norm = normalize_text(df1)
df2_norm = normalize_text(df2)

In [None]:
# Plot Zipfs curve for non-normalized text
plot_zipf_distribution(
    dataframes=(df1, df2),
    title="Zipf's Curve for Custom Quora Dataset: Before Normalization", 
    save_path="plots/zipfs_curve_unnormalized.png"
)

# Plot Zipfs curve for normalized text
plot_zipf_distribution(
    dataframes=(df1_norm, df2_norm),
    title="Zipf's Curve for Custom Quora Dataset: After Normalization", 
    save_path="plots/zipfs_curve_normalized.png"
)

In [None]:
# Get the AkshayFormer model & corresponding trainer to train in a self-supervised way
# using Pairwise Cosine Embedding Contrastive Loss
model = AkshayFormer(model_name_or_path="thenlper/gte-base")
trainer = AdapterTransfomerTrainer(
                model=model,
                epochs=50,
                learning_rate=0.05,
                train_full_model=True,
                model_save_name='AkshayFormer_AO_CV1',
        )

In [None]:
# Get dataset generators to sample tuples of anchor, positive/negative, label
train_dset, val_dset, test_dset = get_dataset_generators(
                                    train_df=df1,
                                    test_df=df2,
                                    model_name_or_path="thenlper/gte-base",
                                    max_seq_length=64,
                                    seed=55
                                )

# Get the dataloaders
train_loader, test_loader, val_loader = trainer.get_data_loaders(
                                            train_dset,
                                            test_dset,
                                            val_dset,
                                            batch_size=256
                                        )

In [None]:
# Train the model. This single method will also evaluate the model on the validation 
# set after each epoch and later test on test set after the training is complete.
trainer.train(
        train_loader, 
        val_loader,
        test_loader
    )