# Learning Sentence Representations from Question-Answer Pairs

In this project, we explore a method to learn fixed length vector representations for variable lengthed "short" sentences (on the order of at most around 50 words) of text data collected from a limited-scope topic domain (like "diabetes"). These sentences are questions scraped from an online forum like answers.com, and each sample is accompanied by the top response (answer) corresponding to each question. We use question-answer data scraped from an online forum because the "label" (answer) for each training sample (question) is free. These labels are technically "soft labels" since they are responses taken verbatim from the general public. Our goal is to use these labels to supervise corresponding answers in a sort of "skip-thought" fashion in order to learn sentence representations from within a specific topic domain.

### Text-based Representation Learning

The application of understanding user questions might be the starting point if you are attempting to build a chatbot to automatically handle customer queries. As an initial step, you might want to cluster queries into different high-level feature categories, requiring a numerical learned feature representation of consistent dimensionality for all queries. A fully supervised approach could potentially work for this, where a model is trained to output the correct answer given a quesion, but would most likely require a lot of data and some clever model design and depend too heavily on soft-labels which could sometimes be downright wrong. On the other hand, this could be attempted in an entirely unsupervised manner, perhaps by learning to unscramble augmented question strings or impute missing words. Unsupervised methods may learn to bias too heavily towards unexpected features (like individual words or low-level grammatical logic), and still, we have useful information nonetheless in the answers, so why not try to use this.

### Triplet Networks

The method chosen here involves constructing "triplets" of comparison-based learning samples consisting of a question, it's correct answer, and a sampled incorrect answer, usually refered to as the anchor, positive and negative samples, respectively. Specifically, by constructing "positive" question-answer pairs of anchor and positive samples, and "negative" question-answer pairs of anchor and negative samples, we train a model to discriminate between these two pairs. Usually this is done by satisfying some criterion based on a distance/similarity measurement in a learned, fixed-dimensionality feature space, requiring that samples from a positive pair be mapped closer together (on average) than samples form a negative pair. Deep networks constructed to solve this learning problem are popularly called "Triplet Networks".

### Model Architecture and Objective

The authors in the paper [Learning Thematic Similarity Metric Using Triplet Networks](https://pdfs.semanticscholar.org/0846/f3cb0ae555c4f7015dca2fce6a047501154f.pdf?_ga=2.178325220.1389316910.1606965483-939693653.1606965483) use a triplet network equipped with the "Ratio Loss" loss-function, which converts distances between samples in representation space into probabilities. The authors report better results using this loss function instead of using the popular "Triplet Margin Loss" loss-function used in other triplet network implementations such as this [FaceNet](https://arxiv.org/pdf/1503.03832.pdf) paper. Upon visual investigation using nearestneighbor searches, dimensionality reduction, and clustering, we also observed better results using the "Ratio Loss", therefore we use this loss function as well. Since our dataset consists of question-answer pairs, constructing the positive pair for a triplet is simply done by pairing a question with its corresponding answer. To construct the negative pair for a triplet, we randomly sample a different answer uniformly from the dataset, resulting in an answer that is most-likely incorrect for the anchor sample. Like most triplet network implementations, our triplet network consists of 3 identical deep sentence encoders with tied weights. Each identical encoder computes a representation for the anchor, positive, and negative sample, and then these 3 representations are used to compute the overall loss based on their "closeness" to each other as measured in the representation space. We test if indeed "Attention Is All You Need" by choosing our encoder architecture to be a series of stacked transformer networks, described in the paper [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf). The resulting model architecure is summarized in the diagram below. 

### Dataset



### Training

### Exploring Learned Representations

#### 1. Setup

Import packages and load learned feature representation data files.

In [1]:
import torch
import pandas as pd
import numpy as np

data_dir = '/home/dylan/model_files/sentence2vec/'
out_dir = '/tmp/'
name = 'softmax_tanh_l1'
metric = 'l1'

# load validation question and answer token lists
with open('{}{}_val_question_tok.txt'.format(data_dir, name), 'r') as fp:
    val_question_tok = [line.strip('\n') for line in fp]
with open('{}{}_val_answer_tok.txt'.format(data_dir, name), 'r') as fp:
    val_answer_tok = [line.strip('\n') for line in fp]
    
# load validation question and answer vectors
val_question_vec = torch.tensor(np.genfromtxt(
    '{}{}_val_question_vec.txt'.format(data_dir, name), delimiter=','))
val_answer_vec = torch.tensor(np.genfromtxt(
    '{}{}_val_answer_vec.txt'.format(data_dir, name), delimiter=','))

#### 2. K-nearest Neighbors

For each question in the validation set, compute the k-nearest neighbors ($k=3$) within the learned feature space and write these neighbors, along with the original question, to a csv file. 

In [None]:
# k nearest neighbors dataframe
knn_df = pd.DataFrame(columns=['question', 'neighbor_1', 'neighbor_2', 'neighbor_3'])

# iterate through question sample texts
for i, sample in enumerate(val_question_tok):
    # create a stacked array of the anchor question
    vec = val_question_vec[i].unsqueeze(0).repeat(len(val_question_vec)-1, 1)
    
    if metric == 'cosine':
        # compute pairwise cosine similarity b/w anchor question and every other question
        dists = torch.nn.functional.cosine_similarity(
            vec, torch.cat([val_question_vec[:i], val_question_vec[i+1:]]), dim=1)
        knns = torch.argsort(dists, descending=True)[:4].long()
    elif metric == 'l1':
        # compute euclidiean pairwise dist b/w anchor question and every other question
        dists = torch.nn.functional.pairwise_distance(
            vec, torch.cat([val_question_vec[:i], val_question_vec[i+1:]]), p=1.0)
        knns = torch.argsort(dists, descending=False)[:4].long()
    else:
        # compute euclidiean pairwise dist b/w anchor question and every other question
        dists = torch.nn.functional.pairwise_distance(
            vec, torch.cat([val_question_vec[:i], val_question_vec[i+1:]]), p=2.0)
        knns = torch.argsort(dists, descending=False)[:4].long()
    
    knn_df = knn_df.append({
        'question': sample,
        'neighbor_1': val_question_tok[knns[0].item()],
        'neighbor_2': val_question_tok[knns[1].item()],
        'neighbor_3': val_question_tok[knns[2].item()],
    }, ignore_index=True)
    
# save question neighbors
knn_df.to_csv('{}question_knns.csv'.format(out_dir), index=False)

In [None]:
for i in range(10):
    print(knn_df.iloc[np.random.randint(len(knn_df))])
    print('')

#### 3. Clustering

Perform DBSCAN clustering on question representations from the validation set.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=3, metric=metric)
val_question_proj = tsne.fit_transform(val_question_vec)

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.cluster import OPTICS
import plotly.express as px

# NOTE: low eps, high min_samples -> search for higher density clusters
clustering = DBSCAN(eps=4, min_samples=10, n_jobs=-1, metric='l2')
#clustering = OPTICS(min_samples=10, n_jobs=-1)

clusters = clustering.fit_predict(val_question_proj)
n_clusters = len(np.unique(clusters))
print('{} clusters found'.format(n_clusters))


df = pd.DataFrame(val_question_proj, columns=['pc_1', 'pc_2', 'pc_3'])
df = pd.concat([df, knn_df['question']], axis=1)
fig = px.scatter_3d(df, x='pc_1', y='pc_2', z='pc_3', color=clusters, hover_data=['question'])
fig.update_traces(marker_size=10)
fig.show()

### Notes:

1. From [When is "Nearest Neighbor" Meaningful?](https://members.loria.fr/MOBerger/Enseignement/Master2/Exposes/beyer.pdf) - "Another possible scenario where high dimensional nearest neighbor queries are meaningful occurs when the underlying dimensionality of the data is much lower than the actual dimensionality."

    - Can we perform nearest neighbor calculations better if we enforce a sparsity constraint on the learned embeddings?
    
2. Test3: triplet margin loss, l1 norm, relu activation, normalization True, margin 0.2, 512 out dim, l1 metric for knn and tsne. - best results so far

3. discuss "soft-labels"

4. Softmax_1: softmax margin loss, 6 transformers with no linear layer, relu activation, 512 out dim, l metric for knn and tsne, DBSCAN(eps=3, n_sam=15) OPTICS(min_samples=15) - great results - also good with cosine metric - maybe better

5. Mention "skip-thought" training.

6. Our intuition is that variably-worded questions could have like-worded answers and vise versa. Hopefuly this supervision could not only learn word associations, but higher-level idea associations as well. 

7. TODO: add utilities for custom distance metrics and use triple loss with custom distance metrics

8. Implement hyperbolic gemoetry distance metric.

9. Probabilistic/statistical distance metric? For $v_i \in \mathcal{R}^d$ and $v_j \in \mathcal{R}^d$ randomly sample $n << d$ indices $l_0, ..., l_{n-1}$ uniformly and compute a distance metric proxy for $d(v_i, v_j) \approx d(s_i, s_j)$, where $s_x = [v_x(l_0), ..., v_x(l_{n-1})]$. Intuition is that if $d(v_i, v_j) < d(v_i, v_k)$, then $P(\mathcal{X}) > 0.5$ for the event $\mathcal{X} = d(s_i, s_j) < d(s_j, s_k)$.

10. softmax loss trains best with learning rate of 1e-4

11. relu on transformer outputs, bad results

12. tanh on transformer outputs, good results

13. The authors [here](https://pdfs.semanticscholar.org/0846/f3cb0ae555c4f7015dca2fce6a047501154f.pdf?_ga=2.178325220.1389316910.1606965483-939693653.1606965483) note they saw better performance using the Ratio Loss than using the Triplet Margin loss from FaceNet. We explore both and generally see the same results from visual inspection of nearest neighbors and clustering on projections.

14. TODO: change 'softmax' to 'ratio'


### Runs:

1. `margin_1` config:

    dataset_directory: /wheatley/dylan/datasets/answers_diabetes/
    output_directory: /home/dylan/model_files/sentence2vec/
    model_name: margin_1
    model_file: null
    number_epochs: 50
    batch_size: 64
    learning_rate: 0.0001
    weight_decay: 0.01
    number_workers: 4
    embed_dimensionality: 512
    number_transformers: 6
    output_normalize: True
    margin: 0.2
    p_norm: 2.0
    loss: margin
    
    Validation metric: 'l2'
    
    Results: Not that great.
    
1. `margin_2` config:

    dataset_directory: /wheatley/dylan/datasets/answers_diabetes/
    output_directory: /home/dylan/model_files/sentence2vec/
    model_name: margin_2
    model_file: null
    number_epochs: 50
    batch_size: 64
    learning_rate: 0.0001
    weight_decay: 0.01
    number_workers: 4
    embed_dimensionality: 512
    number_transformers: 6
    output_normalize: True
    margin: 0.2
    p_norm: 1.0
    loss: margin
    
    Validation metric: 'l1'
    
    Results: Not great.
    
3. `softmax_1` config:
    
    dataset_directory: /wheatley/dylan/datasets/answers_diabetes/
    output_directory: /home/dylan/model_files/sentence2vec/
    model_name: softmax_1
    model_file: null
    number_epochs: 50
    batch_size: 64
    learning_rate: 0.0001
    weight_decay: 0.01
    number_workers: 4
    embed_dimensionality: 512
    number_transformers: 6
    output_normalize: True
    margin: 0.2
    p_norm: 2.0
    loss: softmax

    Validation metric: 'l2'
    
    Results:
    
4. `softmax_2` config:

    dataset_directory: /wheatley/dylan/datasets/answers_diabetes/
    output_directory: /home/dylan/model_files/sentence2vec/
    model_name: softmax_2
    model_file: null
    number_epochs: 50
    batch_size: 64
    learning_rate: 0.0001
    weight_decay: 0.01
    number_workers: 4
    embed_dimensionality: 512
    number_transformers: 6
    output_normalize: True
    margin: 0.2
    p_norm: 1.0
    loss: softmax

    Validation metric: 'l1'
    
    Results: