# Learning Sentence Representations from Question-Answer Pairs

In this project, we explore a method to learn fixed length vector representations for variable lengthed "short" sentences (on the order of at most around 50 words) of text data collected from a limited-scope topic domain (like "diabetes"). These sentences are questions scraped from an online forum like answers.com, and each sample is accompanied by the top response (answer) corresponding to each question. We use question-answer data scraped from an online forum because the "label" (answer) for each training sample (question) is free. These labels are technically "soft labels" since they are responses taken verbatim from the general public. Our goal is to use these labels to supervise corresponding answers in a sort of "skip-thought" fashion in order to learn sentence representations from within a specific topic domain.

### Text-based Representation Learning

The application of understanding user questions might be the starting point if you are attempting to build a chatbot to automatically handle customer queries. As an initial step, you might want to cluster queries into different high-level feature categories, requiring a numerical learned feature representation of consistent dimensionality for all queries. A fully supervised approach could potentially work for this, where a model is trained to output the correct answer given a quesion, but would most likely require a lot of data and some clever model design and depend too heavily on soft-labels which could sometimes be downright wrong. On the other hand, this could be attempted in an entirely unsupervised manner, perhaps by learning to unscramble augmented question strings or impute missing words. Unsupervised methods may learn to bias too heavily towards unexpected features (like individual words or low-level grammatical logic), and still, we have useful information nonetheless in the answers, so why not try to use this.

### Triplet Networks

The method chosen here involves constructing "triplets" of comparison-based learning samples consisting of a question, it's correct answer, and a sampled incorrect answer, usually refered to as the anchor, positive and negative samples, respectively. Specifically, by constructing "positive" question-answer pairs of anchor and positive samples, and "negative" question-answer pairs of anchor and negative samples, we train a model to discriminate between these two pairs. Usually this is done by satisfying some criterion based on a distance/similarity measurement in a learned, fixed-dimensionality feature space, requiring that samples from a positive pair be mapped closer together (on average) than samples form a negative pair. Deep networks constructed to solve this learning problem are popularly called "Triplet Networks".

### Model Architecture and Objective

The authors in the paper [Learning Thematic Similarity Metric Using Triplet Networks](https://pdfs.semanticscholar.org/0846/f3cb0ae555c4f7015dca2fce6a047501154f.pdf?_ga=2.178325220.1389316910.1606965483-939693653.1606965483) use a triplet network equipped with the "Ratio Loss" loss-function, which converts distances between samples in representation space into probabilities. The authors report better results using this loss function instead of using the popular "Triplet Margin Loss" loss-function used in other triplet network implementations such as this [FaceNet](https://arxiv.org/pdf/1503.03832.pdf) paper. Upon visual investigation using nearest-neighbor searches, dimensionality reduction, and clustering, we were able to obtain good results using either loss function, therefore we allow the choice between these two losses to be set as a configuration parameter ("margin" for the Triplet Margin Loss and "ratio" for the Ratio Loss). 

Since our dataset consists of question-answer pairs, constructing the positive pair for a triplet is simply done by pairing a question with its corresponding answer. To construct the negative pair for a triplet, we randomly sample a different answer uniformly from the dataset, resulting in an answer that is most-likely incorrect for the anchor sample. Like most triplet network implementations, our triplet network consists of 3 identical deep sentence encoders with tied weights. Each identical encoder computes a representation for the anchor, positive, and negative sample, and then these 3 representations are used to compute the overall loss based on their "closeness" to each other as measured in the representation space. We test if indeed "Attention Is All You Need" by choosing our encoder architecture to be a series of stacked transformer networks, described in the paper [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf). The generalized model architecure we use is summarized in the diagram below. 

Anchor, positive, and negative sentences are provided as vocabulary-indexed batches which are converted to batches of word-vector arrays by the embedding layer. The embedding layer can be loaded with pre-trained word embeddings by setting the `embedding` variable within the `config.yaml` file to one of the available embedding file names listed for the `torchtext.vocab.load_vectors` function (listed [here](https://torchtext.readthedocs.io/en/latest/vocab.html#torchtext.vocab.Vocab.load_vectors)) or the embeddings can be trained along with the rest of the model parameters by setting `embedding` variable to the string `'custom'`. The output of the final transformer is pooled (sum, max, mean, etc.) along the sentence length dimension to construct fixed lengthed representations for every sentence. The distance function can be any function that satisfies the [distance metric axioms](https://en.wikipedia.org/wiki/Metric_(mathematics)). Transformers at each level (i.e. $Transformer_{i}$) are all weight-tied as well as the embedding layer weights resulting in identical processing streams for the anchor, positive, and negative samples up until the distance function computation.   

![title](images/triplet_net.png)

### Exploring Learned Representations

The steps below involve exploring learned representations from files of encoded sentence vectors after a model has been trained. The `README` in this repo explains how to preprocess the data and train a model, which will write `val_question_tok.txt` and `val_question_vec.txt` files consisting of question strings and their encoded learned representations from the validation split, respectively. You can use the code blocks to explore learned representations for various pretrained models and clustering parameters. We experimented with varying loss functions (Ratio Loss vs. Triplet Margin Loss), model architectures (only pre-trained word vectors, only custom word-vectors, 2 transformers, 4 transformers, 100d word-vectors, 300d word-vectors, etc.), and various other model parameters (transformer activation, distance metric, etc.), and found we could find clusters in each case, but generally observed less bias on individual words when transformer layers were included. Below, we include a set of `config.yaml` model parameters and DBSCAN clustering parameters that together yield some nice clusters. 

`config.yaml` Parameter Settings:

```
number_epochs: 100
batch_size: 64
learning_rate: 0.0001
weight_decay: 0.01
number_workers: 4
embedding_type: 'glove.6B.300d'
embedding_dim: 300
number_transformers: 2
number_attention_heads: 10
transformer_activation: 'tanh'
distance_metric: 'l1'
output_process: 'normalize'
margin: 0.2
loss: 'margin'

```

DBSCAN Parameter Settings:
```
eps: 0.1
min_samples: 10
metric: 'l2'
```

#### 1. Setup

Import packages, define some variables, and load validation question tokens and representations files.

In [1]:
import torch
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import nltk
from nltk.corpus import stopwords
import plotly.express as px

data_dir = '/home/dylan/trained_model_files/pytorch/sentence2vec/'
out_dir = '/tmp/'
name = 'glove_300_2T_tanh_l1_02M'
metric = 'l1'

# load validation question and answer token lists
with open('{}{}_val_question_tok.txt'.format(data_dir, name), 'r') as fp:
    val_question_tok = [line.strip('\n') for line in fp]
    
# load validation question and answer vectors
val_question_vec = np.genfromtxt('{}{}_val_question_vec.txt'.format(data_dir, name), delimiter=',')

#### 2. t-SNE Projection

Project the validation question vectors onto their first three t-SNE components. We use the same distance metric used for training the representations for t-SNE since this algorithm operates in same feature space. Additionally, we normalize the projections along each component dimension in order to bring all features to the common scale $[0, 1]^3$, which simplifies the search for good clustering with varying model parameters and stochastic outputs of t-SNE.

In [2]:
# compute t-SNE projections
tsne = TSNE(n_components=3, metric=metric)
val_question_proj = tsne.fit_transform(val_question_vec)

# normalize projections along each dimension to lie in [0, 1]^3
val_question_proj = val_question_proj / np.max(np.abs(val_question_proj), axis=0)

#### 3. Clustering in Projection Space

In the code block below, we perform density-based DBSCAN clustering on the t-SNE projected question representations. We use Euclidean L2 distance for evaluating clusters since this algorithm operates in a 3D space in which the L2 norm is nicely behaved. The values of `eps` and `min_samples` control the density of clusters for DBSCAN to search for. Generally, lower `eps` and higher `min_samples` leads to the discovery of high-density clusters (more samples within less space). Lower values of `min_samples` will generally lead to the discovery of more specific clusters (and therefore bias more heavily towards specific words) while higher values will yeild more general clusters. Iterating on different values of these parameters provides a good method to explore the projected feature space of the learned representations. Since DBSCAN also creates a cluster for "outlier" samples (label=-1), we report the percentage of clusters found to be "non-outliers" and how many of these clusters are found. We perform analysis moving forward on only the non-outlier clusters, and ignore the rest for now. 

In [8]:
# cluster 3D t-SNE projections
# NOTE: low eps, high min_samples -> search for higher density clusters
clustering = DBSCAN(eps=0.1, min_samples=10, n_jobs=-1, metric='l2')
clusters = clustering.fit_predict(val_question_proj)

# cluster info
num_clusters = len(np.unique(clusters))
perc_labeled = 100 * (len(clusters[clusters != -1]) / len(clusters))
print('[Num. Clusters]: {}, [Perc. Non-Outliers]: {:.1f}'.format(num_clusters, perc_labeled))

[Num. Clusters]: 50, [Perc. Non-Outliers]: 52.1


In [9]:
# initialize dataframe of question token strings
df = pd.DataFrame(val_question_tok, columns=['question'])

# add cluster labels to dataframe
df['label'] = clusters

# add projection components to dataframe
proj_df = pd.DataFrame(val_question_proj, columns=['pc_1', 'pc_2', 'pc_3'])
df = pd.concat([proj_df, df], axis=1)

#### 4. Cluster Summary Strings

For each cluster label returned by DBSCAN (omitting outlier clusters with label=-1), we create a psuedo-label consisting of the top-5 words from questions assigned to that label, omitting all standard stop words and some common extra stop words defined above. These psuedo-labels aim to provide a natural language rough topic summary for each cluster. 

In [10]:
# get list standard stop words and add some custom ones to this list
stop_words = stopwords.words('english')
extras = ['diabetes', '?', '.', '!', '<unk>']
stop_words += extras

# initialize word stemmer
stemmer = nltk.stem.PorterStemmer()

# stem stop words
stop_words = [stemmer.stem(word) for word in stop_words]

# add a summary column
df['summary'] = None

# infer cluster topics
for label in sorted(df['label'].unique()):
    # get all samples with this label
    samples = df[df['label'] == label]['question']
    
    # convert samples to a list
    samples = samples.tolist()
    
    # tokenize samples by whitespace
    tokens = [[word for word in sentence.split(' ')] for sentence in samples]
    
    # flatten samples list
    tokens = [inner for outer in tokens for inner in outer]
    
    # stem tokens
    tokens = [stemmer.stem(token) for token in tokens]
    
    # filter stopwords
    tokens = [token for token in tokens if not token in stop_words]
    
    # get token frequencies
    fdist = nltk.FreqDist(tokens)
    
    # get summary string from 5 most frequent tokens
    summary = ' '.join([token for token, _ in fdist.most_common(5)])
    
    # add summary string to dataframe
    df.loc[df['label'] == label, 'summary'] = summary

#### 5. Plot 3D Clusters

In the code block below, we create a 3D scatter plot of the validation question t-SNE projections, omitting all samples assigned to the outlier cluster label by DBSCAN. Points are color coded by cluster labels to reveal the organized structure in feature space and a legend is included which maps the numerical cluster label to its generated psuedo-label topic summary. 

In [11]:
fig = px.scatter_3d(
    df[df['label'] != -1], x='pc_1', y='pc_2', z='pc_3', color='summary', 
    hover_data=['question', 'summary', 'label'])
fig.update_traces(marker_size=8)
fig.update_layout(
    title='Projected Feature Space of Validation Questions',
    legend={'title': 'Cluster Summaries', 'itemclick': 'toggleothers'},
    height=500,
)
fig.show()

#### 6. Plot Cluster Label Distribution

To get another perspective on the validation question clusters, we plot below the percent-weighted histogram of the non-outlier cluster labels. This gives us a sampled distributional view of the "core" topics in which users ask questions about within this domain, allowing us to make an estimates on the expected relative frequency of seeing these question topics in the data-source forum. 

In [13]:
fig = px.histogram(
    df[df['label'] != -1], x='summary', 
    histnorm='percent', nbins=len(df['label'].unique()))
fig.update_layout(
    title='Non-Outlier Cluster Label Distribution ({:.1f}% Retained)'.format(perc_labeled),
    xaxis={'title': 'Clusters', 'showticklabels': False},
    yaxis={'title': 'Percent'}
)
fig.show()

### Conclusion

Running this notebook with the parameter settings stated above result in question clusters that clearly extend beyond simply clustering based on pre-trained word vectors. For example, one such cluster with the auto-generated topic summary of "take join team massag borderlin" included the queries "should a diabetic use a sauna?", "can you massage on a diabetic?", and "can a diabetic do a \<unk\> tattoo?". We hypothesize that the ability to link these spa-like/beauty-like questions is learned due to the answers to these questions sharing common terms and ideas. This could agree with a useful "many-to-one" graphical structure between questions and answers, which can often be seen in online forums where questions with highly varying forms share a common general answer. 