# Learning Sentence Representations from Question-Answer Pairs

In this project, we explore methods to learn fixed length vector representations for variable lengthed "short" sentences (on the order of at most around 50 words) of text data. These "short" sentences are questions scraped from an online forum like answers.com, and each sample is accompanied by the top response (answer) corresponding to each question. We use question-answer data scraped from an online forum because the "label" (answer) for each training sample (question) is free. We use quotes around the word "label" because these labels are more like "soft labels" since they are responses taken verbatim from the general public. Our goal is to use these labels in order to learn sentence representations from within a specific topic domain (like "diabetes").

### Text-based Representation Learning

The application of understanding user questions might be the starting point if you are attempting to build a chatbot to automatically handle customer queries. As an initial step, you might want to cluster queries into different high-level feature categories, requiring a numerical learned feature representation of consistent dimensionality for all queries. A fully supervised approach could potentially work for this, where a model is trained to output the correct answer given a quesion, but would most likely require a lot of data and some clever model design and depend too heavily on the answer "labels" which may sometimes be donwright wrong. On the other hand, this could be attempted in an entirely unsupervised manner, perhaps by learning to unscramble augmented question strings or impute missing words. Unsupervised methods may learn to bias too heavily towards unexpected features (like individual words or low-level grammatical logic), and still, we have useful information nonetheless in the answers, so why not try to use this. The method we choose here involves using comparison-based learning samples (between questions and answers) in order to directly learn a feature representation for text-based inputs. Specifically, by constructing "positive" question-answer pairs, where the paired answer is the "correct" answer, and "negative" question-answer pairs, where the paired answer is the "incorrect" answer, we can train a model to simply discriminate between these two pairs. Usually this is done by satisfying some criterion based on a distance/similarity measurement in a learned, fixed-dimensionality feature space, requiring that samples from a "positive" pair be mapped closer together (on average) than samples form a "negative" pair. Here, we explore the learned features for some of these comparison-based representation learning models, including the [Triplet Network](https://arxiv.org/pdf/1412.6622.pdf) and the [Siamese Network](https://papers.nips.cc/paper/1993/file/288cc0ff022877bd3df94bc9360b9c5d-Paper.pdf). 

### Training the Models

### Exploring Learned Representations

#### 1. Setup

Import packages and load learned feature representation data files.

In [1]:
import torch
import pandas as pd
import numpy as np

data_dir = '/home/dylan/trained_model_files/pytorch/sentence2vec/'
out_dir = '/tmp/'

# load validation question and answer token lists
with open('{}margin_val_question_tok.txt'.format(data_dir), 'r') as fp:
    val_question_tok = [line.strip('\n') for line in fp]
with open('{}margin_val_answer_tok.txt'.format(data_dir), 'r') as fp:
    val_answer_tok = [line.strip('\n') for line in fp]
    
# load validation question and answer vectors
val_question_vec = torch.tensor(np.genfromtxt(
    '{}margin_val_question_vec.txt'.format(data_dir), delimiter=','))
val_answer_vec = torch.tensor(np.genfromtxt(
    '{}margin_val_answer_vec.txt'.format(data_dir), delimiter=','))

#### 2. K-nearest Neighbors

For each question in the validation set, compute the k-nearest neighbors ($k=3$) within the learned feature space and write these neighbors, along with the original question, to a csv file. 

In [2]:
# k nearest neighbors dataframe
knn_df = pd.DataFrame(columns=['question', 'neighbor_1', 'neighbor_2', 'neighbor_3'])

# iterate through question sample texts
for i, sample in enumerate(val_question_tok):
    # create a stacked array of the anchor question
    vec = val_question_vec[i].unsqueeze(0).repeat(len(val_question_vec)-1, 1)
    
    # compute pairwise cosine similarity b/w anchor question and every other question
    dists = torch.nn.functional.cosine_similarity(
        vec, torch.cat([val_question_vec[:i], val_question_vec[i+1:]]), dim=1)
    knns = torch.argsort(dists, descending=True)[:4].long()
    
    # compute euclidiean pairwise dist b/w anchor question and every other question
    #dists = torch.nn.functional.pairwise_distance(
    #    vec, torch.cat([val_question_vec[:i], val_question_vec[i+1:]]), p=1.0)
    #knns = torch.argsort(dists, descending=False)[:4].long()
    
    knn_df = knn_df.append({
        'question': sample,
        'neighbor_1': val_question_tok[knns[0].item()],
        'neighbor_2': val_question_tok[knns[1].item()],
        'neighbor_3': val_question_tok[knns[2].item()],
    }, ignore_index=True)
    
# save question neighbors
knn_df.to_csv('{}question_knns.csv'.format(out_dir), index=False)

In [3]:
for i in range(10):
    print(knn_df.iloc[np.random.randint(len(knn_df))])
    print('')

question      why does a diabetic person has high blood suga...
neighbor_1         what are the best choices of diabetic food ?
neighbor_2    what is the normal <unk> of glucose in 100ml o...
neighbor_3                           define diabetes mellitus ?
Name: 116, dtype: object

question             can diabetes affect a tattoo ?
neighbor_1            does diabetes get passed on ?
neighbor_2    what should my blood sugar level be ?
neighbor_3          can someone die from diabetes ?
Name: 895, dtype: object

question         is marijuana a benefit to type one diabetics ?
neighbor_1                         is honey good for diabetes ?
neighbor_2    can being upset cause blood sugar to drop with...
neighbor_3         can you take viagra with type two diabetes ?
Name: 930, dtype: object

question      how can you tell the difference between a drun...
neighbor_1                          can a diabetic smoke weed ?
neighbor_2    will being diabetic affect the result of a pre...
neighbor_3

#### 3. Clustering

Perform DBSCAN clustering on question representations from the validation set.

In [4]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=3, metric='cosine')
val_question_proj = tsne.fit_transform(val_question_vec)

In [14]:
from sklearn.cluster import DBSCAN
import plotly.express as px

# NOTE: low eps, high min_samples -> search for higher density clusters
clustering = DBSCAN(eps=2.0, min_samples=20, n_jobs=-1, metric='l2')

clusters = clustering.fit_predict(val_question_proj)
n_clusters = len(np.unique(clusters))
print('{} clusters found'.format(n_clusters))


df = pd.DataFrame(val_question_proj, columns=['pc_1', 'pc_2', 'pc_3'])
df = pd.concat([df, knn_df['question']], axis=1)
fig = px.scatter_3d(df, x='pc_1', y='pc_2', z='pc_3', color=clusters, hover_data=['question'])
fig.update_traces(marker_size=3)
fig.show()

26 clusters found


### Notes:

1. From [When is "Nearest Neighbor" Meaningful?](https://members.loria.fr/MOBerger/Enseignement/Master2/Exposes/beyer.pdf) - "Another possible scenario where high dimensional nearest neighbor queries are meaningful occurs when the underlying dimensionality of the data is much lower than the actual dimensionality."

    - Can we perform nearest neighbor calculations better if we enforce a sparsity constraint on the learned embeddings?
    
2. 