# Learning Sentence Representations from Question-Answer Pairs

In this project, we explore how to use deep NLP to learn fixed length vector representations for variable lengthed "short" sentences (on the order of at most around 50 words). We will explore two methods used to learn representations through comparison based learning; Siamese networks that utilize contrastive learning, and Triplet networks that learn an embedding from mined triplets of data samples. We will go into the details of both these implementations in subsequent sections. 

### The Difficulty With Text Data

Unless you have a nicely pre-processed dataset on hand, training an NLP model on text data is a challenging problem for several reasons. Text data collected in the wild is often messy and full of errors---either low-level spelling/grammatical errors or high-level logical/factual errors. It takes a lot of work, and sort of defeats the purpose of fully-autonomous NLP, to have to manually clean text data at all of these levels before training a model. Luckily, there are many libraries that can help us initially tackle cleaning low-level spelling/grammatical errors from text data. Dealing with high-level errors becomes more challenging as they affect how an NLP model trains by introducing biases or even just blatantly incorrect supervision signals. For these errors, we are left to come up with clever training techniques that attempt to counteract these distructive signals during training. Additionally, many NLP tasks such as sentiment analysis or flagging specific samples for content violations require training datasets in which samples need to be manually labeled---an expensive and time consuming operation.

For this project, we use question-answer data scraped from an online forum because the "label" (answer) for each training sample (question) is free. We use quotes around the word "label" because these labels are more like "soft labels" since they are responses taken verbatim from the general public. Our goal is to use this data in order to learn fixed length representations of user questions from within a specific topic domain (like "diabetes").

### Handling User Questions

The application of understanding user questions would be the starting point, for example, of an engineer attempting to build a chatbot to automatically handle queries. As an initial step, the engineer might desire a clustering of customer queries into different categories. Since we are assuming for this project that forum-based answers are also available, this could be attempted in a straight foward way by training a sequence-to-sequence model to output the corresponding answer string given a question string as input in order to learn useful latent feature representations. This approach would involve some tricky model design, though, and depending on the size of the vocabulary for your data domain, the model output for generating a sequence could be very large, resulting in a difficult-to-manage learning signal. On the other hand, learning sentence representations could be attempted in an entirely unsupervised manner, perhaps by learning to unscramble augmented question strings or impute missing words. This approach would avoid the massive dimensionality at the output when predicting output sentences directly, but could result in learning signals that bias too heavily towards unexpected features (like indiidual words or low-level gramatical logic). We have noisy data in the form of sometimes-wrong-answers that might provide a useful learning signal (on average) nonetheless, therefore it would be unwise to completely ignore this information. Instead, we can combine the fully-supervised sequence-to-sequence approach and the fully-unsupervised approach in a way that "meets in the middle" by using the answer strings to construct comparative based training samples. In the folowing sections, we go into the details of two comparison-based learning approaches, train NLP models using these approaches on question-answer data, and explore sentence representations learned through data visualization and cluster analysis. 

### Comparison-based Question-Answer Supervision



### Setup

In [1]:
import torch
import pandas as pd
import numpy as np

data_dir = '/home/dylan/trained_model_files/pytorch/sentence2vec/'
out_dir = '/tmp/'

# load validation question and answer token lists
with open('{}sentence2vec_val_question_tok.txt'.format(data_dir), 'r') as fp:
    val_question_tok = [line.strip('\n') for line in fp]
with open('{}sentence2vec_val_answer_tok.txt'.format(data_dir), 'r') as fp:
    val_answer_tok = [line.strip('\n') for line in fp]
    
# load validation question and answer vectors
val_question_vec = torch.tensor(np.genfromtxt(
    '{}sentence2vec_val_question_vec.txt'.format(data_dir), delimiter=','))
val_answer_vec = torch.tensor(np.genfromtxt(
    '{}sentence2vec_val_answer_vec.txt'.format(data_dir), delimiter=','))

In [2]:
# k nearest neighbors dataframe
knn_df = pd.DataFrame(columns=['question', 'neighbor_1', 'neighbor_2', 'neighbor_3'])

# iterate through question sample texts
for i, sample in enumerate(val_question_tok):
    # create a stacked array of the anchor question
    vec = val_question_vec[i].unsqueeze(0).repeat(len(val_question_vec)-1, 1)
    
    # compute pairwise cosine similarity b/w anchor question and every other question
    dists = torch.nn.functional.cosine_similarity(
        vec, torch.cat([val_question_vec[:i], val_question_vec[i+1:]]), dim=1)
    knns = torch.argsort(dists, descending=True)[:4].long()
    
    # compute euclidiean pairwise dist b/w anchor question and every other question
    #dists = torch.nn.functional.pairwise_distance(
    #    vec, torch.cat([val_question_vec[:i], val_question_vec[i+1:]]))
    #knns = torch.argsort(dists, descending=False)[:4].long()
    
    knn_df = knn_df.append({
        'question': sample,
        'neighbor_1': val_question_tok[knns[0].item()],
        'neighbor_2': val_question_tok[knns[1].item()],
        'neighbor_3': val_question_tok[knns[2].item()],
    }, ignore_index=True)
    
# save question neighbors
knn_df.to_csv('{}question_knns.csv'.format(out_dir), index=False)

In [3]:
knn_df

Unnamed: 0,question,neighbor_1,neighbor_2,neighbor_3
0,what amount of blood glucose level is normal ?,a healthy diet plan for african american with ...,how does diabetes affect the people that have ...,what is the name for a good diabetic cook book ?
1,where to study certified diabetes educator in ...,what is the average level for blood sugar ?,is type one diabetes strong or type two ?,why are patients fasted before oral glucose to...
2,what is the peak time when administering regul...,is seven a high reading ?,who was 1st person to get diabetes ?,what are the major manifestations of diabetes ...
3,where can you get health insurance in <unk> if...,how serious a disease is diabetes ?,what is one of the major food categories in a ...,do you need a comma in this sentence vascular ...
4,what does the term <unk> sugar or sugar <unk> ...,what is the most prevalent form of diabetes ?,what does starch do for your body ?,what do you have to watch what you are doing w...
...,...,...,...,...
2996,what does two hundred and three as a blood sug...,what is the normal range of blood sugar ?,what is the normal level for blood sugar ?,what is considered a normal blood sugar range ?
2997,who was a diabetic in the movie <unk> <unk> ' ?,any side effects in kissing a diabetic person ?,diabetic grocery list ?,"diabetic retinopathy , a <unk> <unk> ?"
2998,what symptoms do horses have with diabetes ?,what are some natural ways to potentially reve...,what are some medical terms <unk> to diabetes ?,what are the top <unk> children is diabetes an...
2999,describe the possible long-term complications ...,which endocrine gland fails to produce enough ...,who has diabetes out of the jonas brothers ?,diabetes warning signs ?
