# Learning Sentence Representations from Question-Answer Pairs

In this project, we explore how to use deep NLP to learn fixed length vector representations for variable lengthed "short" sentences (on the order of at most around 50 words). We will explore two methods used to learn representations through comparison based learning; Siamese networks that utilize contrastive learning, and Triplet networks that learn an embedding from mined triplets of data samples. We will go into the details of both these implementations in subsequent sections. 

### The Difficulty With Text Data

Unless you have a nicely pre-processed dataset on hand, training an NLP model on text data is a challenging problem for several reasons. Text data collected in the wild is often messy and full of errors---either low-level spelling/grammatical errors or high-level logical/factual errors. It takes a lot of work, and sort of defeats the purpose of fully-autonomous NLP, to have to manually clean text data at all of these levels before training a model. Luckily, there are many libraries that can help us initially tackle cleaning low-level spelling/grammatical errors from text data. Dealing with high-level errors becomes more challenging as they affect how an NLP model trains by introducing biases or even just blatantly incorrect supervision signals. For these errors, we are left to come up with clever training techniques that attempt to counteract these distructive signals during training. Additionally, many NLP tasks such as sentiment analysis or flagging specific samples for content violations require training datasets in which samples need to be manually labeled---an expensive and time consuming operation.

For this project, we use question-answer data scraped from an online forum because the "label" (answer) for each training sample (question) is free. We use quotes around the word "label" because these labels are more like "soft labels" since they are responses taken verbatim from the general public. Our goal is to use this data in order to learn fixed length representations of user questions from within a specific topic domain (like "diabetes").

### Handling User Questions

The application of understanding user questions would be the starting point, for example, of an engineer attempting to build a chatbot to automatically handle queries. As an initial step, the engineer might desire a clustering of customer queries into different categories. Since we are assuming for this project that forum-based answers are also available, this could be attempted in a straight foward way by training a sequence-to-sequence model to output the corresponding answer string given a question string as input in order to learn useful latent feature representations. This approach would involve some tricky model design, though, and depending on the size of the vocabulary for your data domain, the model output for generating a sequence could be very large, resulting in a difficult-to-manage learning signal. On the other hand, learning sentence representations could be attempted in an entirely unsupervised manner, perhaps by learning to unscramble augmented question strings or impute missing words. This approach would avoid the massive dimensionality at the output when predicting output sentences directly, but could result in learning signals that bias too heavily towards unexpected features (like indiidual words or low-level gramatical logic). We have noisy data in the form of sometimes-wrong-answers that might provide a useful learning signal (on average) nonetheless, therefore it would be unwise to completely ignore this information. Instead, we can combine the fully-supervised sequence-to-sequence approach and the fully-unsupervised approach in a way that "meets in the middle" by using the answer strings to construct comparative based training samples. In the folowing sections, we go into the details of two comparison-based learning approaches, train NLP models using these approaches on question-answer data, and explore sentence representations learned through data visualization and cluster analysis. 

### Comparison-based Question-Answer Supervision



### Setup

In [5]:
import torch
import pandas as pd
import numpy as np

sentence_vecs_pkl = '/home/dylan/trained_model_files/pytorch/sentence2vec/sentence2vec_val_vecs.pickle'

df = pd.read_pickle(sentence_vecs_pkl)
df

Unnamed: 0,question_tok,answer_tok,question_idx,answer_idx,question_vec,answer_vec
0,"[is walking good for diabetics ?, what would c...","[yes , walking is very good for diabetics . wa...","[[5, 1730, 67, 19, 42, 3], [14, 72, 71, 12, 10...","[[58, 7, 1730, 5, 112, 67, 19, 42, 2, 1730, 5,...","[[-0.1325819492340088, 0.16458217799663544, 0....","[[-0.06456340104341507, 0.08461099117994308, 0..."
1,[who is the actress in the diabetes test strip...,[<unk> <unk> -- she is an australian actress w...,"[[96, 5, 4, 3902, 15, 4, 6, 95, 1355, 2135, 3]...","[[0, 0, 298, 366, 5, 60, 3983, 3902, 96, 5, 88...","[[-0.011543991044163704, 0.08648223429918289, ...","[[-0.05706747993826866, 0.16805730760097504, -..."
2,[what type of diet is recommend for type two d...,[your diet is going to be one of the most impo...,"[[14, 22, 9, 43, 5, 879, 19, 22, 32, 6, 3], [1...","[[24, 43, 5, 471, 10, 30, 25, 9, 4, 83, 178, 2...","[[-0.0009431586950086057, 0.10353779047727585,...","[[-0.057450417429208755, 0.0717715471982956, -..."
3,[how do i find out if i have normal blood suga...,"[the best way to measure blood sugar levels , ...","[[36, 35, 33, 93, 135, 31, 33, 27, 74, 20, 17,...","[[4, 104, 158, 10, 692, 20, 17, 44, 7, 5, 10, ...","[[-0.21692338585853577, 0.24251995980739594, -...","[[-0.096956767141819, 0.18522176146507263, -0...."
4,"[how do i control my diabetes ?, what are some...",[the best thing you can do to help control you...,"[[36, 35, 33, 110, 105, 6, 3], [14, 16, 62, 73...","[[4, 104, 347, 12, 13, 35, 10, 99, 110, 24, 6,...","[[-0.15618093311786652, 0.08491409569978714, -...","[[-0.05038363113999367, -0.03781764954328537, ..."
5,"[is dates good for diabetic patient ?, if a pa...",[yes . only one or two dates per day is good f...,"[[5, 1037, 67, 19, 21, 152, 3], [31, 8, 1692, ...","[[58, 2, 156, 25, 34, 32, 1037, 325, 138, 5, 6...","[[-0.12354319542646408, 0.13046208024024963, 0...","[[-0.08047402650117874, 0.08867007493972778, 0..."
6,[where can one learn more information about th...,[high blood sugar could be a sign to a medical...,"[[79, 13, 25, 802, 76, 159, 94, 4, 85, 9, 47, ...","[[47, 20, 17, 128, 30, 8, 526, 10, 8, 153, 165...","[[-0.19064490497112274, 0.20862457156181335, -...","[[-0.11032918840646744, 0.12685944139957428, -..."
7,[what is normal range for oral glucose challen...,[<unk> <unk> mg/dl or <unk> <unk> mg/dl or <un...,"[[14, 5, 74, 184, 19, 540, 37, 2734, 95, 3], [...","[[0, 0, 419, 34, 0, 0, 419, 34, 0, 0, 419, 34,...","[[-0.1346587836742401, 0.28653988242149353, -0...","[[-0.1066971868276596, 0.17729060351848602, -0..."
8,[what are some popular books on diabetes nutri...,[the diabetes food and nutrition bible : a com...,"[[14, 16, 62, 1060, 667, 39, 6, 571, 3], [14, ...","[[4, 6, 80, 11, 571, 6301, 107, 8, 1088, 1049,...","[[-0.009066197089850903, 0.11014312505722046, ...","[[0.005060721188783646, 0.0339461974799633, 0...."
9,[what happens to the pancreas when diabetes is...,[diabetes is a non-communicable disease . it i...,"[[14, 211, 10, 4, 92, 52, 6, 5, 8, 479, 1145, ...","[[6, 5, 8, 4831, 91, 2, 18, 5, 4, 165, 52, 24,...","[[-0.0037127903196960688, 0.05959530174732208,...","[[-0.06878530234098434, 0.0694245919585228, -0..."
