## First, we will install the dependencies for running the Similarity Learning task

In [1]:
import os
import csv
import re
from gensim.similarity_learning import DRMM_TKS_Model
from pprint import pprint

Using TensorFlow backend.


## Data Format

We have to provide data in a format which is understood by the model.
The model understands sentences as a list of words. 
Further, we need to give a :
 1. Queries List
 2. Candidate Document List
 3. Correct Label List

1 is a list of list of words
2 and 3 is actually a list of list of list of words/ints

Example:
```
queries = ["When was Abraham Lincoln born ?".split(), 
            "When was the first World War ?".split()]
docs = [
		 ["Abraham Lincoln was the president of the United States of America".split(),
		 "He was born in 1809".split()],
		 ["The first world war was bad".split(),
		 "It was fought in 1914".split(),
		 "There were over a million deaths".split()]
       ]
labels = [[0,
           1],
		  [0,
           1,
           0]
          ]
```

## About the dataset : WikiQA

The WikiQA corpus is a set of question-answer pairs in which for every query there are several candidate documents of which none, one or more documents might be relevant.
Relevance is purely binary, i.e., 1: relavant, 0: not relevant

Sample data:
```
QuestionID	Question	DocumentID	DocumentTitle	SentenceID	Sentence	Label
Q1	how are glacier caves formed?	D1	Glacier cave	D1-0	A partly submerged glacier cave on Perito Moreno Glacier .	0
Q1	how are glacier caves formed?	D1	Glacier cave	D1-1	The ice facade is approximately 60 m high	0
Q1	how are glacier caves formed?	D1	Glacier cave	D1-2	Ice formations in the Titlis glacier cave	0
Q1	how are glacier caves formed?	D1	Glacier cave	D1-3	A glacier cave is a cave formed within the ice of a glacier .	1
Q1	how are glacier caves formed?	D1	Glacier cave	D1-4	Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice.	0
```

## Data Preprocessing
We need to take the above text and make it into `queries, docs, labels` form
We use the below code for that


In [2]:
# Fill the below with wherever you have your WikiQACorpus Folder
wikiqa_data_path = os.path.join('data', 'WikiQACorpus', 'WikiQA-train.tsv')


def preprocess_sent(sent):
    """Utility function to lower, strip and tokenize each sentence
    
    Replace this function if you want to handle preprocessing differently"""
    return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

# Defining some consants for .tsv reading
QUESTION_ID_INDEX = 0
QUESTION_INDEX = 1
ANSWER_INDEX = 5
LABEL_INDEX = 6

with open(wikiqa_data_path, encoding='utf8') as tsv_file:
    tsv_reader = csv.reader(tsv_file, delimiter='\t')
    data_rows = []
    for row in tsv_reader:
        data_rows.append(row)


        
document_group = []
label_group = []

n_relevant_docs = 0
n_filtered_docs = 0

queries = []
docs = []
labels = []

for i, line in enumerate(data_rows[1:], start=1):
    if i < len(data_rows) - 1:  # check if out of bounds might occur
        if data_rows[i][QUESTION_ID_INDEX] == data_rows[i + 1][QUESTION_ID_INDEX]:
            document_group.append(preprocess_sent(data_rows[i][ANSWER_INDEX]))
            label_group.append(int(data_rows[i][LABEL_INDEX]))
            n_relevant_docs += int(data_rows[i][LABEL_INDEX])
        else:
            document_group.append(preprocess_sent(data_rows[i][ANSWER_INDEX]))
            label_group.append(int(data_rows[i][LABEL_INDEX]))

            n_relevant_docs += int(data_rows[i][LABEL_INDEX])

            if n_relevant_docs > 0:
                docs.append(document_group)
                labels.append(label_group)
                queries.append(preprocess_sent(data_rows[i][QUESTION_INDEX]))
            else:
                n_filtered_docs += 1

            n_relevant_docs = 0
            document_group = []
            label_group = []

    else:
        # If we are on the last line
        document_group.append(preprocess_sent(data_rows[i][ANSWER_INDEX]))
        label_group.append(int(data_rows[i][LABEL_INDEX]))
        n_relevant_docs += int(data_rows[i][LABEL_INDEX])

        if n_relevant_docs > 0:
            docs.append(document_group)
            labels.append(label_group)
            queries.append(preprocess_sent(data_rows[i][QUESTION_INDEX]))
        else:
            n_filtered_docs += 1
            n_relevant_docs = 0

## Let's have a look at the data

In [3]:
queries[300]

['where', 'did', 'hurricane', 'katrina', 'begin']

In [4]:
print(docs[300])

[['hurricane', 'katrina', 'was', 'the', 'deadliest', 'and', 'most', 'destructive', 'atlantic', 'hurricane', 'of', 'the', '2005', 'atlantic', 'hurricane', 'season'], ['it', 'was', 'the', 'costliest', 'natural', 'disaster', 'as', 'well', 'as', 'one', 'of', 'the', 'five', 'deadliest', 'hurricanes', 'in', 'the', 'history', 'of', 'the', 'united', 'states'], ['among', 'recorded', 'atlantic', 'hurricanes', 'it', 'was', 'the', 'sixth', 'strongest', 'overall'], ['at', 'least', '1', '833', 'people', 'died', 'in', 'the', 'hurricane', 'and', 'subsequent', 'floods', 'making', 'it', 'the', 'deadliest', 'u', 's', 'hurricane', 'since', 'the', '1928', 'okeechobee', 'hurricane', 'total', 'property', 'damage', 'was', 'estimated', 'at', '81', 'billion', '2005', 'usd', 'nearly', 'triple', 'the', 'damage', 'brought', 'by', 'hurricane', 'andrew', 'in', '1992'], ['hurricane', 'katrina', 'formed', 'over', 'the', 'bahamas', 'on', 'august', '23', '2005', 'and', 'crossed', 'southern', 'florida', 'as', 'a', 'moder

In [5]:
print(labels[300])

[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## Making a train-validation split
At this point, it would be good to make a train-validation split so we can see how the model performs as it trains

In [6]:
train_queries, test_queries = queries[:int(len(queries)*0.8)], queries[int(len(queries)*0.8): ]
train_docs, test_docs = docs[:int(len(docs)*0.8)], docs[int(len(docs)*0.8):]
train_labels, test_labels = labels[:int(len(labels)*0.8)], labels[int(len(labels)*0.8):]

In [7]:
print(len(train_queries), len(test_queries))
print(len(train_docs), len(test_docs))
print(len(train_labels), len(test_labels))

697 175
697 175
697 175


# Training the Model
If we want to train the model with some pretrained word embeddingd like Glove, we will have to specify the path

In [8]:
word_embedding_path = os.path.join('data', 'glove.6B.50d.txt')

We would like to monitor the progress of training of the model.
However, we can't rely on the metrics provided by keras as those metrics don't necessarily apply to Information Retrieval problems.

We can additionally provide a validation dataset which will be tested after every epoch.

Now that we have the preprocessed extracted data, training the model just takes one line:

In [9]:
# Train the model
drmm_tks_model = DRMM_TKS_Model(train_queries, train_docs, train_labels, word_embedding_path=word_embedding_path,
                                epochs=10, validation_data=[test_queries, test_docs, test_labels])

2018-06-20 21:19:27,144 : INFO : Starting Vocab Build
2018-06-20 21:19:27,233 : INFO : Vocab Build Complete
2018-06-20 21:19:27,234 : INFO : Vocab Size is 17800
2018-06-20 21:19:27,235 : INFO : Building embedding index using pretrained word embeddings
2018-06-20 21:19:44,020 : INFO : The embeddings_index built from the given file has 400000 words of 50 dimensions
2018-06-20 21:19:44,021 : INFO : Embedding Matrix for Embedding Layer has shape (17801, 50) 
2018-06-20 21:19:44,054 : INFO : There are 594 words not in the embeddings. Setting them to zero
2018-06-20 21:19:44,055 : INFO : Adding additional dimensions from the embedding file to embedding matrix
2018-06-20 21:19:44,895 : INFO : Normalizing the word embeddings
2018-06-20 21:19:45,466 : INFO : Embedding Matrix now has shape (400597, 50)
2018-06-20 21:19:45,466 : INFO : Pad word has been set to index 400594
2018-06-20 21:19:45,467 : INFO : Embedding index build complete


ValueError: text_maxlen: 200 isn't big enough. Error at sentence of length 305. Sentence is ['in', 'his', 'retelling', 'of', 'fairy', 'tales', 'in', 'the', 'scots', 'language', '0', 'q708', 'what', 'does', 'leeroy', 'jenkins', 'mean', 'd690', 'leeroy', 'jenkins', 'd690', '0', 'ben', 'schulz', 'player', 'of', 'leeroy', 'jenkins', 'at', 'blizzcon', '2007', '0', 'q708', 'what', 'does', 'leeroy', 'jenkins', 'mean', 'd690', 'leeroy', 'jenkins', 'd690', '1', 'leeroy', 'jenkins', 'sometimes', 'misspelled', 'leroy', 'jenkins', 'and', 'often', 'elongated', 'with', 'numerous', 'additional', 'letters', 'is', 'an', 'internet', 'meme', 'named', 'for', 'a', 'player', 'character', 'created', 'by', 'ben', 'schulz', 'in', 'blizzard', 'entertainment', 's', 'mmorpg', 'world', 'of', 'warcraft', '1', 'q708', 'what', 'does', 'leeroy', 'jenkins', 'mean', 'd690', 'leeroy', 'jenkins', 'd690', '2', 'the', 'character', 'became', 'popular', 'due', 'to', 'a', 'video', 'of', 'the', 'game', 'that', 'circulated', 'around', 'the', 'internet', '0', 'q708', 'what', 'does', 'leeroy', 'jenkins', 'mean', 'd690', 'leeroy', 'jenkins', 'd690', '3', 'the', 'phenomenon', 'has', 'since', 'spread', 'beyond', 'the', 'boundaries', 'of', 'the', 'gaming', 'community', 'into', 'other', 'online', 'and', 'mainstream', 'media', '0', 'q709', 'what', 'happened', 'to', 'the', 'officer', 'in', 'bart', 'shooting', 'd691', 'bart', 'police', 'shooting', 'of', 'oscar', 'grant', 'd691', '0', 'oscar', 'grant', 'was', 'fatally', 'shot', 'by', 'bart', 'police', 'officer', 'johannes', 'mehserle', 'in', 'oakland', 'california', 'united', 'states', 'in', 'the', 'early', 'morning', 'hours', 'of', 'new', 'year', 's', 'day', '2009', '0', 'q709', 'what', 'happened', 'to', 'the', 'officer', 'in', 'bart', 'shooting', 'd691', 'bart', 'police', 'shooting', 'of', 'oscar', 'grant', 'd691', '1', 'responding', 'to', 'reports', 'of', 'a', 'fight', 'on', 'a', 'crowded', 'bay', 'area', 'rapid', 'transit', 'train', 'returning', 'from', 'san', 'francisco', 'bart', 'police', 'officers', 'detained', 'oscar', 'grant', 'and', 'several', 'other', 'passengers', 'on', 'the', 'platform', 'at', 'the', 'fruitvale', 'bart', 'station', '0', 'q709', 'what', 'happened', 'to', 'the', 'officer', 'in', 'bart', 'shooting', 'd691', 'bart', 'police', 'shooting', 'of', 'oscar', 'grant', 'd691', '2', 'officer', 'johannes', 'mehserle', 'and', 'another', 'officer', 'were', 'restraining', 'grant', 'who', 'was', 'prostrate', 'and', 'allegedly', 'resisting', 'arrest', '0', 'q709', 'what', 'happened', 'to', 'the', 'officer', 'in', 'bart', 'shooting', 'd691', 'bart', 'police', 'shooting', 'of', 'oscar', 'grant', 'd691', '3', 'officer', 'mehserle', 'stood', 'and', 'according', 'to', 'witnesses', 'said', 'get', 'back', 'i', 'm', 'gonna', 'tase', 'him']

## Testing the model on new data

The testing of the data can be done on completely unseen data using `model.predict(queries, docs)` where
queries: list of list of words
docs: list of list of list of words

In [None]:
# Example:
queries = ["how are glacier caves formed ?".split()]
docs = ["A partly submerged glacier cave on Perito Moreno Glacier".split(),
        "A glacier cave is a cave formed within the ice of a glacier".split()]

In [None]:
drmm_tks_model.predict(queries, docs)

As can be seen above, the correct answer has the higher similarity score.