Information Retrieval System

Motivation

In the following steps, I will discuss how I approached the problem of quora question pairs and the motivation for coming up with an architecture like Siamese network.

How and why I chose the Siamese based network?

Having worked on computer vision in domains like medical imaging, biometrics, and document image super-resolution, I approached the problem of deciding the architecture for the quora question pairs dataset based on specifically biometrics.
While I was working on medical imaging at IIT Mandi, I also worked on a bit of biometrics domain wherein I had palm & knuckle print images of the left & right hand. Similar to that, in this problem I had question 1 & question 2 as an input. Hence, I went with the best possible approach of using a Siamese based network.
Even though the task was to find from a pool of questions which top-3 questions are closer to the user inputted question, to solve this problem, classifying given two questions are they similar or dissimilar was essential. The dataset had same questions repeated multiple times against different set of questions, so the network learned meaningful and discriminative features which eventually helped in solving the prime task of finding top-3 suggestions.
A siamese network had two inputs and it works on a phenomenon of sharing weights. The network’s input was question 1 and question 2 as tokenized vectors each of length 103, label being 0 or 1.
I tried a variety of architectures for the siamese network. Note that there are three functions in my network namely Embedding(), Middle(), Predict(). Following are some of architectural changes I made in the Middle() function:
- Fully Conv1D siamese network which gave a validation accuracy of around 75%. This had multiple combinations of conv1D layers.
- Conv1D + LSTM + Dense (In sequence) along with MaxPool1D layer which gave me a 2% boost in validation accuracy (~78%).
- Then, finally I added a MaxPool1D layer just after the embedding layer output to capture important features instead of directly passing the embedding layer output to conv or lstm layers. This further improved my performance and I finally achieved a training accuracy of around 98-99% and validation accuracy of around 80-81% approximately. The final architecture is simple and smaller in size consisting of just one maxpooling, lstm and dense layer in the Siamese function.
Finding the top-3 closest match for the given query question
- Since the goal of this task was to find from a pool of questions (training data) which top-3 questions are closer to the user inputted question (validation data).
- In order to accomplish this, I broke my siamese network and created a new model which had just two functions namely Embedding and Siamese functions.
- At test time the input was fed to the embedding layer and further extracted features from the Siamese function that had a Dense layer outputting 128 features maps. (Input being training and validation questions fed one at a time).
- Finally, saved these features representations to the disk especially the training feature maps so that I did not have to compute the predictions again and again.
- This process of extracting features was computationally efficient as I got features for all the training data (~650K) by processing them only and only once.
- The feature maps were further reshaped to get a two-dimensional array in which the first dimension was all the training questions (1 & 2) and the second dimension was the size of the feature map (128).
Reducing the search space of finding the top-3 closest match to the query question
- Finding the top-3 closest match for a given query from a pool of 650K or even more questions is a time-consuming task and is also not computationally efficient if done through the brute-force method.
- To reduce the search space I used K-Means clustering to reduce the time complexity by a big margin.
- For example: Finding the top-3 closest match for 100 questions from a pool of 650K questions took 313 seconds when done using the brute-force method. Whereas, using the K-Means clustering technique it took 200 seconds for finding the top-3 closest match for 1000 questions from a pool of 650K questions.
- Not only that clustering the data points did not affect the final results and the clustering was robust when compared with the brute-force method since both the methods gave a same top-3 closest match for each given query question.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md
information_retrieval.ipynb		information_retrieval.ipynb
output_kmeans.csv		output_kmeans.csv
pipeline.png		pipeline.png
siamese.png		siamese.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

information_retrieval.ipynb

information_retrieval.ipynb

output_kmeans.csv

output_kmeans.csv

pipeline.png

pipeline.png

siamese.png

siamese.png

Repository files navigation

Information Retrieval System

Motivation

About

Releases

Packages

Languages

aditya-AI/Information-Retrieval-System

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval System

Motivation

About

Resources

Stars

Watchers

Forks

Languages