Exploration of Document Similarity Models
Abstract:
The need to develop robust document/text similarity measure solutions is an essential step for building applications such as Recommendation Systems, Search Engines, Information Retrieval Systems including other ML/AI applications such as News Aggregators (that require document clustering) or automated recruitment systems used to match CVs to job specification and so on. In general, text similarity is the measure of how words/tokens, tweets, phrases, sentences, paragraphs and entire documents are lexically and semantically close to each other. Texts/words are lexically similar if they have similar character sequence or structure and, are semantically similar if they have the same meaning, describe similar concepts and they are used in the same context.
This work will demonstrate a number of strategies for feature extraction i.e. transforming documents to numeric feature vectors. This transformation step is a prerequisite for computing the similarity between documents. Typically each strategy will involve 3 steps, namely: 1) the use of standard natural language pre-processing techniques to prepare/clean the documents, 2) the transformation of the document text into a vector of vectors. 3) the calculation of document similarity matrix using the Cosine or some other Similarity metric.
Strategies and associated ML/NLP libraries that will be presented are:
- document pre-processing using NLTK library
- feature extraction using Term frequency – Inverse Document Frequency (TF-IDF) with the aid of the Scikit-Learn library
- feature extraction using a pre-trained GloVe embedding with the aid of the Spacy NLP Library
- feature extraction using a trained Word2Vec embedding from scratch with the aid of the Gensim NLP library
- feature extraction using a trained Doc2Vec embedding from scratch also with the aid of the Gensim library
Models developed from these strategies will be compared against each other for the Kaggle competition problem of identifying duplicate questions using the Quora dataset. These models will also be used for computing the document similarity of the popular NPL 20 News Group problem.
Currently note that this project is work-in-progress. Completed parts of this this project are:
- TFIDF Feature vector Extractor component
- Cosine Similarity Measure
- Visualization component including options to reduce the feature matrix dimensions to 2D using PCA, MDS, T-SNE and UMAP techniques
- Plots of the Feature Matrix and the Similarity Matrix Heatmap
Demos of the above features to solve 2 problems, namely:
- A toy/contrived problem which demonstates how to compute the similarity between documents in a corpus which contains 24 documents (Book titles) using data provided in this blog
- The popular NLP 20 newsgroups problem with dataset which comprises around 18000 newsgroups posts on 20 topics
Pending items for this project include:
- build a document feature extarction model using a pre-trained GloVe embedding
- build a document feature extarction model using using a trained word2Vec embedding from stratch
- build a document feature extarction model using using a trained doc2Vec embedding from stratch
- compare the performance of models
- add the Quora duplicate problem see this Kaggle link and use the above models to the problem
Although this project is currently work-in-progress, some of the completed componets of the project include:
- GenericDataSerializerComponent.py: used for getting serialised Problem corpous data
- ProblemSpecificationInterface.py: used as an interface abstraction to specify a document similarity problem
- NLPEngineComponent.py: used as a NLP pre-processing module for cleaning the raw corpus of text
- DocumentFeatureExtractionInterface.py: used as an interface abstraction to specify a number of feature extration models
- TFIDFDocmentVectorExtractor.py: used to build a TF-IDF based feature vector extration model
- DocumentFeatureVisualization.py: used to reduce the dimensions of the feature matrix based on techniques such as PCA, T-SNE, MDS and UMAP.It allows provides the visualization infrastructure to plot similarity Heatmap and the visualizations of documents in 2D space
You can futher explore these details of the project by running this Jupyter notebook
To install the python package dependencies required to run this notebook, I will advise that you create a virtual environment with the aid of this requiremnt.txt file using this Pip command:
- pip3 install -r requirements.txt