Ades-NLP-Recipies

Exploration of Document Similarity Models

Abstract: The need to develop robust document/text similarity measure solutions is an essential step for building applications such as Recommendation Systems, Search Engines, Information Retrieval Systems including other ML/AI applications such as News Aggregators (that require document clustering) or automated recruitment systems used to match CVs to job specification and so on. In general, text similarity is the measure of how words/tokens, tweets, phrases, sentences, paragraphs and entire documents are lexically and semantically close to each other. Texts/words are lexically similar if they have similar character sequence or structure and, are semantically similar if they have the same meaning, describe similar concepts and they are used in the same context.
This work will demonstrate a number of strategies for feature extraction i.e. transforming documents to numeric feature vectors. This transformation step is a prerequisite for computing the similarity between documents. Typically each strategy will involve 3 steps, namely: 1) the use of standard natural language pre-processing techniques to prepare/clean the documents, 2) the transformation of the document text into a vector of vectors. 3) the calculation of document similarity matrix using the Cosine or some other Similarity metric.

Strategies and associated ML/NLP libraries that will be presented are:

document pre-processing using NLTK library
feature extraction using Term frequency – Inverse Document Frequency (TF-IDF) with the aid of the Scikit-Learn library
feature extraction using a pre-trained GloVe embedding with the aid of the Spacy NLP Library
feature extraction using a trained Word2Vec embedding from scratch with the aid of the Gensim NLP library
feature extraction using a trained Doc2Vec embedding from scratch also with the aid of the Gensim library

Models developed from these strategies will be compared against each other for the Kaggle competition problem of identifying duplicate questions using the Quora dataset. These models will also be used for computing the document similarity of the popular NPL 20 News Group problem.

Update Notes:

Currently note that this project is work-in-progress. Completed parts of this this project are:

TFIDF Feature vector Extractor component
Cosine Similarity Measure
Visualization component including options to reduce the feature matrix dimensions to 2D using PCA, MDS, T-SNE and UMAP techniques
Plots of the Feature Matrix and the Similarity Matrix Heatmap

Demos of the above features to solve 2 problems, namely:

A toy/contrived problem which demonstates how to compute the similarity between documents in a corpus which contains 24 documents (Book titles) using data provided in this blog
The popular NLP 20 newsgroups problem with dataset which comprises around 18000 newsgroups posts on 20 topics

Pending items for this project include:

build a document feature extarction model using a pre-trained GloVe embedding
build a document feature extarction model using using a trained word2Vec embedding from stratch
build a document feature extarction model using using a trained doc2Vec embedding from stratch
compare the performance of models
add the Quora duplicate problem see this Kaggle link and use the above models to the problem

Although this project is currently work-in-progress, some of the completed componets of the project include:

GenericDataSerializerComponent.py: used for getting serialised Problem corpous data
ProblemSpecificationInterface.py: used as an interface abstraction to specify a document similarity problem
NLPEngineComponent.py: used as a NLP pre-processing module for cleaning the raw corpus of text
DocumentFeatureExtractionInterface.py: used as an interface abstraction to specify a number of feature extration models
TFIDFDocmentVectorExtractor.py: used to build a TF-IDF based feature vector extration model
DocumentFeatureVisualization.py: used to reduce the dimensions of the feature matrix based on techniques such as PCA, T-SNE, MDS and UMAP.It allows provides the visualization infrastructure to plot similarity Heatmap and the visualizations of documents in 2D space

You can futher explore these details of the project by running this Jupyter notebook

To install the python package dependencies required to run this notebook, I will advise that you create a virtual environment with the aid of this requiremnt.txt file using this Pip command:

pip3 install -r requirements.txt

Pydata 2022 London Conference Presentation (Youtube video)

https://www.youtube.com/watch?v=qXcRW5fIa1g

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Exploration of Document Similarity Models		Exploration of Document Similarity Models
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ades-NLP-Recipies

Update Notes:

Pydata 2022 London Conference Presentation (Youtube video)

About

Releases

Packages

Languages

aidowu1/Ades-NLP-Recepies

Folders and files

Latest commit

History

Repository files navigation

Ades-NLP-Recipies

Update Notes:

Pydata 2022 London Conference Presentation (Youtube video)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages