Skip to content

aidowu1/Ades-NLP-Recepies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

Ades-NLP-Recipies

Exploration of Document Similarity Models

Abstract: The need to develop robust document/text similarity measure solutions is an essential step for building applications such as Recommendation Systems, Search Engines, Information Retrieval Systems including other ML/AI applications such as News Aggregators (that require document clustering) or automated recruitment systems used to match CVs to job specification and so on. In general, text similarity is the measure of how words/tokens, tweets, phrases, sentences, paragraphs and entire documents are lexically and semantically close to each other. Texts/words are lexically similar if they have similar character sequence or structure and, are semantically similar if they have the same meaning, describe similar concepts and they are used in the same context.
This work will demonstrate a number of strategies for feature extraction i.e. transforming documents to numeric feature vectors. This transformation step is a prerequisite for computing the similarity between documents. Typically each strategy will involve 3 steps, namely: 1) the use of standard natural language pre-processing techniques to prepare/clean the documents, 2) the transformation of the document text into a vector of vectors. 3) the calculation of document similarity matrix using the Cosine or some other Similarity metric.

Strategies and associated ML/NLP libraries that will be presented are:

  • document pre-processing using NLTK library
  • feature extraction using Term frequency – Inverse Document Frequency (TF-IDF) with the aid of the Scikit-Learn library
  • feature extraction using a pre-trained GloVe embedding with the aid of the Spacy NLP Library
  • feature extraction using a trained Word2Vec embedding from scratch with the aid of the Gensim NLP library
  • feature extraction using a trained Doc2Vec embedding from scratch also with the aid of the Gensim library

Models developed from these strategies will be compared against each other for the Kaggle competition problem of identifying duplicate questions using the Quora dataset. These models will also be used for computing the document similarity of the popular NPL 20 News Group problem.

Update Notes:

Currently note that this project is work-in-progress. Completed parts of this this project are:

  • TFIDF Feature vector Extractor component
  • Cosine Similarity Measure
  • Visualization component including options to reduce the feature matrix dimensions to 2D using PCA, MDS, T-SNE and UMAP techniques
  • Plots of the Feature Matrix and the Similarity Matrix Heatmap

Demos of the above features to solve 2 problems, namely:

  • A toy/contrived problem which demonstates how to compute the similarity between documents in a corpus which contains 24 documents (Book titles) using data provided in this blog
  • The popular NLP 20 newsgroups problem with dataset which comprises around 18000 newsgroups posts on 20 topics

Pending items for this project include:

  • build a document feature extarction model using a pre-trained GloVe embedding
  • build a document feature extarction model using using a trained word2Vec embedding from stratch
  • build a document feature extarction model using using a trained doc2Vec embedding from stratch
  • compare the performance of models
  • add the Quora duplicate problem see this Kaggle link and use the above models to the problem

Although this project is currently work-in-progress, some of the completed componets of the project include:

You can futher explore these details of the project by running this Jupyter notebook

To install the python package dependencies required to run this notebook, I will advise that you create a virtual environment with the aid of this requiremnt.txt file using this Pip command:

  • pip3 install -r requirements.txt

About

Repo of NLP processing ML/AI focussed models, solutions and practical use cases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published