GitHub - anum94/Topic-Modeling-Sentence-BERT: Topic Modelling using Transformers

Topic Modeling using BERT Embedding on Job Description Dataset

The goal of this project is to cluster jobs based on their description.This project uses classical NLP techniques as well as state-of-the-art deep learning approaches.

Keywords: LDA, Transformers, K-means, TF-IDF, Word Embedding

quick run through docker image

There is a flask project that encapsulates the whole project is available as a docker image. This image shows result on the whole dataset.

To pull Docker image:

docker pull anumafzal/topic-model-employee-objective

To run the pulled image:

docker run --rm -it -p 5000:5000/tcp anumafzal/topic-model-employee-objective:latest

check ports http://127.0.0.1:5000/ or http://0.0.0.0:5000/

If anyone is interested in the models used, there is a jupyter notebook that can be used for experiemtation.

1) Datasets:

job description Dataset. Available on Kaggle https://www.kaggle.com/airiddha/trainrev1

2) Directory Structure:

merck-employee-objectives
- data (folder):
  - all the data files are stored here such as the job description dataset and the merck objective dataset are to be stored in this folder
- flaskApp (folder):
  - This folder contains all the files needed for the flask App
- models (folder):
  This folder contains the models that are used through out the framework.
  - autoencoder_model.py: Contains an autoencoder used to learn the latent representations of the features.
  - embeddings_models.py: All functions related to transformer models are found in this file.
  - topic_model.py: This file contains a topic model class whihc is the backbone of this famework
- utilities (folder):
  All utility functions can be found in this folder
- app.py (file):
the starting point for the flask app
- Dockerfile (file):
Docker file for creating a docker image
- enviroment.yml(file) :
File to create a conda enviroment to run this project. Use the following command to create the enviroment

conda env create -f enviroment.yml
- README.md (file):
this file
- topic-modeling-playground-colab.ipynb (Notebook):
Advanced level notebook for doing analytic on the dataset

3) Code Entry point:

There are 2 ways to use this framework.

Flask App which provides a gui to interact with the backend. simply run the command from project directory. This is recommended if you are non-technical user or just want quickly get an insight into the framework.

flask run
Jupyter Notebook (topic-modeling-playground-colab.ipynb) which basically provides the same insight as the flask App but allows you to dive into the analytics and also play around.
Just run the notebook using jupyterlab. You can run it also with google colab if you want everything to run faster. But for a small dataset, running on the good computer is also ok.
The notebook has comments explaining which extra cells need to be run in case of colab.
Make sure you upload the whole directory to your google drive and then run the notebook as it looks for all the python packages.

4) Visualizations / Evaluations:

Wordcloud

Each image(wordcloud) represents a cluster of employees with similiar objective.
The words in each wordcloud are the main topics discussed by the employees belonging to that cluster

PyLDAvis

This visualization is meant for the LDA model only - This visualization gives an indepth analysis of the base model in terms of probabilities. - It includes information such as the words and the probability of them belonging to a cluster.

Employee clusters in 2D

In the visualization, the employees can be visualized in 2D space.
Each dot in the graph represents an employee and by hover some basic information of the employee can be seen.
The separate colors represent each cluster and on the top right, the topic words for each cluster can be seen

Employee clusters in 3D

In the visualization, the employees can be visualized in 3D space.
Each dot in the graph represents an employee and by hover some basic information of the employee can be seen.
The separate colors represent each cluster and on the top right, the topic words for each cluster can be seen.
It is also possible to rotate the space by left clicking the mouse and moving it and zooming into the space by using the small wheel on the the mouse.

5) Good to Know

The framework operates on english language and hence ignores all text belonging to foreign languages
It concatenates the all the objectives belonging to a user into one long objective as part of the proprocessing
It drops all duplicated entries from the data as part of the proprocessing step

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modeling using BERT Embedding on Job Description Dataset

Keywords: LDA, Transformers, K-means, TF-IDF, Word Embedding

quick run through docker image

1) Datasets:

2) Directory Structure:

merck-employee-objectives

data (folder):

flaskApp (folder):

models (folder):

utilities (folder):

app.py (file):

Dockerfile (file):

enviroment.yml(file) :

README.md (file):

topic-modeling-playground-colab.ipynb (Notebook):

3) Code Entry point:

4) Visualizations / Evaluations:

Wordcloud

PyLDAvis

Employee clusters in 2D

Employee clusters in 3D

5) Good to Know

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
flaskApp		flaskApp
models		models
output/EMBEDDING/EMBEDDING_k_5_bert-sent_2021_06_06_20_31_34		output/EMBEDDING/EMBEDDING_k_5_bert-sent_2021_06_06_20_31_34
utilities		utilities
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
enviroment.yml		enviroment.yml
requirements.txt		requirements.txt
topic-modeling-playground-colab.ipynb		topic-modeling-playground-colab.ipynb

anum94/Topic-Modeling-Sentence-BERT

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling using BERT Embedding on Job Description Dataset

Keywords: LDA, Transformers, K-means, TF-IDF, Word Embedding

quick run through docker image

1) Datasets:

2) Directory Structure:

merck-employee-objectives

data (folder):

flaskApp (folder):

models (folder):

utilities (folder):

app.py (file):

Dockerfile (file):

enviroment.yml(file) :

README.md (file):

topic-modeling-playground-colab.ipynb (Notebook):

3) Code Entry point:

4) Visualizations / Evaluations:

Wordcloud

PyLDAvis

Employee clusters in 2D

Employee clusters in 3D

5) Good to Know

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages