Insight Artificial Intelligence, Session 2019B
As an AI fellow at Insight Data Science, I completed a consulting project with Ping, a company that offers automated timekeeping software to law firms.
I built this topic modeling package to help them more quickly and accurately analyze large amounts of unstructured text in the form of emails, documents, and memos that lawyers write each day.
This program leverages the power of natural language processing (NLP) and unsupervised learning to build a topic model that accurately identifies and represents the key topics in unstructured text.
Specifically, it is designed to take a set of documents and generate a topic model specific to the legal domain. The resulting topic model can provide insights into the nature of an existing corpus and can also be used to perform inference-- that is, to identify the key topics in an unseen document.
Additional Info: Slides
- Git clone this repository to local and create a virtual environment:
conda create -n text2topics python=3.7
conda activate text2topics
cd text2topics
- Install required packages:
pip install -r requirements.txt
- See the example directory for a step by step usage example in a notebook.
If first time running, you must download some important models. Note: this may take some time (it's 788 MB).
python config.py
The following command will:
- load raw text data
- clean and process it
- iterate over different versions of the model
Use the resulting plot and the elbow method to identify the best number of topics.
cd source
python run.py \
--raw_file ../data/raw_docs.pkl \
--clean_path ../data/ \
--results_path ../results/ \
--clean_from_raw 1 \
--iterate 1 \
--use_lda 0 \
The following command will:
- load already cleaned data
- perform embedding + clustering procedure with a user provided number of topics
- generate word clouds for each of the identified topics, saving to results folder
Use the resulting model and word clouds to explore the nature of the topics and documents.
cd source
python run.py \
--raw_file ../data/raw_docs.pkl \
--clean_path ../data/ \
--results_path ../results/ \
--clean_from_raw 0 \
--iterate 0 \
--use_lda 0 \
To perform the above examples using LDA rather than the embeddings + clustering approach, pass the argument:
--use_lda 1 \
This program uses a GloVe embedding model pretrained on Wikipedia corpus (Spacy's 'en_core_web_lg'). It generates 300dim vectors for each unique word in the corpus.
Word embedding models are good for topic modeling because they are able to capture semantic and contextual information that other BOW methods cannot. Semantic relationships between words are preserved via their spatial relationships in the embedding space.
Once the words have been embedded, perform k-means clustering on the vectors to identify words forming topic clusters.
To determine the best "k" for the K-Means clustering, plot the results for several different values and use the elbow method to identify the best.
For example, here, 20 topics is good.