text2topics

Insight Artificial Intelligence, Session 2019B

Background

As an AI fellow at Insight Data Science, I completed a consulting project with Ping, a company that offers automated timekeeping software to law firms.

I built this topic modeling package to help them more quickly and accurately analyze large amounts of unstructured text in the form of emails, documents, and memos that lawyers write each day.

This program leverages the power of natural language processing (NLP) and unsupervised learning to build a topic model that accurately identifies and represents the key topics in unstructured text.

Specifically, it is designed to take a set of documents and generate a topic model specific to the legal domain. The resulting topic model can provide insights into the nature of an existing corpus and can also be used to perform inference-- that is, to identify the key topics in an unseen document.

Additional Info: Slides

Setup

Git clone this repository to local and create a virtual environment:

conda create -n text2topics python=3.7 
conda activate text2topics
cd text2topics

Install required packages:

pip install -r requirements.txt

See the example directory for a step by step usage example in a notebook.

Advanced Usage

If first time running, you must download some important models. Note: this may take some time (it's 788 MB).

python config.py

Example 1

The following command will:

load raw text data
clean and process it
iterate over different versions of the model

Use the resulting plot and the elbow method to identify the best number of topics.

cd source 
python run.py \
    --raw_file ../data/raw_docs.pkl \
    --clean_path ../data/ \
    --results_path ../results/ \
    --clean_from_raw 1 \
    --iterate 1 \
    --use_lda 0 \

Example 2

The following command will:

load already cleaned data
perform embedding + clustering procedure with a user provided number of topics
generate word clouds for each of the identified topics, saving to results folder

Use the resulting model and word clouds to explore the nature of the topics and documents.

cd source 
python run.py \
    --raw_file ../data/raw_docs.pkl \
    --clean_path ../data/ \
    --results_path ../results/ \
    --clean_from_raw 0 \
    --iterate 0 \
    --use_lda 0 \

Example 3

To perform the above examples using LDA rather than the embeddings + clustering approach, pass the argument:

    --use_lda 1 \

How It Works

Step 1: Use embedding model to vectorize words in documents.

This program uses a GloVe embedding model pretrained on Wikipedia corpus (Spacy's 'en_core_web_lg'). It generates 300dim vectors for each unique word in the corpus.

Word embedding models are good for topic modeling because they are able to capture semantic and contextual information that other BOW methods cannot. Semantic relationships between words are preserved via their spatial relationships in the embedding space.

Step 2: K-Means Clustering

Once the words have been embedded, perform k-means clustering on the vectors to identify words forming topic clusters.

Step 2.5: Iterate

To determine the best "k" for the K-Means clustering, plot the results for several different values and use the elbow method to identify the best.

For example, here, 20 topics is good.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
data		data
examples		examples
images		images
results		results
source		source
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt

License

ehutt/text2topics

Folders and files

Latest commit

History

Repository files navigation

text2topics

Background

Setup

Advanced Usage

Example 1

Example 2

Example 3

How It Works

Step 1: Use embedding model to vectorize words in documents.

Step 2: K-Means Clustering

Step 2.5: Iterate

Step 3: Results

About

Resources

License

Stars

Watchers

Forks

Languages