Here we will build a topic model with Genism. Latent Dirichlet Allocation (LDA) is the algorithm we will be using. We will be doing unstructured classification with the customer complaint narrative column within the dataset we've previously prepared in the DataExplore notebook. We will be doing this to create our own caterogies from the data and see just how accurate the existing categories are. The categories will be defined by creating topics (a collection of reoccurring keywords are used to identify a topic).

In [1]:
# Make sure to have nltk and stopwords downloaded
import nltk; nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danrl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Import needed packages

import re
import numpy
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy

import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

import logging
logging.basicConfig(format="%(asctime)s: %(levelname)s : %(message)s", level=logging.ERROR)

import warnings 
warnings.filterwarnings("ignore",category=DeprecationWarning)

unable to import 'smart_open.gcs', disabling that module


In [3]:
# NLTK stop words
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [4]:
# Get data
df = pd.read_csv('../../student-loan-complaints-data/text_analysis_data.csv')
df.head()

Unnamed: 0,Date received,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company,State,Tags,Company response to consumer,Timely response?,Consumer disputed?,month,year
0,2020-05-19,Private student loan,Dealing with your lender or servicer,Received bad information about your loan,When I was applying for my loan my XXXX accoun...,"Figure Technologies, Inc",NJ,,Closed with explanation,Yes,,5,2020
1,2020-02-06,Federal student loan servicing,Incorrect information on your report,Account status incorrect,I'm on a deferred payment plan t never ; late,"Nelnet, Inc.",TX,,Closed with explanation,Yes,,2,2020
2,2020-02-08,Federal student loan servicing,Dealing with your lender or servicer,Problem with customer service,I have attempted multiple times to contact FED...,AES/PHEAA,KY,,Closed with non-monetary relief,Yes,,2,2020
3,2020-01-21,Federal student loan servicing,Dealing with your lender or servicer,Trouble with how payments are being handled,I was divorced in 2004 and I agreed to take th...,AES/PHEAA,OK,,Closed with explanation,Yes,,1,2020
4,2019-12-04,Federal student loan servicing,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,This particular account situation that is late...,AES/PHEAA,FL,,Closed with explanation,Yes,,12,2019


Now that we've imported the necessary packages we will prepare the data to build and feed into the model.

In [5]:
# A function to preprocess all rows in a dataframe
def preprocess_data(data):
    # Change all text to lowercase
    data = data.apply(lambda x: " ".join(x.lower() for x in x.split()))
    
    # Remove puctuation
    data = data.str.replace("[^\w\s]","")
    
    # Remove stopwords
    from nltk.corpus import stopwords
    stop = stopwords.words("english")
    data = data.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    
    # Remove common words
    freq = pd.Series(" ".join(data).split()).value_counts()[:10]
    freq = list(freq.index)
    data = data.apply(lambda x: " ".join(x for x in x.split() if x not in freq))
    
    # Lemmatization
    from textblob import Word
    data = data.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
    
    # Return transformed data
    return data

# Return the data as a list
data = preprocess_data(df["Consumer complaint narrative"])

In [6]:
# Tokenize the data
data = [sub.split() for sub in data] 
print(data[:3])

[['applying', 'account', 'correctly', 'communicate', 'issue', 'offer', '025', 'rate', 'deduction', 'autopay', 'showing', 'account', 'told', 'go', 'application', 'anyway', 'account', 'opened', 'could', 'add', 'autopay', 'receive', 'discount', 'way', 'since', 'account', 'opened', 'called', 'call', 'center', 'least', '4', 'time', 'trying', 'receive', 'autopay', 'discount', 'first', '3', 'time', 'told', 'going', 'applied', 'still', 'seen', 'additionally', 'last', 'time', 'called', '3', 'week', 'ago', 'asked', 'speak', 'manager', 'told', 'take', '10', 'day', 'get', 'back', 'still', 'yet', 'hear', 'back', '15', 'business', 'day', 'later', 'told', 'receiving', 'autopay', 'discount', 'receiving', 'opened', 'account', 'company', 'lying', 'rate', 'going', 'receive', 'dont', 'autopay', 'initiate', '2', 'autopays', 'go', 'far', 'additional', 'issue', 'told', 'rate', 'going', 'based', '1', 'month', 'libor', 'rate', 'published', 'wsj', 'month', 'none', 'rate', 'received', 'thus', 'far', 'match', 'ra

Now that we've cleaned and tokenized the data we need to create bigrams and trigrams. Bigrams are two words frequently seen paired together, trigrams are the same but with three words. We will use Gensim's Phrases model to build the bigrams and trigrams.

In [7]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data], threshold=100)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data[0]]])

['applying', 'account', 'correctly', 'communicate', 'issue', 'offer', '025', 'rate', 'deduction', 'autopay', 'showing', 'account', 'told', 'go', 'application', 'anyway', 'account', 'opened', 'could', 'add', 'autopay', 'receive', 'discount', 'way', 'since', 'account', 'opened', 'called', 'call', 'center', 'least', '4', 'time', 'trying', 'receive', 'autopay', 'discount', 'first', '3', 'time', 'told', 'going', 'applied', 'still', 'seen', 'additionally', 'last', 'time', 'called', '3', 'week', 'ago', 'asked', 'speak', 'manager', 'told', 'take', '10', 'day', 'get', 'back', 'still', 'yet', 'hear', 'back', '15', 'business', 'day', 'later', 'told', 'receiving', 'autopay', 'discount', 'receiving', 'opened', 'account', 'company', 'lying', 'rate', 'going', 'receive', 'dont', 'autopay', 'initiate', '2', 'autopays', 'go', 'far', 'additional', 'issue', 'told', 'rate', 'going', 'based', '1', 'month', 'libor', 'rate', 'published', 'wsj', 'month', 'none', 'rate', 'received', 'thus_far', 'match', 'rate',

In [17]:
# As we can see above, out attempt at lemmatizing the data didn't work. We will just try it again.
# Here we are defining functions for bigrams, trigrams, and lemmatizing the data

def make_bigrams(data):
    return [bigram_mod[doc] for doc in data]

def make_trigrams(data):
    return [trigram_mod[bigram_mod[doc]] for doc in data]

def lemmatization(data, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    complaints_out = []
    for complaint in data:
        doc = nlp(" ".join(complaint))
        complaints_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return complaints_out

In [18]:
# Now we call the functions we build above
data_words_bigrams = make_bigrams(data)
nlp = spacy.load("en", disable=["parser", "ner"])
data_lemmatized = lemmatization(data_words_bigrams)

print(data_lemmatized[:1])

[['apply', 'account', 'correctly', 'communicate', 'issue', 'offer', 'rate', 'deduction', 'autopay', 'show', 'account', 'tell', 'go', 'application', 'account', 'open', 'could', 'add', 'autopay', 'receive', 'discount', 'way', 'account', 'open', 'call', 'center', 'least', 'time', 'try', 'receive', 'autopay', 'discount', 'first', 'time', 'tell', 'go', 'apply', 'still', 'see', 'additionally', 'last', 'time', 'call', 'week', 'ask', 'manager', 'tell', 'take', 'day', 'get', 'back', 'still', 'yet', 'hear', 'back', 'business', 'day', 'later', 'tell', 'receive', 'autopay', 'discount', 'receiving', 'open', 'account', 'company', 'lie', 'rate', 'go', 'receive', 'autopay', 'autopay', 'go', 'far', 'additional', 'issue', 'tell', 'rate', 'go', 'base', 'month', 'rate', 'publish', 'month', 'none', 'rate', 'receive', 'thus_far', 'match', 'rate', 'really', 'know', 'try', 'contact', 'many', 'time', 'people', 'phone', 'seem', 'helpful', 'time', 'talk', 'seem', 'get', 'do']]


At this point we've created some bi_grams and tri_grams. Now we need to create a dictionary and corpus that's needed for topic modeling.

In [23]:
# Create dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
complaints = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(complaint) for complaint in complaints]

print("Corpus format: (word_id, work_frequency)")
print(corpus[:1])
print()
print("Readable version of term-frequency:")
print([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]])

Corpus format: (word_id, work_frequency)
[[(0, 5), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 6), (8, 2), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 3), (21, 1), (22, 1), (23, 1), (24, 2), (25, 5), (26, 1), (27, 1), (28, 2), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 1), (39, 1), (40, 3), (41, 1), (42, 1), (43, 1), (44, 6), (45, 1), (46, 5), (47, 1), (48, 1), (49, 2), (50, 1), (51, 2), (52, 1), (53, 1), (54, 5), (55, 1), (56, 5), (57, 2), (58, 1), (59, 1), (60, 1)]]

Readable version of term-frequency:
[[('account', 5), ('add', 1), ('additional', 1), ('additionally', 1), ('application', 1), ('apply', 2), ('ask', 1), ('autopay', 6), ('back', 2), ('base', 1), ('business', 1), ('call', 2), ('center', 1), ('communicate', 1), ('company', 1), ('contact', 1), ('correctly', 1), ('could', 1), ('day', 2), ('deduction', 1), ('discount', 3), ('do', 1), ('far', 1), ('first', 1), ('get', 2

Now that we've prepared the data and have everything we need to train the LDA model, we will build it. 

In [26]:
# Building the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto",
                                           per_word_topics=True)

In [27]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.196*"repay" + 0.170*"bankruptcy" + 0.132*"daughter" + 0.058*"timely" + '
  '0.039*"daily" + 0.033*"xxxxxxxxxxxxxxxx" + 0.030*"official" + 0.028*"split" '
  '+ 0.028*"overpay" + 0.025*"supply"'),
 (1,
  '0.149*"information" + 0.094*"contact" + 0.067*"company" + 0.066*"number" + '
  '0.062*"provide" + 0.045*"regard" + 0.031*"attempt" + 0.028*"name" + '
  '0.024*"give" + 0.023*"list"'),
 (2,
  '0.206*"receive" + 0.148*"send" + 0.097*"letter" + 0.096*"request" + '
  '0.095*"email" + 0.066*"state" + 0.040*"form" + 0.032*"mail" + '
  '0.030*"response" + 0.025*"write"'),
 (3,
  '0.234*"default" + 0.190*"collection" + 0.108*"nelnet" + 0.094*"agency" + '
  '0.085*"agreement" + 0.042*"education" + 0.034*"rehabilitation" + '
  '0.025*"dept" + 0.024*"mine" + 0.013*"tax"'),
 (4,
  '0.102*"plan" + 0.100*"repayment" + 0.084*"income" + 0.066*"program" + '
  '0.053*"forgiveness" + 0.042*"year" + 0.040*"base" + 0.038*"qualify" + '
  '0.034*"service" + 0.028*"pslf"'),
 (5,
  '0.469*"account" + 

In [32]:
# Compute perplexity
print("Perplexity:", lda_model.log_perplexity(corpus))

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence="c_v")
coherence_lda = coherence_model_lda.get_coherence()
print("\nCoherence score:", coherence_lda)

Perplexity: -9.206020033247555

Coherence score: 0.38448119717595325


The model has been built and can be viewed above. The words displayed above (a couple spaces above are the top words and its weight used to determine that specific topic. There are 20 topics total (0 - 19).

Now we will visualize the topics and their keywords using pyLDAvis

In [33]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

Using the visualization above:
On the left is the topics, the larger the bubble the more popular the topic. The closer the bubble is to another bubble the more similar they are. Moving the cursor over a bubble will show new words and bars on the side. These words are the keywords for that selected topic. 