This is the second notebook I'll be using for my LDA model with the customer complaints concerning the student loan data. I'm going to make a few adustments and hopefully get a better score than I did on my last model. I'm doing a different notebook to compare the model and process to the last one so nothing gets lost. The Coherence Score I got from the last model was 0.4441 (which was with 8 topics), which isn't all that good, so here I will see if I can get it above 0.6.

In [1]:
# Make sure to have nltk and stopwords downloaded
import nltk; nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danrl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Import needed packages

import re
import numpy
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy

import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

import logging
logging.basicConfig(format="%(asctime)s: %(levelname)s : %(message)s", level=logging.ERROR)

import warnings 
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [3]:
# NLTK stop words
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [4]:
# Get data
df = pd.read_csv('../../student-loan-complaints-data/text_analysis_data.csv')
df.head()

Unnamed: 0,Date received,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company,State,Tags,Company response to consumer,Timely response?,Consumer disputed?,month,year
0,2020-05-19,Private student loan,Dealing with your lender or servicer,Received bad information about your loan,When I was applying for my loan my XXXX accoun...,"Figure Technologies, Inc",NJ,,Closed with explanation,Yes,,5,2020
1,2020-02-06,Federal student loan servicing,Incorrect information on your report,Account status incorrect,I'm on a deferred payment plan t never ; late,"Nelnet, Inc.",TX,,Closed with explanation,Yes,,2,2020
2,2020-02-08,Federal student loan servicing,Dealing with your lender or servicer,Problem with customer service,I have attempted multiple times to contact FED...,AES/PHEAA,KY,,Closed with non-monetary relief,Yes,,2,2020
3,2020-01-21,Federal student loan servicing,Dealing with your lender or servicer,Trouble with how payments are being handled,I was divorced in 2004 and I agreed to take th...,AES/PHEAA,OK,,Closed with explanation,Yes,,1,2020
4,2019-12-04,Federal student loan servicing,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,This particular account situation that is late...,AES/PHEAA,FL,,Closed with explanation,Yes,,12,2019


Now that we've imported the necessary packages we will prepare the data to build and feed into the model.

In [8]:
# A function to preprocess all rows in a dataframe
def preprocess_data(data):
    # Change all text to lowercase
    data = data.apply(lambda x: " ".join(x.lower() for x in x.split()))
    
    # Remove puctuation
    data = data.str.replace("[^\w\s]","")
    
    # Remove stopwords
    from nltk.corpus import stopwords
    stop = stopwords.words("english")
    data = data.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    
    # Remove any words like "xx" words and numbers
    data = data.apply(lambda x: " ".join(x for x in x.split() if "xx" not in x))
    data = data.apply(lambda x: " ".join(x for x in x.split() if not x.isnumeric()))
    # This round we decided not to remove common words
#     freq = pd.Series(" ".join(data).split()).value_counts()[:10]
#     freq = list(freq.index)
#     data = data.apply(lambda x: " ".join(x for x in x.split() if x not in freq))
    
    # Lemmatization
    from textblob import Word
    data = data.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
    
    # Return transformed data
    return data

# Return the data as a list
data = preprocess_data(df["Consumer complaint narrative"])

In [9]:
# Tokenize the data
data = [sub.split() for sub in data] 
print(data[:3])

[['applying', 'loan', 'account', 'correctly', 'communicate', 'issue', 'offer', 'rate', 'deduction', 'autopay', 'showing', 'account', 'told', 'go', 'application', 'anyway', 'account', 'opened', 'could', 'add', 'autopay', 'receive', 'discount', 'way', 'since', 'account', 'opened', 'called', 'call', 'center', 'least', 'time', 'trying', 'receive', 'autopay', 'discount', 'first', 'time', 'told', 'going', 'applied', 'still', 'seen', 'additionally', 'last', 'time', 'called', 'week', 'ago', 'asked', 'speak', 'manager', 'told', 'would', 'take', 'day', 'get', 'back', 'still', 'yet', 'hear', 'back', 'business', 'day', 'later', 'told', 'would', 'receiving', 'autopay', 'discount', 'receiving', 'opened', 'account', 'company', 'lying', 'rate', 'loan', 'going', 'receive', 'dont', 'autopay', 'initiate', 'autopays', 'go', 'far', 'additional', 'issue', 'told', 'rate', 'going', 'based', 'month', 'libor', 'rate', 'published', 'wsj', 'month', 'none', 'rate', 'received', 'thus', 'far', 'match', 'rate', 'dont

Now that we've cleaned and tokenized the data we need to create bigrams and trigrams. Bigrams are two words frequently seen paired together, trigrams are the same but with three words. We will use Gensim's Phrases model to build the bigrams and trigrams.

In [11]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data, min_count=5, threshold=20)
trigram = gensim.models.Phrases(bigram[data], threshold=20)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data[0]]])

['applying', 'loan', 'account', 'correctly', 'communicate', 'issue', 'offer', 'rate', 'deduction', 'autopay', 'showing', 'account', 'told', 'go', 'application', 'anyway', 'account', 'opened', 'could', 'add', 'autopay', 'receive', 'discount', 'way', 'since', 'account', 'opened', 'called', 'call', 'center', 'least', 'time', 'trying', 'receive', 'autopay_discount', 'first', 'time', 'told', 'going', 'applied', 'still', 'seen', 'additionally', 'last', 'time', 'called', 'week_ago', 'asked_speak_manager', 'told', 'would', 'take', 'day', 'get', 'back', 'still', 'yet_hear', 'back', 'business_day', 'later', 'told', 'would', 'receiving', 'autopay_discount', 'receiving', 'opened', 'account', 'company', 'lying', 'rate', 'loan', 'going', 'receive', 'dont', 'autopay', 'initiate', 'autopays', 'go', 'far', 'additional', 'issue', 'told', 'rate', 'going', 'based', 'month', 'libor_rate', 'published', 'wsj', 'month', 'none', 'rate', 'received', 'thus_far', 'match', 'rate', 'dont', 'really', 'know', 'tried_

In [12]:
# As we can see above, out attempt at lemmatizing the data didn't work. We will just try it again.
# Here we are defining functions for bigrams, trigrams, and lemmatizing the data

def make_bigrams(data):
    return [bigram_mod[doc] for doc in data]

def make_trigrams(data):
    return [trigram_mod[bigram_mod[doc]] for doc in data]

def lemmatization(data, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    complaints_out = []
    for complaint in data:
        doc = nlp(" ".join(complaint))
        complaints_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return complaints_out

In [13]:
# Now we call the functions we build above
data_words_bigrams = make_bigrams(data)
nlp = spacy.load("en", disable=["parser", "ner"])
data_lemmatized = lemmatization(data_words_bigrams)

print(data_lemmatized[:1])

[['apply', 'loan', 'account', 'correctly', 'communicate', 'issue', 'offer', 'rate', 'deduction', 'autopay', 'show', 'account', 'tell', 'go', 'application', 'account', 'open', 'could', 'add', 'autopay', 'receive', 'discount', 'way', 'account', 'open', 'call', 'center', 'least', 'time', 'try', 'receive', 'first', 'time', 'tell', 'go', 'apply', 'still', 'see', 'additionally', 'last', 'time', 'call', 'ask', 'speak_manag', 'tell', 'would', 'take', 'day', 'back', 'still', 'yet', 'hear', 'back', 'later', 'tell', 'would', 'receive', 'receive', 'open', 'account', 'company', 'lie', 'rate', 'loan', 'go', 'receive', 'autopay', 'autopay', 'go', 'far', 'additional', 'issue', 'tell', 'rate', 'go', 'base', 'month', 'publish', 'month', 'none', 'rate', 'receive', 'thus_far', 'match', 'rate', 'really', 'know', 'tried_contacte', 'many', 'time', 'people', 'phone', 'seem', 'helpful', 'time', 'talk', 'seem', 'get', 'do']]


At this point we've created some bi_grams and tri_grams. Now we need to create a dictionary and corpus that's needed for topic modeling.

In [14]:
# Create dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
complaints = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(complaint) for complaint in complaints]

print("Corpus format: (word_id, work_frequency)")
print(corpus[:1])
print()
print("Readable version of term-frequency:")
print([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]])

Corpus format: (word_id, work_frequency)
[[(0, 5), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 4), (8, 2), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 5), (24, 1), (25, 1), (26, 2), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 2), (33, 1), (34, 1), (35, 2), (36, 1), (37, 1), (38, 3), (39, 1), (40, 1), (41, 1), (42, 5), (43, 1), (44, 6), (45, 1), (46, 2), (47, 1), (48, 1), (49, 2), (50, 1), (51, 1), (52, 5), (53, 1), (54, 5), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1)]]

Readable version of term-frequency:
[[('account', 5), ('add', 1), ('additional', 1), ('additionally', 1), ('application', 1), ('apply', 2), ('ask', 1), ('autopay', 4), ('back', 2), ('base', 1), ('call', 2), ('center', 1), ('communicate', 1), ('company', 1), ('correctly', 1), ('could', 1), ('day', 1), ('deduction', 1), ('discount', 1), ('do', 1), ('far', 1), ('first', 1), ('get', 1), ('go', 5), ('hear', 1), ('helpful', 1),

Now that we've prepared the data and have everything we need to train the LDA model, we will build it. 

In [22]:
# Building the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                           num_topics=8,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto",
                                           per_word_topics=True)

In [23]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.100*"interest" + 0.084*"loan" + 0.083*"amount" + 0.082*"pay" + '
  '0.069*"navient" + 0.048*"balance" + 0.034*"charge" + 0.031*"apply" + '
  '0.028*"principal" + 0.022*"total"'),
 (1,
  '0.056*"pay" + 0.035*"year" + 0.027*"go" + 0.022*"time" + 0.021*"get" + '
  '0.020*"work" + 0.018*"help" + 0.016*"try" + 0.015*"month" + 0.015*"even"'),
 (2,
  '0.246*"loan" + 0.091*"student" + 0.027*"school" + 0.027*"company" + '
  '0.027*"debt" + 0.018*"private" + 0.015*"federal" + 0.014*"service" + '
  '0.013*"default" + 0.011*"program"'),
 (3,
  '0.037*"deferment" + 0.021*"borrower" + 0.011*"practice" + 0.011*"allow" + '
  '0.011*"use" + 0.010*"financial" + 0.010*"like" + 0.010*"force" + '
  '0.010*"term" + 0.009*"repayment"'),
 (4,
  '0.069*"call" + 0.057*"tell" + 0.048*"would" + 0.038*"navient" + 0.033*"say" '
  '+ 0.027*"ask" + 0.019*"time" + 0.019*"could" + 0.017*"phone" + 0.016*"day"'),
 (5,
  '0.050*"account" + 0.047*"receive" + 0.036*"information" + 0.034*"send" + '
  '0.028*"reques

In [24]:
# Compute perplexity
print("Perplexity:", lda_model.log_perplexity(corpus))

# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence="c_v")
coherence_lda = coherence_model_lda.get_coherence()
print("\nCoherence score:", coherence_lda)

Perplexity: -6.654269893658784

Coherence score: 0.42528402999884996


We now have a baseline. We will measure the performance of this model mostly using the coherence score. Which our baseline coherence score is 0.38, which is pretty bad (the higher the better). But atleast we have a baseline and we can start working on improving the mdoel from here. 

The model has been built and can be viewed above. The words displayed above (a couple spaces above are the top words and its weight used to determine that specific topic. There are 20 topics total (0 - 19).

Now we will visualize the topics and their keywords using pyLDAvis

In [25]:
pyLDAvis.enable_notebook(local=True)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
pyLDAvis.display(vis)

Using the visualization above:
On the left is the topics, the larger the bubble the more popular the topic. The closer the bubble is to another bubble the more similar they are. Moving the cursor over a bubble will show new words and bars on the side. These words are the keywords for that selected topic. 

Now that we've created the model, we need to find the dominant topic for each sentence.

In [40]:
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        print(row[0][1])
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

(1, 0.19357862)


TypeError: '<' not supported between instances of 'int' and 'tuple'