<b><h1 align="center"> Topic Modeling with Latent Dirichlet Allocation <h1/></b>

Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds some natural groups of items (topics) even when we are unsure of what we are looking for. A document can be a part of multiple topics, like in fuzzy clustering (soft clustering), where each data point belongs to more than one cluster.

**Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. It can help with the following:**

*   *Discover the hidden themes in the collection.*
*   *Classify the documents into the discovered themes.*
*   *Use the classification to organize/summarize/search the documents.*

Latent Dirichlet Allocation is one of the most popular topic modeling methods. LDA aims to find topics that a document belongs to based on the words in it.

***
![topic-modeling.png](https://2.bp.blogspot.com/-UO8E6wws1Go/XGWgbLTPJnI/AAAAAAAABoQ/tGuBrjfJZ1UGmUQ112ZCv3gAu3Tg0O1FACLcBGAs/s640/image001-min.png)
***

### **Topic Modeling using Latent Dirichlet Allocation**
***

[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is one of the ways to implement [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model). It is a generative probabilistic model where each document is assumed to be consisting of a different proportion of topics.

Latent Dirichlet Allocation (LDA) classifies or categorizes the text into a document and the words per topic. These are modeled based on the Dirichlet distributions and processes. The LDA makes two key assumptions:

*   **Documents are a mixture of topics.**
*   **Topics are a mixture of tokens (or words).**

These topics use the probability distribution to generate the words. In statistical language, the documents are known as the probability density (or distribution) of topics, and the topics are known as the probability density (or distribution) of words.
***

### **How does the LDA algorithm works?**

The following steps are carried out in LDA to assign topics to each of the documents:

![lda-algorithm](https://miro.medium.com/max/1400/0*FUB2WfIUKZ5r87e_)
***

### **Hyperparameters in LDA:**

There are three hyperparameters in LDA.

*   $\alpha \rightarrow$ Document Density Factor.
*   $\beta \rightarrow$ Topic Word Density Factor.
*   $K \rightarrow$ Number of topics selected.

The $\alpha$ hyperparameter controls the number of topics expected in the document. The $\beta$ hyperparameter controls the distribution of words per topic in the document, and $K$ defines how many topics we need to extract.

***
### **Illustrative Example of LDA.**
***
Suppose we have the following four documents in the corpus, and we wish to carry out topic modeling on these documents.

*   **Document 1:** We watch a lot of videos on YouTube.
*   **Document 2:** YouTube videos are very informative.
*   **Document 3:** Reading a technical blog makes me understand things easily.
*   **Document 4:** I prefer blogs to YouTube videos.

LDA modeling helps discover topics in the above corpus and assign topic mixtures for each of the documents. As an example, the model might output something as given below:

*   **Topic 1:** 40% videos, 60% YouTube
*   **Topic 2:** 95% blogs, 5% YouTube

Document 1 and 2 would then belong 100% to Topic 1. Document 3 would belong 100% to Topic 2. Document 4 would belong 80% to Topic 2 and 20% to Topic 1.
***

### **Reference:**

> [**Introduction to Latent Dirichlet Allocation**](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/)







# **Topic Modeling using [sklearn.decomposition.LatentDirichletAllocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html\#sklearn.decomposition.LatentDirichletAllocation)**

> [**YouTube Explanation**](https://www.youtube.com/watch?v=25JOEnrz40c)



In [1]:
# Import Library.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.model_selection import train_test_split
import re, nltk
from nltk.stem.porter import PorterStemmer
from textblob import Word
import warnings

warnings.filterwarnings("ignore")

nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
# Load Dataset.
data = pd.read_csv(
    "https://github.com/srivatsan88/YouTubeLI/blob/master/dataset/consumer_compliants.zip?raw=true",
    compression="zip",
    sep=",",
    quotechar='"',
)

data.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,4/3/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Fraudulent loan,This auto loan was opened on XX/XX/2020 in XXX...,Company has responded to the consumer and the ...,TRUIST FINANCIAL CORPORATION,PA,,,Consent provided,Web,4/3/2020,Closed with explanation,Yes,,3591341
1,3/12/2020,Debt collection,Payday loan debt,Attempts to collect debt not owed,Debt is not yours,In XXXX of 2019 I noticed a debt for {$620.00}...,,CURO Intermediate Holdings,CO,806XX,,Consent provided,Web,3/12/2020,Closed with explanation,Yes,,3564184
2,2/6/2020,Vehicle loan or lease,Loan,Getting a loan or lease,Credit denial,"As stated from Capital One, XXXX XX/XX/XXXX an...",,CAPITAL ONE FINANCIAL CORPORATION,OH,430XX,,Consent provided,Web,2/6/2020,Closed with explanation,Yes,,3521949
3,3/6/2020,Checking or savings account,Savings account,Managing an account,Banking errors,"Please see CFPB case XXXX. \n\nCapital One, in...",,CAPITAL ONE FINANCIAL CORPORATION,CA,,,Consent provided,Web,3/6/2020,Closed with explanation,Yes,,3556237
4,2/14/2020,Debt collection,Medical debt,Attempts to collect debt not owed,Debt is not yours,This debt was incurred due to medical malpract...,Company believes it acted appropriately as aut...,"Merchants and Professional Bureau, Inc.",OH,432XX,,Consent provided,Web,2/14/2020,Closed with explanation,Yes,,3531704


In [3]:
data["Product"].value_counts()

Debt collection                21772
Credit card or prepaid card    13193
Mortgage                        9799
Checking or savings account     7003
Student loan                    2950
Vehicle loan or lease           2736
Name: Product, dtype: int64

In [4]:
data["Company"].value_counts()

CITIBANK, N.A.                                                    3226
CAPITAL ONE FINANCIAL CORPORATION                                 2711
BANK OF AMERICA, NATIONAL ASSOCIATION                             2580
JPMORGAN CHASE & CO.                                              2409
WELLS FARGO & COMPANY                                             2001
                                                                  ... 
Time Investment Company, Inc.                                        1
Automotive Services Finance, Inc.                                    1
Foxstar Energy Resources LLC DBA Federal Student Loans Council       1
Uplift, Inc                                                          1
Keystone Credit Services LLC                                         1
Name: Company, Length: 2197, dtype: int64

In [5]:
complaints_data = data[["Consumer complaint narrative", "Product", "Company"]].rename(
    columns={"Consumer complaint narrative": "complaints"}
)

pd.set_option("display.max_colwidth", -1)
complaints_data

Unnamed: 0,complaints,Product,Company
0,"This auto loan was opened on XX/XX/2020 in XXXX, NC with BB & T in my name. I have NEVER been to North Carolina and I have NEVER been a resident. I have filed a dispute twice through my credit bureaus but both times BB & T has claimed that this is an accurate loan. Which I wasn't aware of until today. I have tried to contact BB & T multiple times but I have never gotten through to a live person. I do n't drive and I have never owned a car before. I didn't have any knowledge of this account until I checked XXXXXXXX XXXX and noticed it. I've tried twice to dispute it. Additionally I never received any bills or information about this account. This is my last resort in trying to remove this fraudulent loan off of my account.",Vehicle loan or lease,TRUIST FINANCIAL CORPORATION
1,"In XXXX of 2019 I noticed a debt for {$620.00} on my credit which i believed was mine I thought speedy cash had bought one of my old debts and sold it to XXXX XXXX XXXX XXXX. I contacted XXXX XXXX XXXX XXXX and after several attempts of giving my full name, nothing came up in their system. I gave my social and the rep said the account popped up but DID NOT tell me that the account was under someone elses name and continued to let me make a payment. The payment was for {$120.00}. Confirmation number-XXXX. After realizing it was not my account, I called back to get my money back and inform them of the mistake. I was told i needed to mail them an FTC report and dispute letter to get my money back. I completed all of this and when i called again they said they transferred the account back to speedy cash for fraud review and I would need to contact them. After contacting them i was again told that i can not get my money back. The issue im having is this representative at XXXX XXXX played blind to obvious fraud and let an innocent person make a payment on someone elses debt and i want my money back.",Debt collection,CURO Intermediate Holdings
2,"As stated from Capital One, XXXX XX/XX/XXXX and XXXX 2018, My wife and I went to several car dealerships to request for a car loan to get a used car. However, according to their credit requirements unfortunately my credit score was insufficient for the car loan approval at that time. It seemed as though they pulled my credit report multiple times.",Vehicle loan or lease,CAPITAL ONE FINANCIAL CORPORATION
3,"Please see CFPB case XXXX. \n\nCapital One, in the letter they provided ( and attached to that case as their response ) said this : "" The funds were reversed and sent back to XXXX XXXX XXXX on XX/XX/XXXX ''. \n\nXXXX XXXX XXXX ( now XXXX XXXX ) has not received these funds. Staff at XXXX XXXX - and also staff at the account-holder 's business - have looked for return of my money ( {$650.00} ) and find nothing. \n\nCapital One needs to document - actually prove - they returned the funds, as stated in their letter. Capital One must provide electronic information, if the return was made that way, or document the paper check they sent back to XXXX XXXX. \n\nI've left 3 messages about this problem for the person who signed the letter ( XXXX ) from Capital One. I have received no call-backs. \n\nSummary : Capital One said they returned my money on XX/XX/XXXX : they did not. If they continue claim they did, then they need to prove that.",Checking or savings account,CAPITAL ONE FINANCIAL CORPORATION
4,"This debt was incurred due to medical malpractice ( XXXX XXXX XXXX, XXXX, TX ). I asked the doctor to turn over my claim to his malpractice insurance company. This has cost me thousands of dollars to XXXX XXXX XXXX. I am still trying to collect damages from this doctor. He never responded and turned over me to collections Merchants and Professional Collection Bureau , Inc. I sent them a letter describing exactly this issue and instead of not contacting me and verifying my debt they start reporting this debt to the credit reporting agencies. They never verified the debt, like I asked and they never stopped it from being reported when I specifically told them not to, due to the circumstances above.",Debt collection,"Merchants and Professional Bureau, Inc."
...,...,...,...
57448,"I am attempting to make a payment toward my student loans on the Nelnet website today, XX/XX/20, and Nelnet will not allow me to post the payment sooner than XX/XX/20. By the time the payment posts, 2-3 days of additional interest will have accrued and my payments will apply more to interest than is due today, the day that I'm attempting to pay. My understanding was that I could make a payment at any time but this does not appear to be true. The funds are available in my bank account today regardless of whether Nelnet can collect over the weekend. I should not be penalized for this. \n\nI submitted complaint XXXX in XXXX for other deceptive practices with Nelnet. They have not yet resolved the issue identified in that complaint or contacted me as they said they would in their response. I believe this new issue is just one more deceptive practice by this company that causes financial harm to borrowers.",Student loan,"Nelnet, Inc."
57449,Received letter for {$480.00}. Original creditor didnt contact me until past statute of limitations for insurance company recoupment per Arizona law. Debt collection is illegal for phantom debt. Additionally they are phoning my office excessively.,Debt collection,"The Receivable Management Services LLC, New York, NY Branch"
57450,"entire time 10 years until XX/XX/2020. XXXX makes my blood boil. I have called and was lied to told to provide my checking account information over the phone in order to turn my cell phone back on. i called at XXXX them at XXXX {$300.00} was added to my bill. \n\nScam scam scam I was told I can not call the office of the President just to write to XXXX XXXX XXXX XXXX XXXX XXXX XXXX, NM XXXX. I did three thousand times. the last letter I mailed on XX/XX/2020. Two collection agencies later. \n\nI chose to leave XXXX XXXX every time I called the XXXX supervisor would threaten me on a recorded line. I need peace of mind and a good Heart to beat inside of me. Im on a XXXX XXXX due to the stress at XXXX XXXX taking all my money 4 10 years.",Debt collection,"Convergent Resources, Inc."
57451,"I am a customer with Wells Fargo Bank. Recently money was withdrawn on a couple of occasions without my permission or consent to pay for a timeshare account that was never used by me nor anyone connected to me because of unfair policies pertaining to the fees of the said timeshare. I tried cancelling the said timeshare account several times because of these fees that were never mentioned at the initiation. My account was debited to pay for the timeshare fees without my knowledge or consent several times. I tried correcting this with Wells Fargo bank with no avail. I would appreciate it if you can look into this matter for me. I was left with no funds in my account and as such I could not take care of the basic necessities of my day to day life. \nThanks in advance,",Checking or savings account,WELLS FARGO & COMPANY


In [6]:
# Split Dataset into Train and Validation Set.
X_train, X_valid = train_test_split(complaints_data, test_size=0.7, random_state=42)

# Text Cleaning and Preprocessing.
X_train["complaints"] = X_train["complaints"].apply(
    lambda x: re.sub("[^a-zA-Z]", " ", x).lower()
)
X_train["complaints"] = X_train["complaints"].apply(
    lambda x: " ".join([Word(word).lemmatize() for word in x.split()])
)

In [7]:
def tokenize(text):
    tokens = [
        word
        for word in nltk.word_tokenize(text)
        if (len(word) > 3 and len(word.strip("Xx/")) > 2)
    ]
    return tokens


vectorizer_tf = TfidfVectorizer(
    tokenizer=tokenize,
    stop_words="english",
    max_df=0.75,
    min_df=50,
    max_features=10000,
    use_idf=False,
    norm=None,
)

tf_vectors = vectorizer_tf.fit_transform(X_train.complaints)

In [8]:
""" Train Latent Dirichlet Allocation algorithm for Topic Modeling. """

lda = LatentDirichletAllocation(
    n_components=10,
    max_iter=100,
    learning_method="online",
    learning_offset=50,
    n_jobs=-1,
    random_state=42,
)

""" Train Non-Negative Matrix Factorization (NMF) algorithm for Topic Modeling. """
# nmf = NMF(n_components=10, random_state=42)

W1 = lda.fit_transform(tf_vectors)
H1 = lda.components_

In [9]:
num_words = 10

vocab = np.array(vectorizer_tf.get_feature_names())

top_words = lambda t: [vocab[i] for i in np.argsort(t)[: -num_words - 1 : -1]]
topic_words = [top_words(t) for t in H1]
topics = [" ".join(t) for t in topic_words]

topics

['card credit charge dispute fraud capital account transaction claim citi',
 'told called said time asked phone email spoke just number',
 'payment account paid late balance statement month received mortgage escrow',
 'credit payment month time year card account paid score company',
 'debt company number collection letter received phone address sent information',
 'loan chase payment student forbearance navient year program month income',
 'loan mortgage home property document modification closing letter company foreclosure',
 'insurance offer bonus policy term customer service month point cooper',
 'debt credit account report collection information company reporting provide bureau',
 'account bank check fund money checking america closed deposit fargo']

In [10]:
colnames = ["Topic-" + str(i) for i in range(lda.n_components)]
docnames = ["Doc-" + str(i) for i in range(len(X_train.complaints))]
df_doc_topic = pd.DataFrame(np.round(W1, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic["dominant_topic"] = significant_topic

df_doc_topic

Unnamed: 0,Topic-0,Topic-1,Topic-2,Topic-3,Topic-4,Topic-5,Topic-6,Topic-7,Topic-8,Topic-9,dominant_topic
Doc-0,0.00,0.17,0.07,0.19,0.13,0.34,0.00,0.00,0.10,0.00,5
Doc-1,0.00,0.54,0.00,0.18,0.19,0.07,0.00,0.00,0.00,0.00,1
Doc-2,0.52,0.01,0.01,0.11,0.22,0.01,0.01,0.01,0.01,0.12,0
Doc-3,0.39,0.26,0.00,0.00,0.00,0.00,0.07,0.06,0.11,0.10,0
Doc-4,0.10,0.29,0.30,0.00,0.29,0.00,0.00,0.00,0.00,0.00,2
...,...,...,...,...,...,...,...,...,...,...,...
Doc-17230,0.00,0.44,0.02,0.00,0.02,0.00,0.46,0.00,0.02,0.04,6
Doc-17231,0.14,0.00,0.00,0.00,0.09,0.00,0.00,0.00,0.30,0.46,9
Doc-17232,0.17,0.00,0.29,0.43,0.00,0.00,0.00,0.00,0.00,0.09,3
Doc-17233,0.00,0.13,0.28,0.00,0.46,0.04,0.00,0.08,0.00,0.00,4


In [11]:
# Text Cleaning and Preprocessing.
X_valid["complaints"] = X_valid["complaints"].apply(
    lambda x: re.sub("[^a-zA-Z]", " ", x).lower()
)
X_valid["complaints"] = X_valid["complaints"].apply(
    lambda x: " ".join([Word(word).lemmatize() for word in x.split()])
)

WHold = lda.transform(vectorizer_tf.transform(X_valid.complaints))

colnames = ["Topic-" + str(i) for i in range(lda.n_components)]
docnames = ["Doc-" + str(i) for i in range(len(X_valid.complaints))]
df_doc_topic = pd.DataFrame(np.round(WHold, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic["dominant_topic"] = significant_topic

df_doc_topic

Unnamed: 0,Topic-0,Topic-1,Topic-2,Topic-3,Topic-4,Topic-5,Topic-6,Topic-7,Topic-8,Topic-9,dominant_topic
Doc-0,0.00,0.25,0.00,0.00,0.14,0.00,0.00,0.00,0.43,0.17,8
Doc-1,0.02,0.08,0.00,0.22,0.11,0.00,0.57,0.00,0.00,0.00,6
Doc-2,0.00,0.50,0.00,0.00,0.00,0.06,0.40,0.00,0.00,0.04,1
Doc-3,0.01,0.01,0.01,0.01,0.72,0.01,0.01,0.01,0.19,0.01,4
Doc-4,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.92,0.01,8
...,...,...,...,...,...,...,...,...,...,...,...
Doc-40213,0.00,0.31,0.34,0.13,0.00,0.00,0.17,0.04,0.00,0.00,2
Doc-40214,0.00,0.22,0.02,0.00,0.00,0.00,0.75,0.00,0.00,0.00,6
Doc-40215,0.04,0.65,0.00,0.00,0.30,0.00,0.00,0.00,0.00,0.00,1
Doc-40216,0.00,0.11,0.00,0.00,0.41,0.00,0.00,0.00,0.47,0.00,8
