## Overview:

In this notebook, I will try to approach to the problem as a topic modeling, specificall Latent Dirichlet Allocation (LDA) problem that has strong statistical influence on NLP text classification probems. The main objective of topic modeling is to find different type of topics that are exist in our row data (text column in th case_study_data.csv file).

Each document in the raw data will be generated of at least one or multiple topics. Once the technique is applied, our job as a human is to interpret the outputs and detect if the mixture of words in each topic make sense. If they don't make sense, we could tune the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.




# Topic Modeling & Latent Dirichlet Allocation

## Introduction

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

The LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. It is used to classify text in a document to a certai topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.  

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Consequently,  selecting the correct corpus of data is crucial. 
* It also postulates documents are made up from a mixture of topics. Those topics then produce words depend on their probability distributions. 




## 1- Load the dataset

The dataset we'll use is a list of opened complain cases with 268,363 records that contains complaint text (**text**), a message identifier (**complaint_id**) and a verified correct complaint department (**product_group**).

In [163]:

# Read the dataset as CSV file and save the text part to 'data_text'

import pandas as pd
data = pd.read_csv('case_study_data.csv', error_bad_lines=False);
# We only need the 'text' column from the raw data as a corpus(document)
data_text = data[:268363][['text']];
data_text['index'] = data_text.index


In [164]:
# First glimpse to the data_text
data.head()

Unnamed: 0,complaint_id,product_group,text
0,2815595,bank_service,On XX/XX/2017 my check # XXXX was debited from...
1,2217937,bank_service,I opened a Bank of the the West account. The a...
2,2657456,bank_service,wells fargo in nj opened a business account wi...
3,1414106,bank_service,A hold was placed on my saving account ( XXXX ...
4,1999158,bank_service,Dear CFPB : I need to send a major concern/com...


In [165]:
# Get all distinct `product_group`
dist_product_group = data.product_group.unique()
dist_product_group

array(['bank_service', 'credit_card', 'credit_reporting',
       'debt_collection', 'loan', 'money_transfers', 'mortgage'],
      dtype=object)

In [166]:
len(dist_product_group)

7

In [167]:
# First glimpse to the data_text
data_text.head()

Unnamed: 0,text,index
0,On XX/XX/2017 my check # XXXX was debited from...,0
1,I opened a Bank of the the West account. The a...,1
2,wells fargo in nj opened a business account wi...,2
3,A hold was placed on my saving account ( XXXX ...,3
4,Dear CFPB : I need to send a major concern/com...,4


In [168]:
# Check the rows of the dataset
print(len(data_text))

268363


In [169]:
# Check the shape of the data_text 
data_text.shape

(268363, 2)

## 2- Data Preprocessing 

For data preprocessing we will execute the following steps

* **Tokenization**: We will split the data into sentences and the sentences into words, then lowercase the entire data_text and remove all punctuations.
* **Remove < three letters**: All the words that have less than **three letters** will be removed. 
* **Remove Stopwords**: All the **stopwords** are removed.
* **Lemmatization**: The words concerning third party will be changed to first person and all past/future tense verbs are converted to present tense  
* **stemming**: All words will be reduced to their root form.


In [1]:

# Import gensim and nltk libraries

# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed()
import nltk
nltk.download('wordnet')
import re

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ahmetcakmak/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [171]:
# Clean all the 'text' column
data_text.text = data_text.text.str.replace(r'[^a-zA-Z\s]+|X{2,}', '')

In [172]:
data_text.head()

Unnamed: 0,text,index
0,On my check was debited from my checking ac...,0
1,I opened a Bank of the the West account The ac...,1
2,wells fargo in nj opened a business account wi...,2
3,A hold was placed on my saving account beca...,3
4,Dear CFPB I need to send a major concerncompl...,4


In [3]:
import pandas as pd

### Example - 1:

We will give a short example of lemmatization using before preprocessing our entire data_text. Let's see what output would be obtained if we lemmatize the word 'given'?

In [173]:
# Convert past tense to present tense
print(WordNetLemmatizer().lemmatize('given', pos = 'v')) 

give


### Example - 2:
In this example, we will give gow a stemming works? I will pick a number of words from the dataset and see how the stemmer deal with them. 

In [4]:
stemmer = SnowballStemmer("english")
original_words = ['account','save','deducted','deduction','transaction','transferred','landlord','agreed','owned', 
           'fee','mortgage''meeting']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,account,account
1,save,save
2,deducted,deduct
3,deduction,deduct
4,transaction,transact
5,transferred,transfer
6,landlord,landlord
7,agreed,agre
8,owned,own
9,fee,fee


In [175]:
#the following function executes the pre-processing step for our entire dataset 
#pre-processing steps on the data_text

def lemm_stemm(x_text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(x_text, pos='v'))

# tokenize and lemmatize
def preprocess(x_text):
    result=[]
    for token in gensim.utils.simple_preprocess(x_text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # perform lemm_stemm on the token, then append to the result lis 
            result.append(lemm_stemm(token))
    return result


In [206]:

# Preview a document after preprocessing

document_num = 146
sample_document = documents[documents['index'] == document_num].values[0][0]

print("original raw document: ")
words = []
for word in sample_document.split(' '):
    words.append(word)
print(words)
print("\n\ntokenized and lemmatized document: ")
print(preprocess(sample_document))

original raw document: 
['they', 'closed', 'my', 'account', 'because', 'im', '', 'which', 'a', 'federal', 'lawsuit', 'is', 'being', 'prepared', 'and', 'about', 'to', 'be', 'filed', 'against', 'them', 'but', 'they', 'have', 'no', 'sent', 'me', 'my', 'money', 'there', 'holding', 'my', 'money', 'and', 'wont', 'tell', 'me', 'when', 'there', 'sending', 'my', 'money']


tokenized and lemmatized document: 
['close', 'account', 'feder', 'lawsuit', 'prepar', 'file', 'send', 'money', 'hold', 'money', 'wont', 'tell', 'send', 'money']


In [207]:
data_text

Unnamed: 0,text,index
0,On my check was debited from my checking ac...,0
1,I opened a Bank of the the West account The ac...,1
2,wells fargo in nj opened a business account wi...,2
3,A hold was placed on my saving account beca...,3
4,Dear CFPB I need to send a major concerncompl...,4
...,...,...
268358,I have recently been declined for a loan due t...,268358
268359,I am military I requested help from Loan car...,268359
268360,The collections department at Wells Fargo bega...,268360
268361,I was denied the chance to continue an applica...,268361


### Note: 
For the sake of avoiding a further confusion, I assign the **data_text** column as the **documents** of the data set. 

In [178]:
# Call each row of data_text a document, therefore documents for data_text
documents = data_text

Now it is time to preprocess all the complains of WF we have in data_text. To make it happen, let's use the [map function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) from pandas to apply `preprocess()` to the `text` column



In [179]:
# preprocess all the complains, saving the list of results as 'processed_docs'
processed_data = documents['text'].map(preprocess)

In [180]:
# view the first 10 'processed_data'
processed_data[:10]

0    [check, debit, check, account, check, wasnt, c...
1    [open, bank, west, account, account, come, pro...
2    [well, fargo, open, busi, account, author, ear...
3    [hold, place, save, account, institut, say, ma...
4    [dear, cfpb, need, send, major, fals, advertis...
5    [bank, america, close, check, account, prior, ...
6    [complaint, bank, america, account, start, loc...
7    [open, account, saturday, well, fargo, go, com...
8    [busi, partner, experi, total, breakdown, comm...
9    [want, credit, increas, credit, card, chase, b...
Name: text, dtype: object

## 3 - Bag of Words (BoW) on the data set

### Bag of Words (BoW)   ###

Now, we will create a dictionary from '**processed_data**' involved the number of times a word appears in the training set. In order to be able to apply that, we will transmit `processed_data` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it as '`BoW_dictionary`'.

In [181]:
# Using gensim.corpora.Dictionary, create a dictionary from the preprocess data ('processed_data') 
# involves the number of times a word appears in the training set, and call it 'dictionary'
BoW_dictionary = gensim.corpora.Dictionary(processed_data)

In [182]:
# Cheking the size of BoW_dictionary
len(BoW_dictionary)

84129

In [183]:
# Print the first 10 items in the BoW_dictionary we created.
count = 0
for k, v in BoW_dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 9:
        break

0 account
1 america
2 answer
3 bank
4 breach
5 cash
6 check
7 contract
8 contractor
9 copi


**From Gensim, filter_extremes:**

This function in the [`Gensim`](https://radimrehurek.com/gensim/index.html) removes all tokens in the dictionary that are:

Filter out tokens in the dictionary by their frequency 

* Less frequent than no_below documents (absolute number, e.g. 5) or

* More frequent than no_above documents (fraction of the total corpus size, e.g. 0.3).

* After the first two steps above, keep only the first keep_n most frequent tokens (or keep all if keep_n=None).

After the pruning, resulting gaps in word ids are shrunk. Due to this gap shrinking, the same word may have a different word id before and after the call to this function!

[`filter_extremes(no_below=5, no_above=0.5, keep_n=300000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)



In [184]:
#Remove words appearing less than 15 times and words appearing in more than 10% of all documents
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=300000)

**From Gensim, doc2bow**

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

It allows us to convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.

* list of (int, int) – BoW representation of document.

* list of (int, int), dict of (str, int) – If return_missing is True, return BoW representation of document + dictionary with missing tokens and their frequencies.

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [185]:
#Create the Bag of words model for each document.
#For each document we create a dictionary, notifying how many words and how many times those words appear. 
#Then save it to'BoW_data'

BoW_data = [BoW_dictionary.doc2bow(document) for document in processed_data]

In [186]:
#Check BoW_data for our sample document (document_num = 146) assigning as (token_id, token_count)

BoW_data[document_num]

[(0, 1),
 (58, 1),
 (64, 1),
 (72, 3),
 (86, 1),
 (174, 2),
 (217, 1),
 (715, 1),
 (746, 1),
 (1321, 1),
 (1345, 1)]

In [187]:
# View BoW for our sample preprocessed document
# Here document_num is document number 4310 which we have checked in Step 2
BoW_doc_146 = BoW_data[document_num]

for i in range(len(BoW_doc_146)):
    print("The word {} (\"{}\") appears {} time.".format(BoW_doc_146[i][0],
                                                     BoW_dictionary[BoW_doc_146[i][0]],
                                                     BoW_doc_146[i][1]))

The word 0 ("account") appears 1 time.
The word 58 ("close") appears 1 time.
The word 64 ("hold") appears 1 time.
The word 72 ("money") appears 3 time.
The word 86 ("tell") appears 1 time.
The word 174 ("send") appears 2 time.
The word 217 ("file") appears 1 time.
The word 715 ("prepar") appears 1 time.
The word 746 ("feder") appears 1 time.
The word 1321 ("lawsuit") appears 1 time.
The word 1345 ("wont") appears 1 time.


### TF-IDF ###

Although TF-IDF is not compulsory for LDA 

While performing TF-IDF on the corpus is not necessary for LDA execution using the gensim model, it is highly recemmended. TF-IDF expects a bag-of-words (integer values) training the raw data during initialization. It will take a vector and return another vector of the same dimensionality right along the transformation. 



** TF-IDF, short for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [188]:
# Create tf-idf model, using models.TfidfModel on 'BoW_data' and save it to 'tfidf'

from gensim import corpora, models

tfidf = models.TfidfModel(BoW_data)

In [189]:
#Apply mapping to the entire raw data and call it 'tfidf_data'

tfidf_data = tfidf[BoW_data]

In [190]:

# View TF-IDF scores for the first document assigned as (token_id, tfidf score)

from pprint import pprint
for doc in tfidf_data:
    pprint(doc)
    break

[(0, 0.045825291607662355),
 (1, 0.2120719192479975),
 (2, 0.1723888484344525),
 (3, 0.09613578316476834),
 (4, 0.24305337933065752),
 (5, 0.20752190679544874),
 (6, 0.3474928245414018),
 (7, 0.18097198195724792),
 (8, 0.358197632746281),
 (9, 0.14544491750720015),
 (10, 0.17889193043554064),
 (11, 0.03788818381484691),
 (12, 0.21607305152235232),
 (13, 0.3593746431907906),
 (14, 0.1679352428545851),
 (15, 0.10406549293031389),
 (16, 0.07307272424878672),
 (17, 0.15467907071791237),
 (18, 0.09226974821905376),
 (19, 0.37461015664244207),
 (20, 0.28713105349636575)]


## 4- Executing LDA by use of BoW ##

We are going for 10 topics in the document corpus.

** We will be executing LDA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will pull are:

* **n_topics: int, optional (default=10)** is the number of requested hidden (latent) topics to be extracted from the training raw data.
* **id2word:** is a transformation from *word_id* (integers) to *word* (strings). It allows us to specify the vocabulary size, along with for debugging and topic printing.
* **workers:** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha, $\alpha$** and **eta, $\eta$:** are hyperparameters that affect sparsity of the document-topic (theta, $\theta$) and topic-word (lambda, $\lambda$) distributions. We will let these be the default values for now(default value is $\frac{1}{n\_topics}$)
    - $\alpha$ is the per document topic distribution.
        * High **alpha**: Every document has a mixture of all topics(documents appear similar to each other).
        * Low **alpha**: Every document has a mixture of very few topics

    - $\eta$ is the per topic word distribution.
        * High **eta**: Each topic has a mixture of most words(topics appear similar to each other).
        * Low **eta**: Each topic has a mixture of few words.

* **passes:** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999` 

In [191]:
# LDA mono-core
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore: Train our lda model by gensim.models.LdaMulticore and save it to 'LDA_model'
 
'''
'''

LDA_model = gensim.models.LdaMulticore(BoW_data, 
                                       num_topics=10, 
                                       id2word = BoW_dictionary, 
                                       passes = 2, 
                                       workers=2)

In [192]:

# For each topic, we find out the words occuring in that particular topic and its relative weight

for id_x, topic in LDA_model.print_topics(-1):
    print("The topic: {} \nWords: {}".format(id_x, topic))
    print("\n")

The topic: 0 
Words: 0.056*"chase" + 0.027*"fraud" + 0.024*"secur" + 0.020*"inform" + 0.017*"person" + 0.015*"address" + 0.014*"number" + 0.013*"social" + 0.011*"fraudul" + 0.011*"steal"


The topic: 1 
Words: 0.046*"credit" + 0.045*"card" + 0.026*"citi" + 0.022*"capit" + 0.016*"call" + 0.014*"phone" + 0.014*"applic" + 0.014*"number" + 0.014*"tell" + 0.013*"line"


The topic: 2 
Words: 0.110*"account" + 0.108*"bank" + 0.045*"check" + 0.027*"america" + 0.026*"charg" + 0.025*"money" + 0.023*"close" + 0.020*"fund" + 0.018*"balanc" + 0.016*"pay"


The topic: 3 
Words: 0.145*"credit" + 0.126*"report" + 0.048*"account" + 0.032*"remov" + 0.018*"inform" + 0.017*"score" + 0.014*"compani" + 0.014*"disput" + 0.014*"bureaus" + 0.011*"inquiri"


The topic: 4 
Words: 0.072*"loan" + 0.050*"mortgag" + 0.023*"home" + 0.019*"modif" + 0.013*"properti" + 0.013*"year" + 0.011*"foreclosur" + 0.011*"well" + 0.011*"fargo" + 0.010*"escrow"


The topic: 5 
Words: 0.044*"call" + 0.040*"tell" + 0.026*"say" + 0.02

### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: 
* 1: 
* 2: 
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9: 

## Step 4.2 Running LDA using TF-IDF ##

In [196]:
'''
Define lda model using corpus_tfidf
'''

LDA_model_tfidf = gensim.models.LdaMulticore(tfidf_data, 
                                             num_topics=10, 
                                             id2word = BoW_dictionary, 
                                             passes = 2, 
                                             workers=4)

In [198]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for id_x, topic in LDA_model_tfidf.print_topics(-1):
    print("The topic: {} Word: {}".format(id_x, topic))
    print("\n")

The topic: 0 Word: 0.015*"loan" + 0.013*"payment" + 0.012*"mortgag" + 0.007*"month" + 0.007*"modif" + 0.006*"home" + 0.006*"escrow" + 0.005*"pay" + 0.005*"tell" + 0.005*"ocwen"


The topic: 1 Word: 0.021*"report" + 0.014*"credit" + 0.013*"debt" + 0.012*"account" + 0.011*"remov" + 0.010*"disput" + 0.010*"collect" + 0.008*"inform" + 0.007*"valid" + 0.007*"agenc"


The topic: 2 Word: 0.011*"debt" + 0.007*"mortgag" + 0.006*"collect" + 0.006*"foreclosur" + 0.006*"properti" + 0.005*"document" + 0.005*"sale" + 0.005*"loan" + 0.005*"court" + 0.005*"letter"


The topic: 3 Word: 0.033*"inquiri" + 0.018*"paypal" + 0.015*"hard" + 0.012*"salli" + 0.012*"author" + 0.012*"credit" + 0.010*"pull" + 0.009*"citibank" + 0.009*"card" + 0.007*"report"


The topic: 4 Word: 0.016*"call" + 0.011*"phone" + 0.010*"number" + 0.009*"debt" + 0.009*"forbear" + 0.008*"tell" + 0.006*"work" + 0.006*"say" + 0.006*"harass" + 0.006*"ask"


The topic: 5 Word: 0.012*"bank" + 0.011*"payment" + 0.010*"check" + 0.009*"account"

### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, what categories could you find?

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

## Step 5.1: Performance evaluation by classifying sample document using LDA Bag of Words model##

We will check to see where our test document would be classified. 

In [199]:
'''
Text of sample document 4310
'''
processed_data[146]

['close',
 'account',
 'feder',
 'lawsuit',
 'prepar',
 'file',
 'send',
 'money',
 'hold',
 'money',
 'wont',
 'tell',
 'send',
 'money']

In [200]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''

# Our test document is document number 4310
for index, score in sorted(LDA_model[BoW_data[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, LDA_model.print_topic(index, 10)))


Score: 0.41861197352409363	 
Topic: 0.110*"account" + 0.108*"bank" + 0.045*"check" + 0.027*"america" + 0.026*"charg" + 0.025*"money" + 0.023*"close" + 0.020*"fund" + 0.018*"balanc" + 0.016*"pay"

Score: 0.24240322411060333	 
Topic: 0.044*"call" + 0.040*"tell" + 0.026*"say" + 0.025*"send" + 0.025*"receiv" + 0.019*"ask" + 0.019*"phone" + 0.018*"time" + 0.017*"contact" + 0.017*"speak"

Score: 0.2113182544708252	 
Topic: 0.034*"request" + 0.033*"letter" + 0.027*"document" + 0.027*"send" + 0.025*"inform" + 0.024*"provid" + 0.016*"file" + 0.016*"account" + 0.015*"report" + 0.015*"receiv"

Score: 0.08765189349651337	 
Topic: 0.052*"debt" + 0.033*"collect" + 0.015*"court" + 0.012*"compani" + 0.011*"attorney" + 0.011*"state" + 0.009*"agenc" + 0.009*"legal" + 0.008*"claim" + 0.007*"servic"


### It has the highest probability (`x`) to be  part of the topic that we assigned as X, which is the accurate classification. ###

## Step 5.2: Performance evaluation by classifying sample document using LDA TF-IDF model##

In [201]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
for index, score in sorted(LDA_model_tfidf[BoW_data[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, LDA_model_tfidf.print_topic(index, 10)))


Score: 0.6701599955558777	 
Topic: 0.012*"bank" + 0.011*"payment" + 0.010*"check" + 0.009*"account" + 0.009*"money" + 0.009*"charg" + 0.008*"card" + 0.007*"transfer" + 0.007*"fund" + 0.006*"transact"

Score: 0.2764914035797119	 
Topic: 0.011*"debt" + 0.007*"mortgag" + 0.006*"collect" + 0.006*"foreclosur" + 0.006*"properti" + 0.005*"document" + 0.005*"sale" + 0.005*"loan" + 0.005*"court" + 0.005*"letter"


### It has the highest probability (`x%`) to be  part of the topic that we assigned as X. ###

## Step 6: Testing model on unseen document ##

In [205]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
BoW_vector = BoW_dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(LDA_model[BoW_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, LDA_model.print_topic(index, 5)))

Score: 0.4725496172904968	 Topic: 0.056*"chase" + 0.027*"fraud" + 0.024*"secur" + 0.020*"inform" + 0.017*"person"
Score: 0.2120215743780136	 Topic: 0.120*"payment" + 0.042*"month" + 0.024*"pay" + 0.020*"late" + 0.016*"time"
Score: 0.19608375430107117	 Topic: 0.044*"call" + 0.040*"tell" + 0.026*"say" + 0.025*"send" + 0.025*"receiv"
Score: 0.017050854861736298	 Topic: 0.046*"credit" + 0.045*"card" + 0.026*"citi" + 0.022*"capit" + 0.016*"call"
Score: 0.01705033704638481	 Topic: 0.110*"account" + 0.108*"bank" + 0.045*"check" + 0.027*"america" + 0.026*"charg"
Score: 0.017049968242645264	 Topic: 0.052*"debt" + 0.033*"collect" + 0.015*"court" + 0.012*"compani" + 0.011*"attorney"
Score: 0.017049238085746765	 Topic: 0.145*"credit" + 0.126*"report" + 0.048*"account" + 0.032*"remov" + 0.018*"inform"
Score: 0.017048416659235954	 Topic: 0.034*"request" + 0.033*"letter" + 0.027*"document" + 0.027*"send" + 0.025*"inform"
Score: 0.01704840362071991	 Topic: 0.026*"servic" + 0.015*"request" + 0.014*"ema

The model correctly classifies the unseen document with 'x'% probability to the X category.