# Charis Project 4 Part 3: Latent Dirichlet Allocation
LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each job description is modeled as a multinomial distribution of job title and each job title is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

In [128]:
import json
import pandas as pd

## Step 1: Load the dataset

The dataset we'll use is a list of job descriptions, start by loading the text file.

In [129]:
# Load the dataset from previous section (in a separate jupyter notebook) and save it to 'data_text'

import json
with open('data0.txt') as json_file:  
    data = json.load(json_file)

data_text = pd.DataFrame.from_dict(data).T.description
documents=data_text
nrows = documents.shape[0]

In [130]:
documents.head()

0    Be part of a great team culture in the analyti...
1    Flexible work arrangements to meet your needs\...
2    Job Title: Technical Applications Specialist /...
3    Federal Government - baseline preferred\nASAP ...
4    The Opportunity\n\nDo you have what it takes t...
Name: description, dtype: object

Let's glance at the dataset:

In [131]:
'''
Get the total number of documents
'''
print(len(documents))

34772


## Step 2: Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.


In [132]:
# Loading Gensim and nltk libraries
# import sys
# !{sys.executable} -m pip install nltk numpy tqdm gensim

'\nLoading Gensim and nltk libraries\n'

In [133]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from nltk.stem.snowball import SnowballStemmer
import numpy as np
np.random.seed(400)

In [134]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [135]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return (WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # Apply lemmatize_stemming on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    return result

In [136]:
# Preview a document after preprocessing
document_num = 3  #taking the document at index 3 as sample for demonstration
doc_sample = documents[document_num]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['Federal', 'Government', '-', 'baseline', 'preferred\nASAP', 'Start\nR,', 'Hadoop,', 'Teradata,', 'Shiny\nWe', 'are', 'looking', 'for', 'a', 'creative,', 'innovative', 'and', 'intellectually', 'curious', 'and', 'entrepreneurial', 'Data', 'Scientist', 'to', 'join', 'the', 'Analytics', 'team', 'within', 'a', 'Federal', 'Government', 'Agency.\n\nResponsibilities:\n\nIdentify', 'valuable', 'data', 'sources', 'and', 'automate', 'collection', 'processes\nUndertake', 'pre-processing', 'of', 'structured', 'and', 'unstructured', 'data\nAnalyse', 'large', 'amounts', 'of', 'information', 'to', 'discover', 'trends', 'and', 'patterns\nBuild', 'predictive', 'models', 'and', 'machine-learning', 'algorithms\nCombine', 'models', 'through', 'ensemble', 'modelling\nPresent', 'information', 'using', 'data', 'visualization', 'techniques\nPropose', 'solutions', 'and', 'strategies', 'to', 'business', 'challenges\nCollaborate', 'with', 'engineering', 'and', 'product', 'development', 'team

Let's now preprocess all the job descriptions we have. To do that, let's use the [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function from pandas to apply `preprocess()` to the `title` column

In [138]:
# preprocess all the job descriptions, saving the list of results as 'processed_docs'
processed_docs = documents.map(preprocess)

In [161]:
#Preview 'processed_docs', remember processed_docs is a pandas series
processed_docs[:10]

0    [great, team, culture, analytics, space, fanta...
1    [flexible, work, arrangements, meet, need, wor...
2    [title, technical, applications, specialist, f...
3    [federal, government, baseline, prefer, asap, ...
4    [opportunity, take, initiative, faculty, infor...
5    [work, type, fix, term, location, parkville, d...
6    [opportunity, immerse, inclusive, culture, ski...
7    [post, applications, close, midnight, days, re...
8    [deliver, components, design, implementation, ...
9    [flexible, work, arrangements, meet, need, wor...
Name: description, dtype: object

## Step 3.1: Bag of words on the dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass `processed_docs` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [141]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [142]:
dictionary[4] 

'actionable'

In [143]:
# Checking dictionary created
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 ability
1 acquire
2 acquisition
3 act
4 actionable
5 actively
6 add
7 advantage
8 alignment
9 amazon
10 analyse


### Gensim filter_extremes

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [145]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

'\nOPTIONAL STEP\nRemove very rare and very common words:\n\n- words appearing less than 15 times\n- words appearing in more than 10% of all documents\n'

### Gensim doc2bow

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* Convert document (it's a pandas series) into the bag-of-words list of tuples (token_id, token_count). Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [146]:
# Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many words and how many times those words appear. Save this to 'bow_corpus'
bow_corpus = {}
bow_corpus = processed_docs.apply(dictionary.doc2bow)

In [148]:
# Checking Bag of Words corpus for sample document (document_num=3)
bow_corpus[document_num] # --> Format is (token_id, token_count)

[(7, 1),
 (10, 1),
 (14, 1),
 (30, 1),
 (32, 1),
 (40, 1),
 (43, 1),
 (46, 1),
 (57, 1),
 (65, 6),
 (82, 1),
 (87, 4),
 (115, 1),
 (135, 1),
 (136, 4),
 (146, 1),
 (191, 1),
 (194, 1),
 (213, 1),
 (221, 3),
 (227, 1),
 (231, 2),
 (238, 3),
 (243, 1),
 (250, 1),
 (259, 1),
 (266, 1),
 (280, 1),
 (287, 1),
 (340, 1),
 (345, 1),
 (349, 3),
 (355, 1),
 (356, 5),
 (370, 1),
 (372, 1),
 (374, 2),
 (378, 1),
 (381, 1),
 (401, 1),
 (403, 1),
 (406, 2),
 (411, 1),
 (422, 1),
 (423, 1),
 (429, 1),
 (438, 1),
 (440, 1),
 (455, 1),
 (464, 1),
 (489, 1),
 (523, 1),
 (553, 1),
 (554, 1),
 (555, 1),
 (556, 1),
 (557, 1),
 (558, 1),
 (559, 1),
 (560, 1),
 (561, 1),
 (562, 1),
 (563, 1),
 (564, 1),
 (565, 1),
 (566, 1),
 (567, 1),
 (568, 1),
 (569, 1),
 (570, 2),
 (571, 2),
 (572, 2),
 (573, 2),
 (574, 1),
 (575, 1),
 (576, 1),
 (577, 1),
 (578, 1),
 (579, 1),
 (580, 1),
 (581, 1),
 (582, 1),
 (583, 1),
 (584, 1),
 (585, 1),
 (586, 1),
 (587, 1),
 (588, 1),
 (589, 1),
 (590, 1),
 (591, 1),
 (592, 1),
 

In [149]:
# Preview BOW for our sample preprocessed document. Here document_num is document number 3 which we have checked in Step 2.
for i,a in bow_corpus[document_num]: # Remember that bow_doc_3 is a list
    print("Word \"{}\" with token_id {} appears {} time.".format(dictionary[i], i, a))

Word "advantage" with token_id 7 appears 1 time.
Word "analyse" with token_id 10 appears 1 time.
Word "analytics" with token_id 14 appears 1 time.
Word "build" with token_id 30 appears 1 time.
Word "business" with token_id 32 appears 1 time.
Word "collaborate" with token_id 40 appears 1 time.
Word "company" with token_id 43 appears 1 time.
Word "complex" with token_id 46 appears 1 time.
Word "creative" with token_id 57 appears 1 time.
Word "data" with token_id 65 appears 6 time.
Word "engineer" with token_id 82 appears 1 time.
Word "experience" with token_id 87 appears 4 time.
Word "identify" with token_id 115 appears 1 time.
Word "large" with token_id 135 appears 1 time.
Word "learn" with token_id 136 appears 4 time.
Word "look" with token_id 146 appears 1 time.
Word "product" with token_id 191 appears 1 time.
Word "quantitative" with token_id 194 appears 1 time.
Word "science" with token_id 213 appears 1 time.
Word "skills" with token_id 221 appears 3 time.
Word "stakeholders" with t

## Step 3.2: TF-IDF on our document set ##

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

*Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.*

** TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [150]:
# Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)

In [151]:
# Apply transformation to the entire corpus and call it 'corpus_tfidf'

corpus_tfidf = tfidf[bow_corpus]

In [152]:
#Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
from pprint import pprint

for i in corpus_tfidf:
    pprint(i)
    break

[(0, 0.009745921556971626),
 (1, 0.030252811992231503),
 (2, 0.031579760699573425),
 (3, 0.03759869320286057),
 (4, 0.03170039552236711),
 (5, 0.03621500746707537),
 (6, 0.0269366922760234),
 (7, 0.04348299693072362),
 (8, 0.033307936883346365),
 (9, 0.028091253636389965),
 (10, 0.03788802685483294),
 (11, 0.02477744094128617),
 (12, 0.01883091611394919),
 (13, 0.035214016983874115),
 (14, 0.04475666066155876),
 (15, 0.02748677306053997),
 (16, 0.017070543303190588),
 (17, 0.0038396691200862417),
 (18, 0.02928030790314177),
 (19, 0.0217709695513815),
 (20, 0.04173395321905053),
 (21, 0.02829093111383332),
 (22, 0.037744855847277964),
 (23, 0.03069999603875983),
 (24, 0.022611920717976394),
 (25, 0.012496855220599899),
 (26, 0.006055004923320923),
 (27, 0.03654586151571627),
 (28, 0.019384979408895064),
 (29, 0.03175365611557208),
 (30, 0.009382449259523034),
 (31, 0.03644701659778417),
 (32, 0.015968046111317892),
 (33, 0.01901412855979561),
 (34, 0.017103409817903457),
 (35, 0.0305912

## Step 4.1: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

** We will be running LDA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.

* ** passes ** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999` 

In [153]:
# Train lda model using gensim.models.LdaMulticore and save it to 'lda_model'
lda_model = gensim.models.LdaMulticore(bow_corpus,
                                       num_topics=10,
                                       id2word = dictionary,
                                       passes = 2, # extra parameter for LDA multicore
                                       workers=2 # extra parameter for LDA multicore 
                                      )

In [154]:
# For each topic, we will explore the words occuring in that topic and its relative weight
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.014*"team" + 0.013*"technical" + 0.012*"experience" + 0.012*"skills" + 0.012*"data" + 0.010*"cloud" + 0.010*"customer" + 0.010*"learn" + 0.010*"solutions" + 0.009*"development"


Topic: 1 
Words: 0.024*"work" + 0.022*"client" + 0.018*"watson" + 0.015*"business" + 0.014*"experience" + 0.014*"team" + 0.013*"clients" + 0.013*"service" + 0.012*"company" + 0.011*"role"


Topic: 2 
Words: 0.025*"sales" + 0.016*"account" + 0.016*"investment" + 0.012*"customer" + 0.011*"business" + 0.009*"manage" + 0.009*"team" + 0.009*"company" + 0.008*"provide" + 0.008*"solution"


Topic: 3 
Words: 0.021*"data" + 0.015*"develop" + 0.013*"work" + 0.012*"team" + 0.011*"support" + 0.010*"management" + 0.010*"skills" + 0.010*"investment" + 0.010*"market" + 0.010*"willis"


Topic: 4 
Words: 0.019*"work" + 0.016*"team" + 0.012*"experience" + 0.012*"management" + 0.011*"service" + 0.009*"legal" + 0.009*"role" + 0.008*"tech" + 0.008*"technology" + 0.008*"firm"


Topic: 5 
Words: 0.028*"school" + 0

### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: 
* 1: 
* 2: 
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9: 

## Step 4.2 Running LDA using TF-IDF ##

In [155]:
# Define lda model using corpus_tfidf
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)

In [156]:
# Exploring the words occuring in each topic and its relative weight
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.008*"manufacture" + 0.005*"mechanical" + 0.005*"estimator" + 0.005*"linkedin" + 0.005*"major" + 0.005*"sukanya" + 0.005*"civil" + 0.005*"devops" + 0.004*"design" + 0.004*"maximo"


Topic: 1 Word: 0.007*"school" + 0.006*"dimensional" + 0.006*"institutional" + 0.005*"water" + 0.005*"financial" + 0.005*"risk" + 0.004*"investment" + 0.004*"oversight" + 0.004*"departments" + 0.004*"philosophy"


Topic: 2 Word: 0.008*"sales" + 0.006*"zendesk" + 0.006*"organize" + 0.006*"account" + 0.006*"manager" + 0.006*"telstra" + 0.005*"contact" + 0.005*"manner" + 0.004*"mercer" + 0.004*"customer"


Topic: 3 Word: 0.007*"deloitte" + 0.007*"linear" + 0.006*"legal" + 0.004*"advisory" + 0.004*"array" + 0.004*"firm" + 0.004*"statistical" + 0.004*"discount" + 0.004*"lenovo" + 0.004*"apprehension"


Topic: 4 Word: 0.013*"investment" + 0.008*"watson" + 0.004*"developers" + 0.003*"institutional" + 0.003*"specific" + 0.003*"fund" + 0.003*"component" + 0.003*"graduate" + 0.003*"solution" + 0.003*"e

### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, what categories could you find?

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

## Step 5.1: Performance evaluation by classifying sample document using LDA Bag of Words model##

We will check to see where our test document would be classified. 

In [157]:
# Text of sample document 3
processed_docs[3]

['federal',
 'government',
 'baseline',
 'prefer',
 'asap',
 'start',
 'hadoop',
 'teradata',
 'shiny',
 'look',
 'creative',
 'innovative',
 'intellectually',
 'curious',
 'entrepreneurial',
 'data',
 'scientist',
 'join',
 'analytics',
 'team',
 'federal',
 'government',
 'agency',
 'identify',
 'valuable',
 'data',
 'source',
 'automate',
 'collection',
 'process',
 'undertake',
 'process',
 'structure',
 'unstructured',
 'data',
 'analyse',
 'large',
 'amount',
 'information',
 'discover',
 'trend',
 'pattern',
 'build',
 'predictive',
 'model',
 'machine',
 'learn',
 'algorithms',
 'combine',
 'model',
 'ensemble',
 'model',
 'present',
 'information',
 'data',
 'visualization',
 'techniques',
 'propose',
 'solutions',
 'strategies',
 'business',
 'challenge',
 'collaborate',
 'engineer',
 'product',
 'development',
 'team',
 'experience',
 'require',
 'experience',
 'statistical',
 'machine',
 'learn',
 'project',
 'commercial',
 'experience',
 'data',
 'science',
 'principles',


In [158]:
# Check which topic our test document belongs to using the LDA Bag of Words model.

# document_num is 3 because our test document is document 3
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6753913760185242	 
Topic: 0.026*"data" + 0.021*"work" + 0.017*"experience" + 0.017*"team" + 0.012*"business" + 0.010*"solutions" + 0.009*"skills" + 0.008*"build" + 0.008*"design" + 0.008*"process"

Score: 0.11879268288612366	 
Topic: 0.014*"team" + 0.013*"technical" + 0.012*"experience" + 0.012*"skills" + 0.012*"data" + 0.010*"cloud" + 0.010*"customer" + 0.010*"learn" + 0.010*"solutions" + 0.009*"development"

Score: 0.11593401432037354	 
Topic: 0.019*"work" + 0.016*"team" + 0.012*"experience" + 0.012*"management" + 0.011*"service" + 0.009*"legal" + 0.009*"role" + 0.008*"tech" + 0.008*"technology" + 0.008*"firm"

Score: 0.055896494537591934	 
Topic: 0.018*"experience" + 0.014*"customer" + 0.014*"sales" + 0.013*"support" + 0.012*"management" + 0.012*"work" + 0.012*"software" + 0.012*"business" + 0.012*"technical" + 0.011*"team"

Score: 0.02997143380343914	 
Topic: 0.029*"analytics" + 0.017*"data" + 0.015*"work" + 0.014*"experience" + 0.014*"drive" + 0.010*"help" + 0.010*"learn

### It has the highest probability (`x`) to be  part of the topic that we assigned as X, which is the accurate classification. ###

## Step 5.2: Performance evaluation by classifying sample document using LDA TF-IDF model##

In [159]:
# Check which topic our test document belongs to using the LDA TF-IDF model.

for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.4450368881225586	 
Topic: 0.010*"analytics" + 0.010*"accenture" + 0.009*"research" + 0.007*"maturity" + 0.005*"advancements" + 0.005*"security" + 0.004*"statistics" + 0.004*"wireless" + 0.004*"implementation" + 0.004*"stelarlab"

Score: 0.23393133282661438	 
Topic: 0.020*"tower" + 0.020*"willis" + 0.013*"iqvia" + 0.008*"healthcare" + 0.007*"deloitte" + 0.006*"data" + 0.006*"analyst" + 0.006*"watson" + 0.006*"analytics" + 0.005*"optimise"

Score: 0.1604917347431183	 
Topic: 0.013*"investment" + 0.008*"watson" + 0.004*"developers" + 0.003*"institutional" + 0.003*"specific" + 0.003*"fund" + 0.003*"component" + 0.003*"graduate" + 0.003*"solution" + 0.003*"employment"

Score: 0.09476153552532196	 
Topic: 0.008*"manufacture" + 0.005*"mechanical" + 0.005*"estimator" + 0.005*"linkedin" + 0.005*"major" + 0.005*"sukanya" + 0.005*"civil" + 0.005*"devops" + 0.004*"design" + 0.004*"maximo"

Score: 0.06177525967359543	 
Topic: 0.011*"liveperson" + 0.006*"trade" + 0.006*"consult" + 0.005*"v

### It has the highest probability (`x%`) to be  part of the topic that we assigned as X. ###

## Step 6: Testing model on unseen document ##

In [160]:
unseen_document = "Senior financial consultant, Melbourne/Sydney, quantitative modelling risk management."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.468237966299057	 Topic: 0.026*"data" + 0.021*"work" + 0.017*"experience" + 0.017*"team" + 0.012*"business"
Score: 0.3150328993797302	 Topic: 0.019*"work" + 0.016*"team" + 0.012*"experience" + 0.012*"management" + 0.011*"service"
Score: 0.14670851826667786	 Topic: 0.029*"experience" + 0.025*"project" + 0.020*"service" + 0.019*"engineer" + 0.018*"network"
Score: 0.010003920644521713	 Topic: 0.021*"data" + 0.015*"develop" + 0.013*"work" + 0.012*"team" + 0.011*"support"
Score: 0.010003841482102871	 Topic: 0.029*"analytics" + 0.017*"data" + 0.015*"work" + 0.014*"experience" + 0.014*"drive"
Score: 0.010003740899264812	 Topic: 0.024*"work" + 0.022*"client" + 0.018*"watson" + 0.015*"business" + 0.014*"experience"
Score: 0.01000288501381874	 Topic: 0.025*"sales" + 0.016*"account" + 0.016*"investment" + 0.012*"customer" + 0.011*"business"
Score: 0.01000240258872509	 Topic: 0.014*"team" + 0.013*"technical" + 0.012*"experience" + 0.012*"skills" + 0.012*"data"
Score: 0.010002021677792072	 

The model correctly classifies the unseen document with 'x'% probability to the X category.