##  Pre-processing and Training Data Development

In [1]:
import pandas as pd
import numpy as np

import gensim
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import TreebankWordTokenizer

from gensim.models.tfidfmodel import TfidfModel

import warnings
warnings.filterwarnings("ignore")

In [2]:
data = pd.read_csv('../data/data_cleaned_final.csv')

pd.set_option('display.max_colwidth',None)
data.head()

Unnamed: 0,job_title,location,connection,job_title_nostop,location_cleaned,combined
0,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85.0,"['2019', 'ct', 'bauer', 'college', 'business', 'graduate', 'magna', 'cum', 'laude', 'aspiring', 'human', 'resources', 'professional']","['houston', 'texas']",2019 ct bauer college business graduate magna cum laude aspiring human resources professional houston texas
1,Native English Teacher at EPIK (English Program in Korea),Canada,500.0,"['native', 'english', 'teacher', 'epik', 'english', 'program', 'korea']",['canada'],native english teacher epik english program korea canada
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina",44.0,"['aspiring', 'human', 'resources', 'professional']","['raleigh', 'durham', 'north', 'carolina', '']",aspiring human resources professional raleigh durham north carolina
3,People Development Coordinator at Ryan,"Denton, Texas",500.0,"['people', 'development', 'coordinator', 'ryan']","['denton', 'texas']",people development coordinator ryan denton texas
4,Aspiring Human Resources Specialist,New York City,1.0,"['aspiring', 'human', 'resources', 'specialist']","['new', 'york', 'city', '']",aspiring human resources specialist new york city


### Feature Engineering

Displaying the combined column which contains the text combined with both the job title and location data

In [3]:
data['combined'].value_counts()

2019 ct bauer college business graduate magna cum laude aspiring human resources professional houston texas                          1
liberal arts major aspiring human resources analyst baton rouge louisiana                                                            1
senior human resources business partner heil environmental chattanooga tennessee                                                     1
aspiring human resources professional energetic teamfocused leader austin texas                                                      1
hr manager endemol shine north america los angeles california                                                                        1
human resources professional world leader gis software highland california                                                           1
rrp brand portfolio executive jti japan tobacco international philadelphia                                                           1
information systems specialist programmer love data org

In [4]:
# variable to train
X_train = data['combined'].values

In [5]:
print(X_train)

['2019 ct bauer college business graduate magna cum laude aspiring human resources professional houston texas'
 'native english teacher epik english program korea canada'
 'aspiring human resources professional raleigh durham north carolina '
 'people development coordinator ryan denton texas'
 'aspiring human resources specialist new york city '
 'student humber college aspiring human resources generalist canada'
 'hr senior specialist san francisco bay '
 'seeking human resources hris generalist positions philadelphia '
 'student chapman university lake forest california'
 'svp chro marketing communications csr officer engie houston woodlands energy gphr sphr houston texas '
 'human resources coordinator intercontinental buckhead atlanta atlanta georgia'
 'aspiring human resources management student seeking internship houston texas '
 'seeking human resources opportunities chicago illinois'
 'experienced retail manager aspiring human resources professional austin texas '
 'human reso

### Generate Document Vectors

#### Usuage of Gensim
Gensim is an open-source NLP library for building document or word vectors.</br>
With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find.</br>
Gensim allows you to build corpora and dictionaries using simple classes and functions.

### Tokenize words

In [6]:
tokenized_docs = [TreebankWordTokenizer().tokenize(doc) for doc in X_train]

In [None]:
#import copy
copy_tokenized_docs = copy.deepcopy(tokenized_docs)

In [7]:
print(tokenized_docs)

[['2019', 'ct', 'bauer', 'college', 'business', 'graduate', 'magna', 'cum', 'laude', 'aspiring', 'human', 'resources', 'professional', 'houston', 'texas'], ['native', 'english', 'teacher', 'epik', 'english', 'program', 'korea', 'canada'], ['aspiring', 'human', 'resources', 'professional', 'raleigh', 'durham', 'north', 'carolina'], ['people', 'development', 'coordinator', 'ryan', 'denton', 'texas'], ['aspiring', 'human', 'resources', 'specialist', 'new', 'york', 'city'], ['student', 'humber', 'college', 'aspiring', 'human', 'resources', 'generalist', 'canada'], ['hr', 'senior', 'specialist', 'san', 'francisco', 'bay'], ['seeking', 'human', 'resources', 'hris', 'generalist', 'positions', 'philadelphia'], ['student', 'chapman', 'university', 'lake', 'forest', 'california'], ['svp', 'chro', 'marketing', 'communications', 'csr', 'officer', 'engie', 'houston', 'woodlands', 'energy', 'gphr', 'sphr', 'houston', 'texas'], ['human', 'resources', 'coordinator', 'intercontinental', 'buckhead', 'at

### Create bigram

Topic models make more sense when 'New' and 'York' are treated as 'New York' - we can do this by creating a bigram model and modifying our corpus accordingly

In [8]:
from gensim.models.phrases import Phrases

In [9]:
#Build another Bigram without the connector_words
bigram = Phrases(tokenized_docs, min_count=1, threshold=1) #train model

tokenized_docs_bigram= [bigram[line] for line in tokenized_docs]

In [10]:
for phrase, score in bigram.find_phrases(tokenized_docs).items():
    print(phrase, score)

aspiring_human 15.258064516129034
resources_professional 12.483870967741936
houston_texas 38.22222222222222
raleigh_durham 114.66666666666666
north_carolina 51.6
resources_specialist 4.161290322580645
new_york 82.56
resources_generalist 8.32258064516129
seeking_human 5.548387096774194
human_resources 16.10822060353798
resources_management 2.774193548387097
seeking_internship 28.666666666666664
retail_manager 36.857142857142854
austin_texas 28.666666666666664
director_human 5.548387096774194
north_america 51.6
manager_seeking 8.19047619047619
business_management 17.2
major_aspiring 14.333333333333332
resources_manager 4.755760368663594
information_systems 129.0
los_angeles 129.0
management_major 28.666666666666664
long_beach 64.5
resources_position 5.548387096774194


In [None]:
tokenized_docs_bigram

In [11]:
len(tokenized_docs_bigram)

49

### Create Dictionary needed for Topic Modeling

In [12]:
# create a dictionary from the job_title and location containing the number of times a word appears in the training set.
#This creates a mapping for Ids for each token
dictionary = Dictionary(tokenized_docs_bigram) 

In [13]:
print(dictionary)

Dictionary(236 unique tokens: ['2019', 'aspiring_human', 'bauer', 'business', 'college']...)


### Create a Bag of Words Corpus

Bag of words corpora in the Gensim library are based on dictionaries and contain the ID of each word along with the frequency of occurrence of the word.

In [14]:
# Create a gensim corpus
# This ia bag of words for the text in the format (token_id, token_count) 2 tuples

#dictionary.token2id
bow_corpus = [dictionary.doc2bow(token) for token in tokenized_docs_bigram]
print(bow_corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)], [(12, 1), (13, 2), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1)], [(1, 1), (11, 1), (19, 1), (20, 1)], [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1)], [(1, 1), (27, 1), (28, 1), (29, 1)], [(1, 1), (4, 1), (12, 1), (30, 1), (31, 1), (32, 1)], [(33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1)], [(39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1)], [(32, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1)], [(8, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1)], [(21, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1)], [(1, 1), (8, 1), (32, 1), (67, 1), (68, 1)], [(43, 1), (44, 1), (69, 1), (70, 1), (71, 1)], [(1, 1), (11, 1), (72, 1), (73, 1), (74, 1)], [(65, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1)], [(27, 1), (28, 1), (38, 1), (65, 1), (80, 1)], [(43, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87

### Topic Modelling

Topic Modelling refers to the probabilistic modelling of text documents as topics.

In [15]:
from gensim.models import LdaModel, LsiModel, HdpModel

#### LSI - Latent Semantic Indexing
An information retrieval method which works by decomposing the original matric of words to maintain key topics

In [16]:
lsimodel = LsiModel(corpus=bow_corpus, num_topics=20, id2word=dictionary)
lsimodel.show_topics(num_topics=5)  # showing only the top 5 topics

[(0,
  '0.622*"human_resources" + 0.391*"aspiring_human" + 0.180*"resources_professional" + 0.174*"new_york" + 0.147*"houston_texas" + 0.135*"position" + 0.133*"resources_manager" + 0.121*"california" + 0.108*"virginia" + 0.104*"north_carolina"'),
 (1,
  '-0.465*"aspiring_human" + 0.424*"human_resources" + -0.324*"resources_professional" + -0.251*"new_york" + -0.189*"houston_texas" + -0.141*"student" + -0.134*"college" + 0.118*"virginia" + 0.117*"management" + 0.110*"california"'),
 (2,
  '-0.387*"university" + -0.385*"student" + -0.346*"california" + -0.308*"indiana" + -0.176*"resources" + -0.165*"business_management" + -0.156*"retail_manager" + -0.154*"delphi" + -0.154*"lafayette" + -0.154*"kokomo"'),
 (3,
  '-0.367*"california" + -0.320*"resources" + -0.260*"seeking_human" + 0.235*"indiana" + 0.225*"student" + 0.196*"university" + -0.156*"new_york" + -0.148*"san" + -0.136*"staffing" + 0.134*"business_management"'),
 (4,
  '0.365*"houston_texas" + -0.293*"new_york" + 0.214*"engie" + 

#### HDP - Hierarchical Dirichlet Process
HDP is an unsupervised topic model which figures out the number of topics on its own

In [17]:
hdpmodel = HdpModel(corpus=bow_corpus,id2word=dictionary)
hdpmodel.show_topics() 

[(0,
  '0.026*chattanooga + 0.018*epik + 0.016*patient + 0.016*lafayette + 0.016*resources + 0.016*procedures + 0.016*california + 0.015*indiana + 0.014*medical + 0.014*hr + 0.014*girardeau + 0.014*partner + 0.013*payroll + 0.013*bauer + 0.012*portfolio + 0.012*baltimore + 0.012*engaging + 0.011*executive + 0.011*inc + 0.011*analyst'),
 (1,
  '0.027*missouri + 0.021*tennessee + 0.019*magna + 0.018*love + 0.017*management_major + 0.017*community + 0.016*retired + 0.015*denton + 0.013*california + 0.013*analyst + 0.012*ryan + 0.012*administrative + 0.011*chro + 0.011*senior + 0.011*inclusive + 0.011*louisiana + 0.011*data + 0.011*indiana + 0.011*raleigh_durham + 0.010*rrp'),
 (2,
  '0.024*lead + 0.023*logging + 0.021*delphi + 0.021*within + 0.019*kokomo + 0.018*payroll + 0.015*torrance + 0.015*business_management + 0.014*jti + 0.014*work + 0.013*junior + 0.013*manager_seeking + 0.012*community + 0.012*excellence + 0.011*missouri + 0.011*compensation + 0.011*chattanooga + 0.011*may + 0.01

#### LDA - Latent Dirichlet Allocation

In [18]:
ldamodel = LdaModel(corpus=bow_corpus, num_topics=10, id2word=dictionary)

ldamodel.show_topics()

[(0,
  '0.054*"human_resources" + 0.040*"atlanta" + 0.027*"california" + 0.014*"aspiring_human" + 0.014*"care" + 0.014*"opportunities" + 0.014*"within" + 0.014*"liberal" + 0.014*"seeking" + 0.014*"rouge"'),
 (1,
  '0.056*"generalist" + 0.038*"north_carolina" + 0.038*"philadelphia" + 0.038*"raleigh_durham" + 0.038*"human_resources" + 0.020*"aspiring_human" + 0.020*"jti" + 0.020*"brand" + 0.020*"tobacco" + 0.020*"leader"'),
 (2,
  '0.033*"state" + 0.033*"university" + 0.033*"student" + 0.033*"north_carolina" + 0.033*"bridgewater" + 0.033*"resources_professional" + 0.033*"westfield" + 0.033*"raleigh_durham" + 0.033*"massachusetts" + 0.033*"aspiring_human"'),
 (3,
  '0.033*"coordinator" + 0.033*"people" + 0.033*"denton" + 0.033*"ey" + 0.033*"director_human" + 0.033*"development" + 0.033*"resources" + 0.033*"texas" + 0.033*"atlanta" + 0.033*"ryan"'),
 (4,
  '0.032*"resources" + 0.032*"california" + 0.032*"north_america" + 0.032*"staffing" + 0.032*"human_resources" + 0.032*"manager" + 0.017*

### TF-IDF with Gensim

Building a TF-IDF model using gensim model. </br>
The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

In [19]:
tfidf = TfidfModel(bow_corpus) # step 1 -- initialize a model

#tfidf is treated as a read-only object that can be used to convert any vector from the old 
#representation (bag-of-words integer counts) to the new representation 

corpus_tfidf = tfidf[bow_corpus] # step 2 Apply transformation to the entire corpus 

In [20]:
type(corpus_tfidf)

gensim.interfaces.TransformedCorpus

In [21]:
for doc, as_text in zip(corpus_tfidf, X_train):
    print(doc, as_text, end='\n\n')

[(0, 0.33249872454322044), (1, 0.13577674153068667), (2, 0.33249872454322044), (3, 0.2386384851773402), (4, 0.2732795068499476), (5, 0.33249872454322044), (6, 0.33249872454322044), (7, 0.33249872454322044), (8, 0.1949959592239595), (9, 0.33249872454322044), (10, 0.33249872454322044), (11, 0.1949959592239595)] 2019 ct bauer college business graduate magna cum laude aspiring human resources professional houston texas

[(12, 0.2642287470777187), (13, 0.6429733601593075), (14, 0.32148668007965375), (15, 0.32148668007965375), (16, 0.32148668007965375), (17, 0.32148668007965375), (18, 0.32148668007965375)] native english teacher epik english program korea canada

[(1, 0.32896328403055086), (11, 0.4724410852392083), (19, 0.5781792882566074), (20, 0.5781792882566074)] aspiring human resources professional raleigh durham north carolina 

[(21, 0.35530268870431103), (22, 0.4322961944081207), (23, 0.4322961944081207), (24, 0.4322961944081207), (25, 0.4322961944081207), (26, 0.35530268870431103)] 

In [22]:
#The model can then be applied to any particular document of interest:
corpus_tfidf[1]

[(12, 0.2642287470777187),
 (13, 0.6429733601593075),
 (14, 0.32148668007965375),
 (15, 0.32148668007965375),
 (16, 0.32148668007965375),
 (17, 0.32148668007965375),
 (18, 0.32148668007965375)]

The most relevant term in document 1 is then term number 15

In [28]:
#looking up the word in the dictionary
dictionary[12]

'canada'

'canada' is the most relevant term in that document

In [26]:
tf_obj = corpus_tfidf[1]
sorted(tf_obj, key=lambda x: x[1], reverse=True)

[(13, 0.6429733601593075),
 (14, 0.32148668007965375),
 (15, 0.32148668007965375),
 (16, 0.32148668007965375),
 (17, 0.32148668007965375),
 (18, 0.32148668007965375),
 (12, 0.2642287470777187)]

And now we see the top five terms for this particular document:

In [27]:
n_terms = 5

top_terms = []
for obj in sorted(tf_obj, key=lambda x: x[1], reverse=True):
    top_terms.append("{0:s} ({1:01.03f})".format(dictionary[obj[0]], obj[1]))

print(top_terms)

['english (0.643)', 'epik (0.321)', 'korea (0.321)', 'native (0.321)', 'program (0.321)', 'teacher (0.321)', 'canada (0.264)']


In [29]:
# transform into sparse matrix
# To create a document term matrix from gensim

from gensim.matutils import corpus2csc

tf_sparse_array = corpus2csc(bow_corpus) 

In [30]:
type(tf_sparse_array )

scipy.sparse.csc.csc_matrix

### Model Evaluation Metrics

### Similarity  - Cosine similarity

After we created the term matrix using the tfidf we have projected each document in a D-dimensional space (where  D  is the number of terms in our dictionary).</br>
One example is determining which documents are similar to which other documents. We could just compute the (Euclidean) distance between two documents, but there is another metric that is popular in text analysis called cosine similarity. It measures the angle between two documents; cosine similarity is not sensitive to how long a particular document is.</br>
Here, we use the MatrixSimilarity class to compute a similarity score for our corpus.

In [31]:
from gensim.similarities import MatrixSimilarity
matsim = MatrixSimilarity(corpus_tfidf)

We need to apply this similarity score to a particular document. Here we apply it to document 1: </br>
Cosine measure returns similarities in the range <-1, 1> (the greater, the more similar), so that the first document has a score of 1 

In [32]:
sims = matsim[corpus_tfidf[1]]
print(list(enumerate(sims)))

[(0, 0.0), (1, 1.0), (2, 0.0), (3, 0.0), (4, 0.0), (5, 0.11127411), (6, 0.0), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.0), (11, 0.0), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.0), (29, 0.0), (30, 0.0), (31, 0.0), (32, 0.0), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.0), (41, 0.0), (42, 0.0), (43, 0.0), (44, 0.0), (45, 0.0), (46, 0.0), (47, 0.0), (48, 0.0)]


In [34]:
#sims = sorted(enumerate(sims), key=lambda item: item[-1])
for doc_position, doc_score in sims:
    print(doc_score, X_train[doc_position])

0.0 2019 ct bauer college business graduate magna cum laude aspiring human resources professional houston texas
0.0 aspiring human resources professional raleigh durham north carolina 
0.0 people development coordinator ryan denton texas
0.0 aspiring human resources specialist new york city 
0.0 hr senior specialist san francisco bay 
0.0 seeking human resources hris generalist positions philadelphia 
0.0 student chapman university lake forest california
0.0 svp chro marketing communications csr officer engie houston woodlands energy gphr sphr houston texas 
0.0 human resources coordinator intercontinental buckhead atlanta atlanta georgia
0.0 aspiring human resources management student seeking internship houston texas 
0.0 seeking human resources opportunities chicago illinois
0.0 experienced retail manager aspiring human resources professional austin texas 
0.0 human resources staffing recruiting professional jackson mississippi 
0.0 human resources specialist luxottica new york city 

In [35]:
# Here there is some similarity between document 1 and document 5

print("Document 1: '" + X_train[1] + "'")
print("Document 5: '" + X_train[5] + "'")

Document 1: 'native english teacher epik english program korea canada'
Document 5: 'student humber college aspiring human resources generalist canada'


In [36]:
# check another document similarity
sims1 = matsim[corpus_tfidf[11]]
print(list(enumerate(sims1)))

[(0, 0.107881114), (1, 0.0), (2, 0.08534675), (3, 0.0), (4, 0.073172), (5, 0.14907332), (6, 0.0), (7, 0.0), (8, 0.093855724), (9, 0.06219403), (10, 0.0), (11, 1.0), (12, 0.0), (13, 0.06262708), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.0), (20, 0.5391047), (21, 0.0), (22, 0.0), (23, 0.03498843), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.05391196), (29, 0.0), (30, 0.0), (31, 0.0), (32, 0.0), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.085927546), (41, 0.055813584), (42, 0.50632364), (43, 0.0), (44, 0.031536143), (45, 0.0), (46, 0.0), (47, 0.0), (48, 0.0)]


In [37]:
# Similary between document 11 and document 20
print("Document 11: '" + X_train[11] + "'")
print("Document 20: '" + X_train[20] + "'")

Document 11: 'aspiring human resources management student seeking internship houston texas '
Document 20: 'aspiring human resources manager seeking internship human resources houston texas '


### Testing model on unseen data

#### Using LDA model

In [38]:
unseen_text1 = "aspiring human resources"

#Processing step for unseen data
bow_vector = dictionary.doc2bow(unseen_text1.split())

for index, score in sorted(ldamodel[bow_vector], key=lambda tup: -1*tup[1]):
    print('Score: {:.3f} Topic: {}'.format(score, ldamodel.print_topic(index, 5)))

Score: 0.550 Topic: 0.032*"resources" + 0.032*"california" + 0.032*"north_america" + 0.032*"staffing" + 0.032*"human_resources"
Score: 0.050 Topic: 0.033*"coordinator" + 0.033*"people" + 0.033*"denton" + 0.033*"ey" + 0.033*"director_human"
Score: 0.050 Topic: 0.056*"generalist" + 0.038*"north_carolina" + 0.038*"philadelphia" + 0.038*"raleigh_durham" + 0.038*"human_resources"
Score: 0.050 Topic: 0.029*"university" + 0.029*"student" + 0.029*"indiana" + 0.029*"aspiring_human" + 0.029*"english"
Score: 0.050 Topic: 0.054*"human_resources" + 0.040*"atlanta" + 0.027*"california" + 0.014*"aspiring_human" + 0.014*"care"
Score: 0.050 Topic: 0.033*"state" + 0.033*"university" + 0.033*"student" + 0.033*"north_carolina" + 0.033*"resources_professional"
Score: 0.050 Topic: 0.047*"human_resources" + 0.032*"position" + 0.032*"virginia" + 0.017*"houston_texas" + 0.017*"aspiring_human"
Score: 0.050 Topic: 0.038*"human_resources" + 0.038*"business" + 0.038*"management" + 0.020*"graduate" + 0.020*"aspirin

The output with the highest probabilty score of 0.550 says that the unseen text is most matched with the first list of topics.
The next set of topics score 0.050 and shows that there is some match with those topics

Another Example of testing unseen data

In [39]:
unseen_text2 = "seeking english teacher"

#Processing step for unseen data
bow_vector = dictionary.doc2bow(unseen_text2.split())

for index, score in sorted(ldamodel[bow_vector], key=lambda tup: -1*tup[1]):
    print('Score: {:.3f} Topic: {}'.format(score, ldamodel.print_topic(index, 5)))

Score: 0.525 Topic: 0.029*"university" + 0.029*"student" + 0.029*"indiana" + 0.029*"aspiring_human" + 0.029*"english"
Score: 0.275 Topic: 0.047*"human_resources" + 0.032*"position" + 0.032*"virginia" + 0.017*"houston_texas" + 0.017*"aspiring_human"
Score: 0.025 Topic: 0.054*"human_resources" + 0.040*"atlanta" + 0.027*"california" + 0.014*"aspiring_human" + 0.014*"care"
Score: 0.025 Topic: 0.056*"generalist" + 0.038*"north_carolina" + 0.038*"philadelphia" + 0.038*"raleigh_durham" + 0.038*"human_resources"
Score: 0.025 Topic: 0.033*"state" + 0.033*"university" + 0.033*"student" + 0.033*"north_carolina" + 0.033*"resources_professional"
Score: 0.025 Topic: 0.033*"coordinator" + 0.033*"people" + 0.033*"denton" + 0.033*"ey" + 0.033*"director_human"
Score: 0.025 Topic: 0.038*"human_resources" + 0.038*"business" + 0.038*"management" + 0.020*"graduate" + 0.020*"aspiring_human"
Score: 0.025 Topic: 0.044*"long_beach" + 0.023*"community" + 0.023*"california" + 0.023*"myrtle" + 0.023*"francisco"
Sc

Here the highest probabilty score is 0.525 shows that the sample text belongs to the first topic.
So the model correctly classifies the unseen text with 0.525 probabilty to first topic

### Clustering

Another task we can do given that our documents are projected in a high-dimensional numeric space is to cluster the documents according to the words they use. We have already seen how to do this with the network data; we are just doing this same thing now with the words themselves.

Due to something called the curse of dimensionality, applying classical clustering techniques to our data will not work very well. There are too many dimensions, the number of words in our lexicon, to reasonably do any clustering of the documents. One clustering technique that does work very well works directly with a matrix of similarity scores; it is called spectral clustering and we can apply it as follows:

In [42]:
from sklearn.cluster import SpectralClustering


scmodel = SpectralClustering(n_clusters=3, affinity='precomputed')
similarity_matrix = matsim[corpus_tfidf]
sc = scmodel.fit_predict(similarity_matrix)
sc

array([0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 2,
       0, 0, 0, 2, 1])

The spectral clustering splits our documents into three groups and returns the id of each group

### KMeans

In [43]:
from sklearn.cluster import KMeans

In [44]:
# transform the corpus into sparse matrix
# To create a document term matrix from gensim

from gensim.matutils import corpus2csc

tf_sparse_array = corpus2csc(bow_corpus) 

In [45]:
tf_sparse_array

<236x49 sparse matrix of type '<class 'numpy.float64'>'
	with 337 stored elements in Compressed Sparse Column format>

In [47]:
#convert the vector to array
type(tf_sparse_array)

scipy.sparse.csc.csc_matrix