# Topic modeling of job reviews
## Andrew Hall <br><sup>September 6, 2022 <br> Submission script for final project of Metis NLP short course</sup>

### Abstract
The goal of this project is to implement topic modeling to derive conclusions about a corpus of job reviews from an anonymous job review website. 

There are general reviews, as well as prompts asking for pros and cons. As such, the analysis in this project was broken into three parts: an analysis of the general written reviews, an analysis of the reviews written when prompted for a “Pros,” and the reviews written when prompted for “Cons.”


## Initial data set up and loading

In [96]:
import pandas as pd
import numpy as np

business_names = ['Adobe', 'Airbnb', 'Amazon', 'Apple', 'Atlassian', 'Bloomberg',
                 'Bytedance', 'Cisco', 'Coinbase', 'Deloitte', 'Goldman-Sachs', 'Google',
                 'IBM', 'Intel', 'Intuit', 'Meta', 'Microsoft', 'Netflix', 'Oracle',
                 'Salesforce', 'SAP-Labs', 'Stripe', 'Twitter', 'Uber', 'Walmart']

#initialize a dictionary
data = {}

#read in each company data with key being name of company
for name in business_names: 
    data[name] = pd.read_csv('data/'+name+'/'+name+'-data.csv')

In [2]:
full_data = []
cols = ['Rating','Description', 'Pros', 'Cons', 'Company']
for company in data:
    subset_data = data[company]
    subset_data['Company'] = company
    full_data.append(subset_data[cols])

full_data = pd.concat(full_data)
print("The number of documents is: ", full_data.shape[0])

The number of documents is:  43803


In [3]:
full_data_long = pd.melt(full_data, 
                         id_vars = ['Company', 'Rating'], 
                         value_vars = ['Description', 'Pros', 'Cons'], 
                         var_name = 'Prompt', 
                         value_name = 'Output')
print("The number of individual document components is: ", full_data_long.shape[0])

The number of individual document components is:  131409


In [4]:
sum([len(str(d).split(' ')) for d in full_data_long.Output]) > 100000.

True

In [5]:
# example call of the data for the first ten entries for Adobe
for d in full_data_long.Output[:10]:
    print(d)

A decent tier 2 company 
Good Company...terrible middle managers
Great place to work
Not a place for work life balance, full of politics.
Work life balance is good
Great benefits and very good wlb
First Impressions 
Gr8 WLB, Management heavy with no direction
Disappointing 
Adobe is amazing. Your managers may not be.


## Initial vectorized output using CountVectorizer

In [7]:
full_data.Description = [str(row) for row in full_data.Description]
full_data.Description

0                                A decent tier 2 company 
1                 Good Company...terrible middle managers
2                                     Great place to work
3       Not a place for work life balance, full of pol...
4                               Work life balance is good
                              ...                        
1372                                   Learned a lot here
1373                  Lot's of investment into e-commerce
1374                                        Exciting work
1375    Good company with opportunities to work on lar...
1376                                            Backwater
Name: Description, Length: 43803, dtype: object

In [14]:
full_data.head()

Unnamed: 0,Rating,Description,Pros,Cons,Company
0,3.0,A decent tier 2 company,Benefits are good. ESPP option is amazing. Cul...,Growth is not too great. Too many old timers w...,Adobe
1,5.0,Good Company...terrible middle managers,"Solid comp, decent RSUs, good wlb but it's tea...","Too much bureaucracy, some really bad middle m...",Adobe
2,4.0,Great place to work,Great work life balance; everyone wants to do ...,Uncertain career progression. Internal candida...,Adobe
3,2.0,"Not a place for work life balance, full of pol...",PerksPay (if including stocks)Company policies...,ManagementWork life balance is terriblePolitic...,Adobe
4,4.0,Work life balance is good,"Wellness compensation, work life balance, stab...","Some teams really sucks, nothing to learn, no ...",Adobe


In [175]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Create doc-term matrix initially using CountVectorizer
vec = CountVectorizer(stop_words = "english", 
                      ngram_range=(1,3),
                      min_df=200,
                      max_df = .8)
doc_term = vec.fit_transform(full_data.Description)

doc_term.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [176]:
pd.DataFrame(doc_term.toarray(), columns = vec.get_feature_names_out())

Unnamed: 0,amazing,amazon,average,awesome,bad,bad wlb,balance,balance good,benefits,best,...,toxic,want,wlb,wlb good,work,work life,work life balance,working,worst,years
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,1,1,0,0,0
4,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,1,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43798,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
43799,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
43800,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
43801,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


## Topic modeling: LDA, NMF, LSA

In [70]:
# Use the tuned document from above on the general text reviews across LDA, NMF, LSA and compare results
# pyLDAvis
import pyLDAvis as pyLDA

# gensim
from gensim import corpora, models, similarities, matutils
import gensim

# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer

# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### LDA

In [350]:
data_description = full_data_long[full_data_long.Prompt == "Description"]
term_doc = vec.fit_transform(data_description['Output'].values.astype(str)).transpose()

In [351]:
pd.DataFrame(term_doc.toarray(), vec.get_feature_names_out())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,43793,43794,43795,43796,43797,43798,43799,43800,43801,43802
amazing,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
amazon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
average,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
awesome,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bad,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
work life,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
work life balance,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
working,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
worst,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [352]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(term_doc)
id2word = dict((v, k) for k, v in vec.vocabulary_.items())
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=3, id2word=id2word, passes=5)
#lda.print_topics()lda_model.print_topics(num_words = 5)

2022-09-06 21:56:42,675 : INFO : using symmetric alpha at 0.3333333333333333
2022-09-06 21:56:42,680 : INFO : using symmetric eta at 0.3333333333333333
2022-09-06 21:56:42,681 : INFO : using serial LDA version on this node
2022-09-06 21:56:42,682 : INFO : running online (multi-pass) LDA training, 3 topics, 5 passes over the supplied corpus of 43803 documents, updating model once every 2000 documents, evaluating perplexity every 20000 documents, iterating 50x with a convergence threshold of 0.001000
2022-09-06 21:56:42,688 : INFO : PROGRESS: pass 0, at document #2000/43803
2022-09-06 21:56:43,344 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:43,345 : INFO : topic #0 (0.333): 0.103*"work" + 0.080*"great" + 0.051*"good" + 0.043*"growth" + 0.042*"place" + 0.037*"company" + 0.030*"place work" + 0.025*"wlb" + 0.022*"company work" + 0.021*"great company"
2022-09-06 21:56:43,345 : INFO : topic #1 (0.333): 0.135*"great" + 0.067*"work" + 0.064*"lif

2022-09-06 21:56:45,427 : INFO : topic #1 (0.333): 0.112*"work" + 0.107*"great" + 0.097*"life" + 0.096*"balance" + 0.093*"work life" + 0.091*"life balance" + 0.091*"work life balance" + 0.050*"great work" + 0.047*"great work life" + 0.018*"great wlb"
2022-09-06 21:56:45,427 : INFO : topic #2 (0.333): 0.159*"good" + 0.103*"wlb" + 0.054*"company" + 0.047*"culture" + 0.041*"bad" + 0.031*"team" + 0.026*"growth" + 0.024*"management" + 0.023*"good company" + 0.022*"good wlb"
2022-09-06 21:56:45,428 : INFO : topic diff=0.213300, rho=0.333333
2022-09-06 21:56:45,743 : INFO : -3.977 per-word bound, 15.7 perplexity estimate based on a held-out corpus of 2000 documents with 7845 words
2022-09-06 21:56:45,744 : INFO : PROGRESS: pass 0, at document #20000/43803
2022-09-06 21:56:45,955 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:45,955 : INFO : topic #0 (0.333): 0.129*"great" + 0.101*"place" + 0.066*"work" + 0.046*"great place" + 0.042*"good" + 0.039

2022-09-06 21:56:47,508 : INFO : PROGRESS: pass 0, at document #36000/43803
2022-09-06 21:56:47,697 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:47,697 : INFO : topic #0 (0.333): 0.197*"great" + 0.085*"place" + 0.068*"work" + 0.047*"company" + 0.044*"great wlb" + 0.040*"great place" + 0.035*"place work" + 0.034*"pay" + 0.033*"great company" + 0.024*"good"
2022-09-06 21:56:47,698 : INFO : topic #1 (0.333): 0.117*"work" + 0.105*"balance" + 0.104*"life" + 0.101*"work life" + 0.100*"life balance" + 0.099*"work life balance" + 0.099*"great" + 0.068*"great work" + 0.064*"great work life" + 0.019*"good work"
2022-09-06 21:56:47,698 : INFO : topic #2 (0.333): 0.155*"good" + 0.127*"wlb" + 0.056*"culture" + 0.043*"company" + 0.034*"team" + 0.034*"good wlb" + 0.033*"growth" + 0.027*"bad" + 0.024*"compensation" + 0.023*"career"
2022-09-06 21:56:47,698 : INFO : topic diff=0.109164, rho=0.235702
2022-09-06 21:56:47,702 : INFO : PROGRESS: pass 0, at do

2022-09-06 21:56:49,671 : INFO : topic #1 (0.333): 0.133*"work" + 0.112*"life" + 0.110*"balance" + 0.107*"work life" + 0.104*"life balance" + 0.103*"work life balance" + 0.082*"great" + 0.055*"great work" + 0.051*"great work life" + 0.020*"good work"
2022-09-06 21:56:49,671 : INFO : topic #2 (0.333): 0.139*"good" + 0.084*"wlb" + 0.054*"team" + 0.049*"culture" + 0.046*"growth" + 0.043*"bad" + 0.033*"career" + 0.029*"company" + 0.025*"career growth" + 0.025*"management"
2022-09-06 21:56:49,671 : INFO : topic diff=0.093184, rho=0.204544
2022-09-06 21:56:49,675 : INFO : PROGRESS: pass 1, at document #10000/43803
2022-09-06 21:56:49,837 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:49,838 : INFO : topic #0 (0.333): 0.159*"great" + 0.092*"place" + 0.062*"work" + 0.044*"company" + 0.040*"great place" + 0.038*"learn" + 0.028*"place work" + 0.026*"good place" + 0.025*"learning" + 0.023*"good"
2022-09-06 21:56:49,838 : INFO : topic #1 (0.333): 0.13

2022-09-06 21:56:51,441 : INFO : PROGRESS: pass 1, at document #26000/43803
2022-09-06 21:56:51,621 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:51,621 : INFO : topic #0 (0.333): 0.194*"great" + 0.091*"place" + 0.064*"work" + 0.053*"company" + 0.047*"great wlb" + 0.043*"great place" + 0.035*"place work" + 0.027*"great company" + 0.023*"pay" + 0.023*"good place"
2022-09-06 21:56:51,622 : INFO : topic #1 (0.333): 0.126*"work" + 0.112*"life" + 0.111*"balance" + 0.107*"work life" + 0.105*"life balance" + 0.104*"work life balance" + 0.088*"great" + 0.066*"great work" + 0.063*"great work life" + 0.020*"good work"
2022-09-06 21:56:51,622 : INFO : topic #2 (0.333): 0.145*"good" + 0.115*"wlb" + 0.050*"growth" + 0.048*"culture" + 0.034*"career" + 0.034*"team" + 0.034*"company" + 0.029*"people" + 0.029*"good wlb" + 0.029*"bad"
2022-09-06 21:56:51,622 : INFO : topic diff=0.086973, rho=0.204544
2022-09-06 21:56:51,627 : INFO : PROGRESS: pass 1, at do

2022-09-06 21:56:53,343 : INFO : topic #2 (0.333): 0.146*"good" + 0.115*"wlb" + 0.060*"culture" + 0.049*"growth" + 0.032*"company" + 0.030*"good wlb" + 0.030*"team" + 0.029*"career" + 0.027*"bad" + 0.027*"people"
2022-09-06 21:56:53,344 : INFO : topic diff=0.077790, rho=0.204544
2022-09-06 21:56:53,582 : INFO : -3.861 per-word bound, 14.5 perplexity estimate based on a held-out corpus of 1803 documents with 6919 words
2022-09-06 21:56:53,583 : INFO : PROGRESS: pass 1, at document #43803/43803
2022-09-06 21:56:53,732 : INFO : merging changes from 1803 documents into a model of 43803 documents
2022-09-06 21:56:53,733 : INFO : topic #0 (0.333): 0.199*"great" + 0.083*"place" + 0.069*"work" + 0.063*"company" + 0.050*"great wlb" + 0.035*"place work" + 0.035*"great place" + 0.030*"great company" + 0.029*"pay" + 0.022*"good place"
2022-09-06 21:56:53,733 : INFO : topic #1 (0.333): 0.128*"work" + 0.112*"life" + 0.112*"balance" + 0.108*"work life" + 0.107*"life balance" + 0.106*"work life balanc

2022-09-06 21:56:55,103 : INFO : topic #1 (0.333): 0.141*"work" + 0.117*"life" + 0.115*"balance" + 0.112*"work life" + 0.109*"life balance" + 0.108*"work life balance" + 0.079*"great" + 0.058*"great work" + 0.054*"great work life" + 0.020*"good work"
2022-09-06 21:56:55,104 : INFO : topic #2 (0.333): 0.135*"good" + 0.094*"wlb" + 0.056*"culture" + 0.052*"growth" + 0.046*"team" + 0.041*"bad" + 0.032*"career" + 0.028*"company" + 0.025*"management" + 0.024*"career growth"
2022-09-06 21:56:55,104 : INFO : topic diff=0.081273, rho=0.200395
2022-09-06 21:56:55,109 : INFO : PROGRESS: pass 2, at document #18000/43803
2022-09-06 21:56:55,287 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:55,288 : INFO : topic #0 (0.333): 0.174*"great" + 0.087*"place" + 0.062*"work" + 0.054*"company" + 0.038*"great place" + 0.029*"great wlb" + 0.029*"place work" + 0.027*"learn" + 0.026*"good" + 0.025*"good place"
2022-09-06 21:56:55,288 : INFO : topic #1 (0.333): 0.1

2022-09-06 21:56:56,773 : INFO : PROGRESS: pass 2, at document #34000/43803
2022-09-06 21:56:56,943 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:56,944 : INFO : topic #0 (0.333): 0.218*"great" + 0.083*"place" + 0.064*"work" + 0.055*"great wlb" + 0.054*"company" + 0.038*"great place" + 0.034*"place work" + 0.031*"pay" + 0.029*"great company" + 0.023*"good place"
2022-09-06 21:56:56,944 : INFO : topic #1 (0.333): 0.129*"work" + 0.114*"balance" + 0.114*"life" + 0.110*"work life" + 0.109*"life balance" + 0.108*"work life balance" + 0.084*"great" + 0.074*"great work" + 0.069*"great work life" + 0.020*"good work"
2022-09-06 21:56:56,945 : INFO : topic #2 (0.333): 0.138*"good" + 0.120*"wlb" + 0.057*"culture" + 0.049*"growth" + 0.034*"team" + 0.030*"good wlb" + 0.030*"career" + 0.030*"company" + 0.025*"bad" + 0.024*"people"
2022-09-06 21:56:56,945 : INFO : topic diff=0.055705, rho=0.200395
2022-09-06 21:56:56,950 : INFO : PROGRESS: pass 2, at do

2022-09-06 21:56:58,746 : INFO : topic #1 (0.333): 0.141*"work" + 0.120*"life" + 0.118*"balance" + 0.115*"work life" + 0.112*"life balance" + 0.111*"work life balance" + 0.072*"great" + 0.063*"great work" + 0.059*"great work life" + 0.022*"good work"
2022-09-06 21:56:58,746 : INFO : topic #2 (0.333): 0.132*"good" + 0.088*"wlb" + 0.053*"growth" + 0.052*"culture" + 0.047*"team" + 0.034*"bad" + 0.033*"career" + 0.026*"company" + 0.025*"management" + 0.024*"career growth"
2022-09-06 21:56:58,746 : INFO : topic diff=0.078489, rho=0.196489
2022-09-06 21:56:58,750 : INFO : PROGRESS: pass 3, at document #8000/43803
2022-09-06 21:56:58,905 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:56:58,905 : INFO : topic #0 (0.333): 0.183*"great" + 0.089*"place" + 0.062*"work" + 0.048*"company" + 0.038*"great place" + 0.032*"learn" + 0.029*"place work" + 0.026*"great wlb" + 0.026*"good place" + 0.024*"good"
2022-09-06 21:56:58,906 : INFO : topic #1 (0.333): 0.14

2022-09-06 21:57:00,883 : INFO : PROGRESS: pass 3, at document #24000/43803
2022-09-06 21:57:01,051 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:57:01,052 : INFO : topic #0 (0.333): 0.205*"great" + 0.088*"place" + 0.062*"work" + 0.052*"company" + 0.045*"great wlb" + 0.043*"great place" + 0.034*"place work" + 0.027*"great company" + 0.024*"good" + 0.022*"good place"
2022-09-06 21:57:01,053 : INFO : topic #1 (0.333): 0.135*"work" + 0.118*"life" + 0.116*"balance" + 0.113*"work life" + 0.111*"life balance" + 0.110*"work life balance" + 0.081*"great" + 0.070*"great work" + 0.067*"great work life" + 0.019*"good work"
2022-09-06 21:57:01,053 : INFO : topic #2 (0.333): 0.131*"good" + 0.110*"wlb" + 0.055*"growth" + 0.053*"culture" + 0.034*"team" + 0.034*"career" + 0.030*"people" + 0.029*"company" + 0.028*"bad" + 0.025*"career growth"
2022-09-06 21:57:01,053 : INFO : topic diff=0.068132, rho=0.196489
2022-09-06 21:57:01,058 : INFO : PROGRESS: pass 3,

2022-09-06 21:57:02,684 : INFO : topic #2 (0.333): 0.136*"good" + 0.120*"wlb" + 0.064*"culture" + 0.050*"growth" + 0.031*"company" + 0.031*"team" + 0.030*"good wlb" + 0.028*"career" + 0.026*"bad" + 0.025*"people"
2022-09-06 21:57:02,684 : INFO : topic diff=0.044435, rho=0.196489
2022-09-06 21:57:02,688 : INFO : PROGRESS: pass 3, at document #42000/43803
2022-09-06 21:57:02,846 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:57:02,846 : INFO : topic #0 (0.333): 0.222*"great" + 0.082*"place" + 0.067*"work" + 0.062*"company" + 0.054*"great wlb" + 0.037*"great place" + 0.035*"place work" + 0.032*"great company" + 0.027*"pay" + 0.022*"good place"
2022-09-06 21:57:02,847 : INFO : topic #1 (0.333): 0.131*"work" + 0.115*"balance" + 0.115*"life" + 0.111*"work life" + 0.109*"life balance" + 0.109*"work life balance" + 0.082*"great" + 0.075*"great work" + 0.070*"great work life" + 0.022*"good work"
2022-09-06 21:57:02,847 : INFO : topic #2 (0.333): 0.137

2022-09-06 21:57:04,351 : INFO : topic #1 (0.333): 0.152*"work" + 0.124*"life" + 0.120*"balance" + 0.117*"work life" + 0.113*"life balance" + 0.113*"work life balance" + 0.068*"great" + 0.057*"great work" + 0.052*"great work life" + 0.020*"good work"
2022-09-06 21:57:04,351 : INFO : topic #2 (0.333): 0.126*"good" + 0.081*"wlb" + 0.057*"culture" + 0.051*"team" + 0.050*"growth" + 0.041*"bad" + 0.031*"career" + 0.025*"company" + 0.024*"management" + 0.023*"career growth"
2022-09-06 21:57:04,351 : INFO : topic diff=0.073487, rho=0.192802
2022-09-06 21:57:04,356 : INFO : PROGRESS: pass 4, at document #16000/43803
2022-09-06 21:57:04,519 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:57:04,520 : INFO : topic #0 (0.333): 0.186*"great" + 0.084*"place" + 0.061*"work" + 0.056*"company" + 0.037*"great place" + 0.029*"great wlb" + 0.028*"place work" + 0.028*"learn" + 0.027*"good" + 0.025*"great company"
2022-09-06 21:57:04,520 : INFO : topic #1 (0.333): 

2022-09-06 21:57:05,978 : INFO : PROGRESS: pass 4, at document #32000/43803
2022-09-06 21:57:06,145 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 21:57:06,146 : INFO : topic #0 (0.333): 0.216*"great" + 0.083*"place" + 0.064*"work" + 0.057*"company" + 0.050*"great wlb" + 0.039*"great place" + 0.033*"place work" + 0.029*"pay" + 0.028*"great company" + 0.024*"good"
2022-09-06 21:57:06,146 : INFO : topic #1 (0.333): 0.132*"work" + 0.116*"life" + 0.116*"balance" + 0.112*"work life" + 0.110*"life balance" + 0.110*"work life balance" + 0.081*"great" + 0.072*"great work" + 0.068*"great work life" + 0.021*"good work"
2022-09-06 21:57:06,146 : INFO : topic #2 (0.333): 0.133*"good" + 0.115*"wlb" + 0.057*"culture" + 0.051*"growth" + 0.033*"team" + 0.031*"career" + 0.028*"good wlb" + 0.027*"company" + 0.027*"bad" + 0.026*"people"
2022-09-06 21:57:06,147 : INFO : topic diff=0.052802, rho=0.192802
2022-09-06 21:57:06,151 : INFO : PROGRESS: pass 4, at document

In [353]:
lda_model.print_topics(num_words = 5)

2022-09-06 21:57:42,742 : INFO : topic #0 (0.333): 0.212*"great" + 0.082*"place" + 0.068*"work" + 0.064*"company" + 0.050*"great wlb"
2022-09-06 21:57:42,743 : INFO : topic #1 (0.333): 0.133*"work" + 0.116*"life" + 0.115*"balance" + 0.112*"work life" + 0.110*"life balance"
2022-09-06 21:57:42,746 : INFO : topic #2 (0.333): 0.139*"good" + 0.110*"wlb" + 0.057*"culture" + 0.048*"growth" + 0.032*"team"


[(0,
  '0.212*"great" + 0.082*"place" + 0.068*"work" + 0.064*"company" + 0.050*"great wlb"'),
 (1,
  '0.133*"work" + 0.116*"life" + 0.115*"balance" + 0.112*"work life" + 0.110*"life balance"'),
 (2,
  '0.139*"good" + 0.110*"wlb" + 0.057*"culture" + 0.048*"growth" + 0.032*"team"')]

### NMF - Matrix Factorization

In [181]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=3)
doc_topic = nmf.fit_transform(doc_term)



In [182]:
def get_top_terms(topic, n_terms, nmf=nmf, terms=vec.get_feature_names_out()):
    # get the topic components (i.e., term weights)
    components = nmf.components_[topic, :]

    # get term indices, sorted (descending) by topic weights
    top_term_indices = components.argsort()[-n_terms:]
    
    # use the `terms` array to get the actual top terms
    top_terms = np.array(terms)[top_term_indices]
    
    return top_terms.tolist()

In [184]:
for i in range(3):
    print(get_top_terms(i, 5))

['life balance', 'work life', 'balance', 'life', 'work']
['great place', 'great wlb', 'place', 'wlb', 'great']
['work', 'place', 'good wlb', 'wlb', 'good']


### LSA - Matrix Factorization

In [185]:
from sklearn.decomposition import TruncatedSVD

lsa = TruncatedSVD(n_components=3)
doc_topic = lsa.fit_transform(doc_term)
doc_topic

array([[ 0.03926376,  0.168048  ,  0.08791759],
       [ 0.15101321,  0.37312656,  0.96420067],
       [ 0.84348008,  1.23155494, -0.16198667],
       ...,
       [ 0.41205901, -0.00944227,  0.10980741],
       [ 0.57004534,  0.38832006,  1.08557773],
       [ 0.        ,  0.        ,  0.        ]])

In [186]:
def get_top_terms_lsa(topic, n_terms, lsa=lsa, terms=vec.get_feature_names_out()):
    # get the topic components (i.e., term weights)
    components = lsa.components_[topic, :]

    # get term indices, sorted (descending) by topic weights
    top_term_indices = components.argsort()[-n_terms:]
    
    # use the `terms` array to get the actual top terms
    top_terms = np.array(terms)[top_term_indices]
    
    return top_terms.tolist()

In [188]:
for i in range(3):
    print(get_top_terms_lsa(i, 5))

['life balance', 'work life', 'balance', 'life', 'work']
['great wlb', 'good', 'place', 'wlb', 'great']
['good wlb', 'good work life', 'good work', 'wlb', 'good']


## Comparison to TF-IDF

In [None]:
# Do the full LDA, NMF, LSA workflow from above but using TF_IDF as the input doct-term matrix

In [203]:
# Create doc-term matrix initially using CountVectorizer
vec_tfidf = TfidfVectorizer(stop_words = "english", 
                            ngram_range=(1,3),
                            min_df=200,
                            max_df = .8)
doc_term = vec_tfidf.fit_transform(full_data.Description)
term_doc = vec_tfidf.fit_transform(data_description['Output'].values.astype(str)).transpose()

pd.DataFrame(doc_term.toarray(), columns = vec_tfidf.get_feature_names())



Unnamed: 0,amazing,amazon,average,awesome,bad,bad wlb,balance,balance good,benefits,best,...,toxic,want,wlb,wlb good,work,work life,work life balance,working,worst,years
0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.256455,0.000000,0.000000,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.265721,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.223214,0.268717,0.271128,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.246777,0.542856,0.0,0.0,...,0.0,0.0,0.0,0.0,0.207301,0.249559,0.251799,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43798,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
43799,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
43800,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,1.000000,0.000000,0.000000,0.0,0.0,0.0
43801,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.223449,0.000000,0.000000,0.0,0.0,0.0


### LDA with TF-IDF

In [206]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(term_doc)
id2word = dict((v, k) for k, v in vec.vocabulary_.items())
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=3, id2word=id2word, passes=5)
#lda.print_topics()lda_model.print_topics(num_words = 5)

2022-09-06 16:47:29,453 : INFO : using symmetric alpha at 0.3333333333333333
2022-09-06 16:47:29,457 : INFO : using symmetric eta at 0.3333333333333333
2022-09-06 16:47:29,458 : INFO : using serial LDA version on this node
2022-09-06 16:47:29,459 : INFO : running online (multi-pass) LDA training, 3 topics, 5 passes over the supplied corpus of 43803 documents, updating model once every 2000 documents, evaluating perplexity every 20000 documents, iterating 50x with a convergence threshold of 0.001000
2022-09-06 16:47:29,467 : INFO : PROGRESS: pass 0, at document #2000/43803
2022-09-06 16:47:30,035 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:30,036 : INFO : topic #0 (0.333): 0.083*"great" + 0.042*"company" + 0.042*"good" + 0.041*"wlb" + 0.038*"growth" + 0.033*"culture" + 0.026*"career" + 0.025*"people" + 0.024*"team" + 0.022*"balance"
2022-09-06 16:47:30,036 : INFO : topic #1 (0.333): 0.069*"good" + 0.060*"work" + 0.043*"bad" + 0.037*"plac

2022-09-06 16:47:31,692 : INFO : topic #1 (0.333): 0.113*"good" + 0.067*"place" + 0.059*"bad" + 0.041*"great place" + 0.039*"management" + 0.034*"place work" + 0.032*"good company" + 0.031*"work" + 0.027*"good work" + 0.026*"depends"
2022-09-06 16:47:31,692 : INFO : topic #2 (0.333): 0.088*"great" + 0.088*"work" + 0.074*"life" + 0.074*"balance" + 0.071*"work life" + 0.069*"life balance" + 0.069*"work life balance" + 0.049*"company" + 0.047*"great work" + 0.044*"great work life"
2022-09-06 16:47:31,693 : INFO : topic diff=0.210698, rho=0.333333
2022-09-06 16:47:31,933 : INFO : -4.592 per-word bound, 24.1 perplexity estimate based on a held-out corpus of 2000 documents with 3406 words
2022-09-06 16:47:31,933 : INFO : PROGRESS: pass 0, at document #20000/43803
2022-09-06 16:47:32,099 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:32,100 : INFO : topic #0 (0.333): 0.086*"wlb" + 0.050*"career" + 0.050*"good" + 0.049*"growth" + 0.047*"great" + 0

2022-09-06 16:47:33,410 : INFO : topic diff=0.120349, rho=0.242536
2022-09-06 16:47:33,414 : INFO : PROGRESS: pass 0, at document #36000/43803
2022-09-06 16:47:33,577 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:33,578 : INFO : topic #0 (0.333): 0.114*"wlb" + 0.066*"great" + 0.064*"great wlb" + 0.048*"good" + 0.044*"growth" + 0.042*"culture" + 0.039*"good wlb" + 0.037*"career" + 0.030*"compensation" + 0.028*"benefits"
2022-09-06 16:47:33,578 : INFO : topic #1 (0.333): 0.122*"good" + 0.082*"place" + 0.046*"great place" + 0.041*"bad" + 0.039*"place work" + 0.039*"good work" + 0.035*"work" + 0.034*"good work life" + 0.033*"good company" + 0.032*"management"
2022-09-06 16:47:33,579 : INFO : topic #2 (0.333): 0.092*"great" + 0.091*"work" + 0.086*"balance" + 0.085*"life" + 0.081*"work life" + 0.081*"life balance" + 0.080*"work life balance" + 0.069*"great work" + 0.064*"great work life" + 0.043*"company"
2022-09-06 16:47:33,579 : INFO : topic 

2022-09-06 16:47:35,234 : INFO : topic #1 (0.333): 0.106*"good" + 0.080*"place" + 0.061*"bad" + 0.039*"great place" + 0.037*"management" + 0.031*"place work" + 0.031*"work" + 0.030*"learn" + 0.028*"depends" + 0.028*"good place"
2022-09-06 16:47:35,234 : INFO : topic #2 (0.333): 0.097*"work" + 0.088*"great" + 0.080*"life" + 0.079*"balance" + 0.075*"work life" + 0.074*"life balance" + 0.073*"work life balance" + 0.053*"company" + 0.049*"great work" + 0.044*"great work life"
2022-09-06 16:47:35,235 : INFO : topic diff=0.090714, rho=0.204544
2022-09-06 16:47:35,239 : INFO : PROGRESS: pass 1, at document #10000/43803
2022-09-06 16:47:35,385 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:35,385 : INFO : topic #0 (0.333): 0.075*"wlb" + 0.055*"growth" + 0.047*"great" + 0.044*"career" + 0.044*"culture" + 0.042*"good" + 0.040*"team" + 0.031*"experience" + 0.029*"opportunities" + 0.026*"learning"
2022-09-06 16:47:35,386 : INFO : topic #1 (0.333): 0.1

2022-09-06 16:47:36,728 : INFO : PROGRESS: pass 1, at document #26000/43803
2022-09-06 16:47:36,894 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:36,895 : INFO : topic #0 (0.333): 0.105*"wlb" + 0.061*"great" + 0.058*"great wlb" + 0.048*"growth" + 0.045*"good" + 0.044*"career" + 0.038*"culture" + 0.035*"good wlb" + 0.028*"best" + 0.024*"slow"
2022-09-06 16:47:36,895 : INFO : topic #1 (0.333): 0.115*"good" + 0.091*"place" + 0.049*"great place" + 0.040*"place work" + 0.040*"bad" + 0.033*"work" + 0.032*"good work" + 0.031*"management" + 0.028*"good work life" + 0.028*"good place"
2022-09-06 16:47:36,895 : INFO : topic #2 (0.333): 0.095*"work" + 0.093*"great" + 0.087*"balance" + 0.087*"life" + 0.082*"work life" + 0.081*"life balance" + 0.081*"work life balance" + 0.063*"great work" + 0.059*"great work life" + 0.053*"company"
2022-09-06 16:47:36,896 : INFO : topic diff=0.067239, rho=0.204544
2022-09-06 16:47:36,901 : INFO : PROGRESS: pass 1, at

2022-09-06 16:47:38,455 : INFO : topic #2 (0.333): 0.093*"work" + 0.092*"great" + 0.086*"balance" + 0.085*"life" + 0.082*"work life" + 0.081*"life balance" + 0.080*"work life balance" + 0.068*"great work" + 0.064*"great work life" + 0.052*"company"
2022-09-06 16:47:38,455 : INFO : topic diff=0.066847, rho=0.204544
2022-09-06 16:47:38,672 : INFO : -4.507 per-word bound, 22.7 perplexity estimate based on a held-out corpus of 1803 documents with 3017 words
2022-09-06 16:47:38,672 : INFO : PROGRESS: pass 1, at document #43803/43803
2022-09-06 16:47:38,813 : INFO : merging changes from 1803 documents into a model of 43803 documents
2022-09-06 16:47:38,814 : INFO : topic #0 (0.333): 0.107*"wlb" + 0.061*"great" + 0.058*"great wlb" + 0.047*"good" + 0.046*"growth" + 0.045*"culture" + 0.037*"good wlb" + 0.035*"career" + 0.028*"compensation" + 0.024*"pay"
2022-09-06 16:47:38,814 : INFO : topic #1 (0.333): 0.124*"good" + 0.082*"place" + 0.049*"management" + 0.043*"bad" + 0.039*"good work" + 0.039*

2022-09-06 16:47:40,099 : INFO : topic #0 (0.333): 0.091*"wlb" + 0.051*"growth" + 0.049*"culture" + 0.048*"great" + 0.044*"good" + 0.041*"career" + 0.034*"great wlb" + 0.031*"team" + 0.026*"experience" + 0.025*"compensation"
2022-09-06 16:47:40,100 : INFO : topic #1 (0.333): 0.107*"good" + 0.076*"place" + 0.059*"bad" + 0.039*"great place" + 0.038*"management" + 0.031*"learn" + 0.030*"place work" + 0.028*"work" + 0.026*"good place" + 0.025*"depends"
2022-09-06 16:47:40,100 : INFO : topic #2 (0.333): 0.102*"work" + 0.092*"great" + 0.082*"life" + 0.081*"balance" + 0.077*"work life" + 0.075*"life balance" + 0.075*"work life balance" + 0.064*"company" + 0.050*"great work" + 0.045*"great work life"
2022-09-06 16:47:40,101 : INFO : topic diff=0.061943, rho=0.200395
2022-09-06 16:47:40,105 : INFO : PROGRESS: pass 2, at document #18000/43803
2022-09-06 16:47:40,250 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:40,250 : INFO : topic #0 (0.333): 0.0

2022-09-06 16:47:41,545 : INFO : topic diff=0.051416, rho=0.200395
2022-09-06 16:47:41,549 : INFO : PROGRESS: pass 2, at document #34000/43803
2022-09-06 16:47:41,696 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:41,697 : INFO : topic #0 (0.333): 0.112*"wlb" + 0.063*"great" + 0.062*"great wlb" + 0.047*"good" + 0.043*"growth" + 0.041*"culture" + 0.038*"career" + 0.037*"good wlb" + 0.029*"compensation" + 0.028*"pay"
2022-09-06 16:47:41,697 : INFO : topic #1 (0.333): 0.119*"good" + 0.086*"place" + 0.046*"great place" + 0.040*"place work" + 0.040*"bad" + 0.035*"good work" + 0.033*"work" + 0.031*"management" + 0.031*"good work life" + 0.030*"good place"
2022-09-06 16:47:41,697 : INFO : topic #2 (0.333): 0.096*"work" + 0.095*"great" + 0.089*"balance" + 0.088*"life" + 0.084*"work life" + 0.083*"life balance" + 0.083*"work life balance" + 0.069*"great work" + 0.065*"great work life" + 0.050*"company"
2022-09-06 16:47:41,698 : INFO : topic diff=0.

2022-09-06 16:47:43,284 : INFO : topic #1 (0.333): 0.109*"good" + 0.079*"place" + 0.049*"bad" + 0.040*"management" + 0.038*"great place" + 0.034*"place work" + 0.032*"learn" + 0.031*"work" + 0.028*"good place" + 0.027*"good work"
2022-09-06 16:47:43,285 : INFO : topic #2 (0.333): 0.101*"work" + 0.091*"great" + 0.086*"life" + 0.084*"balance" + 0.081*"work life" + 0.079*"life balance" + 0.079*"work life balance" + 0.056*"company" + 0.055*"great work" + 0.051*"great work life"
2022-09-06 16:47:43,285 : INFO : topic diff=0.065763, rho=0.196489
2022-09-06 16:47:43,289 : INFO : PROGRESS: pass 3, at document #8000/43803
2022-09-06 16:47:43,424 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:43,425 : INFO : topic #0 (0.333): 0.078*"wlb" + 0.054*"growth" + 0.049*"great" + 0.043*"culture" + 0.043*"career" + 0.041*"good" + 0.035*"team" + 0.031*"great wlb" + 0.027*"opportunities" + 0.026*"experience"
2022-09-06 16:47:43,425 : INFO : topic #1 (0.333): 0

2022-09-06 16:47:44,677 : INFO : PROGRESS: pass 3, at document #24000/43803
2022-09-06 16:47:44,825 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:44,826 : INFO : topic #0 (0.333): 0.101*"wlb" + 0.059*"great" + 0.056*"great wlb" + 0.048*"growth" + 0.043*"good" + 0.043*"career" + 0.040*"culture" + 0.030*"good wlb" + 0.028*"best" + 0.024*"slow"
2022-09-06 16:47:44,826 : INFO : topic #1 (0.333): 0.108*"good" + 0.090*"place" + 0.051*"great place" + 0.041*"place work" + 0.041*"bad" + 0.031*"work" + 0.029*"great" + 0.028*"management" + 0.028*"good place" + 0.025*"good work"
2022-09-06 16:47:44,827 : INFO : topic #2 (0.333): 0.098*"great" + 0.097*"work" + 0.086*"life" + 0.085*"balance" + 0.081*"work life" + 0.080*"life balance" + 0.079*"work life balance" + 0.062*"great work" + 0.059*"great work life" + 0.058*"company"
2022-09-06 16:47:44,827 : INFO : topic diff=0.051377, rho=0.196489
2022-09-06 16:47:44,832 : INFO : PROGRESS: pass 3, at document

2022-09-06 16:47:46,256 : INFO : topic #2 (0.333): 0.095*"work" + 0.092*"great" + 0.090*"balance" + 0.088*"life" + 0.085*"work life" + 0.084*"life balance" + 0.083*"work life balance" + 0.071*"great work" + 0.067*"great work life" + 0.052*"company"
2022-09-06 16:47:46,256 : INFO : topic diff=0.036241, rho=0.196489
2022-09-06 16:47:46,261 : INFO : PROGRESS: pass 3, at document #42000/43803
2022-09-06 16:47:46,404 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:46,405 : INFO : topic #0 (0.333): 0.108*"wlb" + 0.063*"great" + 0.063*"great wlb" + 0.046*"growth" + 0.046*"good" + 0.045*"culture" + 0.037*"career" + 0.036*"good wlb" + 0.027*"compensation" + 0.027*"pay"
2022-09-06 16:47:46,405 : INFO : topic #1 (0.333): 0.119*"good" + 0.084*"place" + 0.044*"great place" + 0.041*"management" + 0.041*"place work" + 0.040*"bad" + 0.037*"good work" + 0.035*"work" + 0.032*"good work life" + 0.030*"company"
2022-09-06 16:47:46,405 : INFO : topic #2 (0.333)

2022-09-06 16:47:47,981 : INFO : topic #0 (0.333): 0.077*"wlb" + 0.050*"growth" + 0.049*"culture" + 0.046*"great" + 0.043*"good" + 0.040*"career" + 0.035*"team" + 0.029*"experience" + 0.028*"great wlb" + 0.026*"compensation"
2022-09-06 16:47:47,981 : INFO : topic #1 (0.333): 0.103*"good" + 0.076*"place" + 0.059*"bad" + 0.040*"great place" + 0.036*"management" + 0.035*"learn" + 0.030*"place work" + 0.028*"work" + 0.026*"depends" + 0.025*"good place"
2022-09-06 16:47:47,982 : INFO : topic #2 (0.333): 0.105*"work" + 0.094*"great" + 0.081*"life" + 0.079*"balance" + 0.075*"work life" + 0.073*"life balance" + 0.073*"work life balance" + 0.066*"company" + 0.046*"great work" + 0.041*"great work life"
2022-09-06 16:47:47,982 : INFO : topic diff=0.060013, rho=0.192802
2022-09-06 16:47:47,987 : INFO : PROGRESS: pass 4, at document #16000/43803
2022-09-06 16:47:48,129 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:48,130 : INFO : topic #0 (0.333): 0.0

2022-09-06 16:47:49,407 : INFO : topic diff=0.052103, rho=0.192802
2022-09-06 16:47:49,415 : INFO : PROGRESS: pass 4, at document #32000/43803
2022-09-06 16:47:49,581 : INFO : merging changes from 2000 documents into a model of 43803 documents
2022-09-06 16:47:49,581 : INFO : topic #0 (0.333): 0.106*"wlb" + 0.060*"great" + 0.057*"great wlb" + 0.047*"good" + 0.044*"growth" + 0.040*"culture" + 0.039*"career" + 0.035*"good wlb" + 0.028*"pay" + 0.028*"compensation"
2022-09-06 16:47:49,582 : INFO : topic #1 (0.333): 0.116*"good" + 0.087*"place" + 0.047*"great place" + 0.041*"bad" + 0.039*"place work" + 0.033*"good work" + 0.032*"work" + 0.030*"management" + 0.029*"good work life" + 0.029*"good place"
2022-09-06 16:47:49,582 : INFO : topic #2 (0.333): 0.096*"work" + 0.095*"great" + 0.087*"balance" + 0.087*"life" + 0.083*"work life" + 0.082*"life balance" + 0.082*"work life balance" + 0.066*"great work" + 0.062*"great work life" + 0.055*"company"
2022-09-06 16:47:49,582 : INFO : topic diff=0.

In [207]:
lda_model.print_topics(num_words = 5)

2022-09-06 16:48:12,251 : INFO : topic #0 (0.333): 0.106*"wlb" + 0.059*"great" + 0.057*"great wlb" + 0.046*"good" + 0.045*"growth"
2022-09-06 16:48:12,261 : INFO : topic #1 (0.333): 0.123*"good" + 0.082*"place" + 0.048*"management" + 0.043*"bad" + 0.039*"great place"
2022-09-06 16:48:12,266 : INFO : topic #2 (0.333): 0.096*"work" + 0.088*"great" + 0.087*"balance" + 0.086*"life" + 0.083*"work life"


[(0,
  '0.106*"wlb" + 0.059*"great" + 0.057*"great wlb" + 0.046*"good" + 0.045*"growth"'),
 (1,
  '0.123*"good" + 0.082*"place" + 0.048*"management" + 0.043*"bad" + 0.039*"great place"'),
 (2,
  '0.096*"work" + 0.088*"great" + 0.087*"balance" + 0.086*"life" + 0.083*"work life"')]

### NMF with TF-IDF

In [208]:
nmf = NMF(n_components=3)
doc_topic = nmf.fit_transform(doc_term)



In [210]:
def get_top_terms(topic, n_terms, nmf=nmf, terms=vec_tfidf.get_feature_names_out()):
    # get the topic components (i.e., term weights)
    components = nmf.components_[topic, :]

    # get term indices, sorted (descending) by topic weights
    top_term_indices = components.argsort()[-n_terms:]
    
    # use the `terms` array to get the actual top terms
    top_terms = np.array(terms)[top_term_indices]
    
    return top_terms.tolist()

In [211]:
for i in range(3):
    print(get_top_terms(i, 5))

['work life balance', 'life balance', 'work life', 'balance', 'life']
['culture', 'great place', 'great wlb', 'wlb', 'great']
['good place', 'good company', 'good wlb', 'company', 'good']


### LSA with TF-IDF

In [212]:
lsa = TruncatedSVD(n_components=3)
doc_topic = lsa.fit_transform(doc_term)
doc_topic

array([[ 0.03388926,  0.13514276,  0.04797742],
       [ 0.04455252,  0.19098186,  0.2555708 ],
       [ 0.21106943,  0.25110293, -0.12412539],
       ...,
       [ 0.35408265, -0.03122332,  0.06668786],
       [ 0.12773978,  0.19831203,  0.27946553],
       [ 0.        ,  0.        ,  0.        ]])

In [214]:
def get_top_terms_lsa(topic, n_terms, lsa=lsa, terms=vec_tfidf.get_feature_names_out()):
    # get the topic components (i.e., term weights)
    components = lsa.components_[topic, :]

    # get term indices, sorted (descending) by topic weights
    top_term_indices = components.argsort()[-n_terms:]
    
    # use the `terms` array to get the actual top terms
    top_terms = np.array(terms)[top_term_indices]
    
    return top_terms.tolist()

In [215]:
for i in range(3):
    print(get_top_terms_lsa(i, 5))

['life balance', 'work life', 'balance', 'life', 'work']
['company', 'great wlb', 'good', 'great', 'wlb']
['good company', 'good work life', 'good work', 'good wlb', 'good']


### Visualizing with pyLDAvis

In [216]:
import pyLDAvis
import pyLDAvis.gensim_models as gensim_vis

In [354]:
#vec_tfidf.vocabulary_.items()
#vec.vocabulary_.items()
word2id = dict((k, v) for k, v in vec.vocabulary_.items())
d = corpora.Dictionary()
d.id2token = id2word
d.token2id = word2id
lda_corpus = lda_model[corpus]
lda_corpus

<gensim.interfaces.TransformedCorpus at 0x7f9a0b804220>

In [355]:
#gensim_vis.pyLDAvis.prepare(lda, lda_corpus, id2word)
pyLDAvis.enable_notebook()
visualization = gensim_vis.prepare(lda_model, corpus = lda_corpus, dictionary = d)
visualization

  default_term_info = default_term_info.sort_values(


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


## Comparing pro vs. con reviews across companies

In [222]:
# use spaCy for positive vs negative and scattertext
import spacy
nlp = spacy.load('en_core_web_sm')

In [237]:
for string in full_data.Cons:
    string = str(string)

In [226]:
full_data['spacy_pros'] = list(nlp.pipe(full_data.Pros))
full_data['Cons']= full_data['Cons'].astype(str) #fixes error with this column not being red in as documents
full_data['spacy_cons'] = list(nlp.pipe(full_data.Cons))
full_data.head()

Unnamed: 0,Rating,Description,Pros,Cons,Company,spacy_pros
0,3.0,A decent tier 2 company,Benefits are good. ESPP option is amazing. Cul...,Growth is not too great. Too many old timers w...,Adobe,"(Benefits, are, good, ., ESPP, option, is, ama..."
1,5.0,Good Company...terrible middle managers,"Solid comp, decent RSUs, good wlb but it's tea...","Too much bureaucracy, some really bad middle m...",Adobe,"(Solid, comp, ,, decent, RSUs, ,, good, wlb, b..."
2,4.0,Great place to work,Great work life balance; everyone wants to do ...,Uncertain career progression. Internal candida...,Adobe,"(Great, work, life, balance, ;, everyone, want..."
3,2.0,"Not a place for work life balance, full of pol...",PerksPay (if including stocks)Company policies...,ManagementWork life balance is terriblePolitic...,Adobe,"(PerksPay, (, if, including, stocks)Company, p..."
4,4.0,Work life balance is good,"Wellness compensation, work life balance, stab...","Some teams really sucks, nothing to learn, no ...",Adobe,"(Wellness, compensation, ,, work, life, balanc..."


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


In [242]:
# extract top adjectives for pros and cons reviews
#pos_reviews = df[df.Type == 'pos']
pros_adj = [token.text.lower() for doc in 
           full_data.spacy_pros
           for token in doc if 
           token.pos_ == 'ADJ'] #pos here is "part of speech" not positive

#neg_reviews = df[df.Type == 'neg']
cons_adj = [token.text.lower() for doc in
           full_data.spacy_cons
           for token in doc if 
           token.pos_ == 'ADJ'] #pos here is "part of speech" not positive

In [309]:
top_pros_adj = Counter(pros_adj).most_common(10)
top_cons_adj = Counter(cons_adj).most_common(10)
print(top_pros_adj)
print(top_cons_adj)

[('good', 20703), ('great', 14631), ('smart', 3153), ('nice', 2652), ('interesting', 2235), ('decent', 2104), ('new', 2053), ('other', 1612), ('best', 1522), ('many', 1480)]
[('bad', 3882), ('slow', 3754), ('good', 3538), ('other', 3420), ('many', 2991), ('much', 2886), ('low', 2731), ('poor', 2196), ('great', 1890), ('hard', 1751)]


In [349]:
full_data.loc[full_data['Pros'].str.contains('great', case = False)]
full_data.loc[full_data['Cons'].str.contains('Many', case = False)]

Unnamed: 0,Rating,Description,Pros,Cons,Company,spacy_pros,spacy_cons
0,3.0,A decent tier 2 company,Benefits are good. ESPP option is amazing. Cul...,Growth is not too great. Too many old timers w...,Adobe,"(Benefits, are, good, ., ESPP, option, is, ama...","(Growth, is, not, too, great, ., Too, many, ol..."
54,4.0,Great work life balance,Great work like balance. Good culture. Good ES...,Not many cons I can think of. Pretty good company,Adobe,"(Great, work, like, balance, ., Good, culture,...","(Not, many, cons, I, can, think, of, ., Pretty..."
72,4.0,Great work life balance,Very reasonable workload and WFH policy (team ...,"Old codebase and mentality, not willing to inn...",Adobe,"(Very, reasonable, workload, and, WFH, policy,...","(Old, codebase, and, mentality, ,, not, willin..."
85,5.0,Probably the best WLB I have had in the last 1...,"- Most devs are chill, low on politics and les...",- At VP and higher level too many recent reorgs.,Adobe,"(-, Most, devs, are, chill, ,, low, on, politi...","(-, At, VP, and, higher, level, too, many, rec..."
87,1.0,Micromanagement and toxic culture,Their compensation packages for new hires are ...,Micromanagement that starts from the top. Shan...,Adobe,"(Their, compensation, packages, for, new, hire...","(Micromanagement, that, starts, from, the, top..."
...,...,...,...,...,...,...,...
1350,4.0,Depends on the team. If you work work with off...,"Good tech stack, friendly people for most part...",Too many layers of management who have no clue...,Walmart,"(Good, tech, stack, ,, friendly, people, for, ...","(Too, many, layers, of, management, who, have,..."
1354,4.0,Good work life balance,Good tech stack and great learning opportunity...,Too many management changes affects stability ...,Walmart,"(Good, tech, stack, and, great, learning, oppo...","(Too, many, management, changes, affects, stab..."
1361,4.0,Work life balance pays well for low cost ofLiving,Cost of living in Arkansas is low and pays wel...,Very politicalNot many other job choices nearb...,Walmart,"(Cost, of, living, in, Arkansas, is, low, and,...","(Very, politicalNot, many, other, job, choices..."
1362,4.0,"Not bad, very friendly and socially aware, hid...",1. Very friendly and caring feel to it for foc...,"1. Managers are managers, means that they ofte...",Walmart,"(1, ., Very, friendly, and, caring, feel, to, ...","(1, ., Managers, are, managers, ,, means, that..."


In [334]:
from collections import Counter

# find adjective modifiers of "job"
noun_str = 'company'

# Pros
adj_modifiers_pros = []
top_adj_mod_pros = []

for doc in full_data.spacy_pros: 
    for token in doc:
        if token.text == noun_str:
            for child in token.children:
                if child.dep_ == 'amod':
                    adj_modifiers_pros.append(child.text.lower())

top_adj_mod_pros = Counter(adj_modifiers_pros).most_common(20)

# Cons
adj_modifiers_cons = []
top_adj_mod_cons = []

for doc in full_data.spacy_cons: 
    for token in doc:
        if token.text == noun_str:
            for child in token.children:
                if child.dep_ == 'amod':
                    adj_modifiers_cons.append(child.text.lower())

top_adj_mod_cons = Counter(adj_modifiers_cons).most_common(20)

In [335]:
top_adj_mod_pros

[('great', 298),
 ('good', 273),
 ('big', 189),
 ('large', 105),
 ('stable', 86),
 ('other', 67),
 ('best', 59),
 ('growing', 43),
 ('overall', 33),
 ('huge', 30),
 ('first', 25),
 ('decent', 18),
 ('driven', 18),
 ('global', 18),
 ('known', 17),
 ('friendly', 17),
 ('amazing', 17),
 ('nice', 16),
 ('top', 16),
 ('innovative', 16)]

In [336]:
top_adj_mod_cons

[('big', 327),
 ('large', 200),
 ('other', 88),
 ('huge', 61),
 ('tech', 52),
 ('good', 48),
 ('great', 36),
 ('old', 26),
 ('driven', 23),
 ('top', 21),
 ('moving', 18),
 ('slow', 17),
 ('massive', 16),
 ('giant', 16),
 ('growing', 16),
 ('first', 15),
 ('based', 13),
 ('bad', 13),
 ('entire', 13),
 ('overall', 12)]

In [None]:
# also use sentiment analysis with Vader

## Scattertext application

### Build the scattertext corpus

In [300]:
import scattertext as st

# Extract only the pros and cons in a column
scatter_data = full_data_long[full_data_long['Prompt'].isin(['Pros', 'Cons'])]

corpus = st.CorpusFromPandas(scatter_data,
                             category_col = 'Prompt',
                             text_col = 'Output',
                             nlp = st.whitespace_nlp_with_sentences).build()

# following should be starting place if needing to random sample 
#df.sample(5, random_state=10)


### Create scatterplot html

In [306]:
html = st.produce_scattertext_explorer(
        corpus,
        category='Pros',
        category_name='Pros',
        not_category_name='Cons',
        minimum_term_frequency=10,
        pmi_threshold_coefficient=5,
        width_in_pixels=1000
#        metadata=scatter_data['Company'],
        )
open('job_reviews.html', 'wb').write(html.encode('utf-8'))

12488115

In [299]:
scatter_data.head()
# following should be starting place if needing to random sample 
#df.sample(5, random_state=10)

Unnamed: 0,Company,Rating,Prompt,Output
43803,Adobe,3.0,Pros,Benefits are good. ESPP option is amazing. Cul...
43804,Adobe,5.0,Pros,"Solid comp, decent RSUs, good wlb but it's tea..."
43805,Adobe,4.0,Pros,Great work life balance; everyone wants to do ...
43806,Adobe,2.0,Pros,PerksPay (if including stocks)Company policies...
43807,Adobe,4.0,Pros,"Wellness compensation, work life balance, stab..."


## Predicting overall rating from general reviews

In [None]:
#Predict including company name

## Sentiment of pros vs. cons? 