## Are the most common words also the most important words?

This may seem like a trivial question but in the world of text analytics, it might be harder than it seems to discern due in large part to the challenging nature of working with text data. The data is unstructured and therefore tedious to parse. 

Given the question posed above, imagine you are working with customer survey data (aka a corpus of reviews). It could be the case
that a word appeared many times in one particular customer's survey. (i.e. the word *quality*). In the rest of the survey data however the word 'quality' may not appear again, but if that one particular review repeated 'quality' many , many times, it could erroneously seem like 'quality' is a major theme across all of the surveys.  

In the following notebook, I'll be looking at **importance of words** via the number of times they appear across sentences in a corpus rather than via their raw count (how many times they occur altogether through an entire corpus). 

**Disclaimer**
(My inspiration comes from CSE6040's Problem 20 in this semester's set of final practice problems (Fall 2021)). The general workflow will be similar to that of the Professor Vuduc's albeit with my own implementation of the data wrangling as it pertains to employee reviews for PNC Bank. The data comes from an HTML file that I downloaded and parsed and it is wrangled via my implementation of the word and sentence tokenizer and bag of words function that you will see shortly. 

At some point, I reuse the professor's generation of word id function and the code to reverse th mapping since the profesor's implementation was both trivial and elegantly written via python dictionaries. I also show how to manually get the corresponding left singular vector from the SVD that the professor gets via scipy's svds() function using numpy's linalg.svd function. 



In [1]:
import bs4
import re
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We'll first use beautiful soup to open and parse through an HTML file that I pulled directly from the review page for PNC on glassdoor. Once we parse it, we're going to pull all of the comments that were 'pros' or positive and store the result in a list called **outtext**

In [2]:
# Working with HTML file
with open("PNCReviews_Glassdoor.html", "r") as f:
    doc = bs4.BeautifulSoup(f, "html.parser")
# Converting the doc file into a string
as_string_doc = str(doc)

# Regex for pulling out all the positive 'pro' reviews
pattern = re.compile(r'\"pros\">([^<]*)<')
result = pattern.findall(as_string_doc)

# initializing an empty list for storing each of the reviews 
outtext = [match.lower() for match in result] 

In [3]:
n = 3
print(f'There are {len(outtext)} reviews in the corpus. Here are the first {n}:')
outtext[:n]

There are 10 reviews in the corpus. Here are the first 3:


['pension. 401k match, training and development opportunities if your time permits. actually making a difference in communities with green investing, a social bond, and other leading edge initiatives.',
 'especially with the pandemic, they have been very flexible with remote capable employees. leadership has shown a willingness to listen to employees and change.',
 'i was able to do some financial planning for the future such as creating a retirement plan and buying some shares at a discounted rate until i resigned then they make you pay it back by deducting it from your shares. this was a great opportunity for me to see what types of jobs i never want again my regret is that i didn’t resign sooner. i’m grateful to have worked in both the branches and corporate.']

#### But this document only has 10 reviews?

In the process of working with the review data, I learned quickly that, given the number of words we might expect to see in a review, the total number of words we'd be working with would grow substantially as the number of reviews increase. For demonstration purposes, I'm limiting the number of reviews to just these 10 but the idea works for larger corpuses as well.

#### Let's get the sentences out.

In [4]:
def extractsentences(doc):
    final = []
    for s in doc:
        out = re.split('\?|\!|\.',s)
        cleaned = [sent.strip(' ') for sent in out]
        final = final + [sent for sent in cleaned if sent]
    return final
all_sents = extractsentences(outtext)

In [5]:
all_sents

['pension',
 '401k match, training and development opportunities if your time permits',
 'actually making a difference in communities with green investing, a social bond, and other leading edge initiatives',
 'especially with the pandemic, they have been very flexible with remote capable employees',
 'leadership has shown a willingness to listen to employees and change',
 'i was able to do some financial planning for the future such as creating a retirement plan and buying some shares at a discounted rate until i resigned then they make you pay it back by deducting it from your shares',
 'this was a great opportunity for me to see what types of jobs i never want again my regret is that i didn’t resign sooner',
 'i’m grateful to have worked in both the branches and corporate',
 'bank recognizes value of technology and ongoing training of employees',
 'it is very focused on diversity and inclusion and offers a broad product suite to meet the needs of any business or individual',
 "there'

#### What are the most common words?

The NLTK package is a great resource for working with text data. It has a lot of features such as a sentence and word tokenizer
as well as what I am using below, a list of stopwords. I decided NOT to use the package's built in sent_tokenize() function nor its builtin word_tokenize() function mainly because I wanted to implement my own tokenization procedure. 

On the topic of stopwords like 'I', 'is',  and 'the', we would see these types of words appear a TON of times but do they 
really carry a lot of semantic meaning? *Probably not*. 

In [6]:
from collections import Counter
from nltk.corpus import stopwords

stopwordlist = stopwords.words('English')
every_single_word = []
for s in all_sents:
    every_single_word = every_single_word + nltk.word_tokenize(s)
every_single_word = [w for w in every_single_word if (w) and (w not in stopwordlist) and (w != ',')]

Before getting knee deep into the manipulation of the text data, let's see what the top ***n*** words are by raw count.

In [7]:
N = 5
topcommonwords = Counter(every_single_word).most_common()
topNwords = topcommonwords[:N]
print(f'Top {N} words by frequency/count: \n', topNwords)

Top 5 words by frequency/count: 
 [('employees', 6), ('financial', 4), ('training', 3), ('good', 3), ('shares', 2)]


**Now let's construct the bag of words by tokenizing the raw sentences.** 

As an example, imagine my sentence is 'Obiwan is actually Luke's father. Anakin would be devastated'.

From the **all_sents** that we constructed, the input list would look something *like* this:

    all_sents = ['Obiwan is actually Luke's father',
                'Anakin would be devastated']
    
    and the output would be in a form similar to this (a list of word sets derived from sentences):
    wordsinabag = [ {'Obiwan', 'actually', 'Luke', 'father'},
                   {'Anakin', 'devastated'}]
        

In [8]:
# Tokenize the sentence and clean it as well
def tokenize_sentence(s):
    from re import findall
    pattern = r"[a-zA-Z]+"
    res = re.findall(pattern, s)
    out = [w for w in res if w]
    return res

In [9]:
# Generate the bag of words. Again our goal is to return a list of sets containing the unique words which we will
# use to evaluate the occurence of that word across all sentences. 
def bag_of_words(sents):
    from nltk.corpus import stopwords
    stopwordlist = stopwords.words('English')
    x = [tokenize_sentence(s) for s in sents]
    out = []
    for cleaned_sentence in x:
        out.append({word for word in cleaned_sentence if word not in stopwordlist})
    return out

In [10]:
wordsinabag = bag_of_words(all_sents)

In [11]:
print("Here are the first 5 elements (tokenized sentences): ")
wordsinabag[:5]

Here are the first 5 elements (tokenized sentences): 


[{'pension'},
 {'development', 'k', 'match', 'opportunities', 'permits', 'time', 'training'},
 {'actually',
  'bond',
  'communities',
  'difference',
  'edge',
  'green',
  'initiatives',
  'investing',
  'leading',
  'making',
  'social'},
 {'capable', 'employees', 'especially', 'flexible', 'pandemic', 'remote'},
 {'change', 'employees', 'leadership', 'listen', 'shown', 'willingness'}]

As mentioned above, I brought over the professor's code from Practice Problem 20 that generates the set of all unique words and uses this set to create word to integer mappings. We will be using these to map words and their ids to each other momentarily. 

In [12]:
# Grab all of the unique words from the bag of words constructed above 
all_words = set()
for b in wordsinabag:
    all_words = all_words | b

# Assigning indices to words and creating a another dictionary to 
# reverse-map the words back to their indices (later use)
word_to_id = {w: k for k, w in enumerate(all_words)}
id_to_word = {k: w for k, w in enumerate(all_words)}

As a reminder, in Problem 20, the idea was to build a sparse matrix such that there are $m$ words and $n$ sentences. 
With that, we would have a matrix $A$ such that $A$ is $m \times n$. The entries of this matrix are  $a_{i,j}$ where $i$ is the ID of a word and $j$ is the ID of the sentence (which we conveniently made the index of the sentence in the bag of words list). $n_i$ is the number of sentences that has the word of interest. Eventually we'll rank the words (you can do this with the sentences too) by how many times they appear in a setnence by leveraging the power of the SVD from Linear Algebra.


$$
  a_{i,j} = \left\{ \begin{array}{cl}
          \left(\ln \dfrac{n+1}{n_i}\right)^{-1} & \mbox{if word $i$ appears in sentence $j$} \\
                             0                   & \mbox{otherwise.}
      \end{array} \right.
$$

Here is my implementation of the algorithm here in the gen_coords function (like the gen_coords function from Problem 20). Admittedly, it is not the most efficient with the nested for loop that handles the main calculation of our $a_{i,j}$ value, but it gets the job done. 

In [13]:
def gen_coords(bags, word_to_id): 
    m, n = len(word_to_id), len(bags)
    rows, cols, vals = [], [], []

    from math import log
    def get_ni(bags, word):
        count = 0
        for word_set in bags:
            if word in word_set:
                count += 1
        return count
    
    for i, word_set in enumerate(bags):
        for j, word in enumerate(word_set):
            rows.append(word_to_id[word])
            cols.append(i)
            ni = get_ni(bags, word)
            want = 1/log((n + 1)/ni)
            vals.append(want)
        
    return rows, cols, vals

Using gen_coords, we can get the row, column, and data values that will be fed into scipy's csr_matrix() solver. 

In [14]:
r, c, v = gen_coords(wordsinabag, word_to_id)

In [15]:
from scipy.sparse import csr_matrix
A = csr_matrix((v, (r, c)), shape=(len(word_to_id), len(wordsinabag)))

Now its time to actually rank the words using matrix $A$'s SVD. 
Note that we have a matrix in CSR format (you'll see why in a second). 
Using np.linalg.svd() we can easily grab the left
singular vectors U, the singular values Sigma, and the right singular vector V.                               

In [16]:
def applySVD(A):
    """Apply the SVD to our problem"""
    # Convert it back to a dense matrix
    A = A.todense()
    u, s, v = np.linalg.svd(A)
    return u, s, v

In [17]:
U, Sigma, V = applySVD(A)

In [18]:
# Get the largest singular - vectors U
def get_ranked_words(U):
    """Takes in U, a left-singular vector such that the columns of U are the eigenvectors of AAH.
    Returns a numpy array of the top (corresponding to the largest singular value)
    """
    # Recall that the columns of u are the left-singular vectors, we want the first one corresponding to the largest
    # singular value (the first singular value in Sigma)
    U_0 = U[:, 0]
    
    # Due to the way its output, we need to take U_0, currently a matrix, and pull out its values (a list of single -element)
    # lists containing the vector elements and store them in a numpy array
    U_0_1 = np.array(U_0.tolist())
    
    # Since its a list of lists rather than a list of elements, we need to pull out the first and only element
    # insert a print statement here if you need.
    U_0_1 = [v[0] for v in U_0_1]
    
    # We are going to store it in an array and use argsort to get the indices of the sorted array
    U_0_1 = np.array(U_0_1)
    return np.argsort(U_0_1)

In [19]:
# Now let's rank the words
ranked_words = get_ranked_words(U)
top_ten_words = [id_to_word[k] for k in ranked_words[:5]]
print(f"Top 5 words computed:", top_ten_words)

Top 3 words computed: ['employees', 'good', 'training', 'culture', 'bank']


**For demonstration purposes, this is Professor Vuduc's implementation in Problem 20 to get the same results from above.** 

It turns out that scipy.sparse has a svds() solver that can be optioned to get the largest singular vectors -- the 'LM' argument -- and also choose how many singular values (and their corresponding singular vectors to return --the k=1 argument --). A major difference here is that, the svds() function takes in as input a sparse matrix unlike my implementation above which needed a dense matrix to work on.

In [20]:
def get_svds_largest(A):
    from scipy.sparse.linalg import svds
    from numpy import abs
    u, s, v = svds(A, k=1, which='LM', return_singular_vectors=True)
    return s, abs(u.reshape(A.shape[0])), abs(v.reshape(A.shape[1]))

_, u0, _ = get_svds_largest(A)

In [21]:
n = 5
ranked_words = np.argsort(u0)[::-1]
top_ten_words = [id_to_word[k] for k in ranked_words[:n]]
print(f"Top {n} words computed:", top_ten_words)
a = set(top_ten_words)

Top 5 words computed: ['employees', 'good', 'training', 'culture', 'bank']


In [22]:
b = [w[0] for w in topcommonwords[:n]]
print(f"Top {n} words by count:", b)
b = set(b)

Top 5 words by count: ['employees', 'financial', 'training', 'good', 'shares']


In [23]:
print(f'From the naked eye we can see how similar these sets are by the \
elements in their intersection: \n {a&b}. <- Here we have {len(a&b)} elements out of a total of {len(a)}.')


From the naked eye we can see how similar these sets are by the elements in their intersection: 
 {'good', 'employees', 'training'}. <- Here we have 3 elements out of a total of 5.


### What's next?

Using regex again, let's pull in job title information.

In [26]:
d = {'reviews': outtext}
emp_reviews = pd.DataFrame(d)


def f(row):
    if 'employees' in row['reviews']:
        return 1
    else:
        return 0
emp_reviews['flag'] = emp_reviews.apply(f, axis=1)

# Regex for pulling out the job titles
jobpattern = re.compile(r">\w\w\w \d\d\, \d\d\d\d - ([\w\d\s\/,]+)<")
jpres = jobpattern.findall(as_string_doc)

emp_reviews['jobtitles'] = pd.Series(jpres)
emp_reviews

Unnamed: 0,reviews,flag,jobtitles
0,"pension. 401k match, training and development ...",0,Technical Writer/Editor
1,"especially with the pandemic, they have been v...",1,Anonymous Employee
2,i was able to do some financial planning for t...,0,Anonymous Employee
3,bank recognizes value of technology and ongoin...,1,Treasury Management Officer
4,my best reason for working there is having the...,0,Financial Consultant
5,they pay is decent enough i guess,0,Branch Banker
6,decent culture. seems to care about employees....,1,"Senior Manager, Product Management"
7,"interacting with customers, fun co workers, go...",0,Customer Service Representative
8,"excellent, diverse financial institute. that l...",1,Account Executive
9,pnc is a well-established organization and ple...,1,Campus Recruiter


Here, I also added a 'flags' column to the dataframe of reviews if a particular word had the consensus top word 'employee' (by rank as well as frequency). What can be done next is we could pull the dates of the reviews from the 
input html file. This way I could see how many times our 'top word' appeared through a specific time frame like 2020 - 2021.

Could it be the case that the top concern for employees have changed from say, benefits to a better work life balance since the start
of the pandemic? Or is there a new theme that is yet to be uncovered?

Or maybe further analysis could be more targeted to verifying if whether or not a theme does indeed exist around say -pay and compensation. Is that consistently appearing in the top 'x' number of comments?
Are key words like 'balance' or 'work-life' or 'flexibility' showing up in the rankings?

If there are indicators like 'work-life' and 'flexibility' could we use a flag to see if the distribution of reviews with that flag
is proportional across job families or job titles? 