### Introduction:

Large amount of data is produced everyday and lot of this data is infact unstructured. Hence, it becomes important to be able to find mechanisms that will help us automatically organize these unstructured data. 

Topic modeling is one such mechanism that: 

- helps uncover hidden topics of various documents
- allocate topics to these documents
- provides a simpler way to analyze large volumes of unlabeled text 

As per __[KDNuggets post](http://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html)__ , 'Topic modelling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection.' (Pls note that this post is inspired in part from the KD nuggets post attached)

So, now we know what topic modeling is. What model can we use to find topics across documents of data? 

We will be using a model called Latent Dirichlet Allocation (LDA). Following are some of the highlights of LDA: 

- LDA is an unsupervised machine learning algorithm
- It extracts key topics from collection of text / documents
- These topics are represented in order of their importance / relevance to the document
- LDA describes each document based on the ordered allocation of these topics

Other important references used throughout these notebook are:
- __[LDA explained in a video](https://www.youtube.com/watch?v=3mHy4OSyRf0)__

### How will we learn?

In this example, we will try to extract topics from company reviews data from Indeed's Company pages. 

The example we will follow is split into two parts:

- _Data Extraction_: We will extract company reviews from Indeed's website. Specifically, we will start by searching for companies that have posted 'Data Scientist' jobs in Austin, TX
- _Explore Ratings_: Once we have all the information, we will make some nice charts around ratings of these companies
- _Topic modeling_: Once we have data, we will go over the methodology to apply LDA 

### Data Extraction

I have separated the details of scraping reviews from Indeed in indeed.py file. 

Overall, the steps I have followed are:
- Look up job details for a given job query and location
- Extract name of the companies that have posted these jobs 
- Find the corresponding company pages on Indeed for these companies
- Find ratings and reviews for these companies

In [1]:
import indeed
import pandas as pd

In [2]:
#get the SERP url
js_url = indeed.jobsearch_url('Data Scientist', 'Austin,TX')

In [3]:
#Extract job details
jobs_df = indeed.get_jobs(js_url)

In [4]:
#For the companies that posted these jobs, get company ratings
ratings_df = indeed.get_comp_ratings(jobs_df)

In [5]:
#Find reviews for companies that posted these jobs
review_df = indeed.get_reviews(jobs_df)
review_df = review_df.merge(jobs_df, on = 'comp_name', how = 'right')

### Topic Modeling using LDA

If you had a chance to watch the attached youtube vide, there are three important items to be considered while building a LDA model. These are:

- Setting the right window of context : Define minimum number of words to be present in each document. Recommended around 100 - 300. Since we are looking at review data, we will use min_word = 10

- Include right words in the model: We will remove any words that appear in less than 20 documents and more than 10% of all our documents, remove any stop words, stem and lemmatize each of the words in every document

- Capture widest range of topics: We will do this by setting the total number of topics we are interested in

Following are the steps to preprocess our data before modeling:
- *Tokenize* : split sentences into individual words
- *Min Freq & Max freq*: remove any tokens or words that appear in less than 20 documents and in more than 10% of all the documents
- *Stop words*: remove stop words from these tokens
- *Convert each word to its base form*: This can be done by either 'stemming' or 'lemmatizing'. For our purposes, we will lemmatize the tokens and then stem the words. We will examine how lemmatization and stemming works and the difference between the two
- *Create id-term dictionary*: We will later assign a random id to each word / token in our corpus
- *Create BoW*: Gensim's LDA algorithm requires bag of words input. Hence, we will convert our id-term dictionary in the required input format

In [6]:
# words that appear in fewer than MIN_DOC_FREQ documents will be ignored
MIN_DOC_FREQ = 20

# if there are less than MIN_WORDS in a document, it is dropped
MIN_WORDS = 20

In [7]:
import datetime
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from stop_words import get_stop_words
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import re
import string
from gensim import corpora, models, similarities
import time
import operator

In [8]:
#First, we will convert all the words into lower case
reviews_lda_df = review_df.copy()

In [9]:
#Sample before changing the case
reviews_lda_df['review_text'][0]

u'Dell provides employees with learning opportunities to succeed.'

In [10]:
#Sample after changing the review to lower case
reviews_lda_df['review_text'][0].lower()

u'dell provides employees with learning opportunities to succeed.'

**Tokenization**:

Tokenization is a way to segment our document into atomic modules of each word. 

There are several ways to tokenize your document. Ex. you can split() your document on spaces / '.'. In our case, we will use tokenize.regexp from nltk. 

You can see how this works in a fun interactive way here: try 'w+' at __[http://regexr.com/](http://regexr.com/)__

In [11]:
tokenizer = RegexpTokenizer(r'\w+')

#Running it only through first review for demo
first_review = reviews_lda_df['review_text'][0]
first_review_tokens = tokenizer.tokenize(first_review)

print 'There are a total of', len(first_review_tokens), 'words in our first review. The first few words are', first_review_tokens[:10]

There are a total of 8 words in our first review. The first few words are [u'Dell', u'provides', u'employees', u'with', u'learning', u'opportunities', u'to', u'succeed']


**Stopword removal:**

Words like 'the', 'for', 'a', etc. do not add any meaningful value to our text. However, these words exist many times more than important and valuable words such as 'management'. Since these words (also called stopwords) are very common, including them will mean that our LDA algorithm will most likely provide us topics such as 'the' / 'a'. 

So, it is important to exclude these words from our reviews during our data preprocessing. 

Note that your choice of stopwords also depend on the context. Ex. 'an' is a typical stopword. However, let's say that 'pay and benefits' is a very commonly occuring term in our dataset. In that case, having a single topic as 'pay and benefit' vs two different topics like 'pay' and 'benefit' is more meaningful. 


In [12]:
#remove stop words
nltk_stop_wds = stopwords.words('english')
get_stop_wds = get_stop_words('en')
all_stop_words = list(set(nltk_stop_wds + get_stop_wds))
all_stop_words += ['.', '...', ',', '(', ')', ':', '`', '``', ';']
all_stop_words += ["'s", "n't"]

print(len(set(all_stop_words)))
print(all_stop_words[:10])

218
[u'all', u"she'll", u'just', u"don't", u'being', u'over', u'both', u'through', u'yourselves', u'its']


In [13]:
fst_rev_wo_stp = [token for token in first_review_tokens if not token in all_stop_words]
print(fst_rev_wo_stp[:10])

[u'Dell', u'provides', u'employees', u'learning', u'opportunities', u'succeed']


**Stemming**: 

Stemming is a mechanism in NLP that helps convert any given word to it's base form. For example, 'breaking' and 'breaker' will be reduced to their base form, 'run'. 

We will be using Snowball stemmer in our example

In [14]:
#Stem the first review
# Instantiate a Snowball stemmer
sb_stemmer = SnowballStemmer('english')

In [15]:
stemmed_tokens = [sb_stemmer.stem(token) for token in fst_rev_wo_stp]
print stemmed_tokens[:10]

[u'dell', u'provid', u'employe', u'learn', u'opportun', u'succeed']


**Lemmatization:**

__[These stackoverflow answers](https://stackoverflow.com/questions/1787110/what-is-the-true-difference-between-lemmatization-vs-stemming)__  provide a good explanation of difference between stemming and lemmatization as follows: 

'Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma'

Ex: 

- The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up

- The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.

- The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

In [16]:
#Lemmatize the first review
# Instantiate a Wordnet lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [17]:
lamma_tokens = [wordnet_lemmatizer.lemmatize(token) for token in fst_rev_wo_stp]
print lamma_tokens[:10]

[u'Dell', u'provides', u'employee', u'learning', u'opportunity', u'succeed']


### Putting this together

Before we apply our LDA model, we need to create list of documents (list of lists) wherein each document is tokenized, has stop words removed from, is stemmed and lemmatized. 

Then, we will create a term-document frequency matrix from this list and pass it to our LDA algorithm

In [18]:
def preprocess(review_df):
    #First, we will create list of documents (list of each review in our dataset)
    doc_list = [review for review in review_df['review_text'].dropna()]
    
    #lowercase all the documents
    doc_list = [doc.lower() for doc in doc_list]
    
    #tokenize every document
    tokenizer = RegexpTokenizer(r'\w+')
    doc_list = [tokenizer.tokenize(content) for content in doc_list]
    
    #remove any stop words
    nltk_stop_wds = stopwords.words('english')
    get_stop_wds = get_stop_words('en')
    all_stop_words = list(set(nltk_stop_wds + get_stop_wds))
    all_stop_words += ['.', '...', ',', '(', ')', ':', '`', '``', ';']
    all_stop_words += ["'s", "n't"]
    doc_list = [[token for token in review_doc if token not in all_stop_words] for review_doc in doc_list]

    #lemmatize words
    wordnet_lemmatizer = WordNetLemmatizer()
    doc_list = [[wordnet_lemmatizer.lemmatize(token) for token in doc] for doc in doc_list]
    
    #stem words
#     sb_stemmer = SnowballStemmer('english')
#     doc_list = [[sb_stemmer.stem(token) for token in doc] for doc in doc_list]
    
    #keep those documents where the number of tokens is 20 (MIN_WORDS) or more 
    doc_list = [doc for doc in doc_list if len(doc) >= MIN_WORDS]
    
    return doc_list

In [19]:
start = time.time()
preprocessed_reviews = preprocess(reviews_lda_df)
end = time.time()
print 'Preprocessing complete in', str(end - start), 'seconds'

Preprocessing complete in 0.991232156754 seconds


#### Id Term dictionary

As mentioned at the beginning, we will now convert our preprocessed reviews into a id-term dictionary wherein every word in our preprocessed reviews corpus will have a randomly assigned id. 

In [20]:
# Now, we will create the id-term dictionary as mentioned in the steps at
# the beginning
reviews_dict = corpora.Dictionary(preprocessed_reviews)
print reviews_dict

Dictionary(3853 unique tokens: [u'gatekeeper', u'similarity', u'consolidated', u'personally', u'yellow']...)


As mentioned at the beginning of the tutorial, we need to set 'window of context'. 

One of the elements of setting window of context is to exclude tokens that appear in less than 20 documents and appear in more than 10% of the documents. 

We are currently working on a small corpus right now and hence, we may not want to reduce our corpus even further. However, let's see what impact this step will have on our corpus. 

In [21]:
reviews_dict.filter_extremes(no_below=30, no_above=0.15) # changes reviews_dict in place
print(reviews_dict)
print("top terms:")
print(sorted(reviews_dict.token2id.items(), key=operator.itemgetter(1), reverse = False)[:10])

Dictionary(217 unique tokens: [u'atmosphere', u'help', u'office', u'food', u'indeed']...)
top terms:
[(u'atmosphere', 0), (u'help', 1), (u'office', 2), (u'indeed', 3), (u'lack', 4), (u'find', 5), (u'young', 6), (u'competitive', 7), (u'month', 8), (u'enjoyable', 9)]


Clearly, we went from **3040 tokens** to **237 tokens**. That is a big difference. Let's continue with this small corpus and see what we get. 

#### Bag of words

As mentioned at the beginning, LDA algorithm requires bag of words representation as it's input. 

If we represent each word in each document of our corpus as (id, freq_count of that word in the corpus), we get bag of word representation. __[Here's a short (2 min) clip](https://www.youtube.com/watch?v=OGK9SHt8SWg)__ explaining bag of words with example. 

We will use doc2bow() method to convert list of documents into bow output

In [22]:
reviews_corpus = [reviews_dict.doc2bow(review_doc) for review_doc in preprocessed_reviews]

### LDA modeling

For LDA modeling, we need to choose the number of topics before we start modeling. 

Ideally, this depends on the context. For ex, if you are trying to build topic model on documents from news website and if you know that that news covers mainly four topics: politics, sports, world news and finance, you will choose your number of topics to be 4. 

For our case, since we do not have such context for now, let's start with looking out for around 4 topics. 

In [23]:
start_time = time.time()
lda_model = models.LdaModel(reviews_corpus,alpha='auto', 
                                   num_topics=4, id2word = reviews_dict, 
                                   passes=20)
end_time = time.time()
print 'It took', str(end_time - start_time), 'seconds to complete LDA model'

It took 37.9830701351 seconds to complete LDA model


### Show me the topics!

Now that we have a LDA model, let's look at the top 10 words in the reviews associated with 4 topics

In [24]:
lda_model.show_topics(num_topics = 4,num_words = 10)

[(0,
  u'0.027*student + 0.026*indeed + 0.019*research + 0.018*learn + 0.017*opportunity + 0.017*experience + 0.017*make + 0.016*university + 0.016*austin + 0.015*pay'),
 (1,
  u'0.046*sale + 0.034*indeed + 0.027*year + 0.022*product + 0.016*office + 0.016*training + 0.014*call + 0.014*experience + 0.013*many + 0.013*position'),
 (2,
  u'0.029*office + 0.022*everyone + 0.021*hour + 0.016*ibm + 0.016*well + 0.016*really + 0.015*pay + 0.015*life + 0.014*training + 0.014*help'),
 (3,
  u'0.039*learned + 0.031*hardest + 0.027*enjoyable + 0.021*typical + 0.020*always + 0.018*different + 0.016*hard + 0.015*best + 0.014*high + 0.014*customer')]

### Next steps:

We just saw how to build a simple LDA model on reviews from jobseekers. 

In order to get more interesting topics, we can try to focus on:

- find topics in reviews by company 
- find topics by average rating of company