# Introduction

This document attempts to examine the topic of text mining within Python. There are two major approaches to text mining: Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). There are three elements in order to conduct both LSA and LDA: 1) a corpus (a set of n document), 2) a vocabulary (a set of m words), and 3) a matrix of size n * m – known as the term-document frequency matrix (a representation of the occurrence of words in the document(s)). Nonetheless, we have our corpus (i.e. the Frozen reviews). From this, we can derive the vocabulary by identifying the distinct words in the corpus. Finally, from the documents within the corpus and the vocabulary, we can create the term-document matrix.

# Dataset

We have a dataset that contains the reviews for the movie "Frozen". It contains 736 reviews scraped from various websites such as IMDB and Rotten Tomatoes

In [None]:
## import the libraries
import pandas as pd
import numpy as np
import csv 
from sklearn.feature_extraction.text import CountVectorizer
import nltk
# nltk.download()
# use the last line to download the libraries for the dictionaries onto local machine

We load the dataset, which is a series which contains 736 reviews in the `Text` column.

In [2]:
df=pd.read_sas("C:/Users/namhpham/Documents/Personal files/R workspace/frozentxt.sas7bdat",encoding="utf-8")

In [3]:
df.head()

Unnamed: 0,id,Text
0,1.0,When people speak of their favorite Disney mov...
1,2.0,A lot of people criticize Frozen for what it i...
2,3.0,"This is a huge movie, seriously huge. You can ..."
3,4.0,Frozen is a legitimately great film but also a...
4,5.0,The last time Disney adapted a Hans Christian ...


In [105]:
df.shape

(736, 2)

We then combine the text into a list of words by splitting by spaces.

In [156]:
#conver list to string
corpus=df.iloc[:,1]
doc_complete_string=''.join("'"+ w +"'," for w in corpus)
#convert into list
doc_complete=doc_complete_string.split()

We then observe the frequency of the first 50 words

In [248]:
wordfreq = [doc_complete[:50].count(w) for w in doc_complete[:50]]
zipped=list(zip(doc_complete[:50],wordfreq))

print ("Frequency: \n" + str(sorted(zipped,key=lambda x: x[1],reverse=True)))

Frequency: 
[('of', 4), ('of', 4), ('of', 4), ('of', 4), ('the', 3), ('the', 3), ('the', 3), ('Disney', 2), ('and', 2), ('to', 2), ('Disney', 2), ('some', 2), ('and', 2), ('some', 2), ('to', 2), ("'When", 1), ('people', 1), ('speak', 1), ('their', 1), ('favorite', 1), ('movies,', 1), ('big', 1), ('four', 1), ('Renaissance', 1), ('films', 1), ('Golden', 1), ('Age', 1), ('animation', 1), ('are', 1), ('likely', 1), ('be', 1), ('mentioned.', 1), ('The', 1), ('past', 1), ('decade', 1), ('has', 1), ('seen', 1), ('movies', 1), ('that', 1), ('were', 1), ('hit', 1), ('or', 1), ('miss.', 1), ('Some', 1), ('considered', 1), ('classics,', 1), ('forgotten', 1), ('close', 1), ('being', 1), ('classics', 1)]


We can see that 'of' appears the most. However, this word gives us no insight into the topic. The same case applies to 'the' or 'to'. They are part of stopping words. As a result, our next step is to remove punctuation, special characters, and stop-words (i.e. a, the, as..). From here we can see the list of vocabulary.

In [184]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]

We can compare the two versions and see that the documents have been cleaned up of stop words.

In [142]:
print(doc_complete[:50])

["'When", 'people', 'speak', 'of', 'their', 'favorite', 'Disney', 'movies,', 'the', 'big', 'four', 'of', 'the', 'Renaissance', 'and', 'films', 'of', 'the', 'Golden', 'Age', 'of', 'animation', 'are', 'likely', 'to', 'be', 'mentioned.', 'The', 'past', 'decade', 'has', 'seen', 'Disney', 'movies', 'that', 'were', 'hit', 'or', 'miss.', 'Some', 'considered', 'classics,', 'some', 'forgotten', 'and', 'some', 'close', 'to', 'being', 'classics']


In [143]:
print(doc_clean[:50])

[['when'], ['people'], ['speak'], [], [], ['favorite'], ['disney'], ['movie'], [], ['big'], ['four'], [], [], ['renaissance'], [], ['film'], [], [], ['golden'], ['age'], [], ['animation'], [], ['likely'], [], [], ['mentioned'], [], ['past'], ['decade'], [], ['seen'], ['disney'], ['movie'], [], [], ['hit'], [], ['miss'], [], ['considered'], ['classic'], [], ['forgotten'], [], [], ['close'], [], [], ['classic']]


We then load the corpus iterator and dictionary from `gensim` library.

In [150]:
import gensim
from gensim import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]


In [149]:
print(dictionary)

Dictionary(12361 unique tokens: ['when', 'people', 'speak', 'favorite', 'disney']...)


From here the LSA and LDA will start to go their own ways; accordingly, we first start our analysis by focusing on LSA.

# Building a  Latent Semantic Analysis model


"Latent" means hidden, concealed. "Semantic" refers to meaning. So LSA refers to hidden meaning of the word. The method aims to detect the hidden meaning of words based on their existence in a collection of document. For example, the word "bank" might refer to money or stream of water. 

Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. The fundamental difficulty arises when we compare words to find relevant documents, because what we really want to do is compare the meanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a “concept” space and doing the comparison in this space.

This is similar to exploratory factor analysis, where factors are searched for within a sample dataset of variables. The built-in function within gensim library took care of building the model, we here choose 25 topics and 5 words in each topic for easier interpretation. 

In [137]:
lsi = gensim.models.lsimodel.LsiModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=25)

In [253]:
lsi.print_topics(num_topics=25, num_words=5)

[(0,
  '-1.000*"movie" + -0.000*"almost" + -0.000*"build" + 0.000*"quite" + 0.000*"im"'),
 (1,
  '-1.000*"disney" + -0.000*"old" + 0.000*"beast" + 0.000*"tale" + 0.000*"look"'),
 (2,
  '-1.000*"film" + 0.000*"new" + 0.000*"interesting" + 0.000*"fan" + -0.000*"find"'),
 (3,
  '1.000*"elsa" + 0.001*"anna" + 0.001*"pretty" + 0.000*"elsas" + 0.000*"still"'),
 (4,
  '0.950*"anna" + -0.314*"character" + -0.001*"elsa" + -0.000*"isnt" + 0.000*"young"'),
 (5,
  '0.950*"character" + 0.314*"anna" + 0.000*"right" + -0.000*"king" + 0.000*"kristen"'),
 (6,
  '1.000*"it" + 0.001*"never" + -0.000*"voice" + -0.000*"help" + 0.000*"know"'),
 (7,
  '-1.000*"frozen" + -0.001*"love" + 0.001*"old" + -0.001*"bad" + -0.000*"look"'),
 (8,
  '-1.000*"like" + 0.002*"love" + -0.001*"watch" + 0.001*"lot" + -0.001*"king"'),
 (9,
  '-1.000*"love" + -0.002*"like" + 0.001*"frozen" + -0.001*"guy" + 0.001*"all"'),
 (10,
  '-1.000*"one" + 0.001*"new" + 0.001*"look" + 0.001*"something" + 0.001*"whole"'),
 (11,
  '-1.000*"s

As with LSA, this process generated the underlying concepts by using singular value decomposition of the term-document frequency matrix. We can choose the number of topics to display and number of words in each topic for easier understanding. For example, topic 20 indicates that the movie has good soundtrack(s). Topic 23 indicates reviewers' appreciation of the animation.

# Building a  Latent Dirichlet Allocation model

To begin the comparison, let’s start off with how LDA’s creation was motivated. In doing so, we start with an assumption: all of the words within each Frozen review are exchangeable. Under this assumption, we find ourselves leveraging a bag-of-words approach: for each of your M documents (i.e. reviews), you will choose a topic (z), and you will choose N vocabulary words where each chosen word is selected independently from a multinomial distribution that’s conditioned on the topic, z. When you will have one topic per document: the model is referred to as a mixture of unigrams. Conversely, when you want to allow a document to have multiple topics, one option is to turn to a probabilistic latent semantic indexing model (pLSI).
Under the pLSI model, for each word of each document, you will choose a topic (z); however, it is now selected from a multinomial distribution conditioned on the specific document. With a topic (z) chosen, you then choose a vocabulary word selected from a multinomial distribution that is conditioned on the topic, z. Interestingly, because z is chosen from a distribution conditioned on a specific document from the corpus, the number of parameters in the pLSI model will increase linearly as the number of documents grows – which doesn’t make much sense from a reduction perspective.
As such, LDA was born as a hierarchical probabilistic generative approach that models a corpus by topics using probability distributions over the vocabulary. By leveraging a finite vocabulary from a corpus, a number of topics (K), smoothing parameters (α and β) at the corpus-level to adjust the fit of the model, and a prior distribution over document lengths, LDA creates random documents whose contents are a mixture of topics. Then, comparing a document to two topics at a time, LDA determines which topic is closer to the document – repeating this across all possible combinations of topics. This will give us which documents are most relevant to which topics – a slight spin form the approach of LSA.

In [125]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=25, id2word = dictionary)

In [136]:
ldamodel.print_topics(num_topics=25, num_words=6)

[(0,
  '0.113*"good" + 0.112*"really" + 0.067*"queen" + 0.026*"build" + 0.023*"shes" + 0.023*"anything"'),
 (1,
  '0.043*"child" + 0.036*"since" + 0.034*"day" + 0.031*"different" + 0.027*"twist" + 0.027*"doesnt"'),
 (2,
  '0.183*"one" + 0.104*"animation" + 0.097*"go" + 0.047*"loved" + 0.042*"didnt" + 0.041*"bit"'),
 (3,
  '0.067*"way" + 0.052*"made" + 0.040*"show" + 0.035*"voice" + 0.029*"without" + 0.027*"far"'),
 (4,
  '0.088*"power" + 0.070*"watch" + 0.034*"back" + 0.027*"everything" + 0.024*"kingdom" + 0.023*"sven"'),
 (5,
  '0.210*"it" + 0.065*"think" + 0.061*"let" + 0.046*"im" + 0.040*"kristoff" + 0.029*"right"'),
 (6,
  '0.071*"even" + 0.069*"first" + 0.037*"funny" + 0.035*"still" + 0.033*"find" + 0.026*"point"'),
 (7,
  '0.216*"frozen" + 0.196*"love" + 0.077*"much" + 0.061*"little" + 0.044*"want" + 0.029*"look"'),
 (8,
  '0.086*"snowman" + 0.068*"get" + 0.063*"classic" + 0.052*"han" + 0.029*"feature" + 0.024*"score"'),
 (9,
  '0.116*"sister" + 0.070*"kid" + 0.047*"seen" + 0.038

Similar to the result in LSA, the result in LDA also includes the topic with some of the keywords. For example, topic 9 can be interpreted as a recommendation for Frozen as a good cartoon for adult. Topic 13 talks about the lyric of the main soundtrack. Topic 21 commends the songs in the movie. 

# Conclusion

LSA calculates term/topic and document/topic matrices with associated loadings (between -1 and 1) to illustrate the shared semantic vector space. LDA calculates the same matrices, but populates them with something different: probabilities (ranging from 0 to 1) instead of loadings (i.e. correlations).
Furthermore, while both techniques deliver reduction in the form of topics, LSA relies a factor-analysis-like approach that seeks to uncover latent structures in the corpus by leveraging linear algebra and transpositions. LDA, on the other hand, takes a Bayesian approach that started, instead, with potential structure and attempts to see which words stick.
There's not one method that always works better than another. As a result, we would need to come to problem definition step at the beginning and need to figure out what we really need to achieve, and use the techniques to tell a compelling story.