#### Code source: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
#### Tutorial source: https://www.datacamp.com/tutorial/what-is-topic-modeling

### Imports for Loading and Preprocessing

In [18]:
import string   # contains a public variable with all ASCII punctuation characters
import nltk

# list of all stopwords such as 'and', 'the', 'is', etc.
nltk.download('stopwords')  

# WordNet is a lexical database of English words that groups words into sets of synonyms, while also recording semantic relationships between words such as "is-a", "part-of", and "opposite-of" relationships.
nltk.download('wordnet')    

# Open Multilingual WordNet (omw) links hand created wordnets and automatically created wordnets for different languages.
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 

# Used to tokenize the text; i.e. create a dictionary mapping words to integers. The dictionary can be used to create a term-document matrix.
from gensim.corpora import Dictionary


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Loading Data

In [9]:
# Sample docs
doc_1 = "A whopping 96.5 percent of water on Earth is in our oceans, covering 71 percent of the surface of our planet. And at any given time, about 0.001 percent is floating above us in the atmosphere. If all of that water fell as rain at once, the whole planet would get about 1 inch of rain."
doc_2 = "One-third of your life is spent sleeping. Sleeping 7-9 hours each night should help your body heal itself, activate the immune system, and give your heart a break. Beyond that--sleep experts are still trying to learn more about what happens once we fall asleep."
doc_3 = "A newborn baby is 78 percent water. Adults are 55-60 percent water. Water is involved in just about everything our body does."
doc_4 = "While still in high school, a student went 264.4 hours without sleep, for which he won first place in the 10th Annual Great San Diego Science Fair in 1964."
doc_5 = "We experience water in all three states: solid ice, liquid water, and gas water vapor."

# Corpus, which is simply a collection of documents
corpus = [doc_1, doc_2, doc_3, doc_4, doc_5]

### Data preprocessing

In [16]:
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)
lemmatizer = WordNetLemmatizer()

def clean(doc):
    lower_case_sentences = doc.lower().split()

    stop_free = " ".join([word for word in lower_case_sentences if word not in stop_words])    # only keep words that are not stopwords
    # print(stop_free)
    punc_free = "".join(ch for ch in stop_free if ch not in punctuation)                       # only keep characters that are not punctuation
    # print(punc_free)
    lemmatized = " ".join(lemmatizer.lemmatize(word) for word in punc_free.split())            # lemmatize words; convert words to their base or root form using their context in the sentence
    # print(lemmatized)

    return lemmatized

clean_corpus = [clean(doc).split() for doc in corpus]

# Print doc_1 after cleaning
print(clean_corpus[0])

['whopping', '965', 'percent', 'water', 'earth', 'ocean', 'covering', '71', 'percent', 'surface', 'planet', 'given', 'time', '0001', 'percent', 'floating', 'u', 'atmosphere', 'water', 'fell', 'rain', 'once', 'whole', 'planet', 'would', 'get', '1', 'inch', 'rain']


We now need to convert this corpus (which is currently just a list of lists) into a bag-of-words representation (which is a list of dictionaries, where each dictionary contains {word: count} pairs for all words in that document)

In [20]:
dictionary = Dictionary(clean_corpus)       # creates a dictionary mapping all words in the corpus to integers
doc_term_matrix = [dictionary.doc2bow(doc) for doc in clean_corpus]

print(doc_term_matrix[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 3), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1)]


### Modeling

In [40]:
from pprint import pprint
NUM_TOPICS = 3

First trying Latent Semantic Analysis (LSA). Note that it is called LsiModel because LSA also used to be called Latent Semantic Indexing. A simplified explanation is that LSA works by performing Singular Value Decomposition on the Term-Document matrix (dimensions: number of words in vocab x number of documents), in order to reduce the number of rows in the matrix.  

By reducing the number of rows (from vocab_size to num_topics) in the matrix, while trying to preserve features in the columns, we get a simpler representation of the text data.   

In [41]:
from gensim.models import LsiModel

lsa_model = LsiModel(doc_term_matrix, num_topics=NUM_TOPICS, id2word = dictionary)     # the argument id2word is used to map the integer IDs back to words when we print the topics
pprint(lsa_model.print_topics(num_topics=NUM_TOPICS, num_words=3))                     # the argument num_words limits the number of words to display for each topic

[(0, '0.555*"water" + 0.489*"percent" + 0.239*"planet"'),
 (1, '-0.361*"sleeping" + -0.215*"still" + -0.215*"hour"'),
 (2, '0.562*"water" + -0.231*"planet" + -0.231*"rain"')]


  sparsetools.csc_matvecs(


Next we try Latent Dirichlet Allocation (LDA). I don't know how this works yet lol

In [42]:
from gensim.models import LdaModel

lda_model = LdaModel(doc_term_matrix, num_topics=NUM_TOPICS, id2word = dictionary)
pprint(lda_model.print_topics(num_topics=NUM_TOPICS, num_words=3))

[(0, '0.028*"hour" + 0.028*"still" + 0.028*"10th"'),
 (1, '0.102*"water" + 0.065*"percent" + 0.028*"planet"'),
 (2, '0.042*"sleeping" + 0.024*"body" + 0.024*"still"')]


Observations: 
- The LSA model finds two similar topics where water is the most prominent word.  
- LDA collects all the facts about water under a topic with "water" and "percent" which is reasonable. 
- Both models find "still" to be an important word, even though it doesn't really add anything. This is likely because the dataset is so small that the frequency causes a lot of noise.

### Visualization

From looking at pyLDAvis docs and https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/gensim_models.py, it seems gensim_models does not support gensim's LSA model. Look into creating a PR in the future.  

In [35]:
import pyLDAvis.gensim_models as gensim_vis
import os
import pyLDAvis

# For visualizing the topics in a Jupyter notebook
pyLDAvis.enable_notebook()

LDAvis_filepath = os.path.join('topics_modeling_basics_LDA_'+str(NUM_TOPICS))
# LSAvis_filepath = os.path.join('topics_modeling_basics_LSA_'+str(NUM_TOPICS))

LDAvis_prepared = gensim_vis.prepare(lda_model, doc_term_matrix, dictionary)
# LSAvis_prepared = gensim_vis.prepare(lsa_model, doc_term_matrix, dictionary) 

# pyLDAvis.save_html(LSAvis_prepared, 'topics_modeling_basics_LSA_'+ str(NUM_TOPICS) +'.html')
pyLDAvis.save_html(LDAvis_prepared, 'topics_modeling_basics_LDA_'+ str(NUM_TOPICS) +'.html')

In [36]:
LDAvis_prepared