<a href="https://colab.research.google.com/github/andrybrew/bigdatanalysis-bi/blob/master/005_text_mining_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Mining - Topic Modeling**

Topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. 

***Install Library, Import Libraries, and Import Modules***

In [None]:
# Install Library
! pip install pyLDAvis

In [None]:
# Import Libraries
from __future__ import print_function 
import nltk
import os
import numpy as np, pyLDAvis, pyLDAvis.sklearn; pyLDAvis.enable_notebook()

# Import Modules
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from matplotlib import pyplot as plt

In [None]:
# Clone Library and Data from Github
! git clone https://github.com/dianrdn/tm

# Set Data Directory
os.chdir('tm')

***Import Data***

In [None]:
# Import Stop Words
nltk.download('stopwords')

# Import Data
data_file = 'text_preprocessed.csv'

# Load Tweets Data
import MyLib as TS
Tweets = TS.LoadTxt(data_file) 
print('Total loaded tweets = {0}'.format(len(Tweets)))

***Set Number of Topics, Top Topics, Top Words***

In [None]:
n_topics = 4
top_topics = 4
top_words = 8

***Word Embedding***

In [None]:
# Feature Extraction
count_vector = CountVectorizer(lowercase = True, token_pattern = r'\b[a-zA-Z]{3,}\b') 
dtm_tf = count_vector.fit_transform(Tweets)
tf_terms = count_vector.get_feature_names()
del Tweets

***Show Topic***

In [None]:
# Topic Search Function
lda_tf = LatentDirichletAllocation(n_components=n_topics, learning_method='online', random_state=0).fit(dtm_tf)

# Show Topics
vsm_topics = lda_tf.transform(dtm_tf); doc_topic =  [a.argmax()+1 for a in tqdm(vsm_topics)] # topic of docs
print('In total there are {0} major topics, distributed as follows'.format(len(set(doc_topic))))
plt.hist(np.array(doc_topic), alpha=0.5); plt.show()
print('Printing top {0} Topics, with top {1} Words:'.format(top_topics, top_words))
TS.print_Topics(lda_tf, tf_terms, top_topics, top_words)

In [None]:
# Interactively visualizing the Topics, please ignore the Warnings
# Wait few minutes and then hover the Mouse over the Topics to Explore
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, count_vector) 