# Topic Modeling 

Topic modeling is a statistical model to discover the abstract "topics" that occur in a collection of documents.  
It is commonly used in text document. But nowadays, in social media analysis, topic modeling is an emerging research area.  
One of the most popular algorithms used is Latent Dirichlet Allocation which was proposed by  
[David Blei et al in 2003](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).   
Here, I want to perform topic modeling for the upvoted kaggle dataset. 

Some notes on topic modeling:   
* To determine the number topics, it is common to use [elbow method](https://en.wikipedia.org/wiki/Elbow_method_(clustering) with [perplexity score](http://qpleple.com/perplexity-to-evaluate-topic-models/) as its cost function.   
* To evaluate the models, we can calculate [topic coherence](http://qpleple.com/topic-coherence-to-evaluate-topic-models/).   
* Finally, to interpret the topics, as studied in social science research, there is [triangulation method](http://www.federica.eu/users/9/docs/amaturo-39571-01-Triangulation.pdf).  

## Import libraries

I used LDA model from gensim. Other option is using sklearn.

In [5]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import pandas as pd
import gensim
import pyLDAvis.gensim


## Initiating Tokenizer and Lemmatizer

Initiate the tokenizer, stop words, and lemmatizer from the libraries.

* Tokenizer is used to split the sentences into words.  
* Lemmatizer (a quite similar term to Stemmer) is used to reduce words to its base form.   
The simple difference is that Lemmatizer considers the meaning while Stemmer does not. 


In [6]:
pattern = r'\b[^\d\W]+\b'
tokenizer = RegexpTokenizer(pattern)
en_stop = get_stop_words('en')
lemmatizer = WordNetLemmatizer()

In [7]:
remove_words = ['data','dataset','datasets','content','context','acknowledgement','inspiration']

## Read the data

In [10]:
# Input from csv
df = pd.read_csv('voted-kaggle-dataset.csv')

# sample data
print(df['Description'].head(5))

0    The datasets contains transactions made by cre...
1    The ultimate Soccer database for data analysis...
2    Background\nWhat can we say about the success ...
3    This dataset is simulated\nWhy are our best an...
4    Context\nInformation on more than 170,000 Terr...
Name: Description, dtype: object


## Perform Tokenization, Words removal, and Lemmatization

In [12]:
# list for tokenized documents in loop
texts = []

# loop through document list
for i in df['Description'].items():
    # clean and tokenize document string
    raw = str(i[1]).lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [raw for raw in tokens if not raw in en_stop]
    
    # remove stop words from tokens
    stopped_tokens_new = [raw for raw in stopped_tokens if not raw in remove_words]
    
    # lemmatize tokens
    lemma_tokens = [lemmatizer.lemmatize(tokens) for tokens in stopped_tokens_new]
    
    # remove word containing only single char
    new_lemma_tokens = [raw for raw in lemma_tokens if not len(raw) == 1]
    
    # add tokens to list
    texts.append(new_lemma_tokens)

# sample data
print(texts[0])

['contains', 'transaction', 'made', 'credit', 'card', 'september', 'european', 'cardholder', 'present', 'transaction', 'occurred', 'two', 'day', 'fraud', 'transaction', 'highly', 'unbalanced', 'positive', 'class', 'fraud', 'account', 'transaction', 'contains', 'numerical', 'input', 'variable', 'result', 'pca', 'transformation', 'unfortunately', 'due', 'confidentiality', 'issue', 'provide', 'original', 'feature', 'background', 'information', 'feature', 'principal', 'component', 'obtained', 'pca', 'feature', 'transformed', 'pca', 'time', 'amount', 'feature', 'time', 'contains', 'second', 'elapsed', 'transaction', 'first', 'transaction', 'feature', 'amount', 'transaction', 'amount', 'feature', 'can', 'used', 'example', 'dependant', 'cost', 'senstive', 'learning', 'feature', 'class', 'response', 'variable', 'take', 'value', 'case', 'fraud', 'otherwise', 'given', 'class', 'imbalance', 'ratio', 'recommend', 'measuring', 'accuracy', 'using', 'area', 'precision', 'recall', 'curve', 'auprc', 'c

## Create term dictionary and document-term matrix

In [13]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

## Generate LDA model

I used pre-determined number of topics. It will better calculating perplexity to find the optimum number of topics.    
*top_topics* shows the sorted topics based on the topic coherence.

In [14]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=20)
import pprint
pprint.pprint(ldamodel.top_topics(corpus,topn=5))

[([(0.028881738, 'others'),
   (0.020567674, 'question'),
   (0.019726312, 'acknowledgement'),
   (0.018826835, 'world'),
   (0.018470805, 'science')],
  -0.8983189510991755),
 ([(0.009770964, 'year'),
   (0.009243801, 'state'),
   (0.006931029, 'country'),
   (0.0068268585, 'information'),
   (0.0064245136, 'can')],
  -0.9113489232636575),
 ([(0.014603494, 'csv'),
   (0.013380207, 'file'),
   (0.011615968, 'word'),
   (0.010382965, 'can'),
   (0.008353946, 'http')],
  -0.9829009040006934),
 ([(0.012837452, 'model'),
   (0.010796197, 'can'),
   (0.010479897, 'file'),
   (0.008199418, 'feature'),
   (0.008171519, 'set')],
  -1.2590129710812663),
 ([(0.038292494, 'tweet'),
   (0.020086152, 'twitter'),
   (0.0137979705, 'sentiment'),
   (0.00965343, 'user'),
   (0.006608839, 'medium')],
  -1.2875551367267146),
 ([(0.008708336, 'time'),
   (0.008550358, 'can'),
   (0.0076371445, 'number'),
   (0.0063950927, 'year'),
   (0.006237971, 'game')],
  -1.6253247268305426),
 ([(0.013892414, 'movie

## Visualize the topic model

Using pyLDAvis, we can create an interactive visualization.

In [16]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

In [22]:
import streamlit as st
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
from gensim import corpora, models


# lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)
vis_data = gensimvis.prepare(ldamodel, corpus, dictionary)

# Interface Streamlit
st.title("Visualisation des Topics avec LDA")
pyLDAvis_html = pyLDAvis.prepared_data_to_html(vis_data)
st.write(pyLDAvis_html, unsafe_allow_html=True)

In [27]:
!streamlit topic-modeling.ipynb


Usage: streamlit [OPTIONS] COMMAND [ARGS]...
Try 'streamlit --help' for help.

Error: No such command 'topic-modeling.ipynb'.
