# Topic Modeling 

Topic modeling is a statistical model to discover the abstract "topics" that occur in a collection of documents.  
It is commonly used in text document. But nowadays, in social media analysis, topic modeling is an emerging research area.  
One of the most popular algorithms used is Latent Dirichlet Allocation which was proposed by  
[David Blei et al in 2003](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).   
Here, I want to perform topic modeling for the upvoted kaggle dataset. 

Some notes on topic modeling:   
* To determine the number topics, it is common to use [elbow method](https://en.wikipedia.org/wiki/Elbow_method_(clustering) with [perplexity score](http://qpleple.com/perplexity-to-evaluate-topic-models/) as its cost function.   
* To evaluate the models, we can calculate [topic coherence](http://qpleple.com/topic-coherence-to-evaluate-topic-models/).   
* Finally, to interpret the topics, as studied in social science research, there is [triangulation method](http://www.federica.eu/users/9/docs/amaturo-39571-01-Triangulation.pdf).  

## Import libraries

I used LDA model from gensim. Other option is using sklearn.

In [11]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import pandas as pd
import gensim #statistical model lib in NLP
import pyLDAvis.gensim #Visualization lib


## Initiating Tokenizer and Lemmatizer

Initiate the tokenizer, stop words, and lemmatizer from the libraries.

* Tokenizer is used to split the sentences into words.  
* Lemmatizer (a quite similar term to Stemmer) is used to reduce words to its base form.   
The simple difference is that Lemmatizer considers the meaning while Stemmer does not. 


In [12]:
pattern = r'\b[^\d\W]+\b'
tokenizer = RegexpTokenizer(pattern)
en_stop = get_stop_words('en')
lemmatizer = WordNetLemmatizer()

In [13]:
remove_words = ['data','dataset','datasets','content','context','acknowledgement','inspiration']

## Read the data

In [14]:
# Input from csv
df = pd.read_csv('../input/voted-kaggle-dataset.csv')

# sample data
print(df['Description'].head(2))

0    The datasets contains transactions made by cre...
1    The ultimate Soccer database for data analysis...
Name: Description, dtype: object


In [15]:
df.head()

Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1241,"Version 2,2016-11-05|Version 1,2016-11-03",crime\nfinance,CSV,144 MB,ODbL,"442,136 views","53,128 downloads","1,782 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1046,"Version 10,2016-10-24|Version 9,2016-10-24|Ver...",association football\neurope,SQLite,299 MB,ODbL,"396,214 views","46,367 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1024,"Version 2,2017-09-28",film,CSV,44 MB,Other,"446,255 views","62,002 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,789,"Version 2,2017-07-19|Version 1,2016-12-08",crime\nterrorism\ninternational relations,CSV,144 MB,Other,"187,877 views","26,309 downloads",608 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."
4,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,Zielak,618,"Version 11,2018-01-11|Version 10,2017-11-17|Ve...",history\nfinance,CSV,119 MB,CC4,"146,734 views","16,868 downloads",68 kernels,13 topics,https://www.kaggle.com/mczielinski/bitcoin-his...,Context\nBitcoin is the longest running and mo...


In [16]:
df.shape

(2150, 15)

## Perform Tokenization, Words removal, and Lemmatization

In [17]:
# list for tokenized documents in loop
texts = []

# loop through document list
for i in df['Description'].iteritems():
    # clean and tokenize document string
    raw = str(i[1]).lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [raw for raw in tokens if not raw in en_stop]
    
    # remove stop words from tokens
    stopped_tokens_new = [raw for raw in stopped_tokens if not raw in remove_words]
    
    # lemmatize tokens
    lemma_tokens = [lemmatizer.lemmatize(tokens) for tokens in stopped_tokens_new]
    
    # remove word containing only single char
    new_lemma_tokens = [raw for raw in lemma_tokens if not len(raw) == 1]
    
    # add tokens to list
    texts.append(new_lemma_tokens)

# sample data
print(texts[0])

['contains', 'transaction', 'made', 'credit', 'card', 'september', 'european', 'cardholder', 'present', 'transaction', 'occurred', 'two', 'day', 'fraud', 'transaction', 'highly', 'unbalanced', 'positive', 'class', 'fraud', 'account', 'transaction', 'contains', 'numerical', 'input', 'variable', 'result', 'pca', 'transformation', 'unfortunately', 'due', 'confidentiality', 'issue', 'provide', 'original', 'feature', 'background', 'information', 'feature', 'principal', 'component', 'obtained', 'pca', 'feature', 'transformed', 'pca', 'time', 'amount', 'feature', 'time', 'contains', 'second', 'elapsed', 'transaction', 'first', 'transaction', 'feature', 'amount', 'transaction', 'amount', 'feature', 'can', 'used', 'example', 'dependant', 'cost', 'senstive', 'learning', 'feature', 'class', 'response', 'variable', 'take', 'value', 'case', 'fraud', 'otherwise', 'given', 'class', 'imbalance', 'ratio', 'recommend', 'measuring', 'accuracy', 'using', 'area', 'precision', 'recall', 'curve', 'auprc', 'c

## Create term dictionary and document-term matrix

In [18]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

In [20]:
dictionary

<gensim.corpora.dictionary.Dictionary at 0x7e26f2fd6c88>

In [22]:
# for k, v in dictionary.items():
#     print(k,v)

## Generate LDA model

I used pre-determined number of topics. It will better calculating perplexity to find the optimum number of topics.    
*top_topics* shows the sorted topics based on the topic coherence.

In [28]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=7, id2word = dictionary, passes=20)
import pprint
pprint.pprint(ldamodel.top_topics(corpus,topn=15))

[([(0.011116325, 'description'),
   (0.0083374484, 'can'),
   (0.0081585506, 'yet'),
   (0.0080767516, 'file'),
   (0.0068363356, 'http'),
   (0.0067467154, 'year'),
   (0.006257317, 'information'),
   (0.0056186169, 'time'),
   (0.0055987849, 'set'),
   (0.005363334, 'csv'),
   (0.0052493629, 'user'),
   (0.0052406141, 'contains'),
   (0.0048864721, 'used'),
   (0.0048376811, 'one'),
   (0.0044507743, 'acknowledgement')],
  -1.1907569607921484),
 ([(0.012243281, 'word'),
   (0.010499313, 'text'),
   (0.010103249, 'can'),
   (0.0079157585, 'price'),
   (0.0058530611, 'contains'),
   (0.0054972009, 'language'),
   (0.0053285342, 'file'),
   (0.0047037825, 'model'),
   (0.0046010995, 'acknowledgement'),
   (0.0045944462, 'corpus'),
   (0.0045317831, 'name'),
   (0.0044834353, 'http'),
   (0.004323517, 'product'),
   (0.003792386, 'value'),
   (0.0037410262, 'using')],
  -1.5700097892459439),
 ([(0.0072528436, 'state'),
   (0.0069877189, 'number'),
   (0.0059843701, 'city'),
   (0.0053042

## Visualize the topic model

Using pyLDAvis, we can create an interactive visualization.

In [27]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]
