<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Libraries</a></span></li></ul></li><li><span><a href="#Data" data-toc-modified-id="Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><span><a href="#preprocessing-functions" data-toc-modified-id="preprocessing-functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>preprocessing functions</a></span></li><li><span><a href="#example-preprocessed-document" data-toc-modified-id="example-preprocessed-document-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>example preprocessed document</a></span></li></ul></li><li><span><a href="#grab-a-random-document" data-toc-modified-id="grab-a-random-document-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>grab a random document</a></span><ul class="toc-item"><li><span><a href="#preprocess-the-documents" data-toc-modified-id="preprocess-the-documents-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>preprocess the documents</a></span></li></ul></li><li><span><a href="#Bag-of-Words" data-toc-modified-id="Bag-of-Words-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Bag of Words</a></span><ul class="toc-item"><li><span><a href="#Dictionary" data-toc-modified-id="Dictionary-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Dictionary</a></span></li><li><span><a href="#Remove-Extremes" data-toc-modified-id="Remove-Extremes-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Remove Extremes</a></span></li><li><span><a href="#BoW" data-toc-modified-id="BoW-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>BoW</a></span></li></ul></li></ul></div>

**Latent Dirichlet Allocation**

LDA can be used for topic modeling or document classification.  It utilizes two dirichlet distributions, one as a topic per document model and one as a words per topic model.  Documents are modeled as multinomial distribution of topics, with each topic being modeled as a multinomial distribution of words.

Some assumptions:
- every chunk of text will contain words that are somehow related
- documents are produced from a mixture of topics
- topics generate words based on their multinomial distribution


## Libraries

In [1]:
# data load
import pandas as pd
import os


# data cleaning
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk

import numpy as np

# Data
The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `abcnews-date-text.csv` file.


In [2]:
data = pd.read_csv(os.path.join('data', 'abcnews-date-text.csv'), error_bad_lines=False);

In [3]:
data

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
1103660,20171231,the ashes smiths warners near miss liven up bo...
1103661,20171231,timelapse: brisbanes new year fireworks
1103662,20171231,what 2017 meant to the kids of australia
1103663,20171231,what the papodopoulos meeting may mean for ausus


In [4]:
# grab just the headlines for the first 300000 entries
documents = data[:300000][['headline_text']];
documents['index'] = documents.index

In [5]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


# Preprocessing

data cleaning will consist of:
- Tokenization
- Case standardization
- Punctuation removal
- Removal of words shorter than 3 characters
- Stopword removal
- Lemmatization
- Stemming

In [6]:
# verify wordnet is up-to-date for lemmatizing
nltk.download("wordnet")

[nltk_data] Error loading wordnet: <urlopen error [WinError 10061] No
[nltk_data]     connection could be made because the target machine
[nltk_data]     actively refused it>


False

## preprocessing functions

In [10]:
def lemmatize_stemming(text):
    
    lem = WordNetLemmatizer().lemmatize(text, pos='v')
    
    stemmer = SnowballStemmer("english")
    stem = stemmer.stem(lem)
    
    return stem

In [8]:
def preprocess(text):

    # use gensim to tokenize the text & perform basic preprocessing
    # converts to lower case, ignores tokens that are too short
    tokens = gensim.utils.simple_preprocess(text, min_len=3)
    
    # initialize a list to store the result
    result = []

    # loop over the tokens
    for token in tokens: 
        
        # only keep if not a stop word
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            
            # add the lemmatized & stemmed word to the result list
            result.append(lemmatize_stemming(token))
            
    return result

## example preprocessed document

In [26]:
# grab a random document
document_num = np.random.randint(len(documents))
ex_doc = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in ex_doc.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(ex_doc))

Original document: 
['mining', 'company', 'says', 'expansion', 'criticisms', 'are']


Tokenized and lemmatized document: 
['mine', 'compani', 'say', 'expans', 'critic']


## preprocess the documents

In [15]:
# preprocess all the headlines
processed_docs = documents["headline_text"].apply(lambda x: preprocess(x))

In [16]:
processed_docs[:5]

0    [aba, decid, communiti, broadcast, licenc]
1                       [act, wit, awar, defam]
2        [call, infrastructur, protect, summit]
3         [air, staff, aust, strike, pay, rise]
4     [air, strike, affect, australian, travel]
Name: headline_text, dtype: object

# Bag of Words

## Dictionary

In [17]:
# use gensim to get a dictionary with all words and an integer ID
dictionary = gensim.corpora.Dictionary(processed_docs)

In [19]:
# check out the first 10 dictionary entries

i = 0

for num, word in dictionary.iteritems():
    
    print(num, word)
    i += 1
    
    if i > 10:
        break

0 aba
1 broadcast
2 communiti
3 decid
4 licenc
5 act
6 awar
7 defam
8 wit
9 call
10 infrastructur


## Remove Extremes

In [20]:
# remove the very rare (less than 15 docs) and very common (more than 10% of docs) words
dictionary.filter_extremes(no_below=15, no_above=0.1)

## BoW

In [24]:
# create a bag of words model for each document
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [25]:
# example document
bow_corpus[document_num]

[(131, 1), (133, 1), (2024, 1), (3977, 1)]

In [29]:
# example document
for bow in bow_corpus[document_num]:
    print("Word {} (\"{}\") appears {} time.".format(bow[0], 
                                                 dictionary[bow[0]], 
                                                 bow[1]))

Word 138 ("critic") appears 1 time.
Word 411 ("say") appears 1 time.
Word 1343 ("mine") appears 1 time.
Word 2558 ("compani") appears 1 time.
Word 3402 ("expans") appears 1 time.
