<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Libraries</a></span></li></ul></li><li><span><a href="#Data" data-toc-modified-id="Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><span><a href="#preprocessing-functions" data-toc-modified-id="preprocessing-functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>preprocessing functions</a></span></li></ul></li></ul></div>

**Latent Dirichlet Allocation**

LDA can be used for topic modeling or document classification.  It utilizes two dirichlet distributions, one as a topic per document model and one as a words per topic model.  Documents are modeled as multinomial distribution of topics, with each topic being modeled as a multinomial distribution of words.

Some assumptions:
- every chunk of text will contain words that are somehow related
- documents are produced from a mixture of topics
- topics generate words based on their multinomial distribution


## Libraries

In [1]:
# data load
import pandas as pd
import os


# data cleaning
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk

import numpy as np

# Data
The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `abcnews-date-text.csv` file.


In [2]:
data = pd.read_csv(os.path.join('data', 'abcnews-date-text.csv'), error_bad_lines=False);

In [3]:
data

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
1103660,20171231,the ashes smiths warners near miss liven up bo...
1103661,20171231,timelapse: brisbanes new year fireworks
1103662,20171231,what 2017 meant to the kids of australia
1103663,20171231,what the papodopoulos meeting may mean for ausus


In [4]:
# grab just the headlines for the first 300000 entries
documents = data[:300000][['headline_text']];
documents['index'] = documents.index

In [5]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


# Preprocessing

data cleaning will consist of:
- Tokenization
- Case standardization
- Punctuation removal
- Removal of words shorter than 3 characters
- Stopword removal
- Lemmatization
- Stemming

In [6]:
# verify wordnet is up-to-date for lemmatizing
nltk.download("wordnet")

[nltk_data] Error loading wordnet: <urlopen error [WinError 10061] No
[nltk_data]     connection could be made because the target machine
[nltk_data]     actively refused it>


False

## preprocessing functions

In [7]:
def lemmatize_stemming(text):
    lem = WordNetLemmatizer().lemmatize(text, pos='v')
    stem = stemmer.stem(lem)
    return stem

In [8]:
def preprocess(text):

    # use gensim to tokenize the text & perform basic preprocessing
    # converts to lower case, ignores tokens that are too short
    tokens = gensim.utils.simple_preprocess(text, min_len=3)
    
    # initialize a list to store the result
    result = []

    # loop over the tokens
    for token in tokens: 
        
        # only keep if not a stop word
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            
            # add the lemmatized & stemmed word to the result list
            result.append(lemmatize_stemming(token))
            
    return result