# Topic Modeling With Latent Dirichlet Allocation

Topic modeling using Latent Dirichlet Allocation (LDA) as discussed in Raschka & Mirjalili (2017) Python Machine Learning.Packt Publishing (Chapter 8).

This method (LDA) often takes an acronym that is also taken by Lindear Discriminat Analysis. Different methods.

Todo items:

* Email discussions often have user "signatures" or "footers". These need to be removed.
* Email discussions also have "threads" which means that a message gets sent but with the previous comments "threaded" below.
* Others ...

In [1]:
import pandas as pd
import numpy as np
import nltk
import contractions
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
def display_results(bag_array, bag_vocab):
    make_list = []
    for key in bag_vocab.keys():
        make_list.append(str(count.vocabulary_[key]) + '_' + key)
    make_list.sort()
    df = pd.DataFrame(bag_array, columns=make_list)

    return(df)

def remove_html(html_text):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_text)
    text = soup.get_text()
    return(text)

def remove_comments(commented_text):
    # Remove comments from html
    pull_string = commented_text[commented_text.find('<!--'):\
                                 commented_text.find('-->')+3]
    clean_string = commented_text.replace(pull_string,'')
    return(clean_string)

def text_only(all_text):
    clean_text = []
    for words in all_text:
        clean_text.append(' '.join(x for x in all_text if x.isalpha()))
        return(' '.join(clean_text))
    
def normalize_text(original_text):
    # print('Removing system comments    ... ', end='')
    return_text = remove_comments(original_text)
    # print('DONE')
    # print('Converting contractions     ... ', end='')
    return_text = contractions.fix(return_text)
    # print('DONE')
    # print('Removing none alpha letters ... ', end='')
    return_text = nltk.word_tokenize(return_text)
    return_text = text_only(return_text)
    # print('DONE')
    if return_text == None:
        return_text=' '
    return(return_text)

In [3]:
msgs = pd.read_stata('studentjudicial_messages.dta')
msgs['msgTxt'] = msgs['msgTxt'].apply(normalize_text)
msgs.head()

Unnamed: 0,index,userId,authName,subject,Unix,Date,msgId,preInTpc,nxtInTpc,preInTime,nxtInTime,topicId,MsgsInTopic,msgRaw,msgTxt
0,0,581375493,"Savage, Shannon",Filming and prop weapons policies,1541724012,2018-11-08 19:40:12,33102,0,0,33101,0,33102,1,"<div id=""ygrps-yiv-1920241537"">Hello all,<br/>...",Hello all Our campus has recently had two inci...
1,1,558034509,Sara Ash,CBD and Policies,1541456954,2018-11-05 17:29:14,33101,0,0,33100,33102,33101,1,"<div id=""ygrps-yiv-79522852"">Hello everyone,<b...",Hello everyone I was wondering if you had any ...
2,2,190928695,Ray Tuttle (rtuttle),Holding students accountable for not respectin...,1541442992,2018-11-05 13:36:32,33100,0,0,33099,33101,33100,1,"<div id=""ygrps-yiv-1523644933""><html>\n<head>\...",Good afternoon Do any of you have language in ...
3,3,569518549,Anthony Leger,RE: Student now not wanting a lawyer,1541438971,2018-11-05 12:29:31,33099,33098,0,33098,33100,33098,2,"<div id=""ygrps-yiv-1607611912"">Dave,<br/>\n<br...",Dave I think it would depend on your policies ...
4,4,578133796,"Steward, David K",Student now not wanting a lawyer,1541436488,2018-11-05 11:48:08,33098,0,33099,33097,33099,33098,2,"<div id=""ygrps-yiv-1417733977"">Have an interes...",Have an interesting twist to a case Student ye...


In [4]:
count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
bag = count.fit_transform(msgs['msgTxt'])

In [5]:
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
bag_topics = lda.fit_transform(bag)

In [6]:
lda.components_.shape

(10, 5000)

In [7]:
n_top_words = 7
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print('Topic %d:' % (topic_idx + 1))
    print(' '.join([feature_names[i]
                   for i in topic.argsort()\
                       [:-n_top_words -1:-1]]))

Topic 1:
marijuana drug police stephen drinking drugs possession
Topic 2:
police criminal assault sex crime violence victim
Topic 3:
suspension probation color interim drug greg suspended
Topic 4:
qui purchase records mary road ny hill
Topic 5:
appeal accused decision victim evidence appeals complainant
Topic 6:
ix transmission sender chris reply contents electronic
Topic 7:
linda illinois nona peoria resolution florida conflict
Topic 8:
programs training position experience professional research related
Topic 9:
communication records attachments electronic immediately privileged copy
Topic 10:
special brett counsel sokolow advisor ncherm risk


In [8]:
drug = bag_topics[:, 0].argsort()[::-1]
for iter_idx, essay_idx in enumerate(drug[:3]):
    print('\nDrug example #%d:' % (iter_idx + 1))
    print(msgs['msgTxt'][essay_idx][:600], '. . .')


Drug example #1:
Thanks to those of you who have volunteered in the last few hours I have added you all to the list Keep the volunteers coming no experience necessary Chris Loschiavo JD Director of Student Conduct and Community Standards please note the new title Office of Student Life University of Oregon Eugene OR Ph Fax chrislos From studentjudicial mailto studentjudicial On Behalf Of Adriane Sent Thursday January PM To studentjudicial Subject RE ASJA Volunteers needed I attended ASJA for the first time last year Because of the number of job responsibilities I am juggling I have not been able to be very inv . . .

Drug example #2:
I attended ASJA for the first time last year Because of the number of job responsibilities I am juggling I have not been able to be very involved with my circuit or with the organization as a whole I would love to volunteer to do something to help at the upcoming conference I want to be helpful and be able to get to know some of the folks in the organizat