### LDA background
LDA assumes that documents are probability distribution over laten topics.
Topics are probability distribution over words.
LDA takes a number of documents. It assumes that the words in each document are related. It then tries to figure out the 'recipe' for how each document could have been created. We just need to tell the model how many topics to construct and it uses that 'recipe' to generate topic and word distributions over a corpus. Based on that output, we can identify similar documents within the corpus.

### In order to understand the LDA process, we have to know how LDA assumes topics are generated:
1. determine the number of words in the document
2. choose a topic mixture for the document over a fixed set of topics (ie. topic A 20%, topic B 50%, etc)
3. generate words in the document by:
    - pick a topic based on the document's multinomial distribution
    - pick a word based on the topic's multinomial distribution

### Working backwards
Suppose you have a corpus of documents, and you want LDA to learn the topic representatino of K topics in each document and the word distribution of each topic. LDA would backtrack from the document level to identify topics that are likely to have generated the corpus.

### LDA's Magic
1. randomly assign each word in each documen tto one of the K topics
2. for each document
    - assume that all topic assignments except for the current one are correct
    - claculate two proportions:
        1. proportion of words in document d that are currently assigned to topic t = p(topic t | document d)
        2. proportion of assignments to topic t over all documents that come from this word w = p(word w | topic t)
    - multiply those two proportions and assign w a new topic based on that probability. p(topic t | document d) * p(word w | topic t)
3. eventually we'll reach a steady state where assignments make sense

### alpha (parameter of the Dirichlet pior of the per-document topic distribution)
high: each document will contain many topics
low: each document iwll have distinct topics

### beta (parameter of the Dirichlet prior on the per-topic word distribution)
high: each topic will contain many words
low: each topic will contain few words

### theta (topic distribution for document m)
### z (topic for the n-th word in document m)
### w (specific word)

### step 0: examine and import corpus

In [121]:
#all the required libraries
import pandas as pd
from bs4 import BeautifulSoup

import os
import codecs
import itertools as it

import spacy
nlp = spacy.load('en')

from gensim.models import Phrases #seems like this is slower, but Phaser was not compatible to our code? need some research
from gensim.models.word2vec import LineSentence
from spacy.lang.en.stop_words import STOP_WORDS

from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

In [26]:
def grab_dockets():
    files = []
    #get all .html files in the folder (all docket files are in .html)
    for file in os.listdir('docket_texts/'):
        if file.endswith('.html'):
            files.append(os.path.join('docket_texts/', file))

    df_docket_texts = pd.DataFrame()
    
    for i in range(len(files)): #gather all docket texts
    #for i in [0, 1]: #for testing purposes
        
        content = codecs.open(files[i], 'r', 'utf-8').read()
        #use beautiful soup to get the case ID
        soup = BeautifulSoup(content, 'lxml')
        case_id = str(soup.find_all('h3'))    
        bookmark1 = case_id.find('CASE #:') + len('CASE #:')
        bookmark2 = case_id.find('</h3>')
        case_id = case_id[bookmark1:bookmark2]

        #use pandas to grab tables in the html files
        docket_tables = pd.read_html(content)

        #error checking: gotta do this because there's different length of docket_list/
        #usually docket texts are in docket_list[3], but not always
        n = 0
        while docket_tables[n].isin(['Docket Text']).sum().sum() == 0:
            #print(n, docket_tables[n].isin(['Docket Text']).sum().sum())
            n += 1
                        
        #print(i, files[i])
        #print(docket_tables[n].head())

        #docket_tables[n] is the docket text table
        new_header = docket_tables[n].iloc[0]
        docket_tables[n] = docket_tables[n][1:]
        docket_tables[n].columns = new_header
        
        docket_tables[n]['#'] = pd.to_numeric(docket_tables[n]['#'],
                                              downcast = 'signed', errors = 'coerce')
        docket_tables[n]['Date Filed'] = pd.to_datetime(docket_tables[n]['Date Filed'])
        docket_tables[n]['Case ID'] = case_id

        df_docket_texts = pd.concat([df_docket_texts, docket_tables[n]])
    #reorder a column
    cols = list(df_docket_texts.columns)
    df_docket_texts = df_docket_texts[[cols[-1]] + cols[:-1]]
    
    print('current docket text table size/shape: {}'.format(df_docket_texts.shape))
    return df_docket_texts

In [27]:
#current docket text table size/shape: (721, 4), 2018-04-18
df = grab_dockets()
df.head()

current docket text table size/shape: (721, 4)


Unnamed: 0,Case ID,Date Filed,#,Docket Text
1,1:16-cv-01215-AMD-SJB,2016-03-10,1.0,"COMPLAINT against Cardiogenics Holdings, Inc. ..."
2,1:16-cv-01215-AMD-SJB,2016-03-10,,Case assigned to Judge Ann M Donnelly and Magi...
3,1:16-cv-01215-AMD-SJB,2016-03-10,2.0,"Summons Issued as to Cardiogenics Holdings, In..."
4,1:16-cv-01215-AMD-SJB,2016-03-11,,NOTICE - emailed attorney regarding missing se...
5,1:16-cv-01215-AMD-SJB,2016-03-11,3.0,In accordance with Rule 73 of the Federal Rule...


In [28]:
docket_original = list(df['Docket Text'])
print(docket_original[0:2])

['COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)', 'Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)']


In [29]:
docket_txt_filepath = 'docket_texts/docket_text_all.txt'

In [54]:
%%time

# if we need to update file, then True
if True:
    
    docket_count = 0

    # create & open a new file in write mode
    with codecs.open(docket_txt_filepath, 'w', encoding = 'utf_8') as docket_txt_file:


        for i in range(len(docket_original)):
            docket_txt_file.write(docket_original[i] + '\n')
            docket_count += 1

    print('Text from {:,} docket texts written to the new txt file'.format(docket_count))
    
else:
    with codecs.open(docket_txt_filepath, encoding = 'utf_8') as docket_txt_file:
        for review_count, line in enumerate(docket_txt_file):
            pass
        
    print(u'Text from {:,} restaurant reviews in the txt file.'.format(docket_count))

Text from 721 docket texts written to the new txt file
Wall time: 5.04 ms


In [56]:
print(df['Docket Text'].iloc[9])
print(df.shape)

EXHIBIT C - Notice of Conversion by LG Capital Funding, LLC. Related document: 1 Complaint, filed by LG Capital Funding, LLC. (Kehrli, Kevin) (Entered: 04/11/2016)
(721, 4)


In [120]:
# extracting a review for testing
n_th_review = 10
with codecs.open(docket_txt_filepath, encoding = 'utf_8') as f:
    sample_review = list(it.islice(f, n_th_review, n_th_review + 1))[0]

print(sample_review)

#using NLP (SpaCy English) to parse the review text
parsed_review = nlp(sample_review)

#for example, we can print out the sentences from the parsed object
#why were the sentences split by a dash?!
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

MOTION for Extension of Time to File Answer by Cardiogenics Holdings, Inc. (Kogan, Simon) Modified on 4/13/2016 to add motion (Quinlan, Krista). (Entered: 04/12/2016)

Sentence 1:
MOTION for Extension of Time to File Answer by Cardiogenics Holdings, Inc.

Sentence 2:
(Kogan, Simon)

Sentence 3:
Modified on 4/13/2016 to add motion (Quinlan, Krista).

Sentence 4:
(Entered: 04/12/2016)




In [68]:
#we can also look at entities, note that this result is not great
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: AFFIDAVIT OF SERVICE of ( - ORG

Entity 2: 2 - CARDINAL

Entity 3: 3 - CARDINAL

Entity 4: Chris Han - PERSON

Entity 5: 4 - CARDINAL

Entity 6: NCP - ORG

Entity 7: New York - GPE

Entity 8: 4/11/2018 - DATE

Entity 9: Thomas Paglia - PERSON

Entity 10: Hangzhou - GPE

Entity 11: Kailai Neckwear Apparel Co. Ltd. - ORG

Entity 12: Han - NORP

Entity 13: Chris - PERSON

Entity 14: Entered - PERSON

Entity 15: 
 - GPE



In [69]:
token_text = [token.orth_ for token in parsed_review] #Verbatim text content 
token_pos = [token.pos_ for token in parsed_review] #part-of-speech.

pd.DataFrame(list(zip(token_text, token_pos)), columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,AFFIDAVIT,PROPN
1,OF,ADP
2,SERVICE,PROPN
3,of,ADP
4,(,PUNCT
5,1,PUNCT
6,),PUNCT
7,Notice,PROPN
8,of,ADP
9,Motion,PROPN


In [70]:
token_lemma = [token.lemma_ for token in parsed_review] # Base form of the token, with no inflectional suffixes.
token_shape = [token.shape_ for token in parsed_review] # Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)), columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,AFFIDAVIT,affidavit,XXXX
1,OF,of,XX
2,SERVICE,service,XXXX
3,of,of,xx
4,(,(,(
5,1,1,d
6,),),)
7,Notice,notice,Xxxxx
8,of,of,xx
9,Motion,motion,Xxxxx


In [72]:
token_entity_type = [token.ent_type_ for token in parsed_review] #Named entity type.
token_entity_iob = [token.ent_iob_ for token in parsed_review] #IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,AFFIDAVIT,ORG,B
1,OF,ORG,I
2,SERVICE,ORG,I
3,of,ORG,I
4,(,ORG,I
5,1,,O
6,),,O
7,Notice,,O
8,of,,O
9,Motion,,O


In [73]:
token_attributes = [(token.orth_, #Verbatim text content
                     token.prob, #Smoothed log probability estimate of token's type
                     token.is_stop, #Is the token part of a "stop list"
                     token.is_punct, #Is the token punctuation
                     token.is_space, #Does the token consist of whitespace characters
                     token.like_num, #Does the token represent a number
                     token.is_oov) #Is the token out-of-vocabulary
                    for token in parsed_review]

df_temp = pd.DataFrame(token_attributes,
                       columns=['text', 'log_probability', 'stop?', 'punctuation?',
                                'whitespace?', 'number?', 'out of vocab.?'])

df_temp.loc[:, 'stop?':'out of vocab.?'] = (df_temp.loc[:, 'stop?':'out of vocab.?']
                                            .applymap(lambda x: u'Yes' if x else u''))
                                               
df_temp

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,AFFIDAVIT,-20.0,,,,,Yes
1,OF,-20.0,,,,,Yes
2,SERVICE,-20.0,,,,,Yes
3,of,-20.0,Yes,,,,Yes
4,(,-20.0,,Yes,,,Yes
5,1,-20.0,,,,Yes,Yes
6,),-20.0,,Yes,,,Yes
7,Notice,-20.0,,,,,Yes
8,of,-20.0,Yes,,,,Yes
9,Motion,-20.0,,,,,Yes


## we pretty much used SpaCy as text clean so far. We are going to use gensim to do phrase modeling

1. Segment text of complete reviews into sentences & normalize text
2. First-order phrase modeling → apply first-order phrase model to transform sentences
3. Second-order phrase modeling → apply second-order phrase model to transform sentences
4. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

In [75]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename), batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            #the problem is here where token.lemma_ transforms the pronouns
            yield u' '.join([token.lemma_ for token in sent if not punct_space(token)])

In [76]:
unigram_sentences_filepath = 'docket_texts/unigram_sentences_all.txt'

In [77]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if True:
    with codecs.open(unigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
        for sentence in lemmatized_sentence_corpus(docket_txt_filepath):
            f.write(sentence + '\n')

Wall time: 9.38 s


In [80]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [87]:
print('original text:')
print(df['Docket Text'].iloc[0])
print(df['Docket Text'].iloc[1])

print('\nunigram_sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
    print('')

original text:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)
Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)

unigram_sentence:
complaint against cardiogenics holdings inc. filing fee $ 400 receipt number 0207

8445206 be the disclosure statement on civil cover sheet complete -yes file by lg capital funding llc

steinmetz michael

additional attachment(s add on 3/11/2016 1 civil cover sheet 2 propose summon

bow

In [88]:
bigram_model_filepath = 'docket_texts/bigram_model_all'

In [90]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if True:

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

Wall time: 92.2 ms


In [91]:
bigram_sentences_filepath = 'docket_texts/bigram_sentences_all.txt'

In [92]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if True:

    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')



Wall time: 210 ms


In [93]:
bigram_sentences = LineSentence(bigram_sentences_filepath)
print('unigram length = {}, bigram length = {}'.format(len(list(unigram_sentences)), len(list(bigram_sentences))))

unigram length = 3456, bigram length = 3456


In [97]:
start = 0
finish = 10
print('original text:')
print(df['Docket Text'].iloc[0])
print(df['Docket Text'].iloc[1])

print('\nunigram sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
print('\nbigram sentence:')
for bigram_sentence in it.islice(bigram_sentences, start, finish):
    print(' '.join(bigram_sentence))

original text:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)
Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)

unigram sentence:
complaint against cardiogenics holdings inc. filing fee $ 400 receipt number 0207
8445206 be the disclosure statement on civil cover sheet complete -yes file by lg capital funding llc
steinmetz michael
additional attachment(s add on 3/11/2016 1 civil cover sheet 2 propose summon
bowens 

In [98]:
trigram_model_filepath = 'docket_texts/trigram_model_all'

In [100]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if True:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

Wall time: 87.3 ms


In [101]:
trigram_sentences_filepath = 'docket_texts/trigram_sentences_all.txt'

In [102]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if True:

    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')



Wall time: 155 ms


In [103]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [104]:
start = 0
finish = 10
print('original text:')
print(df['Docket Text'].iloc[0])
print(df['Docket Text'].iloc[1])

print('\nunigram sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
print('\nbigram sentence:')
for bigram_sentence in it.islice(bigram_sentences, start, finish):
    print(' '.join(bigram_sentence))
print('\ntrigram sentence:')
for trigram_sentence in it.islice(trigram_sentences, start, finish):
    print(' '.join(trigram_sentence))

original text:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)
Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)

unigram sentence:
complaint against cardiogenics holdings inc. filing fee $ 400 receipt number 0207
8445206 be the disclosure statement on civil cover sheet complete -yes file by lg capital funding llc
steinmetz michael
additional attachment(s add on 3/11/2016 1 civil cover sheet 2 propose summon
bowens 

In [133]:
#write trigram to file
trigram_dockets_filepath = 'docket_texts/trigram_transformed_dockets_all.txt'

In [135]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if True:

    with codecs.open(trigram_dockets_filepath, 'w', encoding= 'utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(docket_txt_filepath), batch_size = 10000, n_threads = 4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review if term not in STOP_WORDS]
            
            #additionally we need to remove those pronouns
            #trigram_review = [term for term in trigram_review if term != '-PRON-']
            
            # write the transformed review as a line in the new file
            trigram_review = ' '.join(trigram_review)
            f.write(trigram_review + '\n')



Wall time: 9.06 s


In [138]:
# extracting a review for testing
n_th_review = 10
with codecs.open(docket_txt_filepath, encoding = 'utf_8') as f:
    sample_review = list(it.islice(f, n_th_review, n_th_review + 1))[0]
    
print('Original:' + '\n')
print(sample_review)

    
# the transformed file shouldn't have any pronouns in it
print('----' + '\n')
print('Transformed:' + '\n') 

with codecs.open(trigram_dockets_filepath, encoding = 'utf_8') as f:
    for review in it.islice(f, n_th_review, n_th_review + 1):
        print('corresponding transformation: ')
        print(review)

Original:

MOTION for Extension of Time to File Answer by Cardiogenics Holdings, Inc. (Kogan, Simon) Modified on 4/13/2016 to add motion (Quinlan, Krista). (Entered: 04/12/2016)

----

Transformed:

corresponding transformation: 
motion_for_extension time file answer cardiogenics_holdings_inc. kogan_simon modified_on 4/13/2016 add motion quinlan_krista enter 04/12/2016



## we are finally going to topic model with LDA (Latent Dirichlet Allocation)

In [139]:
trigram_dictionary_filepath = 'docket_texts/trigram_dict_all.dict'

In [140]:
%%time

#some dictionary hyperparameters:
no_below = 5
no_above = 0.8

# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if True:

    trigram_reviews = LineSentence(trigram_dockets_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below = no_below, no_above = no_above) #this step is questionable. May need to change the parameters
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

Wall time: 49.1 ms


### Turning the trigram dictionary into bag-of-words corpus

In [141]:
trigram_bow_filepath = 'docket_texts/trigram_bow_corpus_all.mm'

In [142]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [143]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if True:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath, trigram_bow_generator(trigram_sentences_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

Wall time: 80.2 ms


In [144]:
lda_model_filepath = 'docket_texts/lda_model_all'

### LDA Model Training
note that we need couple of things:
1. dictionary
2. bow corpus
3. numbre of topics

In [130]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.
if True:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus, num_topics = 50, id2word = trigram_dictionary, workers = 3)
    
    lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

Wall time: 27.2 s


### This is to check out which token are assigned to which topic

In [131]:
def explore_topic(topic_number, topn = 10):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print('{:20} {}'.format('term', 'frequency') + '\n')

    for term, frequency in lda.show_topic(topic_number, topn = topn):
        print('{:20} {:.3f}'.format(term, round(frequency, 3)))

In [132]:
#try looking at different topics' constitutinos
for i in range(5):
    print("\n topic {}'s tokens:".format(i))
    explore_topic(topic_number = i)
#seems like topic:
#0: 
#1: 
#2: 
#3: 
#4: 
#etc...


 topic 0's tokens:
term                 frequency

to                   0.038
enter                0.036
of                   0.035
on                   0.032
sign_by              0.024
capital_funding_llc  0.024
the                  0.022
by                   0.020
document_file_by_lg  0.020
kehrli_kevin         0.018

 topic 1's tokens:
term                 frequency

the                  0.043
and                  0.029
on                   0.023
for                  0.021
a                    0.020
be                   0.017
of                   0.016
ex                   0.015
to                   0.015
file                 0.015

 topic 2's tokens:
term                 frequency

to                   0.057
by                   0.032
enter                0.027
motion               0.025
the                  0.025
and                  0.016
llc                  0.016
by_lg_capital_funding 0.016
-PRON-               0.016
file                 0.015

 topic 3's tokens:
term         

In [None]:
import os
import codecs
import json

import spacy
import pandas as pd
import itertools as it

from gensim.models import Phrases #seems like this is slower, but Phaser was not compatible to our code? need some research
from gensim.models.word2vec import LineSentence
from spacy.lang.en.stop_words import STOP_WORDS

from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

nlp = spacy.load('en')

In [None]:
def lda_description(review_text, min_topic_freq = 0.05):
    """
    accept the original text of a review and 
    1. parse it with spaCy,
    2. apply text pre-proccessing steps, 
    3. create a bag-of-words representation, 
    4. create an LDA representation, and
    5. print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review if term not in STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda.sort(key = lambda tup: tup[1], reverse = True)
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print('{:25} {}'.format(topic_names[topic_number], round(freq, 3)))