## Ultimate Goal: 
Build a bot so it assists attorneys in responding to docket texts (e-mails). The bot will achieve the following:
- scan attorney e-mail inboxes for docket texts
- according to docket texts classifications or urgencies, react to docket texts by:
    - do nothing
    - drafting emails

## Difficulties: 
We currently don't have classifications or urgency lables on docket texts to do a supervised learning exercise. Two possible scenarios:
- have attorney begin to label docket texts, and then perform machine learning
- ask the machine to perform unsupervised ML on docket texts (topic modeling), and see if machine produces usable/reliable classifications/urgencies labels

## Notebook Goal:  
Topic modeling on docket texts. We will explore NLP capabilities, phrase identification, as well as LDA and other topic modeling algorithms. Background, NLP and modeling steps are laid out and explained.

### LDA background
LDA assumes that documents are probability distribution over laten topics.
Topics are probability distribution over words.
LDA takes a number of documents. It assumes that the words in each document are related. It then tries to figure out the 'recipe' for how each document could have been created. We just need to tell the model how many topics to construct and it uses that 'recipe' to generate topic and word distributions over a corpus. Based on that output, we can identify similar documents within the corpus.

### In order to understand the LDA process, we have to know how LDA assumes topics are generated:
1. determine the number of words in the document
2. choose a topic mixture for the document over a fixed set of topics (ie. topic A 20%, topic B 50%, etc)
3. generate words in the document by:
    - pick a topic based on the document's multinomial distribution
    - pick a word based on the topic's multinomial distribution

### Working backwards
Suppose you have a corpus of documents, and you want LDA to learn the topic representatino of K topics in each document and the word distribution of each topic. LDA would backtrack from the document level to identify topics that are likely to have generated the corpus.

### LDA's Magic
1. randomly assign each word in each documen tto one of the K topics
2. for each document
    - assume that all topic assignments except for the current one are correct
    - claculate two proportions:
        1. proportion of words in document d that are currently assigned to topic t = p(topic t | document d)
        2. proportion of assignments to topic t over all documents that come from this word w = p(word w | topic t)
    - multiply those two proportions and assign w a new topic based on that probability. p(topic t | document d) * p(word w | topic t)
3. eventually we'll reach a steady state where assignments make sense

### Step 0.a: Import and examine corpus
Our goal here is to take a look at what our data looks like and libraries are capable of

In [121]:
#standard libraries for basic operations
import pandas as pd
from bs4 import BeautifulSoup
import os
import codecs
import itertools as it

#NLP library
import spacy
nlp = spacy.load('en')

#LDA libraries
from gensim.models import Phrases #seems like this is slower, but Phaser was not compatible to our code? need some research
from gensim.models.word2vec import LineSentence
from spacy.lang.en.stop_words import STOP_WORDS
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

#visualization libraries
import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

In [26]:
#import corpus/docket texts from html to pandas DataFrame
def grab_dockets():
    files = []
    #get all .html files in the folder (all docket files are in .html)
    for file in os.listdir('docket_texts/'):
        if file.endswith('.html'):
            files.append(os.path.join('docket_texts/', file))

    df_docket_texts = pd.DataFrame()
    
    for i in range(len(files)): #gather all docket texts
    #for i in [0, 1]: #for testing purposes
        
        content = codecs.open(files[i], 'r', 'utf-8').read()
        #use beautiful soup to get the case ID
        soup = BeautifulSoup(content, 'lxml')
        case_id = str(soup.find_all('h3'))    
        bookmark1 = case_id.find('CASE #:') + len('CASE #:')
        bookmark2 = case_id.find('</h3>')
        case_id = case_id[bookmark1:bookmark2]

        #use pandas to grab tables in the html files
        docket_tables = pd.read_html(content)

        #error checking: gotta do this because there's different length of docket_list/
        #usually docket texts are in docket_list[3], but not always
        n = 0
        while docket_tables[n].isin(['Docket Text']).sum().sum() == 0:
            #print(n, docket_tables[n].isin(['Docket Text']).sum().sum())
            n += 1
                        
        #print(i, files[i])
        #print(docket_tables[n].head())

        #docket_tables[n] is the docket text table
        new_header = docket_tables[n].iloc[0]
        docket_tables[n] = docket_tables[n][1:]
        docket_tables[n].columns = new_header
        
        docket_tables[n]['#'] = pd.to_numeric(docket_tables[n]['#'],
                                              downcast = 'signed', errors = 'coerce')
        docket_tables[n]['Date Filed'] = pd.to_datetime(docket_tables[n]['Date Filed'])
        docket_tables[n]['Case ID'] = case_id

        df_docket_texts = pd.concat([df_docket_texts, docket_tables[n]])
    #reorder a column
    cols = list(df_docket_texts.columns)
    df_docket_texts = df_docket_texts[[cols[-1]] + cols[:-1]]
    
    print('current docket text table size/shape: {}'.format(df_docket_texts.shape))
    return df_docket_texts

In [27]:
#current docket text table size/shape: (721, 4), as of 2018-04-18
df = grab_dockets()
df.head()

current docket text table size/shape: (721, 4)


Unnamed: 0,Case ID,Date Filed,#,Docket Text
1,1:16-cv-01215-AMD-SJB,2016-03-10,1.0,"COMPLAINT against Cardiogenics Holdings, Inc. ..."
2,1:16-cv-01215-AMD-SJB,2016-03-10,,Case assigned to Judge Ann M Donnelly and Magi...
3,1:16-cv-01215-AMD-SJB,2016-03-10,2.0,"Summons Issued as to Cardiogenics Holdings, In..."
4,1:16-cv-01215-AMD-SJB,2016-03-11,,NOTICE - emailed attorney regarding missing se...
5,1:16-cv-01215-AMD-SJB,2016-03-11,3.0,In accordance with Rule 73 of the Federal Rule...


In [234]:
docket_original = list(df['Docket Text'])
print(docket_original[0:2])

['COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)', 'Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)']


In [29]:
docket_txt_filepath = 'docket_texts/docket_text_all.txt'

In [240]:
%%time

# writing our docket texts to a text file: docket_txt_filepath = 'docket_texts/docket_text_all.txt'
# if we need to update file, then True
if True:
    
    docket_count = 0

    # create & open a new file in write mode
    with codecs.open(docket_txt_filepath, 'w', encoding = 'utf_8') as docket_txt_file:


        for i in range(len(docket_original)):
            #note that I used replace here to get rid of '(s)' in the documents as SpaCy NLP wasn't able to handle it
            docket_txt_file.write(docket_original[i].replace('(s)', '') + '\n')
            docket_count += 1

    print('Text from {:,} docket texts written to the new txt file'.format(docket_count))
    
else:
    with codecs.open(docket_txt_filepath, encoding = 'utf_8') as docket_txt_file:
        for review_count, line in enumerate(docket_txt_file):
            pass
        
    print(u'Text from {:,} restaurant reviews in the txt file.'.format(docket_count))

Text from 721 docket texts written to the new txt file
Wall time: 12.5 ms


### Step 0.b: Test out NLP
Let's see what SpaCy is capabile of. The goal is to remove what we don't need from the corpus, in a systematic way. i.e. remove dates, numbers, pronouns, etc

In [241]:
# extracting a review for testing
n_th_review = 0
with codecs.open(docket_txt_filepath, encoding = 'utf_8') as f:
    sample_review = list(it.islice(f, n_th_review, n_th_review + 1))[0]

print(sample_review)

#using NLP (SpaCy English) to parse the review text
parsed_review = nlp(sample_review)

#for example, we can print out the sentences from the parsed object
#why were the sentences split by a dash?!
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)

Sentence 1:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207

Sentence 2:
-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC.

Sentence 3:
(Steinmetz, Michael) (Additional attachment added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla).

Sentence 4:
(Entered: 03/10/2016)




In [242]:
#we can also look at entities, note that this result is not great
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: COMPLAINT - ORG

Entity 2: Cardiogenics Holdings, Inc. - ORG

Entity 3: 400 - MONEY

Entity 4: the Disclosure Statement on Civil Cover Sheet - ORG

Entity 5: LG Capital Funding - ORG

Entity 6: LLC - PERSON

Entity 7: Steinmetz - PERSON

Entity 8: Michael - PERSON

Entity 9: 3/11/2016 - DATE

Entity 10: # 1 - MONEY

Entity 11: Civil Cover Sheet - ORG

Entity 12: Bowens - PERSON

Entity 13: Priscilla - PERSON

Entity 14: Entered - PERSON

Entity 15: 
 - GPE



In [243]:
#looking at parts of speech
token_text = [token.orth_ for token in parsed_review] #Verbatim text content 
token_lemma = [token.lemma_ for token in parsed_review] #lemmatization
token_pos = [token.pos_ for token in parsed_review] #part-of-speech.

pd.DataFrame(list(zip(token_text, token_lemma, token_pos)), columns=['token_text', 'lemma', 'part_of_speech'])
#remove when token.pos_ in ['PUNCT', 'SYM', 'NUM', 'PUNCT', 'SPACE']

Unnamed: 0,token_text,lemma,part_of_speech
0,COMPLAINT,complaint,PROPN
1,against,against,ADP
2,Cardiogenics,cardiogenics,PROPN
3,Holdings,holdings,PROPN
4,",",",",PUNCT
5,Inc.,inc.,PROPN
6,filing,filing,NOUN
7,fee,fee,NOUN
8,$,$,SYM
9,400,400,NUM


In [244]:
#try out lemmatization. this is quite important as it returns words into its original forms. However it's not perfect. We still see things such as 'attachment(s'
token_lemma = [token.lemma_ for token in parsed_review] # Base form of the token, with no inflectional suffixes.
token_shape = [token.shape_ for token in parsed_review] # Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)), columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,COMPLAINT,complaint,XXXX
1,against,against,xxxx
2,Cardiogenics,cardiogenics,Xxxxx
3,Holdings,holdings,Xxxxx
4,",",",",","
5,Inc.,inc.,Xxx.
6,filing,filing,xxxx
7,fee,fee,xxx
8,$,$,$
9,400,400,ddd


In [245]:
# trying out different attributes
token_entity_type = [token.ent_type_ for token in parsed_review] #Named entity type.
token_entity_iob = [token.ent_iob_ for token in parsed_review] #IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,COMPLAINT,ORG,B
1,against,,O
2,Cardiogenics,ORG,B
3,Holdings,ORG,I
4,",",ORG,I
5,Inc.,ORG,I
6,filing,,O
7,fee,,O
8,$,,O
9,400,MONEY,B


In [246]:
# a bigger summary of the NLP tokens
token_attributes = [(token.orth_, #Verbatim text content
                     token.prob, #Smoothed log probability estimate of token's type
                     token.is_stop, #Is the token part of a "stop list"
                     token.is_punct, #Is the token punctuation
                     token.is_space, #Does the token consist of whitespace characters
                     token.like_num, #Does the token represent a number
                     token.is_oov) #Is the token out-of-vocabulary
                    for token in parsed_review]

df_temp = pd.DataFrame(token_attributes,
                       columns=['text', 'log_probability', 'stop?', 'punctuation?',
                                'whitespace?', 'number?', 'out of vocab.?'])

df_temp.loc[:, 'stop?':'out of vocab.?'] = (df_temp.loc[:, 'stop?':'out of vocab.?']
                                            .applymap(lambda x: u'Yes' if x else u''))
                                               
df_temp

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,COMPLAINT,-20.0,,,,,Yes
1,against,-20.0,Yes,,,,Yes
2,Cardiogenics,-20.0,,,,,Yes
3,Holdings,-20.0,,,,,Yes
4,",",-20.0,,Yes,,,Yes
5,Inc.,-20.0,,,,,Yes
6,filing,-20.0,,,,,Yes
7,fee,-20.0,,,,,Yes
8,$,-20.0,,,,,Yes
9,400,-20.0,,,,Yes,Yes


In [254]:
# after exploring what SpaCy can do for us, we'll lemmatize our corpus using these functions
def unnecessary(token):
    """
    helper function to eliminate tokens
    when token.pos_ in ['PUNCT', 'SYM', 'NUM', 'PUNCT', 'SPACE'] or when the .lemma turns out to be ['-PRON-']
    """
    if token.pos_ in ['PUNCT', 'SYM', 'NUM', 'PUNCT', 'SPACE']:
        return True
    elif token.lemma_ in ['-PRON-']:
        return True
    else:
        return False

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename), batch_size = 10000, n_threads = 4):
        
        for sent in parsed_review.sents:
            #the problem is here where token.lemma_ transforms the pronouns
            yield ' '.join([token.lemma_ for token in sent if not unnecessary(token)])

### Step 0.c: Try gensim to do phrase modeling

1. Segment text of complete reviews into sentences & normalize text
2. First-order phrase modeling → apply first-order phrase model to transform sentences
3. Second-order phrase modeling → apply second-order phrase model to transform sentences
4. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

In [255]:
unigram_sentences_filepath = 'docket_texts/unigram_sentences_all.txt'

In [256]:
%%time
# turn the lemmatized corpus into unigram sentences
with codecs.open(unigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for sentence in lemmatized_sentence_corpus(docket_txt_filepath):
        f.write(sentence + '\n')

Wall time: 8.57 s


In [257]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [259]:
#let's do some comparision between the original text and unigram sentences, shouldn't be that different.
print('original text:')
print(df['Docket Text'].iloc[0])
#print(df['Docket Text'].iloc[1])

print('\nunigram_sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
    print('')

original text:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)

unigram_sentence:
complaint against cardiogenics holdings inc. filing fee receipt number

be the disclosure statement on civil cover sheet complete -yes file by lg capital funding llc

steinmetz michael additional attachment add on civil cover sheet propose summon bowens priscilla

enter

case assign to judge ann m donnelly and magistrate judge vera m. scanlon

please download and review the individual practices of the assign judges locate on website

attorney be responsible for provide courtesy copy to judge where individual practices require such

bowens priscilla

enter

summon issue as to cardiogenics holdings inc



In [260]:
bigram_model_filepath = 'docket_texts/bigram_model_all' 

In [261]:
%%time

# store our bigram model
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk if we don't want to run this again
#bigram_model = Phrases.load(bigram_model_filepath)

Wall time: 80.2 ms


In [262]:
bigram_sentences_filepath = 'docket_texts/bigram_sentences_all.txt'

In [263]:
%%time

# apply the bigram model, and write it to file
with codecs.open(bigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for unigram_sentence in unigram_sentences:
        bigram_sentence = ' '.join(bigram_model[unigram_sentence])
        f.write(bigram_sentence + '\n')



Wall time: 161 ms


In [264]:
bigram_sentences = LineSentence(bigram_sentences_filepath)
print('unigram length = {}, bigram length = {}'.format(len(list(unigram_sentences)), len(list(bigram_sentences))))

unigram length = 3439, bigram length = 3439


In [265]:
#original v. unigram v. bigram. Some phrases should be combined already
start = 0
finish = 10
print('original text:')
print(df['Docket Text'].iloc[0])
print(df['Docket Text'].iloc[1])

print('\nunigram sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
print('\nbigram sentence:')
for bigram_sentence in it.islice(bigram_sentences, start, finish):
    print(' '.join(bigram_sentence))

original text:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)
Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)

unigram sentence:
complaint against cardiogenics holdings inc. filing fee receipt number
be the disclosure statement on civil cover sheet complete -yes file by lg capital funding llc
steinmetz michael additional attachment add on civil cover sheet propose summon bowens priscilla
enter
case assign to judg

In [266]:
trigram_model_filepath = 'docket_texts/trigram_model_all'

In [267]:
%%time

# again, using Phrases to attach more words to phrases already formed
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_filepath)

# load the finished model from disk
#trigram_model = Phrases.load(trigram_model_filepath)

Wall time: 68.2 ms


In [268]:
trigram_sentences_filepath = 'docket_texts/trigram_sentences_all.txt'

In [269]:
%%time

with codecs.open(trigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for bigram_sentence in bigram_sentences:
        trigram_sentence = ' '.join(trigram_model[bigram_sentence])
        f.write(trigram_sentence + '\n')



Wall time: 227 ms


In [270]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [271]:
start = 0
finish = 15
print('original text:')
print(df['Docket Text'].iloc[0],'\n')
print(df['Docket Text'].iloc[1],'\n')
print(df['Docket Text'].iloc[2],'\n')
print(df['Docket Text'].iloc[3],'\n')

print('\nUNIGRAM Sentence:')
for unigram_sentence in it.islice(unigram_sentences, start, finish):
    print(' '.join(unigram_sentence))
print('\nBIGRAM Sentence:')
for bigram_sentence in it.islice(bigram_sentences, start, finish):
    print(' '.join(bigram_sentence))
print('\nTRIGRAM Sentence:')
for trigram_sentence in it.islice(trigram_sentences, start, finish):
    print(' '.join(trigram_sentence))

original text:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016) 

Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016) 

Summons Issued as to Cardiogenics Holdings, Inc.. (Bowens, Priscilla) (Entered: 03/11/2016) 

NOTICE - emailed attorney regarding missing second page of the civil cover sheet. (Bowens, Priscilla) (Entered: 03/11/2016) 


UNIGRAM Sentence:
complaint against cardiogenics holdings inc. filing fee receipt

In [272]:
#write trigram to file
trigram_dockets_filepath = 'docket_texts/trigram_transformed_dockets_all.txt'

### Step 1: Combine all NLP and phrase modeling preparation steps

1. Apply SpaCy NLP cleaning to the docket texts
2. Apply bigram model to the clean texts
3. Apply trigram model to the bigram outputs
4. Write to and export resulting file

In [273]:
%%time

#some feedbacks:
#remove all numbers and dates
#remove pronouns
#remove names
#lets throw the corresponding transformation back into parser again to see results

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
with codecs.open(trigram_dockets_filepath, 'w', encoding= 'utf_8') as f:

    for parsed_review in nlp.pipe(line_review(docket_txt_filepath), batch_size = 10000, n_threads = 4):

        # lemmatize the text, removing punctuation and whitespace
        unigram_review = [token.lemma_ for token in parsed_review if not unnecessary(token)]

        # remove

        # apply the first-order and second-order phrase models
        bigram_review = bigram_model[unigram_review]
        trigram_review = trigram_model[bigram_review]

        # remove any remaining stopwords
        trigram_review = [term for term in trigram_review if term not in STOP_WORDS]

        #additionally we need to remove those pronouns
        #trigram_review = [term for term in trigram_review if term != '-PRON-']

        # write the transformed review as a line in the new file
        trigram_review = ' '.join(trigram_review)
        f.write(trigram_review + '\n')



Wall time: 8.72 s


In [275]:
# extracting a review for comparision purposes
n_th_review = 0
with codecs.open(docket_txt_filepath, encoding = 'utf_8') as f:
    sample_review = list(it.islice(f, n_th_review, n_th_review + 1))[0]
    
print('Original:' + '\n')
print(sample_review)

    
# the transformed file shouldn't have any pronouns in it
print('----' + '\n')
print('Transformed:' + '\n') 

with codecs.open(trigram_dockets_filepath, encoding = 'utf_8') as f:
    for review in it.islice(f, n_th_review, n_th_review + 1):
        print('corresponding transformation: ')
        print(review)

Original:

COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)

----

Transformed:

corresponding transformation: 
complaint_against cardiogenics_holdings_inc. filing_fee_receipt_number disclosure_statement civil_cover_sheet complete_-yes file lg_capital_funding_llc steinmetz michael additional attachment add civil_cover_sheet propose summon bowens_priscilla enter



### Step 2.a: Loading trigram texts into a dictionary. Filter out some extreme values along the way

In [276]:
trigram_dictionary_filepath = 'docket_texts/trigram_dict_all.dict'

In [277]:
%%time

#some dictionary hyperparameters:
no_below = 10 #reference is 10
no_above = 0.4 #reference is 0.4



trigram_reviews = LineSentence(trigram_dockets_filepath)

# learn the dictionary by iterating over all of the reviews
trigram_dictionary = Dictionary(trigram_reviews)

# filter tokens that are very rare otrigram_reviewsr too common from
# the dictionary (filter_extremes) and reassign integer ids (compactify)
trigram_dictionary.filter_extremes(no_below = no_below, no_above = no_above) #this step is questionable. May need to change the parameters
trigram_dictionary.compactify()

trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
#trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

Wall time: 31.6 ms


### Step 2.b: Load into bag-of-words

In [278]:
trigram_bow_filepath = 'docket_texts/trigram_bow_corpus_all.mm'

In [280]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [281]:
%%time

# generate bag-of-words representations for
# all reviews and save them as a matrix
MmCorpus.serialize(trigram_bow_filepath, trigram_bow_generator(trigram_sentences_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

Wall time: 66.2 ms


In [282]:
lda_model_filepath = 'docket_texts/lda_model_all'

### Step 2.c LDA Model Training
note that we need couple of things:
1. dictionary
2. bow corpus
3. numbre of topics

### During LDA modeling, some parameters we may need to be mindful of:
- alpha (parameter of the Dirichlet pior of the per-document topic distribution)
   - high: each document will contain many topics
   - low: each document iwll have distinct topics

- beta (parameter of the Dirichlet prior on the per-topic word distribution)
   - high: each topic will contain many words
   - low: each topic will contain few words

- theta (topic distribution for document m)
- z (topic for the n-th word in document m)
- w (specific word)

In [284]:
%%time

#note that we can change # of topics any time
num_topics = 10

with warnings.catch_warnings():
    warnings.simplefilter('ignore')

    # workers => sets the parallelism, and should be
    # set to your number of physical cores minus one
    lda = LdaMulticore(trigram_bow_corpus, num_topics = num_topics, id2word = trigram_dictionary, workers = 4)

lda.save(lda_model_filepath)

# load the finished LDA model from disk
#lda = LdaMulticore.load(lda_model_filepath)

Wall time: 5.39 s


### Step 2.d This is to check out which tokens are assigned to which topic. Use Intuition to provide a description for the topic

In [285]:
def explore_topic(topic_number, topn = 10):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print('{:20} {}'.format('term', 'frequency') + '\n')

    for term, frequency in lda.show_topic(topic_number, topn = topn):
        print('{:20} {:.3f}'.format(term, round(frequency, 3)))

In [286]:
# try looking at different topics' constitutinos
# we might want to 
for i in range(5):
    print("\n topic {}'s tokens:".format(i))
    explore_topic(topic_number = i)
#seems like topic:
#0: 
#1: 
#2: 
#3: 
#4: 
#etc...


 topic 0's tokens:
term                 frequency

lg_capital_funding_llc 0.073
document_file_by     0.063
order                0.058
letter               0.033
file                 0.028
notice               0.023
motion               0.021
kehrli_kevin         0.017
modify               0.015
civil_cover_sheet    0.015

 topic 1's tokens:
term                 frequency

order                0.050
exhibit              0.037
file                 0.034
counsel              0.034
document_file_by     0.029
notice               0.021
party                0.021
time                 0.018
letter               0.017
co._ltd.             0.016

 topic 2's tokens:
term                 frequency

counsel              0.048
kehrli_kevin         0.034
attachment           0.031
conference           0.031
scheduling_order     0.027
exhibit              0.027
order                0.024
direct               0.023
party                0.022
motion               0.018

 topic 3's tokens:
term        

### Step 2.e After looking at the token makeup for each topic, one can interpret a description for each topic. This will aid in future visualizations

In [288]:
#so these are fake as I didn't go through all the outputs
#until we provided a label:
topic_names = {}
for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)

''' this is a place holder
topic_names = {0: 'mexican',
               1: 'menu',
               2: 'thai',
               3: 'steak',
               4: 'donuts & appetizers',
               5: 'specials',
               6: 'soup',
               7: 'wings, sports bar',
               8: 'foreign language',
               9: 'las vegas'}
'''
print(topic_names)

{0: 'Topic 0', 1: 'Topic 1', 2: 'Topic 2', 3: 'Topic 3', 4: 'Topic 4', 5: 'Topic 5', 6: 'Topic 6', 7: 'Topic 7', 8: 'Topic 8', 9: 'Topic 9'}


In [289]:
topic_names_filepath = 'docket_texts/topic_names.pkl'

with open(topic_names_filepath, 'wb') as f:
    pickle.dump(topic_names, f)

### Step 3 Proceed to use LDavis to visualize our results

In [290]:
LDAvis_data_filepath = 'docket_texts/ldavis_prepared'

In [291]:
%%time

LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus, trigram_dictionary)

with open(LDAvis_data_filepath, 'wb') as f:
    pickle.dump(LDAvis_prepared, f)

Wall time: 39.3 s


In [292]:
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

### What are we looking at?
- the plot is redered in 2D according to MDS algo. Topics that are similar should appear close together; those that are dis-similar should be far apart.
- the relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus
- when no topic is selected, the bar chart changes to show the top 30 most relevant terms for the selected topic. the relevance metric is controlled by the parameter λ. 
- λ close to 1 will rank the terms soley according to their probability within the topic.
- λ close to 0 will rank the terms solely according to their distinctiveness/exclusivity within the topic. (terms that only occur in this topic)
- Rolling over a term in the bar chart will cause the topic to resize, thereby showing the strength of the relationship between the topic and the selected term.

In [293]:
pyLDAvis.display(LDAvis_prepared)

### Step 4 Let's label some texts with topics

In [334]:
def get_sample_docket(docket_number):
    return list(it.islice(line_review(docket_txt_filepath), docket_number, docket_number + 1))[0]

In [335]:
def lda_description(docket_text, min_topic_freq = 0.05):
    '''
    accept the original text of a review and 
    1) parse it with spaCy,
    2) apply text pre-proccessing steps, 
    3) create a bag-of-words representation, 
    4) create an LDA representation, and
    5) print a sorted list of the top topics in the LDA representation
    '''
    output = []
    # parse the review text with spaCy
    parsed_docket = nlp(docket_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_docket = [token.lemma_ for token in parsed_review if not unnecessary(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_docket = bigram_model[unigram_docket]
    trigram_docket = trigram_model[bigram_docket]
    
    # remove any remaining stopwords
    trigram_docket = [term for term in trigram_docket if term not in STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_docket)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda.sort(key = lambda tup: tup[1], reverse = True)
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        #print('{:25} {}'.format(topic_names[topic_number], round(freq, 3)))
        output.append((topic_names[topic_number], round(freq, 5)))
    return output

In [339]:
sample_docket = get_sample_docket(2)
print(sample_docket)
print(lda_description(sample_docket))

Summons Issued as to Cardiogenics Holdings, Inc.. (Bowens, Priscilla) (Entered: 03/11/2016)





[('Topic 4', 0.67041), ('Topic 1', 0.25377), ('Topic 7', 0.05168)]


In [313]:
df['topics'] = df['Docket Text'].apply(lda_description)



In [340]:
#export to csv
df.to_csv('docket_texts/docket_text_topics.csv')