## Notebook Goal:  
Using existing NLP and LDA methodologies to perform topic modeling on docket texts. Three hyperparameters to consider:
1. to remove organization or not in docket texts, so organizations themselves won't become topics.
2. to remove names or not in docket texts, so names themselves won't become topics.
3. variations in topic numbers: [2, 3, 5, 10]

Will then perform visualizations and model summary output on every permutation/iteration.

In [1]:
import nltk
from nltk.tag.stanford import StanfordNERTagger
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords

In [2]:
from gensim.models.word2vec import LineSentence
from gensim.models import Phrases
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

#visualization libraries
import pyLDAvis
import pyLDAvis.gensim



In [3]:
import os
java_path = 'C:/Program Files/Java/jdk-10.0.1/bin/java.exe'
os.environ['JAVAHOME'] = java_path

import pandas as pd
import numpy as np
import codecs
import itertools as it
from bs4 import BeautifulSoup
import warnings
import pickle
from collections import Counter
import re
import string

In [7]:
#import corpus/docket texts from html to pandas DataFrame
def grab_docket_test():
    files = []
    #get all .html files in the folder (all docket files are in .html)
    for file in os.listdir('docket_texts/Train/'):
        if file.endswith('.html'):
            files.append(os.path.join('docket_texts/Train/', file))

    df_docket_texts = pd.DataFrame()
    
    for i in range(len(files)): #gather all docket texts
    #for i in [0, 1]: #for testing purposes
        
        content = codecs.open(files[i], 'r', 'utf-8').read()
        #use beautiful soup to get the case ID
        soup = BeautifulSoup(content, 'lxml')
        case_id = str(soup.find_all('h3'))    
        bookmark1 = case_id.find('CASE #:') + len('CASE #:')
        bookmark2 = case_id.find('</h3>')
        case_id = case_id[bookmark1:bookmark2]

        #use pandas to grab tables in the html files
        docket_tables = pd.read_html(content)

        #error checking: gotta do this because there's different length of docket_list/
        #usually docket texts are in docket_list[3], but not always
        n = 0
        while docket_tables[n].isin(['Docket Text']).sum().sum() == 0:
            #print(n, docket_tables[n].isin(['Docket Text']).sum().sum())
            n += 1
                        
        #print(i, files[i])
        #print(docket_tables[n].head())

        #docket_tables[n] is the docket text table
        new_header = docket_tables[n].iloc[0]
        docket_tables[n] = docket_tables[n][1:]
        docket_tables[n].columns = new_header
        
        docket_tables[n]['#'] = pd.to_numeric(docket_tables[n]['#'],
                                              downcast = 'signed', errors = 'coerce')
        docket_tables[n]['Date Filed'] = pd.to_datetime(docket_tables[n]['Date Filed'])
        docket_tables[n]['Case ID'] = case_id

        df_docket_texts = pd.concat([df_docket_texts, docket_tables[n]])
    #reorder a column
    cols = list(df_docket_texts.columns)
    df_docket_texts = df_docket_texts[[cols[-1]] + cols[:-1]]
    
    print('current docket text table size/shape: {}'.format(df_docket_texts.shape))
    return df_docket_texts

### Pull from dir .html files

In [8]:
%%time
df = grab_docket_test()
docket_original = list(df['Docket Text'])
for i in range(5):
    print('docket text {}'.format(i))
    print(docket_original[i], '\n')

  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)
  self.parser.feed(markup)


current docket text table size/shape: (3244, 4)
docket text 0
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016) 

docket text 1
Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016) 

docket text 2
Summons Issued as to Cardiogenics Holdings, Inc.. (Bowens, Priscilla) (Entered: 03/11/2016) 

docket text 3
NOTICE - emailed attorney regarding missing second page of the civil cover sheet. (Bowens, Priscilla) (Entered: 03/11/2

### Used Stanford NER to identy Names and Entities

In [9]:
%%time
path_to_model = r'C:\Users\inves\AppData\Local\Programs\Python\Python35\Lib\site-packages\nltk\stanford-ner-2018-02-27\classifiers\english.all.3class.distsim.crf.ser.gz'
path_to_jar = r'C:\Users\inves\AppData\Local\Programs\Python\Python35\Lib\site-packages\nltk\stanford-ner-2018-02-27\stanford-ner.jar'
tagger = StanfordNERTagger(path_to_model, path_to_jar = path_to_jar)

output = []
#length = 10
length = len(docket_original)
for i in range(length):
    org_str = []
    name_str = []
    stripped_str1 = []
    stripped_str2 = []
    tokens = nltk.tokenize.word_tokenize(docket_original[i])
    #print(tokens)
    for label, token in zip(tagger.tag(tokens), tokens):
        #print(label)
        if label[1] == 'ORGANIZATION':
            org_str.append(label[0])
            stripped_str1.append('-ORG-')
        elif label[1] == 'PERSON':
            name_str.append(label[0])
            stripped_str1.append('-NAME-')
        else:
            stripped_str1.append(token)
            stripped_str2.append(token)
    
    output.append([docket_original[i],
                   ' '.join(org_str),
                   ' '.join(name_str),
                   ' '.join(stripped_str1),
                   ' '.join(stripped_str2)])

Wall time: 1h 31min 32s


In [13]:
NER_df = pd.DataFrame(output, columns = ['Original Docket Text', 'Organization Portion', 'Name Portion', 
                                         'Identifying Org and Name', 'Stripped Org and Name'])

### To re-build new_df, start here

In [14]:
new_df = NER_df.copy()

In [15]:
print(new_df.head())
docket_text_list = list(new_df['Stripped Org and Name'])

                                Original Docket Text  \
0  COMPLAINT against Cardiogenics Holdings, Inc. ...   
1  Case assigned to Judge Ann M Donnelly and Magi...   
2  Summons Issued as to Cardiogenics Holdings, In...   
3  NOTICE - emailed attorney regarding missing se...   
4  In accordance with Rule 73 of the Federal Rule...   

                              Organization Portion  \
0  Cardiogenics Holdings , Inc. LG Capital Funding   
1      Individual Practices of the assigned Judges   
2                            Cardiogenics Holdings   
3                                                    
4                                                    

                                      Name Portion  \
0             ( Steinmetz Michael Bowens Priscilla   
1  Ann M Donnelly Vera M. Scanlon Bowens Priscilla   
2                                 Bowens Priscilla   
3                                 Bowens Priscilla   
4                                 Bowens Priscilla   

             

In [16]:
def text_preprocess1(text):
    text = text.replace('/', ' ')
    text = re.sub("[\(\[].*?[\)\]]", "", text)
    text = text.replace('-', '')
    text = text.replace('(', ' ')
    text = text.replace(')', ' ')
    text = text.replace('(s)', 's')
    text = text.replace("'s", 's')
    text = text.replace('*', '')
    text = text.replace('', '')
    text = text.replace('<', '')
    text = text.replace('>', '')
    text = text.replace('\\', '')
    text = text.replace('&', ' ')
    
    table = text.maketrans(string.punctuation, len(string.punctuation) * ' ')
    text = text.translate(table)
    return text

def text_preprocess2(text):
    text = text.replace('.', '')
    return text

def remove_stop(sentence):
    output = []
    for word in sentence.split():
        if word not in set(stopwords.words('english')):
            output.append(word)
    return ' '.join(output)

keywords = pd.read_csv('docket_texts/keywords.csv', header = None)
keywords.columns = ['keywords']
keyword_list = list(keywords['keywords'])

In [17]:
print(docket_text_list[5], '\n')
docket_text_list = [text_preprocess1(sentence).lower() for sentence in docket_text_list]
docket_text_list = [text_preprocess2(sentence) for sentence in docket_text_list]
print(docket_text_list[5])
print('\nlength of docket text dataset: {}'.format(len(docket_text_list)))

This attorney case opening filing has been checked for quality control . See the attachment for corrections that were made , if any . ( , ) ( Entered : 03/11/2016 ) 

this attorney case opening filing has been checked for quality control   see the attachment for corrections that were made   if any    

length of docket text dataset: 3244


In [18]:
class Splitter(object):

    def __init__(self):
        self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self,text):

        # split into single sentence
        sentences = self.splitter.tokenize(text)
        # tokenization in each sentences
        tokens = [self.tokenizer.tokenize(remove_stop(sent)) for sent in sentences]
        return tokens


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
        pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        return pos_tokens

In [19]:
%%time
lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

lemma_docket_text_list = []
for docket_text in docket_text_list:
    #step 1 split document into sentence followed by tokenization
    tokens = splitter.split(docket_text)

    #step 2 lemmatization using pos tagger 
    lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
    lemma_docket_text_list.append(lemma_pos_token)

Wall time: 1min 11s


In [20]:
print(len(lemma_docket_text_list)) #docket text document level
print(len(lemma_docket_text_list[0])) #docket text sentence level
print(len(lemma_docket_text_list[0][0])) #docket text word level
print(lemma_docket_text_list[0][0][0]) #docket text token level
print(lemma_docket_text_list[0][0][0][0]) #docket text tuple level

3244
1
27
('complaint', 'complaint', ['NN'])
complaint


In [21]:
#let's do a collection of what we have
collection = {}
for lemma_pos_token in lemma_docket_text_list:
    for sentence in lemma_pos_token:
        for token in sentence:
            #print(token[2][0])
            if token[2][0] not in list(collection.keys()):
                collection[token[2][0]] = []
                collection[token[2][0]].append(token[1])
            else:
                if token[1] not in collection[token[2][0]]:
                    collection[token[2][0]].append(token[1])

In [22]:
remove_pos = list(pd.read_excel('NLP_to_be_removed.xlsx', sheetname = 0, header = None)[0])
remove_word = list(pd.read_excel('NLP_to_be_removed.xlsx', sheetname = 1, header = None)[0])
remove_trigram = list(pd.read_excel('NLP_to_be_removed.xlsx', sheetname = 2, header = None)[0])

In [16]:
pd.DataFrame(dict([ (k, pd.Series(v)) for k, v in collection.items()])).to_csv('NLP_pos.csv', index = False)

In [23]:
%%time
'''
remove_pos = ["``", "NNPS", "NNP", "CD", '#', '$', "''", ",", "0", ":"]
remove_word = ["'s", "judge", "party", "defendant", "ex", "plantiff", "shall", "date", "b", "exhibit", "pennsylvania",  
               "Inc..", "inc..", "llc", "'", "[_]", "action", "clerk", "july", "kw", "regard", "sac", "attachment", "c.d", "cal", "case", "cd", "l.p.", 
               "claim", "copy", "court", "direct", "form", "hereby", "magistrate", "p.c", "pl", "plaintiff", "regard", "sign", "time", "mr.", 
               "docket", "follow", "set", "matter" "agreement" "proceeding", "cotton", "january", "february", "march", "april", "may", "june", 
               "july", "august", "september", "october", "november", "december",
               "agreement", "v.", "modify", "fund", "associated", "provide", "material", "amount", "accordingly", "additional", 
               "second", "esq", "transmission", "g.c.", "seal", "review", "honor", "submit", "counsel", "witness", "civ", "first", "ltd..", "enter", 
               "stay", "forth", "matter", "whether", "class", "master", "information", "statement", "submission", "related", "see", "make", "paper", 
               "brookfield", "designate", "remain", "reportertranscriber", "submit", "include", "mail", "fact", "refer", "take", "pursuant", "amount", 
               "behalf", "I.p..", "must", "attorney",
               'abovecapitoned', 'attach', 'add', 'concern', 'chamber', 'close', 'district', 'damage', 'later', 
               'relate', 'return', 'require', 'restriction', 'respect', 'ny', 'seek', 'write', 'expert', 'transcript', 
               'day', 'h.o', 'damage', 'pre', 'proceeding', 'present', 'page', 'pending', 'p.m.', 'frcp', 'g.c.', 'record', 'r.',
              'application', 'filing', 'issue', 'assign', 'iii', 'state', 'protocol', 'loan', 'error', 'file', 'document']
'''
    
#rebuild corpus
docket_texts_output = [] #ultimate output after cleaning

for lemma_pos_token in lemma_docket_text_list:
    docket_text_output = ''
    for sentence in lemma_pos_token:
        sentence_output = []
        for token in sentence:
            #print(token[1])
            
            if token[2][0] not in remove_pos: #if the pos is not in the remove_pos list
                if token[1] not in remove_word: #these are the intentionally left out words
                    sentence_output.append(token[1]) #append the the sentence
        docket_text_output += ' '.join(sentence_output)
    docket_texts_output.append(docket_text_output)
print(docket_texts_output[:10])

['complaint fee receipt number disclosure civil cover sheet complete yes civil cover sheet propose summons', 'please download locate website responsible courtesy individual practice', 'summons inc', 'notice email miss civil cover sheet', 'accordance rule federal rule civil procedure local rule notify consent united available conduct civil trial order entry final judgment notice blank consent fill electronically wish consent also access link http www uscourts gov uscourts formsandfees ao085 pdf withhold consent without adverse substantive consequence consent unless consent', 'open check quality control correction', 'notice appearance', 'backend note complaint', 'ta letter complaint', 'c notice conversion complaint']
Wall time: 155 ms


In [24]:
new_df['Removed unnecessary POS & vocab'] = pd.Series(docket_texts_output)
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",complaint fee receipt number disclosure civil ...
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,please download locate website responsible cou...
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",summons inc
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,notice email miss civil cover sheet
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,accordance rule federal rule civil procedure l...


#### Decision tree to identify keywords and topics based on Chris' feedback

In [25]:
manual_topics_df = pd.read_csv('mannual_topics.csv')
manual_topics_df = manual_topics_df.apply(lambda x: x.astype(str).str.lower())
manual_topics_dict = manual_topics_df.to_dict('list')
for topic in manual_topics_dict.keys():
    manual_topics_dict[topic] = [keyword for keyword in manual_topics_dict[topic] if keyword != 'nan']

In [26]:
def mannual_topic_assignment(text):
    text = text.split()
    #print(text)
    output = []
    for topic in manual_topics_dict.keys():
        if set(text).intersection(manual_topics_dict[topic]):
            output.append(topic)
    #print(output)
    return ', '.join(output)

In [27]:
docket_texts_output_DT = []
topics_DT = []

for text in docket_texts_output:
    topic = mannual_topic_assignment(text)
    #print(topic)
    if topic != '':
        docket_texts_output_DT.append('')
        topics_DT.append(topic)
    else:
        docket_texts_output_DT.append(text)
        topics_DT.append('')

In [28]:
print(topics_DT[:5])
print(docket_texts_output_DT[:5])

['Complaints, Service of Process', '', 'Service of Process', 'Notices', 'Notices, Motions']
['', 'please download locate website responsible courtesy individual practice', '', '', '']


In [29]:
new_df['DT Topics'] = pd.Series(topics_DT)
new_df['Removed unnecessary POS & vocab DT'] = pd.Series(docket_texts_output_DT)
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",complaint fee receipt number disclosure civil ...,"Complaints, Service of Process",
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,please download locate website responsible cou...,,please download locate website responsible cou...
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",summons inc,Service of Process,
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,notice email miss civil cover sheet,Notices,
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,accordance rule federal rule civil procedure l...,"Notices, Motions",


In [30]:
#print some examples
for i in range(10):
    print(i)
    if new_df['DT Topics'].iloc[i] != []:
        print(new_df['DT Topics'].iloc[i])
        print(new_df['Removed unnecessary POS & vocab'].iloc[i])
        print(new_df['Removed unnecessary POS & vocab DT'].iloc[i])

0
Complaints, Service of Process
complaint fee receipt number disclosure civil cover sheet complete yes civil cover sheet propose summons

1

please download locate website responsible courtesy individual practice
please download locate website responsible courtesy individual practice
2
Service of Process
summons inc

3
Notices
notice email miss civil cover sheet

4
Notices, Motions
accordance rule federal rule civil procedure local rule notify consent united available conduct civil trial order entry final judgment notice blank consent fill electronically wish consent also access link http www uscourts gov uscourts formsandfees ao085 pdf withhold consent without adverse substantive consequence consent unless consent

5

open check quality control correction
open check quality control correction
6
Notices
notice appearance

7
Complaints
backend note complaint

8
Complaints
ta letter complaint

9
Notices, Complaints
c notice conversion complaint



In [31]:
unigram_sentences_filepath = 'docket_texts/train/DT/unigram_nltk_noorgnoname.txt'

In [35]:
%%time
# turn the lemmatized corpus into unigram sentences
with codecs.open(unigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for sentence in docket_texts_output_DT:
        f.write(sentence + '\n')

Wall time: 5.99 ms


In [36]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [38]:
#let's do some comparision between the original text and unigram sentences, shouldn't be that different.
print('Original text:')
print(new_df['Removed unnecessary POS & vocab'].iloc[:10])
print(new_df['Removed unnecessary POS & vocab DT'].iloc[:10])
#print(df['Docket Text'].iloc[1])

print('\nUnigram_sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
    print('')

Original text:
0    complaint fee receipt number disclosure civil ...
1    please download locate website responsible cou...
2                                          summons inc
3                  notice email miss civil cover sheet
4    accordance rule federal rule civil procedure l...
5                open check quality control correction
6                                    notice appearance
7                               backend note complaint
8                                  ta letter complaint
9                        c notice conversion complaint
Name: Removed unnecessary POS & vocab, dtype: object
0                                                     
1    please download locate website responsible cou...
2                                                     
3                                                     
4                                                     
5                open check quality control correction
6                                                   

In [39]:
bigram_model_filepath = 'docket_texts/train/DT/bigram_model_noorgnoname' 

In [40]:
%%time

# store our bigram model
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk if we don't want to run this again
#bigram_model = Phrases.load(bigram_model_filepath)

Wall time: 18.9 ms


In [41]:
bigram_sentences_filepath = 'docket_texts/train/DT/bigram_sentences_noorgnoname.txt'

In [42]:
%%time

# apply the bigram model, and write it to file
with codecs.open(bigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for unigram_sentence in unigram_sentences:
        bigram_sentence = ' '.join(bigram_model[unigram_sentence])
        f.write(bigram_sentence + '\n')

Wall time: 32.9 ms




In [43]:
bigram_sentences = LineSentence(bigram_sentences_filepath)
print('unigram length = {}, bigram length = {}'.format(len(list(unigram_sentences)), len(list(bigram_sentences))))

unigram length = 574, bigram length = 574


In [44]:
#original v. unigram v. bigram. Some phrases should be combined already
start = 0
finish = 10
print('Original text:')
print(new_df['Removed unnecessary POS & vocab'].iloc[0])
print(new_df['Removed unnecessary POS & vocab'].iloc[1])

print('\nUnigram sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
print('\nBigram sentence:')
for bigram_sentence in it.islice(bigram_sentences, start, finish):
    print(' '.join(bigram_sentence))

Original text:
complaint fee receipt number disclosure civil cover sheet complete yes civil cover sheet propose summons
please download locate website responsible courtesy individual practice

Unigram sentence:
please download locate website responsible courtesy individual practice
open check quality control correction
propose schedule order
letter propose schedule order
order respond letter order
order respond letter motion order
letter status
reassign longer please download locate website responsible courtesy individual practice
status report
letter order

Bigram sentence:
please_download locate_website responsible_courtesy individual_practice
open_check quality_control correction
propose schedule order
letter propose schedule order
order respond letter order
order respond letter motion order
letter status
reassign longer please_download locate_website responsible_courtesy individual_practice
status_report
letter order


In [45]:
trigram_model_filepath = 'docket_texts/train/DT/trigram_model_nonamenoorg'

In [46]:
%%time

# again, using Phrases to attach more words to phrases already formed
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_filepath)

# load the finished model from disk
#trigram_model = Phrases.load(trigram_model_filepath)

Wall time: 16 ms


In [47]:
trigram_sentences_filepath = 'docket_texts/train/DT/trigram_sentences_nonamenoorg.txt'

In [48]:
%%time

with codecs.open(trigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for bigram_sentence in bigram_sentences:
        #print('Bi', bigram_sentence)
        trigram_sentence = ' '.join(trigram_model[bigram_sentence])
        #print('Tri', trigram_sentence)
        f.write(trigram_sentence + '\n')

Wall time: 24 ms




In [49]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [50]:
start = 0
finish = 15
print('Original text:')
print(new_df['Removed unnecessary POS & vocab'].iloc[0],'\n')
print(new_df['Removed unnecessary POS & vocab'].iloc[1],'\n')
print(new_df['Removed unnecessary POS & vocab'].iloc[2],'\n')
print(new_df['Removed unnecessary POS & vocab'].iloc[3],'\n')

print('\nUNIGRAM Sentence:')
for unigram_sentence in it.islice(unigram_sentences, start, finish):
    print(' '.join(unigram_sentence))
print('\nBIGRAM Sentence:')
for bigram_sentence in it.islice(bigram_sentences, start, finish):
    print(' '.join(bigram_sentence))
print('\nTRIGRAM Sentence:')
for trigram_sentence in it.islice(trigram_sentences, start, finish):
    print(' '.join(trigram_sentence))

Original text:
complaint fee receipt number disclosure civil cover sheet complete yes civil cover sheet propose summons 

please download locate website responsible courtesy individual practice 

summons inc 

notice email miss civil cover sheet 


UNIGRAM Sentence:
please download locate website responsible courtesy individual practice
open check quality control correction
propose schedule order
letter propose schedule order
order respond letter order
order respond letter motion order
letter status
reassign longer please download locate website responsible courtesy individual practice
status report
letter order
letter
undeliverable decision order sender postal notation move unknown
please download locate website responsible courtesy individual practice
open check quality control correction
schedule order hear order show cause reschedule courtroom south deadline effect order

BIGRAM Sentence:
please_download locate_website responsible_courtesy individual_practice
open_check quality_con

In [51]:
def trigram_transform(texts):
    display = False
    trigram_output = ''
    #print(texts)
    '''
    remove_trigram = ['calendar_day', 'court_notice_intend', 'minute_entry_proceeding_hold', 'court_reportertranscriber_abovecaptioned_matter',
                      'redaction_calendar_day', 'rule_statement', 'obtain_pacer', 'may_obtain_pacer', 'reportertranscriber_abovecaptioned_matter',
                      'redact_transcript_deadline', 'send_chamber', "official_transcript_notice_give", "notice_intent_request", "proceed_hold", 
                      "fee_receipt_number", "civil_procedure", "pursuant_frcp", "official_transcript_conference", 
                      "purchase_reportertranscriber_deadline_release", "et_al", "mail_chamber", "transcript_restriction", "redaction_transcript", 
                      "transcript_view_public_terminal", "transcript_make_remotely", "associated_et_al", "electronically_available_public_without", 
                      "genesys_id", "release_transcript_restriction", "adar_bay", "redaction_request_due", "new_york", "official_transcript_conference", 
                      "transcript_make_remotely", "transcript_proceeding_conference_hold", "redaction_transcript",
                      'affidavit_jr._c.p.a', 'corporate_parent', 'certain_underwriter', 'federal_rule_civil_procedure', 'redaction_request', 
                      'official_transcript', 'rule_disclosure', 'rule_corporate_disclosure', 'place_vault', 'public_without_redaction_calendar', 
                      'purchase_deadline_release_transcript', 'transcript_proceeding_hold', 'transcript_remotely_electronically_available',
                      'minute_entry_hold', 'discovery_hear_hold', 'jury_trial_hold', "sign_judge",'place_vault']
    '''
    if texts == None:
        return None
    
    unigram_review = []
    for word in texts.split():
        unigram_review.append(word)
    if display:
        print('Uni: ', unigram_review)
    bigram_review = bigram_model[unigram_review]
    if display:
        print('Bi: ', bigram_review)
    trigram_review = trigram_model[bigram_review]
    if display:
        print('Tri: ', trigram_review)
    trigram_review = [phrase for phrase in trigram_review if phrase not in remove_trigram]
    if display:
        print('Tri removed: ', trigram_review)
    trigram_output += ' '.join(trigram_review)
    
    return trigram_output

In [52]:
new_df['Apply Trigram Phrase Model'] = new_df['Removed unnecessary POS & vocab DT'].apply(trigram_transform)



In [53]:
new_df['Apply Trigram Phrase Model'].iloc[1]

'please_download_locate_website responsible_courtesy_individual_practice'

In [54]:
#write trigram to file
trigram_dockets_filepath = 'docket_texts/train/DT/trigram_transformed_dockets_noorgnoname.txt'

In [55]:
with codecs.open(trigram_dockets_filepath, 'w', encoding= 'utf_8') as f:
    for i in range(len(new_df['Apply Trigram Phrase Model'])):
        f.write(' '.join(new_df['Apply Trigram Phrase Model'][i]) + '\n')

In [56]:
trigram_dictionary_filepath = 'docket_texts/train/DT/trigram_dict_noorgnoname.dict'

In [57]:
%%time

#some dictionary hyperparameters:
no_below = 10 #reference is 10
no_above = 0.4 #reference is 0.4

trigram_reviews = LineSentence(trigram_dockets_filepath)

# learn the dictionary by iterating over all of the reviews
trigram_dictionary = Dictionary(trigram_reviews)

# filter tokens that are very rare otrigram_reviewsr too common from
# the dictionary (filter_extremes) and reassign integer ids (compactify)
trigram_dictionary.filter_extremes(no_below = no_below, no_above = no_above) #this step is questionable. May need to change the parameters
trigram_dictionary.compactify()

trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
#trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

Wall time: 17 ms


In [58]:
trigram_bow_filepath = 'docket_texts/train/DT/trigram_bow_corpus_noorgnoname.mm'

In [59]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        #print(review)
        #print(trigram_dictionary.doc2bow(review))
        yield trigram_dictionary.doc2bow(review)

In [60]:
%%time

# generate bag-of-words representations for
# all reviews and save them as a matrix
MmCorpus.serialize(trigram_bow_filepath, trigram_bow_generator(trigram_sentences_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
print(trigram_bow_corpus)

MmCorpus(574 documents, 11 features, 14 non-zero entries)
Wall time: 10 ms


In [61]:
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",complaint fee receipt number disclosure civil ...,"Complaints, Service of Process",,
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,please download locate website responsible cou...,,please download locate website responsible cou...,please_download_locate_website responsible_cou...
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",summons inc,Service of Process,,
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,notice email miss civil cover sheet,Notices,,
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,accordance rule federal rule civil procedure l...,"Notices, Motions",,


In [62]:
new_df.columns

Index(['Original Docket Text', 'Organization Portion', 'Name Portion',
       'Identifying Org and Name', 'Stripped Org and Name',
       'Removed unnecessary POS & vocab', 'DT Topics',
       'Removed unnecessary POS & vocab DT', 'Apply Trigram Phrase Model'],
      dtype='object')

In [75]:
new_df[['Original Docket Text', 'Removed unnecessary POS & vocab', 
        'Removed unnecessary POS & vocab DT', 'DT Topics',
        'Apply Trigram Phrase Model']].to_csv('NEW_NLP_output.csv', index = False)

### Insert New Model
### Example 1

In [None]:
def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]    
    all_words = []       
    for mail in emails:    
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:  #Body of email is only 3rd line of text file
                    words = line.split()
                    all_words += words
    
    dictionary = Counter(all_words)
    # Paste code for non-word removal here(code snippet is given below) 
    return dictionary

In [None]:
list_to_remove = dictionary.keys()
for item in list_to_remove:
    if item.isalpha() == False: 
        del dictionary[item]
    elif len(item) == 1:
        del dictionary[item]
dictionary = dictionary.most_common(3000)

In [None]:
def extract_features(mail_dir): 
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        docID = docID + 1     
    return features_matrix

In [None]:
import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC, NuSVC, LinearSVC

# Create a dictionary of words with its frequency

train_dir = 'train-mails'
dictionary = make_Dictionary(train_dir)

# Prepare feature vectors per training mail and its labels

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)

# Training SVM and Naive bayes classifier

model1 = MultinomialNB()
model2 = LinearSVC()
model1.fit(train_matrix,train_labels)
model2.fit(train_matrix,train_labels)

# Test the unseen mails for Spam
test_dir = 'test-mails'
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1
result1 = model1.predict(test_matrix)
result2 = model2.predict(test_matrix)
print confusion_matrix(test_labels,result1)
print confusion_matrix(test_labels,result2)

### Example 2

references: 

https://towardsdatascience.com/spam-classifier-in-python-from-scratch-27a98ddd8e73

https://www.kaggle.com/uciml/sms-spam-collection-dataset

In [7]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from math import log, sqrt
%matplotlib inline

In [None]:
#load data
mails = pd.read_csv('spam.csv')
mails.head()

In [None]:
totalMails = mails['message'].shape[0]
trainIndex, testIndex = list(), list()
for i in range(mails.shape[0]):
    if np.random.uniform(0, 1) < 0.75:
        trainIndex += [i]
    else:
        testIndex += [i]

trainData = mails.loc[trainIndex]
testData = mails.loc[testIndex]

In [None]:
trainData.reset_index(inplace = True)
trainData.drop(['index'], axis = 1, inplace = True)
trainData.head()

In [None]:
spam_words = ' '.join(list(mails[mails['label'] == 1]['message']))
spam_wc = WordCloud(width = 512, height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plot.show()

In [None]:
def process_message(message, lower_case = True, stem = True, stop_words = True, gram = 2):
    if lower_case:
        message = message.lower()
    words = word_tokenize(message)
    words = [w for w in words if len(w) > 2]
    if gram > 1:
        w = []
        for i in range(len(words) - gram + 1):
            w += [' '.join(words[i: i + gram])]
        return w
    if stop_words:
        sw = stopwords.words('english')
        words = [word for word in words if word not in sw]
        words = [stemmer.stem(word) for word in words]
    return words

In [None]:
sc_tf_idf = SpamClassifier(trainData, 'tf-idf')
sc_tf_idf.train()
preds_tf_idf = sc_tf_idf.predict(testData['message'])
metrics(testData['label'], preds_tf_idf)

In [None]:
sc_bow = SpamClassifier(trainData, 'bow')
sc_bow.train()
preds_bow = sc_bow.predict(testData['message'])
metrics(testData['label'], preds_bow)

In [None]:
pm = process_message("I can't pick the phone righ tnow. Pls send a message")
sc_tf_idf.classify(pm)

In [None]:
pm = process_message('Congratulations ur awarded $500')
sc_tf_idf.classify(pm)

In [51]:
def explore_topic(model, topic_number, topn = 10):
    topics = []
    print('{:20} {}'.format('term', 'frequency') + '\n')
    for term, frequency in model.show_topic(topic_number, topn = topn):
        print('{:20} {:.3f}'.format(term, round(frequency, 3)))
        topics.append((term, round(frequency, 3)))
    return topics

In [52]:
def topic_modeling_pipeline(num_topics, model_file_path, trigram_bow_corpus, trigram_dictionary, export = False):

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')

        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus, num_topics = num_topics, id2word = trigram_dictionary, workers = 4)

        lda.save(model_file_path)
    
    topic_dict = {}
    for i in range(num_topics):
        print("\n Topic {}'s make-up:".format(i + 1))
        topic_dict[i] = explore_topic(lda, topic_number = i)
    
    if export:
        pd.DataFrame(topic_dict).to_csv(model_file_path + 'topics.csv', index = False)
    
    LDAvis_data_filepath = model_file_path + '_ldavis'
    
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus, trigram_dictionary)

    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
        
    with open(LDAvis_data_filepath, 'rb') as f:
        LDAvis_prepared = pickle.load(f)
        
    return LDAvis_prepared, lda

In [53]:
def lda_description(docket_text, lda, trigram_dictionary, topic_names, min_topic_freq = 0.05):
    '''
    accept the processed texts (trigram) of a review and 
    1) create a bag-of-words representation, 
    4) create an LDA representation, and
    5) print a sorted list of the top topics in the LDA representation
    '''
    output = []
    analyze_this = []
    for sentence in docket_text:
        analyze_this += sentence.split()
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(analyze_this)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda.sort(key = lambda tup: tup[1], reverse = True)
    #print(review_lda)
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        #print('{:25} {}'.format(topic_names[topic_number], round(freq, 3)))
        output.append((topic_names[topic_number], round(freq, 5)))
    return output

### Providing topic visualizations, topic constituents for each topic, and classifying each docket text. For some reason when I use a pipeline functionality it doesn't work with pyLADvis.display
### 2 Topics:

In [54]:
num_topics = 2
pretrained_model_file_path = 'docket_texts/train/DT/lda_model_noorgnomodel_' + str(num_topics)
column_name = str(num_topics) + '-topic Model Classification'

topic_names = {}
for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)

LDAvis_prepared, model = topic_modeling_pipeline(num_topics, pretrained_model_file_path, trigram_bow_corpus, trigram_dictionary, export = True)
topic_summary = []

for docket_text in list(new_df['Apply Trigram Phrase Model']):
    #print(docket_text)
    if docket_text == []:
        topic_summary.append('')
    else:
        topic_summary.append(lda_description(docket_text, model, trigram_dictionary, topic_names))

new_df[column_name] = topic_summary
pyLDAvis.display(LDAvis_prepared)


 Topic 1's make-up:
term                 frequency

order                0.259
discovery            0.072
motion               0.065
letter_address       0.051
schedule             0.050
hold                 0.050
letter               0.049
rule                 0.046
report               0.033
due                  0.028

 Topic 2's make-up:
term                 frequency

order                0.144
rule                 0.143
letter_address       0.091
inc                  0.064
motion               0.060
deposition           0.048
propose              0.043
hold                 0.030
status_report        0.025
motion_limine        0.024


### 3 Topics:

In [55]:
num_topics = 3
pretrained_model_file_path = 'docket_texts/train/DT/lda_model_noorgnomodel_' + str(num_topics)
column_name = str(num_topics) + '-topic Model Classificaiton'

topic_names = {}
for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)

LDAvis_prepared, model = topic_modeling_pipeline(num_topics, pretrained_model_file_path, trigram_bow_corpus, trigram_dictionary, export = True)
topic_summary = []

for docket_text in list(new_df['Apply Trigram Phrase Model']):
    #print(docket_text)
    if docket_text == []:
        topic_summary.append('')
    else:
        topic_summary.append(lda_description(docket_text, model, trigram_dictionary, topic_names))

new_df[column_name] = topic_summary
pyLDAvis.display(LDAvis_prepared)


 Topic 1's make-up:
term                 frequency

motion               0.161
hold                 0.135
order                0.134
schedule             0.050
pm                   0.046
rule                 0.046
due                  0.043
endorsement          0.038
discovery            0.036
grant                0.035

 Topic 2's make-up:
term                 frequency

order                0.360
letter_address       0.096
discovery            0.068
schedule             0.047
report               0.035
protective           0.035
deposition           0.032
status_report        0.031
letter               0.031
propose              0.030

 Topic 3's make-up:
term                 frequency

rule                 0.225
inc                  0.124
letter               0.093
lp                   0.085
letter_address       0.064
motion               0.048
propose              0.038
discovery            0.036
order                0.031
brief                0.030


### 4 Topics

In [56]:
num_topics = 4
pretrained_model_file_path = 'docket_texts/train/DT/lda_model_noorgnomodel_' + str(num_topics)
print(pretrained_model_file_path)
column_name = str(num_topics) + '-topic Model Classificaiton'

''' commented out due to new DT direction
#according to Chris' feedback on 2018-5-10
topic_names = {0: 'motion to dismiss',
               1: 'motion for summary judgment',
               2: 'Complaint and motion',
               3: 'Amended Complaint and motion'}
'''
topic_names = {}
for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)


LDAvis_prepared, model = topic_modeling_pipeline(num_topics, pretrained_model_file_path, trigram_bow_corpus, trigram_dictionary, export = True)
topic_summary = []

for docket_text in list(new_df['Apply Trigram Phrase Model']):
    #print(docket_text)
    if docket_text == []:
        topic_summary.append('')
    else:
        topic_summary.append(lda_description(docket_text, model, trigram_dictionary, topic_names))

new_df[column_name] = topic_summary
pyLDAvis.display(LDAvis_prepared)

docket_texts/train/DT/lda_model_noorgnomodel_4

 Topic 1's make-up:
term                 frequency

rule                 0.172
order                0.125
hold                 0.099
letter               0.074
discovery            0.061
deposition           0.052
propose              0.052
schedule             0.050
endorsed_letter_address 0.041
letter_address       0.039

 Topic 2's make-up:
term                 frequency

order                0.300
letter_address       0.148
schedule             0.063
report               0.060
motion               0.058
discovery            0.052
hold                 0.035
due                  0.033
propose              0.024
associate            0.023

 Topic 3's make-up:
term                 frequency

order                0.245
status_report        0.113
pm                   0.082
discovery            0.075
letter               0.062
dispute              0.049
deadline             0.041
due                  0.039
schedule             0.035
motion_l

### 5 Topics:

In [57]:
num_topics = 5
pretrained_model_file_path = 'docket_texts/train/DT/lda_model_noorgnomodel_' + str(num_topics)
column_name = str(num_topics) + '-topic Model Classificaiton'

topic_names = {}
for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)

LDAvis_prepared, model = topic_modeling_pipeline(num_topics, pretrained_model_file_path, trigram_bow_corpus, trigram_dictionary, export = True)
topic_summary = []

for docket_text in list(new_df['Apply Trigram Phrase Model']):
    #print(docket_text)
    if docket_text == []:
        topic_summary.append('')
    else:
        topic_summary.append(lda_description(docket_text, model, trigram_dictionary, topic_names))

new_df[column_name] = topic_summary
pyLDAvis.display(LDAvis_prepared)


 Topic 1's make-up:
term                 frequency

order                0.193
discovery            0.165
letter_address       0.097
motion               0.074
inc                  0.069
status_report        0.058
due                  0.053
schedule             0.043
dispute              0.043
rule                 0.028

 Topic 2's make-up:
term                 frequency

hold                 0.280
motion               0.156
letter_address       0.095
pm                   0.081
grant                0.062
endorsed_letter_address 0.035
order                0.030
inc                  0.029
letter               0.029
associate            0.028

 Topic 3's make-up:
term                 frequency

order                0.427
letter               0.085
schedule             0.064
letter_address       0.050
protective           0.042
rule                 0.031
propose              0.031
endorsement          0.025
report               0.025
memo_endorsement     0.022

 Topic 4's make-up:
term   

### 6 Topics:

In [58]:
num_topics = 6
pretrained_model_file_path = 'docket_texts/train/DT/lda_model_noorgnomodel_' + str(num_topics)
column_name = str(num_topics) + '-topic Model Classificaiton'

topic_names = {}

for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)

LDAvis_prepared, model = topic_modeling_pipeline(num_topics, pretrained_model_file_path, trigram_bow_corpus, trigram_dictionary, export = True)
topic_summary = []

for docket_text in list(new_df['Apply Trigram Phrase Model']):
    #print(docket_text)
    if docket_text == []:
        topic_summary.append('')
    else:
        topic_summary.append(lda_description(docket_text, model, trigram_dictionary, topic_names))

new_df[column_name] = topic_summary
pyLDAvis.display(LDAvis_prepared)


 Topic 1's make-up:
term                 frequency

order                0.236
hold                 0.111
letter_address       0.104
letter               0.082
discovery            0.075
due                  0.060
motion_limine        0.059
dispute              0.053
rule                 0.024
motion               0.023

 Topic 2's make-up:
term                 frequency

letter_address       0.196
inc                  0.132
discovery            0.105
rule                 0.087
hold                 0.074
pm                   0.074
deposition           0.062
order                0.038
deny                 0.032
deadline             0.025

 Topic 3's make-up:
term                 frequency

report               0.257
inc                  0.177
hold                 0.074
due                  0.073
propose              0.051
trial                0.050
motion_limine        0.050
associate            0.039
schedule             0.027
status_report        0.027

 Topic 4's make-up:
term      

### 7 Topics:

In [59]:
num_topics = 7
pretrained_model_file_path = 'docket_texts/train/DT/lda_model_noorgnomodel_' + str(num_topics)
column_name = str(num_topics) + '-topic Model Classification'

topic_names = {}

for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)

LDAvis_prepared, model = topic_modeling_pipeline(num_topics, pretrained_model_file_path, trigram_bow_corpus, trigram_dictionary, export = True)
topic_summary = []

for docket_text in list(new_df['Apply Trigram Phrase Model']):
    #print(docket_text)
    if docket_text == []:
        topic_summary.append('')
    else:
        topic_summary.append(lda_description(docket_text, model, trigram_dictionary, topic_names))

new_df[column_name] = topic_summary
pyLDAvis.display(LDAvis_prepared)


 Topic 1's make-up:
term                 frequency

order                0.261
rule                 0.089
pm                   0.087
motion               0.076
hold                 0.058
deadline             0.058
inc                  0.051
status_report        0.044
report               0.038
schedule             0.038

 Topic 2's make-up:
term                 frequency

order                0.221
status_report        0.123
amend                0.099
rule                 0.076
letter_address       0.076
discovery            0.076
motion_limine        0.075
schedule             0.052
motion               0.028
trial                0.028

 Topic 3's make-up:
term                 frequency

rule                 0.268
letter_address       0.224
discovery            0.061
schedule             0.046
order                0.039
inc                  0.038
entry                0.038
letter               0.031
due                  0.031
memo_endorsement     0.031

 Topic 4's make-up:
term      

### 10 Topics:

In [60]:
num_topics = 10
pretrained_model_file_path = 'docket_texts/train/DT/lda_model_noorgnomodel_' + str(num_topics)
column_name = str(num_topics) + '-topic Model Classificaiton'

topic_names = {}
for i in range(num_topics):
    topic_names[i] = 'Topic ' + str(i)

LDAvis_prepared, model = topic_modeling_pipeline(num_topics, pretrained_model_file_path, trigram_bow_corpus, trigram_dictionary, export = True)
topic_summary = []

for docket_text in list(new_df['Apply Trigram Phrase Model']):
    #print(docket_text)
    if docket_text == []:
        topic_summary.append('')
    else:
        topic_summary.append(lda_description(docket_text, model, trigram_dictionary, topic_names))

new_df[column_name] = topic_summary
pyLDAvis.display(LDAvis_prepared)


 Topic 1's make-up:
term                 frequency

inc                  0.384
order                0.087
pm                   0.066
schedule             0.066
motion               0.045
letter               0.045
entry                0.045
deadline             0.045
memo_endorsement     0.045
rule                 0.023

 Topic 2's make-up:
term                 frequency

hold                 0.290
order                0.210
letter               0.100
motion               0.064
discovery            0.037
schedule             0.037
pm                   0.037
trial                0.037
memo_endorsement     0.028
endorsement          0.028

 Topic 3's make-up:
term                 frequency

rule                 0.168
order                0.085
motion               0.085
endorsed_letter_address 0.085
deny                 0.085
endorsement          0.085
pm                   0.064
inc                  0.044
deposition           0.044
memo_endorsement     0.044

 Topic 4's make-up:
term   

### Export DataFrame to .csv

In [61]:
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model,2-topic Model Classification,3-topic Model Classificaiton,4-topic Model Classificaiton,5-topic Model Classificaiton,6-topic Model Classificaiton,7-topic Model Classificaiton,10-topic Model Classificaiton
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",[complaint fee receipt number disclosure civil...,"[Complaints, Service of Process]",[],[],,,,,,,
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,[please download locate website responsible co...,[],[please download locate website responsible co...,[please_download_locate_website responsible_co...,"[(Topic 0, 0.5), (Topic 1, 0.5)]","[(Topic 0, 0.33333), (Topic 1, 0.33333), (Topi...","[(Topic 0, 0.25), (Topic 1, 0.25), (Topic 2, 0...","[(Topic 0, 0.2), (Topic 1, 0.2), (Topic 2, 0.2...","[(Topic 0, 0.16667), (Topic 1, 0.16667), (Topi...","[(Topic 0, 0.14286), (Topic 1, 0.14286), (Topi...","[(Topic 0, 0.1), (Topic 1, 0.1), (Topic 2, 0.1..."
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",[summons inc],[Service of Process],[],[],,,,,,,
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,[notice email miss civil cover sheet],[Notices],[],[],,,,,,,
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,[accordance rule federal rule civil procedure ...,"[Motions, Notices]",[],[],,,,,,,


In [62]:
new_df.columns

Index(['Original Docket Text', 'Organization Portion', 'Name Portion',
       'Identifying Org and Name', 'Stripped Org and Name',
       'Removed unnecessary POS & vocab', 'DT Topics',
       'Removed unnecessary POS & vocab DT', 'Apply Trigram Phrase Model',
       '2-topic Model Classification', '3-topic Model Classificaiton',
       '4-topic Model Classificaiton', '5-topic Model Classificaiton',
       '6-topic Model Classificaiton', '7-topic Model Classificaiton',
       '10-topic Model Classificaiton'],
      dtype='object')

In [63]:
new_df[['Original Docket Text', 'Removed unnecessary POS & vocab', 'Removed unnecessary POS & vocab DT', 'Apply Trigram Phrase Model', 
        'DT Topics', '2-topic Model Classification', '3-topic Model Classificaiton', '4-topic Model Classificaiton', 
        '5-topic Model Classificaiton', '6-topic Model Classificaiton', '7-topic Model Classificaiton', 
        '10-topic Model Classificaiton']].to_csv('docket_texts\Train\DT\examine_this.csv', index = False)

In [66]:
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model,2-topic Model Classification,3-topic Model Classificaiton,4-topic Model Classificaiton,5-topic Model Classificaiton,6-topic Model Classificaiton,7-topic Model Classificaiton,10-topic Model Classificaiton
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",[complaint fee receipt number disclosure civil...,"[Complaints, Service of Process]",[],[],,,,,,,
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,[please download locate website responsible co...,[],[please download locate website responsible co...,[please_download_locate_website responsible_co...,"[(Topic 0, 0.5), (Topic 1, 0.5)]","[(Topic 0, 0.33333), (Topic 1, 0.33333), (Topi...","[(Topic 0, 0.25), (Topic 1, 0.25), (Topic 2, 0...","[(Topic 0, 0.2), (Topic 1, 0.2), (Topic 2, 0.2...","[(Topic 0, 0.16667), (Topic 1, 0.16667), (Topi...","[(Topic 0, 0.14286), (Topic 1, 0.14286), (Topi...","[(Topic 0, 0.1), (Topic 1, 0.1), (Topic 2, 0.1..."
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",[summons inc],[Service of Process],[],[],,,,,,,
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,[notice email miss civil cover sheet],[Notices],[],[],,,,,,,
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,[accordance rule federal rule civil procedure ...,"[Motions, Notices]",[],[],,,,,,,


In [76]:
new_df[new_df['2-topic Model Classification'] != ''].count()

Original Docket Text                  622
Organization Portion                  622
Name Portion                          622
Identifying Org and Name              622
Stripped Org and Name                 622
Removed unnecessary POS & vocab       622
DT Topics                             622
Removed unnecessary POS & vocab DT    622
Apply Trigram Phrase Model            622
2-topic Model Classification          622
3-topic Model Classificaiton          622
4-topic Model Classificaiton          622
5-topic Model Classificaiton          622
6-topic Model Classificaiton          622
7-topic Model Classificaiton          622
10-topic Model Classificaiton         622
dtype: int64

In [77]:
622/len(new_df['2-topic Model Classification'])

0.19173859432799015