## Notebook Goal:  
Using existing NLP and LDA methodologies to perform topic modeling on docket texts. Three hyperparameters to consider:
1. to remove organization or not in docket texts, so organizations themselves won't become topics.
2. to remove names or not in docket texts, so names themselves won't become topics.
3. variations in topic numbers: [2, 3, 5, 10]

Will then perform visualizations and model summary output on every permutation/iteration.

In [1]:
import nltk
from nltk.tag.stanford import StanfordNERTagger
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.chunk import conlltags2tree, tree2conlltags

path_to_model = r'C:\Users\inves\AppData\Local\Programs\Python\Python35\Lib\site-packages\nltk\stanford-ner-2018-02-27\classifiers\english.all.3class.distsim.crf.ser.gz'
path_to_jar = r'C:\Users\inves\AppData\Local\Programs\Python\Python35\Lib\site-packages\nltk\stanford-ner-2018-02-27\stanford-ner.jar'
tagger = StanfordNERTagger(path_to_model, path_to_jar = path_to_jar)

In [2]:
from gensim.models.word2vec import LineSentence
from gensim.models import Phrases
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

#visualization libraries
import pyLDAvis
import pyLDAvis.gensim



In [10]:
import os
import pandas as pd
import numpy as np
import codecs
import itertools as it
from bs4 import BeautifulSoup
import warnings
import pickle
from collections import Counter
import re
import datetime
import string

java_path = 'C:/Program Files/Java/jdk-10.0.1/bin/java.exe'
os.environ['JAVAHOME'] = java_path

In [4]:
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\inves\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\inves\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\inves\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [5]:
filename = 'docket_texts/train/DT/basic_df.pickle'

In [6]:
#to load
with open(filename, 'rb') as handle:
    NER_df = pickle.load(handle)

In [7]:
new_df = NER_df.copy()

In [8]:
docket_original = list(new_df['Original Docket Text'])

### Actually we can do some deduping... but can wait as well

In [9]:
len(set(docket_original))

3203

In [11]:
def valid_date(datestring):
    try:
        mat = re.match('(\d{1,2})[/.-](\d{2})[/.-](\d{4})$', datestring)
        if mat is not None:
            datetime.datetime(*(map(int, mat.groups()[-1::-1])))
            return True
    except ValueError:
        pass
    return False
    
valid_date('003/11/2016')

False

### 1. Normalize

In [78]:
url_regex = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
date_regex = '(\d{1,2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})'
punct_regex = r"[^a-zA-Z0-9]"
num_regex = "\d+"
extraspace_regex = " +"

docket_normalized = [text.lower() for text in docket_original]
docket_nourl = [re.sub(url_regex, "url", text) for text in docket_normalized]
docket_nodate = [re.sub(date_regex, "date", text) for text in docket_nourl]
docket_nopunct = [re.sub(punct_regex, " ", text) for text in docket_nodate]
docket_nonum = [re.sub(num_regex, " ", text) for text in docket_nopunct]
docket_noextraspace = [re.sub(extraspace_regex, " ", text) for text in docket_nonum]

COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016) 

complaint against cardiogenics holdings inc filing fee receipt number was the disclosure statement on civil cover sheet completed yes filed by lg capital funding llc steinmetz michael additional attachment s added on date civil cover sheet proposed summons bowens priscilla entered date 


In [13]:
print(docket_original[0], '\n')

print(docket_noextraspace[0])

COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016) 

complaint against cardiogenics holdings inc filing fee receipt number was the disclosure statement on civil cover sheet completed yes filed by lg capital funding llc steinmetz michael additional attachment s added on date civil cover sheet proposed summons bowens priscilla entered date 


### 2. Split and tokenize

In [14]:
docket_tokenized = [word_tokenize(text) for text in docket_noextraspace]
docket_tokenized[0]

['complaint',
 'against',
 'cardiogenics',
 'holdings',
 'inc',
 'filing',
 'fee',
 'receipt',
 'number',
 'was',
 'the',
 'disclosure',
 'statement',
 'on',
 'civil',
 'cover',
 'sheet',
 'completed',
 'yes',
 'filed',
 'by',
 'lg',
 'capital',
 'funding',
 'llc',
 'steinmetz',
 'michael',
 'additional',
 'attachment',
 's',
 'added',
 'on',
 'date',
 'civil',
 'cover',
 'sheet',
 'proposed',
 'summons',
 'bowens',
 'priscilla',
 'entered',
 'date']

### 3. Remove Stop words

In [79]:
docket_nostop = [[w for w in words if w not in stopwords.words("english")] for words in docket_tokenized]
print(docket_nostop[0])

['complaint', 'cardiogenics', 'holdings', 'inc', 'filing', 'fee', 'receipt', 'number', 'disclosure', 'statement', 'civil', 'cover', 'sheet', 'completed', 'yes', 'filed', 'lg', 'capital', 'funding', 'llc', 'steinmetz', 'michael', 'additional', 'attachment', 'added', 'date', 'civil', 'cover', 'sheet', 'proposed', 'summons', 'bowens', 'priscilla', 'entered', 'date']


### 4. Lemmatization

In [80]:
docket_lemmed = [[WordNetLemmatizer().lemmatize(w, pos='v') for w in words] for words in docket_nostop]
print(docket_lemmed[0])

['complaint', 'cardiogenics', 'hold', 'inc', 'file', 'fee', 'receipt', 'number', 'disclosure', 'statement', 'civil', 'cover', 'sheet', 'complete', 'yes', 'file', 'lg', 'capital', 'fund', 'llc', 'steinmetz', 'michael', 'additional', 'attachment', 'add', 'date', 'civil', 'cover', 'sheet', 'propose', 'summon', 'bowens', 'priscilla', 'enter', 'date']


### 5. Phrase Modeling

In [81]:
docket_phrase1 = [' '.join(text) for text in docket_lemmed]
docket_phrase1[:5]

['complaint cardiogenics hold inc file fee receipt number disclosure statement civil cover sheet complete yes file lg capital fund llc steinmetz michael additional attachment add date civil cover sheet propose summon bowens priscilla enter date',
 'case assign judge ann donnelly magistrate judge vera scanlon please download review individual practice assign judge locate website attorneys responsible provide courtesy copy judge individual practice require bowens priscilla enter date',
 'summon issue cardiogenics hold inc bowens priscilla enter date',
 'notice email attorney regard miss second page civil cover sheet bowens priscilla enter date',
 'accordance rule federal rule civil procedure local rule party notify party consent unite state magistrate judge court available conduct proceed civil action include jury nonjury trial order entry final judgment attach notice blank copy consent form fill sign file electronically party wish consent form may also access follow link url may withhol

In [82]:
unigram_sentences_filepath = 'docket_texts/train/DT/unigram_nltk_newsop.txt'

In [83]:
%%time
# turn the lemmatized corpus into unigram sentences
with codecs.open(unigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for sentence in docket_phrase1:
        f.write(sentence + '\n')

Wall time: 15.4 ms


In [84]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [85]:
bigram_model_filepath = 'docket_texts/train/DT/bigram_model_newsop' 

In [86]:
%%time

# store our bigram model
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk if we don't want to run this again
#bigram_model = Phrases.load(bigram_model_filepath)

Wall time: 250 ms


In [87]:
bigram_sentences_filepath = 'docket_texts/train/DT/bigram_sentences_newsop.txt'

In [88]:
%%time

# apply the bigram model, and write it to file
with codecs.open(bigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for unigram_sentence in unigram_sentences:
        bigram_sentence = ' '.join(bigram_model[unigram_sentence])
        f.write(bigram_sentence + '\n')



Wall time: 484 ms


In [89]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [90]:
print('\nUnigram sentence:')
for unigram_sentence in it.islice(unigram_sentences, 0, 10):
    print(' '.join(unigram_sentence))
print('\nBigram sentence:')
for bigram_sentence in it.islice(bigram_sentences, 0, 10):
    print(' '.join(bigram_sentence))


Unigram sentence:
complaint cardiogenics hold inc file fee receipt number disclosure statement civil cover sheet complete yes file lg capital fund llc steinmetz michael additional attachment add date civil cover sheet propose summon bowens priscilla enter date
case assign judge ann donnelly magistrate judge vera scanlon please download review individual practice assign judge locate website attorneys responsible provide courtesy copy judge individual practice require bowens priscilla enter date
summon issue cardiogenics hold inc bowens priscilla enter date
notice email attorney regard miss second page civil cover sheet bowens priscilla enter date
accordance rule federal rule civil procedure local rule party notify party consent unite state magistrate judge court available conduct proceed civil action include jury nonjury trial order entry final judgment attach notice blank copy consent form fill sign file electronically party wish consent form may also access follow link url may withho

In [91]:
trigram_model_filepath = 'docket_texts/train/DT/trigram_model_newsop'

In [92]:
%%time

# again, using Phrases to attach more words to phrases already formed
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_filepath)

# load the finished model from disk
#trigram_model = Phrases.load(trigram_model_filepath)

Wall time: 250 ms


In [93]:
trigram_sentences_filepath = 'docket_texts/train/DT/trigram_sentences_newsop.txt'

In [94]:
%%time

with codecs.open(trigram_sentences_filepath, 'w', encoding = 'utf_8') as f:
    for bigram_sentence in bigram_sentences:
        #print('Bi', bigram_sentence)
        trigram_sentence = ' '.join(trigram_model[bigram_sentence])
        #print('Tri', trigram_sentence)
        f.write(trigram_sentence + '\n')



Wall time: 437 ms


In [95]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [96]:
start = 0
finish = 5
print('Original text:')
print(docket_phrase1[start:finish], '\n')

print('\nUNIGRAM Sentence:')
for unigram_sentence in it.islice(unigram_sentences, start, finish):
    print(' '.join(unigram_sentence))
print('\nBIGRAM Sentence:')
for bigram_sentence in it.islice(bigram_sentences, start, finish):
    print(' '.join(bigram_sentence))
print('\nTRIGRAM Sentence:')
for trigram_sentence in it.islice(trigram_sentences, start, finish):
    print(' '.join(trigram_sentence))

Original text:
['complaint cardiogenics hold inc file fee receipt number disclosure statement civil cover sheet complete yes file lg capital fund llc steinmetz michael additional attachment add date civil cover sheet propose summon bowens priscilla enter date', 'case assign judge ann donnelly magistrate judge vera scanlon please download review individual practice assign judge locate website attorneys responsible provide courtesy copy judge individual practice require bowens priscilla enter date', 'summon issue cardiogenics hold inc bowens priscilla enter date', 'notice email attorney regard miss second page civil cover sheet bowens priscilla enter date', 'accordance rule federal rule civil procedure local rule party notify party consent unite state magistrate judge court available conduct proceed civil action include jury nonjury trial order entry final judgment attach notice blank copy consent form fill sign file electronically party wish consent form may also access follow link url 

In [64]:
def trigram_transform(texts):
    display = False
    texts = str(texts)
    trigram_output = ''
    #print(texts)

    remove_trigram = ['calendar_day', 'court_notice_intend', 'minute_entry_proceeding_hold', 'court_reportertranscriber_abovecaptioned_matter',
                      'redaction_calendar_day', 'rule_statement', 'obtain_pacer', 'may_obtain_pacer', 'reportertranscriber_abovecaptioned_matter',
                      'redact_transcript_deadline', 'send_chamber', "official_transcript_notice_give", "notice_intent_request", "proceed_hold", 
                      "fee_receipt_number", "civil_procedure", "pursuant_frcp", "official_transcript_conference", 
                      "purchase_reportertranscriber_deadline_release", "et_al", "mail_chamber", "transcript_restriction", "redaction_transcript", 
                      "transcript_view_public_terminal", "transcript_make_remotely", "associated_et_al", "electronically_available_public_without", 
                      "genesys_id", "release_transcript_restriction", "adar_bay", "redaction_request_due", "new_york", "official_transcript_conference", 
                      "transcript_make_remotely", "transcript_proceeding_conference_hold", "redaction_transcript",
                      'affidavit_jr._c.p.a', 'corporate_parent', 'certain_underwriter', 'federal_rule_civil_procedure', 'redaction_request', 
                      'official_transcript', 'rule_disclosure', 'rule_corporate_disclosure', 'place_vault', 'public_without_redaction_calendar', 
                      'purchase_deadline_release_transcript', 'transcript_proceeding_hold', 'transcript_remotely_electronically_available',
                      'minute_entry_hold', 'discovery_hear_hold', 'jury_trial_hold', "sign_judge",'place_vault']

    if texts == None:
        return None
    
    unigram_review = []
    for word in texts.split():
        unigram_review.append(word)
    if display:
        print('Uni: ', unigram_review)
    bigram_review = bigram_model[unigram_review]
    if display:
        print('Bi: ', bigram_review)
    trigram_review = trigram_model[bigram_review]
    if display:
        print('Tri: ', trigram_review)
    trigram_review = [phrase for phrase in trigram_review if phrase not in remove_trigram]
    if display:
        print('Tri removed: ', trigram_review)
    trigram_output += ' '.join(trigram_review)
    
    return trigram_output

In [97]:
docket_phrase2 = [trigram_transform(text) for text in docket_phrase1]



In [99]:
print(docket_original[:5])
print(docket_phrase2[:5])
len(set(docket_phrase2))

['COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)', 'Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)', 'Summons Issued as to Cardiogenics Holdings, Inc.. (Bowens, Priscilla) (Entered: 03/11/2016)', 'NOTICE - emailed attorney regarding missing second page of the civil cover sheet. (Bowens, Priscilla) (Entered: 03/11/2016)', 'In accordance with Rule 73 of the Federal Rules of Civil Procedure and Local Rule 73.1, the

3049

### 6. Use NER (named entity recognition) and GPE (geo-political entity)
This did not work as well as I thought. Also, the standford version worked better than the nltk version.
Seems like after some text treatments, it's hard for the NER to be identified.

In [106]:
#after the inital normalization, there's no sense of the NER or GEP by NLTK NER and GPE. Seems that it heavily depend on the cases
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_original[0]))))
print('orignal text: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_normalized[0]))))
print('normalized: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_nourl[0]))))
print('no url: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_nodate[0]))))
print('no date: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_nopunct[0]))))
print('no punctuation: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_nonum[0]))))
print('no numbers: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_noextraspace[0]))))
print('no extra spaces: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_phrase1[0]))))
print('lemmentized: ')
print(iob_tagged, '\n')
iob_tagged = tree2conlltags(ne_chunk(pos_tag(word_tokenize(docket_phrase2[0]))))
print('phrase modeled: ')
print(iob_tagged)

orignal text: 
[('COMPLAINT', 'NNP', 'B-ORGANIZATION'), ('against', 'IN', 'O'), ('Cardiogenics', 'NNP', 'B-ORGANIZATION'), ('Holdings', 'NNPS', 'I-ORGANIZATION'), (',', ',', 'O'), ('Inc.', 'NNP', 'O'), ('filing', 'VBG', 'O'), ('fee', 'JJ', 'O'), ('$', '$', 'O'), ('400', 'CD', 'O'), (',', ',', 'O'), ('receipt', 'JJ', 'O'), ('number', 'NN', 'O'), ('0207-8445206', 'NN', 'O'), ('Was', 'NNP', 'O'), ('the', 'DT', 'O'), ('Disclosure', 'NNP', 'B-ORGANIZATION'), ('Statement', 'NNP', 'O'), ('on', 'IN', 'O'), ('Civil', 'NNP', 'B-PERSON'), ('Cover', 'NNP', 'I-PERSON'), ('Sheet', 'NNP', 'I-PERSON'), ('completed', 'VBD', 'O'), ('-YES', 'NNP', 'O'), (',', ',', 'O'), (',', ',', 'O'), ('filed', 'VBN', 'O'), ('by', 'IN', 'O'), ('LG', 'NNP', 'B-ORGANIZATION'), ('Capital', 'NNP', 'I-ORGANIZATION'), ('Funding', 'NNP', 'I-ORGANIZATION'), (',', ',', 'O'), ('LLC', 'NNP', 'B-ORGANIZATION'), ('.', '.', 'O'), ('(', '(', 'O'), ('Steinmetz', 'NNP', 'B-PERSON'), (',', ',', 'O'), ('Michael', 'NNP', 'B-GPE'), (')', '

In [108]:
#after the inital normalization, there's no sense of the NER or GEP by NLTK NER and GPE. Seems that it heavily depend on the cases
iob_tagged = tagger.tag(word_tokenize(docket_original[0]))
print('orignal text: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_normalized[0]))
print('normalized: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_nourl[0]))
print('no url: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_nodate[0]))
print('no date: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_nopunct[0]))
print('no punctuation: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_nonum[0]))
print('no numbers: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_noextraspace[0]))
print('no extra spaces: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_phrase1[0]))
print('lemmentized: ')
print(iob_tagged, '\n')
iob_tagged = tagger.tag(word_tokenize(docket_phrase2[0]))
print('phrase modeled: ')
print(iob_tagged)

orignal text: 
[('COMPLAINT', 'O'), ('against', 'O'), ('Cardiogenics', 'ORGANIZATION'), ('Holdings', 'ORGANIZATION'), (',', 'ORGANIZATION'), ('Inc.', 'ORGANIZATION'), ('filing', 'O'), ('fee', 'O'), ('$', 'O'), ('400', 'O'), (',', 'O'), ('receipt', 'O'), ('number', 'O'), ('0207-8445206', 'O'), ('Was', 'O'), ('the', 'O'), ('Disclosure', 'O'), ('Statement', 'O'), ('on', 'O'), ('Civil', 'O'), ('Cover', 'O'), ('Sheet', 'O'), ('completed', 'O'), ('-YES', 'O'), (',', 'O'), (',', 'O'), ('filed', 'O'), ('by', 'O'), ('LG', 'ORGANIZATION'), ('Capital', 'ORGANIZATION'), ('Funding', 'ORGANIZATION'), (',', 'O'), ('LLC', 'O'), ('.', 'O'), ('(', 'PERSON'), ('Steinmetz', 'PERSON'), (',', 'O'), ('Michael', 'PERSON'), (')', 'O'), ('(', 'O'), ('Additional', 'O'), ('attachment', 'O'), ('(', 'O'), ('s', 'O'), (')', 'O'), ('added', 'O'), ('on', 'O'), ('3112016', 'O'), (':', 'O'), ('#', 'O'), ('1', 'O'), ('Civil', 'O'), ('Cover', 'O'), ('Sheet', 'O'), (',', 'O'), ('#', 'O'), ('2', 'O'), ('Proposed', 'O'), ('S

### So let's start over

In [None]:
docket_original = list(new_df['Original Docket Text'])

### 1. NER
Stanford is much better at knowing what's going on

In [130]:
output = []
for i in range(5):
    org_str = []
    name_str = []
    stripped_str1 = []
    stripped_str2 = []

    tokens = word_tokenize(docket_original[i])
    for SFlabel, NLlabel, token in zip(tagger.tag(tokens), tree2conlltags(ne_chunk(pos_tag(tokens))), tokens):
        print(SFlabel, NLlabel, token)
        if SFlabel[1] == 'ORGANIZATION':
            org_str.append(SFlabel[0])
            stripped_str1.append('-ORG-')
        elif SFlabel[1] == 'PERSON':
            name_str.append(SFlabel[0])
            stripped_str1.append('-NAME-')
        else:
            stripped_str1.append(token)
            stripped_str2.append(token)

    output.append([docket_original[i],
                   ' '.join(org_str),
                   ' '.join(name_str),
                   ' '.join(stripped_str1),
                   ' '.join(stripped_str2)])
    
for i in range(5):
    print('docket text:', i)
    print(output[i], '\n')

('COMPLAINT', 'O') ('COMPLAINT', 'NNP', 'B-ORGANIZATION') COMPLAINT
('against', 'O') ('against', 'IN', 'O') against
('Cardiogenics', 'ORGANIZATION') ('Cardiogenics', 'NNP', 'B-ORGANIZATION') Cardiogenics
('Holdings', 'ORGANIZATION') ('Holdings', 'NNPS', 'I-ORGANIZATION') Holdings
(',', 'ORGANIZATION') (',', ',', 'O') ,
('Inc.', 'ORGANIZATION') ('Inc.', 'NNP', 'O') Inc.
('filing', 'O') ('filing', 'VBG', 'O') filing
('fee', 'O') ('fee', 'JJ', 'O') fee
('$', 'O') ('$', '$', 'O') $
('400', 'O') ('400', 'CD', 'O') 400
(',', 'O') (',', ',', 'O') ,
('receipt', 'O') ('receipt', 'JJ', 'O') receipt
('number', 'O') ('number', 'NN', 'O') number
('0207-8445206', 'O') ('0207-8445206', 'NN', 'O') 0207-8445206
('Was', 'O') ('Was', 'NNP', 'O') Was
('the', 'O') ('the', 'DT', 'O') the
('Disclosure', 'O') ('Disclosure', 'NNP', 'B-ORGANIZATION') Disclosure
('Statement', 'O') ('Statement', 'NNP', 'O') Statement
('on', 'O') ('on', 'IN', 'O') on
('Civil', 'O') ('Civil', 'NNP', 'B-PERSON') Civil
('Cover', 'O')

In [109]:
tagger.tag(word_tokenize(docket_original[4]))

[('In', 'O'),
 ('accordance', 'O'),
 ('with', 'O'),
 ('Rule', 'O'),
 ('73', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Federal', 'O'),
 ('Rules', 'O'),
 ('of', 'O'),
 ('Civil', 'O'),
 ('Procedure', 'O'),
 ('and', 'O'),
 ('Local', 'O'),
 ('Rule', 'O'),
 ('73.1', 'O'),
 (',', 'O'),
 ('the', 'O'),
 ('parties', 'O'),
 ('are', 'O'),
 ('notified', 'O'),
 ('that', 'O'),
 ('if', 'O'),
 ('all', 'O'),
 ('parties', 'O'),
 ('consent', 'O'),
 ('a', 'O'),
 ('United', 'LOCATION'),
 ('States', 'LOCATION'),
 ('magistrate', 'O'),
 ('judge', 'O'),
 ('of', 'O'),
 ('this', 'O'),
 ('court', 'O'),
 ('is', 'O'),
 ('available', 'O'),
 ('to', 'O'),
 ('conduct', 'O'),
 ('all', 'O'),
 ('proceedings', 'O'),
 ('in', 'O'),
 ('this', 'O'),
 ('civil', 'O'),
 ('action', 'O'),
 ('including', 'O'),
 ('a', 'O'),
 ('(', 'O'),
 ('jury', 'O'),
 ('or', 'O'),
 ('nonjury', 'O'),
 (')', 'O'),
 ('trial', 'O'),
 ('and', 'O'),
 ('to', 'O'),
 ('order', 'O'),
 ('the', 'O'),
 ('entry', 'O'),
 ('of', 'O'),
 ('a', 'O'),
 ('final', 'O'),
 ('j

### Used Stanford NER to identy Names and Entities

In [110]:
%%time

output = []
#length = 10
length = len(docket_original)
for i in range(length):
    org_str = []
    name_str = []
    stripped_str1 = []
    stripped_str2 = []
    tokens = word_tokenize(docket_original[i])
    #print(tokens)
    for STlabel, token in zip(tagger.tag(tokens), tokens):
        #print(label)
        if label[1] == 'ORGANIZATION':
            org_str.append(label[0])
            stripped_str1.append('-ORG-')
        elif label[1] == 'PERSON':
            name_str.append(label[0])
            stripped_str1.append('-NAME-')
        else:
            stripped_str1.append(token)
            stripped_str2.append(token)
    
    output.append([docket_original[i],
                   ' '.join(org_str),
                   ' '.join(name_str),
                   ' '.join(stripped_str1),
                   ' '.join(stripped_str2)])

NameError: name 'label' is not defined

In [111]:
NER_df = pd.DataFrame(output, columns = ['Original Docket Text', 'Organization Portion', 'Name Portion', 
                                         'Identifying Org and Name', 'Stripped Org and Name'])

### To re-build new_df, start here

In [112]:
new_df = NER_df.copy()

In [113]:
print('shape before dedupe: {}'.format(new_df.shape))
new_df.drop_duplicates(inplace = True)
print('shape after dedupe: {}'.format(new_df.shape))

shape before dedupe: (0, 5)
shape after dedupe: (0, 5)


In [114]:
print(new_df.head())
docket_text_list = list(new_df['Stripped Org and Name'])

Empty DataFrame
Columns: [Original Docket Text, Organization Portion, Name Portion, Identifying Org and Name, Stripped Org and Name]
Index: []


In [127]:
def text_preprocess1(text):
    text = text.replace('/', ' ')
    text = re.sub("[\(\[].*?[\)\]]", "", text)
    text = text.replace('-', '')
    text = text.replace('(', ' ')
    text = text.replace(')', ' ')
    text = text.replace('(s)', 's')
    text = text.replace("'s", 's')
    text = text.replace('*', '')
    text = text.replace('', '')
    text = text.replace('<', '')
    text = text.replace('>', '')
    text = text.replace('\\', '')
    text = text.replace('&', ' ')
    
    table = text.maketrans(string.punctuation, len(string.punctuation) * ' ')
    text = text.translate(table)
    return text

def text_preprocess2(text):
    text = text.replace('.', '')
    return text

def remove_stop(sentence):
    output = []
    for word in sentence.split():
        if word not in set(stopwords.words('english')):
            output.append(word)
    return ' '.join(output)

keywords = pd.read_csv('docket_texts/keywords.csv', header = None)
keywords.columns = ['keywords']
keyword_list = list(keywords['keywords'])

In [128]:
print(docket_text_list[5], '\n')
docket_text_list = [text_preprocess1(sentence).lower() for sentence in docket_text_list]
docket_text_list = [text_preprocess2(sentence) for sentence in docket_text_list]
print(docket_text_list[5])
print('\nlength of docket text dataset: {}'.format(len(docket_text_list)))

This attorney case opening filing has been checked for quality control . See the attachment for corrections that were made , if any . ( , ) ( Entered : 03/11/2016 ) 

this attorney case opening filing has been checked for quality control   see the attachment for corrections that were made   if any    

length of docket text dataset: 3203


In [129]:
class Splitter(object):

    def __init__(self):
        self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self,text):

        # split into single sentence
        sentences = self.splitter.tokenize(text)
        # tokenization in each sentences
        tokens = [self.tokenizer.tokenize(remove_stop(sent)) for sent in sentences]
        return tokens


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
        pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        return pos_tokens

In [130]:
%%time
lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

lemma_docket_text_list = []
for docket_text in docket_text_list:
    #step 1 split document into sentence followed by tokenization
    tokens = splitter.split(docket_text)

    #step 2 lemmatization using pos tagger 
    lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
    lemma_docket_text_list.append(lemma_pos_token)

Wall time: 1min 12s


In [138]:
print(len(lemma_docket_text_list)) #docket text document level
print(len(lemma_docket_text_list[4])) #docket text sentence level
print(lemma_docket_text_list[4]) #docket text sentence level
print(len(lemma_docket_text_list[0][0])) #docket text word level
print(lemma_docket_text_list[0][0][0]) #docket text token level
print(lemma_docket_text_list[0][0][0][0]) #docket text tuple level

3203
1
[[('accordance', 'accordance', ['NN']), ('rule', 'rule', ['NN']), ('73', '73', ['CD']), ('federal', 'federal', ['JJ']), ('rules', 'rule', ['NNS']), ('civil', 'civil', ['JJ']), ('procedure', 'procedure', ['NN']), ('local', 'local', ['JJ']), ('rule', 'rule', ['NN']), ('73', '73', ['CD']), ('1', '1', ['CD']), ('parties', 'party', ['NNS']), ('notified', 'notify', ['VBN']), ('parties', 'party', ['NNS']), ('consent', 'consent', ['NN']), ('united', 'united', ['JJ']), ('states', 'state', ['NNS']), ('magistrate', 'magistrate', ['VBP']), ('judge', 'judge', ['NN']), ('court', 'court', ['NN']), ('available', 'available', ['JJ']), ('conduct', 'conduct', ['NN']), ('proceedings', 'proceeding', ['NNS']), ('civil', 'civil', ['JJ']), ('action', 'action', ['NN']), ('including', 'include', ['VBG']), ('trial', 'trial', ['NN']), ('order', 'order', ['NN']), ('entry', 'entry', ['NN']), ('final', 'final', ['JJ']), ('judgment', 'judgment', ['NN']), ('attached', 'attach', ['VBN']), ('notice', 'notice', ['

In [133]:
#let's do a collection of what we have, this is to see for seeing what we have from a POS perspective, and potentially remove those POS

pos_collection = {}
for lemma_pos_token in lemma_docket_text_list:
    for sentence in lemma_pos_token:
        for token in sentence:
            #print(token[2][0])
            if token[2][0] not in list(pos_collection.keys()):
                pos_collection[token[2][0]] = []
                pos_collection[token[2][0]].append(token[1])
            else:
                if token[1] not in pos_collection[token[2][0]]:
                    pos_collection[token[2][0]].append(token[1])

In [16]:
pd.DataFrame(dict([ (k, pd.Series(v)) for k, v in collection.items()])).to_csv('NLP_train_pos.csv', index = False)

In [134]:
remove_pos = list(pd.read_excel('NLP_to_be_removed.xlsx', sheetname = 0, header = None)[0])
remove_word = list(pd.read_excel('NLP_to_be_removed.xlsx', sheetname = 1, header = None)[0])
remove_trigram = list(pd.read_excel('NLP_to_be_removed.xlsx', sheetname = 2, header = None)[0])

In [135]:
%%time
'''
remove_pos = ["``", "NNPS", "NNP", "CD", '#', '$', "''", ",", "0", ":"]
remove_word = ["'s", "judge", "party", "defendant", "ex", "plantiff", "shall", "date", "b", "exhibit", "pennsylvania",  
               "Inc..", "inc..", "llc", "'", "[_]", "action", "clerk", "july", "kw", "regard", "sac", "attachment", "c.d", "cal", "case", "cd", "l.p.", 
               "claim", "copy", "court", "direct", "form", "hereby", "magistrate", "p.c", "pl", "plaintiff", "regard", "sign", "time", "mr.", 
               "docket", "follow", "set", "matter" "agreement" "proceeding", "cotton", "january", "february", "march", "april", "may", "june", 
               "july", "august", "september", "october", "november", "december",
               "agreement", "v.", "modify", "fund", "associated", "provide", "material", "amount", "accordingly", "additional", 
               "second", "esq", "transmission", "g.c.", "seal", "review", "honor", "submit", "counsel", "witness", "civ", "first", "ltd..", "enter", 
               "stay", "forth", "matter", "whether", "class", "master", "information", "statement", "submission", "related", "see", "make", "paper", 
               "brookfield", "designate", "remain", "reportertranscriber", "submit", "include", "mail", "fact", "refer", "take", "pursuant", "amount", 
               "behalf", "I.p..", "must", "attorney",
               'abovecapitoned', 'attach', 'add', 'concern', 'chamber', 'close', 'district', 'damage', 'later', 
               'relate', 'return', 'require', 'restriction', 'respect', 'ny', 'seek', 'write', 'expert', 'transcript', 
               'day', 'h.o', 'damage', 'pre', 'proceeding', 'present', 'page', 'pending', 'p.m.', 'frcp', 'g.c.', 'record', 'r.',
              'application', 'filing', 'issue', 'assign', 'iii', 'state', 'protocol', 'loan', 'error', 'file', 'document']
'''
    
#rebuild corpus
docket_texts_output = [] #ultimate output after cleaning

for lemma_pos_token in lemma_docket_text_list:
    docket_text_output = ''
    for sentence in lemma_pos_token:
        sentence_output = []
        for token in sentence:
            #print(token[1])
            
            if token[2][0] not in remove_pos: #if the pos is not in the remove_pos list
                if token[1] not in remove_word: #these are the intentionally left out words
                    sentence_output.append(token[1]) #append the the sentence
        docket_text_output += ' '.join(sentence_output)
    docket_texts_output.append(docket_text_output)
print(docket_texts_output[:10])

['complaint fee receipt number disclosure civil cover sheet complete yes civil cover sheet propose summons', 'please download locate website responsible courtesy individual practice', 'summons inc', 'notice email miss civil cover sheet', 'accordance rule federal rule civil procedure local rule notify consent united available conduct civil trial order entry final judgment notice blank consent fill electronically wish consent also access link http www uscourts gov uscourts formsandfees ao085 pdf withhold consent without adverse substantive consequence consent unless consent', 'open check quality control correction', 'notice appearance', 'backend note complaint', 'ta letter complaint', 'c notice conversion complaint']
Wall time: 159 ms


In [139]:
new_df['Removed unnecessary POS & vocab'] = pd.Series(docket_texts_output)
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",complaint fee receipt number disclosure civil ...
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,please download locate website responsible cou...
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",summons inc
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,notice email miss civil cover sheet
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,accordance rule federal rule civil procedure l...


#### Decision tree to identify keywords and topics based on Chris' feedback

In [140]:
manual_topics_df = pd.read_csv('mannual_topics.csv')
manual_topics_df = manual_topics_df.apply(lambda x: x.astype(str).str.lower())
manual_topics_dict = manual_topics_df.to_dict('list')
for topic in manual_topics_dict.keys():
    manual_topics_dict[topic] = [keyword for keyword in manual_topics_dict[topic] if keyword != 'nan']

In [141]:
def mannual_topic_assignment(text):
    text = text.split()
    #print(text)
    output = []
    for topic in manual_topics_dict.keys():
        if set(text).intersection(manual_topics_dict[topic]):
            output.append(topic)
    #print(output)
    return ', '.join(output)

In [142]:
docket_texts_output_DT = []
topics_DT = []

for text in docket_texts_output:
    topic = mannual_topic_assignment(text)
    #print(topic)
    if topic != '':
        docket_texts_output_DT.append('')
        topics_DT.append(topic)
    else:
        docket_texts_output_DT.append(text)
        topics_DT.append('')

In [143]:
print(topics_DT[:5])
print(docket_texts_output_DT[:5])

['Complaints, Service of Process', '', 'Service of Process', 'Notices', 'Notices, Motions']
['', 'please download locate website responsible courtesy individual practice', '', '', '']


In [144]:
new_df['DT Topics'] = pd.Series(topics_DT)
new_df['Removed unnecessary POS & vocab DT'] = pd.Series(docket_texts_output_DT)
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",complaint fee receipt number disclosure civil ...,"Complaints, Service of Process",
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,please download locate website responsible cou...,,please download locate website responsible cou...
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",summons inc,Service of Process,
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,notice email miss civil cover sheet,Notices,
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,accordance rule federal rule civil procedure l...,"Notices, Motions",


In [145]:
#print some examples
for i in range(10):
    print(i)
    if new_df['DT Topics'].iloc[i] != []:
        print(new_df['DT Topics'].iloc[i])
        print(new_df['Removed unnecessary POS & vocab'].iloc[i])
        print(new_df['Removed unnecessary POS & vocab DT'].iloc[i])

0
Complaints, Service of Process
complaint fee receipt number disclosure civil cover sheet complete yes civil cover sheet propose summons

1

please download locate website responsible courtesy individual practice
please download locate website responsible courtesy individual practice
2
Service of Process
summons inc

3
Notices
notice email miss civil cover sheet

4
Notices, Motions
accordance rule federal rule civil procedure local rule notify consent united available conduct civil trial order entry final judgment notice blank consent fill electronically wish consent also access link http www uscourts gov uscourts formsandfees ao085 pdf withhold consent without adverse substantive consequence consent unless consent

5

open check quality control correction
open check quality control correction
6
Notices
notice appearance

7
Complaints
backend note complaint

8
Complaints
ta letter complaint

9
Notices, Complaints
c notice conversion complaint



In [182]:
new_df['Apply Trigram Phrase Model'] = new_df['Removed unnecessary POS & vocab DT'].apply(trigram_transform)



In [171]:
new_df[['DT Topics','Apply Trigram Phrase Model']].head()

Unnamed: 0,DT Topics,Apply Trigram Phrase Model
0,"Complaints, Service of Process",
1,,please_download_locate_website responsible_cou...
2,Service of Process,
3,Notices,
4,"Notices, Motions",


In [177]:
#write trigram to file
trigram_dockets_filepath = 'docket_texts/train/DT/trigram_transformed_dockets_noorgnoname.txt'

In [189]:
with codecs.open(trigram_dockets_filepath, 'w', encoding = 'utf_8') as f:
    for i in range(len(new_df['Apply Trigram Phrase Model'])):
        text = new_df['Apply Trigram Phrase Model'].iloc[i]
        if text != '':
            #print(text)
            f.write(text + '\n')

In [190]:
trigram_dictionary_filepath = 'docket_texts/train/DT/trigram_dict_noorgnoname.dict'

In [191]:
%%time

#some dictionary hyperparameters:
no_below = 10 #reference is 10
no_above = 0.4 #reference is 0.4

trigram_reviews = LineSentence(trigram_dockets_filepath)

# learn the dictionary by iterating over all of the reviews
trigram_dictionary = Dictionary(trigram_reviews)

# filter tokens that are very rare otrigram_reviewsr too common from
# the dictionary (filter_extremes) and reassign integer ids (compactify)
trigram_dictionary.filter_extremes(no_below = no_below, no_above = no_above) #this step is questionable. May need to change the parameters
trigram_dictionary.compactify()

trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
#trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

Wall time: 13 ms


In [192]:
trigram_bow_filepath = 'docket_texts/train/DT/trigram_bow_corpus_noorgnoname.mm'

In [193]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        #print(review)
        #print(trigram_dictionary.doc2bow(review))
        yield trigram_dictionary.doc2bow(review)

In [194]:
%%time

# generate bag-of-words representations for
# all reviews and save them as a matrix
MmCorpus.serialize(trigram_bow_filepath, trigram_bow_generator(trigram_sentences_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
print(trigram_bow_corpus)

MmCorpus(552 documents, 32 features, 832 non-zero entries)
Wall time: 12 ms


In [195]:
new_df.head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model
0,"COMPLAINT against Cardiogenics Holdings, Inc. ...","Cardiogenics Holdings , Inc. LG Capital Funding",( Steinmetz Michael Bowens Priscilla,COMPLAINT against -ORG- -ORG- -ORG- -ORG- fili...,"COMPLAINT against filing fee $ 400 , receipt n...",complaint fee receipt number disclosure civil ...,"Complaints, Service of Process",,
1,Case assigned to Judge Ann M Donnelly and Magi...,Individual Practices of the assigned Judges,Ann M Donnelly Vera M. Scanlon Bowens Priscilla,Case assigned to Judge -NAME- -NAME- -NAME- an...,Case assigned to Judge and Magistrate Judge . ...,please download locate website responsible cou...,,please download locate website responsible cou...,please_download_locate_website responsible_cou...
2,"Summons Issued as to Cardiogenics Holdings, In...",Cardiogenics Holdings,Bowens Priscilla,"Summons Issued as to -ORG- -ORG- , Inc.. ( -NA...","Summons Issued as to , Inc.. ( , ) ( Entered :...",summons inc,Service of Process,,
3,NOTICE - emailed attorney regarding missing se...,,Bowens Priscilla,NOTICE - emailed attorney regarding missing se...,NOTICE - emailed attorney regarding missing se...,notice email miss civil cover sheet,Notices,,
4,In accordance with Rule 73 of the Federal Rule...,,Bowens Priscilla,In accordance with Rule 73 of the Federal Rule...,In accordance with Rule 73 of the Federal Rule...,accordance rule federal rule civil procedure l...,"Notices, Motions",,


In [196]:
new_df.columns

Index(['Original Docket Text', 'Organization Portion', 'Name Portion',
       'Identifying Org and Name', 'Stripped Org and Name',
       'Removed unnecessary POS & vocab', 'DT Topics',
       'Removed unnecessary POS & vocab DT', 'Apply Trigram Phrase Model'],
      dtype='object')

In [197]:
new_df[['Original Docket Text', 'Removed unnecessary POS & vocab', 
        'Removed unnecessary POS & vocab DT', 'DT Topics',
        'Apply Trigram Phrase Model']].to_csv('NEW_NLP_output.csv', index = False)

In [198]:
new_df.shape

(3203, 9)

In [199]:
new_df['Original Docket Text'].nunique()

3203

### Stop here if you don't need to compare with previous NLP results

In [76]:
noDT_data = pd.read_excel(r'E:\WinUser\Documents\Python Code\AI Paralegal\docket_texts\Train\DT\New Topics - Classification -5.27.2018.xlsx')
noDT_data.drop('DT Topics', axis = 1, inplace = True)
noDT_data.head()

Unnamed: 0,Original Docket Text,Removed unnecessary POS & vocab,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model,New Topocs,Action [Y/N],If Y
0,(REDACTED) RULE 56.1 STATEMENT. Document filed...,"['rule .', 'document file .']","['rule .', 'document file .']","['rule .', 'document file .']",System Msg,N,
1,***DELETED DOCUMENT. Deleted document number 1...,"['delete document .', 'delete document number ...","['delete document .', 'delete document number ...","['delete document .', 'delete document number ...",System Msg,N,
2,***DELETED DOCUMENT. Deleted document number 2...,"['delete document .', 'delete document number ...","['delete document .', 'delete document number ...","['delete document .', 'delete document number ...",System Msg,N,
3,***NOTE TO ATTORNEY OF NON-ECF CASE ERROR. Not...,"['note nonecf error .', 'note manually refile ...","['note nonecf error .', 'note manually refile ...","['note nonecf error .', 'note manually refile ...",System Msg,N,
4,***NOTE TO ATTORNEY TO RE-FILE DOCUMENT - NON-...,"['note refile document nonecf error .', 'note ...","['note refile document nonecf error .', 'note ...","['note refile document nonecf error .', 'note ...",System Msg,N,


In [105]:
print('old data shape: {}'.format(noDT_data.shape))
print('old data no duplicates shape: {}'.format(noDT_data.drop_duplicates().shape))
print('new data shape: {}'.format(new_df.shape))
print('new data no duplicates shape: {}'.format(new_df.drop_duplicates().shape))
print('new data no DT: {}'.format((new_df.drop_duplicates()['DT Topics'] == '').sum()))

old data shape: (624, 7)
old data no duplicates shape: (602, 7)
new data shape: (3244, 9)
new data no duplicates shape: (3203, 9)
new data no DT: 596


In [111]:
combined_df = new_df[new_df['DT Topics'] == ''].drop_duplicates().merge(noDT_data.drop_duplicates(), how = 'outer', on = 'Original Docket Text')
combined_df.shape

(603, 15)

In [112]:
combined_df.to_csv('old_v_new_study.csv', index = False)

In [101]:
noDT_data.columns

Index(['Original Docket Text', 'Removed unnecessary POS & vocab',
       'Removed unnecessary POS & vocab DT', 'Apply Trigram Phrase Model',
       'New Topocs', 'Action [Y/N]', 'If Y'],
      dtype='object')

In [96]:
combined_df.columns

Index(['Original Docket Text', 'Organization Portion', 'Name Portion',
       'Identifying Org and Name', 'Stripped Org and Name',
       'Removed unnecessary POS & vocab_x', 'DT Topics',
       'Removed unnecessary POS & vocab DT_x', 'Apply Trigram Phrase Model_x',
       'Removed unnecessary POS & vocab_y',
       'Removed unnecessary POS & vocab DT_y', 'Apply Trigram Phrase Model_y',
       'New Topocs', 'Action [Y/N]', 'If Y'],
      dtype='object')

In [99]:
combined_df[(combined_df['DT Topics'] != '') & (combined_df['Action [Y/N]'] == '')].head()

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab_x,DT Topics,Removed unnecessary POS & vocab DT_x,Apply Trigram Phrase Model_x,Removed unnecessary POS & vocab_y,Removed unnecessary POS & vocab DT_y,Apply Trigram Phrase Model_y,New Topocs,Action [Y/N],If Y


In [None]:
.to_csv('old_v_new_study.csv', index = False)

In [113]:
text1 = "MEMO ENDORSEMENT on re: (37 in 1:04-cv-07900-LAK, (72 in 1:03-cv-02387-LAK) MOTION for an entry of an order in the form attached hereto as Exhibit 1 attached to this motion. ENDORSED: Granted. So Ordered (Signed by Judge Lewis A. Kaplan on 7/29/2008) Filed In Associated Cases: 1:03-cv-02387-LAK, 1:04-cv-07900-LAK(jfe) (Entered: 07/29/2008)"
noDT_data[noDT_data['Original Docket Text'] == text1]
#Chris' input: 1) Y, 2) Court's Order, 3) Triage

Unnamed: 0,Original Docket Text,Removed unnecessary POS & vocab,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model,New Topocs,Action [Y/N],If Y
242,MEMO ENDORSEMENT on re: (37 in 1:04-cv-07900-L...,['memo endorsement motion entry order hereto a...,['memo endorsement motion entry order hereto a...,['memo_endorsement motion entry order hereto a...,,,


In [114]:
new_df[new_df['Original Docket Text'] == text1]
#Chris' input: 1) Y, 2) Court's Order, 3) Triage

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model
679,MEMO ENDORSEMENT on re: (37 in 1:04-cv-07900-L...,,Lewis A. Kaplan,MEMO ENDORSEMENT on re : ( 37 in 1:04-cv-07900...,MEMO ENDORSEMENT on re : ( 37 in 1:04-cv-07900...,memo endorsement motion entry order hereto att...,,memo endorsement motion entry order hereto att...,memo_endorsement motion entry order hereto att...


In [115]:
text2 = "DECISION and ORDER: Pursuant to 28 U.S.C. § 1404, this Court hereby transfers this matter to the United States District Court for the Southern District of New York. Case transferred to District of Southern District of New York. Original file, certified copy of transfer order, and docket sheet sent. ALL FILINGS ARE TO BE MADE IN THE TRANSFER COURT, DO NOT DOCKET TO THIS CASE.. So Ordered by Judge William F. Kuntz, II on 4/17/2017. (Tavarez, Jennifer) [Transferred from New York Eastern on 4/28/2017.] (Entered: 04/19/2017)"
noDT_data[noDT_data['Original Docket Text'] == text1]
#Chris' input: 1) Y, 2) Court's Order, 3) Triage

Unnamed: 0,Original Docket Text,Removed unnecessary POS & vocab,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model,New Topocs,Action [Y/N],If Y


In [116]:
new_df[new_df['Original Docket Text'] == text2]
#Chris' input: 1) Y, 2) Court's Order, 3) Triage

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model
733,DECISION and ORDER: Pursuant to 28 U.S.C. § 14...,United States District Court for the Southern ...,William F. Kuntz Tavarez Jennifer,DECISION and ORDER : Pursuant to 28 U.S.C . § ...,DECISION and ORDER : Pursuant to 28 U.S.C . § ...,decision order u c § transfer transfer origina...,,decision order u c § transfer transfer origina...,decision order u_c § transfer transfer origina...


In [118]:
text3 = "***NOTE TO ATTORNEY TO RE-FILE DOCUMENT - NON-ECF DOCUMENT ERROR. Note to Attorney Gary A. Bornstein: Document No. 329 is an Exhibit. This document is not filed via ECF. Exhibits are ONLY filed as attachments to a supporting or opposing document. (ldi) (Entered: 01/02/2013)"
noDT_data[noDT_data['Original Docket Text'] == text3]
#Chris' input: 1) Y, 2) Court's Order, 3) Triage

Unnamed: 0,Original Docket Text,Removed unnecessary POS & vocab,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model,New Topocs,Action [Y/N],If Y
6,***NOTE TO ATTORNEY TO RE-FILE DOCUMENT - NON-...,['note refile document nonecf document error ....,['note refile document nonecf document error ....,['note refile document nonecf document error ....,System Msg,,


In [119]:
new_df[new_df['Original Docket Text'] == text3]
#Chris' input: 1) Y, 2) Court's Order, 3) Triage

Unnamed: 0,Original Docket Text,Organization Portion,Name Portion,Identifying Org and Name,Stripped Org and Name,Removed unnecessary POS & vocab,DT Topics,Removed unnecessary POS & vocab DT,Apply Trigram Phrase Model
1391,***NOTE TO ATTORNEY TO RE-FILE DOCUMENT - NON-...,ECF,Gary A. Bornstein,***NOTE TO ATTORNEY TO RE-FILE DOCUMENT - NON-...,***NOTE TO ATTORNEY TO RE-FILE DOCUMENT - NON-...,note refile nonecf note via support oppose,,note refile nonecf note via support oppose,note refile nonecf note via support oppose
