### LDA background
LDA assumes that documents are probability distribution over laten topics.
Topics are probability distribution over words.
LDA takes a number of documents. It assumes that the words in each document are related. It then tries to figure out the 'recipe' for how each document could have been created. We just need to tell the model how many topics to construct and it uses that 'recipe' to generate topic and word distributions over a corpus. Based on that output, we can identify similar documents within the corpus.

### In order to understand the LDA process, we have to know how LDA assumes topics are generated:
1. determine the number of words in the document
2. choose a topic mixture for the document over a fixed set of topics (ie. topic A 20%, topic B 50%, etc)
3. generate words in the document by:
    - pick a topic based on the document's multinomial distribution
    - pick a word based on the topic's multinomial distribution

### Working backwards
Suppose you have a corpus of documents, and you want LDA to learn the topic representatino of K topics in each document and the word distribution of each topic. LDA would backtrack from the document level to identify topics that are likely to have generated the corpus.

### LDA's Magic
1. randomly assign each word in each documen tto one of the K topics
2. for each document
    - assume that all topic assignments except for the current one are correct
    - claculate two proportions:
        1. proportion of words in document d that are currently assigned to topic t = p(topic t | document d)
        2. proportion of assignments to topic t over all documents that come from this word w = p(word w | topic t)
    - multiply those two proportions and assign w a new topic based on that probability. p(topic t | document d) * p(word w | topic t)
3. eventually we'll reach a steady state where assignments make sense

### alpha (parameter of the Dirichlet pior of the per-document topic distribution)
high: each document will contain many topics
low: each document iwll have distinct topics

### beta (parameter of the Dirichlet prior on the per-topic word distribution)
high: each topic will contain many words
low: each topic will contain few words

### theta (topic distribution for document m)
### z (topic for the n-th word in document m)
### w (specific word)

### step 0: examine and import corpus

In [25]:
#all the required libraries
import pandas as pd
from bs4 import BeautifulSoup

import os
import codecs
import itertools as it

import spacy
nlp = spacy.load('en')

from gensim.models import Phrases #seems like this is slower, but Phaser was not compatible to our code? need some research
from gensim.models.word2vec import LineSentence
from spacy.lang.en.stop_words import STOP_WORDS

In [26]:
def grab_dockets():
    files = []
    #get all .html files in the folder (all docket files are in .html)
    for file in os.listdir('docket_texts/'):
        if file.endswith('.html'):
            files.append(os.path.join('docket_texts/', file))

    df_docket_texts = pd.DataFrame()
    
    for i in range(len(files)): #gather all docket texts
    #for i in [0, 1]: #for testing purposes
        
        content = codecs.open(files[i], 'r', 'utf-8').read()
        #use beautiful soup to get the case ID
        soup = BeautifulSoup(content, 'lxml')
        case_id = str(soup.find_all('h3'))    
        bookmark1 = case_id.find('CASE #:') + len('CASE #:')
        bookmark2 = case_id.find('</h3>')
        case_id = case_id[bookmark1:bookmark2]

        #use pandas to grab tables in the html files
        docket_tables = pd.read_html(content)

        #error checking: gotta do this because there's different length of docket_list/
        #usually docket texts are in docket_list[3], but not always
        n = 0
        while docket_tables[n].isin(['Docket Text']).sum().sum() == 0:
            #print(n, docket_tables[n].isin(['Docket Text']).sum().sum())
            n += 1
                        
        #print(i, files[i])
        #print(docket_tables[n].head())

        #docket_tables[n] is the docket text table
        new_header = docket_tables[n].iloc[0]
        docket_tables[n] = docket_tables[n][1:]
        docket_tables[n].columns = new_header
        
        docket_tables[n]['#'] = pd.to_numeric(docket_tables[n]['#'],
                                              downcast = 'signed', errors = 'coerce')
        docket_tables[n]['Date Filed'] = pd.to_datetime(docket_tables[n]['Date Filed'])
        docket_tables[n]['Case ID'] = case_id

        df_docket_texts = pd.concat([df_docket_texts, docket_tables[n]])
    #reorder a column
    cols = list(df_docket_texts.columns)
    df_docket_texts = df_docket_texts[[cols[-1]] + cols[:-1]]
    
    print('current docket text table size/shape: {}'.format(df_docket_texts.shape))
    return df_docket_texts

In [27]:
#current docket text table size/shape: (721, 4), 2018-04-18
df = grab_dockets()
df.head()

current docket text table size/shape: (721, 4)


Unnamed: 0,Case ID,Date Filed,#,Docket Text
1,1:16-cv-01215-AMD-SJB,2016-03-10,1.0,"COMPLAINT against Cardiogenics Holdings, Inc. ..."
2,1:16-cv-01215-AMD-SJB,2016-03-10,,Case assigned to Judge Ann M Donnelly and Magi...
3,1:16-cv-01215-AMD-SJB,2016-03-10,2.0,"Summons Issued as to Cardiogenics Holdings, In..."
4,1:16-cv-01215-AMD-SJB,2016-03-11,,NOTICE - emailed attorney regarding missing se...
5,1:16-cv-01215-AMD-SJB,2016-03-11,3.0,In accordance with Rule 73 of the Federal Rule...


In [28]:
docket_original = list(df['Docket Text'])
print(docket_original[0:2])

['COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)', 'Case assigned to Judge Ann M Donnelly and Magistrate Judge Vera M. Scanlon. Please download and review the Individual Practices of the assigned Judges, located on our website. Attorneys are responsible for providing courtesy copies to judges where their Individual Practices require such. (Bowens, Priscilla) (Entered: 03/11/2016)']


In [29]:
docket_txt_filepath = 'docket_texts/docket_text_all.txt'

In [34]:
%%time

# if we need to update file, then True
if True:
    
    docket_count = 0

    # create & open a new file in write mode
    with codecs.open(docket_txt_filepath, 'w', encoding = 'utf_8') as docket_txt_file:


        for i in range(len(docket_original)):
            docket_txt_file.write(docket_original[i] + '\n')
            docket_count += 1

    print('Text from {:,} docket texts written to the new txt file'.format(docket_count))
    
else:
    with codecs.open(docket_txt_filepath, encoding = 'utf_8') as docket_txt_file:
        for review_count, line in enumerate(docket_txt_file):
            pass
        
    print(u'Text from {:,} restaurant reviews in the txt file.'.format(docket_count))

Text from 721 docket texts written to the new txt file
Wall time: 5.01 ms


In [35]:
# extracting a review for testing
start = 0
finish = 10
with codecs.open(docket_txt_filepath, encoding = 'utf_8') as f:
    sample_review = list(it.islice(f, start, finish))[0]
    length = len(list(f))

print('# of reviews {}'.format(length))
print(sample_review)

# of reviews 711
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)



In [36]:
%%time
#using NLP (SpaCy English) to parse the review text
parsed_review = nlp(sample_review)

Wall time: 129 ms


In [37]:
#the parsed object is actually not all that different. but it actually contains a lot of info
print(type(parsed_review))
print(parsed_review)

<class 'spacy.tokens.doc.Doc'>
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC. (Steinmetz, Michael) (Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons) (Bowens, Priscilla). (Entered: 03/10/2016)



In [38]:
#for example, we can print out the sentences from the parsed object
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

Sentence 1:
COMPLAINT against Cardiogenics Holdings, Inc. filing fee $ 400, receipt number 0207

Sentence 2:
-8445206 Was the Disclosure Statement on Civil Cover Sheet completed -YES,, filed by LG Capital Funding, LLC.

Sentence 3:
(Steinmetz, Michael)

Sentence 4:
(Additional attachment(s) added on 3/11/2016: # 1 Civil Cover Sheet, # 2 Proposed Summons)

Sentence 5:
(Bowens, Priscilla).

Sentence 6:
(Entered: 03/10/2016)




In [39]:
#we can also look at entities
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: COMPLAINT - ORG

Entity 2: Cardiogenics Holdings, Inc. - ORG

Entity 3: 400 - MONEY

Entity 4: the Disclosure Statement on Civil Cover Sheet - ORG

Entity 5: LG Capital Funding - ORG

Entity 6: LLC - PERSON

Entity 7: Steinmetz - PERSON

Entity 8: Michael - PERSON

Entity 9: 3/11/2016 - DATE

Entity 10: # 1 - MONEY

Entity 11: Civil Cover Sheet - ORG

Entity 12: Bowens - PERSON

Entity 13: Priscilla - PERSON

Entity 14: Entered - PERSON

Entity 15: 
 - GPE



In [40]:
token_text = [token.orth_ for token in parsed_review] #Verbatim text content 
token_pos = [token.pos_ for token in parsed_review] #part-of-speech.

pd.DataFrame(list(zip(token_text, token_pos)), columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,COMPLAINT,PROPN
1,against,ADP
2,Cardiogenics,PROPN
3,Holdings,PROPN
4,",",PUNCT
5,Inc.,PROPN
6,filing,NOUN
7,fee,NOUN
8,$,SYM
9,400,NUM


In [41]:
token_lemma = [token.lemma_ for token in parsed_review] # Base form of the token, with no inflectional suffixes.
token_shape = [token.shape_ for token in parsed_review] # Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)), columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,COMPLAINT,complaint,XXXX
1,against,against,xxxx
2,Cardiogenics,cardiogenics,Xxxxx
3,Holdings,holdings,Xxxxx
4,",",",",","
5,Inc.,inc.,Xxx.
6,filing,filing,xxxx
7,fee,fee,xxx
8,$,$,$
9,400,400,ddd


In [42]:
token_entity_type = [token.ent_type_ for token in parsed_review] #Named entity type.
token_entity_iob = [token.ent_iob_ for token in parsed_review] #IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,COMPLAINT,ORG,B
1,against,,O
2,Cardiogenics,ORG,B
3,Holdings,ORG,I
4,",",ORG,I
5,Inc.,ORG,I
6,filing,,O
7,fee,,O
8,$,,O
9,400,MONEY,B


In [43]:
token_attributes = [(token.orth_, #Verbatim text content
                     token.prob, #Smoothed log probability estimate of token's type
                     token.is_stop, #Is the token part of a "stop list"
                     token.is_punct, #Is the token punctuation
                     token.is_space, #Does the token consist of whitespace characters
                     token.like_num, #Does the token represent a number
                     token.is_oov) #Is the token out-of-vocabulary
                    for token in parsed_review]

df_temp = pd.DataFrame(token_attributes,
                       columns=['text', 'log_probability', 'stop?', 'punctuation?',
                                'whitespace?', 'number?', 'out of vocab.?'])

df_temp.loc[:, 'stop?':'out of vocab.?'] = (df_temp.loc[:, 'stop?':'out of vocab.?']
                                            .applymap(lambda x: u'Yes' if x else u''))
                                               
df_temp

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,COMPLAINT,-20.0,,,,,Yes
1,against,-20.0,Yes,,,,Yes
2,Cardiogenics,-20.0,,,,,Yes
3,Holdings,-20.0,,,,,Yes
4,",",-20.0,,Yes,,,Yes
5,Inc.,-20.0,,,,,Yes
6,filing,-20.0,,,,,Yes
7,fee,-20.0,,,,,Yes
8,$,-20.0,,,,,Yes
9,400,-20.0,,,,Yes,Yes


In [None]:
### import corpus, only for relevant texts

In [None]:
import os
import codecs
import json

import spacy
import pandas as pd
import itertools as it

from gensim.models import Phrases #seems like this is slower, but Phaser was not compatible to our code? need some research
from gensim.models.word2vec import LineSentence
from spacy.lang.en.stop_words import STOP_WORDS

from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

nlp = spacy.load('en')

In [None]:
def lda_description(review_text, min_topic_freq = 0.05):
    """
    accept the original text of a review and 
    1. parse it with spaCy,
    2. apply text pre-proccessing steps, 
    3. create a bag-of-words representation, 
    4. create an LDA representation, and
    5. print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review if term not in STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda.sort(key = lambda tup: tup[1], reverse = True)
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print('{:25} {}'.format(topic_names[topic_number], round(freq, 3)))