# Topic Modeling with various methods
Topic modeling is a powerful tool for quickly sorting through a lot of text and documents without having to read every one. There are several methods available for this using python, as well as several libraries. Topic modeling is extremely challenging to get meaningful results. "Garbage in, garbage out" is a phrase that applies well to this - we have to do a significant amount of text preprocessing to extract the right information to feed into a model. On this sheet, I will be topic modeling supreme court cases with the following:

__SKlearn__

LDA (with TF)

NMF (with TFIDF)

LSA - AKA TruncatedSVD (with TF and TFIDF)

### Process of the ENTIRE project
Extracting text using beautiful soup --> processing the text --> fitting text to a model --> applying model to other text

In [3]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from textblob import TextBlob
from sklearn.preprocessing import Normalizer

In [25]:
doc_list.read_pickle("full_proj_lemmatized3.pickle") #always save your work!

In [26]:
doc_list.shape #checking to make sure we have the info we expected to have

(23268, 5)

## _____________________________________________________________________
## Model testing section
I'm trying LDA, NMF and LSA as well as adjusting # of features, # topics, and overlap. 

In [103]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
    
def modeler(corp, n_topics, n_top_words, clf, vect):
    df = .80
    str_vect = str(vect).split("(")[0]
    str_clf = str(clf).split("(")[0]

    print("Extracting {} features for {}...".format(str_vect, str_clf))
    vect_trans = vect.fit_transform(corp)


    # Fit the model
    print("Fitting the {} model with {} features, "
          "n_topics= {}, n_topic_words= {}, n_features= {}..."
          .format(str_clf, str_vect, n_topics, n_top_words, n_features))

    clf = clf.fit(vect_trans)
    if str_clf == "TruncatedSVD":
        print("\nExplained variance ratio", clf.explained_variance_ratio_)
        
    print("\nTopics in {} model:".format(str_clf))
    feature_names = vect.get_feature_names()
    return print_top_words(clf, feature_names, n_top_words) 

### NMF model
Find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction.

In [None]:
modeler(doc_list.lem, 30, 30, NMF(n_components=30, random_state=1, alpha=.1, l1_ratio=.5), \ 
                tfidf_vectorizer = TfidfVectorizer(
            max_df=0.95,
            min_df=min_df_val,  # Reduced from 5
            stop_words='english',
            ngram_range=(1, 1),
            max_features=5000  # Limit features to avoid memory issues
        ))  

Extracting tf-idf features for NMF...
Fitting the NMF model with tf-idf features, n_topics= 30, n_topic_words= 30, n_features= 2000...

Topics in NMF model:
Topic #0:
jurisdiction suit admiralty citizenship controversy bring question exclusive arise removal diversity proceeding original exercise jurisdictional complaint want stat dismiss appellate confer final merit constitution entertain remove allege section judgment venue
Topic #1:
dismiss curiam want whereon substantial report appellee question misc appellant assistant pd appellees jurisdiction improvidently sod solicitor probable frankfurter app note consideration decision mosk moot paxton brief reverse rhyne dispense
Topic #2:
respondent brief reverse judgment file assistant affirm solicitor curia urge footnote divide join improvidently amicus amici rehnquist equally complaint reversal jj curiam deliver usc app blackmun decision affirmance award conclude
Topic #3:
vacate remand pauperis forma curiam judgment ante proceed consider

#### Notes about NMF performance 
Seeing these results makes me so happy - through several attempts of playing around with options for this model, this one has proved overwhelmingly good for the type of topic modeling I'm doing. I've done more reading about NMF and I think the methods behind it are what has lead to its awesome performance. Being able to use tf-idf I think is very important for this.

### Truncated SVD (LSA) Model
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). It is very similar to PCA, but operates on sample vectors directly, instead of on a covariance matrix. This means it can work with scipy.sparse matrices efficiently.

Notes: SVD suffers from a problem called “sign indeterminancy”, which means the sign of the components_ and the output from transform depend on the algorithm and random state. To work around this, fit instances of this class to data once, then keep the instance around to do transformations.

In [None]:
modeler(doc_list.lem, 100, 30, TruncatedSVD(2, algorithm = 'arpack'), TfidfVectorizer(max_df=.8, min_df=2,stop_words='english'))  

Extracting tf-idf features for LSA...
Fitting the LSA model with tf-idf features,n_samples=2000 and n_features=1000...

Topics in LSA model:

Explained variance ratio [ 0.00454438  0.04702202]
Topic #0:
defendant judgment dismiss make error shall sup respondent statute appellant claim question opinion issue section proceeding file curiam evidence remand
Topic #1:
dismiss curiam want appellant substantial appellees assistant whereon affirm appellee ginnane solicitor app vacate question judgment jurisdiction report hummel pd



#### Notes about LSA performance
A few attempts at tinkering with this algorithm did not improve its performance at all. The issues I'm finding with this are the same as the issues I found with LDA - it's good at pulling out the law themes, but that's not _really_ what I need. I really need the law terms to not play a role at all in modeling for these topics - we know that this entire corpus is about the law, but we need to know what KIND of law each case within the corpus is about. 

### Latent Dirchlet Allocation model 
In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

In [36]:
modeler(doc_list.lem, 30, 30, LatentDirichletAllocation(n_topics=30, max_iter=5, learning_method='online', \
        learning_offset=50.,random_state=0), CountVectorizer(max_df=.80, min_df=2, 
                                                             stop_words='english'))

Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples and n_features=1000...

Topics in LDA model:
Topic #0:
activity engage authorize acquire relate closely provide principally affirm brief hearing respondent affiliate language prohibit approval section statute determination record
Topic #1:
respondent statute evidence make opinion defendant sup shipper purpose file judgment notice provide child violate reverse deliver present claim appellee
Topic #2:
claim shall provide respondent injury pay manner defendant follow constitution suit make purpose proceeding ordinance yes pass regulation person effect
Topic #3:
shall make defendant statute purchase file section issue pay creditor provide appellant asset evidence require opinion purpose respondent determine resident
Topic #4:
defendant respondent make present judgment error amend exempt lease comment assistant suit render allege obtain premium affirm lawfully year duty
Topic #5:
charge tariff shipper responden

In [37]:
LDA_mod(doc_list.lem, .95, 2, 2000,10) #df is a way to extract 'meaningful text' in this case

Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples and n_features=2000...

Topics in LDA model:
Topic #0:
defendant make judgment shall error evidence opinion issue purpose pay question proceeding respondent section statute provide suit present charge subject
Topic #1:
respondent death penalty tariff charge opinion shipper offender question defendant activity engage punishment evidence file execution arrangement comment authorize judgment
Topic #2:
opinion require make effect clause provide child affirm finding reimbursement religious violate voting number judgment result death deliver section program
Topic #3:
respondent statute file jurisdiction brief opinion appellee permit claim question authorize evidence deliver determine certificate issue counterclaim subject regulation judgment
Topic #4:
value shall damage cost make completion exceed date default erect pay respondent prior breach debt follow measure decision entitle lien
Topic #5:
abandonment statu

#### Notes about LDA model performance
LDA was the first modeling type I tried, because it was the most frequently used in conversations about topic modeling. Initially I assumed that I would not have any other reasonable options, but LDA has proven ineffective for this project. I've done more reading about the differences between LDA and NMF, and LDA seems to be not so good at picking up subtle differences in a corpus about the same subject (as in, if I wanted to find the difference between Apple products and apple the fruit, LDA would probably work, but not if I need to find the difference between cases where the majority of the text is about the law). My suspicion is that this is because LDA can only use a count vectorizer rather than a tfidf, so this bag of words is a serious limitation to finding how these documents _relate_.