### Model application to dataframe
This is a little difficult to follow so here's an explanation:

Step 1: run the model against the entire dataframe to collect the topics
    
Step 2: take this model and apply it back to the dataframe to assign most likely topic to each case (we want the topic # and its dot product)
                                                                                                    
Step 3: make a dictionary of the components that make up each topic from the original model
    
Step 4: use this dictionary to "look up" the topic components and apply those to the dataframe  

In [85]:
import pandas as pd
import re

In [87]:
##########################################  modeling imports  #######################################################
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
#from sklearn.preprocessing import Normalizer

In [88]:
df = pd.read_pickle("full_proj_lemmatized.pickle")

In [89]:
df.head(5)

Unnamed: 0,caseurl,casetitle,years,case,lem
15000,http://caselaw.findlaw.com/us-supreme-court/38...,"ALUMINUM CO. OF AMERICA v. UNITED STATES, 382...",1965,United States Supreme Court JOBE v. CITY OF ...,jobe dismiss appellant appellee curiam dismiss...
15001,http://caselaw.findlaw.com/us-supreme-court/38...,JONES & LAUGHLIN STEEL CORP. v. GRIDIRON STEEL...,1965,United States Supreme Court JONES & LAUGHLIN...,gridiron civ proc extend limit expire saturday...
15002,http://caselaw.findlaw.com/us-supreme-court/38...,"JORDAN v. SILVER, 381 U.S. 415 (1965)",1965,United States Supreme Court JORDAN v. SILVER...,affirm corker assistant gruskin selvin appella...
15003,http://caselaw.findlaw.com/us-supreme-court/38...,"KADANS v. DICKERSON, 382 U.S. 22 (1965)",1965,United States Supreme Court KADANS v. DICKER...,dismiss appellant parraguirre appellees curiam...
15004,http://caselaw.findlaw.com/us-supreme-court/38...,"METROMEDIA, INC. v. AMERICAN SOCIETY OF COMPOS...",1965,United States Supreme Court KASHARIAN v. MET...,dismiss appellant conover appellees curiam dis...


## Step 1: Run model against entire dataframe (as a corpus)
Think of it like this: We need to find the themes across the entire set of documents (over 23,000 in all), so how else would we do this than stacking every document together as a reservoir to extract information out of?

In [90]:
def nmf_mod(corp ):
    df = .80
    n_topics = 30
    n_features = 2000
    n_top_words = 30
    
    # Use tf-idf features for NMF.
    print("Extracting tf-idf features for NMF...")
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=2, #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)


    # Fit the NMF model
    print("Fitting the NMF model with tf-idf features, "
          "n_topics= %d, n_topic_words= %d, n_features= %d..."
          % (n_topics, n_top_words, n_features))

    nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
    
    print("\nTopics in NMF model:")
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    #return print_top_words(nmf, tfidf_feature_names, n_top_words) 
    return tfidf,nmf

In [91]:
tfidf, nmf_mod_test = nmf_mod(df.lem)

Extracting tf-idf features for NMF...
Fitting the NMF model with tf-idf features, n_topics= 30, n_topic_words= 30, n_features= 2000...

Topics in NMF model:


## Step 2: Applying the model back to the dataframe 
NMF (as well as other types of topic modeling) returns a matrix of likelihoods that a particular document fits in Topic 1, 2, etc. Unlike LDA, __An NMF matrix does not contain probabilities of inclusion, but rather the dot product of two matrices__. Don't worry about the (linear algebra) details, just imagine that we need to find the biggest number in this matrix and return the index of that. 

In [92]:
out =nmf_mod_test.transform(tfidf)
out[49] #verified that each of these is different 

array([ 0.00682247,  0.07127508,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.02156119,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])

__Returning these as a Series__
It's easy to run the model against a column of the dataframe, return it as a series, and append that series as a new column. (But remember not to sort if you do this because you need the order to stay the same, obviously).

In [93]:
import operator
topics = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics.append(max_index) 
    
df["topicnumber"] = pd.Series(topics, index=df.index)    

In [94]:
topics_likelihood = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics_likelihood.append(max_value)
    
df["strengthoftopic"] = pd.Series(topics_likelihood, index=df.index)     

In [95]:
df.topicnumber.value_counts() #let's make sure this is a good model...

16    3267
21    1590
25    1308
29    1144
7     1053
28     959
1      801
11     784
19     762
3      745
14     745
6      742
26     738
12     736
8      714
2      662
10     650
20     620
13     612
0      579
4      570
18     543
22     526
5      440
15     431
27     350
24     345
23     331
9      316
17     205
Name: topicnumber, dtype: int64

### Creating dictionary of topic components
There's probably an easier way to do this, but I haven't found one. I'm running the model function again (random state will get the same results as before) but this time creating a topic words feature space to "look up" in my dataframe.

In [96]:
def nmf_topics_dict(corp):
    df = .80
    n_topics = 30
    n_features = 2000
    n_top_words = 30
    
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=2, #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)
    nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
      
    topic_dict = {}
    for topic_idx, topic in enumerate(nmf.components_):
        topic_dict[topic_idx] = " ".join([tfidf_feature_names[i] \
                                    for i in topic.argsort()[:-n_top_words - 1:-1]])
    return topic_dict

In [97]:
nmf_words = nmf_topics_dict(df.lem) #dict object

In [98]:
def word_lookup(num):
    return nmf_words.get(num)

In [99]:
df["words"] = df.topicnumber.apply(word_lookup)

In [104]:
df.ix[15002,"words"] # This cell and the one below verifies that it worked

'brief statute file regulation usc clause require decision member join requirement violate apply dissent challenge seek curia issue footnote violation discrimination question relief complaint urge activity statutory program conduct provide'

In [103]:
df.ix[15002,"lem"]

'affirm corker assistant gruskin selvin appellant phill appellee curiam appellant notice record strike dismiss affirm affirm judgment affirm join concur constitution propose file sign person vote precede gubernatorial propose submit bare majority vote require pass prior constitution iv provide apportion basis measure know proposition submit voter delete requirement apportion strict basis leave method apportion unaffected statement accompany measure distribute voter proposition attempt provide federaltype similar apportionment summarize argument proposal proposition approve vote following year adopt apportionment statute effectuate legislation submit require approve provide compose member elect base contain consist change measure submit voter occasion attempt change apportionment proposition defeat vote proposition defeat vote proposition defeat vote today summarily affirm decree apportionment consistently approve majority voting invalid decision companion able detect slight basis optim

In [None]:
df.to_pickle("full_project_modelled.pickle")

In [102]:
df

Unnamed: 0,caseurl,casetitle,years,case,lem,topicnumber,strengthoftopic,words
15000,http://caselaw.findlaw.com/us-supreme-court/38...,"ALUMINUM CO. OF AMERICA v. UNITED STATES, 382...",1965,United States Supreme Court JOBE v. CITY OF ...,jobe dismiss appellant appellee curiam dismiss...,1,0.043896,dismiss curiam want whereon substantial report...
15001,http://caselaw.findlaw.com/us-supreme-court/38...,JONES & LAUGHLIN STEEL CORP. v. GRIDIRON STEEL...,1965,United States Supreme Court JONES & LAUGHLIN...,gridiron civ proc extend limit expire saturday...,3,0.008186,vacate remand pauperis forma curiam judgment a...
15002,http://caselaw.findlaw.com/us-supreme-court/38...,"JORDAN v. SILVER, 381 U.S. 415 (1965)",1965,United States Supreme Court JORDAN v. SILVER...,affirm corker assistant gruskin selvin appella...,16,0.023336,brief statute file regulation usc clause requi...
15003,http://caselaw.findlaw.com/us-supreme-court/38...,"KADANS v. DICKERSON, 382 U.S. 22 (1965)",1965,United States Supreme Court KADANS v. DICKER...,dismiss appellant parraguirre appellees curiam...,1,0.095863,dismiss curiam want whereon substantial report...
15004,http://caselaw.findlaw.com/us-supreme-court/38...,"METROMEDIA, INC. v. AMERICAN SOCIETY OF COMPOS...",1965,United States Supreme Court KASHARIAN v. MET...,dismiss appellant conover appellees curiam dis...,1,0.039786,dismiss curiam want whereon substantial report...
15005,http://caselaw.findlaw.com/us-supreme-court/38...,"KASHARIAN v. SOUTH PLAINFIELD BAPTIST CHURCH, ...",1965,United States Supreme Court KASHARIAN v. SOU...,dismiss curiam dismiss want jurisdiction supp ...,1,0.035189,dismiss curiam want whereon substantial report...
15006,http://caselaw.findlaw.com/us-supreme-court/38...,FLORIDA EAST COAST RAILWAY CO. v. UNITED STATE...,1965,United States Supreme Court KASHARIAN v. WIL...,dismiss curiam dismiss want jurisdiction where...,1,0.041425,dismiss curiam want whereon substantial report...
15007,http://caselaw.findlaw.com/us-supreme-court/38...,"KENNECOTT COPPER CORP. v. UNITED STATES, 381 ...",1965,United States Supreme Court KENNECOTT COPPER...,supp affirm milman appellant solicitor assista...,10,0.033415,appellant affirm appellees appellee supp assis...
15008,http://caselaw.findlaw.com/us-supreme-court/38...,"KILLGORE v. BLACKWELL, 381 U.S. 278 (1965)",1965,United States Supreme Court KILLGORE v. BLAC...,footnote reporter note opinion report amend ju...,3,0.058271,vacate remand pauperis forma curiam judgment a...
15009,http://caselaw.findlaw.com/us-supreme-court/37...,KITTY HAWK DEVELOPMENT CO. v. CITY OF COLORADO...,1965,United States Supreme Court KITTY HAWK DEVEL...,dismiss report pd prettyman appellant rhyne rh...,1,0.085258,dismiss curiam want whereon substantial report...
