### Model application to dataframe
This is a little difficult to follow so here's an explanation:

Step 1: run the model against the entire dataframe to collect the topics
    
Step 2: take this model and apply it back to the dataframe to assign most likely topic to each case (we want the topic # and its dot product)
                                                                                                    
Step 3: make a dictionary of the components that make up each topic from the original model
    
Step 4: use this dictionary to "look up" the topic components and apply those to the dataframe

Step 5: Getting data together for visualization!

In [149]:
import pandas as pd
import re

In [150]:
##########################################  modeling imports  #######################################################
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
#from sklearn.preprocessing import Normalizer

In [151]:
df = pd.read_pickle("full_proj_lemmatized3.pickle")

In [152]:
df.head(5)

Unnamed: 0,caseurl,casetitle,years,case,lem
15000,http://caselaw.findlaw.com/us-supreme-court/38...,"ALUMINUM CO. OF AMERICA v. UNITED STATES, 382...",1965,United States Supreme Court JOBE v. CITY OF ...,jobe appellant appellee want substantial proba...
15001,http://caselaw.findlaw.com/us-supreme-court/38...,JONES & LAUGHLIN STEEL CORP. v. GRIDIRON STEEL...,1965,United States Supreme Court JONES & LAUGHLIN...,gridiron civ proc extend limit expire saturday...
15002,http://caselaw.findlaw.com/us-supreme-court/38...,"JORDAN v. SILVER, 381 U.S. 415 (1965)",1965,United States Supreme Court JORDAN v. SILVER...,corker gruskin selvin appellant phill appellee...
15003,http://caselaw.findlaw.com/us-supreme-court/38...,"KADANS v. DICKERSON, 382 U.S. 22 (1965)",1965,United States Supreme Court KADANS v. DICKER...,appellant parraguirre appellees want jurisdict...
15004,http://caselaw.findlaw.com/us-supreme-court/38...,"METROMEDIA, INC. v. AMERICAN SOCIETY OF COMPOS...",1965,United States Supreme Court KASHARIAN v. MET...,appellant conover appellees want jurisdiction ...


In [153]:
df.ix[15000, "caseurl"]

'http://caselaw.findlaw.com/us-supreme-court/382/12.html'

## Step 1: Run model against entire dataframe (as a corpus)
Think of it like this: We need to find the themes across the entire set of documents (over 23,000 in all), so how else would we do this than stacking every document together as a reservoir to extract information out of?

In [231]:
def nmf_mod(corp ):
    df = .80
    n_topics = 30
    n_features = 2000
    n_top_words = 40
    
    # Use tf-idf features for NMF.
    print("Extracting tf-idf features for NMF...")
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=5, # ngram_range=(1,2), #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)


    # Fit the NMF model
    print("Fitting the NMF model with tf-idf features, "
          "n_topics= %d, n_topic_words= %d, n_features= %d..."
          % (n_topics, n_top_words, n_features))

    nmf = NMF(n_components=n_topics, random_state=2, alpha=.1, l1_ratio=.5).fit(tfidf)
    
    print("\nTopics in NMF model:")
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    #return print_top_words(nmf, tfidf_feature_names, n_top_words) 
    return tfidf,nmf

In [232]:
tfidf, nmf_mod_test = nmf_mod(df.lem)

Extracting tf-idf features for NMF...
Fitting the NMF model with tf-idf features, n_topics= 30, n_topic_words= 40, n_features= 2000...

Topics in NMF model:


## Step 2: Applying the model back to the dataframe 
NMF (as well as other types of topic modeling) returns a matrix of likelihoods that a particular document fits in Topic 1, 2, etc. Unlike LDA, __An NMF matrix does not contain probabilities of inclusion, but rather the dot product of two matrices__. Don't worry about the (linear algebra) details, just imagine that we need to find the biggest number in this matrix and return the index of that. 

In [233]:
out =nmf_mod_test.transform(tfidf)
out[49] #verified that each of these is different 

array([ 0.        ,  0.        ,  0.01743965,  0.        ,  0.        ,
        0.00812628,  0.03701182,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.05023011,  0.00074146,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.00029258,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])

__Returning these as a Series__
It's easy to run the model against a column of the dataframe, return it as a series, and append that series as a new column. (But remember not to sort if you do this because you need the order to stay the same, obviously).

In [234]:
import operator
topics = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics.append(max_index) 
    
df["topicnumber"] = pd.Series(topics, index=df.index)    

In [235]:
topics_likelihood = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics_likelihood.append(max_value)
    
df["strengthoftopic"] = pd.Series(topics_likelihood, index=df.index)     

In [236]:
df.topicnumber.value_counts() #let's make sure this is a good model...

16    2053
0     1567
22    1539
27    1457
3     1343
19    1242
10    1039
29     944
20     943
4      860
11     764
21     734
1      680
15     652
18     589
6      559
5      555
26     552
14     547
23     542
13     540
7      458
8      450
9      449
17     434
2      405
24     401
28     377
12     366
25     227
Name: topicnumber, dtype: int64

### Step 3: Creating dictionary of topic components
There's probably an easier way to do this, but I haven't found one. I'm running the model function again (random state will get the same results as before) but this time creating a topic words feature space to "look up" in my dataframe.

In [221]:
def nmf_topics_dict(corp, n_topics):
    df = .80
    n_top_words = 40
    
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=5,# ngram_range=(1,2), #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)
    nmf = NMF(n_components=n_topics, random_state=2, alpha=.1, l1_ratio=.5).fit(tfidf)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
      
    topic_dict = {}
    for topic_idx, topic in enumerate(nmf.components_):
        topic_dict[topic_idx] = ", ".join([tfidf_feature_names[i] \
                                    for i in topic.argsort()[:-n_top_words - 1:-1]])
    return topic_dict

In [222]:
# I tested a few different topic distributions, 30 was optimal
nmf_words_30 = nmf_topics_dict(df.lem, 30) #dict object

In [223]:
nmf_words_30

{0: 'error, defendant, sup, deliver, judgment, statute, mckenna, bring, record, messrs, prosecute, assignment, recover, favor, render, pass, verdict, proceeding, demurrer, cost, exception, try, sustain, assign, appearance, overrule, allow, purpose, assess, make, come, involve, result, raise, memorandum, contrary, validity, contention, mandamus, approve',
 1: 'vacate, remand, consideration, judgment, solicitor, moot, reconsideration, proceeding, suggestion, respondent, reason, app, ninth, seventh, pauperis, forma, disposition, hearing, inconsistent, result, appellate, record, basis, appropriate, merit, eighth, habeas, consistent, outright, complaint, supra, relation, manoli, dismissal, finan, ordman, gillen, wulf, consider, timely',
 2: 'want, substantial, report, appellate, app, properly, rhyne, consideration, solicitor, grey, quel, appellees, kaufmann, derengoski, probable, bourke, melaniphy, sybert, tighe, torina, londerholm, reversed, jurisdictional, botter, stedman, toch, hickey, u

In [224]:
import json
with open('finaliteration_topics.json', 'w') as fp:
    json.dump(nmf_words_30, fp)

## Step 4: Looking up topic words for each item in dataframe

In [237]:
def word_lookup(num):
    return nmf_words_30.get(num)

In [238]:
df["words"] = df.topicnumber.apply(word_lookup)

In [239]:
df.ix[15754,"words"] # This cell and the one below verifies that it worked

'forma, pauperis, proceed, reason, examination, submit, reverse, docketing, frivolous, ninth, hearing, compliance, supra, issue, clerk, habeas, record, respondent, extraordinary, consideration, relief, filing, vacate, docket, transcript, forth, request, app, set, noncriminal, status, allow, mahorner, unless, argument, faircloth, result, denial, appellate, fender'

In [240]:
df.ix[14972,"lem"]

'appellant licensed physician convict accessory married person information advice prevent conception follow examination prescribe contraceptive device wife statute make person article prevent conception appellant claim accessory statute apply violate fourteenth appellate judgment appellant standing a'

In [241]:
df.to_pickle("full_project_modelled_final.pickle")

## Step 5: Putting data together for visualization
I'm making a brushable area chart with D3. This visualization requires a datapoint for every topic for every year, so we have to do some pivoting to make that happen.

In [242]:
temp_df = df[['years', 'topicnumber']]
#temp_df.to_csv("temp.csv")
temp_df

Unnamed: 0,years,topicnumber
15000,1965,2
15001,1965,23
15002,1965,5
15003,1965,6
15004,1965,6
15005,1965,6
15006,1965,15
15007,1965,17
15008,1965,25
15009,1965,2


In [243]:
#dummy value for each existing topic. Pay no attention to this error.
temp_df["count"] = 1
temp_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,years,topicnumber,count
15000,1965,2,1
15001,1965,23,1
15002,1965,5,1
15003,1965,6,1
15004,1965,6,1
15005,1965,6,1
15006,1965,15,1
15007,1965,17,1
15008,1965,25,1
15009,1965,2,1


In [244]:
#this condenses each point for the same year into n number of points 
temp_df = temp_df.groupby(["years", "topicnumber"]).count().reset_index()
temp_df

Unnamed: 0,years,topicnumber,count
0,1792,21,1
1,1793,20,2
2,1795,10,1
3,1795,22,1
4,1796,4,1
5,1796,19,1
6,1796,21,1
7,1796,29,1
8,1798,4,1
9,1798,20,1


In [245]:
'''
a few (really cool) things are happening in this step. First, we are pivoting to add dummy values for nonexistent 
year/topic points (for ex, there's only 1 case in 1792 but 30 topics, we need 29 points of 0). The topic numbers become
column headers first, followed by filling the NaNs with 0's, then we stack the df back to the way it was and reset the index.'''

data_fillna = temp_df.pivot_table("count", "years", "topicnumber").fillna(0).unstack().reset_index()

In [246]:
#we lose the count label column in the previous steps, so we're just renaming it here, and reordering columns based on 
#how they are arranged in the viz csv
data_fillna.columns = ["topicnumber", "years", "count"]
data_fillna = data_fillna[["years", "topicnumber", "count"]]

In [249]:
#sort by year
data_fillna.sort_values("years", inplace = True, ascending = True)
data_fillna.to_csv("topicsbyyear.csv", index = False)

In [251]:
#backup file
data_fillna.to_csv("year_topic_data2.csv", index = False)

In [None]:
'''the best part of this viz is the brushing side to side effect. For that, we need total cases for every year
and need no other information'''

data_fillna.groupby("years")["count"].sum().reset_index().to_csv("year_data.csv", index = False)

## _c'est fin!_