### Model application to dataframe
This is a little difficult to follow so here's an explanation:

Step 1: run the model against the entire dataframe to collect the topics
    
Step 2: take this model and apply it back to the dataframe to assign most likely topic to each case (we want the topic # and its dot product)
                                                                                                    
Step 3: make a dictionary of the components that make up each topic from the original model
    
Step 4: use this dictionary to "look up" the topic components and apply those to the dataframe

Step 5: Getting data together for visualization!

In [3]:
import pandas as pd
import re

In [4]:
##########################################  modeling imports  #######################################################
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
#from sklearn.preprocessing import Normalizer

In [6]:
df = pd.read_pickle("full_proj_lemmatized3.pickle")

In [7]:
df.head(5)

Unnamed: 0,caseurl,casetitle,years,case,lem
15000,http://caselaw.findlaw.com/us-supreme-court/38...,"ALUMINUM CO. OF AMERICA v. UNITED STATES, 382...",1965,United States Supreme Court JOBE v. CITY OF ...,jobe appellant appellee want substantial proba...
15001,http://caselaw.findlaw.com/us-supreme-court/38...,JONES & LAUGHLIN STEEL CORP. v. GRIDIRON STEEL...,1965,United States Supreme Court JONES & LAUGHLIN...,gridiron civ proc extend limit expire saturday...
15002,http://caselaw.findlaw.com/us-supreme-court/38...,"JORDAN v. SILVER, 381 U.S. 415 (1965)",1965,United States Supreme Court JORDAN v. SILVER...,corker gruskin selvin appellant phill appellee...
15003,http://caselaw.findlaw.com/us-supreme-court/38...,"KADANS v. DICKERSON, 382 U.S. 22 (1965)",1965,United States Supreme Court KADANS v. DICKER...,appellant parraguirre appellees want jurisdict...
15004,http://caselaw.findlaw.com/us-supreme-court/38...,"METROMEDIA, INC. v. AMERICAN SOCIETY OF COMPOS...",1965,United States Supreme Court KASHARIAN v. MET...,appellant conover appellees want jurisdiction ...


In [8]:
df.ix[15000, "caseurl"]

'http://caselaw.findlaw.com/us-supreme-court/382/12.html'

## Step 1: Run model against entire dataframe (as a corpus)
Think of it like this: We need to find the themes across the entire set of documents (over 23,000 in all), so how else would we do this than stacking every document together as a reservoir to extract information out of?

In [231]:
def nmf_mod(corp ):
    df = .80
    n_topics = 30
    n_features = 2000
    n_top_words = 40
    
    # Use tf-idf features for NMF.
    print("Extracting tf-idf features for NMF...")
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=5, # ngram_range=(1,2), #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)


    # Fit the NMF model
    print("Fitting the NMF model with tf-idf features, "
          "n_topics= %d, n_topic_words= %d, n_features= %d..."
          % (n_topics, n_top_words, n_features))

    nmf = NMF(n_components=n_topics, random_state=2, alpha=.1, l1_ratio=.5).fit(tfidf)
    
    print("\nTopics in NMF model:")
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    #return print_top_words(nmf, tfidf_feature_names, n_top_words) 
    return tfidf,nmf

In [232]:
tfidf, nmf_mod_test = nmf_mod(df.lem)

Extracting tf-idf features for NMF...
Fitting the NMF model with tf-idf features, n_topics= 30, n_topic_words= 40, n_features= 2000...

Topics in NMF model:


## Step 2: Applying the model back to the dataframe 
NMF (as well as other types of topic modeling) returns a matrix of likelihoods that a particular document fits in Topic 1, 2, etc. Unlike LDA, __An NMF matrix does not contain probabilities of inclusion, but rather the dot product of two matrices__. Don't worry about the (linear algebra) details, just imagine that we need to find the biggest number in this matrix and return the index of that. 

In [233]:
out =nmf_mod_test.transform(tfidf)
out[49] #verified that each of these is different 

array([ 0.        ,  0.        ,  0.01743965,  0.        ,  0.        ,
        0.00812628,  0.03701182,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.05023011,  0.00074146,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.00029258,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ])

__Returning these as a Series__
It's easy to run the model against a column of the dataframe, return it as a series, and append that series as a new column. (But remember not to sort if you do this because you need the order to stay the same, obviously).

In [234]:
import operator
topics = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics.append(max_index) 
    
df["topicnumber"] = pd.Series(topics, index=df.index)    

In [235]:
topics_likelihood = []
for item in out:
    max_index, max_value = max(enumerate(item), key=operator.itemgetter(1))
    topics_likelihood.append(max_value)
    
df["strengthoftopic"] = pd.Series(topics_likelihood, index=df.index)     

In [236]:
df.topicnumber.value_counts() #let's make sure this is a good model...

16    2053
0     1567
22    1539
27    1457
3     1343
19    1242
10    1039
29     944
20     943
4      860
11     764
21     734
1      680
15     652
18     589
6      559
5      555
26     552
14     547
23     542
13     540
7      458
8      450
9      449
17     434
2      405
24     401
28     377
12     366
25     227
Name: topicnumber, dtype: int64

### Step 3: Creating dictionary of topic components
There's probably an easier way to do this, but I haven't found one. I'm running the model function again (random state will get the same results as before) but this time creating a topic words feature space to "look up" in my dataframe.

In [221]:
def nmf_topics_dict(corp, n_topics):
    df = .80
    n_top_words = 40
    
    tfidf_vectorizer = TfidfVectorizer(max_df=df, min_df=5,# ngram_range=(1,2), #max_features=n_features,
                                       stop_words='english')

    tfidf = tfidf_vectorizer.fit_transform(corp)
    nmf = NMF(n_components=n_topics, random_state=2, alpha=.1, l1_ratio=.5).fit(tfidf)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
      
    topic_dict = {}
    for topic_idx, topic in enumerate(nmf.components_):
        topic_dict[topic_idx] = ", ".join([tfidf_feature_names[i] \
                                    for i in topic.argsort()[:-n_top_words - 1:-1]])
    return topic_dict

In [222]:
# I tested a few different topic distributions, 30 was optimal
nmf_words_30 = nmf_topics_dict(df.lem, 30) #dict object

In [223]:
nmf_words_30

{0: 'error, defendant, sup, deliver, judgment, statute, mckenna, bring, record, messrs, prosecute, assignment, recover, favor, render, pass, verdict, proceeding, demurrer, cost, exception, try, sustain, assign, appearance, overrule, allow, purpose, assess, make, come, involve, result, raise, memorandum, contrary, validity, contention, mandamus, approve',
 1: 'vacate, remand, consideration, judgment, solicitor, moot, reconsideration, proceeding, suggestion, respondent, reason, app, ninth, seventh, pauperis, forma, disposition, hearing, inconsistent, result, appellate, record, basis, appropriate, merit, eighth, habeas, consistent, outright, complaint, supra, relation, manoli, dismissal, finan, ordman, gillen, wulf, consider, timely',
 2: 'want, substantial, report, appellate, app, properly, rhyne, consideration, solicitor, grey, quel, appellees, kaufmann, derengoski, probable, bourke, melaniphy, sybert, tighe, torina, londerholm, reversed, jurisdictional, botter, stedman, toch, hickey, u

In [224]:
import json
with open('finaliteration_topics.json', 'w') as fp:
    json.dump(nmf_words_30, fp)

## Step 4: Looking up topic words for each item in dataframe

In [237]:
def word_lookup(num):
    return nmf_words_30.get(num)

In [238]:
df["words"] = df.topicnumber.apply(word_lookup)

In [239]:
df.ix[15017,"words"] # This cell and the one below verifies that it worked

'forma, pauperis, proceed, reason, examination, submit, reverse, docketing, frivolous, ninth, hearing, compliance, supra, issue, clerk, habeas, record, respondent, extraordinary, consideration, relief, filing, vacate, docket, transcript, forth, request, app, set, noncriminal, status, allow, mahorner, unless, argument, faircloth, result, denial, appellate, fender'

In [240]:
df.ix[14972,"lem"]

'appellant licensed physician convict accessory married person information advice prevent conception follow examination prescribe contraceptive device wife statute make person article prevent conception appellant claim accessory statute apply violate fourteenth appellate judgment appellant standing a'

In [14]:
df.ix[15017,"caseurl"]

'http://caselaw.findlaw.com/us-supreme-court/380/145.html'

In [9]:
df.to_pickle("full_project_modelled_final.pickle")

In [10]:
df = pd.read_pickle("full_project_modelled_final.pickle")

## Step 5: Putting data together for visualization
I'm making a brushable area chart with D3. This visualization requires a datapoint for every topic for every year, so we have to do some pivoting to make that happen.

In [119]:
# some topics were extremely similar and at the suggestion of my instructors,
# for the sake of the visualization, I have condensed the topics to 20

def topic_condenser(topicnum):
    if topicnum == 20:
        return 24
    if topicnum == 25:
        return 1
    if topicnum == 2:
        return 12
    if topicnum == 27:
        return 26
    if topicnum == 18 or topicnum == 5:
        return 29
    if topicnum == 8 or topicnum == 22:
        return 7
    if topicnum == 15:
        return 16
    if topicnum == 9:
        return 14
    if topicnum == 19:
        return 3
    else: 
        return topicnum
df["condensedtopics"] = df.topicnumber.apply(topic_condenser)

In [120]:
# doing some research on the not so obvious topics
df = df[df["topicnumber"] != 2]
#df_16.ix[15065, "caseurl"]
df_16

Unnamed: 0,caseurl,casetitle,years,case,lem,topicnumber,strengthoftopic,words,condensedtopics
15006,http://caselaw.findlaw.com/us-supreme-court/38...,FLORIDA EAST COAST RAILWAY CO. v. UNITED STATE...,1965,United States Supreme Court KASHARIAN v. WIL...,want jurisdiction middle supp alvis appellant ...,15,0.034274,"appellant, appellees, supp, solicitor, probabl...",15
15042,http://caselaw.findlaw.com/us-supreme-court/38...,"O'CONNOR v. OHIO, 382 U.S. 19 (1965)",1965,United States Supreme Court O'CONNOR v. OHIO...,cowell appellant friberg appellee want jurisdi...,15,0.035508,"appellant, appellees, supp, solicitor, probabl...",15
15049,http://caselaw.findlaw.com/us-supreme-court/38...,PRICE v. STATE ROAD COMMISSION OF WEST VIRGINI...,1965,United States Supreme Court PUGACH v. NEW YO...,want jurisdiction dba appellant graziani sarve...,15,0.050230,"appellant, appellees, supp, solicitor, probabl...",15
15053,http://caselaw.findlaw.com/us-supreme-court/38...,"BURNETTE v. DAVIS, 382 U.S. 42 (1965)",1965,United States Supreme Court RATLEY v. CROUSE...,ratley want jurisdiction supp bowl bowl appell...,15,0.040606,"appellant, appellees, supp, solicitor, probabl...",15
15063,http://caselaw.findlaw.com/us-supreme-court/38...,"SCREVANE v. LOMENZO, 382 U.S. 11 (1965)",1965,United States Supreme Court SCREVANE v. LOME...,handel appellant solicitor appellees judgment ...,15,0.057852,"appellant, appellees, supp, solicitor, probabl...",15
15090,http://caselaw.findlaw.com/us-supreme-court/38...,"TRAVIA v. LOMENZO, 382 U.S. 9 (1965)",1965,United States Supreme Court TRAVIA v. LOMENZ...,rifkind costikyan appellant solicitor appellee...,15,0.034208,"appellant, appellees, supp, solicitor, probabl...",15
15127,http://caselaw.findlaw.com/us-supreme-court/38...,WETHERALL v. STATE ROAD COMMISSION OF WEST VIR...,1965,United States Supreme Court WETHERALL v. STA...,appellant graziani sarver appellees dba,15,0.066042,"appellant, appellees, supp, solicitor, probabl...",15
15133,http://caselaw.findlaw.com/us-supreme-court/38...,"AMALGAMATED TRANSIT UNION v. UNITED STATES, 3...",1966,United States Supreme Court AMALGAMATED TRAN...,amalgamate teamster chauffeur warehouseman hel...,15,0.054034,"appellant, appellees, supp, solicitor, probabl...",15
15136,http://caselaw.findlaw.com/us-supreme-court/38...,"AMERICAN TRUCKING ASSOCIATIONS, INC. v. UNITED...",1966,United States Supreme Court AMERICAN TRUCKIN...,supp beardsley busser appellant helmetag appel...,15,0.040616,"appellant, appellees, supp, solicitor, probabl...",15
15137,http://caselaw.findlaw.com/us-supreme-court/38...,"AMERICAN TRUCKING ASSOCIATIONS, INC. v. UNITED...",1966,United States Supreme Court AMERICAN TRUCKIN...,supp beardsley appellant solicitor moloney bis...,15,0.044420,"appellant, appellees, supp, solicitor, probabl...",15


In [124]:
df_details = pd.read_csv("detailsford3.csv", encoding = 'iso-8859-1')
df_details.columns = ["condensedtopics", "topicname", "title", "exampleURL", "leadpp", "topicwords"]
df_details

Unnamed: 0,condensedtopics,topicname,title,exampleURL,leadpp,topicwords
0,4,Interstate Law,"KOEHRING CO. v. HYDE CONSTR. CO., (1966)",http://caselaw.findlaw.com/us-supreme-court/38...,"On March 10, 1964, the Court of Appeals for th...","decree, enter, final, suit, adjudge, supplemen..."
1,16,Civil Rights - discrimination,"LOUISIANA v. UNITED STATES, (1965)",http://caselaw.findlaw.com/us-supreme-court/38...,Pursuant to 42 U.S.C. 1971 (c) the Attorney Ge...,"statute, suit, clause, require, violate, regul..."
2,24,Municipal,"CAMARA v. MUNICIPAL COURT, (1967)",http://caselaw.findlaw.com/us-supreme-court/38...,Appellant was charged with violating the San F...,"ordinance, pass, permit, construct, adopt, ope..."
3,1,Right to Defense Regardless of Income,"GIDEON v. WAINWRIGHT, (1963)",http://caselaw.findlaw.com/us-supreme-court/37...,Charged in a Florida State Court with a noncap...,"vacate, remand, consideration, judgment, solic..."
4,12,Monopolies,"UNITED STATES v. UNITED SHOE CORP., (1968)",http://caselaw.findlaw.com/us-supreme-court/39...,In 1953 the District Court for the District of...,"report, app, consideration, exception, improvi..."
5,23,Free Speech,"MEMOIRS v. MASSACHUSETTS, (1966)",http://caselaw.findlaw.com/us-supreme-court/38...,"Appellee, the Attorney General of Massachusett...","judgment, reverse, enter, verdict, favor, rend..."
6,26,Stocks & Fair Values,"FRIBOURG NAV. CO. v. COMMISSIONER, (1966)",http://caselaw.findlaw.com/us-supreme-court/38...,Prior to acquiring a used Liberty ship for $46...,"value, assess, par, decedent, valuation, cost,..."
7,29,Torts - a civil wrong that caused a loss,"UNITED STATES v. ATLAS INS. CO., (1965)",http://caselaw.findlaw.com/us-supreme-court/38...,The Life Insurance Company Income Tax Act of 1...,"claim, claimant, suit, owner, assert, cl, poss..."
8,10,Workers Unions,"LABOR BOARD v. BROWN, (1965)",http://caselaw.findlaw.com/us-supreme-court/38...,Respondents were members of a multiemployer ba...,"employee, employer, relation, agreement, barga..."
9,6,"Jurisdiction - states, immigration, reservations","DURFEE v. DUKE, (1963)",http://caselaw.findlaw.com/us-supreme-court/37...,Petitioners sued respondent in a Nebraska Stat...,"jurisdiction, want, probable, argument, sup, d..."


In [130]:
df_with_details = pd.merge(df, df_details, how = "inner", on = "condensedtopics")

In [148]:
#temp_df = df_with_details[['years', 'condensedtopics', "topicname", "title", "exampleURL", "leadpp", "topicwords"]]
#temp_df.to_csv("temp.csv")
temp_df = df_with_details[['years', 'condensedtopics']]
temp_df.condensedtopics.value_counts()

16    2705
3     2585
7     2447
29    2088
26    2009
0     1567
24    1344
10    1039
14     996
1      907
4      860
11     764
21     734
6      559
23     542
13     540
17     434
28     377
12     366
Name: condensedtopics, dtype: int64

In [None]:
#dummy value for each existing topic. Pay no attention to this error.
temp_df["count"] = 1
temp_df

In [150]:
#this condenses each point for the same year into n number of points 
temp_df = temp_df.groupby(["years", "condensedtopics"]).count().reset_index()
temp_df

Unnamed: 0,years,condensedtopics,count
0,1792,21,1
1,1793,24,2
2,1795,7,1
3,1795,10,1
4,1796,3,1
5,1796,4,1
6,1796,21,1
7,1796,29,1
8,1798,4,1
9,1798,21,1


#### A few (really cool) things are happening in the cell below 

First, we are pivoting to add dummy values for nonexistent 
year/topic points (for ex, there's only 1 case in 1792 but 30 topics, we need 29 points of 0). The topic numbers become column headers first, followed by filling the NaNs with 0's, then we stack the df back to the way it was and reset the index.



In [None]:
data_fillna = temp_df.pivot_table("count", "years", "condensedtopics").fillna(0).unstack().reset_index()

In [152]:
#we lose the count label column in the previous steps, so we're just renaming it here, and reordering columns based on 
#how they are arranged in the viz csv
data_fillna.columns = ["condensedtopics", "years", "count"]
data_fillna = data_fillna[["years", "condensedtopics", "count"]]

In [153]:
#sort by year
final_data = pd.merge(data_fillna, df_details, how = "inner", on = "condensedtopics")
final_data

Unnamed: 0,years,condensedtopics,count,topicname,title,exampleURL,leadpp,topicwords
0,1792,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
1,1793,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
2,1795,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
3,1796,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
4,1798,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
5,1799,0,1,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
6,1800,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
7,1801,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
8,1802,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
9,1803,0,2,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."


In [154]:
final_data.sort_values("years", inplace = True, ascending = True)
final_data

Unnamed: 0,years,condensedtopics,count,topicname,title,exampleURL,leadpp,topicwords
0,1792,0,0,Criminal Activity - nonviolent,"ORNELAS et al. v. UNITED STATES, (1996)",http://caselaw.findlaw.com/us-supreme-court/51...,In denying petitioners' motion to suppress coc...,"""error, defendant, sup, deliver, judgment, sta..."
1554,1792,21,1,Civil Rights - search and seizure,"MAPP v. OHIO, (1961)",http://caselaw.findlaw.com/us-supreme-court/36...,"On May 23, 1957, three Cleveland police office...","person, obscene, engage, make, convict, unlawf..."
2664,1792,7,0,Violent Crimes & Death Penalty,"BUMPER v. NORTH CAROLINA, (1968)",http://caselaw.findlaw.com/us-supreme-court/39...,Petitioner was tried for rape in North Carolin...,"death, penalty, sentence, circumstance, punish..."
444,1792,17,0,Crimes - mail,"SINGER v. UNITED STATES, (1965)",http://caselaw.findlaw.com/us-supreme-court/38...,"Petitioner, a defendant in a federal criminal ...","solicitor, curia, urge, amici, equally, divide..."
1998,1792,29,0,Torts - a civil wrong that caused a loss,"UNITED STATES v. ATLAS INS. CO., (1965)",http://caselaw.findlaw.com/us-supreme-court/38...,The Life Insurance Company Income Tax Act of 1...,"claim, claimant, suit, owner, assert, cl, poss..."
2220,1792,12,0,Monopolies,"UNITED STATES v. UNITED SHOE CORP., (1968)",http://caselaw.findlaw.com/us-supreme-court/39...,In 1953 the District Court for the District of...,"report, app, consideration, exception, improvi..."
1332,1792,10,0,Workers Unions,"LABOR BOARD v. BROWN, (1965)",http://caselaw.findlaw.com/us-supreme-court/38...,Respondents were members of a multiemployer ba...,"employee, employer, relation, agreement, barga..."
2886,1792,23,0,Free Speech,"MEMOIRS v. MASSACHUSETTS, (1966)",http://caselaw.findlaw.com/us-supreme-court/38...,"Appellee, the Attorney General of Massachusett...","judgment, reverse, enter, verdict, favor, rend..."
2442,1792,6,0,"Jurisdiction - states, immigration, reservations","DURFEE v. DUKE, (1963)",http://caselaw.findlaw.com/us-supreme-court/37...,Petitioners sued respondent in a Nebraska Stat...,"jurisdiction, want, probable, argument, sup, d..."
3996,1792,28,0,Tenants Rights - business and individual,"MILLINERY CORP. v. COMMISSIONER, (1956)",http://caselaw.findlaw.com/us-supreme-court/35...,"In April 1924, petitioner leased land in New Y...","lease, lessee, lessor, mineral, premise, year,..."


In [156]:
#backup file
final_data.to_csv("topicsbyyear.csv", index = False)
final_data.to_csv("year_topic_data2.csv", index = False)

In [None]:
'''the best part of this viz is the brushing side to side effect. For that, we need total cases for every year
and need no other information'''

data_fillna.groupby("years")["count"].sum().reset_index().to_csv("year_data.csv", index = False)

## _c'est fin!_