# CS109 Project - The Court Rules In Favor Of...
## Aidi Adnan Brian John (Team AABJ)

### Abstract
The purpose of this project is to predict votes of Supreme Court justices using oral argument transcripts. Studies in linguistics and psychology, as well as common sense, dictates that the word choices that people make convey crucial information about their beliefs and intentions with regard to issues. Rather than use precedents or formal analysis of the law to predict Supreme Court decisions, we attempt to extract essential emotional features of oral arguments made by justices and advocates in the court. Using aggregate data from 1946 to present

### Data
Oral Argument Transcripts - obtained from http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx. Transcripts are made available on the day of court hearing.
Justice Vote Counts/Case Information - obtained from the Supreme Court Database.

## Data Cleaning and Preparation

In [33]:
import string
import re
import numpy as np
import pandas as pd
import operator
import os
import sys
import io
import collections

We used a python script (scraper.py) to first scrape the pdfs from the Supreme Court Justice Website (but didn't upload those to the repository, because we ultimately wanted to use text files in our process). We then used a script to convert the pdf files to text files, but not before removing the last 10 pages which were reserved as an index for certain words.

In [34]:
# gather all txt files, first get the path to the data directory
# then list the files and filter out all non-txt files
curPath = os.getcwd()
dataPath = curPath + '/data/'
fileList = os.listdir(dataPath)
fileExt = ".txt"
txtFiles = filter(lambda f : f[-4:] == fileExt, fileList)
txtFiles = map(lambda f : dataPath + f, txtFiles)

In [35]:
#reads in text file, replace path of "wut.txt" to relevant txt; only processes one text file currently
fil = "data/cut_126, orig_ppl4.txt"
fil = "wut.txt"
text_file = open(fil, "r")
text = text_file.read()

We wrote a parser to extract the names of the petitioner and respondant attorneys from the first 2 pages of the converted text document. An example of list of petitioner and respondant speakers, taken from the example case in 2014 of Johnson v United States (docket number 13-7120) which shall be henceforth used as the recurring example in this process book, is:

Katherine M. Menendez, ESQ., Minneapolis, Minn.; on behalf of Petitioner
Michael R. Dreeben, ESQ., Deputy Solicitor General, Department of Justice, Washington D.C.; on behalf of Respondent

In [36]:
def get_petitioners_and_respondents(text):
    '''
    This function takes in input text file as string and outputs 2 lists of speakers speaking for petitioners and
    respondents sides.
    '''
    #get portion of transcript between APPEARANCES and CONTENTS that specifies speakers for petitioners/respondents
    start = text.find('APPEARANCES:') + len('APPEARNACES')
    end = text.find('C O N T E N T S')
    speakers_text = text[start:end]
    split_speakers_text = re.split('\.[ ]*\n', speakers_text)
    #for each speaker, get name (capitalized) and side (Pet/Res) he/she is speaking for
    pet_speakers, res_speakers, other_speakers = [], [], []
    for speaker in split_speakers_text:
        name = speaker.strip().split(',')[0]
        #search for first index of capitalized word (which will be start of speaker name)
        start = 0
        for idx, char in enumerate(name):
            if str.isupper(char):
                start = idx
                break
        #actual name to be appended to correct list
        name = name[start:]
        
        #if words Petition, Plaintiff, etc occur in speaker blurb, speaker belongs to Pet
        if any(x in speaker for x in ['etition' , 'ppellant', 'emand', 'evers', 'laintiff']):
            pet_speakers.append(name)
        #otherwise if words Respondent, Defendant, etc occur, speaker belongs to Res
        elif any(x in speaker for x in ['espond' , 'ppellee', 'efendant']):
            res_speakers.append(name)
        #otherwise if neither side is specified in blurb, speaking belongs to Other
        elif 'neither' in speaker:
            other_speakers.append(name)
    return pet_speakers, res_speakers, other_speakers

In [37]:
# For example, for wut.txt, there's 1 petitioner and 2 respondents
pet_speakers, res_speakers, other_speakers = get_petitioners_and_respondents(text)
pet_speakers, res_speakers, other_speakers 

(['MR. H. BARTOW FARR'],
 ['MR. ROY L. REARDON', 'MS. BARBARA D. UNDERWOOD'],
 [])

In [38]:
# generate a list of regular expressions to split the text on
def generateRES(nameList, plebe):
    """
    plebe is a boolean determining whether or not the list is of
    justices or not
    """
    retList = []
    for name in nameList:
        address = ""
        if plebe:
            words = name.split(' ')
            # first term is the title, last
            # word is the last name
            address = "%s %s" % (words[0], words[-1])
            retList.append(address)
        else:
            address = "JUSTICE %s" % name
            address2 = "CHIEF JUSTICE %s" % name
            retList.append(address)
            retList.append(address2)
    return retList

In [39]:
def getJusticeNames(text):
    index = 0
    retList = []
    while index < len(text):
        index = text.find("JUSTICE", index)
        if index == -1:
            break
        index += 8 # because length of JUSTICE is 7, plus length of the space
        prevIndex = index
        while text[index] != ':':
            index +=1
        retList.append(text[prevIndex:index])
    return list(set(retList))

In [40]:
# nms = getJusticeNames(clean_argument)
# generateRES(nms, False)
# print nms

The general flow of court proceedings is that the Petitioner attornies make their oral argument, followed by the Respondent attornies, before we hear the rebuttal argument of the Petitioners again. Throughout all proceedings, Justices are free to interject with questions and statements of their own. The below function extracts the main argument portion of the oral transcripts, which is the meat of the proceedings that we are interested in conducting analysis on. 

In [41]:
def get_argument_portion(text):
    '''
    This function gets just the argument portion of the text.
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('P R O C E E D I N G S')
    end = text.rfind('Whereupon')
    return text[start:end]

In [42]:
argument_portion = get_argument_portion(text)
argument_portion[:500]

"P R O C E E D I N G S\n\n2\n\n[10:13 a.m.]\n\n3\n4\n\nCHIEF JUSTICE REHNQUIST:\n\nWe'll hear argument on\n\nNumber 00-24, PGA Tour, Inc. vs. Casey Martin.\n\n5\n\nORAL ARGUMENT OF H. BARTOW FARR, III\n\n6\n\nON BEHALF OF THE PETITIONER\n\n7\n\nMR. FARR:\n\nMr. Farr?\n\nMr. Chief Justice and may it please\n\n8\n\nthe Court:\n\nThe Ninth Circuit in our view made two\n\n9\n\ncritical mistakes in applying the Disabilities Act to this\n\n10\n\ntype of claim by a professional athlete. First it failed\n\n11\n\nto recognize that Title 3 of the act, "

In [43]:
def count_words(s):
    '''
    This function counts number of proper English words in a string s (not non-words like - or --)
    '''
    s = s.split()
    non_words = ['-', '--']
    return sum([x not in non_words for x in s])

In [44]:
def modify_speaker_names(speakers):
    '''
    This function modifies speaker names like 'QUESTION' to 'QUESTION: ', for word count parsing later on
    '''
    return map(lambda x: x+': ', speakers)

In [45]:
def cleanTextMaker(text):
    '''
    This function takes in the portions of text, and gets rid of the \n and the line numbers. 
    '''
    text_arr=text.splitlines()
    text_clean=[]
    for each in text_arr:
        if each != '':
            try:
                int(each)
            except ValueError: #assummption: if the item only has integers, it is a line number.
                text_clean.append(each)
    out_text=' '.join(text_clean)
    return out_text

In [46]:
cleanText=cleanTextMaker(argument_portion)
cleanText[:500]

"P R O C E E D I N G S [10:13 a.m.] CHIEF JUSTICE REHNQUIST: We'll hear argument on Number 00-24, PGA Tour, Inc. vs. Casey Martin. ORAL ARGUMENT OF H. BARTOW FARR, III ON BEHALF OF THE PETITIONER MR. FARR: Mr. Farr? Mr. Chief Justice and may it please the Court: The Ninth Circuit in our view made two critical mistakes in applying the Disabilities Act to this type of claim by a professional athlete. First it failed to recognize that Title 3 of the act, the public accommodations provision, apply on"

In [201]:
def total_wordcount(text):
    '''
    POSSIBLE FEATURE 1:
    This function returns a dictionary with key: name of speaker/justice and value: total number of words they
    spoke in total throughout argument.
    '''
    arg_text = get_argument_portion(text)
    #keeps track of current speaker
    current_speaker = 'N/A'
    clean_argument = cleanTextMaker(arg_text)
    
    #clean argument text split by instances where speakers change

    # first get the names of the judges and speakers
    pet_speakers, res_speakers, other_speakers = get_petitioners_and_respondents(text)
    justiceLeague = getJusticeNames(clean_argument)
    # create the regular expression for the justices and the plebes
    # need to also add the justice speaker
    JLList = generateRES(justiceLeague, False)
    plebeList = pet_speakers + res_speakers + other_speakers
    plebeRE = generateRES(plebeList, True)
    finREList = ["QUESTION"]
    finREList += plebeRE + JLList
    
    finREList = map(lambda name : name + ":", finREList)
    RE = '('  + '|'.join(finREList) + ')'
    
    split_argument = re.split(RE, clean_argument)
    all_speakers = finREList
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    num_words = dict(zip(all_speakers + [current_speaker], [0] * (len(all_speakers)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker
        if s in all_speakers:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            num_words[current_speaker] = num_words[current_speaker] + count_words(s)
    
    return num_words

In [48]:
#for example, this gives us total number of words uttered by each speaker
#we just need to find list of all speakers in the form they're referred to in the argument, "JUSTICE SCALIA: " for ex.
total_wordcount(text)
# print text

{'CHIEF JUSTICE REHNQUIST:': 24,
 'JUSTICE REHNQUIST:': 0,
 'MR. FARR:': 3433,
 'MR. REARDON:': 1480,
 'MS. UNDERWOOD:': 1129,
 'N/A': 13,
 'QUESTION:': 4009}

In [49]:
def wordCounter(text):
    """
    counts number of times each word appears in a file
    
    returns a dictionary of word : times it appears
    """
    wordCount={}
    for word in text.split():
        # unfortunately, isalpha does discount some real words
        # like those with apostrophes, and words with question
        # marks at the end of them
        if word.lower() not in wordCount and word.isalpha():
            wordCount[word.lower()] = 1
        elif word.isalpha():
            wordCount[word.lower()] += 1
    return wordCount

In [50]:
def topWords(diction, num, verbose=False):
    """
    returns the top num words in a dictionary
    dictionary is expected to be of the format {word : count}
    """
    d = collections.Counter(diction)
    if verbose:
        for k, v in d.most_common(numTop):
            print '%s: %i' % (k, v)
    return d.most_common(num)

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

In [52]:
# credit to:
# http://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python
def getCosine(vec1, vec2):
    """
    cosine similarity is used to calculate the similarity index between two vectors
    """
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = np.sqrt(sum1) * np.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def textToVector(text):
    """
    turn a word into a vector
    """
    WORD = re.compile(r'\w+')
    words = WORD.findall(text)
    return Counter(words)

In [154]:
def textCompiler(text):
    """
    cleanly puts together all of the functions we have written to return a text ready for
    classification
    """
    txt = get_argument_portion(text)
    txt = cleanTextMaker(txt)
    #txt = txt.replace('.', '\n').split('\n')
    return txt

In [93]:
def getData():
    """
    opens and cleans all data, putting it in a list
    uses txtFiles, which is declared at the beginning of
    the ipynb
    """
    retList = []
    for File in txtFiles:
        cur = open(File)
        textual = cur.read()
        cleanTextual = textCompiler(textual)
        retList.append(cleanTextual)
        cur.close()
    return retList

In [55]:
# splits along 1 dimension deterministically
def splitData(X, fraction_train=9.0 / 10.0):
    end_train = int(len(X) * fraction_train)
    X_train = X[0:end_train]
    X_test = X[end_train:]
    return X_train, X_test

# def splitTrainTest(X, Y, fraction_train = 9.0 / 10.0):
#     X_train, X_test = splitData(X, fraction_train)
#     Y_train, Y_test = splitData(Y, fraction_train)
#     assert(X_train.shape[0] == Y_train.shape[0])
#     assert(X_test.shape[0] == Y_test.shape[0])
#     return X_train, y_train, X_test, y_test

In [94]:
fullBody = getData()

In [95]:
# splits data. note that tdidf is a measure of the relative importance of a word in a document
# based on its inverse frequency
testBody, trainBody = splitData(fullBody)
vectorizer = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english')

In [None]:
def textToMat(vizer, docList):
    """
    turns documents into a tfidf matrix
    Parameters:
        vizer is a vectorizer of type TfidfVectorizer
        docList is a list of (ideally) preprocessed documents
    """
    retList = []
    for doc in docList:
        resMat = vizer.transform(doc).todense()
        retList.append(resMat)
    return retList    

In [None]:
# print vectorizer.vocabulary_
# print mat.todense()#[:,942]

In [None]:
mat = vectorizer.fit_transform(trainBody)

# Running classifier

### Logistic Regression

Logistic regression is a natural first choice for a model since our target value can be viewed as a probability between 0 or 1 for any individual justice to vote For or Against, with a higher probability representing a higher confidence of that justice voting in favor of the arguing party. 

In [None]:
from sklearn.linear_model import LogisticRegression


caseId  
docketId   
caseIssuesId  
voteId  
dateDecision  
decisionType  
usCite  
sctCite  
ledCite  
lexisCite  
term   
naturalCourt  
chief  
docket   
caseName  
dateArgument  
dateRearg  
petitioner   
petitionerState  
respondent  
respondentState  
jurisdiction  
adminAction  
adminActionState  
threeJudgeFdc  
caseOrigin  
caseOriginState  
caseSource  
caseSourceState  
lcDisagreement  
certReason  
lcDisposition  
lcDispositionDirection  
declarationUncon  
caseDisposition   
caseDispositionUnusual  
partyWinning    
precedentAlteration  
voteUnclear    
issue  
issueArea  
decisionDirection  
decisionDirectionDissent  
authorityDecision1  
authorityDecision2  
lawType  
lawSupp  
lawMinor  
majOpinWriter  
majOpinAssigner  
splitVote  
majVotes  
minVotes 

issue: This variable identifies the issue for each decision. Although criteria for the identification of issues are hard to articulate, the focus here is on the subject matter of the controversy (e.g., sex discrimination, state tax, affirmative action) rather than its legal basis (e.g., the equal protection clause) 

issueArea: This variable simply separates the issues identified in the preceding variable (issue) into the following larger categories: criminal procedure (issues 10010-10600), civil rights (issues 20010-20410), First Amendment (issues 30010-30020), etc 

decisionDirection: In order to determine whether the Court supports or opposes the issue to which the case pertains, this variable codes the ideological "direction" of the decision. 
    An outcome is liberal (=2) or conservative (=1)

dateDecision: This variable contains the year, month, and day that the Court announced its decision in the case.

justiceName: This is a string variable indicating the first initial for the five justices with a common surname (Harlan, Johnson, Marshall, Roberts, and White) and last name of each justice.

chief: This variable identifies the chief justice durinmg whose tenure the case was decided.

caseSource: This variable identifies the court whose decision the Supreme Court reviewed. If the case originated in the same court whose decision the Supreme Court reviewed, the entry in the caseOrigin should be the same as here. This variable has no entry if the case arose under the Supreme Court's original jurisdiction. 

certReason: This variable provides the reason, if any, that the Court gives for granting the petition for certiorari. 
 

In [205]:
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

In [207]:
bigdf['issueArea'].head()

0    8
1    8
2    8
3    8
4    8
Name: issueArea, dtype: float64

In [None]:
from patsy import dmatrices
log_model = LogisticRegression(penalty='l2',C=1.0, fit_intercept=True, class_weight='auto')
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

smalldf = pd.DataFrame()

regress_vars = ['issue', 'issueArea', 'decisionDirection', 
                'dateDecision', 'justice', 'chief', 'caseSource', 'certReason', 'vote'] 

for i in regress_vars: 
    smalldf[i] = bigdf[i]
    
y, X = dmatrices('C(vote) ~ C(issue) + C(issueArea) + C(decisionDirection) + C(justice) + \
                  C(caseSource) + C(certReason)',
                  smalldf, return_type="dataframe")

print (X.shape)
print(y.shape)
# y = np.ravel(y)
print((y))
model = LogisticRegression()

model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.cross_validation import train_test_split, cross_val_score

In [None]:
# read in SCDB data from file
bigdf=pd.read_csv("supremeCourtDb.csv")
bigdf[]


In [None]:
df = bigdf["docketId", "dateDecision", "case"]


### Linear SVM Classifier

In [None]:
svm_model = svm.SVC(C=1.0, kernel='linear', probability=True, class_weight='auto')
svm_model = my_svm.fit(X, y)
svm_pred = svm_fit.predict(W)
# Class probabilities, based on log regression on distance to hyperplane.
svm_prob = svm_fit.predict_proba(W)
svm_dist = svm_fit.decision_function(W)

## 2. Justice Ruling Prediction

We use a different dataset in a slightly different approach to making Supreme Court ruling predictions. This method is motivated by the fact that usually, only 2 justices tend to be swing votes and justice decisions are highly influenced by factors outside of what transpires in court proceedings, such as background information about the case itself. The Supreme Court website contains a Justice-centered database which contains extensive information about each case; in particular, the most pertinent fields we are interested in analyzing are:

1. Decision Year
2. Natural Court
3. Petitioner
4. Respondent
5. Case Origin
6. Case Source
7. Lower Court Disposition Direction
8. Issue Area

Our target value to predict is the field called winningParty (petitioner or respondent), which using our justice-centered approach involves aggregating predicted votes for each individual justice and taking majority vote. The associated confidence of our entire prediction is obtained by averaging individual confidences of our models for each justice.

In [None]:
# read in justice-centered SCDB data from file
newdf=pd.read_csv("SCDB_justice_centered.csv")

In [None]:
# maybe lcDispositionDirection? choose features with continuous/numerical features
# do the numbers mean anything though?
newsmalldf = newdf[["term", "naturalCourt", "petitioner", "respondent", "caseOrigin", "caseSource", "lcDisposition", "issueArea"]]

In [None]:
newsmalldf.head()

For an intuitive understanding of the features above, check out the documentation here: http://scdb.wustl.edu/documentation.php?var=petitioner. All the above features are categorical instead of continuous (which means the numbers specify a category instead of having a numerical meaning). For an illustrative example, the "petitioner" variable includes:

1. attorney general of the United States, or his office
2. specified state board or department of education
7. state department or agency
etc

#### Advantages of Using Decision Tree Classifiers

Having an intuitive understanding of the meanings behind the variables is important and leads us to our idea of usign the decision tree classifier. A distinct advantage of using decisiontrees is that the decision at each node has an intuitive meaning and corresponds to querying along one feature axis at a time (e.g. is the petitioner an attorney general of the United States?). 

Furthermore, trees are easy to understand and interpret. We can look at the top node and figure out which feature it corresponds to, and conclude that this feature contributes the most information gain, i.e. is the most important/predictive feature. This makes it easy to verify whether our results make intuitive sense.

We will show the process of running decision trees on each justice, before aggregating the votes now.

### 2.1 Justice-Centered Decision Tree Classifiers

Ultimately, the feature that we want to predict is the vote for each justice.

In [None]:
from sklearn import tree

In [None]:
# newdf.majority refers to whether justice voted with the majority (1 for dissent, 2 for majority)
# newdf.partyWinning indicates winning party (0 for responding party, 1 for petitioning party, 2 for unclear)
# We use the above 2 features to infer which party the individual justice voted for
# NOTE: majority has around 4000 NaNs that we should filter out?
results = []
ctr1, ctr2 = 0,0
for idx, x in enumerate(newdf.majority):
    #if decision is unclear, append 2 to results (in reality, apparently there aren't ANY 2s)
    if newdf.partyWinning[idx] == 2:
        results.append(2)
        ctr += 1
        break
    #if justice voted in the majority
    if x == 2:
        results.append(newdf.partyWinning[idx]) #results contains 0/1
    #if justice voted in the minority
    elif x == 1:
        results.append(1 - newdf.partyWinning[idx]) #results contains 0/1
    else:
        #need to clean this up o.o
        results.append(1)

In [None]:
len(newsmalldf)

In [None]:
newsmalldf.columns

In [None]:
# these are our target values
pd.concat([newsmalldf, results])

In [None]:
# drop rows where any column value is NaN - dealing with missing data
newsmalldf.dropna(axis=0).head()

In [None]:
len(newsmalldf), len(results)

In [None]:
newsmalldf["results"] = pd.Series(results)

In [None]:
newsmalldf

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)

## Sentiment Analysis

In this section, we will analyze the transcripts using sentiment analysis to get a sense of what the verdict will be. 

### Part 1: Prepping the Text File

In [299]:
def get_file_dict(fileList, fileExt='.txt'):
    '''
    This function takes the fileList and returns a list of dictionaries of the format 
    {'case_number': case_number, 'full_text': full_text}
    '''
    fileDict=[]
    fields=['docket', 'full_text']
    txtFiles_filter = filter(lambda f : f[-4:] == fileExt, fileList)
    for each in txtFiles_filter:
        name_str=each[4:-4]
        try:
            indexx=name_str.index('_')
            docketNum=name_str[:indexx]
        except ValueError:
            docketNum=name_str
        cur = open(dataPath+each)
        textual = cur.read()
        cleanTextual = textCompiler(textual) #textCompiler edited (see above) @Aidi I comented out the replace-split
        cur.close()
        tuple=(docketNum, cleanTextual)
        fileDict.append(dict(zip(fields, tuple)))
    return fileDict

In [300]:
fileDict=get_file_dict(fileList)

In [301]:
txtdf = pd.DataFrame(fileDict)

(1116, 2)


Unnamed: 0,docket,full_text
0,00-1011,P R O C E E D I N G S (10:17 a.m.) CHIEF JUSTI...
1,00-1021,P R O C E E D I N G S (10:07 a.m.) CHIEF JUSTI...
2,00-1045,P R O C E E D I N G S (11:02 a.m.) CHIEF JUSTI...
3,00-10666,P R O C E E D I N G S (11:03 a.m.) CHIEF JUSTI...
4,00-1072,P R O C E E D I N G S (11:18 a.m.) CHIEF JUSTI...


In [277]:
casedf=pd.read_csv('supremeCourtDb.csv')

In [482]:
#the 2015 transcripts will be dropped in the process as the database does not contain 2015 entries
merged=pd.merge(left=txtdf, right=casedf, how='inner', left_on='docket', right_on='docket')

In [208]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

In [209]:
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS

In [291]:
#adapted from hw5
def get_parts(thetext):
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [304]:
%%time
parsed=[get_parts(t) for t in merged.full_text.values]

  and tokens[j] in ("'", "\"", u"”", u"’", "...", ".", "!", "?", ")", EOS):


### Dumped `parsed` to pickle

In [335]:
#####storing the parsed variable in a file in case it's needed later.
#import pickle

#credit: http://stackoverflow.com/questions/6568007/how-do-i-save-and-restore-multiple-variables-in-python
#with open('parsed.pickle', 'w') as f:
#    pickle.dump(parsed, f)

# Getting back the objects:
#with open('objs.pickle') as f:
#    parsed = pickle.load(f)

In [327]:
#taking all the descriptives from the transcript. 
nbdata=[each[1] for each in parsed]

In [504]:
#right now the list is flattened and duplicates are dropped for simplicity; room for improvement to consider dup.
flattened=[]
for sub_nb in nbdata:
    flattened.append([item for l in sub_nb for item in l])

In [505]:
#adding the flattened list of descriptives to the dataframe as a column. 
merged['descriptives']=pd.Series(flattened, index=merged.index)
merged.head()

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes,net_pos,descriptives,pos,neg
0,00-1011,P R O C E E D I N G S (10:17 a.m.) CHIEF JUSTI...,2000-079,2000-079-01,2000-079-01-01,2000-079-01-01-01,6/25/01,1,533 U.S. 348,121 S. Ct. 2268,...,,103,103,1,5,4,9,"[chief, oral, jurisdictional, legal, pure, com...","[[accurate, good, permissible, right, consiste...","[[unconstitutional, criminal, arbitrary, odd, ..."
1,00-1021,P R O C E E D I N G S (10:07 a.m.) CHIEF JUSTI...,2001-071,2001-071-01,2001-071-01-01,2001-071-01-01-01,6/20/02,1,536 U.S. 355,122 S. Ct. 2151,...,,107,103,1,5,4,-1,"[chief, oral, straightforward, civil, exclusiv...","[[precise, interesting, adequate, excited, str...","[[reluctant, undermined, puzzled, bad, difficu..."
2,00-1045,P R O C E E D I N G S (11:02 a.m.) CHIEF JUSTI...,2001-002,2001-002-01,2001-002-01-01,2001-002-01-01-01,11/13/01,1,534 U.S. 19,122 S. Ct. 441,...,15 U.S.C. � 1681,109,102,1,9,0,-6,"[chief, oral, improper, opposed, improper, exp...","[[accurate, timely, like, reasonable, right, f...","[[limited, negligent, defamatory, bad, liable,..."
3,00-10666,P R O C E E D I N G S (11:03 a.m.) CHIEF JUSTI...,2001-076,2001-076-01,2001-076-01-01,2001-076-01-01-01,6/24/02,1,536 U.S. 545,122 S. Ct. 2406,...,18 U.S.C. � 924,106,102,1,5,4,5,"[chief, oral, brandish, reasonable, separate, ...","[[precise, intricate, harmless, good, unlimite...","[[unconstitutional, violent, worst, anxious, u..."
4,00-1072,P R O C E E D I N G S (11:18 a.m.) CHIEF JUSTI...,2001-031,2001-031-01,2001-031-01-01,2001-031-01-01-01,3/19/02,1,535 U.S. 106,122 S. Ct. 1145,...,,107,102,1,9,0,10,"[chief, oral, relation-back, proper, procedura...","[[accurate, perfect, good, permissible, timely...","[[stringent, inconsistent, untimely, discrimin..."


In [492]:
#load opinion lexicon
pos_link = "opinion-lexicon-English/positive-words.txt"
neg_link = "opinion-lexicon-English/negative-words.txt"

def get_both_list(pos_link, neg_link):
    '''
    This function takes the links in for the lexicon files downloaded from 
    https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
    and outputs two lists: posList and negList
    '''
    pos_file = open(pos_link, "r")
    neg_file = open(neg_link, "r")

    posList = pos_file.read()
    negList = neg_file.read()
    
    posList=get_pos_lexicon(posList)
    negList=get_neg_lexicon(negList)

    posList=posList.split('\n')
    negList=negList.split('\n')

    return posList, negList

In [493]:
def get_neg_lexicon(text):
    '''
    This function gets neg words from the list
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('2-faced')
    end = text.rfind('zombie')
    return text[start:end]

def get_pos_lexicon(text):
    '''
    This function gets pos words from the list.
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('a+')
    end = text.rfind('zippy')
    return text[start:end]

In [None]:
#get posList and negList for the next step
posList, negList = get_both_list(pos_link, neg_link)

In [495]:
def split_pos_neg(descriptives, posList, negList):
    '''
    This function takes all the descriptives for each transcript and 
    splits the list into two lists: one of positive and one of negative words. 
    '''
    pos=[]
    neg=[]
    for l in descriptives:
        positive=[]
        negative=[]
        for e in l:
            if e in posList:
                positive.append(e)
            elif e in negList:
                negative.append(e)
        pos.append([positive, 'pos'])
        neg.append([negative, 'neg'])
    return pos, neg

In [506]:
pos, neg=split_pos_neg(list(merged.descriptives.values), posList, negList)



In [507]:
#adding the split lists to merged as columns
merged['pos']=pd.Series(pos, index=merged.index)
merged['neg']=pd.Series(neg, index=merged.index)

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes,net_pos,descriptives,pos,neg
0,00-1011,P R O C E E D I N G S (10:17 a.m.) CHIEF JUSTI...,2000-079,2000-079-01,2000-079-01-01,2000-079-01-01-01,6/25/01,1,533 U.S. 348,121 S. Ct. 2268,...,,103,103,1,5,4,9,"[chief, oral, jurisdictional, legal, pure, com...","[[pure, pure, proper, gracious, permissible, c...","[[awkward, difficult, awkward, ambiguous, crim..."
1,00-1021,P R O C E E D I N G S (10:07 a.m.) CHIEF JUSTI...,2001-071,2001-071-01,2001-071-01-01,2001-071-01-01-01,6/20/02,1,536 U.S. 355,122 S. Ct. 2151,...,,107,103,1,5,4,-1,"[chief, oral, straightforward, civil, exclusiv...","[[straightforward, clear, correct, available, ...","[[improper, reluctant, puzzled, unable, wrong,..."
2,00-1045,P R O C E E D I N G S (11:02 a.m.) CHIEF JUSTI...,2001-002,2001-002-01,2001-002-01-01,2001-002-01-01-01,11/13/01,1,534 U.S. 19,122 S. Ct. 441,...,15 U.S.C. � 1681,109,102,1,9,0,-6,"[chief, oral, improper, opposed, improper, exp...","[[correct, like, available, reasonable, reason...","[[improper, improper, improper, improper, impr..."
3,00-10666,P R O C E E D I N G S (11:03 a.m.) CHIEF JUSTI...,2001-076,2001-076-01,2001-076-01-01,2001-076-01-01-01,6/24/02,1,536 U.S. 545,122 S. Ct. 2406,...,18 U.S.C. � 924,106,102,1,5,4,5,"[chief, oral, brandish, reasonable, separate, ...","[[reasonable, clear, leading, important, good,...","[[steep, disadvantaged, ineffective, ineffecti..."
4,00-1072,P R O C E E D I N G S (11:18 a.m.) CHIEF JUSTI...,2001-031,2001-031-01,2001-031-01-01,2001-031-01-01-01,3/19/02,1,535 U.S. 106,122 S. Ct. 1145,...,,107,102,1,9,0,10,"[chief, oral, relation-back, proper, procedura...","[[proper, consistent, significant, sufficient,...","[[unlawful, unlawful, discriminatory, unlawful..."


In [508]:
#get net count of positive words over negative words
net_count=[len(e1[0])-len(e2[0]) for (e1, e2) in zip(merged.pos, merged.neg)]

In [509]:
#append net_post to merged as a column
merged['net_pos']=pd.Series(net_count, index=merged.index)
merged.head(2)

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes,net_pos,descriptives,pos,neg
0,00-1011,P R O C E E D I N G S (10:17 a.m.) CHIEF JUSTI...,2000-079,2000-079-01,2000-079-01-01,2000-079-01-01-01,6/25/01,1,533 U.S. 348,121 S. Ct. 2268,...,,103,103,1,5,4,6,"[chief, oral, jurisdictional, legal, pure, com...","[[pure, pure, proper, gracious, permissible, c...","[[awkward, difficult, awkward, ambiguous, crim..."
1,00-1021,P R O C E E D I N G S (10:07 a.m.) CHIEF JUSTI...,2001-071,2001-071-01,2001-071-01-01,2001-071-01-01-01,6/20/02,1,536 U.S. 355,122 S. Ct. 2151,...,,107,103,1,5,4,15,"[chief, oral, straightforward, civil, exclusiv...","[[straightforward, clear, correct, available, ...","[[improper, reluctant, puzzled, unable, wrong,..."


## Naive Bayes Classifier

In [456]:
import findspark
findspark.init()

from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

/usr/local/opt/apache-spark/libexec


In [464]:
nbdatardd=sc.parallelize([ele[1] for ele in parsed])
nbdatardd.cache()

ParallelCollectionRDD[13] at parallelize at PythonRDD.scala:423

In [465]:
#adapted from hw5
adjvocab = (nbdatardd.flatMap(lambda word: word)
             .flatMap(lambda word: word)
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)
             .map(lambda (x,y): x)
             .zipWithIndex()
            ).collectAsMap()

In [467]:
import itertools
Xarraypre=nbdatardd.map(lambda l: " ".join(list(itertools.chain.from_iterable(l))))
Xarray=Xarraypre.collect()

In [468]:
from sklearn.cross_validation import train_test_split
itrain, itest = train_test_split(xrange(len(Xarray)), train_size=0.7)
mask=np.ones(len(Xarray), dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

In [469]:
X=np.array(Xarray)
y=np.array(merged.partyWinning.values)

In [470]:
def make_xy(X_col, y_col, vectorizer):
    X = vectorizer.fit_transform(X_col)
    y = y_col
    return X, y

In [471]:
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    favorable = y == 1
    unclear = ~favorable
    return prob[unclear, 0].sum() + prob[favorable, 1].sum()

In [472]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import KFold

In [476]:
def cv_score(clf, x, y, score_func, nfold=5):
    result = 0
    for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
        clf.fit(x[train], y[train]) # fit
        result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

In [487]:
#search for best alphas and min_dfs
alphas = [0, .1, 1, 5]
min_dfs = [1e-5, 1e-3, 1e-1]

best_alpha = None
best_min_df = None
maxscore=-np.inf
for alpha in alphas:
    for min_df in min_dfs:         
        vectorizer = CountVectorizer(min_df = min_df, vocabulary=adjvocab)       
        Xthis, ythis = make_xy(X_col=X, y_col=y, vectorizer=vectorizer)
        Xtrainthis=Xthis[mask]
        ytrainthis=ythis[mask]
        clf = MultinomialNB(alpha=alpha)
        cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)
        if cvscore > maxscore:
            maxscore = cvscore
            best_alpha, best_min_df = alpha, min_df

In [488]:
print "alpha: %f" % best_alpha
print "min_df: %f" % best_min_df

alpha: 1.000000
min_df: 0.000010


In [489]:
#get train and test accuracy
vectorizer = CountVectorizer(min_df = 0.00001, vocabulary=adjvocab)       
Xthis, ythis = make_xy(X_col=X, y_col=y, vectorizer=vectorizer)
Xtrainthis=Xthis[mask]
ytrainthis=ythis[mask]
clf = MultinomialNB(alpha=1.)
cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)
print "accuracy on training set", clf.score(Xtrainthis, ytrainthis)
print "accuracy on testing set", clf.score(Xthis[~mask], ythis[~mask])

accuracy on training set 0.886676875957
accuracy on testing set 0.55871886121
