# CS109 Project - The Court Rules In Favor Of...
## Aidi Adnan Brian John (Team AABJ)

### Abstract
The purpose of this project is to predict votes of Supreme Court justices using oral argument transcripts. Studies in linguistics and psychology, as well as common sense, dictates that the word choices that people make convey crucial information about their beliefs and intentions with regard to issues. Rather than use precedents or formal analysis of the law to predict Supreme Court decisions, we attempt to extract essential emotional features of oral arguments made by justices and advocates in the court. Using aggregate data from 1946 to present

### Data
Oral Argument Transcripts - obtained from http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx. Transcripts are made available on the day of court hearing.
Justice Vote Counts/Case Information - obtained from the Supreme Court Database.

## Data Cleaning and Preparation

In [3]:
import string
import re
import numpy as np
import pandas as pd
import operator
import os
import sys

**TODO**: Brian/Adnan can you fill this in with a description of what you did in parser.py/convert.py
First, we took the all the PDF files of ...

In [2]:
#reads in text file, replace path of "wut.txt" to relevant txt; only processes one text file currently
text_file = open("wut.txt", "r")
text = text_file.read()

We wrote a parser to extract the names of the petitioner and respondant attorneys from the first 2 pages of the converted text document. An example of list of petitioner and respondant speakers, taken from the example case in 2014 of Johnson v United States (docket number 13-7120) which shall be henceforth used as the recurring example in this process book, is:

Katherine M. Menendez, ESQ., Minneapolis, Minn.; on behalf of Petitioner
Michael R. Dreeben, ESQ., Deputy Solicitor General, Department of Justice, Washington D.C.; on behalf of Respondent

In [3]:
def get_petitioners_and_respondents(text):
    '''
    This function takes in input text file as string and outputs 2 lists of speakers speaking for petitioners and
    respondents sides.
    '''
    #get portion of transcript between APPEARANCES and CONTENTS that specifies speakers for petitioners/respondents
    start = text.find('APPEARANCES:') + len('APPEARNACES')
    end = text.find('C O N T E N T S')
    speakers_text = text[start:end]
    split_speakers_text = re.split('\.[ ]*\n', speakers_text)
    #for each speaker, get name (capitalized) and side (Pet/Res) he/she is speaking for
    pet_speakers, res_speakers, other_speakers = [], [], []
    for speaker in split_speakers_text:
        name = speaker.strip().split(',')[0]
        #search for first index of capitalized word (which will be start of speaker name)
        start = 0
        for idx, char in enumerate(name):
            if str.isupper(char):
                start = idx
                break
        #actual name to be appended to correct list
        name = name[start:]
        #print name
        
        #if words Petition, Plaintiff, etc occur in speaker blurb, speaker belongs to Pet
        if any(x in speaker for x in ['etition' , 'ppellant', 'emand', 'evers', 'laintiff']):
            pet_speakers.append(name)
        #otherwise if words Respondent, Defendant, etc occur, speaker belongs to Res
        elif any(x in speaker for x in ['espond' , 'ppellee', 'efendant']):
            res_speakers.append(name)
        #otherwise if neither side is specified in blurb, speaking belongs to Other
        elif 'neither' in speaker:
            other_speakers.append(name)
    return pet_speakers, res_speakers, other_speakers

In [9]:
# For example, for wut.txt, there's 1 petitioner and 2 respondents
pet_speakers, res_speakers, other_speakers = get_petitioners_and_respondents(text)
pet_speakers, res_speakers, other_speakers 

(['MR. H. BARTOW FARR'],
 ['MR. ROY L. REARDON', 'MS. BARBARA D. UNDERWOOD'],
 [])

The general flow of court proceedings is that the Petitioner attornies make their oral argument, followed by the Respondent attornies, before we hear the rebuttal argument of the Petitioners again. Throughout all proceedings, Justices are free to interject with questions and statements of their own. The below function extracts the main argument portion of the oral transcripts, which is the meat of the proceedings that we are interested in conducting analysis on. 

In [5]:
def get_argument_portion(text):
    '''
    This function gets just the argument portion of the text.
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('P R O C E E D')
    end = text.rfind('Whereupon')
    return text[start:end]

In [6]:
argument_portion = get_argument_portion(text)
argument_portion[:500]

"P R O C E E D I N G S\n\n2\n\n[10:13 a.m.]\n\n3\n4\n\nCHIEF JUSTICE REHNQUIST:\n\nWe'll hear argument on\n\nNumber 00-24, PGA Tour, Inc. vs. Casey Martin.\n\n5\n\nORAL ARGUMENT OF H. BARTOW FARR, III\n\n6\n\nON BEHALF OF THE PETITIONER\n\n7\n\nMR. FARR:\n\nMr. Farr?\n\nMr. Chief Justice and may it please\n\n8\n\nthe Court:\n\nThe Ninth Circuit in our view made two\n\n9\n\ncritical mistakes in applying the Disabilities Act to this\n\n10\n\ntype of claim by a professional athlete. First it failed\n\n11\n\nto recognize that Title 3 of the act, "

In [7]:
def count_words(s):
    '''
    This function counts number of proper English words in a string s (not non-words like - or --)
    '''
    s = s.split()
    non_words = ['-', '--']
    return sum([x not in non_words for x in s])

In [47]:
def modify_speaker_names(speakers):
    '''
    This function modifies speaker names like 'QUESTION' to 'QUESTION: ', for word count parsing later on
    '''
    return map(lambda x: x+': ', speakers)

In [33]:
def clean_text(text):
    '''
    This function takes in the portions of text, and gets rid of the \n and the line numbers. 
    '''
    text_arr=text.splitlines()
    text_arr.remove('')
    text_clean=[]
    for each in text_arr:
        if each != '':
            try:
                int(each)
            except ValueError: #assummption: if the item only has integers, it is a line number.
                text_clean.append(each)
    out_text=' '.join(text_clean)
    return out_text

In [34]:
clean_argument=clean_text(argument_portion)
clean_argument[:500]

"P R O C E E D I N G S [10:13 a.m.] CHIEF JUSTICE REHNQUIST: We'll hear argument on Number 00-24, PGA Tour, Inc. vs. Casey Martin. ORAL ARGUMENT OF H. BARTOW FARR, III ON BEHALF OF THE PETITIONER MR. FARR: Mr. Farr? Mr. Chief Justice and may it please the Court: The Ninth Circuit in our view made two critical mistakes in applying the Disabilities Act to this type of claim by a professional athlete. First it failed to recognize that Title 3 of the act, the public accommodations provision, apply on"

In [72]:
def total_wordcount(text):
    '''
    POSSIBLE FEATURE 1:
    This function returns a dictionary with key: name of speaker/justice and value: total number of words they
    spoke in total throughout argument.
    '''
    arg_text = get_argument_portion(text)
    #keeps track of current speaker
    current_speaker = 'N/A'
    clean_argument = clean_text(arg_text)
    
    #clean argument text split by instances where speakers change
    #TODO: cleanup - these should not be hardcorded and instead be result of 
    #modify_speaker_names(pet_speakers + res_speakers + other_speakers)!!!
    #this is currently kept this way cuz of QUESTION: ..........ugh
    split_argument = re.split('(MR. FARR: |QUESTION: |MR. REARDON: |CHIEF JUSTICE REHNQUIST: )', clean_argument)
    all_speakers = ['MR. FARR: ', 'QUESTION: ', 'MR. REARDON: ', 'CHIEF JUSTICE REHNQUIST: ']
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    num_words = dict(zip(all_speakers + [current_speaker], [0] * (len(all_speakers)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker
        if s in all_speakers:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            num_words[current_speaker] = num_words[current_speaker] + count_words(s)
    
    return num_words

In [73]:
#for example, this gives us total number of words uttered by each speaker
#we just need to find list of all speakers in the form they're referred to in the argument, "JUSTICE SCALIA: " for ex.
total_wordcount(text)

{'CHIEF JUSTICE REHNQUIST: ': 24,
 'MR. FARR: ': 3433,
 'MR. REARDON: ': 1480,
 'N/A': 13,
 'QUESTION: ': 5170}

In [20]:
text_file.close()

# Running classifier

### Logistic Regression

Logistic regression is a natural first choice for a model since our target value can be viewed as a probability between 0 or 1 for any individual justice to vote For or Against, with a higher probability representing a higher confidence of that justice voting in favor of the arguing party. 

In [2]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_model = LogisticRegression(penalty='l2',C=1.0, fit_intercept=True, class_weight='auto')
log_model = LR.fit(X, y)

### Random Forest Classifier

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.cross_validation import train_test_split, cross_val_score

In [6]:
# read in SCDB data from file
bigdf=pd.read_csv("supremeCourtDb.csv")

In [8]:
df = bigdf["docketId", "dateDecision", "case"]

Index([u'caseId', u'docketId', u'caseIssuesId', u'voteId', u'dateDecision',
       u'decisionType', u'usCite', u'sctCite', u'ledCite', u'lexisCite',
       u'term', u'naturalCourt', u'chief', u'docket', u'caseName',
       u'dateArgument', u'dateRearg', u'petitioner', u'petitionerState',
       u'respondent', u'respondentState', u'jurisdiction', u'adminAction',
       u'adminActionState', u'threeJudgeFdc', u'caseOrigin',
       u'caseOriginState', u'caseSource', u'caseSourceState',
       u'lcDisagreement', u'certReason', u'lcDisposition',
       u'lcDispositionDirection', u'declarationUncon', u'caseDisposition',
       u'caseDispositionUnusual', u'partyWinning', u'precedentAlteration',
       u'voteUnclear', u'issue', u'issueArea', u'decisionDirection',
       u'decisionDirectionDissent', u'authorityDecision1',
       u'authorityDecision2', u'lawType', u'lawSupp', u'lawMinor',
       u'majOpinWriter', u'majOpinAssigner', u'splitVote', u'majVotes',
       u'minVotes'],
      dtype='obj

### Linear SVM Classifier

In [None]:
svm_model = svm.SVC(C=1.0, kernel='linear', probability=True, class_weight='auto')
svm_model = my_svm.fit(X, y)
svm_pred = svm_fit.predict(W)
# Class probabilities, based on log regression on distance to hyperplane.
svm_prob = svm_fit.predict_proba(W)
svm_dist = svm_fit.decision_function(W)

## 2. Justice Ruling Prediction

We use a different dataset in a slightly different approach to making Supreme Court ruling predictions. This method is motivated by the fact that usually, only 2 justices tend to be swing votes and justice decisions are highly influenced by factors outside of what transpires in court proceedings, such as background information about the case itself. The Supreme Court website contains a Justice-centered database which contains extensive information about each case; in particular, the most pertinent fields we are interested in analyzing are:

1. Decision Year
2. Natural Court
3. Petitioner
4. Respondent
5. Case Origin
6. Case Source
7. Lower Court Disposition Direction
8. Issue Area

Our target value to predict is the field called winningParty (petitioner or respondent), which using our justice-centered approach involves aggregating predicted votes for each individual justice and taking majority vote. The associated confidence of our entire prediction is obtained by averaging individual confidences of our models for each justice.

In [4]:
# read in justice-centered SCDB data from file
newdf=pd.read_csv("SCDB_justice_centered.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [24]:
# maybe lcDispositionDirection? choose features with continuous/numerical features
# do the numbers mean anything though?
newsmalldf = newdf[["term", "naturalCourt", "petitioner", "respondent", "caseOrigin", "caseSource", "lcDisposition", "issueArea"]]

In [25]:
newsmalldf.head()

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea
0,1946,1301,198,172,51,29,2,8
1,1946,1301,198,172,51,29,2,8
2,1946,1301,198,172,51,29,2,8
3,1946,1301,198,172,51,29,2,8
4,1946,1301,198,172,51,29,2,8


For an intuitive understanding of the features above, check out the documentation here: http://scdb.wustl.edu/documentation.php?var=petitioner. All the above features are categorical instead of continuous (which means the numbers specify a category instead of having a numerical meaning). For an illustrative example, the "petitioner" variable includes:

1. attorney general of the United States, or his office
2. specified state board or department of education
7. state department or agency
etc

#### Advantages of Using Decision Tree Classifiers

Having an intuitive understanding of the meanings behind the variables is important and leads us to our idea of usign the decision tree classifier. A distinct advantage of using decisiontrees is that the decision at each node has an intuitive meaning and corresponds to querying along one feature axis at a time (e.g. is the petitioner an attorney general of the United States?). 

Furthermore, trees are easy to understand and interpret. We can look at the top node and figure out which feature it corresponds to, and conclude that this feature contributes the most information gain, i.e. is the most important/predictive feature. This makes it easy to verify whether our results make intuitive sense.

We will show the process of running decision trees on each justice, before aggregating the votes now.

### 2.1 Justice-Centered Decision Tree Classifiers

Ultimately, the feature that we want to predict is the vote for each justice.

In [29]:
from sklearn import tree

In [36]:
# newdf.majority refers to whether justice voted with the majority (1 for dissent, 2 for majority)
# newdf.partyWinning indicates winning party (0 for responding party, 1 for petitioning party, 2 for unclear)


0         1
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         1
9         0
10        0
11        0
12        0
13        0
14        0
15        0
16        0
17        0
18        0
19        0
20        0
21        0
22        0
23        0
24        0
25        0
26        0
27        0
28        0
29        0
         ..
114865    1
114866    1
114867    1
114868    0
114869    0
114870    0
114871    0
114872    0
114873    0
114874    0
114875    0
114876    0
114877    1
114878    1
114879    1
114880    1
114881    1
114882    1
114883    1
114884    1
114885    1
114886    0
114887    0
114888    0
114889    0
114890    0
114891    0
114892    0
114893    0
114894    0
Name: partyWinning, dtype: float64

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit()