# CS109 Project - The Court Rules In Favor Of...
## Aidi Adnan Brian John (Team AABJ)

### Abstract
The purpose of this project is to predict votes of Supreme Court justices using oral argument transcripts. Studies in linguistics and psychology, as well as common sense, dictates that the word choices that people make convey crucial information about their beliefs and intentions with regard to issues. Rather than use precedents or formal analysis of the law to predict Supreme Court decisions, we attempt to extract essential emotional features of oral arguments made by justices and advocates in the court. Using aggregate data from 1946 to present

### Data
Oral Argument Transcripts - obtained from http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx. Transcripts are made available on the day of court hearing.
Justice Vote Counts/Case Information - obtained from the Supreme Court Database.

## Data Cleaning and Preparation

In [1]:
import string
import re
import numpy as np
import pandas as pd
import operator
import os
import sys
import io
import collections

We used a python script (scraper.py) to first scrape the pdfs from the Supreme Court Justice Website (but didn't upload those to the repository, because we ultimately wanted to use text files in our process). We then used a script to convert the pdf files to text files, but not before removing the last 10 pages which were reserved as an index for certain words.

In [2]:
# gather all txt files, first get the path to the data directory
# then list the files and filter out all non-txt files
curPath = os.getcwd()
dataPath = curPath + '/data/'
fileList = os.listdir(dataPath)
fileExt = ".txt"
txtFiles = filter(lambda f : f[-4:] == fileExt, fileList)
txtFiles = map(lambda f : dataPath + f, txtFiles)

In [4]:
#reads in text file, replace path of "wut.txt" to relevant txt; only processes one text file currently
fil = "data/cut_126, orig_ppl4.txt"
fil = "wut.txt"
text_file = open(fil, "r")
text = text_file.read()

We wrote a parser to extract the names of the petitioner and respondant attorneys from the first 2 pages of the converted text document. An example of list of petitioner and respondant speakers, taken from the example case in 2014 of Johnson v United States (docket number 13-7120) which shall be henceforth used as the recurring example in this process book, is:

Katherine M. Menendez, ESQ., Minneapolis, Minn.; on behalf of Petitioner
Michael R. Dreeben, ESQ., Deputy Solicitor General, Department of Justice, Washington D.C.; on behalf of Respondent

In [129]:
'''
Function
--------
get_petitioners_and_respondents

Inputs
------
text : a transcript in its raw form, without having run cleanTextMaker

Returns
-------
pet_speakers, res_speakers, other_speakers
'''
def get_petitioners_and_respondents(text):
    #get portion of transcript between APPEARANCES and CONTENTS that specifies speakers for petitioners/respondents
    start = text.find('APPEARANCES:') + len('APPEARNACES')
    end = text.find('C O N T E N T S')
    speakers_text = text[start:end]
    split_speakers_text = re.split('\.[ ]*\n', speakers_text)
    #for each speaker, get name (capitalized) and side (Pet/Res) he/she is speaking for
    pet_speakers, res_speakers, other_speakers = [], [], []
    for speaker in split_speakers_text:
        name = speaker.strip().split(',')[0]
        #search for first index of capitalized word (which will be start of speaker name)
        start = 0
        for idx, char in enumerate(name):
            if str.isupper(char):
                start = idx
                break
        #actual name to be appended to correct list
        name = name[start:]
        
        #if words Petition, Plaintiff, etc occur in speaker blurb, speaker belongs to Pet
        if any(x in speaker for x in ['etition' , 'ppellant', 'emand', 'evers', 'laintiff']):
            pet_speakers.append(name)
        #otherwise if words Respondent, Defendant, etc occur, speaker belongs to Res
        elif any(x in speaker for x in ['espond' , 'ppellee', 'efendant']):
            res_speakers.append(name)
        #otherwise if neither side is specified in blurb, speaking belongs to Other
        elif 'neither' in speaker:
            other_speakers.append(name)
    return pet_speakers, res_speakers, other_speakers

In [7]:
# generate a list of regular expressions to split the text on
def generateRES(nameList, plebe):
    """
    plebe is a boolean determining whether or not the list is of
    justices or not
    """
    retList = []
    for name in nameList:
        address = ""
        if plebe:
            words = name.split(' ')
            # first term is the title, last
            # word is the last name
            address = "%s" % (words[-1])
            retList.append(address)
        else:
            address = "JUSTICE %s" % name
            address2 = "CHIEF JUSTICE %s" % name
            retList.append(address)
            retList.append(address2)
    return retList

In [8]:
def getJusticeNames(text):
    index = 0
    retList = []
    while index < len(text):
        index = text.find("JUSTICE", index)
        if index == -1:
            break
        index += 8 # because length of JUSTICE is 7, plus length of the space
        prevIndex = index
        while text[index] != ':':
            index +=1
        retList.append(text[prevIndex:index])
    return list(set(retList))

In [9]:
# nms = getJusticeNames(clean_argument)
# generateRES(nms, False)
# print nms

The general flow of court proceedings is that the Petitioner attornies make their oral argument, followed by the Respondent attornies, before we hear the rebuttal argument of the Petitioners again. Throughout all proceedings, Justices are free to interject with questions and statements of their own. The below function extracts the main argument portion of the oral transcripts, which is the meat of the proceedings that we are interested in conducting analysis on. 

In [10]:
def get_argument_portion(text):
    '''
    This function gets just the argument portion of the text.
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('P R O C E E D I N G S')
    end = text.rfind('Whereupon')
    return text[start:end]

In [11]:
argument_portion = get_argument_portion(text)
argument_portion[:500]

"P R O C E E D I N G S\n\n2\n\n[10:13 a.m.]\n\n3\n4\n\nCHIEF JUSTICE REHNQUIST:\n\nWe'll hear argument on\n\nNumber 00-24, PGA Tour, Inc. vs. Casey Martin.\n\n5\n\nORAL ARGUMENT OF H. BARTOW FARR, III\n\n6\n\nON BEHALF OF THE PETITIONER\n\n7\n\nMR. FARR:\n\nMr. Farr?\n\nMr. Chief Justice and may it please\n\n8\n\nthe Court:\n\nThe Ninth Circuit in our view made two\n\n9\n\ncritical mistakes in applying the Disabilities Act to this\n\n10\n\ntype of claim by a professional athlete. First it failed\n\n11\n\nto recognize that Title 3 of the act, "

In [12]:
def count_words(s):
    '''
    This function counts number of proper English words in a string s (not non-words like - or --)
    '''
    s = s.split()
    non_words = ['-', '--']
    return sum([x not in non_words for x in s])

In [13]:
def modify_speaker_names(speakers):
    '''
    This function modifies speaker names like 'QUESTION' to 'QUESTION: ', for word count parsing later on
    '''
    return map(lambda x: x+': ', speakers)

In [14]:
def cleanTextMaker(text):
    '''
    This function takes in the portions of text, and gets rid of the \n and the line numbers. 
    '''
    text_arr=text.splitlines()
    text_clean=[]
    for each in text_arr:
        if each != '':
            try:
                int(each)
            except ValueError: #assummption: if the item only has integers, it is a line number.
                text_clean.append(each)
    out_text=' '.join(text_clean)
    return out_text

In [15]:
cleanText=cleanTextMaker(argument_portion)
cleanText[:500]

"P R O C E E D I N G S [10:13 a.m.] CHIEF JUSTICE REHNQUIST: We'll hear argument on Number 00-24, PGA Tour, Inc. vs. Casey Martin. ORAL ARGUMENT OF H. BARTOW FARR, III ON BEHALF OF THE PETITIONER MR. FARR: Mr. Farr? Mr. Chief Justice and may it please the Court: The Ninth Circuit in our view made two critical mistakes in applying the Disabilities Act to this type of claim by a professional athlete. First it failed to recognize that Title 3 of the act, the public accommodations provision, apply on"

In [16]:
def total_wordcount(text):
    '''
    POSSIBLE FEATURE 1:
    This function returns a dictionary with key: name of speaker/justice and value: total number of words they
    spoke in total throughout argument.
    '''
    arg_text = get_argument_portion(text)
    #keeps track of current speaker
    current_speaker = 'N/A'
    clean_argument = cleanTextMaker(arg_text)
    
    #clean argument text split by instances where speakers change

    # first get the names of the judges and speakers
    pet_speakers, res_speakers, other_speakers = get_petitioners_and_respondents(text)
    justiceLeague = getJusticeNames(clean_argument)
    # create the regular expression for the justices and the plebes
    # need to also add the justice speaker
    JLList = generateRES(justiceLeague, False)
    plebeList = pet_speakers + res_speakers + other_speakers
    plebeRE = generateRES(plebeList, True)
    finREList = ["QUESTION"]
    finREList += plebeRE + JLList
    
    finREList = map(lambda name : name + ":", finREList)
    RE = '('  + '|'.join(finREList) + ')'
    
    split_argument = re.split(RE, clean_argument)
    all_speakers = finREList
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    num_words = dict(zip(all_speakers + [current_speaker], [0] * (len(all_speakers)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker
        if s in all_speakers:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            num_words[current_speaker] = num_words[current_speaker] + count_words(s)
    
    return num_words

In [17]:
#for example, this gives us total number of words uttered by each speaker
#we just need to find list of all speakers in the form they're referred to in the argument, "JUSTICE SCALIA: " for ex.
total_wordcount(text)

{'CHIEF JUSTICE REHNQUIST:': 25,
 'FARR:': 3433,
 'JUSTICE REHNQUIST:': 0,
 'N/A': 13,
 'QUESTION:': 4118,
 'REARDON:': 1480,
 'UNDERWOOD:': 1129}

In [18]:
def wordCounter(text):
    """
    counts number of times each word appears in a file
    
    returns a dictionary of word : times it appears
    """
    wordCount={}
    for word in text.split():
        # unfortunately, isalpha does discount some real words
        # like those with apostrophes, and words with question
        # marks at the end of them
        if word.lower() not in wordCount and word.isalpha():
            wordCount[word.lower()] = 1
        elif word.isalpha():
            wordCount[word.lower()] += 1
    return wordCount

In [19]:
def topWords(diction, num, verbose=False):
    """
    returns the top num words in a dictionary
    dictionary is expected to be of the format word : count
    """
    d = collections.Counter(diction)
    if verbose:
        for k, v in d.most_common(numTop):
            print '%s: %i' % (k, v)
    return d.most_common(num)

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

In [21]:
# credit to:
# http://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python
def getCosine(vec1, vec2):
    """
    cosine similarity is used to calculate the similarity index between two vectors
    """
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = np.sqrt(sum1) * np.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def textToVector(text):
    """
    turn a word into a vector
    """
    WORD = re.compile(r'\w+')
    words = WORD.findall(text)
    return Counter(words)

In [22]:
def textCompiler(text):
    """
    cleanly puts together all of the functions we have written to return a text ready for
    classification
    """
    txt = get_argument_portion(text)
    txt = cleanTextMaker(txt)
    txt = txt.replace('.', '\n').split('\n')
    return txt

In [23]:
def getData():
    """
    opens and cleans all data, putting it in a list
    uses txtFiles, which is declared at the beginning of
    the ipynb
    """
    retList = []
    for File in txtFiles:
        cur = open(File)
        textual = cur.read()
        cleanTextual = textCompiler(textual)
        retList.append(cleanTextual)
        cur.close()
    return retList

In [24]:
# splits along 1 dimension deterministically
def splitData(X, fraction_train=9.0 / 10.0):
    end_train = int(len(X) * fraction_train)
    X_train = X[0:end_train]
    X_test = X[end_train:]
    return X_train, X_test

# def splitTrainTest(X, Y, fraction_train = 9.0 / 10.0):
#     X_train, X_test = splitData(X, fraction_train)
#     Y_train, Y_test = splitData(Y, fraction_train)
#     assert(X_train.shape[0] == Y_train.shape[0])
#     assert(X_test.shape[0] == Y_test.shape[0])
#     return X_train, y_train, X_test, y_test

In [25]:
fullBody = getData()

In [26]:
# splits data. note that tdidf is a measure of the relative importance of a word in a document
# based on its inverse frequency
testBody, trainBody = splitData(fullBody)
vectorizer = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english')

In [27]:
def textToMat(vizer, docList):
    """
    turns documents into a tfidf matrix
    Parameters:
        vizer is a vectorizer of type TfidfVectorizer
        docList is a list of (ideally) preprocessed documents
    """
    retList = []
    for doc in docList:
        resMat = vizer.transform(doc).todense()
        retList.append(resMat)
    return retList

In [28]:
# print vectorizer.vocabulary_
# print mat.todense()#[:,942]

## Pre-processing

#### Text representation

Document representation is the first step of our analysis since there are a variety of ways to represent a transcript, which in its raw form is a simple string of texts. We use a pre-processing technique that reduces the complexity of the documents and makes them easier to handle, which is to transform the oral transcripts from the full text version to a document vector/sparse matrix. Every text document is represented as a vector of term weights (word features) from a set of terms (dictionary), where each term occurs at least once in a certain critical number of documents.

#### High dimensionality of text representation

A major characteristic of document classification problems is the extremely high dimensionality of data where the number of potential features often exceeds the number of training documents. Dimensionality reduction is thus critical to allow for efficient data manipulation. Irrelevant and redundant features often degrade performance of classification algorithms both in accuracy and speed, and also tends to fall into the all-common trap of overfitting.

#### Data cleaning

Pre-processing of text data involves tokenization of raw text, stop words removal, stemming and eliminating as much as possible the language dependent factors. Brief explanations of these preprocessing stages are as fllows:

1. Sentence splitting: identifying sentence boundaries in documents
2. Tokenization: partitioning documents that are initially treated as a string into a list of tokens
3. Stop words removal: removing common English words like "the", "a", etc
4. Stemming: reducing derived words to its most root form, example happiest -> happy
5. Noisy data: cleaning noisy data spilt over from pdf to text conversion, including inclusion of line numbers, page breaks, etc
6. Text representation: determining whether we should use words, phrases or entire sentences as a "token" for analysis

### Feature extraction vs. feature selection

After feature extraction, feature selection was conducted to construct a vector space of appropriate dimensionality, which improves the scalability, efficiency and accuracy of our classification algorithm. The main idea of feature selection is to choose a subset of features from the original texts, with subset determined by obtaining features with the highest score according to some predetermined measure of feature importance.

We attempt two different approaches for feature selection in our analysis:
1. Wrappers: 
2. Filters: as opposed to wrappers, filters can be conducted independently of the actual classification algorithm, and hence is less computationally expensive. Filters use an evaluation metric that measures ability of a feature to differentiate each class, hence choosing the most discriminative and valuable features. The filter of our choice is a technique called frequency document-inverse document frequency, as shown below.

### Frequency Document-Inverse Document Frequency (TF-IDF)

Frequency document–inverse document frequency (tf-idf), is a powerful method to evaluate how important is a word in a document, and captures the relative relevance among words. 

Methodology:
It converts the textual representation of information into a Vector-Space Model or a sparse matrix representation.

In [28]:
'''
Returns list of all uncleaned transcripts in raw text form.
'''
def getAllRawText(txtFiles):
    return [(open(txtFiles[i]).read()) for i in range(len(txtFiles))]

In [29]:
'''
Returns list of all cleaned transcripts.
'''
def getAllRawCleanText(txtFiles):
    return [(cleanTextMaker(open(txtFiles[i]).read())) for i in range(len(txtFiles))]

In [30]:
allRawText = getAllRawText(txtFiles)
allRawCleanText = getAllRawCleanText(txtFiles)

In [31]:
def toColloquialName(formal_name):
    ret = formal_name.split()
    return ret[-1]

In [33]:
def getBagOfWords():
    """
    Gets bag of words dictionary for every document in txtFiles
    bag of words dictionary consists of word : number of times word appears in transcript
    """
    retList = []
    for File in txtFiles:
        cur = open(File)
        textual = cur.read()
        cleanTextual = wordCounter(textual)
        retList.append(cleanTextual)
        cur.close()
    return retList

In [34]:
# transforms bag of words into td-idf weighted
bow = getBagOfWords()

In [36]:
from sklearn.utils import as_float_array
from sklearn.utils.extmath import randomized_svd
import snowballstemmer

In [166]:
def destem(allRawText):
    """
    INPUT: 
        
    FUNCTION:
        stems all words from a list of documents
        documents are assumed to be stored in Latin1 encoding
        there is one document that is not tractable so we exclude it
        uses snowballstemmer
        required to decode string to avoid UnicodeDecodeErrors
    """
    stemmer = snowballstemmer.stemmer('english')
    stemmedList = []
    for text in allRawText:
        try:
            temp = stemmer.stemWords(text.split())
            for i in xrange(len(temp)):
                if str(type(temp[i])) == "<type 'str'>":
                    temp[i] = temp[i].decode('Latin1')
            res = ' '.join(temp)
            stemmedList.append(res)
        except UnicodeDecodeError:
            # fk this one document
            pass
    return stemmedList

In [155]:
def getDocketNo(text):
    '''
    Input: raw text
    '''
    cleantext = cleanTextMaker(text)
    docketIdx = cleantext.find("No.")
    return cleantext[docketIdx+4:].split()[0]

In [151]:
def splitTextPetRes(text):
    '''
    Input: raw text
    '''
    arg_text = get_argument_portion(text)
    #keeps track of current speaker
    current_speaker = 'N/A'
    clean_argument = cleanTextMaker(arg_text)

    # first get the names of the judges and speakers
    pet_speakers, res_speakers, _ = get_petitioners_and_respondents(text)
    
    # create the regular expression for the justices and the plebes
    petList = generateRES(pet_speakers, True)
    resList = generateRES(res_speakers, True)
    petList = map(lambda name : name + ":", petList)
    resList = map(lambda name : name + ":", resList)
    all_speakers = (petList + resList)
    
    RE = '('  + '|'.join(all_speakers) + ')'
    
    # split argument portion by times elements in plebeList (e.x. MR. FARR: or EUGENE: appears)
    split_argument = re.split(RE, clean_argument)
    
    # dictionary keyed by speaker, with value actual speech (in string format)
    speech = dict(zip(all_speakers + [current_speaker], [""] * (len(all_speakers)+1)))
    
    #iterate through split argument, accumulating speeches for all speakers
    for s in split_argument:
        if s in all_speakers:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            speech[current_speaker] += s

    #combine all pet and res speakers, if multiple
    retDict = {"resSpeakers":"", "petSpeakers":""}
    
    for rSpeaker in resList:
        retDict["resSpeakers"] += speech[rSpeaker]
    for pSpeaker in petList:
        retDict["petSpeakers"] += speech[pSpeaker]

    return speech, retDict

In [161]:
allPetSpeeches = []
allResSpeeches = []
allDocketNo = []
for text in allRawText:
    speech, retDict = splitTextPetRes(text)
    petSpeech = retDict["petSpeakers"]
    resSpeech = retDict["resSpeakers"]
    # if either petSpeech or resSpeech is an empty string, do not add to workable dataset
    if petSpeech and resSpeech:
        allPetSpeeches.append(petSpeech)
        allResSpeeches.append(resSpeech)
        allDocketNo.append(getDocketNo(text))

In [162]:
# 58 oral transcripts have empty petitioner or respondent speeches due to dirty scraping of pdf files
# for example, get_petitioners_and_respondents sometimes does not scrape properly due to bad formatting
len(allPetSpeeches), len(allResSpeeches), len(allDocketNo)

(1058, 1058, 1058)

In [167]:
# takes long to run
# TODO: pickle to file
allDestemmedPetSpeeches = destem(allPetSpeeches)
# allDestemmedResSpeeches = destem(allResSpeeches)

In [299]:
def runSVD(documentList, vizer, numComponents=25):
    """
    takes a list of documents and a vectorizer
    converts document list to a matrix of frequencies 
        (as determined by the vectorizer) of document by word
    takes matrix and runs truncated SVD on it to generate
    a matrix that consists of themes (T) in each document
    overall importance of the word (S)
    and a matrix that consists of how important each word
    is in the document (DT)
    
    this code is partially derived from sklearn's
    truncated_svd function (which doesn't return
    all of the matricies we are interested in)
    """
    mat = vizer.fit_transform(documentList)
    X = as_float_array(mat, copy=False)
    # T is the term by concept matrix
    # S the singular value matrix
    # D is the concept-document matrix
    T, S, DT = randomized_svd(X, numComponents, n_iter=50)
    return T, S, DT

In [222]:
merged[merged["docket"] == '13-1499']

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
828,13-1499,"[P R O C E E D I N G S (11:15 a, m, ) CHIE...",2014-055,2014-055-01,2014-055-01-01,2014-055-01-01-01,4/29/15,1,,135 S. Ct. 1656,...,2,,2,200,,111,111,1,5,4


In [227]:
merged[merged["docket"] == '12-9012']

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes


In [220]:
allDocketNo[900:1000]

['129012',
 '12929',
 '12930',
 '129490',
 '12-96',
 '12-98',
 '12992',
 '1299',
 '126,',
 '128,',
 '129',
 '131010',
 '131019',
 '131032',
 '131034',
 '131041',
 '131067',
 '131074',
 '131075',
 '131080',
 '13115',
 '131174',
 '131175',
 '131211',
 '131314',
 '13132',
 '131333',
 '131339',
 '131352',
 '131371',
 '131402',
 '131421',
 '131428',
 '131433',
 '131487',
 '131499',
 '13193',
 '13212',
 '13271',
 '13298',
 '13299',
 '13301',
 '13316',
 '13317',
 '13339',
 '13352',
 '13369',
 '13433',
 '13435',
 '13461',
 '13483',
 '13485',
 '13502',
 '13517',
 '13534',
 '13550',
 '13553',
 '13604',
 '13628.',
 '136827',
 '13684',
 '137120',
 '13719',
 '13720',
 '137211',
 '137451',
 '13854.',
 '13894',
 '13895',
 '13896',
 '139026',
 '13935',
 '13975.',
 '13983',
 '139972',
 '132,',
 '137,',
 '138,',
 '14103',
 '141146',
 '14114.',
 '14116',
 '14144',
 '1415',
 '14275',
 '14280',
 '14361',
 '14378',
 '14400',
 '14419',
 '14449',
 '14452',
 '14462',
 '1446',
 '14520',
 '14556',
 '14556',
 '14

In [216]:
allDecisions = []
for docketNo in allDocketNo:
    row = merged[merged["docket"] == docketNo]
    if len(row) > 0:
        allDecisions.append(row.partyWinning.values[0])
    else:
        docketNo = docket

In [217]:
len(allDecisions)

753

In [200]:
docketNo = allDocketNo

In [202]:
docketNo = '00-1011'

In [None]:
# write function to transform full text to petSpeech
def textToPet(text):
    speech, retDict = splitTextPetRes(text)
    return retDict["petSpeakers"], 

In [199]:
# we have merged, allDocketNo; we need to line these up

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
929,99-901,"[P R O C E E D I N G S (11:06 a, m, ) CHIEF JU...",2000-018,2000-018-01,2000-018-01-01,2000-018-01-01-01,2/20/01,1,531 U.S. 288,121 S. Ct. 924,...,2,,2,230,,107,103,1,5,4
930,99-9073,"[P R O C E E D I N G S (11:03 a, m, ) CHIEF JU...",2000-033,2000-033-01,2000-033-01-01,2000-033-01-01-01,3/20/01,1,532 U.S. 59,121 S. Ct. 1276,...,4,,6,600,18 U.S.C.,110,102,1,9,0
931,99-9136,"[P R O C E E D I N G S (10:03 a, m, ) CHIEF JU...",2000-045,2000-045-01,2000-045-01-01,2000-045-01-01-01,4/25/01,1,532 U.S. 374,121 S. Ct. 1578,...,4,,3,341,,104,102,1,5,4
932,99-929,"[P R O C E E D I N G S [10:59 a, m, ] CHIEF JU...",2000-027,2000-027-01,2000-027-01-01,2000-027-01-01-01,2/28/01,1,531 U.S. 510,121 S. Ct. 1029,...,2,,2,223,,103,103,1,9,0
933,99-936,"[P R O C E E D I N G S (10:03 a, m, ) CHIEF JU...",2000-034,2000-034-01,2000-034-01-01,2000-034-01-01-01,3/21/01,1,532 U.S. 67,121 S. Ct. 1281,...,2,,2,205,,103,103,1,6,3


In [190]:
# allDocketNo -> allDecisions
map(lambda x: merged[merged["docket"] == x], allDocketNo)

1057

In [248]:
merged.tail()

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
929,99-901,"[P R O C E E D I N G S (11:06 a, m, ) CHIEF JU...",2000-018,2000-018-01,2000-018-01-01,2000-018-01-01-01,2/20/01,1,531 U.S. 288,121 S. Ct. 924,...,2,,2,230,,107,103,1,5,4
930,99-9073,"[P R O C E E D I N G S (11:03 a, m, ) CHIEF JU...",2000-033,2000-033-01,2000-033-01-01,2000-033-01-01-01,3/20/01,1,532 U.S. 59,121 S. Ct. 1276,...,4,,6,600,18 U.S.C.,110,102,1,9,0
931,99-9136,"[P R O C E E D I N G S (10:03 a, m, ) CHIEF JU...",2000-045,2000-045-01,2000-045-01-01,2000-045-01-01-01,4/25/01,1,532 U.S. 374,121 S. Ct. 1578,...,4,,3,341,,104,102,1,5,4
932,99-929,"[P R O C E E D I N G S [10:59 a, m, ] CHIEF JU...",2000-027,2000-027-01,2000-027-01-01,2000-027-01-01-01,2/28/01,1,531 U.S. 510,121 S. Ct. 1029,...,2,,2,223,,103,103,1,9,0
933,99-936,"[P R O C E E D I N G S (10:03 a, m, ) CHIEF JU...",2000-034,2000-034-01,2000-034-01-01,2000-034-01-01-01,3/21/01,1,532 U.S. 67,121 S. Ct. 1281,...,2,,2,205,,103,103,1,6,3


In [301]:
# produallStemmedArgumentsemmedArgumentsvectorizer that will calculate the importance of words
vectorizer1 = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english', encoding='Latin1', analyzer='word', token_pattern='\w+')
vectorizer2 = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english', encoding='Latin1', analyzer='word', token_pattern='\w+')
tPet, sPet, dTPet = runSVD(allDestemmedPetSpeeches, vectorizer, numComponents=25)

In [302]:
tPet.shape, sPet.shape, dTPet.shape

((1115, 25), (25,), (25, 44282))

In [168]:
# TODO: take tPet and docket no. and generate target/result value
# import SCDB csv on case-centered data
caseCentered = pd.read_csv("SCDB_caseCentered_Docket.csv")

In [242]:
def get_file_dict(fileList, fileExt='.txt'):
    '''
    This function takes the fileList and returns a list of dictionaries of the format 
    {'case_number': case_number, 'full_text': full_text}
    '''
    fileDict=[]
    fields=['docket', 'full_text']
    txtFiles_filter = filter(lambda f : f[-4:] == fileExt, fileList)
    for each in txtFiles_filter:
        name_str=each[4:-4]
        try:
            indexx=name_str.index('_')
            docketNum=name_str[:indexx]
        except ValueError:
            docketNum=name_str
        cur = open(dataPath+each)
        textual = cur.read()
#         cleanTextual = textCompiler(textual) #textCompiler edited (see above) @Aidi I commented out the replace-split
        cur.close()
        tuple_=(docketNum, cleanTextual)
        fileDict.append(dict(zip(fields, tuple_)))
    return fileDict

In [243]:
fileDict=get_file_dict(fileList)

In [251]:
fileDict

KeyboardInterrupt: 

In [244]:
# fileDict=get_file_dict(fileList)
txtdf = pd.DataFrame(fileDict)
casedf = pd.read_csv('supremeCourtDb.csv')
merged = pd.merge(left=txtdf, right=casedf, how='inner', left_on='docket', right_on='docket')
print merged.shape

(934, 54)


In [261]:
# allPetSpeeches
# allResSpeeches
# allDocketNo
fulltext = merged.full_text[0]

In [263]:
fulltext

['P R O C E E D I N G S (10:17 a',
 'm',
 ") CHIEF JUSTICE REHNQUIST: We'll hear argument now in Number 00-1011, Deboris Calcano-Martinez v",
 ' The Immigration and Naturalization Service',
 ' Mr',
 ' Guttentag',
 ' ORAL ARGUMENT OF LUCAS GUTTENTAG ON BEHALF OF THE PETITIONER MR',
 ' GUTTENTAG: Mr',
 ' Chief Justice, and may it please the Court: The jurisdictional issue presented in this case is whether a legal ruling by the Attorney General on a pure question of law compelling the deportation of long- time legal permanent residents is reviewable in any court',
 " Never in our country's history has an alien been subject to deportation without the judicial branch determining the legal validity of the administrative deportation order",
 " We submit that the Constitution does not permit denying judicial scrutiny of the Attorney General's ruling, and that the statute did not deprive this Court and the district court and the courts of appeals of considering the pure question of law presente

In [262]:
speech, retDict = splitTextPetRes(fulltext)

AttributeError: 'list' object has no attribute 'find'

In [None]:
runSVD(allDestemmedPetSpeeches, vectorizer1)

In [104]:
# run logistic regression on D x numTopics matrix of independent variables, vs. 0/1 result vector
from sklearn.linear_model import LogisticRegression

In [None]:
clf = LogisticRegression()

In [105]:
"""
Function
--------
cv_optimize

Inputs
------
clf : an instance of a scikit-learn classifier
parameters: a parameter grid dictionary thats passed to GridSearchCV (see above)
X: a samples-features matrix in the scikit-learn style
y: the response vectors of 1s and 0s (+ives and -ives)
n_folds: the number of cross-validation folds (default 5)
score_func: a score function we might want to pass (default python None)

Returns
-------
The best estimator from the GridSearchCV, after the GridSearchCV has been used to
fit the model.
"""
def cv_optimize(clf, parameters, X, y, n_folds=5, score_func=None):
    clf = GridSearchCV(clf, param_grid=parameters, scoring=score_func, cv=n_folds)
    clf.fit(X,y)
    return clf.best_estimator_

In [None]:
clflog, Xtrain, ytrain, Xtest, ytest = do_classify(LogisticRegression(penalty="l1"), {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, dftouse, lcols, u'RESP', 1, reuse_split=reuse_split)

# Running classifier

### Logistic Regression

Logistic regression is a natural first choice for a model since our target value can be viewed as a probability between 0 or 1 for any individual justice to vote For or Against, with a higher probability representing a higher confidence of that justice voting in favor of the arguing party. 

In [12]:
from sklearn.linear_model import LogisticRegression


caseId  
docketId   
caseIssuesId  
voteId  
dateDecision  
decisionType  
usCite  
sctCite  
ledCite  
lexisCite  
term   
naturalCourt  
chief  
docket   
caseName  
dateArgument  
dateRearg  
petitioner   
petitionerState  
respondent  
respondentState  
jurisdiction  
adminAction  
adminActionState  
threeJudgeFdc  
caseOrigin  
caseOriginState  
caseSource  
caseSourceState  
lcDisagreement  
certReason  
lcDisposition  
lcDispositionDirection  
declarationUncon  
caseDisposition   
caseDispositionUnusual  
partyWinning    
precedentAlteration  
voteUnclear    
issue  
issueArea  
decisionDirection  
decisionDirectionDissent  
authorityDecision1  
authorityDecision2  
lawType  
lawSupp  
lawMinor  
majOpinWriter  
majOpinAssigner  
splitVote  
majVotes  
minVotes 

issue: This variable identifies the issue for each decision. Although criteria for the identification of issues are hard to articulate, the focus here is on the subject matter of the controversy (e.g., sex discrimination, state tax, affirmative action) rather than its legal basis (e.g., the equal protection clause) 

issueArea: This variable simply separates the issues identified in the preceding variable (issue) into the following larger categories: criminal procedure (issues 10010-10600), civil rights (issues 20010-20410), First Amendment (issues 30010-30020), etc 

decisionDirection: In order to determine whether the Court supports or opposes the issue to which the case pertains, this variable codes the ideological "direction" of the decision. 
    An outcome is liberal (=2) or conservative (=1)

dateDecision: This variable contains the year, month, and day that the Court announced its decision in the case.

justiceName: This is a string variable indicating the first initial for the five justices with a common surname (Harlan, Johnson, Marshall, Roberts, and White) and last name of each justice.

chief: This variable identifies the chief justice durinmg whose tenure the case was decided.

caseSource: This variable identifies the court whose decision the Supreme Court reviewed. If the case originated in the same court whose decision the Supreme Court reviewed, the entry in the caseOrigin should be the same as here. This variable has no entry if the case arose under the Supreme Court's original jurisdiction. 

certReason: This variable provides the reason, if any, that the Court gives for granting the petition for certiorari. 
 

In [587]:
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

In [589]:
bigdf['issueArea']

0        8
1        8
2        8
3        8
4        8
5        8
6        8
7        8
8        8
9        1
10       1
11       1
12       1
13       1
14       1
15       1
16       1
17       1
18       8
19       8
20       8
21       8
22       8
23       8
24       8
25       8
26       8
27       2
28       2
29       2
        ..
77312    9
77313    9
77314    9
77315    2
77316    2
77317    2
77318    2
77319    2
77320    2
77321    2
77322    2
77323    2
77324    1
77325    1
77326    1
77327    1
77328    1
77329    1
77330    1
77331    1
77332    1
77333    2
77334    2
77335    2
77336    2
77337    2
77338    2
77339    2
77340    2
77341    2
Name: issueArea, dtype: float64

In [43]:
from patsy import dmatrices
log_model = LogisticRegression(penalty='l2',C=1.0, fit_intercept=True, class_weight='auto')
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

smalldf = pd.DataFrame()

regress_vars = ['issue', 'issueArea', 'decisionDirection', 
                'dateDecision', 'justice', 'chief', 'caseSource', 'certReason', 'vote'] 

for i in regress_vars: 
    smalldf[i] = bigdf[i]
    
y, X = dmatrices('C(vote) ~ C(issue) + C(issueArea) + C(decisionDirection) + C(justice) + \
                  C(caseSource) + C(certReason)',
                  smalldf, return_type="dataframe")

print (X.shape)
print(y.shape)
# y = np.ravel(y)
print((y))
model = LogisticRegression()

model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)

(61833, 434)
(61833, 8)
       C(vote)[1.0]  C(vote)[2.0]  C(vote)[3.0]  C(vote)[4.0]  C(vote)[5.0]  \
0                 0             1             0             0             0   
1                 1             0             0             0             0   
2                 1             0             0             0             0   
3                 0             0             0             1             0   
4                 1             0             0             0             0   
5                 1             0             0             0             0   
6                 1             0             0             0             0   
7                 1             0             0             0             0   
8                 1             0             0             0             0   
9                 1             0             0             0             0   
10                0             1             0             0             0   
11                1         

ValueError: bad input shape (61833, 8)

NameError: name 'X' is not defined

### Random Forest Classifier

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.cross_validation import train_test_split, cross_val_score

In [585]:
# read in SCDB data from file
bigdf=pd.read_csv("supremeCourtDb.csv")
bigdf[]


Unnamed: 0,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,ledCite,lexisCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
0,1946-001,1946-001-01,1946-001-01-01,1946-001-01-01-01,11/18/46,1,329 U.S. 1,67 S. Ct. 6,91 L. Ed. 3,1946 U.S. LEXIS 1724,...,4,,6,600,35 U.S.C. � 33,78,78,1,8,1
1,1946-002,1946-002-01,1946-002-01-01,1946-002-01-01-01,11/18/46,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4,,6,600,18 U.S.C. � 398,81,87,1,6,3
2,1946-003,1946-003-01,1946-003-01-01,1946-003-01-01-01,11/18/46,1,329 U.S. 29,67 S. Ct. 1,91 L. Ed. 22,1946 U.S. LEXIS 3037,...,1,,2,207,,84,78,1,5,4
3,1946-004,1946-004-01,1946-004-01-01,1946-004-01-01-01,11/25/46,7,329 U.S. 40,67 S. Ct. 167,91 L. Ed. 29,1946 U.S. LEXIS 1696,...,4,,6,600,49 Stat. 801,87,87,1,5,3
4,1946-005,1946-005-01,1946-005-01-01,1946-005-01-01-01,11/25/46,1,329 U.S. 64,67 S. Ct. 154,91 L. Ed. 44,1946 U.S. LEXIS 2997,...,7,,,,,78,87,1,6,3
5,1946-006,1946-006-01,1946-006-01-01,1946-006-01-01-01,11/25/46,1,329 U.S. 69,67 S. Ct. 156,91 L. Ed. 80,1946 U.S. LEXIS 3005,...,2,,1,129,,81,87,1,7,1
6,1946-007,1946-007-01,1946-007-01-01,1946-007-01-01-01,11/25/46,1,329 U.S. 90,67 S. Ct. 133,91 L. Ed. 103,1946 U.S. LEXIS 3053,...,1,,6,600,15 U.S.C. � 79,82,74,1,6,0
7,1946-008,1946-008-01,1946-008-01-01,1946-008-01-01-01,12/9/46,1,329 U.S. 129,67 S. Ct. 231,91 L. Ed. 128,1946 U.S. LEXIS 2995,...,4,,6,600,35 U.S.C. � 89,87,87,1,9,0
8,1946-009,1946-009-01,1946-009-01-01,1946-009-01-01-01,12/9/46,1,329 U.S. 143,67 S. Ct. 245,91 L. Ed. 136,1946 U.S. LEXIS 3047,...,4,5,5,512,,87,87,1,9,0
9,1946-010,1946-010-01,1946-010-01-01,1946-010-01-01-01,12/9/46,1,329 U.S. 156,67 S. Ct. 237,91 L. Ed. 162,1946 U.S. LEXIS 3048,...,4,,3,307,,78,87,1,8,0


In [27]:
df = bigdf["docketId", "dateDecision", "case"]

KeyError: ('docketId', 'dateDecision', 'case')

### Linear SVM Classifier

In [None]:
svm_model = svm.SVC(C=1.0, kernel='linear', probability=True, class_weight='auto')
svm_model = my_svm.fit(X, y)
svm_pred = svm_fit.predict(W)
# Class probabilities, based on log regression on distance to hyperplane.
svm_prob = svm_fit.predict_proba(W)
svm_dist = svm_fit.decision_function(W)

## 2. Justice Ruling Prediction

We use a different dataset in a slightly different approach to making Supreme Court ruling predictions. This method is motivated by the fact that usually, only 2 justices tend to be swing votes and justice decisions are highly influenced by factors outside of what transpires in court proceedings, such as background information about the case itself. The Supreme Court website contains a Justice-centered database which contains extensive information about each case; in particular, the most pertinent fields we are interested in analyzing are:

1. Decision Year
2. Natural Court
3. Petitioner
4. Respondent
5. Case Origin
6. Case Source
7. Lower Court Disposition Direction
8. Issue Area

Our target value to predict is the field called winningParty (petitioner or respondent), which using our justice-centered approach involves aggregating predicted votes for each individual justice and taking majority vote. The associated confidence of our entire prediction is obtained by averaging individual confidences of our models for each justice.

In [91]:
# read in justice-centered SCDB data from file
newdf=pd.read_csv("SCDB_justice_centered.csv")

In [1]:
# maybe lcDispositionDirection? choose features with continuous/numerical features
# do the numbers mean anything though?
newsmalldf = newdf[["term", "naturalCourt", "petitioner", "respondent", "caseOrigin", "caseSource", "lcDisposition", "issueArea"]]

NameError: name 'newdf' is not defined

In [93]:
newsmalldf.head()

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea
0,1946,1301,198,172,51,29,2,8
1,1946,1301,198,172,51,29,2,8
2,1946,1301,198,172,51,29,2,8
3,1946,1301,198,172,51,29,2,8
4,1946,1301,198,172,51,29,2,8


For an intuitive understanding of the features above, check out the documentation here: http://scdb.wustl.edu/documentation.php?var=petitioner. All the above features are categorical instead of continuous (which means the numbers specify a category instead of having a numerical meaning). For an illustrative example, the "petitioner" variable includes:

1. attorney general of the United States, or his office
2. specified state board or department of education
7. state department or agency
etc

#### Advantages of Using Decision Tree Classifiers

Having an intuitive understanding of the meanings behind the variables is important and leads us to our idea of usign the decision tree classifier. A distinct advantage of using decisiontrees is that the decision at each node has an intuitive meaning and corresponds to querying along one feature axis at a time (e.g. is the petitioner an attorney general of the United States?). 

Furthermore, trees are easy to understand and interpret. We can look at the top node and figure out which feature it corresponds to, and conclude that this feature contributes the most information gain, i.e. is the most important/predictive feature. This makes it easy to verify whether our results make intuitive sense.

We will show the process of running decision trees on each justice, before aggregating the votes now.

### 2.1 Justice-Centered Decision Tree Classifiers

Ultimately, the feature that we want to predict is the vote for each justice.

In [29]:
from sklearn import tree

In [94]:
# newdf.majority refers to whether justice voted with the majority (1 for dissent, 2 for majority)
# newdf.partyWinning indicates winning party (0 for responding party, 1 for petitioning party, 2 for unclear)
# We use the above 2 features to infer which party the individual justice voted for
# NOTE: majority has around 4000 NaNs that we should filter out?
results = []
ctr1, ctr2 = 0,0
for idx, x in enumerate(newdf.majority):
    #if decision is unclear, append 2 to results (in reality, apparently there aren't ANY 2s)
    if newdf.partyWinning[idx] == 2:
        results.append(2)
        ctr += 1
        break
    #if justice voted in the majority
    if x == 2:
        results.append(newdf.partyWinning[idx]) #results contains 0/1
    #if justice voted in the minority
    elif x == 1:
        results.append(1 - newdf.partyWinning[idx]) #results contains 0/1
    else:
        #need to clean this up o.o
        results.append(1)

In [99]:
# these are our target values
pd.concat([newsmalldf, results])

AttributeError: 'numpy.float64' object has no attribute 'index'

In [85]:
# drop rows where any column value is NaN - dealing with missing data
newsmalldf.dropna(axis=0).head()

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea
0,1946,1301,198,172,51,29,2,8
1,1946,1301,198,172,51,29,2,8
2,1946,1301,198,172,51,29,2,8
3,1946,1301,198,172,51,29,2,8
4,1946,1301,198,172,51,29,2,8


In [100]:
len(newsmalldf), len(results)

(114895, 114895)

In [104]:
newsmalldf["results"] = pd.Series(results)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [105]:
newsmalldf

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea,results
0,1946,1301,198,172,51,29,2,8,0
1,1946,1301,198,172,51,29,2,8,1
2,1946,1301,198,172,51,29,2,8,1
3,1946,1301,198,172,51,29,2,8,1
4,1946,1301,198,172,51,29,2,8,1
5,1946,1301,198,172,51,29,2,8,1
6,1946,1301,198,172,51,29,2,8,1
7,1946,1301,198,172,51,29,2,8,1
8,1946,1301,198,172,51,29,2,8,1
9,1946,1301,100,27,123,30,2,1,0


In [79]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').