# CS109 Project - The Court Rules In Favor Of...
## Aidi Adnan Brian John (Team AABJ)

### Abstract
The purpose of this project is to predict votes of Supreme Court justices using oral argument transcripts. Studies in linguistics and psychology, as well as common sense, dictates that the word choices that people make convey crucial information about their beliefs and intentions with regard to issues. Rather than use precedents or formal analysis of the law to predict Supreme Court decisions, we attempt to extract essential emotional features of oral arguments made by justices and advocates in the court. Using aggregate data from 1946 to present

### Data
Oral Argument Transcripts - obtained from http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx. Transcripts are made available on the day of court hearing.
Justice Vote Counts/Case Information - obtained from the Supreme Court Database.

#Table of Contents
* [The Court Rules in Favor Of...](#CS109 Project - The Court Rules In Favor Of...)
    * [1. Data Cleaning and Preparation](#1.-Data-Cleaning-and-Preparation)
    * [2. Latent Sentiment Indexing](#2.-LSI-:-Latent-Sentiment-Indexing)
        * [2.0.1 Pre-processing](##Stage-1:-Pre-processing)
        * [2.0.2 Frequency Document-Inverse Document Frequency (TF-IDF)](###-TF-IDF-:-Frequency-Document-Inverse-Document Frequency)
        * [2.1 Splitting prepared documents into petitioner and respondent speeches](##-2.1-Splitting-prepared-documents-into-petitioner-and-respondent-speeches)
        * [2.2 Applying term-frequency inverse document frequency vectorizer](##-2.2-Applying-term-frequency-inverse-document-frequency-vectorizer)
        * [2.3 Running Singular Vector Decomposition](##-2.3-Running-Singular-Vector-Decomposition)
        * [2.4 Training logistic regression classifier on petitioner and respondent differences](##-2.4-Training logistic-regression-classifier-on-petitioner-and-respondent-differences)
    * [Sentiment Analysis](#-3.-Sentiment-Analysis)

# 1. Data Cleaning and Preparation

In [1]:
import string
import re
import numpy as np
import pandas as pd
import operator
import os
import sys
import io
import collections

We used a python script (scraper.py) to first scrape the pdfs from the Supreme Court Justice Website (but didn't upload those to the repository, because we ultimately wanted to use text files in our process). We then used a script to convert the pdf files to text files, but not before removing the last 10 pages from each transcript, which were reserved as an index for certain words. 

In [2]:
# gather all txt files, first get the path to the data directory
# then list the files and filter out all non-txt files
curPath = os.getcwd()
dataPath = curPath + '/data/'
fileList = os.listdir(dataPath)
fileExt = ".txt"
txtFiles = filter(lambda f : f[-4:] == fileExt, fileList)
txtFiles = map(lambda f : dataPath + f, txtFiles)

In [28]:
def getAllRawText(txtFiles):
    '''
    Inputs:
    txtFiles: a list of paths of textfiles
    
    Returns:
    list of all uncleaned transcripts in raw text form.
    '''
    return [(open(txtFiles[i]).read()) for i in range(len(txtFiles))]

In [264]:
def get_file_dict(fileList, fileExt='.txt'):
    '''
    This function takes the fileList and returns a list of dictionaries of the format 
    {'case_number': case_number, 'full_text': full_text}
    
    Inputs:
    fileList: list of the paths of the textfiles
    fileExt: optional parameter for the type of file
    
    Returns:
    dictionary of the filename:text
    '''
    fileDict=[]
    fields=['docket', 'full_text']
    txtFiles_filter = filter(lambda f : f[-4:] == fileExt, fileList)
    for each in txtFiles_filter:
        name_str=each[4:-4]
        try:
            indexx=name_str.index('_')
            docketNum=name_str[:indexx]
        except ValueError:
            docketNum=name_str
        cur = open(dataPath+each)
        textual = cur.read()
        cur.close()
        tuple_=(docketNum, textual)
        fileDict.append(dict(zip(fields, tuple_)))
    return fileDict

We wrote a parser to extract the names of the petitioner and respondant attorneys from the first 2 pages of the converted text document. An example of list of petitioner and respondant speakers, taken from the example case in 2014 of Johnson v United States (docket number 13-7120) is:

Katherine M. Menendez, ESQ., Minneapolis, Minn.; on behalf of Petitioner
Michael R. Dreeben, ESQ., Deputy Solicitor General, Department of Justice, Washington D.C.; on behalf of Respondent

To get these speakers, we write a function that uses a regular expression to split the lines based on new line and checks whether there is a name in the line. If we find a name, then we check if that name was listed as a petitoner or respondent at the beginning of the text.

In [129]:
def get_petitioners_and_respondents(text):
    '''
    Inputs:
    text : a transcript in its raw form, without having run cleanTextMaker

    Returns:
    pet_speakers, res_speakers, other_speakers
    the petitoner speakers, the respondent speakers, and any other speakers as a list
    '''
    #get portion of transcript between APPEARANCES and CONTENTS that specifies speakers for petitioners/respondents
    start = text.find('APPEARANCES:') + len('APPEARNACES')
    end = text.find('C O N T E N T S')
    speakers_text = text[start:end]
    split_speakers_text = re.split('\.[ ]*\n', speakers_text)
    #for each speaker, get name (capitalized) and side (Pet/Res) he/she is speaking for
    pet_speakers, res_speakers, other_speakers = [], [], []
    for speaker in split_speakers_text:
        name = speaker.strip().split(',')[0]
        #search for first index of capitalized word (which will be start of speaker name)
        start = 0
        for idx, char in enumerate(name):
            if str.isupper(char):
                start = idx
                break
        #actual name to be appended to correct list
        name = name[start:]
        
        #if words Petition, Plaintiff, etc occur in speaker blurb, speaker belongs to Pet
        if any(x in speaker for x in ['etition' , 'ppellant', 'emand', 'evers', 'laintiff']):
            pet_speakers.append(name)
        #otherwise if words Respondent, Defendant, etc occur, speaker belongs to Res
        elif any(x in speaker for x in ['espond' , 'ppellee', 'efendant']):
            res_speakers.append(name)
        #otherwise if neither side is specified in blurb, speaking belongs to Other
        elif 'neither' in speaker:
            other_speakers.append(name)
    return pet_speakers, res_speakers, other_speakers

The following function used to generate a regular expression for later use based on "TITLE. LASTNAME"; however, that pattern ultimately ended up not being used in some of the cases, and we abandonded this format in favor of just using the last name to generate the regular expression that we would ultimately use to separate the text into which speaker was responsible for a portion of text

In [7]:
def generateRES(nameList, plebe):
    """
    generates a regular expression that splits text based on speaker names
    
    Inputs:
    nameList: a list of strings of the names
    plebe: a boolean determining whether or not the list is of justices or not
    
    Returns:
    A list of regular expressions for each of the names that work for supreme
    court case transcripts
    """
    retList = []
    for name in nameList:
        address = ""
        if plebe:
            words = name.split(' ')
            # first term is the title, last
            # word is the last name
            address = words[-1]
            retList.append(address)
        else:
            # Justice can appear in two ways!
            address = "JUSTICE %s" % name
            address2 = "CHIEF JUSTICE %s" % name
            retList.append(address)
            retList.append(address2)
    return retList

We don't yet have a way to gather the names of the Justices from the text, so we do so! We just find the word JUSTICE in all caps and then gather whatever comes after.

In [8]:
def getJusticeNames(text):
    """
    Inputs:
    text is the raw text of a transcript
    
    Returns:
    A set of the names of Justices mentioned by name in the transcript
    """
    index = 0
    retList = []
    while index < len(text):
        index = text.find("JUSTICE", index)
        if index == -1:
            break
        index += 8 # because length of JUSTICE is 7, plus length of the space
        prevIndex = index
        while text[index] != ':':
            index +=1
        retList.append(text[prevIndex:index])
    return list(set(retList))

The general flow of court proceedings is that the Petitioner attornies make their oral argument, followed by the Respondent attornies, before we hear the rebuttal argument of the Petitioners again. Throughout all proceedings, Justices are free to interject with questions and statements of their own. The below function extracts the main argument portion of the oral transcripts, which is the meat of the proceedings that we are interested in conducting analysis on. 

In [10]:
def get_argument_portion(text):
    '''
    Inputs:
    Raw text of a transcript as a string
    
    Returns:
    The argument portion of the case as a string
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('P R O C E E D I N G S')
    end = text.rfind('Whereupon')
    return text[start:end]

In [12]:
def count_words(s):
    '''
    Inputs:
    s: string of words
    
    Returns:
    An integer counting the number of the words in s
    '''
    s = s.split()
    non_words = ['-', '--']
    return sum([x not in non_words for x in s])

The transcripts contain a lot of line numbers as well as linebreaks in between sentences, so we want to remove those before we try and do any analysis on them. We can also use this function to return a list of the cleaned version of our documents.

In [14]:
def cleanTextMaker(text):
    '''
    Inputs:
    text: a raw text with newlines and numbers, as is usual in the transcripts
    
    Returns:
    A file with newlines and numbers scrubbed
    '''
    text_arr=text.splitlines()
    text_clean=[]
    for each in text_arr:
        if each != '':
            try:
                int(each)
            except ValueError: #assummption: if the item only has integers, it is a line number.
                text_clean.append(each)
    out_text=' '.join(text_clean)
    return out_text

In [29]:
def getAllRawCleanText(txtFiles):
    '''
    Inputs:
    txtFiles: a list of paths of textfiles
    
    Returns:
    list of all cleaned transcripts in raw text form.
    '''
    return [(cleanTextMaker(open(txtFiles[i]).read())) for i in range(len(txtFiles))]

It may be useful to split data into test and train sets, so lets write a function that does so for later.

In [341]:
def splitData(X, fraction_train=9.0 / 10.0):
    """
    Deterministically splits a vector
    
    Inputs:
    X: a one dimensional vector
    fraction_train: the fraction of data that is desired to be train
    
    Returns:
    the train portion and test portions of the vector
    """
    end_train = int(len(X) * fraction_train)
    X_train = X[0:end_train]
    X_test = X[end_train:]
    return X_train, X_test

def splitTrainTest(X, Y, fraction_train = 9.0 / 10.0):
    """
    Inputs:
    X, Y : vectors to be split
    
    Returns:
    Each vector split into train and test
    """
    X_train, X_test = splitData(X, fraction_train)
    Y_train, Y_test = splitData(Y, fraction_train)
    return X_train, Y_train, X_test, Y_test

# 2. LSI : Latent Sentiment Indexing

LSI is a method that uses singular value decomposition to find patterns and themes between unstructured documents. The key idea behind LSI is to select the conceptual content of some text by finding connections between terms that occur in similar contexts. Once we produce a SVD, we can also truncate the resultant matricies to instead produce smaller matricies that are easier to handle and more likely to avoid overfitting.

## Stage 1: Pre-processing

#### Text representation

Document representation is the first step of our analysis since there are a variety of ways to represent a transcript, which in its raw form is a simple string of texts. We use a pre-processing technique that reduces the complexity of the documents and makes them easier to handle, which is to transform the oral transcripts from the full text version to a document vector/sparse matrix. Every text document is represented as a vector of term weights (word features) from a set of terms (dictionary), where each term occurs at least once in a certain critical number of documents.

#### High dimensionality of text representation

A major characteristic of document classification problems is the extremely high dimensionality of data where the number of potential features often exceeds the number of training documents. Dimensionality reduction is thus critical to allow for efficient data manipulation. Irrelevant and redundant features often degrade performance of classification algorithms both in accuracy and speed, and also tends to fall into the all-common trap of overfitting.

#### Data cleaning

Pre-processing of text data involves tokenization of raw text, stop words removal, stemming and eliminating as much as possible the language dependent factors. Brief explanations of these preprocessing stages are as fllows:

1. Sentence splitting: identifying sentence boundaries in documents
2. Tokenization: partitioning documents that are initially treated as a string into a list of tokens
3. Stop word removal: removing common English words like "the", "a", etc
4. Stemming: reducing derived words to its most root form, example happiest -> happy
5. Noisy data: cleaning noisy data spilt over from pdf to text conversion, including inclusion of line numbers, page breaks, etc
6. Text representation: determining whether we should use words, phrases or entire sentences as a "token" for analysis

### Feature extraction vs. feature selection

After feature extraction, feature selection was conducted to construct a vector space of appropriate dimensionality, which improves the scalability, efficiency and accuracy of our classification algorithm. The main idea of feature selection is to choose a subset of features from the original texts, with subset determined by obtaining features with the highest score according to some predetermined measure of feature importance.

We use filters to generate our features. Filters can be conducted independently of the actual classification algorithm, and is not very computationally expensive. Filters use an evaluation metric that measures ability of a feature to differentiate each class, hence choosing the most discriminative and valuable features. The filter of our choice is a technique called frequency document-inverse document frequency, as shown below.

### TF-IDF : Frequency Document-Inverse Document Frequency

Frequency document–inverse document frequency (tf-idf), is a powerful method to evaluate how important is a word in a document, and captures the relative significance among words. It converts the textual representation of information into a Vector-Space Model or a sparse matrix representation.

In [766]:
# used to generated tfidf sparse matricies for the importance of words in documents
from sklearn.feature_extraction.text import TfidfVectorizer
# internal utilities used to replicate functionality of truncated_svd
from sklearn.utils import as_float_array
from sklearn.utils.extmath import randomized_svd
# stemmer of words
import snowballstemmer

In [30]:
allRawText = getAllRawText(txtFiles)
allRawCleanText = getAllRawCleanText(txtFiles)

Similar to the approach we took when generating the regular expressions for parsing, we want to have a way to convert the names scraped from the documents into a name that we can generally split the speechs by, and thus we adopt the convention of using the last name.

In [31]:
def toColloquialName(formal_name):
    ret = formal_name.split()
    return ret[-1]

When running LSI, we do not want to possibly split importance among words that are actually very similar (such as "stealing" and "steal"), so we stem words by removing the suffix, bring down the words to a root, or 'stem' that we can assign importance to. There was an issue with different encodings: the txt documents are stored in Latin1 encoding when converted from PDFs to permit earlier functions to work, but when iterating through words there is some issues with how the words are decoded and passed to the stemmer. As a result, we need to manually convert incompatible strings to a tractable format. This conversion was not possible on a single document in our entire database, so we ultimately had to remove it from our database (if we didn't remove it, there would be an issue when we used the docketId from the document to index into a merged dataframe later on, we would have one more response variable than predictor variables). Stemming naturally can leave words as a non english word, or may incorrectly mistem a word. Nothing short of a large dictionary containing the stem of every possible word would accurately perform the stemming, so we are forced to accept this aggressive trimming.

In [295]:
def destem(allRawText):
    """
    stems all words from a list of documents
    documents are assumed to be stored in Latin1 encoding
    there is one document that is not tractable so we exclude it
    uses snowballstemmer
    required to decode string to avoid UnicodeDecodeErrors
    
    Inputs:
    a list of raw text files that have been cleaned
    
    Returns:
    a list of text files that have been stemmed word by word
    """
    stemmer = snowballstemmer.stemmer('english')
    stemmedList = []
    for text in allRawText:
        try:
            temp = stemmer.stemWords(text.split())
            for i in xrange(len(temp)):
                if str(type(temp[i])) == "<type 'str'>":
                    temp[i] = temp[i].decode('Latin1')
            res = ' '.join(temp)
            stemmedList.append(res)
        except UnicodeDecodeError:
            # literally just one document
            pass
    return stemmedList

Transcript cases on the first page have a number that yields the docket ID of the case. We can very easily retrieve the docket number by just performing a .find() on the document.

In [155]:
def getDocketNo(text):
    '''
    Input: the text of a transcript
    
    Returns:
    the docket number of the case
    '''
    cleantext = cleanTextMaker(text)
    docketIdx = cleantext.find("No.")
    return cleantext[docketIdx+4:].split()[0]

We want to merge the text files with the supreme court database so that we can easily associate the text files with the docketId and get the decision of the cases. However, the supreme court database unfortunately does not contain information from beyond 2014, so we lose out on several transcripts in the merging process. 

In [266]:
fileDict=get_file_dict(fileList)
txtdf = pd.DataFrame(fileDict)
casedf = pd.read_csv('supremeCourtDb.csv')
merged = pd.merge(left=txtdf, right=casedf, how='inner', left_on='docket', right_on='docket')

(934, 54)


As referenced earlier, there is a single document that is not tractable to stemming due to codec issues, so we merely drop it for being insolent.

In [325]:
# drop problematic docket
merged = merged[merged.docket != '08-351']

In [321]:
merged.shape

(931, 54)

## 2.1 Splitting prepared documents into petitioner and respondent speeches

Ok, so we have about 930 documents to work with. We now want to find a way to gather the texts of what the petitioners and the respondents say. We adapt a function we wrote earlier that counted the number of words that each speaker said and use it to gather the texts that each party is responsible for. We first gather them by speaker then gather them by the group that they are a part of.

In [151]:
def splitTextPetRes(text):
    '''
    Input: 
    text: raw text document (transcript, recently opened)
        
    Returns:
    a dictionary of speaker: the words they said
    a dictionary of group: the words they said
    '''
    arg_text = get_argument_portion(text)
    #keeps track of current speaker
    current_speaker = 'N/A'
    clean_argument = cleanTextMaker(arg_text)

    # first get the names of the judges and speakers
    pet_speakers, res_speakers, _ = get_petitioners_and_respondents(text)
    
    # create the regular expression for the justices and the plebes
    petList = generateRES(pet_speakers, True)
    resList = generateRES(res_speakers, True)
    petList = map(lambda name : name + ":", petList)
    resList = map(lambda name : name + ":", resList)
    all_speakers = (petList + resList)
    
    RE = '('  + '|'.join(all_speakers) + ')'
    
    # split argument portion by times elements in plebeList (e.x. MR. FARR: or EUGENE: appears)
    split_argument = re.split(RE, clean_argument)
    
    # dictionary keyed by speaker, with value actual speech (in string format)
    speech = dict(zip(all_speakers + [current_speaker], [""] * (len(all_speakers)+1)))
    
    #iterate through split argument, accumulating speeches for all speakers
    for s in split_argument:
        if s in all_speakers:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            speech[current_speaker] += s

    #combine all pet and res speakers, if multiple
    retDict = {"resSpeakers":"", "petSpeakers":""}
    
    for rSpeaker in resList:
        retDict["resSpeakers"] += speech[rSpeaker]
    for pSpeaker in petList:
        retDict["petSpeakers"] += speech[pSpeaker]

    return speech, retDict

Because the transcripts are not consistently formatted, sometimes we fail to gather the names of petitioners or respondents. In this case, we do not want to add that case's speech to the database because then we wouldnt be able to compare either the respondents or the petitioners against an empty speech. Just would not be fair!

In [326]:
# iterate through merged.full_text, trying to fill in merged.pet_speech and merged.res_speech
allPetSpeeches = []
allResSpeeches = []
allDocketNo = []
allDecisions = []
for row in merged.iterrows():
    speech, retDict = splitTextPetRes(row[1]["full_text"])
    petSpeech = retDict["petSpeakers"]
    resSpeech = retDict["resSpeakers"]
    # if either petSpeech or resSpeech is an empty string, do not add to workable dataset
    if petSpeech and resSpeech:
        allPetSpeeches.append(petSpeech)
        allResSpeeches.append(resSpeech)
        allDocketNo.append(row[1]["docket"])
        allDecisions.append(row[1]["partyWinning"])

In [330]:
# some oral transcripts have empty petitioner or respondent speeches due to dirty scraping of pdf files
# for example, get_petitioners_and_respondents sometimes does not scrape properly due to bad formatting
len(allPetSpeeches), len(allResSpeeches), len(allDecisions)

(886, 886, 886)

We now stem all of the words in the document files. This takes a long time because of the inconsistent encodings of strings as mentioned earlier, necessitating iterating through each word and manually trying to convert it. 

In [328]:
# takes long to run
allDestemmedPetSpeeches = destem(allPetSpeeches)
allDestemmedResSpeeches = destem(allResSpeeches)

In [329]:
len(allDestemmedPetSpeeches), len(allDestemmedResSpeeches)

(886, 886)

## 2.2 Applying term-frequency inverse document frequency vectorizer

Finally, we are ready to pass each set of documents to the vectorizer that will count the number of words and assign them based on Term-Frequency Inverse Document Frequency (tfidf), which first calculates the raw term frequency (aka the number of times that the word appears in the document) and then multiplies it by the inverse document frequency (a global weighing function):
$$g_i = \log_2 \frac{n}{1 + df_i}$$

Where $g_i$ is the weight for term $i$, $n$ is the number of times a word appears in a document, and $df_i$ is the number of documents in which $i$ appears. This properly penalizes words that appear frequently in many documents. We take $g_i \times f_i$ (where $f_i$ is the raw frequency) as the tfidf statistic for that word in a given document. 

(credit to: https://en.wikipedia.org/wiki/Latent_semantic_indexing)

In [331]:
vectorizer1 = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english', encoding='Latin1', analyzer='word', token_pattern='\w+')
vectorizer2 = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english', encoding='Latin1', analyzer='word', token_pattern='\w+')

## 2.3 Running Singular Vector Decomposition

SVD is a method to decompose a matrix into three other matricies that perform the following transformations: a rotation, a scaling along the coordinate axes, and another rotation. Respectively, we label these as T, S, DT in our code.

When we run SVD on the matrix generated by the vectorizers, the 3 matricies we get back have a meaning in the context of LSI. The first matrix is a matrix that represents the themes in the documents, the second is a diagonal matrix of singular values representing the relative importance of each theme overall, and the third matrix is representing the importance of each word in the themes. We can run logistic regression on just the first matrix, and specifically the different between the matrix of the respondents and the petitioners, to represent the difference in how strongly the parties speak about a certain theme within a document.

In [143]:
def runSVD(documentList, vizer, numComponents=25, nIter=5):
    """
    takes a list of documents and a vectorizer
    converts document list to a matrix of frequencies 
        (as determined by the vectorizer) of document by word
    takes matrix and runs truncated SVD on it to generate
    a matrix that consists of themes (T) in each document
    overall importance of the word (S)
    and a matrix that consists of how important each word
    is in the theme (DT)
    
    this code is partially derived from sklearn's
    truncated_svd function (which doesn't return
    all of the matricies we are interested in)
    """
    mat = vizer.fit_transform(documentList)
    X = as_float_array(mat, copy=False)
    # T is the term by concept matrix
    # S the singular value matrix
    # D is the concept-document matrix
    T, S, DT = randomized_svd(X, numComponents, n_iter=nIter)
    return T, S, DT

In [354]:
tPet, sPet, dTPet = runSVD(allDestemmedPetSpeeches, vectorizer1, numComponents=25)
tRes, sRes, dTRes = runSVD(allDestemmedResSpeeches, vectorizer2, numComponents=25)

In [332]:
print tPet.shape, sPet.shape, dTPet.shape
print tRes.shape, sRes.shape, dTRes.shape

((886, 25), (25,), (25, 29308))

## 2.4 Training logistic regression classifier on petitioner and respondent differences

As suggested earlier, we run logisitic regression on the difference of the two matricies that we are interested in.

In [333]:
# run logistic regression on D x numTopics matrix of independent variables, vs. 0/1 result vector
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn.metrics import accuracy_score
from sklearn.grid_search import GridSearchCV

In [349]:
def cv_optimize(clf, parameters, X, y, n_folds=5):
    """
    From CS109's problem sets

    Inputs:

    clf : an instance of a scikit-learn classifier
    parameters: a parameter grid dictionary thats passed to GridSearchCV (see above)
    X: a samples-features matrix in the scikit-learn style
    y: the response vectors of 1s and 0s (+ives and -ives)
    n_folds: the number of cross-validation folds (default 5)
    score_func: a score function we might want to pass (default python None)

    Returns:

    The best estimator from the GridSearchCV, after the GridSearchCV has been used to
    fit the model.
    """
    clf = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    clf.fit(X,y)
    return clf.best_estimator_

Logistic Regression, our good old friend.

In [None]:
clflog = LogisticRegression()
tDiff = tPet - tRes
xTrain, yTrain, xTest, yTest = splitTrainTest(tDiff, allDecisions)

In [364]:
clflogopt = cv_optimize(clflog, {"C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, xTrain, yTrain, n_folds=5)

In [365]:
clflogopt

LogisticRegression(C=0.0001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [366]:
training_accuracy = clflogopt.score(xTrain, yTrain)
test_accuracy = clflogopt.score(xTest, yTest)
print training_accuracy, test_accuracy

0.692597239649 0.651685393258


Ok, so maybe we are not at the point where we can replace judges just yet. However, there were some issues that may have hindered our accuracy. For example, we had to get rid of some documents when we were merging document texts with databases, because we didn't have data from decisions in 2015.

# 3. Sentiment Analysis

In the above analysis, we have focused on parts of the oral transcripts corresponding to attorney speeches, which means we are essentially ignoring an equally valuable portion of information that we can glean from these transcripts: the judge's responses and questions to these attorneys. In this section, we attempt to conduct Natural Language Processing on judge's speeches and other salient features of the text that were not included in the Latent Semantic Analysis above, which might yield interesting results.



In [397]:
# Filter out all oral transcripts that do not discriminate between individual justice's speeches, instead using
# QUESTION: in place of all justice's speeches
qualifying_rows = []
for rowNo in range(len(merged)):
    # find appropriate row and check whether full_text contains QUESTION: - if so, delete from database
    row = merged.iloc[rowNo, :]
    if row["full_text"].find("QUESTION:") == -1:
        qualifying_rows.append(rowNo)

In [399]:
mergedJudges = merged.iloc[qualifying_rows, :] 
mergedJudges.head()

Now that we have our cleaned dataframe with information about raw text and decisions for every oral transcript, we want to find the most appropriate features for predictors - we're on the stage of feature extraction again. After much experimentation and linguistical analysis on court proceedings in particular, we identified the following features that we want to introduce as predictors to run our classification algorithm on. 

For each oral transcript, we want to identify:
1. Number of words judges uttered to petitioner's and respondent's sides:
        Usage: judges_word_count_split(text)
2. Number of times a judge interrupted petitioner or respondent attorneys: 
        Usage: total_interruptions_pet, total_interruptions_ret = get_total_interruptions(text)
3. Sentiment analysis on words judges uttered to petitioner's and respondent's sides: 
        Usage: judges_speech_split(text) and sentiment analysis using existing dictionary of positive/negative
        words

We will look at each of these features one by one.

## Feature 1: Number of words judges directed at each side

In [734]:
# we can see from this illustrative example that the total number of words corresponds to roughly the sum of words
# each judge said to each side. There might have been words uttered to speakers neither on petitioner or respondent's
# side, or a prologue directed to the general audience (especially for the Chief Justice), which can be ignored for
# our purposes.

In [701]:
def getSplitArgument(text):
    '''
    Gets split argument by RE generated with list of all speakers.
    '''
    # first get the names of the judges and speakers
    pet_speakers, res_speakers, other_speakers = get_petitioners_and_respondents(text)
    justice_speakers = getJusticeNames(clean_argument)
    
    # creates duplicate "JUSTICE" and "CHIEF JUSTICE"; creates RE for pet/res/justices
    justice_RE = generateRES(justice_speakers, False)
    pet_RE = generateRES(pet_speakers, True)
    res_RE = generateRES(res_speakers, True)
    
    # appends colons to all REs
    justice_speakers_with_colon = map(lambda name : name + ":", justice_RE)
    pet_speakers_with_colon = map(lambda name : name + ":", pet_RE)
    res_speakers_with_colon = map(lambda name : name + ":", res_RE)
    
    # aggregates justice and attorney REs
    all_speakers = ["QUESTION"]
    all_speakers += (justice_RE + pet_RE + res_RE)
    all_speakers_with_colon = map(lambda name : name + ":", all_speakers)
    
    # finally, creates regular expression for the justices and attorneys (i.e. all_speakers) to split text on
    RE = '('  + '|'.join(all_speakers_with_colon) + ')'
    
    # splits argument portion according to generated regular expression above that consists of all possible speakers,
    # plebes and judges alike
    split_argument = re.split(RE, clean_argument)
    
    return split_argument, all_speakers_with_colon, pet_speakers_with_colon, res_speakers_with_colon, justice_speakers_with_colon

In [702]:
def speakersWordCount(text):
    '''
    FINAL FEATURE # 1:
        Total number of words spoken by each justice in the entire transcript.
    Output:
        num_words: a dictionary with key being name of justice/attorney and value being total number of words they
        spoke in total throughout argument.  
    '''
    splitArgument, all_speakers_with_colon, pet_speakers_with_colon, res_speakers_with_colon, _ = getSplitArgument(text)
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    currentSpeaker = 'NA:'
    num_words = dict(zip(all_speakers_with_colon + [currentSpeaker], [0] * (len(all_speakers_with_colon)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker
        if s in all_speakers_with_colon:
            currentSpeaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            num_words[currentSpeaker] += count_words(s)

    return num_words, pet_speakers_with_colon, res_speakers_with_colon

In [703]:
def deleteVal(dictionary,val):
    '''
    Get rid of all items in dictionary with value being a specific val: 
    E.x. when no words spoken means that we can ignore them
    '''
    for k,v in dictionary.items():
        if v == val:
            del dictionary[k]
    return dictionary

In [704]:
def featureOne(text):
    numWords, pet_speakers_with_colon, res_speakers_with_colon = speakersWordCount(text)
    numWords = deleteVal(numWords,0)
    # clump together petitioners and respondents
    numWordsPet, numWordsRes = 0,0
    for s in pet_speakers_with_colon:
        if s in numWords:
            numWordsPet += numWords[s]
    for s in res_speakers_with_colon:
        if s in numWords:
            numWordsRes += numWords[s]
    return numWordsPet, numWordsRes

In [705]:
featureOne(text)

(3226, 2909)

## Feature 2: Number of interruptions per side

In [726]:
def get_num_interruptions(text):
    '''
    FINAL FEATURE # 2 HELPER:
        Number of times a judge interrupted petitioner or respondent attorneys in the entire transcript.
    Output:
        num_interruptions: 
        a dictionary with key being name of justice/attorney speaking and value being total number of times that 
        speaker was interrupted
    '''
    # define the sign for an interruption at end of speech
    interruptions = ["-", "--"]
    
    # get split argument
    split_argument, all_speakers_with_colon, pet_speakers_with_colon, res_speakers_with_colon, \
        justice_speakers_with_colon = getSplitArgument(text)
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    current_speaker = 'NA:'
    num_interruptions = dict(zip(all_speakers_with_colon + [current_speaker], [0] * (len(all_speakers_with_colon)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker to a justice speaking!
        if s in all_speakers_with_colon:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            #if speech contains at least 2 words, check whether the last or second-last contians an interruption
            #the interruption could come as the 2nd last word when the last word is part of the first name of
            #next speaker
            if len(s.split()) >= 2:
                if s.split()[-1] in interruptions or s.split()[-2] in interruptions:
                    num_interruptions[current_speaker] += 1
    
    return num_interruptions, pet_speakers_with_colon, res_speakers_with_colon

In [728]:
num_interruptions, pet_speakers_with_colon, res_speakers_with_colon = get_num_interruptions(text)

In [731]:
# FINAL FEATURE #2 FUNCTION
def get_total_interruptions(text):
    num_interruptions, pet_speakers_with_colon, res_speakers_with_colon = get_num_interruptions(text)
    # now we need to clump petitioner and respondent interruptions together
    total_interruptions_pet = sum([num_interruptions[k] for k in pet_speakers_with_colon])
    total_interruptions_res = sum([num_interruptions[k] for k in res_speakers_with_colon])
    return total_interruptions_pet, total_interruptions_res

In [733]:
get_total_interruptions(text)

(15, 15)

## Feature 3: Sentiment analysis on judge's speeches

In [735]:
def judges_speech_split(text):
    '''
    FINAL FEATURE # 3:
        All words said by each justice to petitioners and respondents in the entire transcript.
    Output:
        num_words: a dictionary with key being name of justice/attorney and value being total number of words they
        spoke in total throughout argument.  
    '''
    # first get the names of the judges and speakers
    pet_speakers, res_speakers, other_speakers = get_petitioners_and_respondents(text)
    justice_speakers = getJusticeNames(clean_argument)
    
    # creates duplicate "JUSTICE" and "CHIEF JUSTICE"; creates RE for pet/res/justices
    justice_RE = generateRES(justice_speakers, False)
    pet_RE = generateRES(pet_speakers, True)
    res_RE = generateRES(res_speakers, True)
    
    # appends colons to all REs
    justice_speakers_with_colon = map(lambda name : name + ":", justice_RE)
    pet_speakers_with_colon = map(lambda name : name + ":", pet_RE)
    res_speakers_with_colon = map(lambda name : name + ":", res_RE)
    
    # aggregates justice and attorney REs
    all_speakers = ["QUESTION"]
    all_speakers += (justice_RE + pet_RE + res_RE)
    all_speakers_with_colon = map(lambda name : name + ":", all_speakers)
    
    # finally, creates regular expression for the justices and attorneys (i.e. all_speakers) to split text on
    RE = '('  + '|'.join(all_speakers_with_colon) + ')'
    
    # splits argument portion according to generated regular expression above that consists of all possible speakers,
    # plebes and judges alike
    split_argument = re.split(RE, clean_argument)
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    previous_speaker = 'NA:'
    current_speaker = 'NA:'
    words_to_pet = dict(zip(all_speakers_with_colon + [current_speaker], [""] * (len(all_speakers_with_colon)+1)))
    words_to_res = dict(zip(all_speakers_with_colon + [current_speaker], [""] * (len(all_speakers_with_colon)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker to a justice speaking!
        if s in all_speakers_with_colon:
            previous_speaker = current_speaker
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            if current_speaker in justice_speakers_with_colon and previous_speaker in pet_speakers_with_colon:
                words_to_pet[current_speaker] += s
            elif current_speaker in justice_speakers_with_colon and previous_speaker in res_speakers_with_colon:
                words_to_res[current_speaker] += s
    
    return words_to_pet, words_to_res

In [736]:
words_to_pet, words_to_res = judges_speech_split(text)

In [753]:
def combine_speeches(dictionary):
    #joins strings that are values of a particular dictionary (want to use on words_to_pet, words_to_res)
    return " ".join(dictionary.values()).strip()

In [762]:
def featureThree(text):
    #we obtain dictionaries keyed by justice with values equivalent to the stitched together speeches of every time
    # that justice spoke up in the court proceedings
    words_to_pet,  words_to_res = judges_speech_split(text)
    words_to_pet = deleteVal(words_to_pet, "")
    words_to_res = deleteVal(words_to_res, "")
    words_to_pet = combine_speeches(words_to_pet)
    words_to_res = combine_speeches(words_to_res)
    petSAscore = sentimentAnalysis(words_to_pet)
    resSAscore = sentimentAnalysis(words_to_res)
    return petSAscore, resSAscore

In [758]:
words_to_pet, words_to_res = featureThree(text)

### Sentiment Analysis

In [763]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

In [764]:
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS

In [765]:
#adapted from hw5
def get_parts(thetext):
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [None]:
%%time
## TODO: change to pet speeches, and ret speeches
parsed=[get_parts(t) for t in merged.full_text.values]

In [None]:
#taking all the descriptives from the transcript. 
nbdata=[each[1] for each in parsed]

In [None]:
#right now the list is flattened and duplicates are dropped for simplicity; room for improvement to consider dup.
flattened=[]
for sub_nb in nbdata:
    flattened.append([item for l in sub_nb for item in l])

In [None]:
#adding the flattened list of descriptives to the dataframe as a column. 
merged['descriptives']=pd.Series(flattened, index=merged.index)
merged.head()

In [None]:
#load opinion lexicon
pos_link = "opinion-lexicon-English/positive-words.txt"
neg_link = "opinion-lexicon-English/negative-words.txt"

def get_both_list(pos_link, neg_link):
    '''
    This function takes the links in for the lexicon files downloaded from 
    https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
    and outputs two lists: posList and negList
    '''
    pos_file = open(pos_link, "r")
    neg_file = open(neg_link, "r")

    posList = pos_file.read()
    negList = neg_file.read()
    
    posList=get_pos_lexicon(posList)
    negList=get_neg_lexicon(negList)

    posList=posList.split('\n')
    negList=negList.split('\n')

    return posList, negList

In [None]:
def get_neg_lexicon(text):
    '''
    This function gets neg words from the list
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('2-faced')
    end = text.rfind('zombie')
    return text[start:end]

def get_pos_lexicon(text):
    '''
    This function gets pos words from the list.
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('a+')
    end = text.rfind('zippy')
    return text[start:end]

In [None]:
#get posList and negList for the next step
posList, negList = get_both_list(pos_link, neg_link)

In [None]:
def split_pos_neg(descriptives, posList, negList):
    '''
    This function takes all the descriptives for each transcript and 
    splits the list into two lists: one of positive and one of negative words. 
    '''
    pos=[]
    neg=[]
    for l in descriptives:
        positive=[]
        negative=[]
        for e in l:
            if e in posList:
                positive.append(e)
            elif e in negList:
                negative.append(e)
        pos.append([positive, 'pos'])
        neg.append([negative, 'neg'])
    return pos, neg

In [None]:
pos, neg=split_pos_neg(list(merged.descriptives.values), posList, negList)

In [None]:
#adding the split lists to merged as columns
merged['pos']=pd.Series(pos, index=merged.index)
merged['neg']=pd.Series(neg, index=merged.index)

In [None]:
#get net count of positive words over negative words
net_count=[len(e1[0])-len(e2[0]) for (e1, e2) in zip(merged.pos, merged.neg)]

In [None]:
#append net_post to merged as a column
merged['net_pos']=pd.Series(net_count, index=merged.index)
merged.head(2)

# Running classifier

### Logistic Regression

Logistic regression is a natural first choice for a model since our target value can be viewed as a probability between 0 or 1 for any individual justice to vote For or Against, with a higher probability representing a higher confidence of that justice voting in favor of the arguing party. 

In [12]:
from sklearn.linear_model import LogisticRegression


caseId  
docketId   
caseIssuesId  
voteId  
dateDecision  
decisionType  
usCite  
sctCite  
ledCite  
lexisCite  
term   
naturalCourt  
chief  
docket   
caseName  
dateArgument  
dateRearg  
petitioner   
petitionerState  
respondent  
respondentState  
jurisdiction  
adminAction  
adminActionState  
threeJudgeFdc  
caseOrigin  
caseOriginState  
caseSource  
caseSourceState  
lcDisagreement  
certReason  
lcDisposition  
lcDispositionDirection  
declarationUncon  
caseDisposition   
caseDispositionUnusual  
partyWinning    
precedentAlteration  
voteUnclear    
issue  
issueArea  
decisionDirection  
decisionDirectionDissent  
authorityDecision1  
authorityDecision2  
lawType  
lawSupp  
lawMinor  
majOpinWriter  
majOpinAssigner  
splitVote  
majVotes  
minVotes 

issue: This variable identifies the issue for each decision. Although criteria for the identification of issues are hard to articulate, the focus here is on the subject matter of the controversy (e.g., sex discrimination, state tax, affirmative action) rather than its legal basis (e.g., the equal protection clause) 

issueArea: This variable simply separates the issues identified in the preceding variable (issue) into the following larger categories: criminal procedure (issues 10010-10600), civil rights (issues 20010-20410), First Amendment (issues 30010-30020), etc 

decisionDirection: In order to determine whether the Court supports or opposes the issue to which the case pertains, this variable codes the ideological "direction" of the decision. 
    An outcome is liberal (=2) or conservative (=1)

dateDecision: This variable contains the year, month, and day that the Court announced its decision in the case.

justiceName: This is a string variable indicating the first initial for the five justices with a common surname (Harlan, Johnson, Marshall, Roberts, and White) and last name of each justice.

chief: This variable identifies the chief justice durinmg whose tenure the case was decided.

caseSource: This variable identifies the court whose decision the Supreme Court reviewed. If the case originated in the same court whose decision the Supreme Court reviewed, the entry in the caseOrigin should be the same as here. This variable has no entry if the case arose under the Supreme Court's original jurisdiction. 

certReason: This variable provides the reason, if any, that the Court gives for granting the petition for certiorari. 
 

In [587]:
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

In [589]:
bigdf['issueArea']

0        8
1        8
2        8
3        8
4        8
5        8
6        8
7        8
8        8
9        1
10       1
11       1
12       1
13       1
14       1
15       1
16       1
17       1
18       8
19       8
20       8
21       8
22       8
23       8
24       8
25       8
26       8
27       2
28       2
29       2
        ..
77312    9
77313    9
77314    9
77315    2
77316    2
77317    2
77318    2
77319    2
77320    2
77321    2
77322    2
77323    2
77324    1
77325    1
77326    1
77327    1
77328    1
77329    1
77330    1
77331    1
77332    1
77333    2
77334    2
77335    2
77336    2
77337    2
77338    2
77339    2
77340    2
77341    2
Name: issueArea, dtype: float64

In [43]:
from patsy import dmatrices
log_model = LogisticRegression(penalty='l2',C=1.0, fit_intercept=True, class_weight='auto')
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

smalldf = pd.DataFrame()

regress_vars = ['issue', 'issueArea', 'decisionDirection', 
                'dateDecision', 'justice', 'chief', 'caseSource', 'certReason', 'vote'] 

for i in regress_vars: 
    smalldf[i] = bigdf[i]
    
y, X = dmatrices('C(vote) ~ C(issue) + C(issueArea) + C(decisionDirection) + C(justice) + \
                  C(caseSource) + C(certReason)',
                  smalldf, return_type="dataframe")

print (X.shape)
print(y.shape)
# y = np.ravel(y)
print((y))
model = LogisticRegression()

model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)

(61833, 434)
(61833, 8)
       C(vote)[1.0]  C(vote)[2.0]  C(vote)[3.0]  C(vote)[4.0]  C(vote)[5.0]  \
0                 0             1             0             0             0   
1                 1             0             0             0             0   
2                 1             0             0             0             0   
3                 0             0             0             1             0   
4                 1             0             0             0             0   
5                 1             0             0             0             0   
6                 1             0             0             0             0   
7                 1             0             0             0             0   
8                 1             0             0             0             0   
9                 1             0             0             0             0   
10                0             1             0             0             0   
11                1         

ValueError: bad input shape (61833, 8)

NameError: name 'X' is not defined

### Random Forest Classifier

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.cross_validation import train_test_split, cross_val_score

In [585]:
# read in SCDB data from file
bigdf=pd.read_csv("supremeCourtDb.csv")
bigdf[]


Unnamed: 0,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,ledCite,lexisCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
0,1946-001,1946-001-01,1946-001-01-01,1946-001-01-01-01,11/18/46,1,329 U.S. 1,67 S. Ct. 6,91 L. Ed. 3,1946 U.S. LEXIS 1724,...,4,,6,600,35 U.S.C. � 33,78,78,1,8,1
1,1946-002,1946-002-01,1946-002-01-01,1946-002-01-01-01,11/18/46,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4,,6,600,18 U.S.C. � 398,81,87,1,6,3
2,1946-003,1946-003-01,1946-003-01-01,1946-003-01-01-01,11/18/46,1,329 U.S. 29,67 S. Ct. 1,91 L. Ed. 22,1946 U.S. LEXIS 3037,...,1,,2,207,,84,78,1,5,4
3,1946-004,1946-004-01,1946-004-01-01,1946-004-01-01-01,11/25/46,7,329 U.S. 40,67 S. Ct. 167,91 L. Ed. 29,1946 U.S. LEXIS 1696,...,4,,6,600,49 Stat. 801,87,87,1,5,3
4,1946-005,1946-005-01,1946-005-01-01,1946-005-01-01-01,11/25/46,1,329 U.S. 64,67 S. Ct. 154,91 L. Ed. 44,1946 U.S. LEXIS 2997,...,7,,,,,78,87,1,6,3
5,1946-006,1946-006-01,1946-006-01-01,1946-006-01-01-01,11/25/46,1,329 U.S. 69,67 S. Ct. 156,91 L. Ed. 80,1946 U.S. LEXIS 3005,...,2,,1,129,,81,87,1,7,1
6,1946-007,1946-007-01,1946-007-01-01,1946-007-01-01-01,11/25/46,1,329 U.S. 90,67 S. Ct. 133,91 L. Ed. 103,1946 U.S. LEXIS 3053,...,1,,6,600,15 U.S.C. � 79,82,74,1,6,0
7,1946-008,1946-008-01,1946-008-01-01,1946-008-01-01-01,12/9/46,1,329 U.S. 129,67 S. Ct. 231,91 L. Ed. 128,1946 U.S. LEXIS 2995,...,4,,6,600,35 U.S.C. � 89,87,87,1,9,0
8,1946-009,1946-009-01,1946-009-01-01,1946-009-01-01-01,12/9/46,1,329 U.S. 143,67 S. Ct. 245,91 L. Ed. 136,1946 U.S. LEXIS 3047,...,4,5,5,512,,87,87,1,9,0
9,1946-010,1946-010-01,1946-010-01-01,1946-010-01-01-01,12/9/46,1,329 U.S. 156,67 S. Ct. 237,91 L. Ed. 162,1946 U.S. LEXIS 3048,...,4,,3,307,,78,87,1,8,0


In [27]:
df = bigdf["docketId", "dateDecision", "case"]

KeyError: ('docketId', 'dateDecision', 'case')

### Linear SVM Classifier

In [None]:
svm_model = svm.SVC(C=1.0, kernel='linear', probability=True, class_weight='auto')
svm_model = my_svm.fit(X, y)
svm_pred = svm_fit.predict(W)
# Class probabilities, based on log regression on distance to hyperplane.
svm_prob = svm_fit.predict_proba(W)
svm_dist = svm_fit.decision_function(W)

## 2. Justice Ruling Prediction

We use a different dataset in a slightly different approach to making Supreme Court ruling predictions. This method is motivated by the fact that usually, only 2 justices tend to be swing votes and justice decisions are highly influenced by factors outside of what transpires in court proceedings, such as background information about the case itself. The Supreme Court website contains a Justice-centered database which contains extensive information about each case; in particular, the most pertinent fields we are interested in analyzing are:

1. Decision Year
2. Natural Court
3. Petitioner
4. Respondent
5. Case Origin
6. Case Source
7. Lower Court Disposition Direction
8. Issue Area

Our target value to predict is the field called winningParty (petitioner or respondent), which using our justice-centered approach involves aggregating predicted votes for each individual justice and taking majority vote. The associated confidence of our entire prediction is obtained by averaging individual confidences of our models for each justice.

In [91]:
# read in justice-centered SCDB data from file
newdf=pd.read_csv("SCDB_justice_centered.csv")

In [1]:
# maybe lcDispositionDirection? choose features with continuous/numerical features
# do the numbers mean anything though?
newsmalldf = newdf[["term", "naturalCourt", "petitioner", "respondent", "caseOrigin", "caseSource", "lcDisposition", "issueArea"]]

NameError: name 'newdf' is not defined

In [93]:
newsmalldf.head()

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea
0,1946,1301,198,172,51,29,2,8
1,1946,1301,198,172,51,29,2,8
2,1946,1301,198,172,51,29,2,8
3,1946,1301,198,172,51,29,2,8
4,1946,1301,198,172,51,29,2,8


For an intuitive understanding of the features above, check out the documentation here: http://scdb.wustl.edu/documentation.php?var=petitioner. All the above features are categorical instead of continuous (which means the numbers specify a category instead of having a numerical meaning). For an illustrative example, the "petitioner" variable includes:

1. attorney general of the United States, or his office
2. specified state board or department of education
7. state department or agency
etc

#### Advantages of Using Decision Tree Classifiers

Having an intuitive understanding of the meanings behind the variables is important and leads us to our idea of usign the decision tree classifier. A distinct advantage of using decisiontrees is that the decision at each node has an intuitive meaning and corresponds to querying along one feature axis at a time (e.g. is the petitioner an attorney general of the United States?). 

Furthermore, trees are easy to understand and interpret. We can look at the top node and figure out which feature it corresponds to, and conclude that this feature contributes the most information gain, i.e. is the most important/predictive feature. This makes it easy to verify whether our results make intuitive sense.

We will show the process of running decision trees on each justice, before aggregating the votes now.

### 2.1 Justice-Centered Decision Tree Classifiers

Ultimately, the feature that we want to predict is the vote for each justice.

In [29]:
from sklearn import tree

In [94]:
# newdf.majority refers to whether justice voted with the majority (1 for dissent, 2 for majority)
# newdf.partyWinning indicates winning party (0 for responding party, 1 for petitioning party, 2 for unclear)
# We use the above 2 features to infer which party the individual justice voted for
# NOTE: majority has around 4000 NaNs that we should filter out?
results = []
ctr1, ctr2 = 0,0
for idx, x in enumerate(newdf.majority):
    #if decision is unclear, append 2 to results (in reality, apparently there aren't ANY 2s)
    if newdf.partyWinning[idx] == 2:
        results.append(2)
        ctr += 1
        break
    #if justice voted in the majority
    if x == 2:
        results.append(newdf.partyWinning[idx]) #results contains 0/1
    #if justice voted in the minority
    elif x == 1:
        results.append(1 - newdf.partyWinning[idx]) #results contains 0/1
    else:
        #need to clean this up o.o
        results.append(1)

In [99]:
# these are our target values
pd.concat([newsmalldf, results])

AttributeError: 'numpy.float64' object has no attribute 'index'

In [85]:
# drop rows where any column value is NaN - dealing with missing data
newsmalldf.dropna(axis=0).head()

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea
0,1946,1301,198,172,51,29,2,8
1,1946,1301,198,172,51,29,2,8
2,1946,1301,198,172,51,29,2,8
3,1946,1301,198,172,51,29,2,8
4,1946,1301,198,172,51,29,2,8


In [100]:
len(newsmalldf), len(results)

(114895, 114895)

In [104]:
newsmalldf["results"] = pd.Series(results)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [105]:
newsmalldf

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea,results
0,1946,1301,198,172,51,29,2,8,0
1,1946,1301,198,172,51,29,2,8,1
2,1946,1301,198,172,51,29,2,8,1
3,1946,1301,198,172,51,29,2,8,1
4,1946,1301,198,172,51,29,2,8,1
5,1946,1301,198,172,51,29,2,8,1
6,1946,1301,198,172,51,29,2,8,1
7,1946,1301,198,172,51,29,2,8,1
8,1946,1301,198,172,51,29,2,8,1
9,1946,1301,100,27,123,30,2,1,0


In [79]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').