# CS109 Project - The Court Rules In Favor Of...
## Aidi Adnan Brian John (Team AABJ)

### Abstract
The purpose of this project is to predict votes of Supreme Court justices using oral argument transcripts. Studies in linguistics and psychology, as well as common sense, dictates that the word choices that people make convey crucial information about their beliefs and intentions with regard to issues. Rather than use precedents or formal analysis of the law to predict Supreme Court decisions, we attempt to extract essential emotional features of oral arguments made by justices and advocates in the court. Using aggregate data from 1946 to present

### Data
Oral Argument Transcripts - obtained from http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx. Transcripts are made available on the day of court hearing.
Justice Vote Counts/Case Information - obtained from the Supreme Court Database.

# 1. Data Cleaning and Preparation

In [1]:
import string
import re
import numpy as np
import pandas as pd
import operator
import os
import sys
import io
import collections

We used a python script (scraper.py) to first scrape the pdfs from the Supreme Court Justice Website (but didn't upload those to the repository, because we ultimately wanted to use text files in our process). We then used a script to convert the pdf files to text files, but not before removing the last 10 pages which were reserved as an index for certain words. 

In [2]:
# gather all txt files, first get the path to the data directory
# then list the files and filter out all non-txt files
curPath = os.getcwd()
dataPath = curPath + '/data/'
fileList = os.listdir(dataPath)
fileExt = ".txt"
txtFiles = filter(lambda f : f[-4:] == fileExt, fileList)
txtFiles = map(lambda f : dataPath + f, txtFiles)

We wrote a parser to extract the names of the petitioner and respondant attorneys from the first 2 pages of the converted text document. An example of list of petitioner and respondant speakers, taken from the example case in 2014 of Johnson v United States (docket number 13-7120) is:

Katherine M. Menendez, ESQ., Minneapolis, Minn.; on behalf of Petitioner
Michael R. Dreeben, ESQ., Deputy Solicitor General, Department of Justice, Washington D.C.; on behalf of Respondent

To get these speakers, we write a function that uses a regular expression to split the lines based on new line and checks whether there is a name in the line. If we find a name, then we check if that name was listed as a petitoner or respondent at the beginning of the text.

In [802]:
def getPetitionersAndRespondents(text):
    '''
    Inputs:
    text : a transcript in its raw form, without having run cleanTextMaker

    Returns:
    pet_speakers, res_speakers, other_speakers
    the petitoner speakers, the respondent speakers, and any other speakers as a list
    '''
    #get portion of transcript between APPEARANCES and CONTENTS that specifies speakers for petitioners/respondents
    start = text.find('APPEARANCES:') + len('APPEARNACES')
    end = text.find('C O N T E N T S')
    speakers_text = text[start:end]
    split_speakers_text = re.split('\.[ ]*\n', speakers_text)
    #for each speaker, get name (capitalized) and side (Pet/Res) he/she is speaking for
    pet_speakers, res_speakers, other_speakers = [], [], []
    for speaker in split_speakers_text:
        name = speaker.strip().split(',')[0]
        #search for first index of capitalized word (which will be start of speaker name)
        start = 0
        for idx, char in enumerate(name):
            if str.isupper(char):
                start = idx
                break
        #actual name to be appended to correct list
        name = name[start:]
        
        #if words Petition, Plaintiff, etc occur in speaker blurb, speaker belongs to Pet
        if any(x in speaker for x in ['etition' , 'ppellant', 'emand', 'evers', 'laintiff']):
            pet_speakers.append(name)
        #otherwise if words Respondent, Defendant, etc occur, speaker belongs to Res
        elif any(x in speaker for x in ['espond' , 'ppellee', 'efendant']):
            res_speakers.append(name)
        #otherwise if neither side is specified in blurb, speaking belongs to Other
        elif 'neither' in speaker:
            other_speakers.append(name)
    return pet_speakers, res_speakers, other_speakers

The following function used to generate a regular expression for later use based on "$TITLE. $LASTNAME"; however, that pattern ultimately ended up not being used in some of the cases, and we abandonded this format in favor of just using the last name to generate the regular expression that we would ultimately use to separate the text into which speaker was responsible for a portion of text

In [7]:
# generate a list of regular expressions to split the text on
def generateRES(nameList, plebe):
    """
    Inputs:
    nameList: a list of strings of the names
    plebe: a boolean determining whether or not the list is of justices or not
    
    Returns:
    A list of regular expressions for each of the names that work for supreme
    court case transcripts
    """
    retList = []
    for name in nameList:
        address = ""
        if plebe:
            words = name.split(' ')
            # first term is the title, last
            # word is the last name
            address = words[-1]
            retList.append(address)
        else:
            address = "JUSTICE %s" % name
            address2 = "CHIEF JUSTICE %s" % name
            retList.append(address)
            retList.append(address2)
    return retList

In [8]:
def getJusticeNames(text):
    """
    Inputs:
    text is the raw text of a transcript
    
    Returns:
    A set of the names of Justices mentioned by name in the transcript
    """
    index = 0
    retList = []
    while index < len(text):
        index = text.find("JUSTICE", index)
        if index == -1:
            break
        index += 8 # because length of JUSTICE is 7, plus length of the space
        prevIndex = index
        while text[index] != ':':
            index +=1
        retList.append(text[prevIndex:index])
    return list(set(retList))

The general flow of court proceedings is that the Petitioner attornies make their oral argument, followed by the Respondent attornies, before we hear the rebuttal argument of the Petitioners again. Throughout all proceedings, Justices are free to interject with questions and statements of their own. The below function extracts the main argument portion of the oral transcripts, which is the meat of the proceedings that we are interested in conducting analysis on. 

In [10]:
def get_argument_portion(text):
    '''
    Inputs:
    Raw text of a transcript as a string
    
    Returns:
    The argument portion of the case as a string
    '''
    #start and end defines bounds of argument portion of text
    start = text.find('P R O C E E D I N G S')
    end = text.rfind('Whereupon')
    return text[start:end]

In [1036]:
def countWords(s):
    '''
    Inputs:
    s: string of words
    
    Returns:
    An integer counting the number of the words in s
    '''
    s = s.split()
    non_words = ['-', '--']
    return sum([x not in non_words for x in s])

In [13]:
def modify_speaker_names(speakers):
    '''
    Inputs:
    speakers: a list of speakers as strings
    
    Returns:
    A list of the speaker with a colon appended onto them,
    corresponding to how they appear in the transcripts
    '''
    return map(lambda x: x+': ', speakers)

The transcripts contain a lot of line numbers as well as linebreaks in between sentences, so we want to remove those before we try and do any analysis on them.

In [14]:
def cleanTextMaker(text):
    '''
    Inputs:
    text: a raw text with newlines and numbers, as is usual in the transcripts
    
    Returns:
    A file with newlines and numbers scrubbed
    '''
    text_arr=text.splitlines()
    text_clean=[]
    for each in text_arr:
        if each != '':
            try:
                int(each)
            except ValueError: #assummption: if the item only has integers, it is a line number.
                text_clean.append(each)
    out_text=' '.join(text_clean)
    return out_text

In [16]:
def total_wordcount(text):
    '''
    Inputs:
    text: some raw text of a transcript
    
    Returns:
    A dictionary of the the number of words each speaker in the text spoke
    '''
    
    arg_text = get_argument_portion(text)
    #keeps track of current speaker
    current_speaker = 'N/A'
    #clean argument text split by instances where speakers change
    clean_argument = cleanTextMaker(arg_text)
    

    # first get the names of the judges and speakers
    pet_speakers, res_speakers, other_speakers = get_petitioners_and_respondents(text)
    justiceLeague = getJusticeNames(clean_argument)
    # create the regular expression for the justices and the plebes
    # need to also add the justice speaker
    JLList = generateRES(justiceLeague, False)
    plebeList = pet_speakers + res_speakers + other_speakers
    plebeRE = generateRES(plebeList, True)
    finREList = ["QUESTION"]
    finREList += plebeRE + JLList
    
    # speakers are indicated by their name with a colon appended to the name
    finREList = map(lambda name : name + ":", finREList)
    RE = '('  + '|'.join(finREList) + ')'
    
    split_argument = re.split(RE, clean_argument)
    all_speakers = finREList
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    num_words = dict(zip(all_speakers + [current_speaker], [0] * (len(all_speakers)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker
        if s in all_speakers:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            num_words[current_speaker] = num_words[current_speaker] + count_words(s)
    
    return num_words

In [18]:
def wordCounter(text):
    """
    counts number of times each word appears in a file
    
    Inputs:
    text: raw text of transcript
    
    Returns:
    a dictionary of (word : times) it appears
    """
    wordCount={}
    for word in text.split():
        # unfortunately, isalpha does discount some real words
        # like those with apostrophes, and words with question
        # marks at the end of them
        if word.lower() not in wordCount and word.isalpha():
            wordCount[word.lower()] = 1
        elif word.isalpha():
            wordCount[word.lower()] += 1
    return wordCount

In [142]:
def topWords(diction, num, verbose=False):
    """
    Inputs:
    dictionary: of the format word : count for each entry
    num: an integer
    verbose: whether printing of the words is desired
    
    returns: 
    the top num words in a dictionary
    
    """
    d = collections.Counter(diction)
    if verbose:
        for k, v in d.most_common(numTop):
            print '%s: %i' % (k, v)
    return d.most_common(num)

In [341]:
def splitData(X, fraction_train=9.0 / 10.0):
    """
    Deterministically splits a vector
    
    Inputs:
    X: a one dimensional vector
    fraction_train: the fraction of data that is desired to be train
    
    Returns:
    the train portion and test portions of the vector
    """
    end_train = int(len(X) * fraction_train)
    X_train = X[0:end_train]
    X_test = X[end_train:]
    return X_train, X_test

def splitTrainTest(X, Y, fraction_train = 9.0 / 10.0):
    """
    Inputs:
    X, Y : vectors to be split
    
    Returns:
    Each vector split into train and test
    """
    X_train, X_test = splitData(X, fraction_train)
    Y_train, Y_test = splitData(Y, fraction_train)
    return X_train, Y_train, X_test, Y_test

In [27]:
def textToMat(vizer, docList):
    """
    turns documents into a tfidf matrix
    
    Inputs:
    vizer: a vectorizer on documents
    docList: a list of preprocessed documents
    
    Returns:
    a list of matricies that contain the frequencies of words in
    the documents based on the vectorizer passed
    """
    retList = []
    for doc in docList:
        resMat = vizer.transform(doc).todense()
        retList.append(resMat)
    return retList

## Stage 1: Pre-processing

#### Text representation

Document representation is the first step of our analysis since there are a variety of ways to represent a transcript, which in its raw form is a simple string of texts. We use a pre-processing technique that reduces the complexity of the documents and makes them easier to handle, which is to transform the oral transcripts from the full text version to a document vector/sparse matrix. Every text document is represented as a vector of term weights (word features) from a set of terms (dictionary), where each term occurs at least once in a certain critical number of documents.

#### High dimensionality of text representation

A major characteristic of document classification problems is the extremely high dimensionality of data where the number of potential features often exceeds the number of training documents. Dimensionality reduction is thus critical to allow for efficient data manipulation. Irrelevant and redundant features often degrade performance of classification algorithms both in accuracy and speed, and also tends to fall into the all-common trap of overfitting.

#### Data cleaning

Pre-processing of text data involves tokenization of raw text, stop words removal, stemming and eliminating as much as possible the language dependent factors. Brief explanations of these preprocessing stages are as fllows:

1. Sentence splitting: identifying sentence boundaries in documents
2. Tokenization: partitioning documents that are initially treated as a string into a list of tokens
3. Stop words removal: removing common English words like "the", "a", etc
4. Stemming: reducing derived words to its most root form, example happiest -> happy
5. Noisy data: cleaning noisy data spilt over from pdf to text conversion, including inclusion of line numbers, page breaks, etc
6. Text representation: determining whether we should use words, phrases or entire sentences as a "token" for analysis

### Feature extraction vs. feature selection

After feature extraction, feature selection was conducted to construct a vector space of appropriate dimensionality, which improves the scalability, efficiency and accuracy of our classification algorithm. The main idea of feature selection is to choose a subset of features from the original texts, with subset determined by obtaining features with the highest score according to some predetermined measure of feature importance.

We attempt two different approaches for feature selection in our analysis:
1. Wrappers: 
2. Filters: as opposed to wrappers, filters can be conducted independently of the actual classification algorithm, and hence is less computationally expensive. Filters use an evaluation metric that measures ability of a feature to differentiate each class, hence choosing the most discriminative and valuable features. The filter of our choice is a technique called frequency document-inverse document frequency, as shown below.

### Frequency Document-Inverse Document Frequency (TF-IDF)

Frequency document–inverse document frequency (tf-idf), is a powerful method to evaluate how important is a word in a document, and captures the relative relevance among words. It converts the textual representation of information into a Vector-Space Model or a sparse matrix representation.

In [766]:
# used to generated tfidf sparse matricies for the importance of words in documents
from sklearn.feature_extraction.text import TfidfVectorizer
# internal utilities used to replicate functionality of truncated_svd
from sklearn.utils import as_float_array
from sklearn.utils.extmath import randomized_svd
# stemmer of words
import snowballstemmer

In [28]:
def getAllRawText(txtFiles):
    '''
    Inputs:
    txtFiles: a list of paths of textfiles
    
    Returns:
    list of all uncleaned transcripts in raw text form.
    '''
    return [(open(txtFiles[i]).read()) for i in range(len(txtFiles))]

In [29]:
def getAllRawCleanText(txtFiles):
    '''
    Inputs:
    txtFiles: a list of paths of textfiles
    
    Returns:
    list of all cleaned transcripts in raw text form.
    '''
    return [(cleanTextMaker(open(txtFiles[i]).read())) for i in range(len(txtFiles))]

In [30]:
allRawText = getAllRawText(txtFiles)
allRawCleanText = getAllRawCleanText(txtFiles)

Similar to the approach we took when generating the regular expressions for parsing, we want to have a way to convert the names scraped from the documents into a name that we can generally split the speechs by, and thus we adopt the convention of using the last name.

In [31]:
def toColloquialName(formal_name):
    ret = formal_name.split()
    return ret[-1]

In [33]:
def getBagOfWords(txtFiles):
    """
    Gets bag of words dictionary for every document in txtFiles
    
    Inputs:
    txtFiles: lists of paths of textfiles
    
    Returns:
    list of dictionary of word : number of times word appears in transcript
    """
    retList = []
    for File in txtFiles:
        cur = open(File)
        textual = cur.read()
        cleanTextual = wordCounter(textual)
        retList.append(cleanTextual)
        cur.close()
    return retList

In [34]:
# transforms bag of words into td-idf weighted
bow = getBagOfWords(txtFiles)

# 2. Latent Sentiment Indexing (LSI)

When running LSI, we do not want to possibly split importance among words that are actually very similar (such as "stealing" and "steal"), so we stem words by removing the suffix, bring down the words to a root, or 'stem' that we can assign importance to. There was an issue with different encodings: the txt documents are stored in Latin1 encoding when converted from PDFs to permit earlier functions to work, but when iterating through words there is some issues with how the words are decoded and passed to the stemmer. As a result, we need to manually convert incompatible strings to a tractable format. This conversion was not possible on a single document in our entire database, so we ultimately had to remove it from our database (if we didn't remove it, there would be an issue when we used the docketId from the document to index into a merged dataframe later on, we would have one more response variable than predictor variables). Stemming naturally can leave words as a non english word, or may incorrectly mistem a word. Nothing short of a large dictionary containing the stem of every possible word would accurately perform the stemming, so we are forced to accept this aggressive trimming.

In [295]:
def destem(allRawText):
    """
    stems all words from a list of documents
    documents are assumed to be stored in Latin1 encoding
    there is one document that is not tractable so we exclude it
    uses snowballstemmer
    required to decode string to avoid UnicodeDecodeErrors
    
    Inputs:
    a list of raw text files that have been cleaned
    
    Returns:
    a list of text files that have been stemmed word by word
    """
    stemmer = snowballstemmer.stemmer('english')
    stemmedList = []
    for text in allRawText:
        try:
            temp = stemmer.stemWords(text.split())
            for i in xrange(len(temp)):
                if str(type(temp[i])) == "<type 'str'>":
                    temp[i] = temp[i].decode('Latin1')
            res = ' '.join(temp)
            stemmedList.append(res)
        except UnicodeDecodeError:
            # literally just one document
            pass
    return stemmedList

In [155]:
def getDocketNo(text):
    '''
    Input: the text of a transcript
    
    Returns:
    the docket number of the case
    '''
    cleantext = cleanTextMaker(text)
    docketIdx = cleantext.find("No.")
    return cleantext[docketIdx+4:].split()[0]

In [264]:
def get_file_dict(fileList, fileExt='.txt'):
    '''
    This function takes the fileList and returns a list of dictionaries of the format 
    {'case_number': case_number, 'full_text': full_text}
    
    Inputs:
    fileList: list of the paths of the textfiles
    fileExt: optional parameter for the type of file
    
    Returns:
    dictionary of the filename:text
    '''
    fileDict=[]
    fields=['docket', 'full_text']
    txtFiles_filter = filter(lambda f : f[-4:] == fileExt, fileList)
    for each in txtFiles_filter:
        name_str=each[4:-4]
        try:
            indexx=name_str.index('_')
            docketNum=name_str[:indexx]
        except ValueError:
            docketNum=name_str
        cur = open(dataPath+each)
        textual = cur.read()
        cur.close()
        tuple_=(docketNum, textual)
        fileDict.append(dict(zip(fields, tuple_)))
    return fileDict

We want to merge the text files with the supreme court database so that we can easily associate the text files with the docketId and get the decision of the cases. However, the supreme court database unfortunately does not contain information from beyond 2014, so we lose out on several transcripts in the merging process. 

In [266]:
fileDict=get_file_dict(fileList)
txtdf = pd.DataFrame(fileDict)
casedf = pd.read_csv('supremeCourtDb.csv')
merged = pd.merge(left=txtdf, right=casedf, how='inner', left_on='docket', right_on='docket')

(934, 54)


As referenced earlier, there is a single document that is not tractable to stemming due to codec issues, so we merely drop it for being insolent.

In [325]:
# drop problematic docket
merged = merged[merged.docket != '08-351']

In [321]:
merged.shape

(931, 54)

## 2.1 Splitting prepared documents into petitioner and respondent speeches

Ok, so we have about 930 documents to work with. We now want to find a way to gather the texts of what the petitioners and the respondents say. We adapt a function we wrote earlier that counted the number of words that each speaker said and use it to gather the texts that each party is responsible for. We first gather them by speaker then gather them by the group that they are a part of.

In [151]:
def splitTextPetRes(text):
    '''
    Input: 
    text: raw text document (transcript, recently opened)
        
    Returns:
    a dictionary of speaker: the words they said
    a dictionary of group: the words they said
    '''
    arg_text = get_argument_portion(text)
    #keeps track of current speaker
    current_speaker = 'N/A'
    clean_argument = cleanTextMaker(arg_text)

    # first get the names of the judges and speakers
    pet_speakers, res_speakers, _ = get_petitioners_and_respondents(text)
    
    # create the regular expression for the justices and the plebes
    petList = generateRES(pet_speakers, True)
    resList = generateRES(res_speakers, True)
    petList = map(lambda name : name + ":", petList)
    resList = map(lambda name : name + ":", resList)
    all_speakers = (petList + resList)
    
    RE = '('  + '|'.join(all_speakers) + ')'
    
    # split argument portion by times elements in plebeList (e.x. MR. FARR: or EUGENE: appears)
    split_argument = re.split(RE, clean_argument)
    
    # dictionary keyed by speaker, with value actual speech (in string format)
    speech = dict(zip(all_speakers + [current_speaker], [""] * (len(all_speakers)+1)))
    
    #iterate through split argument, accumulating speeches for all speakers
    for s in split_argument:
        if s in all_speakers:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            speech[current_speaker] += s

    #combine all pet and res speakers, if multiple
    retDict = {"resSpeakers":"", "petSpeakers":""}
    
    for rSpeaker in resList:
        retDict["resSpeakers"] += speech[rSpeaker]
    for pSpeaker in petList:
        retDict["petSpeakers"] += speech[pSpeaker]

    return speech, retDict

Because scraping is not perfect, sometimes we fail to gather the names of petitioners or respondents. In this case, we do not want to add that case's speech to the database because then we wouldnt be able to compare either the respondents or the petitioners against an empty speech. Just would not be fair!

In [326]:
# iterate through merged.full_text, trying to fill in merged.pet_speech and merged.res_speech
allPetSpeeches = []
allResSpeeches = []
allDocketNo = []
allDecisions = []
for row in merged.iterrows():
    speech, retDict = splitTextPetRes(row[1]["full_text"])
    petSpeech = retDict["petSpeakers"]
    resSpeech = retDict["resSpeakers"]
    # if either petSpeech or resSpeech is an empty string, do not add to workable dataset
    if petSpeech and resSpeech:
        allPetSpeeches.append(petSpeech)
        allResSpeeches.append(resSpeech)
        allDocketNo.append(row[1]["docket"])
        allDecisions.append(row[1]["partyWinning"])

In [330]:
# some oral transcripts have empty petitioner or respondent speeches due to dirty scraping of pdf files
# for example, get_petitioners_and_respondents sometimes does not scrape properly due to bad formatting
len(allPetSpeeches), len(allResSpeeches), len(allDecisions)

(886, 886, 886)

We now stem all of the words in the document files. This takes a long time because of the inconsistent encodings of strings as mentioned earlier, necessitating iterating through each word and manually trying to convert it. 

In [328]:
# takes long to run
allDestemmedPetSpeeches = destem(allPetSpeeches)
allDestemmedResSpeeches = destem(allResSpeeches)

In [329]:
len(allDestemmedPetSpeeches), len(allDestemmedResSpeeches)

(886, 886)

## 2.2 Applying term-frequency inverse document frequency vectorizer

Finally, we are ready to pass each set of documents to the vectorizer that will count the number of words and assign them based on Term-Frequency Inverse Document Frequency (tfidf), which first calculates the raw term frequency (aka the number of times that the word appears in the document) and then multiplies it by the inverse document frequency (a global weighing function):
$$g_i = \log_2 \frac{n}{1 + df_i}$$

Where $g_i$ is the weight for term $i$, $n$ is the number of times a word appears in a document, and $df_i$ is the number of documents in which $i$ appears. This properly penalizes words that appear frequently in many documents ('the', 'of', etc). We take $g_i \times f_i$ (where $f_i$ is the raw frequency) as the tfidf statistic for that word in a given document. (credit to: https://en.wikipedia.org/wiki/Latent_semantic_indexing)

In [331]:
# produallStemmedArgumentsemmedArgumentsvectorizer that will calculate the importance of words
vectorizer1 = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english', encoding='Latin1', analyzer='word', token_pattern='\w+')
vectorizer2 = TfidfVectorizer(min_df=1, norm='l2', use_idf=True, stop_words='english', encoding='Latin1', analyzer='word', token_pattern='\w+')

When we run SVD on the matrix generated by the vectorizers, we get back 3 matrices. The first matrix is a matrix that represents the themes in the documents, the second is a diagonal matrix of singular values representing the relative importance of each theme overall, and the third matrix is representing the importance of each word in the themes. We can run logistic regression on just the first matrix, and specifically the different between the matrix of the respondents and the petitioners, to represent the difference in how strongly the parties speak about a certain theme within a document.

## 2.3 Running Singular Vector Decomposition

In [143]:
def runSVD(documentList, vizer, numComponents=25, nIter=5):
    """
    takes a list of documents and a vectorizer
    converts document list to a matrix of frequencies 
        (as determined by the vectorizer) of document by word
    takes matrix and runs truncated SVD on it to generate
    a matrix that consists of themes (T) in each document
    overall importance of the word (S)
    and a matrix that consists of how important each word
    is in the theme (DT)
    
    this code is partially derived from sklearn's
    truncated_svd function (which doesn't return
    all of the matricies we are interested in)
    """
    mat = vizer.fit_transform(documentList)
    X = as_float_array(mat, copy=False)
    # T is the term by concept matrix
    # S the singular value matrix
    # D is the concept-document matrix
    T, S, DT = randomized_svd(X, numComponents, n_iter=nIter)
    return T, S, DT

In [354]:
tPet, sPet, dTPet = runSVD(allDestemmedPetSpeeches, vectorizer1, numComponents=25)
tRes, sRes, dTRes = runSVD(allDestemmedResSpeeches, vectorizer2, numComponents=25)

In [332]:
tPet.shape, sPet.shape, dTPet.shape

((886, 25), (25,), (25, 29308))

In [355]:
tRes.shape, sRes.shape, dTRes.shape

((886, 25), (25,), (25, 28559))

In [358]:
tDiff = tPet - tRes

## 2.4 Training logistic regression classifier on petitioner and respondent differences

In [333]:
# run logistic regression on D x numTopics matrix of independent variables, vs. 0/1 result vector
from sklearn.linear_model import LogisticRegression

In [348]:
clflog = LogisticRegression()

In [349]:
"""
Function
--------
cv_optimize

Inputs
------
clf : an instance of a scikit-learn classifier
parameters: a parameter grid dictionary thats passed to GridSearchCV (see above)
X: a samples-features matrix in the scikit-learn style
y: the response vectors of 1s and 0s (+ives and -ives)
n_folds: the number of cross-validation folds (default 5)
score_func: a score function we might want to pass (default python None)

Returns
-------
The best estimator from the GridSearchCV, after the GridSearchCV has been used to
fit the model.
"""
def cv_optimize(clf, parameters, X, y, n_folds=5):
    clf = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    clf.fit(X,y)
    return clf.best_estimator_

In [343]:
from sklearn.cross_validation import KFold
from sklearn.metrics import accuracy_score
from sklearn.grid_search import GridSearchCV

## 2.5 Testing cross-validation and evaluating results

In [364]:
xTrain, yTrain, xTest, yTest = splitTrainTest(tDiff, allDecisions)
clflogopt = cv_optimize(clflog, {"C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, xTrain, yTrain, n_folds=5)

In [365]:
clflogopt

LogisticRegression(C=0.0001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [366]:
training_accuracy = clflogopt.score(xTrain, yTrain)
test_accuracy = clflogopt.score(xTest, yTest)
print training_accuracy, test_accuracy

0.692597239649 0.651685393258


# 3. Sentiment Analysis

In the above analysis, we have focused on parts of the oral transcripts corresponding to attorney speeches, which means we are essentially ignoring an equally valuable portion of information that we can glean from these transcripts: the judge's responses and questions to these attorneys. In this section, we attempt to conduct Natural Language Processing on judge's speeches and other salient features of the text that were not included in the Latent Semantic Analysis above, which might yield interesting results.

## 3.1 Cleaning dataset
Some of our oral transcripts do not discriminate between individual justice's speeches, instead using "QUESTION:" in place of all justice's speeches. We are not unable to use these, and for consistency's sake, delete all these transcripts from our working dataset. It turns out that this was not too consequential since only transcripts from 2001-2002 had the "QUESTION:" ambiguity problem.

In [774]:
#find qualifying rows without "QUESTION:"
qualifyingRows = []
for rowNo in range(len(merged)):
    # find appropriate row and check whether full_text contains QUESTION: - if so, delete from database
    row = merged.iloc[rowNo, :]
    if row["full_text"].find("QUESTION:") == -1:
        qualifyingRows.append(rowNo)

In [775]:
mergedJudges = merged.iloc[qualifying_rows, :] 
mergedJudges.head()

Unnamed: 0,docket,full_text,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
158,02-1472,1\n\nIN THE SUPREME COURT OF THE UNITED STATES...,2004-025,2004-025-01,2004-025-01-01,2004-025-01-01-01,3/1/05,1,543 U.S. 631,125 S. Ct. 1172,...,4,,6,600,25 U.S.C. � 450,110,103,1,8,0
169,02-1672,1\n2\n\nIN THE SUPREME COURT OF THE UNITED STA...,2004-033,2004-033-01,2004-033-01-01,2004-033-01-01-01,3/29/05,1,544 U.S. 167,125 S. Ct. 1497,...,4,,3,322,,104,103,1,5,4
216,03-10198,1\n\nIN THE SUPREME COURT OF THE UNITED STATES...,2004-073,2004-073-01,2004-073-01-01,2004-073-01-01-01,6/23/05,1,545 U.S. 605,125 S. Ct. 2582,...,7,2.0,2,231,,109,102,1,7,2
218,03-1039,1\n\nIN THE SUPREME COURT OF THE UNITED STATES...,2004-032,2004-032-01,2004-032-01-01,2004-032-01-01-01,3/22/05,1,544 U.S. 133,125 S. Ct. 1432,...,7,4.0,3,341,,106,104,1,5,3
220,03-1116,1\n\nIN THE SUPREME COURT OF THE UNITED STATES...,2004-045,2004-045-01,2004-045-01-01,2004-045-01-01-01,5/16/05,1,544 U.S. 460,125 S. Ct. 1885,...,2,,1,111,,106,105,1,5,4


## 3.2 Feature Selection and Extraction

Now that we have our cleaned dataframe with information about raw text and decisions for every oral transcript, we want to find the most appropriate features for predictors - we're on the stage of feature extraction again. After much experimentation and linguistical analysis on court proceedings in particular, we identified the following features that we want to introduce as predictors to run our classification algorithm on. 

For each oral transcript, we want to identify:
1. Number of words judges uttered to petitioner's and respondent's sides:
        Usage: judges_word_count_split(text)
2. Number of times a judge interrupted petitioner or respondent attorneys: 
        Usage: total_interruptions_pet, total_interruptions_ret = get_total_interruptions(text)
3. Sentiment analysis on words judges uttered to petitioner's and respondent's sides: 
        Usage: judges_speech_split(text) and sentiment analysis using existing dictionary of positive/negative
        words

We will look at each of these features one by one.

## 3.3 Exploring feature #1: Number of words judges directed at each side

In [1032]:
def getSplitArgument(text):
    '''
    Gets split argument by RE generated with list of all speakers.
    '''
    # first get the names of the judges and speakers
    petSpeakers, resSpeakers, _ = getPetitionersAndRespondents(text)
    cleanArg = cleanTextMaker(text)
    justiceSpeakers = getJusticeNames(cleanArg)
    
    # creates duplicate "JUSTICE" and "CHIEF JUSTICE"; creates RE for pet/res/justices
    justiceRE = generateRES(justiceSpeakers, False)
    petRE = generateRES(petSpeakers, True)
    resRE = generateRES(resSpeakers, True)
    
    # appends colons to all REs
    justiceSpeakersColon = map(lambda name : name + ":", justiceRE)
    petSpeakersColon = map(lambda name : name + ":", petRE)
    resSpeakersColon = map(lambda name : name + ":", resRE)
    justiceSpeakersColon = map(lambda name : name + ":", justiceRE)
    
    # aggregates justice and attorney REs
    allSpeakers = ["QUESTION"]
    allSpeakers += (justiceRE + petRE + resRE)
    allSpeakersColon = map(lambda name : name + ":", allSpeakers)
    
    # finally, creates regular expression for the justices and attorneys (i.e. all_speakers) to split text on
    RE = '('  + '|'.join(allSpeakersColon) + ')'
    
    # splits argument portion according to generated regular expression above that consists of all possible speakers,
    # plebes and judges alike
    splitArg = re.split(RE, cleanArg)
    
    return splitArg, allSpeakersColon, petSpeakersColon, resSpeakersColon, justiceSpeakersColon

In [1058]:
def speakersWordCount(text):
    '''
    FINAL FEATURE # 1:
    Total number of words spoken by each justice in the entire transcript.
    
    Returns:
    num_words: a dictionary with key being name of justice/attorney and value being total number of words they
    spoke in total throughout argument.  
    '''
    # splits argument
    splitArg, allSpeakersColon, petSpeakersColon, resSpeakersColon, justiceSpeakersColon = getSplitArgument(text)
    
    # num_words is a dictionary that maps all speaker names to number of words they spoke
    currSpeaker = 'NA:'
    prevSpeaker = 'NA:'
    numWordsToPet = dict(zip(allSpeakersColon + [currSpeaker], [0] * (len(allSpeakersColon)+1)))
    numWordsToRes = dict(zip(allSpeakersColon + [currSpeaker], [0] * (len(allSpeakersColon)+1)))
    
    # iterate through split argument, accumulating word counts for all speakers
    for s in splitArg:
        print currSpeaker in justiceSpeakersColon
        print prevSpeaker in petSpeakersColon
        print prevSpeaker in resSpeakersColon
        #if split chunk signifies change in speaker
        if s in allSpeakersColon:
            prevSpeaker = currSpeaker
            currSpeaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            if currSpeaker in justiceSpeakersColon and prevSpeaker in petSpeakersColon:
                numWordsToPet[currSpeaker] += countWords(s)
            elif currSpeaker in justiceSpeakersColon and prevSpeaker in resSpeakersColon:
                numWordsToRes[currSpeaker] += countWords(s)

    return numWordsToPet, numWordsToRes, petSpeakersColon, resSpeakersColon

In [1048]:
def deleteVal(dictionary,val):
    '''
    Get rid of all items in dictionary with value being a specific val: 
    E.x. when no words spoken means that we can ignore them
    '''
    for k,v in dictionary.items():
        if v == val:
            del dictionary[k]
    return dictionary

In [1053]:
def featureOne(text):
    numWordsToPet, numWordsToRes, petSpeakersColon, resSpeakersColon = speakersWordCount(text)
    numWordsToPet = deleteVal(numWordsToPet,0)
    numWordsToRes = deleteVal(numWordsToRes,0)
    # clump together petitioners and respondents
    numWordsPet, numWordsRes = 0,0
    for s in petSpeakersColon:
        if s in numWords:
            numWordsPet += numWords[s]
    for s in resSpeakersColon:
        if s in numWords:
            numWordsRes += numWords[s]
    return numWordsPet, numWordsRes

In [734]:
# we can see from this illustrative example that the total number of words corresponds to roughly the sum of words
# each judge said to each side. There might have been words uttered to speakers neither on petitioner or respondent's
# side, or a prologue directed to the general audience (especially for the Chief Justice), which can be ignored for
# our purposes.

In [1054]:
# append NEW FEATURES to mergedJudges!!!
allNumWordsPet = map(lambda x: featureOne(x)[0], mergedJudges.full_text.values)
allNumWordsRes = map(lambda x: featureOne(x)[1], mergedJudges.full_text.values)
# mergedJudges["allNumWordsPet"] = allNumWordsPet
# mergedJudges["allNumWordsRes"] = allNumWordsRes

In [1057]:
featureOne(txt1)

(0, 0)

## 3.2. Exploring feature #2: Number of interruptions per side

In [726]:
def get_num_interruptions(text):
    '''
    FINAL FEATURE # 2 HELPER:
        Number of times a judge interrupted petitioner or respondent attorneys in the entire transcript.
    Output:
        num_interruptions: 
        a dictionary with key being name of justice/attorney speaking and value being total number of times that 
        speaker was interrupted
    '''
    # define the sign for an interruption at end of speech
    interruptions = ["-", "--"]
    
    # get split argument
    split_argument, all_speakers_with_colon, pet_speakers_with_colon, res_speakers_with_colon, \
        justice_speakers_with_colon = getSplitArgument(text)
    
    #num_words is a dictionary that maps all speaker names to number of words they spoke
    current_speaker = 'NA:'
    num_interruptions = dict(zip(all_speakers_with_colon + [current_speaker], [0] * (len(all_speakers_with_colon)+1)))
    
    #iterate through split argument, accumulating word counts for all speakers
    for s in split_argument:
        #if split chunk signifies change in speaker to a justice speaking!
        if s in all_speakers_with_colon:
            current_speaker = s
        #if split chunk is part of speech of current speaker, append to word count
        else:
            #if speech contains at least 2 words, check whether the last or second-last contians an interruption
            #the interruption could come as the 2nd last word when the last word is part of the first name of
            #next speaker
            if len(s.split()) >= 2:
                if s.split()[-1] in interruptions or s.split()[-2] in interruptions:
                    num_interruptions[current_speaker] += 1
    
    return num_interruptions, pet_speakers_with_colon, res_speakers_with_colon

In [728]:
num_interruptions, pet_speakers_with_colon, res_speakers_with_colon = get_num_interruptions(text)

In [731]:
# FINAL FEATURE #2 FUNCTION
def get_total_interruptions(text):
    num_interruptions, pet_speakers_with_colon, res_speakers_with_colon = get_num_interruptions(text)
    # now we need to clump petitioner and respondent interruptions together
    total_interruptions_pet = sum([num_interruptions[k] for k in pet_speakers_with_colon])
    total_interruptions_res = sum([num_interruptions[k] for k in res_speakers_with_colon])
    return total_interruptions_pet, total_interruptions_res

In [733]:
get_total_interruptions(text)

(15, 15)

In [1023]:
# again, append NEW FEATURES to mergedJudges!!!
allInterruptionsPet = map(lambda x: get_total_interruptions(x)[0], mergedJudges.full_text.values)
allInterruptionsRes = map(lambda x: get_total_interruptions(x)[1], mergedJudges.full_text.values)
mergedJudges["allInterruptionsPet"] = allInterruptionsPet
mergedJudges["allInterruptionsRes"] = allInterruptionsRes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## Exploring feature #3: Sentiment analysis on judge's speeches

In [875]:
def judgesSpeechSplit(txt):
    '''
    FINAL FEATURE # 3:
        All words said by each justice to petitioners and respondents in the entire transcript.
    Output:
        num_words: a dictionary with key being name of justice/attorney and value being total number of words they
        spoke in total throughout argument.  
    '''
    # first get the names of the judges and speakers
    petSpeakers, resSpeakers, otherSpeakers = getPetitionersAndRespondents(txt)
    cleanArg = cleanTextMaker(txt)
    justiceSpeakers = getJusticeNames(cleanArg)
    
    # creates duplicate "JUSTICE" and "CHIEF JUSTICE"; creates RE for pet/res/justices
    justiceRE = generateRES(justiceSpeakers, False)
    petRE = generateRES(petSpeakers, True)
    resRE = generateRES(resSpeakers, True)
    
    # appends colons to all REs
    justiceSpeakersColon = map(lambda name : name + ":", justiceRE)
    petSpeakersColon = map(lambda name : name + ":", petRE)
    resSpeakersColon = map(lambda name : name + ":", resRE)

    # aggregates justice and attorney REs
    allSpeakers = ["QUESTION"]
    allSpeakers += (justiceRE + petRE + resRE)
    allSpeakersColon = map(lambda name : name + ":", allSpeakers)
    
    # finally, creates regular expression for the justices and attorneys (i.e. all_speakers) to split text on
    RE = '('  + '|'.join(allSpeakersColon) + ')'
    
    # splits argument portion according to generated regular expression above that consists of all possible speakers,
    # plebes and judges alike
    splitArg = re.split(RE, cleanArg)
    
    # num_words is a dictionary that maps all speaker names to number of words they spoke
    prevSpeaker = 'NA:'
    currSpeaker = 'NA:'
    wordsToPet = dict(zip(allSpeakersColon + [currSpeaker], [""] * (len(allSpeakersColon)+1)))
    wordsToRes = dict(zip(allSpeakersColon + [currSpeaker], [""] * (len(allSpeakersColon)+1)))
    
    # iterate through split argument, accumulating word counts for all speakers
    for s in splitArg:
        # if split chunk signifies change in speaker to a justice speaking!
        if s in allSpeakersColon:
            prevSpeaker = currSpeaker
            currSpeaker = s
        # if split chunk is part of speech of current speaker, append to word count
        else:
            if currSpeaker in justiceSpeakersColon and prevSpeaker in petSpeakersColon:
                wordsToPet[currSpeaker] += s
            elif currSpeaker in justiceSpeakersColon and prevSpeaker in resSpeakersColon:
                wordsToRes[currSpeaker] += s
    
    return wordsToPet, wordsToRes

In [876]:
def combineSpeeches(dictionary):
    # joins strings that are values of a particular dictionary (want to use on words_to_pet, words_to_res)
    return " ".join(dictionary.values()).strip()

In [877]:
def featureThreeHelper(txt):
    # we obtain dictionaries keyed by justice with values equivalent to the stitched together speeches of every time
    # that justice spoke up in the court proceedings
    wordsToPet,  wordsToRes = judgesSpeechSplit(txt)
    wordsToPet = deleteVal(wordsToPet, "")
    wordsToRes = deleteVal(wordsToRes, "")
    wordsToPet = combine_speeches(wordsToPet)
    wordsToRes = combine_speeches(wordsToRes)
    return wordsToPet, wordsToRes

In [880]:
wordsToPet, wordsToRes = judgesSpeechSplit(txt)

In [814]:
print "Here are the words judges spoke to the petitioner side: \n"
print wordsToPet[:500]
print "-------------------------"
print "And here are the words judges spoke to the respondents side: \n"
print wordsToRes[:500]

Here are the words judges spoke to the petitioner side: 

Well, did we take this case on the ground that he wasn't adequately advised or did we 1111 14th Street, NW Suite 400 Alderson Reporting Company 1-800-FOR-DEPO Washington, DC 20005 take the case on the ground that even if he were advised, he'd still have his right? MR.  What the Well, I -- I take it you would challenge the validity of the waiver even if he were advised? MR.  Absolutely. And even if he said, I hereby waive? MR.  How -- why is that prejudicial to him? did it but I can't plead guilt
-------------------------
And here are the words judges spoke to the respondents side: 

MR.  Because of the error in scoring? MR.  That -- that ultimately You take the position that in fact there was no error in scoring. MR.   Look, the -- imagine -- I'm just repeating what Justice Souter said. it's so obvious that there must be an obvious answer, but I haven't heard the answer. There must be - He knows Michigan law or his lawyer does. 

In [None]:
def featureThree(txt):
    petSAscore = sentimentAnalysis(words_to_pet)
    resSAscore = sentimentAnalysis(words_to_res)
    return petSAscore, resSAscore

### 3.3.1 Exploring Sentiment Analysis of Text

We use the pattern Python library with tools for scraping text and natural language processing. In the getParts function, we will implement the parsing of separate petitioners and respondents' text by tokenizing, parsing out the punctuation, removing stop words in English, and finally returning us information about nouns and descriptives that are most salient in the text.

In [763]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
from sklearn.feature_extraction import text 
stopwords = text.ENGLISH_STOP_WORDS
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

In [898]:
# adapted from hw5! :D
def getParts(thetext):
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

In [889]:
# get all judge's speeches to petitioners and respondents from our initial dataframe: mergedJudges
allToPetSpeeches = map(lambda t: featureThreeHelper(t)[0], mergedJudges.full_text.values)
allToResSpeeches = map(lambda t: featureThreeHelper(t)[1], mergedJudges.full_text.values)

In [891]:
len(allToPetSpeeches), len(allToResSpeeches)

(653, 653)

In [892]:
%%time
# get parsed speeches for all speeches to petitioners and respondents, separately
parsedPet = [getParts(t) for t in allToPetSpeeches]
parsedRes = [getParts(t) for t in allToResSpeeches]

CPU times: user 4min 46s, sys: 2.59 s, total: 4min 49s
Wall time: 4min 50s


In [997]:
# taking all the descriptives from the transcript
nbDataPet = [each[1] for each in parsedPet]
nbDataRes = [each[1] for each in parsedRes]

In [1004]:
# right now the list is flattened and duplicates are not dropped since we are counting frequency of appearance
flattenedPet = map(lambda x: list(itertools.chain(*x)), nbDataPet)
flattenedRes = map(lambda x: list(itertools.chain(*x)), nbDataRes)

In [1005]:
# adding the flattened list of descriptives to the dataframe as columns
mergedJudges['descriptivesPet'] = pd.Series(flattenedPet, index=mergedJudges.index)
mergedJudges['descriptivesRes'] = pd.Series(flattenedRes, index=mergedJudges.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [1015]:
# load opinion lexicon
posLink = "idvsus-backup/opinion-lexicon-English/positive-words.txt"
negLink = "idvsus-backup/opinion-lexicon-English/negative-words.txt"

def getBothList(posLink, negLink):
    '''
    Inputs:
    The links in for the lexicon files downloaded from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
    
    Outputs:
    Two lists: posList and negList representing lists of positive and negative words appearing in each transcript
    '''
    posFile = open(posLink, "r")
    negFile = open(negLink, "r")

    posList = posFile.read()
    negList = negFile.read()
    
    posList = getPosLexicon(posList)
    negList = getNegLexicon(negList)

    posList = posList.split('\n')
    negList = negList.split('\n')

    return posList, negList

In [1016]:
def getNegLexicon(text):
    '''
    Gets negative-sentiment words from the list
    '''
    # start and end defines bounds of argument portion of text
    start = text.find('2-faced')
    end = text.rfind('zombie')
    return text[start:end]

def getPosLexicon(text):
    '''
    Gets positive-sentiment words from the list.
    '''
    # start and end defines bounds of argument portion of text
    start = text.find('a+')
    end = text.rfind('zippy')
    return text[start:end]

In [935]:
# get posList and negList for the next step
posList, negList = getBothList(posLink, negLink)

In [1007]:
descriptivesPet = mergedJudges.descriptivesPet.values
descriptivesRes = mergedJudges.descriptivesRes.values

In [960]:
# for example, here are the list of positive and negative words in a randomly selected descriptive list for petitioners
d = descriptivesPet[0]
pos, neg = splitDescriptivePosNeg(d, posList, negList)
print "These are the positive words spoken to petitioners: \n"
print pos
print "---------------------"
print "These are the negative words spoken to petitioners: \n"
print neg

These are the positive words spoken to petitioners: 

[u'significant', u'helpful', u'clear', u'articulate', u'best', u'better', u'important', u'articulate', u'intuitive', u'great', u'tough']
---------------------
These are the negative words spoken to petitioners: 

[u'reckless', u'culpable', u'drunk', u'forceful', u'violent', u'violent', u'embarrassing', u'violent', u'odd', u'drunk', u'accidental', u'arbitrary', u'bad', u'blind', u'dangerous', u'difficult', u'false', u'unlawful', u'unlawful', u'explosive', u'drunk', u'assault', u'inconsistent', u'blind']




In [1011]:
def findNetPositivity(descriptive, posList, negList):
    '''
    Inputs:
    descriptive: list of salient descriptive words taken from a transcript
    posList: list of all positive words from lookup dictionary
    negList: list of all negative words from lookup dictionary
    
    Returns:
    net positivity of that descriptive, using sentiment analysis techniques
    '''
    pos, neg = [], []
    for word in descriptive:
        if word in posList:
            pos.append(word)
        elif word in negList:
            neg.append(word)
    return len(pos) - len(neg)

In [1013]:
netPosPet = map(lambda x: findNetPositivity(x, posList, negList), mergedJudges.descriptivesPet.values)



In [1014]:
netPosRes = map(lambda x: findNetPositivity(x, posList, negList), mergedJudges.descriptivesRes.values)



In [1018]:
# append netPosPet and netPosRes to merged as a column as NEW FEATURES
mergedJudges['netPosPet'] = pd.Series(netPosPet, index=mergedJudges.index)
mergedJudges['netPosRes'] = pd.Series(netPosRes, index=mergedJudges.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


## 3.4 Running classifier on combined features #1, #2, #3

Let's take a sneak peek at what our completed dataframe looks like now!!!

In [1060]:
# combining all features from #1, #2, #3 in a consolidated predictor space
X = mergedJudges[["netPosPet", "netPosRes", "allInterruptionsPet", "allInterruptionsRes", "allNumWordsPet", "allNumWordsRes"]]

In [1061]:
y = mergedJudges["partyWinning"]

This is awesome!!! We have all our pain-stakingly selected features in a predictor matrix and our target values - now it's time to run our classifier! Let's simply run Logistic Regression because we are trying to predict binary outcomes.

In [1062]:
clflog2 = LogisticRegression()

In [1065]:
xTrain, yTrain, xTest, yTest = splitTrainTest(X, y)
clflogopt2 = cv_optimize(clflog, {"C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, xTrain, yTrain, n_folds=5)

In [1066]:
# this is our optimal classifier after cross-validation
clflogopt2

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [1068]:
training_accuracy = clflogopt2.score(xTrain, yTrain)
test_accuracy = clflogopt2.score(xTest, yTest)
print training_accuracy, test_accuracy

0.695059625213 0.636363636364


# 4. Analysis on Justice-Centered Data

Variables  
* term: This variable identifies the term in which the Court handed down its decision.  
* naturalCourt: A natural court is a period during which no personnel change occurs. Scholars have subdivided them into "strong" and "weak" natural courts, but no convention exists as to the dates on which they begin and end. Options include 1) date of confirmation, 2) date of seating, 3) cases decided after seating, and 4) cases argued and decided after seating.  
* petitioner: Petitioner" refers to the party who petitioned the Supreme Court to review the case. This party is variously known as the petitioner or the appellant.
* respondent: Respondent" refers to the party being sued or tried and is also known as the appellee.
* caseOrigin: The focus of this variable is the court in which the case originated,
* caseSource: This variable identifies the court whose decision the Supreme Court reviewed. 
* lcDisposition: This variable specifies the treatment the court whose decision the Supreme Court reviewed accorded the decision of the court it reviewed; e.g., whether the court below the Supreme Court---typically a federal court of appeals or a state supreme court---affirmed, reversed, remanded, etc. the decision of the court it reviewed---typically a trial court. 
* issueArea: This variable simply separates the issues identified in the preceding variable (issue) into the following larger categories: criminal procedure, civil rights, First Amendment, due process, privacy , attorneys' or governmental officials' fees or compensation, unions, economic activity, judicial power, federalism, interstate relation, federal taxation, miscellaneous, and private law. 

In [767]:
def encode_variable(df, target_column):
    """Encode variable into a number.

    Inputs
    ----
    df -- pandas DataFrame.
    target_column -- column to map to int, producing
                     new Target column.

    Returns
    -------
    df_mod -- modified DataFrame.
    targets -- list of target names.
    """
    df_mod = df.copy()
    targets = df_mod[target_column].unique()
    map_to_int = {name: n for n, name in enumerate(targets)}
    df_mod[target_column] = df_mod[target_column].replace(map_to_int)

    return (df_mod, targets)

#from graphviz documentation
def visualize_tree(tree, feature_names):
    """Create tree png using graphviz.

    Args
    ----
    tree -- scikit-learn DecsisionTree.
    feature_names -- list of feature names.
    """
    with open("dt.dot", 'w') as f:
        export_graphviz(tree, out_file=f,
                        feature_names=feature_names)

    command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
    try:
        subprocess.check_call(command)
    except:
        exit("Could not run dot, ie graphviz, to "
             "produce visualization")

In [768]:
#create the target, and train
#read  the csv and make a big df
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

#create an empty small df
smalldf = pd.DataFrame()

#select the variables to run the classifier on 
#casedisposition is our target
train_vars = ['term', 'naturalCourt', 'petitioner',
                'respondent', 'caseOrigin', 'caseSource', 'lcDisposition', 'issueArea', 'partyWinning', 'majority']

# ["term", "naturalCourt ", "petitioner", "respondent", "caseOrigin", "caseSource", "lcDisposition", "issueArea"]
#add train columns to smalldf
smalldf = bigdf[train_vars]

#drop row if any values are NAN - maxmimum  22% of original data 
smalldf=smalldf.dropna(axis=0,how='any')

print smalldf.shape

# smalldf, _ = encode_variable(smalldf,'chief')
# smalldf, _ = encode_variable(smalldf, 'dateDecision')

# smalldf.majority refers to whether justice voted with the majority (1 for dissent, 2 for majority)
# smalldf.partyWinning indicates winning party (0 for responding party, 1 for petitioning party, 2 for unclear)
# We use the above 2 features to infer which party the individual justice voted for
# NOTE: majority has around 4000 NaNs that we should filter out?

results = []

for idx, x in smalldf.iterrows(): 
    if x.partyWinning == 2:
        results.append(2)
        
    if x.partyWinning == 1 and x.majority == 2: 
        results.append(1)
        
    if x.partyWinning == 0 and x.majority == 2: 
        results.append(0)
        
    if x.partyWinning == 1 and x.majority == 1: 
        results.append(0)
        
    if x.partyWinning == 0 and x.majority == 1:
        results.append(1)

smalldf['justiceVote'] = results

smalldf['is_train'] = np.random.uniform(0, 1, len(smalldf)) <= .75

train, test = smalldf[smalldf['is_train']==True], smalldf[smalldf['is_train']==False]

features = list(smalldf.columns[:8])

(62929, 10)


## 4.1 Decision Trees

In [770]:
import matplotlib
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm

In [771]:
dt = DecisionTreeClassifier(random_state=99)
dt.fit(train[features],train["justiceVote"])
visualize_tree(dt, features)
dt.score(test[features],test['justiceVote'])

0.75429881543752386

In [None]:
Cs=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
gs=GridSearchCV(dt, param_grid={'C':Cs}, cv=5)
gs.fit(train[features], train['justiceVote'])
print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_

## 4.2 Linear SVM

In [None]:
%%time
svc = svm.LinearSVC(loss="hinge")
svc_classifier = svc.fit(train[features], train['justiceVote'])
print svc_classifier.score(test[features], test['justiceVote'])

In [None]:
Cs=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
gs=GridSearchCV(svc, param_grid={'C':Cs}, cv=5)
gs.fit(train[features], train['justiceVote'])
print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_

In [None]:
best = gs.best_estimator_
best.fit(train[features], train['justiceVote'])
best.score(test[features], test['justiceVote'])

## 4.3 Logistic Regression ?? do we want to do this

Logistic regression is a natural first choice for a model since our target value can be viewed as a probability between 0 or 1 for any individual justice to vote For or Against, with a higher probability representing a higher confidence of that justice voting in favor of the arguing party. 

In [None]:
from sklearn.linear_model import LogisticRegression

In [587]:
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

In [43]:
from patsy import dmatrices
log_model = LogisticRegression(penalty='l2',C=1.0, fit_intercept=True, class_weight='auto')
bigdf=pd.read_csv("SCDB_2015_01_justiceCentered_Citation.csv")

smalldf = pd.DataFrame()

regress_vars = ['issue', 'issueArea', 'decisionDirection', 
                'dateDecision', 'justice', 'chief', 'caseSource', 'certReason', 'vote'] 

for i in regress_vars: 
    smalldf[i] = bigdf[i]
    
y, X = dmatrices('C(vote) ~ C(issue) + C(issueArea) + C(decisionDirection) + C(justice) + \
                  C(caseSource) + C(certReason)',
                  smalldf, return_type="dataframe")

print (X.shape)
print(y.shape)
# y = np.ravel(y)
print((y))
model = LogisticRegression()

model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)

(61833, 434)
(61833, 8)
       C(vote)[1.0]  C(vote)[2.0]  C(vote)[3.0]  C(vote)[4.0]  C(vote)[5.0]  \
0                 0             1             0             0             0   
1                 1             0             0             0             0   
2                 1             0             0             0             0   
3                 0             0             0             1             0   
4                 1             0             0             0             0   
5                 1             0             0             0             0   
6                 1             0             0             0             0   
7                 1             0             0             0             0   
8                 1             0             0             0             0   
9                 1             0             0             0             0   
10                0             1             0             0             0   
11                1         

ValueError: bad input shape (61833, 8)

NameError: name 'X' is not defined

### Random Forest Classifier

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn import preprocessing
from sklearn import metrics
from sklearn import svm
from sklearn.cross_validation import train_test_split, cross_val_score

In [585]:
# read in SCDB data from file
bigdf=pd.read_csv("supremeCourtDb.csv")
bigdf[]


Unnamed: 0,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,ledCite,lexisCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
0,1946-001,1946-001-01,1946-001-01-01,1946-001-01-01-01,11/18/46,1,329 U.S. 1,67 S. Ct. 6,91 L. Ed. 3,1946 U.S. LEXIS 1724,...,4,,6,600,35 U.S.C. � 33,78,78,1,8,1
1,1946-002,1946-002-01,1946-002-01-01,1946-002-01-01-01,11/18/46,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4,,6,600,18 U.S.C. � 398,81,87,1,6,3
2,1946-003,1946-003-01,1946-003-01-01,1946-003-01-01-01,11/18/46,1,329 U.S. 29,67 S. Ct. 1,91 L. Ed. 22,1946 U.S. LEXIS 3037,...,1,,2,207,,84,78,1,5,4
3,1946-004,1946-004-01,1946-004-01-01,1946-004-01-01-01,11/25/46,7,329 U.S. 40,67 S. Ct. 167,91 L. Ed. 29,1946 U.S. LEXIS 1696,...,4,,6,600,49 Stat. 801,87,87,1,5,3
4,1946-005,1946-005-01,1946-005-01-01,1946-005-01-01-01,11/25/46,1,329 U.S. 64,67 S. Ct. 154,91 L. Ed. 44,1946 U.S. LEXIS 2997,...,7,,,,,78,87,1,6,3
5,1946-006,1946-006-01,1946-006-01-01,1946-006-01-01-01,11/25/46,1,329 U.S. 69,67 S. Ct. 156,91 L. Ed. 80,1946 U.S. LEXIS 3005,...,2,,1,129,,81,87,1,7,1
6,1946-007,1946-007-01,1946-007-01-01,1946-007-01-01-01,11/25/46,1,329 U.S. 90,67 S. Ct. 133,91 L. Ed. 103,1946 U.S. LEXIS 3053,...,1,,6,600,15 U.S.C. � 79,82,74,1,6,0
7,1946-008,1946-008-01,1946-008-01-01,1946-008-01-01-01,12/9/46,1,329 U.S. 129,67 S. Ct. 231,91 L. Ed. 128,1946 U.S. LEXIS 2995,...,4,,6,600,35 U.S.C. � 89,87,87,1,9,0
8,1946-009,1946-009-01,1946-009-01-01,1946-009-01-01-01,12/9/46,1,329 U.S. 143,67 S. Ct. 245,91 L. Ed. 136,1946 U.S. LEXIS 3047,...,4,5,5,512,,87,87,1,9,0
9,1946-010,1946-010-01,1946-010-01-01,1946-010-01-01-01,12/9/46,1,329 U.S. 156,67 S. Ct. 237,91 L. Ed. 162,1946 U.S. LEXIS 3048,...,4,,3,307,,78,87,1,8,0


In [27]:
df = bigdf["docketId", "dateDecision", "case"]

KeyError: ('docketId', 'dateDecision', 'case')

### Linear SVM Classifier

In [None]:
svm_model = svm.SVC(C=1.0, kernel='linear', probability=True, class_weight='auto')
svm_model = my_svm.fit(X, y)
svm_pred = svm_fit.predict(W)
# Class probabilities, based on log regression on distance to hyperplane.
svm_prob = svm_fit.predict_proba(W)
svm_dist = svm_fit.decision_function(W)

## 2. Justice Ruling Prediction

We use a different dataset in a slightly different approach to making Supreme Court ruling predictions. This method is motivated by the fact that usually, only 2 justices tend to be swing votes and justice decisions are highly influenced by factors outside of what transpires in court proceedings, such as background information about the case itself. The Supreme Court website contains a Justice-centered database which contains extensive information about each case; in particular, the most pertinent fields we are interested in analyzing are:

1. Decision Year
2. Natural Court
3. Petitioner
4. Respondent
5. Case Origin
6. Case Source
7. Lower Court Disposition Direction
8. Issue Area

Our target value to predict is the field called winningParty (petitioner or respondent), which using our justice-centered approach involves aggregating predicted votes for each individual justice and taking majority vote. The associated confidence of our entire prediction is obtained by averaging individual confidences of our models for each justice.

In [91]:
# read in justice-centered SCDB data from file
newdf=pd.read_csv("SCDB_justice_centered.csv")

In [1]:
# maybe lcDispositionDirection? choose features with continuous/numerical features
# do the numbers mean anything though?
newsmalldf = newdf[["term", "naturalCourt", "petitioner", "respondent", "caseOrigin", "caseSource", "lcDisposition", "issueArea"]]

NameError: name 'newdf' is not defined

In [93]:
newsmalldf.head()

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea
0,1946,1301,198,172,51,29,2,8
1,1946,1301,198,172,51,29,2,8
2,1946,1301,198,172,51,29,2,8
3,1946,1301,198,172,51,29,2,8
4,1946,1301,198,172,51,29,2,8


For an intuitive understanding of the features above, check out the documentation here: http://scdb.wustl.edu/documentation.php?var=petitioner. All the above features are categorical instead of continuous (which means the numbers specify a category instead of having a numerical meaning). For an illustrative example, the "petitioner" variable includes:

1. attorney general of the United States, or his office
2. specified state board or department of education
7. state department or agency
etc

#### Advantages of Using Decision Tree Classifiers

Having an intuitive understanding of the meanings behind the variables is important and leads us to our idea of usign the decision tree classifier. A distinct advantage of using decisiontrees is that the decision at each node has an intuitive meaning and corresponds to querying along one feature axis at a time (e.g. is the petitioner an attorney general of the United States?). 

Furthermore, trees are easy to understand and interpret. We can look at the top node and figure out which feature it corresponds to, and conclude that this feature contributes the most information gain, i.e. is the most important/predictive feature. This makes it easy to verify whether our results make intuitive sense.

We will show the process of running decision trees on each justice, before aggregating the votes now.

### 2.1 Justice-Centered Decision Tree Classifiers

Ultimately, the feature that we want to predict is the vote for each justice.

In [29]:
from sklearn import tree

In [94]:
# newdf.majority refers to whether justice voted with the majority (1 for dissent, 2 for majority)
# newdf.partyWinning indicates winning party (0 for responding party, 1 for petitioning party, 2 for unclear)
# We use the above 2 features to infer which party the individual justice voted for
# NOTE: majority has around 4000 NaNs that we should filter out?
results = []
ctr1, ctr2 = 0,0
for idx, x in enumerate(newdf.majority):
    #if decision is unclear, append 2 to results (in reality, apparently there aren't ANY 2s)
    if newdf.partyWinning[idx] == 2:
        results.append(2)
        ctr += 1
        break
    #if justice voted in the majority
    if x == 2:
        results.append(newdf.partyWinning[idx]) #results contains 0/1
    #if justice voted in the minority
    elif x == 1:
        results.append(1 - newdf.partyWinning[idx]) #results contains 0/1
    else:
        #need to clean this up o.o
        results.append(1)

In [99]:
# these are our target values
pd.concat([newsmalldf, results])

AttributeError: 'numpy.float64' object has no attribute 'index'

In [85]:
# drop rows where any column value is NaN - dealing with missing data
newsmalldf.dropna(axis=0).head()

Unnamed: 0,term,naturalCourt,petitioner,respondent,caseOrigin,caseSource,lcDisposition,issueArea
0,1946,1301,198,172,51,29,2,8
1,1946,1301,198,172,51,29,2,8
2,1946,1301,198,172,51,29,2,8
3,1946,1301,198,172,51,29,2,8
4,1946,1301,198,172,51,29,2,8
