## Supreme Court Decision Prediction

### Load In Data

In [140]:
import pandas as pd
import numpy as np
import re

In [141]:
cases = pd.read_csv('csvs/justice2.csv')
judges = pd.read_csv('csvs/table_of_justices.csv')
presidents = pd.read_csv('csvs/presidents.csv')

In [142]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3303 entries, 0 to 3302
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          3303 non-null   int64 
 1   ID                  3303 non-null   int64 
 2   name                3303 non-null   object
 3   href                3303 non-null   object
 4   docket              3292 non-null   object
 5   term                3303 non-null   int64 
 6   first_party         3302 non-null   object
 7   second_party        3302 non-null   object
 8   facts               3303 non-null   object
 9   facts_len           3303 non-null   int64 
 10  majority_vote       3303 non-null   int64 
 11  minority_vote       3303 non-null   int64 
 12  first_party_winner  3288 non-null   object
 13  decision_type       3296 non-null   object
 14  disposition         3231 non-null   object
 15  issue_area          3161 non-null   object
dtypes: int64(6), object(10)


In [143]:
judges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Index                     121 non-null    int64 
 1   Justice Name              121 non-null    object
 2   Supreme Court Term Start  121 non-null    object
 3   Supreme Court Term End    121 non-null    object
 4   Appointing President      121 non-null    object
 5   Notable Opinion(s)        121 non-null    object
dtypes: int64(1), object(5)
memory usage: 5.8+ KB


It's nice to note that there isn't any missing information regarding the entered judges here. Every judge has
* Name
* Start Term
* End Term
* Appointing President

This will help us to potentially classify the judge's political affiliations when creating features for our model.

In [144]:
presidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   President   46 non-null     object
 1   Party       46 non-null     object
dtypes: object(2)
memory usage: 864.0+ bytes


### Data Prep

#### Get All Dates in Terms of Years

In [145]:
judges['start_year'] = pd.to_datetime(judges['Supreme Court Term Start']).dt.year
judges['end_year'] = pd.to_datetime(judges['Supreme Court Term End'].replace("--","01-Mar-24")).dt.year

  judges['end_year'] = pd.to_datetime(judges['Supreme Court Term End'].replace("--","01-Mar-24")).dt.year


All cases just have a "term" in which a ruling was placed, not an actual date. Therefore, we will need to make an assumption that if a Supreme Court Justice's service encapsulates the term in which a ruling was passed, they will in fact have had been part of that ruling.

#### Match Judges to their Party Affiliations

In [146]:
judges['president_last_name'] = judges["Appointing President"].str.extract(r'^(\w+),')
presidents['president_last_name']= presidents['President '].str.extract(r'\s(\w+)$') 

In [147]:
president_party_dict = presidents.set_index('president_last_name')['Party '].to_dict()
judges['party'] = np.where(judges['president_last_name'].isin(president_party_dict.keys()),
                           judges['president_last_name'].map(president_party_dict),
                           None)
judges['party'] = judges['party'].dropna().apply(lambda x: x.strip())

We make the assumption that the judges will have the same political affiliation as the President that inaugarated him/her. Therefore, it is helpful to map the President's party affiliation to the corresponding Supreme Court Justices they appointed.

In [148]:
judges.value_counts("party")

party
Republican                                   49
Democratic                                   44
Independent                                  11
Democratic-Republican                         6
Republican/National Union                     5
Democratic-Republican/National Republican     4
Whig                                          2
Name: count, dtype: int64

This is a problem that was unforseen. There have been some different sounding political parties over the years with some unique ideals. It's hard to classify distinction with the change of names because...

`Democratic` + `Republican` != `Democratic-Republican`

Our solution here is to instead to map party ideology into "neutral", "conservative", or "liberal" utilizing our pre-existing knowledge of these parties.

In [149]:
ideology_map = {
    "Republican": "conservative",
    "Democratic": "liberal",
    "Independent": "neutral", 
    "Whig": "conservative",
    "Democratic-Republican/National Republican": "conservative",
    "Republican/National Union": "conservative",
    "Democratic-Republican": "liberal"
}

In [150]:
judges['ideology'] = np.where(judges['party'].isin(ideology_map.keys()),
                           judges['party'].map(ideology_map),
                           None)

#### Match Cases to Judges to Get Counts for Each Ideology


In [151]:
cases["conservative"] = 0
cases["liberal"] = 0
cases["neutral"] = 0

In [152]:
for case_index, term in enumerate(cases.term):
    for judge_index, (start, end) in enumerate(zip(judges.start_year, judges.end_year)):
        if (start <= term <= end):
            judge_ideology = judges.ideology[judge_index]
            cases.loc[case_index, judge_ideology] += 1
        else:
            pass

#### Remove Rows Missing "`Issue_area` "

In [153]:
cases = cases.loc[cases.issue_area.notnull()]

#### Regularize Facts Statements

In [154]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def tokenizer(text):
    ''' Tokenize document text by stripping the text into individual tokens, removing digit characters, removing punctuation, remove stop words, and lemmatizing the tokens.
    
    Parameters
    ----------
    text : str
        The body of text that you would like to tokenize
    
    Returns
    -------
    cleaned_doc_tokens : str
        The cleaned and tokenized text

    Example
    -------
    text = "Hailey Naugle was quite the important contributor to this Supreme Court analysis. - Brandon Owens"

    clean_text = tokenizer(text)
    
    Will output:
        "hailey naugle important contributor supreme court analysis brandon owens"

    '''

    text_tokens = []
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        sent_tokens = nltk.word_tokenize(sentence)
        sent_tokens = [lemmatizer.lemmatize(word.lower()) for word in sent_tokens 
                       if (word.lower() not in stop_words) and (word not in string.punctuation) and (len(word) > 1) and not(word.isdigit())]
        text_tokens += sent_tokens

    cleaned_doc_tokens = ' '.join(text_tokens)

    return cleaned_doc_tokens


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [155]:
cases["cleaned_facts"] = cases["facts"].apply(lambda x: tokenizer(x))

#### Convert Target Column to Binary

In [156]:
cases.first_party_winner = cases.first_party_winner.replace({True: 1, False: 0})

### EDA

In [None]:
import gensim
import seaborn as sns

## Modeling

#### Baseline Model

**Hey Hails, not sure how you wanted to do the baseline/null model here for something to compare to. It's your call**

**Also, I got rid of the "unanimous" section you added just because under the decision type, per-curiam means unanimous according to Google I guess?**

**Why don't we pick out some models to eventually try -- I haven't finished with the nlp stuff yet. I still want to do a lot with it in the EDA, I just had to prep a tokenizer**

#### Split the data into test and train data sets