## Supreme Court Decision Prediction

### Load In Data

In [59]:
import pandas as pd
import numpy as np
import re

In [60]:
cases = pd.read_csv('csvs/justice2.csv')
judges = pd.read_csv('csvs/table_of_justices.csv')
presidents = pd.read_csv('csvs/presidents.csv')

In [61]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3303 entries, 0 to 3302
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          3303 non-null   int64 
 1   ID                  3303 non-null   int64 
 2   name                3303 non-null   object
 3   href                3303 non-null   object
 4   docket              3292 non-null   object
 5   term                3303 non-null   int64 
 6   first_party         3302 non-null   object
 7   second_party        3302 non-null   object
 8   facts               3303 non-null   object
 9   facts_len           3303 non-null   int64 
 10  majority_vote       3303 non-null   int64 
 11  minority_vote       3303 non-null   int64 
 12  first_party_winner  3288 non-null   object
 13  decision_type       3296 non-null   object
 14  disposition         3231 non-null   object
 15  issue_area          3161 non-null   object
dtypes: int64(6), object(10)


In [62]:
judges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Index                     121 non-null    int64 
 1   Justice Name              121 non-null    object
 2   Supreme Court Term Start  121 non-null    object
 3   Supreme Court Term End    121 non-null    object
 4   Appointing President      121 non-null    object
 5   Notable Opinion(s)        121 non-null    object
dtypes: int64(1), object(5)
memory usage: 5.8+ KB


It's nice to note that there isn't any missing information regarding the entered judges here. Every judge has
* Name
* Start Term
* End Term
* Appointing President

This will help us to potentially classify the judge's political affiliations when creating features for our model.

In [63]:
presidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   President   46 non-null     object
 1   Party       46 non-null     object
dtypes: object(2)
memory usage: 864.0+ bytes


### Data Prep

#### Get All Dates in Terms of Years

In [64]:
judges['start_year'] = pd.to_datetime(judges['Supreme Court Term Start']).dt.year
judges['end_year'] = pd.to_datetime(judges['Supreme Court Term End'].replace("--","01-Mar-24")).dt.year

  judges['end_year'] = pd.to_datetime(judges['Supreme Court Term End'].replace("--","01-Mar-24")).dt.year


All cases just have a "term" in which a ruling was placed, not an actual date. Therefore, we will need to make an assumption that if a Supreme Court Justice's service encapsulates the term in which a ruling was passed, they will in fact have had been part of that ruling.

#### Match Judges to their Party Affiliations

In [65]:
judges['president_last_name'] = judges["Appointing President"].str.extract(r'^(\w+),')
presidents['president_last_name']= presidents['President '].str.extract(r'\s(\w+)$') 

In [66]:
president_party_dict = presidents.set_index('president_last_name')['Party '].to_dict()
judges['party'] = np.where(judges['president_last_name'].isin(president_party_dict.keys()),
                           judges['president_last_name'].map(president_party_dict),
                           None)
judges['party'] = judges['party'].dropna().apply(lambda x: x.strip())

We make the assumption that the judges will have the same political affiliation as the President that inaugarated him/her. Therefore, it is helpful to map the President's party affiliation to the corresponding Supreme Court Justices they appointed.

In [67]:
judges.value_counts("party")

party
Republican                                   49
Democratic                                   44
Independent                                  11
Democratic-Republican                         6
Republican/National Union                     5
Democratic-Republican/National Republican     4
Whig                                          2
Name: count, dtype: int64

This is a problem that was unforseen. There have been some different sounding political parties over the years with some unique ideals. It's hard to classify distinction with the change of names because...

`Democratic` + `Republican` != `Democratic-Republican`

Our solution here is to instead to map party ideology into "neutral", "conservative", or "liberal" utilizing our pre-existing knowledge of these parties.

In [68]:
ideology_map = {
    "Republican": "conservative",
    "Democratic": "liberal",
    "Independent": "neutral", 
    "Whig": "conservative",
    "Democratic-Republican/National Republican": "conservative",
    "Republican/National Union": "conservative",
    "Democratic-Republican": "liberal"
}

In [69]:
judges['ideology'] = np.where(judges['party'].isin(ideology_map.keys()),
                           judges['party'].map(ideology_map),
                           None)

#### Match Cases to Judges to Get Counts for Each Ideology


In [70]:
cases["conservative"] = 0
cases["liberal"] = 0
cases["neutral"] = 0

In [71]:
for case_index, term in enumerate(cases.term):
    for judge_index, (start, end) in enumerate(zip(judges.start_year, judges.end_year)):
        if (start <= term <= end):
            judge_ideology = judges.ideology[judge_index]
            cases.loc[case_index, judge_ideology] += 1
        else:
            pass

#### Remove Rows Missing "`Issue_area` " and "`first_party_winner`"

In [72]:
cases = cases.loc[cases.issue_area.notnull()]
cases = cases.loc[cases.first_party_winner.notnull()]

#### Regularize Facts Statements

In [73]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def tokenizer(text):
    ''' Tokenize document text by stripping the text into individual tokens, removing digit characters, removing punctuation, remove stop words, and lemmatizing the tokens.
    
    Parameters
    ----------
    text : str
        The body of text that you would like to tokenize
    
    Returns
    -------
    cleaned_doc_tokens : str
        The cleaned and tokenized text

    Example
    -------
    text = "Hailey Naugle was quite the important contributor to this Supreme Court analysis. - Brandon Owens"

    clean_text = tokenizer(text)
    
    Will output:
        "hailey naugle important contributor supreme court analysis brandon owens"

    '''

    text_tokens = []
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        sent_tokens = nltk.word_tokenize(sentence)
        sent_tokens = [lemmatizer.lemmatize(word.lower()) for word in sent_tokens 
                       if (word.lower() not in stop_words) and (word not in string.punctuation) and (len(word) > 1) and not(word.isdigit())]
        text_tokens += sent_tokens

    cleaned_doc_tokens = ' '.join(text_tokens)

    return cleaned_doc_tokens


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\brand\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [74]:
cases["cleaned_facts"] = cases["facts"].apply(lambda x: tokenizer(x))

#### Convert Some Columns to Binary

In [75]:
issue_areas = pd.get_dummies(cases.issue_area)
combo_dummies = pd.concat([issue_areas], axis = 1, join="inner")

In [76]:
cols = ["name", "term", "first_party", "second_party", "facts", "cleaned_facts", "majority_vote", "minority_vote", "first_party_winner", "conservative", "liberal", "neutral"]
cases = pd.concat([cases[cols], combo_dummies], axis=1, join="inner")

In [77]:
cases = cases.replace({True: 1, False: 0})

In [78]:
cases.reset_index()

Unnamed: 0,index,name,term,first_party,second_party,facts,cleaned_facts,majority_vote,minority_vote,first_party_winner,...,Economic Activity,Federal Taxation,Federalism,First Amendment,Interstate Relations,Judicial Power,Miscellaneous,Privacy,Private Action,Unions
0,1,Stanley v. Illinois,1971,"Peter Stanley, Sr.",Illinois,<p>Joan Stanley had three children with Peter ...,joan stanley three child peter stanley stanley...,5,2,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Giglio v. United States,1971,John Giglio,United States,<p>John Giglio was convicted of passing forged...,john giglio convicted passing forged money ord...,7,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,Reed v. Reed,1971,Sally Reed,Cecil Reed,"<p>The Idaho Probate Code specified that ""male...",idaho probate code specified `` male must pref...,7,0,1,...,0,0,0,0,0,0,0,0,0,0
3,4,Miller v. California,1971,Marvin Miller,California,"<p>Miller, after conducting a mass mailing cam...",miller conducting mass mailing campaign advert...,5,4,1,...,0,0,0,1,0,0,0,0,0,0
4,5,Kleindienst v. Mandel,1971,"Richard G. Kleindienst, Attorney General of th...","Ernest E. Mandel, et al.",<p>Ernest E. Mandel was a Belgian professional...,ernest e. mandel belgian professional journali...,6,3,1,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3154,3297,Yellen v. Confederated Tribes of the Chehalis ...,2020,"Janet L. Yellen, Secretary of the Treasury",Confederated Tribes of the Chehalis Reservatio...,<p>For over a century after the Alaska Purchas...,century alaska purchase federal government set...,6,3,1,...,0,0,0,0,0,0,0,0,0,0
3155,3298,United States v. Palomar-Santiago,2020,United States,Refugio Palomar-Santiago,"<p>Refugio Palomar-Santiago, a Mexican nationa...",refugio palomar-santiago mexican national gran...,9,0,1,...,0,0,0,0,0,0,0,0,0,0
3156,3299,Terry v. United States,2020,Tarahrick Terry,United States,<p>Tarahrick Terry pleaded guilty to one count...,tarahrick terry pleaded guilty one count posse...,9,0,0,...,0,0,0,0,0,0,0,0,0,0
3157,3300,United States v. Cooley,2020,United States,Joshua James Cooley,<p>Joshua James Cooley was parked in his picku...,joshua james cooley parked pickup truck side r...,9,0,1,...,0,0,0,0,0,0,0,0,0,0


In [79]:
cases.columns

Index(['name', 'term', 'first_party', 'second_party', 'facts', 'cleaned_facts',
       'majority_vote', 'minority_vote', 'first_party_winner', 'conservative',
       'liberal', 'neutral', 'Attorneys', 'Civil Rights', 'Criminal Procedure',
       'Due Process', 'Economic Activity', 'Federal Taxation', 'Federalism',
       'First Amendment', 'Interstate Relations', 'Judicial Power',
       'Miscellaneous', 'Privacy', 'Private Action', 'Unions'],
      dtype='object')

### EDA

In [23]:
%matplotlib inline

In [24]:
import seaborn as sns

I'm not sure what I would like to do here yet. I was thinking of doing topic modelling at the start, but they already have the issue areas? I want a little time to think about this.

## Modeling

#### Baseline Model

**NEED TO MAKE A BASELINE MODEL**

In [80]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#### Features without NLP

In [81]:
features = cases[["term", "conservative", "liberal", "neutral", 'Attorneys', 'Civil Rights', 'Criminal Procedure', 'Due Process',
       'Economic Activity', 'Federal Taxation', 'Federalism',
       'First Amendment', 'Interstate Relations', 'Judicial Power',
       'Miscellaneous', 'Privacy', 'Private Action', 'Unions']]
targets = cases['first_party_winner']

In [82]:
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=27)

##### LR

In [83]:
clf = LogisticRegression(random_state=27).fit(X_train, y_train)
clf.score(X_test, y_test)

0.6772151898734177

##### Decision Tree

In [84]:
dt = DecisionTreeClassifier(random_state=27).fit(X_train, y_train)
dt.score(X_test, y_test)

0.6044303797468354

##### Random Forest

In [85]:
rf = RandomForestClassifier(random_state=27).fit(X_train, y_train)
rf.score(X_test, y_test)

0.629746835443038

##### K-Nearest Neighbor

In [86]:
knn = KNeighborsClassifier().fit(X_train, y_train)
knn.score(X_test, y_test)

AttributeError: 'Flags' object has no attribute 'c_contiguous'

#### NLP Extraction without Features

In [87]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [88]:
features = cases["cleaned_facts"]

In [89]:
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=27)

In [90]:
count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

##### LR

In [91]:
pipe_count_lr = Pipeline(steps=[('cv', CountVectorizer()),('lr', LogisticRegression(solver='liblinear'))])
pipe_tfidf_lr = Pipeline(steps=[('tfidfv', TfidfVectorizer()),('lr', LogisticRegression(solver='liblinear'))])

In [92]:
pipe_count_lr.fit(X_train, y_train)
pipe_count_lr.score(X_test, y_test)

0.6012658227848101

In [93]:
pipe_tfidf_lr.fit(X_train, y_train)
pipe_tfidf_lr.score(X_test, y_test)

0.6819620253164557

##### Decision Tree

In [94]:
pipe_count_dt = Pipeline(steps=[('cv', CountVectorizer()),('dt', DecisionTreeClassifier(random_state=27))])
pipe_tfidf_dt = Pipeline(steps=[('tfidfv', TfidfVectorizer()),('dt', DecisionTreeClassifier(random_state=27))])

In [95]:
pipe_count_dt.fit(X_train, y_train)
pipe_count_dt.score(X_test, y_test)

0.5759493670886076

In [96]:
pipe_tfidf_dt.fit(X_train, y_train)
pipe_tfidf_dt.score(X_test, y_test)

0.5537974683544303

##### Random Forest


In [97]:
pipe_count_rf = Pipeline(steps=[('cv', CountVectorizer()),('rf', RandomForestClassifier(random_state=27))])
pipe_tfidf_rf = Pipeline(steps=[('tfidfv', TfidfVectorizer()),('rf', RandomForestClassifier(random_state=27))])

In [98]:
pipe_count_rf.fit(X_train, y_train)
pipe_count_rf.score(X_test, y_test)

0.6677215189873418

In [99]:
pipe_tfidf_rf.fit(X_train, y_train)
pipe_tfidf_rf.score(X_test, y_test)

0.6566455696202531

##### K-Nearest Neighbors

In [100]:
pipe_count_knn = Pipeline(steps=[('cv', CountVectorizer()),('knn', KNeighborsClassifier())])
pipe_tfidf_knn = Pipeline(steps=[('tfidfv', TfidfVectorizer()),('knn', KNeighborsClassifier())])

In [101]:
pipe_count_knn.fit(X_train, y_train)
pipe_count_knn.score(X_test, y_test)

0.6170886075949367

In [102]:
pipe_tfidf_knn.fit(X_train, y_train)
pipe_tfidf_knn.score(X_test, y_test)

0.5996835443037974

#### Combine BOW Vectors with Features

In [49]:
facts_count_vectorized = count_vectorizer.fit_transform(cases["cleaned_facts"]) 
dense_counts_vector_array = facts_count_vectorized.toarray()
counts_df = pd.DataFrame(dense_counts_vector_array, columns = count_vectorizer.vocabulary_.keys())

In [50]:
facts_tfidf_vectorized = tfidf_vectorizer.fit_transform(cases["cleaned_facts"]) 
dense_tfidf_vector_array = facts_tfidf_vectorized.toarray()
tfidf_df = pd.DataFrame(dense_tfidf_vector_array, columns = tfidf_vectorizer.vocabulary_.keys())

In [51]:
counts_features_df = pd.concat([counts_df, cases], axis=1, join="inner")
tfidf_features_df = pd.concat([tfidf_df, cases], axis=1, join="inner")

In [52]:
log_reg = LogisticRegression()
rf = RandomForestClassifier()

In [53]:
unncessary_cols = ["name", "first_party", "second_party", "facts", "cleaned_facts", "majority_vote", "minority_vote", "first_party_winner"]
counts_features = counts_features_df.loc[:, ~counts_features_df.columns.isin(unncessary_cols)]
tfidf_features = tfidf_features_df.loc[:, ~tfidf_features_df.columns.isin(unncessary_cols)]
targets = counts_features_df.loc[:, "first_party_winner"]

In [54]:
len(targets)

3022

In [55]:
Xc_train, Xc_test, yc_train, yc_test = train_test_split(counts_features, targets, test_size=0.2, random_state=27)
Xt_train, Xt_test, yt_train, yt_test = train_test_split(tfidf_features, targets, test_size=0.2, random_state=27)

In [56]:
from sklearn.model_selection import GridSearchCV

lr_param_grid = {
    "penalty":["l1", "l2", "elasticnet"],
    "solver":["lbfgs", "liblinear", "newton-cholesky"],
    "C": [0.01, 0.1, 1, 10],
}

grid_search = GridSearchCV(estimator=log_reg, param_grid=lr_param_grid, cv=5)


In [57]:
grid_search.fit(Xc_train, yc_train)
best_model_counts = grid_search.best_estimator_
print(grid_search.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

KeyboardInterrupt: 

In [None]:
grid_search.fit(Xt_train, yt_train)
best_model_tfidf = grid_search.best_estimator_
print(grid_search.best_params_)