# Supreme Court Data Exploration

In [3]:
from convokit import Corpus, download

Download our corpus from convokit. This takes a minute -- downloading a ~1.2G zip and upacking it.

In [4]:
corpus = Corpus(filename=download("supreme-corpus"))
# if you already downloaded this just substitute the file path for download('supreme-corpus') 

Dataset already exists at /Users/vaughnfranz/.convokit/downloads/supreme-corpus


In [5]:
import pandas as pd

## Conversations Dataframe
Get all of the converstaions in the corpus formatted as a dataframe. 

In [6]:
conv_df = corpus.get_conversations_dataframe()

We can see here a lot of the data we need already present for us, the side of each advocate, the side which won the case and the side which each justice voted for

In [7]:
conv_df.head()

Unnamed: 0_level_0,vectors,meta.case_id,meta.advocates,meta.win_side,meta.votes_side
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13127,[],1955_71,"{'harry_f_murphy': {'side': 1, 'role': 'inferr...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
12997,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13024,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13015,[],1955_351,"{'harry_d_graham': {'side': 3, 'role': 'inferr...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13016,[],1955_38,"{'robert_n_gorman': {'side': 3, 'role': 'infer...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."


In [8]:
conv_df.shape

(7817, 5)

In [18]:
conv_df.isna().sum()

vectors             0
meta.case_id        0
meta.advocates      0
meta.win_side      13
meta.votes_side    13
dtype: int64

In [9]:
conv_df['meta.advocates'][0]

{'harry_f_murphy': {'side': 1, 'role': 'inferred'},
 'john_v_lindsay': {'side': 0, 'role': 'inferred'}}

In [12]:
conv_df['meta.votes_side'][0]

{'j__john_m_harlan2': 0,
 'j__hugo_l_black': 0,
 'j__william_o_douglas': 0,
 'j__earl_warren': 0,
 'j__tom_c_clark': 0,
 'j__felix_frankfurter': 0,
 'j__harold_burton': 0,
 'j__stanley_reed': 0,
 'j__sherman_minton': 0}

## Utterances Dataframe
We can also get a dataframe of all of the utterances. This one actually contains the text that we can train on. Building this takes a while -- 1.7 million utterances.

In [13]:
utterances_df = corpus.get_utterances_dataframe()

In [6]:
utterances_df.head()

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.speaker_type,meta.side,meta.timestamp,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
13127__0_000,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",J,,0.0,[]
13127__0_001,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",A,1.0,9.218,[]
13127__0_002,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],J,,48.138,[]
13127__0_003,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",A,1.0,49.315,[]
13127__0_004,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,,174.058,[]


Noticing the case_id column here, very easy for us to slice the dataframe by case.

In [20]:
utterances_df.shape

(1700789, 12)

In [19]:
utterances_df.isna().sum()

timestamp            1700789
text                       0
speaker                    0
reply_to                7817
conversation_id            0
meta.case_id               0
meta.start_times           0
meta.stop_times            0
meta.speaker_type      88103
meta.side             875520
meta.timestamp             0
vectors                    0
dtype: int64

In [21]:
utterances_df[utterances_df["text"] == ""].shape

(0, 12)

In [8]:
utterances_df["speaker"].value_counts()

<INAUDIBLE>             88102
j__byron_r_white        79536
j__antonin_scalia       57631
j__felix_frankfurter    53061
j__john_paul_stevens    50851
                        ...  
fred_a_granata              1
william_c_harvin            1
william_g_comb              1
iver_e_skjeie               1
c_c_fraizer                 1
Name: speaker, Length: 8979, dtype: int64

There are a lot of INAUDIBLE speakers...I suppose not a huge problem since we probably will not use those labels.

## Cases Data 
Now looking ast the cases.jsonl file for a minute, this is where all of the information for each case is drawn from. It may not be necessary for us to use this since the votes are already recorded in the conversations dataframe. 

In [22]:
import json

In [24]:
with open('../data/cases.jsonl', 'r') as f:
    json_list = list(f)

In [25]:
jsons = []
for json_str in json_list:
    jsons.append(json.loads(json_str))

In [26]:
jsons[0]

{'id': '1955_71',
 'year': 1955,
 'citation': '350 US 79',
 'title': 'Affronti v. United States',
 'petitioner': 'Affronti',
 'respondent': 'United States',
 'docket_no': '71',
 'court': 'Warren Court',
 'decided_date': 'Dec 5, 1955',
 'url': 'https://www.oyez.org/cases/1955/71',
 'transcripts': [{'name': 'Oral Argument - November 15, 1955',
   'url': 'https://apps.oyez.org/player/#/warren3/oral_argument_audio/13127',
   'id': 13127,
   'case_id': '1955_71'}],
 'adv_sides_inferred': True,
 'known_respondent_adv': True,
 'advocates': {'Harry F. Murphy': {'id': 'harry_f_murphy',
   'name': 'Harry F. Murphy',
   'side': 1},
  'John V. Lindsay': {'id': 'john_v_lindsay',
   'name': 'John V. Lindsay',
   'side': 0}},
 'win_side': 0.0,
 'win_side_detail': 2.0,
 'scdb_docket_id': '1955-009-01',
 'votes': {'j__john_m_harlan2': 2.0,
  'j__hugo_l_black': 2.0,
  'j__william_o_douglas': 2.0,
  'j__earl_warren': 2.0,
  'j__tom_c_clark': 2.0,
  'j__felix_frankfurter': 2.0,
  'j__harold_burton': 2.0,


The voting information is a little confusing, you can read more about it in these places: https://convokit.cornell.edu/documentation/supreme.html
http://scdb.wustl.edu/documentation.php?var=majority
http://scdb.wustl.edu/documentation.php?var=vote

Luckily they already extracted the information we need from that, it's in the conversations dataframe.

## Manual exploration of Justic Rosters
This was our idea to check out the longest running set of justices.

In [28]:
jsons[0]['votes'].keys()

dict_keys(['j__john_m_harlan2', 'j__hugo_l_black', 'j__william_o_douglas', 'j__earl_warren', 'j__tom_c_clark', 'j__felix_frankfurter', 'j__harold_burton', 'j__stanley_reed', 'j__sherman_minton'])

In [30]:
justice_combos = dict()
justices = dict()
for j in jsons:
    if j['votes'] != None:
        temp_justices = list(j['votes'].keys())
        for justice in temp_justices:
            justices[justice] = justices.get(justice, 0) + 1
        frozen_js = frozenset(temp_justices)
        justice_combos[frozen_js] = justice_combos.get(frozen_js, 0) + 1

In [31]:
max(justice_combos, key=justice_combos.get)

frozenset({'j__anthony_m_kennedy',
           'j__antonin_scalia',
           'j__clarence_thomas',
           'j__david_h_souter',
           'j__john_paul_stevens',
           'j__ruth_bader_ginsburg',
           'j__sandra_day_oconnor',
           'j__stephen_g_breyer',
           'j__william_h_rehnquist'})

In [32]:
justice_combos[max(justice_combos, key=justice_combos.get)]

950

In [33]:
len(justice_combos.keys())

42

Was also curious about how many justices are represented in the dataset total.

In [34]:
len(justices.keys())

34

## The convokit Transformer Class
This looks to make our lives a lot easier for this project...it includes a ton of built in functionality for preprocessing, feature extraction and analysis. 
You can see more here: https://convokit.cornell.edu/documentation/transformers.html

In [39]:
from convokit import TextCleaner, TextParser, BoWTransformer

### TextCleaner
"Transformer that cleans the text of utterances in an input Corpus. By default, the text cleaner assumes the text is in English. It fixes unicode errors, transliterates text to the closest ASCII representation, lowercases text, removes line breaks, and replaces URLs, emails, phone numbers, numbers, currency symbols with special tokens."

In [36]:
TextCleaner(verbosity=50000).transform(corpus)

50000/1700789 utterances processed
100000/1700789 utterances processed
150000/1700789 utterances processed
200000/1700789 utterances processed
250000/1700789 utterances processed
300000/1700789 utterances processed
350000/1700789 utterances processed
400000/1700789 utterances processed
450000/1700789 utterances processed
500000/1700789 utterances processed
550000/1700789 utterances processed
600000/1700789 utterances processed
650000/1700789 utterances processed
700000/1700789 utterances processed
750000/1700789 utterances processed
800000/1700789 utterances processed
850000/1700789 utterances processed
900000/1700789 utterances processed
950000/1700789 utterances processed
1000000/1700789 utterances processed
1050000/1700789 utterances processed
1100000/1700789 utterances processed
1150000/1700789 utterances processed
1200000/1700789 utterances processed
1250000/1700789 utterances processed
1300000/1700789 utterances processed
1350000/1700789 utterances processed
1400000/1700789 utter

<convokit.model.corpus.Corpus at 0x10d58af10>

### TextParser
"Transformer that dependency-parses each Utterance in a Corpus. This parsing step is a prerequisite for some of the models included in ConvoKit.

By default, will perform the following:

tokenize words and sentences
POS-tags words
dependency-parses sentences"

This may not be necessary for our models but it's cool to know that it's an option.

In [None]:
TextParser(verbosity=200000).transform(corpus)

### BoWTransformer
Bag-of-Words Transformer for annotating a Corpus’s objects with the bag-of-words vectorization of some textual element of the Corpus components.
Out of the box bag of words vectors for all our utterances...pretty nifty. 

In [40]:
bow_transformer = BoWTransformer(obj_type="utterance")

Initializing default unigram CountVectorizer...Done.


In [41]:
bow_transformer.fit_transform(corpus)



<convokit.model.corpus.Corpus at 0x10d58af10>

Now all of our utterances have a bag of words vector

In [46]:
corpus.get_utterance('13127__0_000').vectors

['bow_vector']