In [2]:
from convokit import Corpus, download, TextCleaner, TextParser, BoWTransformer
import pandas as pd
import json

In [21]:
corpus = Corpus(filename="/Users/vaughnfranz/.convokit/downloads/supreme-corpus")

## Data Preprocessing 

Below I define a couple of custom cleaning functions to complement the built in cleaning capabilites of convokit.

The TextCleaner transformer from convokit operates on a string, so these functions do as well. 

The first function removes punctuation. The second removes stopwords and lemmatizes the text.

In [38]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english'))
wordNetLemm = WordNetLemmatizer()

def remove_punctuation(text):
    cleaned = "".join([char for char in text if char not in string.punctuation])
    return cleaned

def custom_cleaner(text):
    toks = text.split()
    toks = [word for word in toks if not word in stop_words]
    toks = [wordNetLemm.lemmatize(word) for word in toks]
    cleaned = " ".join(toks)
    return cleaned

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vaughnfranz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vaughnfranz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/vaughnfranz/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


I'm going to start wiht removing the punctuation, then use the built in cleaning functionality, and then do the additional cleaning. The reason for this is to avoid affecting the special tokens which convokit inserts for numbers and the like.

Verbosity of 250,000 feels reasonable for 1.7+ million utterances, so I will use that number throughout. 

The TextCleaner class of convokit takes a keyword arg that allows us to specify a custom cleaining function to apply to each utterance. 

With replace set to False the cleaner should store the cleaned text in an attribute on the utterances called 'cleaned.' The original text will be preserved in 'text.' I will then do the additional cleaning steps on the 'cleaned' attribute. 

In [31]:
corpus = TextCleaner(verbosity=250000, text_cleaner=remove_punctuation, replace_text=False).transform(corpus)

250000/1700789 utterances processed
500000/1700789 utterances processed
750000/1700789 utterances processed
1000000/1700789 utterances processed
1250000/1700789 utterances processed
1500000/1700789 utterances processed
1700789/1700789 utterances processed


Let's make sure the cleaning is operating more or less as expected.

In [32]:
test_utterance_id = '13127__0_000'
utt = corpus.get_utterance(test_utterance_id)
print('ORIGINAL:', utt.text)
print('CLEANED:', utt.meta['cleaned'])

ORIGINAL: Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.
CLEANED: Number 71 Lonnie Affronti versus United States of America
Mr Murphy


Cleaning using the built in functionality of convokit. 

The TextCleaner will, by default:
- fix unicode errors, transliterate text to the closest ASCII representation
- lowercase text
- remove line breaks
- replace URLs, emails, phone numbers, numbers, and currency symbols with special tokens

The cleaner will operate by default on the utterances (specifically, utterance.text).

In [33]:
corpus = TextCleaner(verbosity=250000, input_field='cleaned', replace_text=False).transform(corpus)

250000/1700789 utterances processed
500000/1700789 utterances processed
750000/1700789 utterances processed
1000000/1700789 utterances processed
1250000/1700789 utterances processed
1500000/1700789 utterances processed
1700789/1700789 utterances processed


Another sanity test on the cleaning process.

In [34]:
test_utterance_id = '13127__0_000'
test_utterance_id_2 = '13127__0_004'
utt = corpus.get_utterance(test_utterance_id)
print('TEST: 13127__0_000')
print('ORIGINAL:', utt.text)
print('CLEANED:', utt.meta['cleaned'])
utt2 = corpus.get_utterance(test_utterance_id_2)
print('TEST: 13127__0_004')
print('ORIGINAL:', utt2.text)
print('CLEANED:', utt2.meta['cleaned'])

TEST: 13127__0_000
ORIGINAL: Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.
CLEANED: number <number> lonnie affronti versus united states of america mr murphy
TEST: 13127__0_004
ORIGINAL: Was the aggregate prison sentence was 20 or 25 years?
CLEANED: was the aggregate prison sentence was <number> or <number> years


Now performing our other custom cleaning steps as defined in the function up top. 

In [39]:
corpus = TextCleaner(verbosity=250000, text_cleaner=custom_cleaner, input_field='cleaned', replace_text=False).transform(corpus)

250000/1700789 utterances processed
500000/1700789 utterances processed
750000/1700789 utterances processed
1000000/1700789 utterances processed
1250000/1700789 utterances processed
1500000/1700789 utterances processed
1700789/1700789 utterances processed


Sanity test again...

In [40]:
test_utterance_id = '13127__0_000'
test_utterance_id_2 = '13127__0_004'
utt = corpus.get_utterance(test_utterance_id)
print('TEST: 13127__0_000')
print('ORIGINAL:', utt.text)
print('CLEANED:', utt.meta['cleaned'])
utt2 = corpus.get_utterance(test_utterance_id_2)
print('TEST: 13127__0_004')
print('ORIGINAL:', utt2.text)
print('CLEANED:', utt2.meta['cleaned'])

TEST: 13127__0_000
ORIGINAL: Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.
CLEANED: number <number> lonnie affronti versus united state america mr murphy
TEST: 13127__0_004
ORIGINAL: Was the aggregate prison sentence was 20 or 25 years?
CLEANED: aggregate prison sentence <number> <number> year


## Putting the data together 
Now we can get our dataframes, and connect the case information with the utterances.

In [42]:
utterances_df = corpus.get_utterances_dataframe()

In [41]:
conversations_df = corpus.get_conversations_dataframe()

In [43]:
utterances_df.head()

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.speaker_type,meta.side,meta.timestamp,meta.cleaned,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
13127__0_000,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",J,,0.0,number <number> lonnie affronti versus united ...,[]
13127__0_001,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",A,1.0,9.218,may please court writ certiorari eighth circui...,[]
13127__0_002,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],J,,48.138,consecutive sentence,[]
13127__0_003,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",A,1.0,49.315,consecutive sentence case defendant affronti i...,[]
13127__0_004,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,,174.058,aggregate prison sentence <number> <number> year,[]


In [44]:
conversations_df.head()

Unnamed: 0_level_0,vectors,meta.case_id,meta.advocates,meta.win_side,meta.votes_side
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13127,[],1955_71,"{'harry_f_murphy': {'side': 1, 'role': 'inferr...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
12997,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13024,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13015,[],1955_351,"{'harry_d_graham': {'side': 3, 'role': 'inferr...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13016,[],1955_38,"{'robert_n_gorman': {'side': 3, 'role': 'infer...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."


We can use the pandas built in merge function to bring in the case information to the utterances df. 

In [45]:
utterances_df = pd.merge(utterances_df, conversations_df[['meta.case_id', 'meta.win_side', 'meta.votes_side']], how='left', left_on='meta.case_id', right_on='meta.case_id')

In [46]:
utterances_df.head()

Unnamed: 0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.speaker_type,meta.side,meta.timestamp,meta.cleaned,vectors,meta.win_side,meta.votes_side
0,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",J,,0.0,number <number> lonnie affronti versus united ...,[],0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
1,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",A,1.0,9.218,may please court writ certiorari eighth circui...,[],0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
2,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],J,,48.138,consecutive sentence,[],0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
3,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",A,1.0,49.315,consecutive sentence case defendant affronti i...,[],0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
4,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,,174.058,aggregate prison sentence <number> <number> year,[],0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."


In [48]:
utterances_df['meta.win_side'].value_counts()

 1    1378597
 0     796114
 2        831
-1        374
Name: meta.win_side, dtype: int64

According to convokit documentation a 2 signifies that the decision was unclear and a -1 signifies that the data was unavailable. We can drop these rows to simplify the classification task. 

In [49]:
utterances_df = utterances_df[utterances_df['meta.win_side'] != 2]
utterances_df = utterances_df[utterances_df['meta.win_side'] != -1]

In [64]:
utterances_df['meta.win_side'].value_counts()

1    1378597
0     796114
Name: meta.win_side, dtype: int64

## Grouping Utterances by Case for Case Level Classification

I wanted to try creating documents for each case in the event that we wanted to do case-level classification. 

Just going to concatenate all text for the cases. 

In [66]:
utt_df_cpy = utterances_df.copy(deep=True)

Also going to remove some columns to make this next part simpler...

In [72]:
utt_df_cpy.drop(columns=['speaker', 'reply_to', 'conversation_id',
                         'meta.start_times', 'meta.stop_times', 
                         'meta.speaker_type', 'meta.side', 
                         'meta.timestamp', 'vectors', 'timestamp', 
                         'meta.votes_side'])

Unnamed: 0,text,meta.case_id,meta.cleaned,meta.win_side
0,"Number 71, Lonnie Affronti versus United State...",1955_71,number <number> lonnie affronti versus united ...,0
1,May it please the Court.\nWe are here by writ ...,1955_71,may please court writ certiorari eighth circui...,0
2,Consecutive sentences.,1955_71,consecutive sentence,0
3,"Consecutive sentences.\nIn this case, the defe...",1955_71,consecutive sentence case defendant affronti i...,0
4,Was the aggregate prison sentence was 20 or 25...,1955_71,aggregate prison sentence <number> <number> year,0
...,...,...,...,...
2179509,-- has all sorts of meaning that you're not en...,2019_19-67,sort meaning youre endorsing youre saying aidi...,1
2179510,"No, Your Honor --",2019_19-67,honor,1
2179511,-- altogether?,2019_19-67,altogether,1
2179512,-- we are using the principles of complicity a...,2019_19-67,using principle complicity solicitation statut...,1


In [73]:
utt_df_cpy.groupby(['meta.case_id', 'meta.win_side'])['meta.cleaned'].apply(" ".join).reset_index()

Unnamed: 0,meta.case_id,meta.win_side,meta.cleaned
0,1955_10,0,number <number> commonwealth pennsylvania vers...
1,1955_102,0,minute remaining simply desire point brief dec...
2,1955_110,1,court dennis case one cited brief reading beca...
3,1955_111,1,number <number> gonzales versus hr landon dist...
4,1955_112,1,number <number> amos reece versus state georgi...
...,...,...,...
6711,2019_19-631,0,well hear argument next case <number> william ...
6712,2019_19-635,0,well hear argument next case <number> donald t...
6713,2019_19-67,1,well hear argument morning case <number> unite...
6714,2019_19-7,1,well hear argument first morning case <number>...
