In [1]:
from convokit import Corpus, download, TextCleaner, TextParser, BoWTransformer
import pandas as pd
import json

In [2]:
corpus = Corpus(filename="/Users/vaughnfranz/.convokit/downloads/supreme-corpus")

## Data Preprocessing 

Below I define a couple of custom cleaning functions to complement the built in cleaning capabilites of convokit.

The TextCleaner transformer from convokit operates on a string, so these functions do as well. 

The first function removes punctuation. The second removes stopwords and lemmatizes the text.

In [3]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english'))
wordNetLemm = WordNetLemmatizer()

def remove_punctuation(text):
    cleaned = "".join([char for char in text if char not in string.punctuation])
    return cleaned

def custom_cleaner(text):
    toks = text.split()
    toks = [word for word in toks if not word in stop_words]
    toks = [wordNetLemm.lemmatize(word) for word in toks]
    cleaned = " ".join(toks)
    return cleaned

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vaughnfranz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vaughnfranz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/vaughnfranz/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


I'm going to start wiht removing the punctuation, then use the built in cleaning functionality, and then do the additional cleaning. The reason for this is to avoid affecting the special tokens which convokit inserts for numbers and the like.

Verbosity of 250,000 feels reasonable for 1.7+ million utterances, so I will use that number throughout. 

The TextCleaner class of convokit takes a keyword arg that allows us to specify a custom cleaining function to apply to each utterance. 

With replace set to False the cleaner should store the cleaned text in an attribute on the utterances called 'cleaned.' The original text will be preserved in 'text.' I will then do the additional cleaning steps on the 'cleaned' attribute. 

In [4]:
corpus = TextCleaner(verbosity=250000, text_cleaner=remove_punctuation, replace_text=False).transform(corpus)

250000/1700789 utterances processed
500000/1700789 utterances processed
750000/1700789 utterances processed
1000000/1700789 utterances processed
1250000/1700789 utterances processed
1500000/1700789 utterances processed
1700789/1700789 utterances processed


Let's make sure the cleaning is operating more or less as expected.

In [5]:
test_utterance_id = '13127__0_000'
utt = corpus.get_utterance(test_utterance_id)
print('ORIGINAL:', utt.text)
print('CLEANED:', utt.meta['cleaned'])

ORIGINAL: Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.
CLEANED: Number 71 Lonnie Affronti versus United States of America
Mr Murphy


Cleaning using the built in functionality of convokit. 

The TextCleaner will, by default:
- fix unicode errors, transliterate text to the closest ASCII representation
- lowercase text
- remove line breaks
- replace URLs, emails, phone numbers, numbers, and currency symbols with special tokens

The cleaner will operate by default on the utterances (specifically, utterance.text).

In [6]:
corpus = TextCleaner(verbosity=250000, input_field='cleaned', replace_text=False).transform(corpus)

250000/1700789 utterances processed
500000/1700789 utterances processed
750000/1700789 utterances processed
1000000/1700789 utterances processed
1250000/1700789 utterances processed
1500000/1700789 utterances processed
1700789/1700789 utterances processed


Another sanity test on the cleaning process.

In [7]:
test_utterance_id = '13127__0_000'
test_utterance_id_2 = '13127__0_004'
utt = corpus.get_utterance(test_utterance_id)
print('TEST: 13127__0_000')
print('ORIGINAL:', utt.text)
print('CLEANED:', utt.meta['cleaned'])
utt2 = corpus.get_utterance(test_utterance_id_2)
print('TEST: 13127__0_004')
print('ORIGINAL:', utt2.text)
print('CLEANED:', utt2.meta['cleaned'])

TEST: 13127__0_000
ORIGINAL: Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.
CLEANED: number <number> lonnie affronti versus united states of america mr murphy
TEST: 13127__0_004
ORIGINAL: Was the aggregate prison sentence was 20 or 25 years?
CLEANED: was the aggregate prison sentence was <number> or <number> years


Now performing our other custom cleaning steps as defined in the function up top. 

In [8]:
corpus = TextCleaner(verbosity=250000, text_cleaner=custom_cleaner, input_field='cleaned', replace_text=False).transform(corpus)

250000/1700789 utterances processed
500000/1700789 utterances processed
750000/1700789 utterances processed
1000000/1700789 utterances processed
1250000/1700789 utterances processed
1500000/1700789 utterances processed
1700789/1700789 utterances processed


Sanity test again...

In [9]:
test_utterance_id = '13127__0_000'
test_utterance_id_2 = '13127__0_004'
utt = corpus.get_utterance(test_utterance_id)
print('TEST: 13127__0_000')
print('ORIGINAL:', utt.text)
print('CLEANED:', utt.meta['cleaned'])
utt2 = corpus.get_utterance(test_utterance_id_2)
print('TEST: 13127__0_004')
print('ORIGINAL:', utt2.text)
print('CLEANED:', utt2.meta['cleaned'])

TEST: 13127__0_000
ORIGINAL: Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.
CLEANED: number <number> lonnie affronti versus united state america mr murphy
TEST: 13127__0_004
ORIGINAL: Was the aggregate prison sentence was 20 or 25 years?
CLEANED: aggregate prison sentence <number> <number> year


### OMIT THIS -- DOES NOT WORK AS ANTICIPATED 

Now, we use the TextParser to tokenize our cleaned strings. 

I'm going to use a custom tokenizer here which simply splits the sentences on spaces. This will give us our desired output as a list of strings which is needed for the gensim Word2Vec model. 

This uses nltk's sentence tokenizer by default. The output will be stored in a field called 'parsed.'

If you want to run this you need to make sure spacy's english model is downloaded. You can do this by running:

``` python -m spacy download en ```

In [13]:
def custom_tokenizer(text):
    toks = text.split()
    return toks 
# corpus = TextParser(verbosity=250000, sent_tokenizer=custom_tokenizer, input_field='cleaned', mode='tokenize').transform(corpus)

## Putting the data together 
Now we can get our dataframes, and connect the case information with the utterances.

In [15]:
utterances_df = corpus.get_utterances_dataframe()

Dropping the parsed column from the misstep in using convokit's TextParser earlier... You don't need to run this if you skipped the earlier step. 

In [26]:
utterances_df = utterances_df.drop(columns=['meta.parsed'])

Saving the dataframe to a file for safe keeping so that the above steps do not need to be repeated. 

In [18]:
utterances_df.to_csv(path_or_buf="parsed_utterances.csv", index=False)

Now lets actually tokenize our text, using the custom tokenizer defined earlier. 

In [19]:
utterances_df['tokens'] = utterances_df['meta.cleaned'].apply(custom_tokenizer)

In [27]:
utterances_df.head()

Unnamed: 0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.speaker_type,meta.side,meta.timestamp,meta.cleaned,vectors,tokens,meta.win_side,meta.votes_side
0,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",J,,0.0,number <number> lonnie affronti versus united ...,[],"[number, <number>, lonnie, affronti, versus, u...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
1,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",A,1.0,9.218,may please court writ certiorari eighth circui...,[],"[may, please, court, writ, certiorari, eighth,...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
2,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],J,,48.138,consecutive sentence,[],"[consecutive, sentence]",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
3,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",A,1.0,49.315,consecutive sentence case defendant affronti i...,[],"[consecutive, sentence, case, defendant, affro...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
4,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,,174.058,aggregate prison sentence <number> <number> year,[],"[aggregate, prison, sentence, <number>, <numbe...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."


In [21]:
conversations_df = corpus.get_conversations_dataframe()

In [22]:
conversations_df.head()

Unnamed: 0_level_0,vectors,meta.case_id,meta.advocates,meta.win_side,meta.votes_side
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13127,[],1955_71,"{'harry_f_murphy': {'side': 1, 'role': 'inferr...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
12997,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13024,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13015,[],1955_351,"{'harry_d_graham': {'side': 3, 'role': 'inferr...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13016,[],1955_38,"{'robert_n_gorman': {'side': 3, 'role': 'infer...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."


We can use the pandas built in merge function to bring in the case information to the utterances df. 

In [23]:
utterances_df = pd.merge(utterances_df, conversations_df[['meta.case_id', 'meta.win_side', 'meta.votes_side']], how='left', left_on='meta.case_id', right_on='meta.case_id')

In [28]:
utterances_df.head()

Unnamed: 0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.speaker_type,meta.side,meta.timestamp,meta.cleaned,vectors,tokens,meta.win_side,meta.votes_side
0,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",J,,0.0,number <number> lonnie affronti versus united ...,[],"[number, <number>, lonnie, affronti, versus, u...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
1,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",A,1.0,9.218,may please court writ certiorari eighth circui...,[],"[may, please, court, writ, certiorari, eighth,...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
2,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],J,,48.138,consecutive sentence,[],"[consecutive, sentence]",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
3,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",A,1.0,49.315,consecutive sentence case defendant affronti i...,[],"[consecutive, sentence, case, defendant, affro...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
4,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,,174.058,aggregate prison sentence <number> <number> year,[],"[aggregate, prison, sentence, <number>, <numbe...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."


In [29]:
utterances_df['meta.win_side'].value_counts()

 1    1378597
 0     796114
 2        831
-1        374
Name: meta.win_side, dtype: int64

According to convokit documentation a 2 signifies that the decision was unclear and a -1 signifies that the data was unavailable. We can drop these rows to simplify the classification task. 

In [30]:
utterances_df = utterances_df[utterances_df['meta.win_side'] != 2]
utterances_df = utterances_df[utterances_df['meta.win_side'] != -1]

In [31]:
utterances_df['meta.win_side'].value_counts()

1    1378597
0     796114
Name: meta.win_side, dtype: int64

We are also going to make use of the speaker information in our models, so let's examine that and drop data where necessary.

In [32]:
utterances_df['meta.speaker_type'].value_counts()

A    1056973
J    1012327
Name: meta.speaker_type, dtype: int64

We also need to do a one-hot-encoding of the speakers for our model. Doing that here.

In [33]:
utterances_df = pd.get_dummies(utterances_df, prefix=['speaker_type'], columns=['meta.speaker_type'])

In [34]:
utterances_df.head()

Unnamed: 0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.side,meta.timestamp,meta.cleaned,vectors,tokens,meta.win_side,meta.votes_side,speaker_type_A,speaker_type_J
0,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",,0.0,number <number> lonnie affronti versus united ...,[],"[number, <number>, lonnie, affronti, versus, u...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,1
1,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",1.0,9.218,may please court writ certiorari eighth circui...,[],"[may, please, court, writ, certiorari, eighth,...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",1,0
2,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],,48.138,consecutive sentence,[],"[consecutive, sentence]",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,1
3,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",1.0,49.315,consecutive sentence case defendant affronti i...,[],"[consecutive, sentence, case, defendant, affro...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",1,0
4,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,174.058,aggregate prison sentence <number> <number> year,[],"[aggregate, prison, sentence, <number>, <numbe...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,0


In [47]:
utterances_df['meta.case_id'].value_counts()

1961_8          12672
1961_2           7476
1965_759         6595
1960_6           5576
1962_8           4972
                ...  
1977_76-5935       27
1968_574           26
1960_340           24
2000_99-1884        2
2011_10-1195        1
Name: meta.case_id, Length: 6728, dtype: int64

## Obtaining embedded utterances 

We are now going to use gensim in order to obtain embedded utterances. We will use a word2vec model and then will pool the vectors for each utterance. 

To start, I am going to slice the dataframe down to a specific case and train on that. We will then use that to debug our model pipelines. 

In [35]:
case_1955_71 = utterances_df[utterances_df['meta.case_id'] == '1955_71']

In [36]:
case_1955_71.shape

(145, 17)

In [37]:
case_1955_71.head()

Unnamed: 0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.side,meta.timestamp,meta.cleaned,vectors,tokens,meta.win_side,meta.votes_side,speaker_type_A,speaker_type_J
0,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",,0.0,number <number> lonnie affronti versus united ...,[],"[number, <number>, lonnie, affronti, versus, u...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,1
1,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",1.0,9.218,may please court writ certiorari eighth circui...,[],"[may, please, court, writ, certiorari, eighth,...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",1,0
2,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],,48.138,consecutive sentence,[],"[consecutive, sentence]",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,1
3,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",1.0,49.315,consecutive sentence case defendant affronti i...,[],"[consecutive, sentence, case, defendant, affro...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",1,0
4,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,174.058,aggregate prison sentence <number> <number> year,[],"[aggregate, prison, sentence, <number>, <numbe...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,0


In [44]:
from gensim.models.word2vec import Word2Vec

Separating the word2vec training into separate steps. We will use a maximum vocabulary size of 10k for our initial training. 

In [45]:
w2vmodel = Word2Vec(max_vocab_size=10000, vector_size=300)

In [46]:
w2vmodel.build_vocab(case_1955_71['tokens'], progress_per=10000)

In [49]:
case_1955_71['tokens'].shape

(145,)

In [51]:
w2vmodel.train(case_1955_71['tokens'], total_examples=145, epochs=30)

(26141, 88380)

## Using the Trained W2V model to create embeddings

We will combine the vectors for each word in an utterance by averaging them. 

In [84]:
import numpy as np

def get_pooled_embedding(tokens):
    word_embeddings = [w2vmodel.wv[tok] for tok in tokens if tok in w2vmodel.wv]
    if len(word_embeddings) > 0:
        average = np.mean(word_embeddings, axis=0)
    else:
        average = np.zeros((300, ))
    return average

In [85]:
case_1955_71['sentence_embedding'] = case_1955_71['tokens'].apply(get_pooled_embedding)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  case_1955_71['sentence_embedding'] = case_1955_71['tokens'].apply(get_pooled_embedding)


In [86]:
case_1955_71.head()

Unnamed: 0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.side,meta.timestamp,meta.cleaned,vectors,tokens,meta.win_side,meta.votes_side,speaker_type_A,speaker_type_J,sentence_embedding
0,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",,0.0,number <number> lonnie affronti versus united ...,[],"[number, <number>, lonnie, affronti, versus, u...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,1,"[-0.009820573, 0.13145004, -0.017705124, -0.01..."
1,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",1.0,9.218,may please court writ certiorari eighth circui...,[],"[may, please, court, writ, certiorari, eighth,...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",1,0,"[-0.013877024, 0.15704316, -0.021506846, -0.01..."
2,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],,48.138,consecutive sentence,[],"[consecutive, sentence]",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,1,"[-0.017855529, 0.17307705, -0.024734817, -0.01..."
3,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",1.0,49.315,consecutive sentence case defendant affronti i...,[],"[consecutive, sentence, case, defendant, affro...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",1,0,"[-0.012529933, 0.14926648, -0.020781767, -0.01..."
4,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,174.058,aggregate prison sentence <number> <number> year,[],"[aggregate, prison, sentence, <number>, <numbe...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,...",0,0,"[-0.014745751, 0.16578588, -0.020736586, -0.01..."


In [88]:
case_1955_71_data = case_1955_71[["sentence_embedding", "speaker_type_A", "speaker_type_J", "meta.win_side"]]

In [90]:
case_1955_71_data.head()

Unnamed: 0,sentence_embedding,speaker_type_A,speaker_type_J,meta.win_side
0,"[-0.009820573, 0.13145004, -0.017705124, -0.01...",0,1,0
1,"[-0.013877024, 0.15704316, -0.021506846, -0.01...",1,0,0
2,"[-0.017855529, 0.17307705, -0.024734817, -0.01...",0,1,0
3,"[-0.012529933, 0.14926648, -0.020781767, -0.01...",1,0,0
4,"[-0.014745751, 0.16578588, -0.020736586, -0.01...",0,0,0


In [92]:
case_1955_71_np_arr = case_1955_71_data.to_numpy()

In [96]:
with open("case_1955_71.npy", "wb") as f:
    np.save(f, case_1955_71_np_arr)
    
with open("case_1955_71.npy", "rb") as f:
    test = np.load(f, allow_pickle=True)
print(test)

[[array([-0.00982057,  0.13145004, -0.01770512, -0.01309748,  0.02003422,
         -0.09262118,  0.12788484,  0.30289868, -0.0372206 , -0.05293366,
          0.0681519 , -0.03293465, -0.01475439, -0.04844739, -0.00565174,
         -0.13407059,  0.16388571, -0.00729544, -0.02639389, -0.05541466,
         -0.02227178, -0.00414846,  0.18889457,  0.08828274,  0.01137721,
         -0.05698402, -0.17822069, -0.00414393, -0.07419278, -0.10577166,
          0.11179742, -0.02635492,  0.00936802, -0.05196199, -0.03022265,
          0.07457402,  0.0287883 , -0.13213901, -0.10053387,  0.0308865 ,
         -0.02088697, -0.03120656,  0.02503637, -0.1047829 ,  0.08456096,
          0.14965193,  0.04479617,  0.04773679, -0.00337038,  0.07332712,
          0.07453831,  0.0121258 , -0.03165824,  0.08230674, -0.0388317 ,
          0.06520162,  0.02425457,  0.00982084, -0.01222275,  0.03510634,
         -0.08391145, -0.06923111,  0.06755479,  0.0058574 , -0.06111544,
          0.1402124 ,  0.07109143,  0.

## Grouping Utterances by Case for Case Level Classification

I wanted to try creating documents for each case in the event that we wanted to do case-level classification. 

Just going to concatenate all text for the cases. 

In [66]:
utt_df_cpy = utterances_df.copy(deep=True)

Also going to remove some columns to make this next part simpler...

In [72]:
utt_df_cpy.drop(columns=['speaker', 'reply_to', 'conversation_id',
                         'meta.start_times', 'meta.stop_times', 
                         'meta.speaker_type', 'meta.side', 
                         'meta.timestamp', 'vectors', 'timestamp', 
                         'meta.votes_side'])

Unnamed: 0,text,meta.case_id,meta.cleaned,meta.win_side
0,"Number 71, Lonnie Affronti versus United State...",1955_71,number <number> lonnie affronti versus united ...,0
1,May it please the Court.\nWe are here by writ ...,1955_71,may please court writ certiorari eighth circui...,0
2,Consecutive sentences.,1955_71,consecutive sentence,0
3,"Consecutive sentences.\nIn this case, the defe...",1955_71,consecutive sentence case defendant affronti i...,0
4,Was the aggregate prison sentence was 20 or 25...,1955_71,aggregate prison sentence <number> <number> year,0
...,...,...,...,...
2179509,-- has all sorts of meaning that you're not en...,2019_19-67,sort meaning youre endorsing youre saying aidi...,1
2179510,"No, Your Honor --",2019_19-67,honor,1
2179511,-- altogether?,2019_19-67,altogether,1
2179512,-- we are using the principles of complicity a...,2019_19-67,using principle complicity solicitation statut...,1


In [73]:
utt_df_cpy.groupby(['meta.case_id', 'meta.win_side'])['meta.cleaned'].apply(" ".join).reset_index()

Unnamed: 0,meta.case_id,meta.win_side,meta.cleaned
0,1955_10,0,number <number> commonwealth pennsylvania vers...
1,1955_102,0,minute remaining simply desire point brief dec...
2,1955_110,1,court dennis case one cited brief reading beca...
3,1955_111,1,number <number> gonzales versus hr landon dist...
4,1955_112,1,number <number> amos reece versus state georgi...
...,...,...,...
6711,2019_19-631,0,well hear argument next case <number> william ...
6712,2019_19-635,0,well hear argument next case <number> donald t...
6713,2019_19-67,1,well hear argument morning case <number> unite...
6714,2019_19-7,1,well hear argument first morning case <number>...
