# Data 

This notebook illustrates the estimation of embeddings on a corpus of Supreme Court oral arguments. The data are available via the excellent Cornell Conversational Analysis Toolkit (ConvoKit). You can read more about the data (and ConvoKit) [here](https://convokit.cornell.edu/documentation/supreme.html#).  The Oral Arguments corpus is described as:

>A collection of cases from the U.S. Supreme Court, along with transcripts of oral arguments. Contains approximately 1,700,000 utterances over 8,000 oral arguments transcripts from 7,700 cases.



In [None]:
!pip3 install convokit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting convokit
  Downloading convokit-2.5.3.tar.gz (167 kB)
[K     |████████████████████████████████| 167 kB 5.2 MB/s 
Collecting msgpack-numpy>=0.4.3.2
  Downloading msgpack_numpy-0.4.8-py2.py3-none-any.whl (6.9 kB)
Collecting clean-text>=0.1.1
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.4-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 49.7 MB/s 
[?25hCollecting emoji<2.0.0,>=1.0.0
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 10.3 MB/s 
[?25hCollecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 613 kB/s 
Building wheels for collected packages: convokit, emoji
  Building wheel for convokit (setup.py) ... [?25l[?25hdone
  Created wheel for convokit: filename=convokit-2

## Prepping the Corpus

The first thing we need to do is to download the corpus. This will take a couple minutes, as this is a large corpus. Lawyers and judges like to talk a lot. The benefit of this additional text, though, is that we have significantly more information for validly estimating the word embeddings.


In [None]:
from convokit import Corpus, download

In [None]:
corpus = Corpus(filename=download("supreme-corpus"))

Downloading supreme-corpus to /root/.convokit/downloads/supreme-corpus
Downloading supreme-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/supreme-corpus.zip (1255.8MB)... Done


We can see a bit of information on our corpus as follows. 

In [None]:
corpus.print_summary_stats()

Number of Speakers: 8979
Number of Utterances: 1700789
Number of Conversations: 7817


Let's look at the first utterance. This is the Chief Justice of the U.S. Supreme Court introducing the case and the first lawyer to speak before the Court.

In [None]:
for utt in corpus.iter_utterances():
    print(utt.text)
    break

Number 71, Lonnie Affronti versus United States of America.
Mr. Murphy.


Next, we need to begin to prepare the corpus for estimating word embeddings. To do so, we must first do some standard NLP tasks, segmenting the corpus by sentence and tokenizing the texts. We'll just use the nltk tokenizers to segment into sentences and tokens.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's look at how the tokenizer works for the first utterance.

In [None]:
for utt in corpus.iter_utterances():
    print( [word_tokenize(t) for t in sent_tokenize(utt.text)])
    break

[['Number', '71', ',', 'Lonnie', 'Affronti', 'versus', 'United', 'States', 'of', 'America', '.'], ['Mr.', 'Murphy', '.']]


Generate the sentence tokens, and the word tokens within them. This took ~ 11 minutes, given 1.7 million utterances.

In [None]:
sents = []
for utt in corpus.iter_utterances():
    sents.append([word_tokenize(t) for t in sent_tokenize(utt.text)])

In [None]:
len(sents)

1700789

In [None]:
sents[1]

[['May', 'it', 'please', 'the', 'Court', '.'],
 ['We',
  'are',
  'here',
  'by',
  'writ',
  'of',
  'certiorari',
  'to',
  'the',
  'Eighth',
  'Circuit',
  '.'],
 ['There',
  'is',
  'one',
  'question',
  'to',
  'be',
  'decided',
  'in',
  'this',
  'case',
  ',',
  'decided',
  'carefully',
  '.'],
 ['Upon',
  'sentence',
  'to',
  'consecutive',
  'sentences',
  'or',
  'terms',
  'by',
  'a',
  'District',
  'Court',
  '.'],
 ['The',
  'defending',
  'pattern',
  'started',
  'the',
  'service',
  'of',
  'a',
  'first',
  'sentence',
  '.'],
 ['Thus',
  ',',
  'the',
  'District',
  'Court',
  'thereafter',
  'have',
  'jurisdiction',
  'to',
  'suspend',
  'the',
  'execution',
  'of',
  'the',
  'remaining',
  'sentences',
  'and',
  'place',
  'the',
  'defendant',
  'on',
  'probation',
  '.']]

That's the second document/utterance, a list of lists (each sentence is a list of tokens). That means sents is organized as a list of lists of lists. The model wants a list of lists (the tokens by sentence, without distinguishing between the utterances in which they are used). So, we flatten the list (to a list of sentences, each a list of tokens).

In [None]:
flat_sents_list = [sentence for utt in sents for sentence in utt] # for every utterance, loop over its sentences and add them to the list

In [None]:
len(flat_sents_list)

3880254

As you can see, we are closing in on 4 million sentences overall.

## FastText

FastText embeddings (takes 15 min)

In [None]:
from gensim.models import FastText


In [None]:
modelf_w5 = FastText(sentences=flat_sents_list, size=100, window=5, min_count=5, workers=1)
modelf_w5.save("w5_fasttext.model")

In [None]:
vectors_w5_f = np.asarray(modelf_w5.wv.vectors)
labels_w5_f = np.asarray(modelf_w5.wv.index2word)

kmeans_w5_f_20 = KMeans(n_clusters=20)
kmeans_w5_f_20.fit(vectors_w5_f)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [None]:
for k in range(20):
  print(modelf_w5.wv.most_similar([kmeans_w5_f_20.cluster_centers_[k]]))

[('acknowledgment', 0.8484030961990356), ('concede', 0.8400231003761292), ('reappoint', 0.8161632418632507), ('acknowledgement', 0.8080217242240906), ('Vatersay', 0.8050147294998169), ('appreciate', 0.8028841018676758), ('understand—and', 0.8028236627578735), ('criticise', 0.7989882230758667), ('reassert', 0.7972162961959839), ('conquest', 0.7933142185211182)]
[('Could', 0.8523712158203125), ('is—will', 0.8343628644943237), ('Gould', 0.8303952813148499), ('will—will', 0.827309787273407), ('—will', 0.8239887356758118), ('will', 0.8192586898803711), ('do—will', 0.8120394945144653), ('Would', 0.8107205629348755), ('Should', 0.8085587024688721), ('Will', 0.7992162704467773)]
[('alignment', 0.9004708528518677), ('propulsion', 0.8757753968238831), ('torment', 0.8706355094909668), ('destination', 0.8695518970489502), ('temperament', 0.8676959276199341), ('monument', 0.8650738000869751), ('apportionment', 0.8616500496864319), ('clandestinely', 0.8612180352210999), ('intimation', 0.860936582088