# Computing the orientation of phrases and utterances

This notebook demos an unsupervised procedure for deriving the _orientation_ of a phrase in an utterance, a measure of the extent to which it aims forwards in a conversation to advance, relative to the extent to which it aims backwards in the conversation to address what's been said. It implements, with some methodological tweaks, an approach detailed in the [paper](http://www.cs.cornell.edu/~cristian/Orientation_files/orientation-forwards-backwards.pdf),

```
Balancing Objectives in Counseling Conversations: Advancing Forwards or Looking Backwards
Justine Zhang and Cristian Danescu-Niculescu-Mizil
Proceedings of ACL 2020.

```

Beyond the measure, the notebook illustrates an approach to characterize utterances, and phrases within utterances, based on the types of replies that tend to come after it and the types of predecessors it's replying to. Interestingly, this approach can be seen as a generalization of the approach for inferring [Prompt Types](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/prompt-types/prompt-type-demo.ipynb), detailed in a [previous work](http://www.cs.cornell.edu/~cristian/Asking_too_much.html). We're exploring a more unified way to think about this approach, and an implementation in ConvoKit is forthcoming.

Note that the dataset used in the paper is a collection of crisis counseling conversations, which we cannot release (see [here](https://www.crisistextline.org/data-philosophy/research-fellows) for details about access). Rather, for the demo, we use a dataset of oral arguments from the Supreme Court, extracted from the Oyez [website](https://www.oyez.org/) and available [here](https://convokit.cornell.edu/documentation/supreme.html) (we used a small subset of this corpus to perform some exploratory analysis that was reported in the appendix of the aforementioned paper). In this setting, we will characterize the orientation of things that the justices say in back-and-forths with lawyers.

The Supreme Court tends to be more lexically diverse than crisis counseling conversations; the types of cases heard are much more varied than the types of situations covered in counseling conversations, while justices often have distinctive linguistic idiosyncracies. As such, while we feel that our approach for computing orientation still returns sensible output, the demo might also suggest some additional challenges that future work could tackle, like dealing with this increased lexical diversity.

In [1]:
import os

import pandas as pd
import json

In [2]:
from convokit import Corpus
from convokit.text_processing import TextProcessor, TextToArcs
from convokit import download

In [3]:
from convokit.convokitPipeline import ConvokitPipeline

In [4]:
import warnings
warnings.filterwarnings('ignore')

## Preliminaries: setting up the training data

At a high level, our approach uses some "training data" consisting of a subset of utterances in a corpus and their associated replies and predecessors to derive per-phrase orientation scores (corresponding to the relative forwards/backwards intention of that phrase), before scoring utterances. Note that as this approach is unsupervised, "training data" is somewhat figurative -- we don't have supervision in the form of explicit labels in the data, but we will use information from the conversational context as a source of signal.

In this section, we'll generate this training data as a subset of the larger Supreme Court corpus. (This corresponds to Figure 3A in the paper). 

Note that we've made some particular decisions about what to include in this subset; you may wish to play around with these choices depending on the data you've got. 

Note that the Supreme Court corpus is distributed as separate sub-corpora per year, since it's quite large -- in this notebook, we will demonstrate our particular choice of what training data to take on one year (2019) of data; running `get_train_subset.py` in this directory then gives you the rest of the training data.



In [5]:
DEMO_CORPUS_NAME = 'supreme-2019'

Replace this with the directory you wish to write the corpora to:

In [6]:
DATA_DIR = '<YOUR DIRECTORY>'

uncomment lines, depending on whether you want to download the corpus or read from disk:

In [7]:
# demo_corpus = Corpus(download(DEMO_CORPUS_NAME, data_dir=DATA_DIR))
demo_corpus = Corpus(os.path.join(DATA_DIR, DEMO_CORPUS_NAME))

In [8]:
demo_corpus.print_summary_stats()

Number of Speakers: 113
Number of Utterances: 13707
Number of Conversations: 58


We will first preprocess the data to generate phrases for each utterance. In our case, we will use dependency tree arcs (detailed [here](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/prompt-types/prompt-type-demo.ipynb)) as phrases. (This requires us to read in dependency parses, which we've provided in the corpus but which we do not load by default.)

In [9]:
demo_corpus.load_info('utterance',['parsed'])

In [10]:
text_prep_pipe = ConvokitPipeline([
    ('arcs_per_sent', TextToArcs(output_field='arcs_per_sent')),
    ('arcs', TextProcessor(input_field='arcs_per_sent', output_field='arcs',
                     proc_fn=lambda sents: '\n'.join(sents))),
    ('wordcount', TextProcessor(input_field='parsed', output_field='wordcount',
           proc_fn=lambda sents: sum(sum(x['tag'] != '_SP' for x in sent['toks']) for sent in sents))),
    ('tokens', TextProcessor(input_field='parsed', output_field='tokens',
           proc_fn=lambda sents: '\n'.join((' '.join(x['tok'] for x in sent['toks']).strip()) for sent in sents)))
])

In [11]:
demo_corpus = text_prep_pipe.transform(demo_corpus)

Here's what the preprocessing step outputs for each utterance:

In [12]:
utt = demo_corpus.get_utterance('24929__0_000')

In [13]:
print(utt.text)

We'll hear argument next in Case 18-877, Allen versus Cooper. Mr. Shaffer.


In [14]:
utt.retrieve_meta('wordcount')

18

In [15]:
utt.retrieve_meta('arcs')

"'ll_* allen_* allen_versus argument_* case_* cooper_* hear_'ll hear_* hear_allen hear_argument hear_next hear_we in_* in_case next_* next_in versus_* versus_cooper we>'ll we>* we_*\nshaffer_*"

In [16]:
print(utt.retrieve_meta('tokens'))

We 'll hear argument next in Case 18 - 877 , Allen versus Cooper .
Mr. Shaffer .


As noted above, our approach centers around characterizing a justice utterance in terms of the types of lawyer utterances that tend to precede or follow. More generically, we aim to characterize **source** utterances/phrases in terms of surrounding **target** utterances. In this case, the source and target utterances correspond to what justices and lawyers say, respectively; in our paper, source and target correspond to counselor and texter. (Something that would be interesting to try is to reverse roles, i.e., such that lawyers now utter the source utterances.) 

Our training data must contain information about justice utterances, and about the lawyer utterances that these justice utterances precede or follow. We will work towards outputting two tables, one for justice utterances and the other for lawyer utterances; each table will contain the set of component phrases in an utterance as well as the IDs of replies and predecessors.

To address some of the noisiness in this corpus, we will be somewhat restrictive with the training data we subset. In particular, we will focus on characterizing justice utterances of some minimum length, that occur between lawyer utterances that are reasonably long -- that is, there is enough information about the utterance and about the context it arises in, and we are not dealing with utterances that might be small interjections or disfluencies, or very long speeches that don't reflect a back-and-forth dynamic. 

Therefore, we'll start by extracting a list of justice utterances and the IDs of their replies and predecessors, stored as a dataframe:

In [17]:
def get_context_id_df(corpus):
    prev_df = pd.DataFrame([{'id': utt.id, 'prev_id': utt.reply_to} for utt in corpus.iter_utterances()])
    context_id_df = prev_df.join(prev_df.drop_duplicates('prev_id').set_index('prev_id')['id'].rename('next_id'), on='id')
    return context_id_df

In [18]:
context_id_df = get_context_id_df(demo_corpus)

In [19]:
context_id_df.head()

Unnamed: 0,id,prev_id,next_id
0,24929__0_000,,24929__0_001
1,24929__0_001,24929__0_000,24929__0_002
2,24929__0_002,24929__0_001,24929__0_003
3,24929__0_003,24929__0_002,24929__0_004
4,24929__0_004,24929__0_003,24929__0_005


We use justice utterances as source utterances, and lawyer utterances as target utterances:

In [20]:
source_filter = lambda utt: (utt.retrieve_meta('speaker_type') == 'J') and (utt.retrieve_meta('arcs') != '')
target_filter = lambda utt: (utt.retrieve_meta('speaker_type') == 'A') and (utt.retrieve_meta('arcs') != '')

In [21]:
for utt in demo_corpus.iter_utterances():
    utt.set_info('source_filter',source_filter(utt))
    utt.set_info('target_filter',target_filter(utt))

To filter down to our training set, we need to get sets of source and target utterances, subject to the wordcount constraints we suggested above. We'll use some dataframe operations to make this selection.

In [22]:
utt_df = demo_corpus.get_attribute_table('utterance', ['wordcount', 'source_filter','target_filter'])

In [23]:
utt_df.head()

Unnamed: 0_level_0,source_filter,target_filter,wordcount
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24929__0_000,True,False,18
24929__0_001,False,True,390
24929__0_002,True,False,45
24929__0_003,False,True,161
24929__0_004,True,False,4


In [24]:
full_context_df = context_id_df.join(utt_df, on='id')\
    .join(utt_df, on='prev_id', rsuffix='_prev')\
    .join(utt_df, on='next_id', rsuffix='_next')

In [25]:
full_context_df.head()

Unnamed: 0,id,prev_id,next_id,source_filter,target_filter,wordcount,source_filter_prev,target_filter_prev,wordcount_prev,source_filter_next,target_filter_next,wordcount_next
0,24929__0_000,,24929__0_001,True,False,18,,,,False,True,390.0
1,24929__0_001,24929__0_000,24929__0_002,False,True,390,True,False,18.0,True,False,45.0
2,24929__0_002,24929__0_001,24929__0_003,True,False,45,False,True,390.0,False,True,161.0
3,24929__0_003,24929__0_002,24929__0_004,False,True,161,True,False,45.0,True,False,4.0
4,24929__0_004,24929__0_003,24929__0_005,True,False,4,False,True,161.0,False,True,7.0


We want source utterances that are reasonably (but not too) long, and that occur between reasonably long target utterances. The following min/max wordcounts roughly correspond to 25th and 50th percentiles (these are parameters that could be tweaked); selecting on them produces tables listing the source and target utterances we will consider.

In [26]:
min_wc_source = 10
max_wc_source = 50
min_wc_target = 10
max_wc_target = 75

In [27]:
source_df = full_context_df[full_context_df.source_filter
           & full_context_df.wordcount.between(min_wc_source, max_wc_source)
           & full_context_df.wordcount_prev.between(min_wc_target, max_wc_target)
           & full_context_df.wordcount_next.between(min_wc_target, max_wc_target)].set_index('id')

In [28]:
target_df = full_context_df[full_context_df.target_filter
   & full_context_df.wordcount.between(min_wc_target, max_wc_target)].set_index('id')
source_df = source_df[source_df.prev_id.isin(target_df.index)
         & source_df.next_id.isin(target_df.index)]

In [29]:
len(source_df)

353

In [30]:
len(target_df)

2087

Joining these tables with tables of utterance phrasings gives us the full training data we will subsequently use.

In [31]:
text_cols = ['arcs','tokens']
text_df = demo_corpus.get_attribute_table('utterance',text_cols)

In [32]:
source_df = source_df[['prev_id','next_id']].join(text_df)
target_df = target_df[[]].join(text_df)

In [33]:
source_df.head()

Unnamed: 0_level_0,prev_id,next_id,arcs,tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
24929__0_014,24929__0_013,24929__0_015,but>* but>how could_* going_* going_rules goin...,But how -- how could -- how could we have the ...
24929__0_016,24929__0_015,24929__0_017,'re_* asking_'re asking_* asking_basically ask...,"So , basically , you 're asking us to overrule..."
24929__0_054,24929__0_053,24929__0_055,a_* by_* by_government by_state constitutional...,Every -- every infringement is a violate -- ev...
24929__0_082,24929__0_081,24929__0_083,all>* all_* california_* over_* over_all over_...,All over California .\nWhy does n't California...
24929__0_086,24929__0_085,24929__0_087,'m_* about_* about_copyright copyright_* i>'m ...,I 'm not talking about copyright .\nI 'm talki...


In [34]:
target_df.head()

Unnamed: 0_level_0,arcs,tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1
24929__0_013,argument_* argument_kagan argument_that be_* b...,It would be certainly open to folks in patent ...
24929__0_015,be_* be_prediction be_that be_would my_* predi...,That would be my prediction .\nMy prediction i...
24929__0_017,'m_* alito_* alito_justice alito_think asking_...,"I 'm asking this Court to follow Katz , Justic..."
24929__0_019,basis_* basis_for basis_the florida_* for_* fo...,I think it -- it overruled -- it overruled the...
24929__0_025,court_* court_the decide_* decide_court decide...,"Well , Justice Kavanaugh , obviously , the Cou..."


You can generate the rest of the training data by running `get_train_subset.py` in the same directory as this. The variables at the top of the file can be tweaked and played around with, per the comments in the script.

We now read all of the training data from across the entire Supreme Court corpus. (`MIN_YEAR` and `MAX_YEAR` can be adjusted if you wish to only examine a subset, or if you're short on memory.)

In [35]:
MIN_YEAR = 1955
MAX_YEAR = 2019

In [36]:
source_dfs = []
target_dfs = []

In [37]:
for year in range(MIN_YEAR, MAX_YEAR + 1):
    source_dfs.append(pd.read_csv(os.path.join(DATA_DIR, 'supreme-' + str(year) + '.source.tsv'), sep='\t', index_col=0))
    target_dfs.append(pd.read_csv(os.path.join(DATA_DIR, 'supreme-' + str(year) + '.target.tsv'), sep='\t', index_col=0))

In [38]:
source_df = pd.concat(source_dfs)
target_df = pd.concat(target_dfs)

This is how many source and target utterances we have. 

(A note: these numbers are not equivalent because we were slightly permissive about which target utterances to include; while we enforce that source utterances must be surrounded by reasonably-long target utterances, the only restriction we place on target utterances is that they're reasonably long, without imposing these contextual constraints.)

In [39]:
len(source_df)

91924

In [40]:
len(target_df)

372268

## Deriving vector representations of target utterances

The next step of our approach is to derive vector representations of target (i.e., lawyer) utterances, corresponding to Figure 3B in the paper. Per the paper, we will:
* derive tf-idf vectors of target utterances;
* use singular value decomposition to get low-dimensional representations of utterances.

In [41]:
import numpy as np

from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils.extmath import randomized_svd
from sklearn.preprocessing import normalize

from sklearn.metrics.pairwise import cosine_distances
from scipy import sparse

### getting tf-idf vectors

Some details:
* MIN_DF and MAX_DF correspond to the min_df and max_df arguments passed to `sklearn`'s `TfidfVectorizer`, controling how frequently phrases need to appear to be counted in our vocabulary. In the Supreme Court corpus, it might be safe to set MIN_DF fairly high; otherwise the vocabulary could contain many phrases that are specific to particular cases.
* We found that for each phrase, normalizing the tf-idf weight across all the utterances the phrases appear in produced slightly nicer output. The intuition might be to think of these next few steps as characterizing _phrases_, rather than utterances. `ColNormedTfidf` is a custom transformer that accomplishes this (and stores norms so that new data can later be similarly transformed)

In [42]:
MIN_DF = 100
MAX_DF = 1.
MAX_FEATURES = 2000

TEXT_COL = 'arcs'

In [43]:
class ColNormedTfidf(TransformerMixin):
    
    def __init__(self, norm_cols=True, **kwargs):
        self.tfidf_model = TfidfVectorizer(token_pattern=r'(?u)(\S+)',**kwargs)
        self.norm_cols = norm_cols
    
    def fit(self, X, y=None):
        tfidf_vects_raw = self.tfidf_model.fit_transform(X)
        self.tfidf_norms = sparse.linalg.norm(tfidf_vects_raw, axis=0)
    
    def transform(self, X):
        tfidf_vects_raw = self.tfidf_model.transform(X)
        if self.norm_cols:
            tfidf_vect = tfidf_vects_raw.T / self.tfidf_norms[:,np.newaxis]
        else:
            tfidf_vect = tfidf_vects_raw.T / np.ones_like(self.tfidf_norms[:,np.newaxis])
        return tfidf_vect
    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)

    def get_feature_names(self):
        return self.tfidf_model.get_feature_names()
    
    def get_params(self, deep=True):
        return self.tfidf_model.get_params(deep=deep)
    
    def set_params(self, **params):
        return self.tfidf_model.set_params(**params)

In [44]:
target_tfidf_obj = ColNormedTfidf(max_features=MAX_FEATURES, binary=True,
                                 min_df=MIN_DF, max_df=MAX_DF)
target_tfidf_vect = target_tfidf_obj.fit_transform(target_df[TEXT_COL].values)

In [45]:
target_tfidf_vect.shape

(2000, 372268)

### getting low-dimensional representations using SVD

Setting SVD_DIMS higher or lower roughly toggles the extent to which you capture higher-level conceptual classes, versus more direct lexical matches. In our paper, we used a higher dimension for the counseling data than what we've chosen here: the intuition is again to work around the increased lexical diversity and mitigate the possibility of capturing case-specific information. 

An in-the-weeds spoiler alert: This is worth playing around with -- higher values of SVD_DIMS results in more forwards-oriented phrasings later on. (an intuition is that more sensitivity to lexical differences = more sensitive to noise in the varied things that lawyers say that justices respond to; whereas lawyers, perhaps out of procedure or respect, tend to have more well-defined responses to justice prompts)

In [46]:
SVD_DIMS = 15
RANDOM_STATE = 2019

In [47]:
def get_svd_obj(vect, svd_dims, random_state=RANDOM_STATE):
    U,s,V = randomized_svd(vect, n_components=svd_dims, random_state=random_state)
    return {'U': U, 's': s, 'V': V.T}

In [48]:
target_svd_obj = get_svd_obj(target_tfidf_vect, SVD_DIMS)

In [49]:
target_svd_obj['s']

array([4.96525955, 2.22446003, 2.11388959, 2.05197851, 2.00981703,
       1.98249565, 1.9364843 , 1.92670078, 1.91136899, 1.89870573,
       1.88872119, 1.84319579, 1.83926558, 1.82746449, 1.82167165])

For text, the first SVD dimension typically corresponds to word/phrase frequency. Since we would like embeddings to be close together on the basis of semantic, rather than numeric similarity, we will drop the first dimension via the following function:

In [50]:
def snip(vects, dim=None, snip_first_dim=True):
    if dim is None:
        dim = vects.shape[1]
    return normalize(vects[:,int(snip_first_dim):dim])

## Representing source phrases in terms of target utterances

Thus far, we've produced representations of target utterances in the training data. We now want to work towards producing representations of the _source_ phrases that follow or precede these target utterances -- recall that what we're after is some characterization of justices, not lawyers.

The high-level idea we will subsequently implement is to represent a source phrase in terms of the target utterances that follow source utterances with that phrase in the training data, e.g., all lawyer responses to utterances where the justice says "[what's the] difference between..." -- we'll refer to this as a _forwards_ representation. 

Likewise, we'll compute a _backwards_ representation of a source phrase in terms of the target utterances that precede source utterances with that phrase, e.g., all lawyer utterances to which the justice responds "[what's the] difference between".

This requires us to keep track of which source utterances are associated in with which target utterances in either direction -- here, we'll keep track of pairs of indices: (index of source utterance in an  array; index of corresponding target utterance in an array).

In [51]:
source_df['mtx_idx'] = np.arange(len(source_df))
target_df['mtx_idx'] = np.arange(len(target_df))

In [52]:
source_df = source_df.join(target_df.mtx_idx, on='prev_id', rsuffix='_prev')\
    .join(target_df.mtx_idx, on='next_id', rsuffix='_next')

In [53]:
fw_idx_mapping = source_df[['mtx_idx','mtx_idx_next']].values # forwards
bk_idx_mapping = source_df[['mtx_idx','mtx_idx_prev']].values # backwards

Using these associations between utterances:
1. we will essentially _project_ (in a linear algebra sense) a source phrase into the low-dimensional space of target utterances (Figure 3C), to derive "prototypical representations" of source phrases in terms of their expected responses/predecessors. In practice, this amounts to taking a _weighted average_ of target utterances; here we use tf-idf weights normalized by phrase (similar to how we represented target utterances above), and rescale each dimension by the singular values from the preceding SVD.
2. given these prototypical representations, we compute a _range_ for each phrase that quantifies the extent to which expected replies (or predecessors) are well-defined and similar to each other, or varied and spread out (Figure 3D).

In [54]:
class CrossEmbed:
    
    def __init__(self, source_vects, target_embeddings, target_s, idx_mapping, snip_first_dim=True):
        
        self.source_vects = source_vects
        self.target_embeddings = target_embeddings
        self.target_s = target_s
        
        source_subset = self.source_vects[:, idx_mapping[:,0]]
        target_subset = self.target_embeddings[idx_mapping[:, 1]]
        
        # deriving central point for a phrase
        self.term_embeddings = source_subset * target_subset / target_s
        
        # computing range for a phrase
        full_dists = cosine_distances(
            snip(self.term_embeddings, snip_first_dim=snip_first_dim),
            snip(target_subset, snip_first_dim=snip_first_dim)
        )
        weights = normalize(np.array(source_subset > 0), norm='l1')
        clipped_dists = np.clip(full_dists, None, 1)
        
        self.term_ranges = (clipped_dists * weights).sum(axis=1)
        
    # deriving embeddings of utterances. we won't use this, but it corresponds to the PromptType methodology 
    # and we might as well include it.
    def embed_docs(self, doc_vect):
        return doc_vect.T * self.term_embeddings / self.target_s
    
    # computing ranges of utterances.
    def compute_docs_range(self, doc_vect):
        return np.dot(normalize(doc_vect.T, norm='l1'), self.term_ranges)

(computing tf-idf weights for our weighted average)

In [55]:
source_tfidf_obj = ColNormedTfidf(max_features=MAX_FEATURES, binary=True,
                                 min_df=MIN_DF, max_df=MAX_DF)
source_tfidf_vect = source_tfidf_obj.fit_transform(source_df[TEXT_COL].values)

In [56]:
frequency = np.array(source_tfidf_vect > 0).sum(axis=1)

Representing source phrases in terms of replies (forwards):

In [57]:
fw_obj = CrossEmbed(source_tfidf_vect, target_svd_obj['V'], target_svd_obj['s'], fw_idx_mapping)

Representing source phrases in terms of predecessors (backwards):

In [58]:
bk_obj = CrossEmbed(source_tfidf_vect, target_svd_obj['V'], target_svd_obj['s'], bk_idx_mapping)

Following this procedure, we now have two quantities that characterize each source phrase -- a forwards and a backwards range. Subtracting these ranges then gives us the phrase's orientation (Figure 3E)

In [59]:
orientation = bk_obj.term_ranges - fw_obj.term_ranges

## Exploring orientation of phrases from justice orientations

We can inspect these phrases to see what our measure reflects in the Supreme Court corpus. 

In [60]:
orientation_df = pd.DataFrame({
    'index': source_tfidf_obj.get_feature_names(),
    'orientation': orientation,
    'fw_range': fw_obj.term_ranges,
    'bk_range': bk_obj.term_ranges,
    'n': frequency
}).set_index('index')

First, most justice phrasings have a positive orientation -- that is, they're more forwards oriented, so we generally have a stronger sense of what replies they prompt than what predecessors they follow. This might make sense thinking about the power dynamics in the Supreme Court, it's believable that justices use more words that push lawyers towards specific replies than that reflect on what the lawyers have said; an alternative explanation is that lawyers say very diverse things for justices to reply to (such that backwards-ranges are more spread out), but justices often structure their questions to provoke particular forms of answers (such that there is greater lexical cohesion among lawyer responses)

In [61]:
np.sign(orientation_df.orientation).value_counts(normalize=True)

 1.0    0.639
-1.0    0.361
Name: orientation, dtype: float64

Here are some of the most backwards-oriented phrases. We see a few more "topical" phrases (commission, prosecution, employees), perhaps reflecting points that justices often pick up on in what a lawyer has said.

In [62]:
orientation_df[orientation_df.n >= 250].sort_values('orientation').head(25)

Unnamed: 0_level_0,orientation,fw_range,bk_range,n
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
available_*,-0.068312,0.875046,0.806734,317
commission_*,-0.067559,0.85292,0.785361,656
is_for,-0.061511,0.846946,0.785435,256
specific_*,-0.060206,0.875184,0.814978,342
and>it,-0.059189,0.855133,0.795944,329
prosecution_*,-0.059091,0.82594,0.766848,257
employees_*,-0.054333,0.767615,0.713282,318
laughter_*,-0.053807,0.832771,0.778964,683
in_order,-0.053699,0.873568,0.819868,410
understand_that,-0.053405,0.880014,0.826609,392


Here are some of the most forwards-oriented phrases. Interestingly (and perhaps promisingly), many sound like fragments of questions.

In [63]:
orientation_df[orientation_df.n >= 250].sort_values('orientation', ascending=False).head(25)

Unnamed: 0_level_0,orientation,fw_range,bk_range,n
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
in_brief,0.145374,0.654769,0.800142,313
brief_your,0.141113,0.676582,0.817695,338
is_anything,0.115232,0.707332,0.822564,288
be_what,0.11334,0.723905,0.837245,363
is_issue,0.10966,0.715291,0.824952,270
difference_any,0.10834,0.742786,0.851126,267
make_would,0.103676,0.748457,0.852132,266
is_really,0.103238,0.767692,0.87093,284
be_you,0.102827,0.749427,0.852254,368
hear_*,0.097677,0.652663,0.75034,292


My take is that while the forwards-oriented examples seem pretty clearly pointed forwards, in that they sound like questions prompting particular types of answers (like in the Prompt Types intuition), the backwards-oriented examples are a little more muddied. Perhaps interpreting them requires a bit more domain knowledge about the Supreme Court, or they tend to be more contingent in the particularities of various cases (which, after all, justices have to address) -- future work could better address such contingencies.

One way to better interpret orientation, and more broadly, what these embeddings are telling us, is to look at source and target phrases that are mapped to similar regions of the latent space. By construction, it should be the case that if a representation of a source phrase, under the forwards mapping, is close to a representation of a target phrase, that target phrase tended to be in a reply to that source phrase in the training data. Likewise, if two embeddings are close together under the backwards mapping, then that target phrase would tend to precede the source phrase in the training data. 

As such, for each source phrase, we inspect nearest neighbors:

In [64]:
def get_cross_embed_neighbors(source_term_embeds, target_term_embeds, source_terms, target_terms,
                             snip_first_dim=True):
    neighbors = cosine_distances(snip(source_term_embeds, snip_first_dim=snip_first_dim),
                                snip(target_term_embeds, snip_first_dim=snip_first_dim))
    return pd.DataFrame(data=neighbors, index=source_terms, columns=target_terms)

In [65]:
fw_neighbors = get_cross_embed_neighbors(fw_obj.term_embeddings, target_svd_obj['U'], source_tfidf_obj.get_feature_names(),
                                        target_tfidf_obj.get_feature_names())
bk_neighbors = get_cross_embed_neighbors(bk_obj.term_embeddings, target_svd_obj['U'], source_tfidf_obj.get_feature_names(),
                                        target_tfidf_obj.get_feature_names())

One example of a forwards-oriented phrase is `difference_between`:

In [66]:
orientation_df.loc['difference_between'].orientation

0.08025254053242692

Inspecting nearest neighbors of the forwards embedding -- i.e., things we expect lawyers to say in response to a justice utterance containing "difference_between", based on the training data -- we see that the phrase (unsurprisingly) prompts lawyers to draw contrasts:

In [67]:
fw_neighbors.loc['difference_between'].sort_values().head(20)

is_difference       0.108667
difference_the      0.119686
is_where            0.121033
difference_*        0.135345
difference_a        0.150741
distinction_*       0.180849
between_*           0.190600
significant_*       0.192622
is_are              0.239439
speech_*            0.241334
discrimination_*    0.255033
is_different        0.267632
is_there            0.269909
substantial_*       0.273739
requirement_*       0.282649
real_*              0.295133
injury_*            0.297816
different_*         0.299115
serious_*           0.311851
basis_a             0.314357
Name: difference_between, dtype: float64

This cohesion is arguably less visible if we look at nearest neighbors of the backwards embedding -- i.e., things that lawyers said that justices tended to respond to with utterances containing "difference_between":

In [68]:
bk_neighbors.loc['difference_between'].sort_values().head(20)

there_*           0.228998
difference_a      0.255950
question_no       0.261286
is_there          0.281736
difference_*      0.283994
relevant_*        0.296001
question_about    0.316051
's_there          0.316707
there>*           0.324570
doubt_*           0.325439
is_difference     0.328573
but>there         0.336422
's_question       0.337620
difference_the    0.351272
and>there         0.351567
sense_the         0.352142
serious_*         0.354075
distinction_*     0.363075
question_a        0.370353
possibility_*     0.371970
Name: difference_between, dtype: float64

It still seems that justices tend to ask for contrasts after lawyers articulate contrasts, but this is certainly not a hard-and-fast rule.

One example of a backwards-oriented phrase is `specific_*`:

In [69]:
orientation_df.loc['specific_*'].orientation

-0.06020567193337334

Looking at nearest backwards neighbors, we see lawyer phrases which seem to locate specific aspects of e.g., a statute, a requirement, some other precedent:

In [72]:
bk_neighbors.loc['specific_*'].sort_values().head(20)

is_which            0.188142
specific_*          0.188767
is_one              0.197095
is_to               0.216025
is_in               0.244239
discrimination_*    0.245091
itself_*            0.251703
is_where            0.278767
in_statute          0.284501
separate_*          0.293673
is_the              0.295271
is_here             0.297449
process_the         0.302387
is_there            0.303016
is_*                0.307968
is_now              0.311984
requirement_*       0.313694
is_statute          0.314290
requirement_the     0.317387
of_section          0.322219
Name: specific_*, dtype: float64

This is less clear in the forwards direction:

In [70]:
fw_neighbors.loc['specific_*'].sort_values().head(20)

speaking_*       0.267105
certain_*        0.271000
aware_of         0.317548
are_there        0.321637
aware_*          0.324101
sure_*           0.340423
'm_not           0.343878
'm_sure          0.345877
of_that          0.349148
'm_i             0.360995
am_sure          0.362645
am_not           0.366701
are_well         0.366811
'm_sorry         0.376976
sorry_*          0.378896
completely_*     0.388760
am_i             0.392444
understand_*     0.401210
understand_i     0.402399
question_your    0.402740
Name: specific_*, dtype: float64

## sentence-level orientation

Finally, we can characterize the orientation of a sentence by aggregating phrase-level orientation across all the phrases in a sentence. For now, we will simply compute a tf-idf weighted average of phrase-level orientation. Note that at this level of aggregation, the measure gets a lot messier, especially given the relatively noisy oral argument setting. 

We can, of course, go beyond sentences to look at entire utterances -- some more work might need to be done here, since utterances in oral arguments can get quite long and cover a lot of ground (contrasting short text messages in the crisis counseling data used in the paper).

Note that while we used the subsetted training data to compute orientation, we are not bound to the same constraints in computing the orientation of a new utterance -- i.e., an utterance has a well-defined orientation regardless of how long or short its replies or predecessors are. This speaks to the intuition that our embeddings aim to represent some aspect of a speaker's _intention_ based on the phrases they use, rather than what _actually happens_ in a conversation.

As such, we will compute orientation for all sentences uttered by justices in our demo corpus from earlier:

In [74]:
# get sentence-level representations of utterances
arcs_per_sentence = []
for utt in demo_corpus.iter_utterances():
    if utt.retrieve_meta('source_filter'):
        sents = utt.retrieve_meta('arcs').split('\n')
        tok_sents = utt.retrieve_meta('tokens').split('\n')
        for i, (sent, tok_sent) in enumerate(zip(sents, tok_sents)):
            arcs_per_sentence.append({'id': '%s__%02d' % (utt.id, i),
                                     'n_tokens': len(tok_sent.split()),
                                     'arcs': sent,
                                     'tokens': tok_sent})

In [75]:
arc_sent_df = pd.DataFrame(arcs_per_sentence).set_index('id')

In [76]:
arc_sent_vects = source_tfidf_obj.transform(arc_sent_df.arcs.values)

In [77]:
arc_sent_fw_range = fw_obj.compute_docs_range(arc_sent_vects)
arc_sent_bk_range = bk_obj.compute_docs_range(arc_sent_vects)

In [78]:
arc_sent_df['orientation'] = arc_sent_bk_range - arc_sent_fw_range
arc_sent_df['fw'] = arc_sent_fw_range
arc_sent_df['bk'] = arc_sent_bk_range

As with phrases, we see that most justice sentences are forwards-oriented:

In [79]:
np.sign(arc_sent_df.orientation).value_counts(normalize=True)

 1.0    0.783507
-1.0    0.196412
 0.0    0.020081
Name: orientation, dtype: float64

Below, we display a selection of the most forwards and most backwards-oriented sentences:

In [80]:
arc_sent_subset = arc_sent_df[arc_sent_df.n_tokens >= 15].drop_duplicates('arcs') # for interpretability, examining reasonably-long sentences

In [81]:
arc_sent_subset.sort_values('orientation').head(10).tokens.values

array(["Now , if we 're going to be extending the -- the -- the understanding of what sex encompasses , and I know your argument --",
       'But before counsel leave , I would like to invite Mr. Clement to return to the lectern .',
       'And obviously I can use the same example with race , which is famous .',
       "Normally , we use law enforcement investigative tools like subpoenas to investigate known crimes and not to pursue individuals ' defined crimes .",
       'But AEDPA was intended to move habeas petitions along quickly and is full of deadlines .',
       'And -- and RFRA provides a backstop on that , but even beyond RFRA , in the ACA , Congress has delegated to the agency .',
       'Mr. Feigin , everybody has authority to spend or do their act on behalf of the agency .',
       'And then , in the 1907 Act , which is after , you know , the Enabling Act , it says all causes , civil or criminal , shall be proceeded with , held and determined by the courts of the state comi

In [82]:
arc_sent_subset.sort_values('orientation').tail(10).tokens.values

array(['Let me ask you a question about the difference between you and the government .',
       'Counsel , you spend a lot of time in your brief documenting that the purpose of these subpoenas was actually investigatory rather than legislative .',
       '-- between -- what is the difference between that setup and the setup that Mr. Lessig says is required ?',
       'Well , I -- I thought that in your brief , in your letter brief , you specifically rejected every other theory of -- of why this case was live .',
       'May I ask you a question about this -- this theory of yours I saw nowhere aired below ?',
       'I guess what struck me was that in many -- on many occasions you modified that test in your brief .',
       'Mr. Citron , may I ask you a basic question of -- of what matters here ?',
       "Is there anything explicitly that terminated the reservation in the history that you 've recounted ?",
       "It was n't raised in -- in the district court or in the court of appeal