Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. In this demo, we will use the Surprise transformer to compute Speaker Convo Diversity, a measure of how surprising a speaker's participation in one conversation is compared to their participation in all other conversations. We will then compare the results to those obtained using the actual SpeakerConvoDiversity transformer. We eventually want to use the Surprise transformer within the SpeakerConvoDiversity transformer to reduce redundancy, but for now, this demo serves as a sanity check on the correctness of the Surprise transformer.

In [1]:
import convokit
import numpy as np
from convokit import Corpus, download, Surprise
from convokit.text_processing import TextProcessor, TextParser

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus('C:\\Users\\rgang\\.convokit\\downloads\\subreddit-Cornell')

In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)



In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(25).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances(selector=utterance_is_valid)) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 25
Number of Utterances: 11145
Number of Conversations: 5082


In [11]:
from convokit.text_processing import TextParser
from convokit.speaker_convo_helpers.speaker_convo_attrs import SpeakerConvoAttrs

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

def _join_all_tokens(parses):
    joined = []
    for parse in parses:
        for sent in parse:
            joined += [tok['tok'].lower() for tok in sent['toks']]
    return joined

agg_tokens = SpeakerConvoAttrs('tokens',
                 agg_fn=_join_all_tokens,
                 recompute=False)

agg_tokens.transform(subset_corpus)

1000/11145 utterances processed
2000/11145 utterances processed
3000/11145 utterances processed
4000/11145 utterances processed
5000/11145 utterances processed
6000/11145 utterances processed
7000/11145 utterances processed
8000/11145 utterances processed
9000/11145 utterances processed
10000/11145 utterances processed
11000/11145 utterances processed
11145/11145 utterances processed


<convokit.model.corpus.Corpus at 0x2375df2ac08>

Step 2: Create instance of Surprise transformer
---------------
`target_sample_size` and `context_sample_size` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements. The transformer takes `n_samples` samples from the target and context transformer (where samples are of size corresponding to `target_sample_size` and `context_sample_size`). It calculates cross entropy for each pair of samples and takes the average to get the final surprise score. This is done to minimize effect of length on scores.

`model_key_selector` defines how utterances in a corpus should be mapped to a model. It takes in an utterance and returns the key for the corresponding model. For this demo we want to map utterances to models based on their speaker and conversation ids.

The transformer also has an optional `cv` to customize the `scikit-learn` `CountVectorizer` used by the transformer to vectorize text. Since we are comparing Surprise to the SpeakerConvoDiversity transformer, we want to make sure that our transformer handles tokenization the same way as SpeakerConvoDiversity, so we will pass in a custom tokenizer function.

The `smooth` parameter determines whether the transformer uses +1 laplace smoothing (`smooth = True`) or naively replaces 0 counts with 1's (`smooth = False`) as SpeakerConvoDiversity does. Here we'll set `smooth = False` since we're comparing the results of Surprise with SpeakerConvoDiversity.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

surp = Surprise(cv=CountVectorizer(preprocessor=lambda x: x, tokenizer=lambda x: x), model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=100, n_samples=50, smooth=False)

Step 3: Fit transformer to corpus
-----
The `text_func` parameter defines what text each model should be trained on. For this demo, we want a model corresponding to a (speaker, conversation) pair to be trained on all the utterances from the same speaker in different conversations.

In [13]:
from itertools import chain
speaker_convo_attr_table = subset_corpus.get_full_attribute_table(['tokens'])

def _get_text_func(utt, df):
  utt_row = df.loc[f'{utt.speaker.id}__{utt.conversation_id}']
  ref_subset = df[(df.convo_idx % 2 != utt_row.convo_idx % 2) & (df.speaker == utt_row.speaker)]
  return [np.array(list(chain(*ref_subset.tokens.values)))]

surp = surp.fit(subset_corpus, text_func=lambda utt: _get_text_func(utt, speaker_convo_attr_table))

3269it [01:28, 16.89it/s]

Step 4: Transform corpus
--------
We'll call `transform` with object type `'speaker'` so that surprise scores will be added as a metadata field for each speaker.

In [14]:
transformed_corpus = surp.transform(subset_corpus, 'speaker')

0it [00:00, ?it/s]


TypeError: len() of unsized object

Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [None]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [None]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

Now, let's look at some of the least surprising entries.

In [None]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

## Comparison to SpeakerConvoDiversity

In [None]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', select_fn=lambda df, row, aux: (df.convo_id != row.convo_id) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 100}, verbosity=1000)

In [None]:
div_transformed = scd.transform(subset_corpus)

Here are the speaker convo entries that have the highest diversity score.

In [None]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Notice that the diversity scores returned by `SpeakerConvoDiversity` are slightly different from the scores returned by the `Surprise` transformer. This difference can be attributed to the addition of Laplace smoothing in the `Surprise` transformer to account for out of vocabulary tokens. The `SpeakerConvoDiversity` transformer deals with OOV tokens by simply treating their count as 1. If you run the `Surprise` transformer with the `smooth` flag set to false, the transformer will treat OOV tokens the same way `SpeakerConvoDiversity` does. When run without smoothing, the `Surprise` transformer returns the same scores as `SpeakerConvoDiversity`.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [None]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

## Surprise With Smoothing

In [None]:
surp = Surprise(cv=CountVectorizer(tokenizer=lambda x: [t.text for t in spacy_nlp(x)]), model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), surprise_attr_name='surprise_smoothed', target_sample_size=100, context_sample_size=100, n_samples=50, smooth=True)
surp.fit(subset_corpus, text_func=lambda utt: [u.text for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id])
transformed_corpus = surp.transform(subset_corpus, 'speaker')

In [None]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise_smoothed'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [None]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

In [None]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

## SpeakerConvoDiversity reimplemented using Surprise

In [None]:
from convokit import SpeakerConvoDiversityWrapper
from convokit.speakerConvoDiversity.speakerConvoDiversity2 import SpeakerConvoDiversityWrapper as SpeakerConvoDiversityWrapper2

In [None]:
corpus = Corpus('C:\\Users\\rgang\\.convokit\\downloads\\subreddit-Cornell')
corpus.print_summary_stats()

In [None]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot','AutoModerator']
def utterance_is_valid(utterance):
    return (utterance.id != utterance.conversation_id) and (utterance.speaker.id not in SPEAKER_BLACKLIST)

corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)

In [None]:
speaker_activities = corpus.get_attribute_table('speaker',['n_convos'])
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(25).index

In [None]:
subset_utts = []
for speaker in top_speakers:
    subset_utts += list(corpus.get_speaker(speaker).iter_utterances(selector=utterance_is_valid))
subset_corpus = Corpus(utterances=subset_utts)
subset_corpus.print_summary_stats()

In [None]:
tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

In [None]:
scd = SpeakerConvoDiversityWrapper(lifestage_size=2, max_exp=20,
                sample_size=20, min_n_utterances=1, n_iters=50, cohort_delta=60*60*24*30*2, verbosity=100)

In [None]:
subset_corpus = scd.transform(subset_corpus)

In [None]:
subset_corpus.get_speaker_convo_attribute_table(attrs=['div__self', 'div__other', 'div__adj']).dropna(subset=['div__self', 'div__other', 'div__adj'], how='all').head(10)

In [None]:
subset_corpus = Corpus(utterances=subset_utts)
subset_corpus.print_summary_stats()

In [None]:
scd = SpeakerConvoDiversityWrapper2(lifestage_size=2, max_exp=20,
                sample_size=20, min_n_utterances=1, n_iters=50, cohort_delta=60*60*24*30*2, verbosity=100)

In [None]:
subset_corpus = scd.transform(subset_corpus)

In [None]:
subset_corpus.get_speaker_convo_attribute_table(attrs=['div__self', 'div__other', 'div__adj']).dropna(subset=['div__self', 'div__other', 'div__adj'], how='all').head(10)