Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. The transformer currently only allows computation of how surprising a speaker's utterances in one conversation (target) are compared to their utterances in all other conversations (context) in the corpus. Eventually, the functionality of the Surprise transformer will be abstracted to allow for computation of surprise between any target and context types.

In [1]:
import convokit
import numpy as np
from convokit import Corpus, download, Surprise

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at C:\Users\rgang\.convokit\downloads\subreddit-Cornell


In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)

In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances()) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20700
Number of Conversations: 6904


Step 2: Create instance of surprise transformer
---------------
`target_sample_size` and `context_sample_size` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements. The transformer takes `n_samples` samples from the target and context transformer (where samples are of size corresponding to `target_sample_size` and `context_sample_size`). It calculates cross entropy for each pair of samples and takes the average to get the final surprise score. This is done to minimize effect of length on scores.

`model_key_selector` defines how utterances in a corpus should be mapped to a model. It takes in an utterance and returns the key for the corresponding model. For this demo we want to map utterances to models based on their speaker and conversation ids.

In [11]:
surp = Surprise(model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=100, n_samples=50, smooth=True)

Step 3: Fit transformer to corpus
-----
The `text_func` parameter defines what text each model should be trained on. For this demo, we want a model corresponding to a (speaker, conversation) pair to be trained on all the utterances from the same speaker in different conversations.

In [12]:
surp = surp.fit(subset_corpus, text_func=lambda utt: [u.text for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id])

Step 4: Transform corpus
--------
We'll call `transform` with object type `'speaker'` so that surprise scores will be added as a metadata field for each speaker.

In [13]:
transformed_corpus = surp.transform(subset_corpus, 'speaker')

Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [14]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [15]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

GROUP_cartesiancategory_6pot7n__MODEL_cartesiancategory_6pot7n    5.231371
GROUP_t3hasiangod_4f56qv__MODEL_t3hasiangod_4f56qv                5.231347
GROUP_t3hasiangod_3sc927__MODEL_t3hasiangod_3sc927                5.225886
GROUP_t3hasiangod_3t9lgm__MODEL_t3hasiangod_3t9lgm                5.225873
GROUP_t3hasiangod_5641lt__MODEL_t3hasiangod_5641lt                5.222505
GROUP_t3hasiangod_5fqbes__MODEL_t3hasiangod_5fqbes                5.222256
GROUP_t3hasiangod_4tymy0__MODEL_t3hasiangod_4tymy0                5.221537
GROUP_t3hasiangod_57ci9e__MODEL_t3hasiangod_57ci9e                5.219757
GROUP_t3hasiangod_5v6s6m__MODEL_t3hasiangod_5v6s6m                5.215703
GROUP_EQUASHNZRKUL_59sn56__MODEL_EQUASHNZRKUL_59sn56              5.215699
dtype: float64

Now, let's look at some of the least surprising entries.

In [16]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

GROUP_apbay_8bmd6z__MODEL_apbay_8bmd6z                      4.199647
GROUP_Sleeppppp_7jz1e0__MODEL_Sleeppppp_7jz1e0              4.265192
GROUP_Sleeppppp_7ldk9w__MODEL_Sleeppppp_7ldk9w              4.267883
GROUP_chrissydablack_5vvc60__MODEL_chrissydablack_5vvc60    4.336837
GROUP_ChocolatePain_2kxu9t__MODEL_ChocolatePain_2kxu9t      4.440125
GROUP_Bearclawmen_8z0gx8__MODEL_Bearclawmen_8z0gx8          4.474428
GROUP_Bearclawmen_91yv8u__MODEL_Bearclawmen_91yv8u          4.479524
GROUP_apbay_4f3ko9__MODEL_apbay_4f3ko9                      4.483573
GROUP_BuildAnything_8rf7j1__MODEL_BuildAnything_8rf7j1      4.487007
GROUP_soontocollege_4retr6__MODEL_soontocollege_4retr6      4.495519
dtype: float64

### Comparison to SpeakerConvoDiversity

In [17]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', select_fn=lambda df, row, aux: (df.convo_id != row.convo_id) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 100}, verbosity=100)

In [18]:
from convokit.text_processing import TextParser

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/20700 utterances processed
2000/20700 utterances processed
3000/20700 utterances processed
4000/20700 utterances processed
5000/20700 utterances processed
6000/20700 utterances processed
7000/20700 utterances processed
8000/20700 utterances processed
9000/20700 utterances processed
10000/20700 utterances processed
11000/20700 utterances processed
12000/20700 utterances processed
13000/20700 utterances processed
14000/20700 utterances processed
15000/20700 utterances processed
16000/20700 utterances processed
17000/20700 utterances processed
18000/20700 utterances processed
19000/20700 utterances processed
20000/20700 utterances processed
20700/20700 utterances processed


In [19]:
div_transformed = scd.transform(subset_corpus)

joining tokens across conversation utterances
100 / 15394
200 / 15394
300 / 15394
400 / 15394
500 / 15394
600 / 15394
700 / 15394
800 / 15394
900 / 15394
1000 / 15394
1100 / 15394
1200 / 15394
1300 / 15394
1400 / 15394
1500 / 15394
1600 / 15394
1700 / 15394
1800 / 15394
1900 / 15394
2000 / 15394
2100 / 15394
2200 / 15394
2300 / 15394
2400 / 15394
2500 / 15394
2600 / 15394
2700 / 15394
2800 / 15394
2900 / 15394
3000 / 15394
3100 / 15394
3200 / 15394
3300 / 15394
3400 / 15394
3500 / 15394
3600 / 15394
3700 / 15394
3800 / 15394
3900 / 15394
4000 / 15394
4100 / 15394
4200 / 15394
4300 / 15394
4400 / 15394
4500 / 15394
4600 / 15394
4700 / 15394
4800 / 15394
4900 / 15394
5000 / 15394
5100 / 15394
5200 / 15394
5300 / 15394
5400 / 15394
5500 / 15394
5600 / 15394
5700 / 15394
5800 / 15394
5900 / 15394
6000 / 15394
6100 / 15394
6200 / 15394
6300 / 15394
6400 / 15394
6500 / 15394
6600 / 15394
6700 / 15394
6800 / 15394
6900 / 15394
7000 / 15394
7100 / 15394
7200 / 15394
7300 / 15394
7400 / 15394
7

Here are the speaker convo entries that have the highest diversity score.

In [20]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Straight_Derpin__5kst5l,Straight_Derpin,5kst5l,34,4.58973
Dr_Narwhal__6h08sg,Dr_Narwhal,6h08sg,75,4.543916
SwissWatchesOnly__9hcpip,SwissWatchesOnly,9hcpip,129,4.494886
t3hasiangod__5v6sqb,t3hasiangod,5v6sqb,590,4.492267
rrrrrrr1131__8l3xht,rrrrrrr1131,8l3xht,25,4.487745
sasha07974__8v40c1,sasha07974,8v40c1,42,4.48657
agottler__9iyo8u,agottler,9iyo8u,66,4.483837
ScottVandeberg__8tlcdl,ScottVandeberg,8tlcdl,81,4.48068
t3hasiangod__4ufm6z,t3hasiangod,4ufm6z,262,4.480016
cartesiancategory__8bdf5g,cartesiancategory,8bdf5g,310,4.478022


Notice that the diversity scores returned by `SpeakerConvoDiversity` are slightly different from the scores returned by the `Surprise` transformer. This difference can be attributed to the addition of Laplace smoothing in the `Surprise` transformer to account for out of vocabulary tokens. The `SpeakerConvoDiversity` transformer deals with OOV tokens by simply treating their count as 1. If you run the `Surprise` transformer with the `smooth` flag set to false, the transformer will treat OOV tokens the same way `SpeakerConvoDiversity` does. When run without smoothing, the `Surprise` transformer returns the same scores as `SpeakerConvoDiversity`.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [21]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fencerman2__6tiomd,Fencerman2,6tiomd,111,4.203668
t3hasiangod__3wtoeo,t3hasiangod,3wtoeo,46,4.221644
iBeReese__1uuldh,iBeReese,1uuldh,6,4.223295
Enyo287__5ipedu,Enyo287,5ipedu,281,4.225254
laveritecestla__4pylgl,laveritecestla,4pylgl,74,4.225618
dedicateddan__4krfrc,dedicateddan,4krfrc,97,4.228558
Pjcrafty__5apodz,Pjcrafty,5apodz,17,4.230794
t3hasiangod__4ar3u0,t3hasiangod,4ar3u0,99,4.231619
kickstand__obvjl,kickstand,obvjl,6,4.235342
shadowclan98__6njs8z,shadowclan98,6njs8z,8,4.238192


In [11]:
from convokit.speaker_convo_helpers.speaker_convo_lifestage import SpeakerConvoLifestage

lifestage_transform = SpeakerConvoLifestage(20)
lifestage_transform.transform(subset_corpus)

<convokit.model.corpus.Corpus at 0x1dd3e8ac8c8>

In [16]:
subset_corpus.get_speaker_convo_info('laveritecestla', '5xn5yi', key='lifestage')

13