Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. In this demo, we will use the Surprise transformer to compute Speaker Convo Diversity, a measure of how surprising a speaker's participation in one conversation is compared to their participation in all other conversations. We will then compare the results to those obtained using the actual SpeakerConvoDiversity transformer. We eventually want to use the Surprise transformer within the SpeakerConvoDiversity transformer to reduce redundancy, but for now, this demo serves as a sanity check on the correctness of the Surprise transformer.

In [1]:
import convokit
import itertools
import numpy as np
import spacy
from convokit import Corpus, download, Surprise
from convokit.text_processing import TextProcessor, TextParser
from sklearn.feature_extraction.text import CountVectorizer

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at /home/axl4/.convokit/downloads/subreddit-Cornell


In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)



In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances(selector=utterance_is_valid)) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20550
Number of Conversations: 6866


Step 2: Create instance of Surprise transformer
---------------
`target_sample_size` and `context_sample_size` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements. The transformer takes `n_samples` samples from the target and context transformer (where samples are of size corresponding to `target_sample_size` and `context_sample_size`). It calculates cross entropy for each pair of samples and takes the average to get the final surprise score. This is done to minimize effect of length on scores.

`model_key_selector` defines how utterances in a corpus should be mapped to a model. It takes in an utterance and returns the key for the corresponding model. For this demo we want to map utterances to models based on their speaker and conversation ids.

The transformer also has an optional `cv` to customize the `scikit-learn` `CountVectorizer` used by the transformer to vectorize text. Since we are comparing Surprise to the SpeakerConvoDiversity transformer, we want to make sure that our transformer handles tokenization the same way as SpeakerConvoDiversity, so we will pass in a custom tokenizer function.

The `smooth` parameter determines whether the transformer uses +1 laplace smoothing (`smooth = True`) or naively replaces 0 counts with 1's (`smooth = False`) as SpeakerConvoDiversity does. Here we'll set `smooth = False` since we're comparing the results of Surprise with SpeakerConvoDiversity.

In [11]:
import spacy

spacy_nlp = spacy.load('en_core_web_sm', disable=['ner','parser', 'tagger', 'lemmatizer'])
for utt in subset_corpus.iter_utterances():
    utt.meta['joined_tokens'] = [t.text.lower() for t in spacy_nlp(utt.text)]

In [12]:
surp = Surprise(tokenizer=lambda x: x, model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=1000, n_samples=50, smooth=False)

In [13]:
surp = surp.fit(subset_corpus, text_func=lambda utt: [list(itertools.chain(*[u.meta['joined_tokens'] for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id]))])

fit1: 20550it [00:16, 1267.10it/s]
fit2: 100%|██████████| 15394/15394 [00:00<00:00, 989140.20it/s]


Step 4: Transform corpus
--------
We'll call `transform` with object type `'speaker'` so that surprise scores will be added as a metadata field for each speaker.

In [14]:
transformed_corpus = surp.transform(subset_corpus, 'speaker')

transform: 100it [13:21,  8.01s/it]


Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [15]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [16]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

GROUP_EQUASHNZRKUL_815y6t__MODEL_EQUASHNZRKUL_815y6t            6.903181
GROUP_SwissWatchesOnly_8g5q88__MODEL_SwissWatchesOnly_8g5q88    6.884900
GROUP_SwissWatchesOnly_67cljd__MODEL_SwissWatchesOnly_67cljd    6.819719
GROUP_CornellMan333_9iwucv__MODEL_CornellMan333_9iwucv          6.801696
GROUP_EQUASHNZRKUL_73xuw6__MODEL_EQUASHNZRKUL_73xuw6            6.797471
GROUP_Udontlikecake_7rj6a0__MODEL_Udontlikecake_7rj6a0          6.796684
GROUP_Straight_Derpin_5kst5l__MODEL_Straight_Derpin_5kst5l      6.768749
GROUP_laveritecestla_6v4ysm__MODEL_laveritecestla_6v4ysm        6.745163
GROUP_SharkHogBestHog_9f8mou__MODEL_SharkHogBestHog_9f8mou      6.738350
GROUP_ClawofBeta_52u1nu__MODEL_ClawofBeta_52u1nu                6.736324
dtype: float64

Now, let's look at some of the least surprising entries.

In [17]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

GROUP_Unga_Bunga_30ac0l__MODEL_Unga_Bunga_30ac0l              5.544727
GROUP_crash_over-ride_8f7b0y__MODEL_crash_over-ride_8f7b0y    5.607746
GROUP_Bisphosphate_7r8nu1__MODEL_Bisphosphate_7r8nu1          5.613737
GROUP_crash_over-ride_6bjxnm__MODEL_crash_over-ride_6bjxnm    5.621535
GROUP_crash_over-ride_30zba1__MODEL_crash_over-ride_30zba1    5.622496
GROUP_omgdonerkebab_v4a3p__MODEL_omgdonerkebab_v4a3p          5.660275
GROUP_crash_over-ride_2vhtzx__MODEL_crash_over-ride_2vhtzx    5.666062
GROUP_crash_over-ride_2vtgvc__MODEL_crash_over-ride_2vtgvc    5.673908
GROUP_crash_over-ride_t6w01__MODEL_crash_over-ride_t6w01      5.674109
GROUP_crash_over-ride_9b132c__MODEL_crash_over-ride_9b132c    5.683613
dtype: float64

## Comparison to SpeakerConvoDiversity

In [18]:
from convokit.text_processing import TextProcessor, TextParser

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/20550 utterances processed
2000/20550 utterances processed
3000/20550 utterances processed
4000/20550 utterances processed
5000/20550 utterances processed
6000/20550 utterances processed
7000/20550 utterances processed
8000/20550 utterances processed
9000/20550 utterances processed
10000/20550 utterances processed
11000/20550 utterances processed
12000/20550 utterances processed
13000/20550 utterances processed
14000/20550 utterances processed
15000/20550 utterances processed
16000/20550 utterances processed
17000/20550 utterances processed
18000/20550 utterances processed
19000/20550 utterances processed
20000/20550 utterances processed
20550/20550 utterances processed


In [19]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', select_fn=lambda df, row, aux: (df.convo_id != row.convo_id) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 1000}, verbosity=1000)

In [20]:
div_transformed = scd.transform(subset_corpus)

joining tokens across conversation utterances
1000 / 15394
2000 / 15394
3000 / 15394
4000 / 15394
5000 / 15394
6000 / 15394
7000 / 15394
8000 / 15394
9000 / 15394
10000 / 15394
11000 / 15394
12000 / 15394
13000 / 15394
14000 / 15394
15000 / 15394


Here are the speaker convo entries that have the highest diversity score.

In [21]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Straight_Derpin__5kst5l,Straight_Derpin,5kst5l,34,6.822236
Dr_Narwhal__6h08sg,Dr_Narwhal,6h08sg,75,6.523284
sasha07974__8v40c1,sasha07974,8v40c1,42,6.267265
mushiettake__89mbvs,mushiettake,89mbvs,269,6.136225
SwissWatchesOnly__9hcpip,SwissWatchesOnly,9hcpip,129,6.134547
blackashi__2xxkm4,blackashi,2xxkm4,6,6.114604
ScottVandeberg__8tlcdl,ScottVandeberg,8tlcdl,81,6.081968
cartesiancategory__8bdf5g,cartesiancategory,8bdf5g,310,6.075852
agottler__9iyo8u,agottler,9iyo8u,66,6.073317
t3hasiangod__5v6sqb,t3hasiangod,5v6sqb,590,6.059111


Notice that the diversity scores returned by `SpeakerConvoDiversity` are slightly different from the scores returned by the `Surprise` transformer. This difference can be attributed to the addition of Laplace smoothing in the `Surprise` transformer to account for out of vocabulary tokens. The `SpeakerConvoDiversity` transformer deals with OOV tokens by simply treating their count as 1. If you run the `Surprise` transformer with the `smooth` flag set to false, the transformer will treat OOV tokens the same way `SpeakerConvoDiversity` does. When run without smoothing, the `Surprise` transformer returns the same scores as `SpeakerConvoDiversity`.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [22]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CornellMan333__9epekx,CornellMan333,9epekx,56,5.041151
laveritecestla__4pylgl,laveritecestla,4pylgl,74,5.161081
CornellMan333__9j2exy,CornellMan333,9j2exy,67,5.18982
cryptkeep__3zgnom,cryptkeep,3zgnom,40,5.207361
voluminous_lexicon__6v3oa0,voluminous_lexicon,6v3oa0,9,5.225655
Fencerman2__66rq5i,Fencerman2,66rq5i,51,5.230217
iBeReese__1uuldh,iBeReese,1uuldh,6,5.236974
BuildAnything__6v0x2j,BuildAnything,6v0x2j,17,5.244524
SpookBusters__93abug,SpookBusters,93abug,225,5.248574
SantaSoul__6zqlyk,SantaSoul,6zqlyk,35,5.257848


## Surprise With Smoothing

In [23]:
import spacy

spacy_nlp = spacy.load('en_core_web_sm', disable=['ner','parser', 'tagger', 'lemmatizer'])
for utt in subset_corpus.iter_utterances():
    utt.meta['joined_tokens'] = [t.text.lower() for t in spacy_nlp(utt.text)]

In [24]:
surp = Surprise(tokenizer=lambda x: x, model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=1000, n_samples=50, smooth=True, surprise_attr_name='surprise_smoothed')
surp.fit(subset_corpus, text_func=lambda utt: [list(itertools.chain(*[u.meta['joined_tokens'] for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id]))])
transformed_corpus = surp.transform(subset_corpus, 'speaker')

fit1: 20550it [00:11, 1826.44it/s]
fit2: 100%|██████████| 15394/15394 [00:00<00:00, 993416.66it/s]
transform: 100it [12:08,  7.29s/it]


In [25]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise_smoothed'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [26]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

GROUP_EQUASHNZRKUL_815y6t__MODEL_EQUASHNZRKUL_815y6t            7.218165
GROUP_SwissWatchesOnly_8g5q88__MODEL_SwissWatchesOnly_8g5q88    7.205426
GROUP_SwissWatchesOnly_67cljd__MODEL_SwissWatchesOnly_67cljd    7.137258
GROUP_EQUASHNZRKUL_73xuw6__MODEL_EQUASHNZRKUL_73xuw6            7.089950
GROUP_CornellMan333_9iwucv__MODEL_CornellMan333_9iwucv          7.070621
GROUP_Straight_Derpin_5kst5l__MODEL_Straight_Derpin_5kst5l      7.062313
GROUP_ClawofBeta_52u1nu__MODEL_ClawofBeta_52u1nu                7.049150
GROUP_syntheticity_97zg9z__MODEL_syntheticity_97zg9z            7.044579
GROUP_Udontlikecake_7rj6a0__MODEL_Udontlikecake_7rj6a0          7.043805
GROUP_Enyo287_3s4yj4__MODEL_Enyo287_3s4yj4                      7.039872
dtype: float64

In [27]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

GROUP_Unga_Bunga_30ac0l__MODEL_Unga_Bunga_30ac0l              5.869289
GROUP_crash_over-ride_6bjxnm__MODEL_crash_over-ride_6bjxnm    5.938821
GROUP_Bisphosphate_7r8nu1__MODEL_Bisphosphate_7r8nu1          5.941510
GROUP_omgdonerkebab_v4a3p__MODEL_omgdonerkebab_v4a3p          5.942420
GROUP_crash_over-ride_t6w01__MODEL_crash_over-ride_t6w01      5.960937
GROUP_crash_over-ride_9b132c__MODEL_crash_over-ride_9b132c    5.975243
GROUP_crash_over-ride_7owfvv__MODEL_crash_over-ride_7owfvv    5.979871
GROUP_crash_over-ride_2vhtzx__MODEL_crash_over-ride_2vhtzx    5.981664
GROUP_crash_over-ride_8f7b0y__MODEL_crash_over-ride_8f7b0y    5.989878
GROUP_crash_over-ride_llc0q__MODEL_crash_over-ride_llc0q      5.993079
dtype: float64

## SpeakerConvoDiversity reimplemented using Surprise

In [28]:
from convokit import SpeakerConvoDiversityWrapper
from convokit.speakerConvoDiversity.speakerConvoDiversity2 import SpeakerConvoDiversityWrapper as SpeakerConvoDiversityWrapper2

In [29]:
corpus = Corpus(filename=download('subreddit-Cornell'))
corpus.print_summary_stats()

Dataset already exists at /home/axl4/.convokit/downloads/subreddit-Cornell
Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In [30]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot','AutoModerator']
def utterance_is_valid(utterance):
    return (utterance.id != utterance.conversation_id) and (utterance.speaker.id not in SPEAKER_BLACKLIST)

corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)



In [31]:
speaker_activities = corpus.get_attribute_table('speaker',['n_convos'])
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(25).index

In [32]:
subset_utts = []
for speaker in top_speakers:
    subset_utts += list(corpus.get_speaker(speaker).iter_utterances(selector=utterance_is_valid))
subset_corpus = Corpus(utterances=subset_utts)
subset_corpus.print_summary_stats()

Number of Speakers: 25
Number of Utterances: 10909
Number of Conversations: 5042


In [33]:
tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/10909 utterances processed
2000/10909 utterances processed
3000/10909 utterances processed
4000/10909 utterances processed
5000/10909 utterances processed
6000/10909 utterances processed
7000/10909 utterances processed
8000/10909 utterances processed
9000/10909 utterances processed
10000/10909 utterances processed
10909/10909 utterances processed


In [34]:
scd = SpeakerConvoDiversityWrapper(lifestage_size=2, max_exp=20,
                sample_size=20, min_n_utterances=1, n_iters=50, cohort_delta=60*60*24*30*2, verbosity=100)

In [35]:
subset_corpus = scd.transform(subset_corpus)

getting lifestages
getting within diversity
joining tokens across conversation utterances
100 / 396
200 / 396
300 / 396
getting across diversity
joining tokens across conversation utterances
100 / 396
200 / 396
300 / 396
getting relative diversity
100 / 380
200 / 380
300 / 380


In [36]:
subset_corpus.get_speaker_convo_attribute_table(attrs=['div__self', 'div__other', 'div__adj']).dropna(subset=['div__self', 'div__other', 'div__adj'], how='all').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div__self,div__other,div__adj
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
laveritecestla__1t542i,laveritecestla,1t542i,0,2.88289,2.945694,0.062805
laveritecestla__22i7ke,laveritecestla,22i7ke,1,2.86387,2.965638,0.101768
laveritecestla__2kk0n2,laveritecestla,2kk0n2,3,,2.971524,
laveritecestla__31hwi8,laveritecestla,31hwi8,8,2.927225,,
laveritecestla__34ycz6,laveritecestla,34ycz6,9,2.958407,2.969732,0.011326
laveritecestla__36bxln,laveritecestla,36bxln,10,2.949173,,
laveritecestla__36tnnq,laveritecestla,36tnnq,11,2.925592,2.965691,0.040099
laveritecestla__3856f2,laveritecestla,3856f2,12,2.969732,2.942005,-0.027727
laveritecestla__37vkfp,laveritecestla,37vkfp,13,2.957332,2.951916,-0.005416
laveritecestla__39m37w,laveritecestla,39m37w,15,,2.951187,


In [37]:
subset_corpus = Corpus(utterances=subset_utts)
subset_corpus.print_summary_stats()

Number of Speakers: 25
Number of Utterances: 10909
Number of Conversations: 5042


In [38]:
scd = SpeakerConvoDiversityWrapper2(lifestage_size=2, max_exp=20,
                sample_size=20, min_n_utterances=1, n_iters=50, cohort_delta=60*60*24*30*2, verbosity=100)

In [39]:
subset_corpus = scd.transform(subset_corpus)

fit1: 41it [00:00, 403.87it/s]

getting lifestages
getting within diversity
joining tokens across conversation utterances


fit1: 10909it [00:24, 450.05it/s]
fit2: 100%|██████████| 8143/8143 [00:00<00:00, 938637.91it/s]
transform: 25it [00:24,  1.02it/s]
set output: 25it [04:47, 11.50s/it]
fit1: 20it [00:00, 195.12it/s]

getting across diversity
joining tokens across conversation utterances


fit1: 10909it [00:46, 236.85it/s]
fit2: 100%|██████████| 8143/8143 [00:00<00:00, 359559.71it/s]
transform: 25it [00:17,  1.44it/s]
set output: 25it [03:30,  8.43s/it]


getting relative diversity
100 / 5104
200 / 5104
300 / 5104
400 / 5104
500 / 5104
600 / 5104
700 / 5104
800 / 5104
900 / 5104
1000 / 5104
1100 / 5104
1200 / 5104
1300 / 5104
1400 / 5104
1500 / 5104
1600 / 5104
1700 / 5104
1800 / 5104
1900 / 5104
2000 / 5104
2100 / 5104
2200 / 5104
2300 / 5104
2400 / 5104
2500 / 5104
2600 / 5104
2700 / 5104
2800 / 5104
2900 / 5104
3000 / 5104
3100 / 5104
3200 / 5104
3300 / 5104
3400 / 5104
3500 / 5104
3600 / 5104
3700 / 5104
3800 / 5104
3900 / 5104
4000 / 5104
4100 / 5104
4200 / 5104
4300 / 5104
4400 / 5104
4500 / 5104
4600 / 5104
4700 / 5104
4800 / 5104
4900 / 5104
5000 / 5104
5100 / 5104


In [40]:
subset_corpus.get_speaker_convo_attribute_table(attrs=['div__self', 'div__other', 'div__adj']).dropna(subset=['div__self', 'div__other', 'div__adj'], how='all').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div__self,div__other,div__adj
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
laveritecestla__1t542i,laveritecestla,1t542i,0,2.929177,2.932982,0.003806
laveritecestla__22i7ke,laveritecestla,22i7ke,1,2.912328,2.959453,0.047125
laveritecestla__2kk0n2,laveritecestla,2kk0n2,3,,2.974585,
laveritecestla__31hwi8,laveritecestla,31hwi8,8,2.952755,,
laveritecestla__34ycz6,laveritecestla,34ycz6,9,2.939873,2.969563,0.02969
laveritecestla__36bxln,laveritecestla,36bxln,10,2.937259,,
laveritecestla__36tnnq,laveritecestla,36tnnq,11,2.930667,2.962513,0.031846
laveritecestla__3856f2,laveritecestla,3856f2,12,2.968516,2.94644,-0.022076
laveritecestla__37vkfp,laveritecestla,37vkfp,13,2.952691,2.958236,0.005544
laveritecestla__39m37w,laveritecestla,39m37w,15,,2.957072,
