Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. In this demo, we will use the Surprise transformer to compute Speaker Convo Diversity, a measure of how surprising a speaker's participation in one conversation is compared to their participation in all other conversations. We will then compare the results to those obtained using the actual SpeakerConvoDiversity transformer. We eventually want to use the Surprise transformer within the SpeakerConvoDiversity transformer to reduce redundancy, but for now, this demo serves as a sanity check on the correctness of the Surprise transformer.

In [1]:
import convokit
import itertools
import numpy as np
import spacy
from convokit import Corpus, download, Surprise
from convokit.text_processing import TextProcessor, TextParser
from sklearn.feature_extraction.text import CountVectorizer

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at /home/axl4/.convokit/downloads/subreddit-Cornell


In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)



In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances(selector=utterance_is_valid)) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20550
Number of Conversations: 6866


Step 2: Create instance of Surprise transformer
---------------
`target_sample_size` and `context_sample_size` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements. The transformer takes `n_samples` samples from the target and context transformer (where samples are of size corresponding to `target_sample_size` and `context_sample_size`). It calculates cross entropy for each pair of samples and takes the average to get the final surprise score. This is done to minimize effect of length on scores.

`model_key_selector` defines how utterances in a corpus should be mapped to a model. It takes in an utterance and returns the key for the corresponding model. For this demo we want to map utterances to models based on their speaker and conversation ids.

The transformer also has an optional `cv` to customize the `scikit-learn` `CountVectorizer` used by the transformer to vectorize text. Since we are comparing Surprise to the SpeakerConvoDiversity transformer, we want to make sure that our transformer handles tokenization the same way as SpeakerConvoDiversity, so we will pass in a custom tokenizer function.

The `smooth` parameter determines whether the transformer uses +1 laplace smoothing (`smooth = True`) or naively replaces 0 counts with 1's (`smooth = False`) as SpeakerConvoDiversity does. Here we'll set `smooth = False` since we're comparing the results of Surprise with SpeakerConvoDiversity.

In [11]:
import spacy

spacy_nlp = spacy.load('en_core_web_sm', disable=['ner','parser', 'tagger', 'lemmatizer'])
for utt in subset_corpus.iter_utterances():
    utt.meta['joined_tokens'] = [t.text.lower() for t in spacy_nlp(utt.text)]

In [12]:
surp = Surprise(tokenizer=lambda x: x, model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=1000, n_samples=50, smooth=False)

In [13]:
surp = surp.fit(subset_corpus, text_func=lambda utt: [list(itertools.chain(*[u.meta['joined_tokens'] for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id]))])

fit1: 20550it [00:16, 1273.50it/s]
fit2: 100%|██████████| 15394/15394 [00:00<00:00, 918229.10it/s]


Step 4: Transform corpus
--------
We'll call `transform` with object type `'speaker'` so that surprise scores will be added as a metadata field for each speaker.

In [14]:
transformed_corpus = surp.transform(subset_corpus, 'speaker')

transform: 100it [13:11,  7.92s/it]


Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [15]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [16]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

EQUASHNZRKUL_815y6t          6.889343
SwissWatchesOnly_8g5q88      6.882153
SwissWatchesOnly_67cljd      6.818217
CornellMan333_9iwucv         6.805448
Straight_Derpin_5kst5l       6.791131
Udontlikecake_7rj6a0         6.770960
EQUASHNZRKUL_73xuw6          6.759494
laveritecestla_6v4ysm        6.758605
ClawofBeta_52u1nu            6.746009
Pretty_Good_At_IRL_6zoww2    6.744907
dtype: float64

Now, let's look at some of the least surprising entries.

In [17]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

crash_over-ride_6bjxnm    5.585589
Unga_Bunga_30ac0l         5.616060
omgdonerkebab_v4a3p       5.622264
Bisphosphate_7r8nu1       5.623319
crash_over-ride_t6w01     5.628418
crash_over-ride_8f7b0y    5.635361
crash_over-ride_30zba1    5.637371
crash_over-ride_7owfvv    5.673594
Bisphosphate_8mbpdu       5.676536
crash_over-ride_9ghfjc    5.678164
dtype: float64

## Comparison to SpeakerConvoDiversity

In [18]:
from convokit.text_processing import TextProcessor, TextParser

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/20550 utterances processed
2000/20550 utterances processed
3000/20550 utterances processed
4000/20550 utterances processed
5000/20550 utterances processed
6000/20550 utterances processed
7000/20550 utterances processed
8000/20550 utterances processed
9000/20550 utterances processed
10000/20550 utterances processed
11000/20550 utterances processed
12000/20550 utterances processed
13000/20550 utterances processed
14000/20550 utterances processed
15000/20550 utterances processed
16000/20550 utterances processed
17000/20550 utterances processed
18000/20550 utterances processed
19000/20550 utterances processed
20000/20550 utterances processed
20550/20550 utterances processed


In [19]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', select_fn=lambda df, row, aux: (df.convo_id != row.convo_id) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 1000}, verbosity=1000)

In [20]:
div_transformed = scd.transform(subset_corpus)

joining tokens across conversation utterances
1000 / 15394
2000 / 15394
3000 / 15394
4000 / 15394
5000 / 15394
6000 / 15394
7000 / 15394
8000 / 15394
9000 / 15394
10000 / 15394
11000 / 15394
12000 / 15394
13000 / 15394
14000 / 15394
15000 / 15394


Here are the speaker convo entries that have the highest diversity score.

In [21]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Straight_Derpin__5kst5l,Straight_Derpin,5kst5l,34,6.816388
Dr_Narwhal__6h08sg,Dr_Narwhal,6h08sg,75,6.539103
sasha07974__8v40c1,sasha07974,8v40c1,42,6.260069
SwissWatchesOnly__9hcpip,SwissWatchesOnly,9hcpip,129,6.134661
blackashi__2xxkm4,blackashi,2xxkm4,6,6.120601
mushiettake__89mbvs,mushiettake,89mbvs,269,6.109522
agottler__9iyo8u,agottler,9iyo8u,66,6.098672
EQUASHNZRKUL__55e40e,EQUASHNZRKUL,55e40e,28,6.098302
t3hasiangod__5v6sqb,t3hasiangod,5v6sqb,590,6.093472
ScottVandeberg__8tlcdl,ScottVandeberg,8tlcdl,81,6.071957


Notice that the diversity scores returned by `SpeakerConvoDiversity` are slightly different from the scores returned by the `Surprise` transformer. This difference can be attributed to the addition of Laplace smoothing in the `Surprise` transformer to account for out of vocabulary tokens. The `SpeakerConvoDiversity` transformer deals with OOV tokens by simply treating their count as 1. If you run the `Surprise` transformer with the `smooth` flag set to false, the transformer will treat OOV tokens the same way `SpeakerConvoDiversity` does. When run without smoothing, the `Surprise` transformer returns the same scores as `SpeakerConvoDiversity`.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [22]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CornellMan333__9epekx,CornellMan333,9epekx,56,5.062892
CornellMan333__9j2exy,CornellMan333,9j2exy,67,5.153162
laveritecestla__4pylgl,laveritecestla,4pylgl,74,5.195755
Fencerman2__57qqhb,Fencerman2,57qqhb,21,5.238213
cornell256__96utv3,cornell256,96utv3,292,5.243144
cryptkeep__3zgnom,cryptkeep,3zgnom,40,5.243289
iBeReese__1uuldh,iBeReese,1uuldh,6,5.246245
ScottVandeberg__71whjx,ScottVandeberg,71whjx,34,5.246945
t3hasiangod__3wtoeo,t3hasiangod,3wtoeo,46,5.248247
dedicateddan__4krfrc,dedicateddan,4krfrc,97,5.250035


## Surprise With Smoothing

In [23]:
import spacy

spacy_nlp = spacy.load('en_core_web_sm', disable=['ner','parser', 'tagger', 'lemmatizer'])
for utt in subset_corpus.iter_utterances():
    utt.meta['joined_tokens'] = [t.text.lower() for t in spacy_nlp(utt.text)]

In [24]:
surp = Surprise(tokenizer=lambda x: x, model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=1000, n_samples=50, smooth=True, surprise_attr_name='surprise_smoothed')
surp.fit(subset_corpus, text_func=lambda utt: [list(itertools.chain(*[u.meta['joined_tokens'] for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id]))])
transformed_corpus = surp.transform(subset_corpus, 'speaker')

fit1: 20550it [00:11, 1837.53it/s]
fit2: 100%|██████████| 15394/15394 [00:00<00:00, 874454.76it/s]
transform: 100it [12:00,  7.21s/it]


In [25]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise_smoothed'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [26]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

EQUASHNZRKUL_815y6t        7.234207
SwissWatchesOnly_8g5q88    7.229217
SwissWatchesOnly_67cljd    7.122332
Udontlikecake_7rj6a0       7.097393
EQUASHNZRKUL_73xuw6        7.095252
CornellMan333_9iwucv       7.074767
ClawofBeta_52u1nu          7.074176
Straight_Derpin_5kst5l     7.060966
laveritecestla_6v4ysm      7.055223
DEEP_THORAX_8drwet         7.039485
dtype: float64

In [27]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

Unga_Bunga_30ac0l         5.864586
crash_over-ride_6bjxnm    5.898152
crash_over-ride_30zba1    5.932144
crash_over-ride_8f7b0y    5.940947
Bisphosphate_7r8nu1       5.948326
crash_over-ride_v4j70     5.956539
omgdonerkebab_v4a3p       5.985297
Bisphosphate_8mbpdu       5.987885
dontich_8gzrs4            5.989558
crash_over-ride_9b132c    5.993430
dtype: float64

## SpeakerConvoDiversity reimplemented using Surprise

In [28]:
from convokit import SpeakerConvoDiversityWrapper
from convokit.speakerConvoDiversity.speakerConvoDiversity2 import SpeakerConvoDiversityWrapper as SpeakerConvoDiversityWrapper2

In [29]:
corpus = Corpus(filename=download('subreddit-Cornell'))
corpus.print_summary_stats()

Dataset already exists at /home/axl4/.convokit/downloads/subreddit-Cornell
Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In [30]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot','AutoModerator']
def utterance_is_valid(utterance):
    return (utterance.id != utterance.conversation_id) and (utterance.speaker.id not in SPEAKER_BLACKLIST)

corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)



In [31]:
speaker_activities = corpus.get_attribute_table('speaker',['n_convos'])
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(25).index

In [32]:
subset_utts = []
for speaker in top_speakers:
    subset_utts += list(corpus.get_speaker(speaker).iter_utterances(selector=utterance_is_valid))
subset_corpus = Corpus(utterances=subset_utts)
subset_corpus.print_summary_stats()

Number of Speakers: 25
Number of Utterances: 10909
Number of Conversations: 5042


In [33]:
tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/10909 utterances processed
2000/10909 utterances processed
3000/10909 utterances processed
4000/10909 utterances processed
5000/10909 utterances processed
6000/10909 utterances processed
7000/10909 utterances processed
8000/10909 utterances processed
9000/10909 utterances processed
10000/10909 utterances processed
10909/10909 utterances processed


In [34]:
scd = SpeakerConvoDiversityWrapper(lifestage_size=2, max_exp=20,
                sample_size=20, min_n_utterances=1, n_iters=50, cohort_delta=60*60*24*30*2, verbosity=100)

In [35]:
subset_corpus = scd.transform(subset_corpus)

getting lifestages
getting within diversity
joining tokens across conversation utterances
100 / 396
200 / 396
300 / 396
getting across diversity
joining tokens across conversation utterances
100 / 396
200 / 396
300 / 396
getting relative diversity
100 / 380
200 / 380
300 / 380


In [36]:
subset_corpus.get_speaker_convo_attribute_table(attrs=['div__self', 'div__other', 'div__adj']).dropna(subset=['div__self', 'div__other', 'div__adj'], how='all').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div__self,div__other,div__adj
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
laveritecestla__1t542i,laveritecestla,1t542i,0,2.899498,2.945001,0.045503
laveritecestla__22i7ke,laveritecestla,22i7ke,1,2.902664,2.962396,0.059732
laveritecestla__2kk0n2,laveritecestla,2kk0n2,3,,2.9721,
laveritecestla__31hwi8,laveritecestla,31hwi8,8,2.931096,,
laveritecestla__34ycz6,laveritecestla,34ycz6,9,2.952168,2.971982,0.019814
laveritecestla__36bxln,laveritecestla,36bxln,10,2.953607,,
laveritecestla__36tnnq,laveritecestla,36tnnq,11,2.958524,2.966267,0.007742
laveritecestla__3856f2,laveritecestla,3856f2,12,2.964933,2.960538,-0.004394
laveritecestla__37vkfp,laveritecestla,37vkfp,13,2.948179,2.969014,0.020835
laveritecestla__39m37w,laveritecestla,39m37w,15,,2.968986,


In [37]:
subset_corpus = Corpus(utterances=subset_utts)
subset_corpus.print_summary_stats()

Number of Speakers: 25
Number of Utterances: 10909
Number of Conversations: 5042


In [38]:
scd = SpeakerConvoDiversityWrapper2(lifestage_size=2, max_exp=20,
                sample_size=20, min_n_utterances=1, n_iters=50, cohort_delta=60*60*24*30*2, verbosity=100)

In [39]:
subset_corpus = scd.transform(subset_corpus)

fit1: 39it [00:00, 389.70it/s]

getting lifestages
getting within diversity
joining tokens across conversation utterances


fit1: 10909it [00:25, 434.79it/s]
fit2: 100%|██████████| 8143/8143 [00:00<00:00, 866088.94it/s]
transform: 25it [00:29,  1.16s/it]
set output: 25it [00:00, 1071.76it/s]
fit1: 20it [00:00, 188.74it/s]

getting across diversity
joining tokens across conversation utterances


fit1: 10909it [00:47, 229.37it/s]
fit2: 100%|██████████| 8143/8143 [00:00<00:00, 338173.96it/s]
transform: 25it [00:20,  1.22it/s]
set output: 25it [00:00, 1566.16it/s]


getting relative diversity
100 / 5104
200 / 5104
300 / 5104
400 / 5104
500 / 5104
600 / 5104
700 / 5104
800 / 5104
900 / 5104
1000 / 5104
1100 / 5104
1200 / 5104
1300 / 5104
1400 / 5104
1500 / 5104
1600 / 5104
1700 / 5104
1800 / 5104
1900 / 5104
2000 / 5104
2100 / 5104
2200 / 5104
2300 / 5104
2400 / 5104
2500 / 5104
2600 / 5104
2700 / 5104
2800 / 5104
2900 / 5104
3000 / 5104
3100 / 5104
3200 / 5104
3300 / 5104
3400 / 5104
3500 / 5104
3600 / 5104
3700 / 5104
3800 / 5104
3900 / 5104
4000 / 5104
4100 / 5104
4200 / 5104
4300 / 5104
4400 / 5104
4500 / 5104
4600 / 5104
4700 / 5104
4800 / 5104
4900 / 5104
5000 / 5104
5100 / 5104


In [40]:
subset_corpus.get_speaker_convo_attribute_table(attrs=['div__self', 'div__other', 'div__adj']).dropna(subset=['div__self', 'div__other', 'div__adj'], how='all').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div__self,div__other,div__adj
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
laveritecestla__1t542i,laveritecestla,1t542i,0,2.901239,2.921369,0.02013
laveritecestla__22i7ke,laveritecestla,22i7ke,1,2.896725,2.961061,0.064336
laveritecestla__2kk0n2,laveritecestla,2kk0n2,3,,2.974532,
laveritecestla__31hwi8,laveritecestla,31hwi8,8,2.932066,,
laveritecestla__34ycz6,laveritecestla,34ycz6,9,2.963612,2.96182,-0.001792
laveritecestla__36bxln,laveritecestla,36bxln,10,2.953149,,
laveritecestla__36tnnq,laveritecestla,36tnnq,11,2.936448,2.957372,0.020925
laveritecestla__3856f2,laveritecestla,3856f2,12,2.967771,2.949905,-0.017865
laveritecestla__37vkfp,laveritecestla,37vkfp,13,2.955516,2.973774,0.018257
laveritecestla__39m37w,laveritecestla,39m37w,15,,2.964423,
