Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. In this demo, we will use the Surprise transformer to compute Speaker Convo Diversity, a measure of how surprising a speaker's participation in one conversation is compared to their participation in all other conversations. We will then compare the results to those obtained using the actual SpeakerConvoDiversity transformer. We eventually want to use the Surprise transformer within the SpeakerConvoDiversity transformer to reduce redundancy, but for now, this demo serves as a sanity check on the correctness of the Surprise transformer.

In [2]:
import convokit
import numpy as np
from convokit import Corpus, download, Surprise

Step 1: Load a corpus
--------
For now, we will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [3]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at C:\Users\rgang\.convokit\downloads\subreddit-Cornell


In [4]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [5]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [6]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)

In [7]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [8]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [9]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [10]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances()) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [11]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20700
Number of Conversations: 6904


Step 2: Create instance of Surprise transformer
---------------
`target_sample_size` and `context_sample_size` specify the minimum number of tokens that should be in the target and context respectively. If the target or context is too short, the transformer will set the surprise to be `nan`. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements. The transformer takes `n_samples` samples from the target and context transformer (where samples are of size corresponding to `target_sample_size` and `context_sample_size`). It calculates cross entropy for each pair of samples and takes the average to get the final surprise score. This is done to minimize effect of length on scores.

`model_key_selector` defines how utterances in a corpus should be mapped to a model. It takes in an utterance and returns the key for the corresponding model. For this demo we want to map utterances to models based on their speaker and conversation ids.

The transformer also has an optional `cv` to customize the `scikit-learn` `CountVectorizer` used by the transformer to vectorize text. Since we are comparing Surprise to the SpeakerConvoDiversity transformer, we want to make sure that our transformer handles tokenization the same way as SpeakerConvoDiversity, so we will pass in a custom tokenizer function.

The `smooth` parameter determines whether the transformer uses +1 laplace smoothing (`smooth = True`) or naively replaces 0 counts with 1's (`smooth = False`) as SpeakerConvoDiversity does. Here we'll set `smooth = False` since we're comparing the results of Surprise with SpeakerConvoDiversity.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
import spacy
spacy_nlp = spacy.load('en', disable=['ner','parser', 'tagger'])
surp = Surprise(cv=CountVectorizer(tokenizer=lambda x: [t.text for t in spacy_nlp(x)]), model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=100, n_samples=50, smooth=False)

Step 3: Fit transformer to corpus
-----
The `text_func` parameter defines what text each model should be trained on. For this demo, we want a model corresponding to a (speaker, conversation) pair to be trained on all the utterances from the same speaker in different conversations.

In [13]:
surp = surp.fit(subset_corpus, text_func=lambda utt: [u.text for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id])

Step 4: Transform corpus
--------
We'll call `transform` with object type `'speaker'` so that surprise scores will be added as a metadata field for each speaker.

In [14]:
transformed_corpus = surp.transform(subset_corpus, 'speaker')

100it [55:38, 33.39s/it]


Analysis
------
Let's take a look at some of the most surprising speaker conversation involvements.

In [15]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [16]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

GROUP_Straight_Derpin_5kst5l__MODEL_Straight_Derpin_5kst5l      4.589240
GROUP_Dr_Narwhal_6h08sg__MODEL_Dr_Narwhal_6h08sg                4.514626
GROUP_rrrrrrr1131_8l3xht__MODEL_rrrrrrr1131_8l3xht              4.480983
GROUP_sasha07974_8v40c1__MODEL_sasha07974_8v40c1                4.477288
GROUP_ScottVandeberg_8tlcdl__MODEL_ScottVandeberg_8tlcdl        4.472549
GROUP_t3hasiangod_5v6sqb__MODEL_t3hasiangod_5v6sqb              4.471669
GROUP_t3hasiangod_4ufm6z__MODEL_t3hasiangod_4ufm6z              4.459614
GROUP_mushiettake_89mbvs__MODEL_mushiettake_89mbvs              4.458309
GROUP_SwissWatchesOnly_9hcpip__MODEL_SwissWatchesOnly_9hcpip    4.458051
GROUP_blackashi_2xxkm4__MODEL_blackashi_2xxkm4                  4.457852
dtype: float64

Now, let's look at some of the least surprising entries.

In [17]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

GROUP_shadowclan98_6njs8z__MODEL_shadowclan98_6njs8z        4.203864
GROUP_dedicateddan_4krfrc__MODEL_dedicateddan_4krfrc        4.210269
GROUP_Pjcrafty_5apodz__MODEL_Pjcrafty_5apodz                4.211766
GROUP_dedicateddan_1zcxhv__MODEL_dedicateddan_1zcxhv        4.216961
GROUP_chaosbutters_9abvcm__MODEL_chaosbutters_9abvcm        4.220616
GROUP_t3hasiangod_3wtoeo__MODEL_t3hasiangod_3wtoeo          4.224474
GROUP_shadowclan98_9mfj81__MODEL_shadowclan98_9mfj81        4.227370
GROUP_laveritecestla_54dxr7__MODEL_laveritecestla_54dxr7    4.228319
GROUP_CornellMan333_9epekx__MODEL_CornellMan333_9epekx      4.232255
GROUP_Fencerman2_6tiomd__MODEL_Fencerman2_6tiomd            4.233666
dtype: float64

## Comparison to SpeakerConvoDiversity

In [18]:
from convokit import SpeakerConvoDiversity

scd = SpeakerConvoDiversity('div', select_fn=lambda df, row, aux: (df.convo_id != row.convo_id) & (df.speaker == row.speaker), speaker_cols=['n_convos'], aux_input={'n_iters': 50, 'cmp_sample_size': 100, 'ref_sample_size': 100}, verbosity=1000)

In [19]:
from convokit.text_processing import TextParser

tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/20700 utterances processed
2000/20700 utterances processed
3000/20700 utterances processed
4000/20700 utterances processed
5000/20700 utterances processed
6000/20700 utterances processed
7000/20700 utterances processed
8000/20700 utterances processed
9000/20700 utterances processed
10000/20700 utterances processed
11000/20700 utterances processed
12000/20700 utterances processed
13000/20700 utterances processed
14000/20700 utterances processed
15000/20700 utterances processed
16000/20700 utterances processed
17000/20700 utterances processed
18000/20700 utterances processed
19000/20700 utterances processed
20000/20700 utterances processed
20700/20700 utterances processed


In [20]:
div_transformed = scd.transform(subset_corpus)

joining tokens across conversation utterances
1000 / 15394
2000 / 15394
3000 / 15394
4000 / 15394
5000 / 15394
6000 / 15394
7000 / 15394
8000 / 15394
9000 / 15394
10000 / 15394
11000 / 15394
12000 / 15394
13000 / 15394
14000 / 15394
15000 / 15394


Here are the speaker convo entries that have the highest diversity score.

In [21]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div', ascending=False).head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Straight_Derpin__5kst5l,Straight_Derpin,5kst5l,34,4.590349
Dr_Narwhal__6h08sg,Dr_Narwhal,6h08sg,75,4.539678
SwissWatchesOnly__9hcpip,SwissWatchesOnly,9hcpip,129,4.500917
ScottVandeberg__8tlcdl,ScottVandeberg,8tlcdl,81,4.499456
sasha07974__8v40c1,sasha07974,8v40c1,42,4.497518
rrrrrrr1131__8l3xht,rrrrrrr1131,8l3xht,25,4.497179
blackashi__2xxkm4,blackashi,2xxkm4,6,4.490293
agottler__9iyo8u,agottler,9iyo8u,66,4.48928
t3hasiangod__5v6sqb,t3hasiangod,5v6sqb,590,4.484694
EQUASHNZRKUL__82l7qg,EQUASHNZRKUL,82l7qg,328,4.471964


Notice that the diversity scores returned by `SpeakerConvoDiversity` are slightly different from the scores returned by the `Surprise` transformer. This difference can be attributed to the addition of Laplace smoothing in the `Surprise` transformer to account for out of vocabulary tokens. The `SpeakerConvoDiversity` transformer deals with OOV tokens by simply treating their count as 1. If you run the `Surprise` transformer with the `smooth` flag set to false, the transformer will treat OOV tokens the same way `SpeakerConvoDiversity` does. When run without smoothing, the `Surprise` transformer returns the same scores as `SpeakerConvoDiversity`.

Here are the least diverse speaker-convo entries based on the SpeakerConvoDiversity transformer.

In [22]:
div_transformed.get_speaker_convo_attribute_table(attrs=['div']).sort_values('div').head(10)

Unnamed: 0_level_0,speaker,convo_id,convo_idx,div
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fencerman2__6tiomd,Fencerman2,6tiomd,111,4.204802
cornell256__96utv3,cornell256,96utv3,292,4.211056
dedicateddan__4krfrc,dedicateddan,4krfrc,97,4.218379
t3hasiangod__3wtoeo,t3hasiangod,3wtoeo,46,4.225599
t3hasiangod__4ar3u0,t3hasiangod,4ar3u0,99,4.230127
Enyo287__5ipedu,Enyo287,5ipedu,281,4.231756
t3hasiangod__4me3m0,t3hasiangod,4me3m0,168,4.233925
iBeReese__1uuldh,iBeReese,1uuldh,6,4.234036
Pjcrafty__5apodz,Pjcrafty,5apodz,17,4.234668
dedicateddan__1zcxhv,dedicateddan,1zcxhv,20,4.239415


## Surprise With Smoothing

In [23]:
surp = Surprise(cv=CountVectorizer(tokenizer=lambda x: [t.text for t in spacy_nlp(x)]), model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), surprise_attr_name='surprise_smoothed', target_sample_size=100, context_sample_size=100, n_samples=50, smooth=True)
surp.fit(subset_corpus, text_func=lambda utt: [u.text for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id])
transformed_corpus = surp.transform(subset_corpus, 'speaker')

100it [43:02, 25.83s/it]


In [24]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In [25]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

GROUP_Straight_Derpin_5kst5l__MODEL_Straight_Derpin_5kst5l      4.589240
GROUP_Dr_Narwhal_6h08sg__MODEL_Dr_Narwhal_6h08sg                4.514626
GROUP_rrrrrrr1131_8l3xht__MODEL_rrrrrrr1131_8l3xht              4.480983
GROUP_sasha07974_8v40c1__MODEL_sasha07974_8v40c1                4.477288
GROUP_ScottVandeberg_8tlcdl__MODEL_ScottVandeberg_8tlcdl        4.472549
GROUP_t3hasiangod_5v6sqb__MODEL_t3hasiangod_5v6sqb              4.471669
GROUP_t3hasiangod_4ufm6z__MODEL_t3hasiangod_4ufm6z              4.459614
GROUP_mushiettake_89mbvs__MODEL_mushiettake_89mbvs              4.458309
GROUP_SwissWatchesOnly_9hcpip__MODEL_SwissWatchesOnly_9hcpip    4.458051
GROUP_blackashi_2xxkm4__MODEL_blackashi_2xxkm4                  4.457852
dtype: float64

In [26]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

GROUP_shadowclan98_6njs8z__MODEL_shadowclan98_6njs8z        4.203864
GROUP_dedicateddan_4krfrc__MODEL_dedicateddan_4krfrc        4.210269
GROUP_Pjcrafty_5apodz__MODEL_Pjcrafty_5apodz                4.211766
GROUP_dedicateddan_1zcxhv__MODEL_dedicateddan_1zcxhv        4.216961
GROUP_chaosbutters_9abvcm__MODEL_chaosbutters_9abvcm        4.220616
GROUP_t3hasiangod_3wtoeo__MODEL_t3hasiangod_3wtoeo          4.224474
GROUP_shadowclan98_9mfj81__MODEL_shadowclan98_9mfj81        4.227370
GROUP_laveritecestla_54dxr7__MODEL_laveritecestla_54dxr7    4.228319
GROUP_CornellMan333_9epekx__MODEL_CornellMan333_9epekx      4.232255
GROUP_Fencerman2_6tiomd__MODEL_Fencerman2_6tiomd            4.233666
dtype: float64