# Analyzing the Tennis Corpus with Surprise
This demo is based on the [Tie-breaker paper](https://www.cs.cornell.edu/~liye/tennis.html) on gender-bias in sports journalism.

In [1]:
import convokit
import numpy as np
from convokit import Corpus, download, Surprise

In [2]:
corpus = Corpus(filename=download('tennis-corpus'))

Dataset already exists at C:\Users\rgang\.convokit\downloads\tennis-corpus


To help with the analysis, let's add a metadata attribute to each utterance that is a reporter question describing the gender of the player the question is posed to.

In [3]:
for utt in corpus.iter_utterances(selector=lambda u: u.meta['is_question']):
    utt.add_meta('player_gender', utt.get_conversation().get_utterance(utt.id.replace('q', 'a')).get_speaker().meta['gender'])
corpus.get_utterances_dataframe()

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.is_answer,meta.is_question,meta.pair_idx,meta.player_gender
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1681_0.q,2008-08-28,I think this is your biggest success right now...,REPORTER,,1681_0.q,False,True,1681_0,M
1681_0.a,2008-08-28,Yeah.,Kei Nishikori,1681_0.q,1681_0.q,True,False,1681_0,
1681_1.q,2008-08-28,How would you describe it? Is it fantastic for...,REPORTER,,1681_1.q,False,True,1681_1,M
1681_1.a,2008-08-28,"Yeah, I'm pretty happy, but it was -- I wasn't...",Kei Nishikori,1681_1.q,1681_1.q,True,False,1681_1,
1681_2.q,2008-08-28,Do you know why he has retired?,REPORTER,,1681_2.q,False,True,1681_2,M
...,...,...,...,...,...,...,...,...,...
755_9.a,2007-01-16,"Yeah, no.",Kim Clijsters,755_9.q,755_9.q,True,False,755_9,
755_10.q,2007-01-16,It's working?,REPORTER,,755_10.q,False,True,755_10,F
755_10.a,2007-01-16,Yeah. Feel good.,Kim Clijsters,755_10.q,755_10.q,True,False,755_10,
755_11.q,2007-01-16,So when something's not going right with your ...,REPORTER,,755_11.q,False,True,755_11,F


## Part 1: How surprising is each interview question compared to the other questions?

For this demo, we want to train one model for the entire corpus, so we'll make our `model_key_selector` a function that returns the same key for every utterance in a corpus.

In [4]:
surp = Surprise(model_key_selector=lambda utt: 'corpus', target_sample_size=10, context_sample_size=5000)

Since we just want to look at how surprising questions asked by reporters are, we'll fit the transformer just on utterances that are questions.

In [5]:
surp.fit(corpus, selector=lambda utt: utt.meta['is_question'])

<convokit.surprise.surprise.Surprise at 0x189bebfc488>

To speed up the demo, we'll select a random subset of interview questions to compute surprise scores for.

In [6]:
import itertools

subset_utts = [corpus.get_utterance(utt) for utt in corpus.get_utterances_dataframe()[corpus.get_utterances_dataframe()['meta.is_question']].sample(500).index]
subset_corpus = Corpus(utterances=subset_utts)

Again we want to select only utterances that are questions to compute surprise for.

In [7]:
surp.transform(subset_corpus, obj_type='utterance', selector=lambda utt: utt.meta['is_question'])

<convokit.model.corpus.Corpus at 0x189a80ec648>

In [8]:
utterances = subset_corpus.get_utterances_dataframe(selector=lambda utt: utt.meta['is_question'])
utterances

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.is_answer,meta.is_question,meta.pair_idx,meta.player_gender,meta.surprise
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5245_1.q,2013-09-07,Stan said he found you were extremely nervous ...,REPORTER,,5245_1.q,False,True,5245_1,M,6.54898
4341_7.q,2012-08-08,Can you talk about how you were able to not le...,REPORTER,,4341_7.q,False,True,4341_7,F,5.95094
107_3.q,2015-03-17,Yes.,REPORTER,,107_3.q,False,True,107_3,F,
6211_3.q,2014-05-25,You obviously have played a lot of SEC team sp...,REPORTER,,6211_3.q,False,True,6211_3,M,6.40259
924_6.q,2007-03-26,"You would have been in Mexico, right, Mexico C...",REPORTER,,924_6.q,False,True,924_6,M,
...,...,...,...,...,...,...,...,...,...,...
186_5.q,2015-10-10,"Before you came to the China Open, you had to ...",REPORTER,,186_5.q,False,True,186_5,F,6.75991
3979_0.q,2012-01-09,What pleased you most about that performance?,REPORTER,,3979_0.q,False,True,3979_0,F,
2987_10.q,2010-07-01,Was there a point in your career when you real...,REPORTER,,2987_10.q,False,True,2987_10,F,6.29503
4289_1.q,2012-08-23,Can you go over exactly what happened? When di...,REPORTER,,4289_1.q,False,True,4289_1,F,5.82461


In [9]:
utterances[utterances['meta.player_gender'] == 'F']['meta.surprise'].dropna().mean()

6.243879214686866

In [10]:
utterances[utterances['meta.player_gender'] == 'M']['meta.surprise'].dropna().mean()

6.275950256659707

## Part 2: How surprising is a question compared to all questions posed to male players and all questions posed ot female players?

Let's see how surprising questions are compared to questions posed to players of each gender. To do this, we'll want to make our `model_key_selector` return a key based on the player's gender. Recall that we added `'player_gender'` as a metadata field to each question earlier.

In [11]:
gender_models_surp = Surprise(model_key_selector=lambda utt: utt.meta['player_gender'], target_sample_size=10, context_sample_size=1000, surprise_attr_name='surprise_gender_model')

In [12]:
gender_models_surp.fit(corpus, selector=lambda utt: utt.meta['is_question'])

<convokit.surprise.surprise.Surprise at 0x189bcd89108>

Since for each question, we want to compute surprise based on both the male interview questions model and the female interview questions model, we will use the `group_and_models` parameter for the `transform` function. Each utterance should belong to it's own group and be compared to both the `'M'` and `'F'` gender models.

In [13]:
gender_models_surp.transform(subset_corpus, obj_type='utterance', group_and_models=lambda utt: (utt.id, ['M', 'F']), selector=lambda utt: utt.meta['is_question'])

<convokit.model.corpus.Corpus at 0x189a80ec648>

In [14]:
utterances = subset_corpus.get_utterances_dataframe(selector=lambda utt: utt.meta['is_question'])
utterances

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.is_answer,meta.is_question,meta.pair_idx,meta.player_gender,meta.surprise,meta.surprise_gender_model
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
5245_1.q,2013-09-07,Stan said he found you were extremely nervous ...,REPORTER,,5245_1.q,False,True,5245_1,M,6.54898,"{'GROUP_5245_1.q__MODEL_M': 7.103776529351631,..."
4341_7.q,2012-08-08,Can you talk about how you were able to not le...,REPORTER,,4341_7.q,False,True,4341_7,F,5.95094,"{'GROUP_4341_7.q__MODEL_M': 6.783209349957584,..."
107_3.q,2015-03-17,Yes.,REPORTER,,107_3.q,False,True,107_3,F,,"{'GROUP_107_3.q__MODEL_M': nan, 'GROUP_107_3.q..."
6211_3.q,2014-05-25,You obviously have played a lot of SEC team sp...,REPORTER,,6211_3.q,False,True,6211_3,M,6.40259,"{'GROUP_6211_3.q__MODEL_M': 6.678198283998114,..."
924_6.q,2007-03-26,"You would have been in Mexico, right, Mexico C...",REPORTER,,924_6.q,False,True,924_6,M,,"{'GROUP_924_6.q__MODEL_M': nan, 'GROUP_924_6.q..."
...,...,...,...,...,...,...,...,...,...,...,...
186_5.q,2015-10-10,"Before you came to the China Open, you had to ...",REPORTER,,186_5.q,False,True,186_5,F,6.75991,"{'GROUP_186_5.q__MODEL_M': 6.968725597594767, ..."
3979_0.q,2012-01-09,What pleased you most about that performance?,REPORTER,,3979_0.q,False,True,3979_0,F,,"{'GROUP_3979_0.q__MODEL_M': nan, 'GROUP_3979_0..."
2987_10.q,2010-07-01,Was there a point in your career when you real...,REPORTER,,2987_10.q,False,True,2987_10,F,6.29503,{'GROUP_2987_10.q__MODEL_M': 7.090169721884778...
4289_1.q,2012-08-23,Can you go over exactly what happened? When di...,REPORTER,,4289_1.q,False,True,4289_1,F,5.82461,"{'GROUP_4289_1.q__MODEL_M': 6.631479446868192,..."


In [15]:
utterances[utterances['meta.player_gender'] == 'F']['meta.surprise_gender_model'].values[:10]

array([{'GROUP_4341_7.q__MODEL_M': 6.783209349957584, 'GROUP_4341_7.q__MODEL_F': 6.845298787847863},
       {'GROUP_107_3.q__MODEL_M': nan, 'GROUP_107_3.q__MODEL_F': nan},
       {'GROUP_3052_4.q__MODEL_M': nan, 'GROUP_3052_4.q__MODEL_F': nan},
       {'GROUP_521_14.q__MODEL_M': 6.659929098638321, 'GROUP_521_14.q__MODEL_F': 6.550808970181529},
       {'GROUP_2224_1.q__MODEL_M': 6.7815066487137505, 'GROUP_2224_1.q__MODEL_F': 6.829032042880556},
       {'GROUP_5852_2.q__MODEL_M': 6.694288506666312, 'GROUP_5852_2.q__MODEL_F': 6.767889002922368},
       {'GROUP_3743_0.q__MODEL_M': 6.889600003115634, 'GROUP_3743_0.q__MODEL_F': 6.82613640612194},
       {'GROUP_5975_2.q__MODEL_M': 6.8160504781214675, 'GROUP_5975_2.q__MODEL_F': 6.732984651078998},
       {'GROUP_2820_5.q__MODEL_M': 6.749485526356883, 'GROUP_2820_5.q__MODEL_F': 6.641774645055219},
       {'GROUP_1863_15.q__MODEL_M': 7.023676916052487, 'GROUP_1863_15.q__MODEL_F': 6.889031503704184}],
      dtype=object)

In [16]:
utterances[utterances['meta.player_gender'] == 'M']['meta.surprise_gender_model'].values[:10]

array([{'GROUP_5245_1.q__MODEL_M': 7.103776529351631, 'GROUP_5245_1.q__MODEL_F': 6.900926940711671},
       {'GROUP_6211_3.q__MODEL_M': 6.678198283998114, 'GROUP_6211_3.q__MODEL_F': 6.931403827505043},
       {'GROUP_924_6.q__MODEL_M': nan, 'GROUP_924_6.q__MODEL_F': nan},
       {'GROUP_3890_10.q__MODEL_M': 6.799334299891807, 'GROUP_3890_10.q__MODEL_F': 6.763646052451127},
       {'GROUP_1860_4.q__MODEL_M': 6.84138969161925, 'GROUP_1860_4.q__MODEL_F': 6.724015665111105},
       {'GROUP_1371_3.q__MODEL_M': 6.9518876126735165, 'GROUP_1371_3.q__MODEL_F': 7.083357797245587},
       {'GROUP_472_2.q__MODEL_M': 6.893839769831588, 'GROUP_472_2.q__MODEL_F': 6.750585792412042},
       {'GROUP_5608_7.q__MODEL_M': 6.800128219871717, 'GROUP_5608_7.q__MODEL_F': 6.7548165776156415},
       {'GROUP_4841_3.q__MODEL_M': 7.007302827107132, 'GROUP_4841_3.q__MODEL_F': 6.388921585232761},
       {'GROUP_1202_20.q__MODEL_M': nan, 'GROUP_1202_20.q__MODEL_F': nan}],
      dtype=object)