# Converting the Friends dataset into ConvoKit format

This notebook describes how we converted the Friends dataset (https://github.com/emorynlp/character-mining) into a Corpus with ConvoKit.

In [1]:
!pip3 install convokit
# !python3 -m spacy download en

Collecting convokit
[?25l  Downloading https://files.pythonhosted.org/packages/81/f3/d04f33525a2dbf316c31bfab383d1ee84280714b474718e272882e8b2d6a/convokit-2.0.11.tar.gz (77kB)
[K     |████▎                           | 10kB 14.0MB/s eta 0:00:01[K     |████████▌                       | 20kB 1.7MB/s eta 0:00:01[K     |████████████▊                   | 30kB 2.5MB/s eta 0:00:01[K     |█████████████████               | 40kB 1.7MB/s eta 0:00:01[K     |█████████████████████▏          | 51kB 2.1MB/s eta 0:00:01[K     |█████████████████████████▍      | 61kB 2.5MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71kB 2.9MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.1MB/s 
Collecting msgpack-numpy==0.4.3.2 (from convokit)
  Downloading https://files.pythonhosted.org/packages/ad/45/464be6da85b5ca893cfcbd5de3b31a6710f636ccb8521b17bd4110a08d94/msgpack_numpy-0.4.3.2-py2.py3-none-any.whl
Collecting spacy==2.0.12 (from convokit)
[?25l  Downloading https://files

In [0]:
import requests
import json
from tqdm import tqdm
from convokit import Corpus, User, Utterance

## The Friends Dataset

The original dataset (https://github.com/emorynlp/character-mining) contains a set of 10 JSON files, each of which represents a complete transcript of 1 season of <i>Friends</i>. Since the data are available in JSON format from this GitHub repo, we download the raw data directly using the `requests` module. You will not need to download raw data files to use this script.

## Generating user information
Since our dataset doesn't have any existing user information, we extract speaker information from the conversation. For each user, we collect the episode in which he/she first appears and guess his/her gender based on the name using the gender_guesser module.

Users are indexed by their name, which is a `<str>`. For each user, we create an object with:

- <b>first_appearance:</b> the episode in which he or she first appeared
- <b>gender:</b> the character's gender, as defined by the `gender_guesser` module's guess of his/her name

In [3]:
! pip3 install gender_guesser
import gender_guesser.detector as gender
d = gender.Detector()

Collecting gender_guesser
[?25l  Downloading https://files.pythonhosted.org/packages/13/fb/3f2aac40cd2421e164cab1668e0ca10685fcf896bd6b3671088f8aab356e/gender_guesser-0.4.0-py2.py3-none-any.whl (379kB)
[K     |████████████████████████████████| 389kB 2.8MB/s 
[?25hInstalling collected packages: gender-guesser
Successfully installed gender-guesser-0.4.0


In [4]:
users = {}
for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      for l in range(len(utterances)):
        utterance = utterances[l]
        speaker_list = utterance['speakers']
        for speaker in speaker_list:
          if speaker not in users:
            users[speaker] = {'first_appearance': episode['episode_id'], 'gender': d.get_gender(speaker.split()[0])}

100%|██████████| 10/10 [00:03<00:00,  2.78it/s]


Sanity-checking the user data, we should see the correct genders assigned to the 6 friends:

In [5]:
print("number of users in the data = {}/700".format(len(users)))
print("Monica Geller object: ", users["Monica Geller"])
print("Joey Tribbiani object: ", users["Joey Tribbiani"])
print("Chandler Bing object: ", users["Chandler Bing"])
print("Phoebe Buffay object: ", users["Phoebe Buffay"])
print("Ross Geller object: ", users["Ross Geller"])
print("Rachel Green object: ", users["Rachel Green"])

number of users in the data = 700/700
Monica Geller object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Joey Tribbiani object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Chandler Bing object:  {'first_appearance': 's01_e01', 'gender': 'mostly_male'}
Phoebe Buffay object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Ross Geller object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Rachel Green object:  {'first_appearance': 's01_e01', 'gender': 'female'}


We then create a User object for each unique character in the dataset.

In [0]:
corpus_users = {k: User(name=k, meta=v) for k,v in users.items()}

In [7]:
print(corpus_users['Monica Geller'].name)
print(corpus_users['Monica Geller'].meta)

Monica Geller
{'first_appearance': 's01_e01', 'gender': 'female'}


## Generating Utterances

We then loop through the data to generate a list of all utterances in the series. To align with the Utterance schema ConvoKit expects, we construct for each utterance:

- **id:** index of the utterance

- **user:** the user who authored the utterance; the speaker in our case

- **root:** id of the conversation root of the utterance; the first utterance in the scene, in our case

- **reply_to:** id of the utterance to which this utterance replies to; None if the utterance is not a reply.

- **timestamp:** time of the utterance (None for us -- the dataset does not contain this information)

- **text:** textual content of the utterance

We also pull in the following metadata including:
- **tokens** a tokenized representation of the text (handy for sentence separation)
-**character_entities** available for some but not all utterances; `None` if unavailable. These are intended to identify who the user is speaking to and/or about.
-**emotion** emotion labels for each token. Available for some but not all utterances; `None` if unavailable. 
-**caption**  available for some but not all utterances; `None` if unavailable. This contains the begin time, end time, and text sans punctuation. Only available for seasons 6-9.
-**transcript_with_note**  a version of the text with an action note (e.g. "(to Ross) Hand me the coffee" vs. "Hand me the coffee"). Available for some but not all utterances; `None` if unavailable.
-**token_with_note** a tokenized representation of the above.

In [8]:
all_utterances = {}



for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      
      root = utterances[0] #set the root as the first utterance in the scene for now
      
      prev_utt = None

      for l in range(len(utterances)):
        utterance = utterances[l]
        
        speaker = utterance['speakers']
        
        if len(speaker) == 0:
          prev_utt = None
          continue
        
        # Add meta       
        meta = {
            'tokens': utterance.get('tokens'),
            'character_entities': utterance.get('character_entities'),
            'emotion': utterance.get('emotion'),
            'caption': utterance.get('caption'),
            'transcript_with_note': utterance.get('transcript_with_note'),
            'tokens_with_note': utterance.get('tokens_with_note')
        }
        
        # Create the Utterance, including meta
        all_utterances[utterance['utterance_id']] = Utterance(
            id=utterance['utterance_id'],
            user=corpus_users[speaker[0]],
            root=root['utterance_id'],
            reply_to=prev_utt,
            timestamp=None,
            text=utterance['transcript'],
            meta=meta
        )
        
        # Get the prev_utt for the next iteration
        prev_utt = utterance['utterance_id']


100%|██████████| 10/10 [00:04<00:00,  2.50it/s]


In [9]:
print("This corpus has {}/61309 utterances".format(len(all_utterances)))

This corpus has 61338/61309 utterances


In [35]:
all_utterances['s01_e18_c05_u001']

Utterance({'id': 's01_e18_c05_u001', 'user': User([('name', 'Phoebe Buffay')]), 'root': 's01_e18_c05_u001', 'reply_to': None, 'timestamp': None, 'text': 'Ross, could we please, please, please listen to anything else?', 'meta': {'tokens': [['Ross', ',', 'could', 'we', 'please', ',', 'please', ',', 'please', 'listen', 'to', 'anything', 'else', '?']], 'character_entities': [[[0, 1, 'Ross Geller'], [3, 4, 'Ross Geller', 'Phoebe Buffay', 'Rachel Green']]], 'emotion': ['Joyful', ['Joyful', 'Mad', 'Joyful', 'Neutral']], 'caption': None, 'transcript_with_note': None, 'tokens_with_note': None, 'parsed': Ross, could we please, please, please listen to anything else?}})

## Creating the corpus from a list of utterances

We now create the corpus from our dict of utterances. Note, we are are allowing convokit to create conversations IDs automatically after loading the utterances list.

In [0]:
utterance_list = [utt for k, utt in all_utterances.items()]

In [0]:
friends_corpus = Corpus(utterances=utterance_list, version=1)

Sanity checks for the number of conversations in the dataset and the first 5 conversations:

In [13]:
print("number of conversations in the dataset={}".format(len(friends_corpus.get_conversation_ids())))

number of conversations in the dataset=3099


In [14]:
convo_ids = friends_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(friends_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['s01_e01_c01_u001', 's01_e01_c01_u002', 's01_e01_c01_u003', 's01_e01_c01_u004', 's01_e01_c01_u006', 's01_e01_c01_u007', 's01_e01_c01_u008', 's01_e01_c01_u010', 's01_e01_c01_u011', 's01_e01_c01_u012', 's01_e01_c01_u013', 's01_e01_c01_u014', 's01_e01_c01_u015', 's01_e01_c01_u016', 's01_e01_c01_u017', 's01_e01_c01_u018', 's01_e01_c01_u019', 's01_e01_c01_u021', 's01_e01_c01_u022', 's01_e01_c01_u023', 's01_e01_c01_u024', 's01_e01_c01_u025', 's01_e01_c01_u026', 's01_e01_c01_u027', 's01_e01_c01_u028', 's01_e01_c01_u029', 's01_e01_c01_u030', 's01_e01_c01_u031', 's01_e01_c01_u032', 's01_e01_c01_u033', 's01_e01_c01_u034', 's01_e01_c01_u035', 's01_e01_c01_u036', 's01_e01_c01_u037', 's01_e01_c01_u038', 's01_e01_c01_u039', 's01_e01_c01_u040', 's01_e01_c01_u041', 's01_e01_c01_u042', 's01_e01_c01_u044', 's01_e01_c01_u045', 's01_e01_c01_u047', 's01_e01_c01_u048', 's01_e01_c01_u049', 's01_e01_c01_u050', 's01_e01_c01_u051', 's01_e01_c01_u052', 's01_e01_c01_u053', 's01_e01_c01_u05

Summary stats for the corpus:

In [15]:
friends_corpus.print_summary_stats()

Number of Users: 699
Number of Utterances: 61338
Number of Conversations: 3099


## Updating Gender for our Main Characters

In [16]:
users['Chandler Bing']['gender'] = 'male'
users['Carol Willick']['gender'] = 'female'
print(users['Chandler Bing'])

{'first_appearance': 's01_e01', 'gender': 'male'}


# Exploring Dataset
## Taking character entities and placing at conversation level
For each scene, we extract unique people that are mentioned during converstion from character entities and unique speakers.

In [0]:
import re
import string
string.ascii_uppercase

all_ce = {}


for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      ces = set()
      spkrs = set()
      for l in range(len(utterances)):
        utterance = utterances[l]
        if 'character_entities' in utterance.keys():
          character_entities = utterance['character_entities']
          speakers = utterance['speakers']
          for char in character_entities:
            if len(char) != 0:
               for li in char:
                  if name != speakers[0]:
                    if re.findall("ai", str)>0:
                      name=None
                    else:
                      name=li[2]
                      ces.add(name)
                      ceg=[]
                      for n in ces:
                        ce_g = d.get_gender(n.split()[0])
                        ceg.append(ce_g)
          for sp in speakers:
            for list in sp:
              sname=sp
              spkrs.add(sname) 
              spg=[]
              for z in spkrs:
                sp_g = d.get_gender(z.split()[0])
                spg.append(sp_g)
      all_ce[scene['scene_id']] = {'character_entities': ces, 'ce_gender': ceg, 'speakers': spkrs, 'spkr_gender': spg}

  0%|          | 0/10 [00:00<?, ?it/s]


TypeError: ignored

In [0]:
print(all_ce['s01_e01_c01'])

{'character_entities': {'#GENERAL#', 'Waitress', 'Monica Geller', 'Carol Willick', 'Paul the Wine Guy', 'Joey Tribbiani', 'Chandler Bing', 'Rachel Green', 'Phoebe Buffay', 'Ross Geller'}, 'ce_gender': ['unknown', 'unknown', 'female', 'mostly_female', 'male', 'male', 'mostly_male', 'female', 'female', 'male'], 'speakers': {'Waitress', 'Monica Geller', '#ALL#', 'Joey Tribbiani', 'Chandler Bing', 'Rachel Green', 'Phoebe Buffay', 'Ross Geller'}, 'spkr_gender': ['unknown', 'female', 'unknown', 'male', 'mostly_male', 'female', 'female', 'male']}


In [0]:
if char(35) in li:
                    name = None
                  else:
                    name = li[2]

## Extracting Romantic Words from Conversation
For each scene, we'll check what romantic words are used during the conversation.

In [0]:
from google.colab import files
uploaded = files.upload()

Saving RomanticWords.txt to RomanticWords.txt


In [0]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Read from RomanticWords.txt and store in root word. We use only words instead of phrases to reduce the checking time of each utterance.

In [0]:
romantic_words = [ps.stem(line.rstrip('\n')) for line in open('RomanticWords.txt', 'r')]

In [0]:
print(romantic_words)

['ador', 'amaz', 'angel', 'babe', 'beau', 'beauti', 'belov', 'better half', 'crazy for y', 'darl', 'dearest', 'enchant', 'friend and lov', 'gorgeou', 'handsom', 'heavenli', 'honey', 'life-chang', 'main squeez', 'my everyth', 'paramour', 'sweetheart', 'sweeti', 'swoon', 'wonder', 'ador', 'admir', 'care', 'cherish', 'choos', 'daydream', 'delight', 'dream', 'need', 'prize', 'treasur', 'valu', 'want', 'worship', 'yearn', 'date ', 'love', 'kiss', 'sex', 'romanc', 'romant', 'hug']


In [0]:
extract_romantic = {}

for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      romantic = []
      non_romantic = []
      for l in range(len(utterances)):
        utterance = utterances[l]
        if len(utterance['speakers']) == 0 :
          continue
        tokens = utterance['tokens']
        for token in tokens:
          for word in token:
            if ps.stem(word) in romantic_words:
              romantic.append(word)
            else:
              non_romantic.append(word)
      extract_romantic[scene['scene_id']] = {'romantic words': romantic, 'nonromantic words': non_romantic}

100%|██████████| 10/10 [00:17<00:00,  1.74s/it]


In [0]:
extract_romantic.keys()

dict_keys(['s01_e01_c01', 's01_e01_c02', 's01_e01_c03', 's01_e01_c04', 's01_e01_c05', 's01_e01_c06', 's01_e01_c07', 's01_e01_c08', 's01_e01_c09', 's01_e01_c10', 's01_e01_c11', 's01_e01_c12', 's01_e01_c13', 's01_e01_c14', 's01_e01_c15', 's01_e02_c01', 's01_e02_c02', 's01_e02_c03', 's01_e02_c04', 's01_e02_c05', 's01_e02_c06', 's01_e02_c07', 's01_e02_c08', 's01_e02_c09', 's01_e02_c10', 's01_e02_c11', 's01_e03_c01', 's01_e03_c02', 's01_e03_c03', 's01_e03_c04', 's01_e03_c05', 's01_e03_c06', 's01_e03_c07', 's01_e03_c08', 's01_e03_c09', 's01_e03_c10', 's01_e03_c11', 's01_e03_c12', 's01_e03_c13', 's01_e03_c14', 's01_e04_c01', 's01_e04_c02', 's01_e04_c03', 's01_e04_c04', 's01_e04_c05', 's01_e04_c06', 's01_e04_c07', 's01_e04_c08', 's01_e04_c09', 's01_e04_c10', 's01_e04_c11', 's01_e04_c12', 's01_e04_c13', 's01_e04_c14', 's01_e04_c15', 's01_e04_c16', 's01_e05_c01', 's01_e05_c02', 's01_e05_c03', 's01_e05_c04', 's01_e05_c05', 's01_e05_c06', 's01_e05_c07', 's01_e05_c08', 's01_e05_c09', 's01_e05_c10',

In [0]:
print(extract_romantic['s01_e01_c01'])

{'romantic words': ['want', 'sex', 'dream', 'dream', 'sweetie', 'want', 'want', 'gorgeous', 'wondering'], 'nonromantic words': ['There', "'s", 'nothing', 'to', 'tell', '!', 'He', "'s", 'just', 'some', 'guy', 'I', 'work', 'with', '!', "C'mon", ',', 'you', "'re", 'going', 'out', 'with', 'the', 'guy', '!', 'There', "'s", 'got', 'ta', 'be', 'something', 'wrong', 'with', 'him', '!', 'All', 'right', 'Joey', ',', 'be', 'nice', '.', 'So', 'does', 'he', 'have', 'a', 'hump', '?', 'A', 'hump', 'and', 'a', 'hairpiece', '?', 'Wait', ',', 'does', 'he', 'eat', 'chalk', '?', 'Just', ',', "'", 'cause', ',', 'I', 'do', "n't", 'her', 'to', 'go', 'through', 'what', 'I', 'went', 'through', 'with', 'Carl', '-', 'oh', '!', 'Okay', ',', 'everybody', 'relax', '.', 'This', 'is', 'not', 'even', 'a', 'date', '.', 'It', "'s", 'just', 'two', 'people', 'going', 'out', 'to', 'dinner', 'and', '-', 'not', 'having', '.', 'Sounds', 'like', 'a', 'date', 'to', 'me', '.', 'Alright', ',', 'so', 'I', "'m", 'back', 'in', 'high

# Assessing the Romantic Words and Genders

In [0]:
[(scene_id, len(extract_romantic[scene_id]['romantic words']), len(extract_romantic[scene_id]['nonromantic words'])) for scene_id in extract_romantic.keys()]

[('s01_e01_c01', 9, 867),
 ('s01_e01_c02', 6, 894),
 ('s01_e01_c03', 5, 54),
 ('s01_e01_c04', 1, 187),
 ('s01_e01_c05', 0, 139),
 ('s01_e01_c06', 1, 119),
 ('s01_e01_c07', 1, 273),
 ('s01_e01_c08', 2, 179),
 ('s01_e01_c09', 1, 40),
 ('s01_e01_c10', 0, 116),
 ('s01_e01_c11', 3, 529),
 ('s01_e01_c12', 2, 102),
 ('s01_e01_c13', 3, 232),
 ('s01_e01_c14', 3, 568),
 ('s01_e01_c15', 1, 185),
 ('s01_e02_c01', 5, 203),
 ('s01_e02_c02', 0, 228),
 ('s01_e02_c03', 4, 733),
 ('s01_e02_c04', 3, 521),
 ('s01_e02_c05', 3, 402),
 ('s01_e02_c06', 0, 122),
 ('s01_e02_c07', 0, 76),
 ('s01_e02_c08', 0, 165),
 ('s01_e02_c09', 4, 255),
 ('s01_e02_c10', 1, 363),
 ('s01_e02_c11', 1, 223),
 ('s01_e03_c01', 0, 298),
 ('s01_e03_c02', 1, 241),
 ('s01_e03_c03', 1, 604),
 ('s01_e03_c04', 0, 136),
 ('s01_e03_c05', 2, 452),
 ('s01_e03_c06', 1, 184),
 ('s01_e03_c07', 3, 178),
 ('s01_e03_c08', 0, 23),
 ('s01_e03_c09', 0, 283),
 ('s01_e03_c10', 1, 149),
 ('s01_e03_c11', 1, 167),
 ('s01_e03_c12', 6, 491),
 ('s01_e03_c13',

In [0]:
d4 = dict(all_ce)
d4.update(extract_romantic)

NameError: ignored

## Adding corpus-level metadata

We add the name of the corpus.

In [0]:
friends_corpus.meta['name'] = 'Friends Dataset'

# Create the corpus dump

If working in a locally mounted notebook:

In [0]:
friends_corpus.dump("friends-corpus", base_path = "YOUR_BASE_PATH/datasets/friends-corpus")

If working in Google Colab, first mount your Google Drive, then dump:

In [0]:
from zipfile import ZipFile
import os
import google
from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
friends_corpus.dump("friends-corpus", base_path = "gdrive/My Drive/F19-CS6742/")

In [0]:
with ZipFile("gdrive/My Drive/CS6742/friends-corpus.zip", 'w') as zip_f:
  for fname in os.listdir("gdrive/My Drive/CS6742/friends-corpus"):
    zip_f.write("gdrive/My Drive/CS6742/friends-corpus/"+fname)

# Assessing Politeness Strategies

In this step (A1 part D2) we apply the ConvoKit politeness transformer to the utterances in our dataset.

## Setup: Install spacy and nltk

ConvoKit transformers expect installations of:
- spacy, with its `en` model
- nltk, with its `punkt` library

In [27]:
!python -m spacy download en

Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K     |████████████████████████████████| 37.4MB 49.9MB/s 
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.0.0-cp36-none-any.whl size=37405977 sha256=e52c9a72137ccf59580ba1bd6a573a738dbe9d0ae2b40006f91a5cd628cf54f1
  Stored in directory: /tmp/pip-ephem-wheel-cache-2302kkhn/wheels/54/7c/d8/f86364af8fbba7258e14adae115f18dd2c91552406edc3fdaa
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.1.0
    Uninstalling en-core-web-sm-2.1.0:
      Successfully uninstalled en

In [30]:
import spacy
spacy.load('en')

<spacy.lang.en.English at 0x7f740a3aa780>

In [38]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 1) Construct parsed versions of each utterance for downstream use in the transformer.

In [0]:
from convokit.parser.parser import Parser
parser = Parser()
parsed_corpus = parser.transform(corpus=friends_corpus)

In [0]:
uids = friends_corpus.get_utterance_ids()

In [42]:
parsed_corpus.get_utterance(uids[0])

Utterance({'id': 's01_e01_c01_u001', 'user': User([('name', 'Monica Geller')]), 'root': 's01_e01_c01_u001', 'reply_to': None, 'timestamp': None, 'text': "There's nothing to tell! He's just some guy I work with!", 'meta': {'tokens': [['There', "'s", 'nothing', 'to', 'tell', '!'], ['He', "'s", 'just', 'some', 'guy', 'I', 'work', 'with', '!']], 'character_entities': [[], [[0, 1, 'Paul the Wine Guy'], [4, 5, 'Paul the Wine Guy'], [5, 6, 'Monica Geller']]], 'emotion': None, 'caption': None, 'transcript_with_note': None, 'tokens_with_note': None, 'parsed': There's nothing to tell! He's just some guy I work with!, 'politeness_strategies': {'feature_politeness_==Please==': 0, 'feature_politeness_==Please_start==': 0, 'feature_politeness_==Indirect_(btw)==': 0, 'feature_politeness_==Hedges==': 0, 'feature_politeness_==Factuality==': 0, 'feature_politeness_==Deference==': 0, 'feature_politeness_==Gratitude==': 0, 'feature_politeness_==Apologizing==': 0, 'feature_politeness_==1st_person_pl.==': 0

## 2) Apply the PolitenessStrategies transformer to the parsed corpus.

In [0]:
from convokit.politenessStrategies.politenessStrategies import PolitenessStrategies
politeness_transformer = PolitenessStrategies()
polite_corpus = politeness_transformer.transform(corpus=parsed_corpus)

## 3) Analyze the politeness outputs

We are interested in a cursory understanding of how the politeness strategies of each utterance intersect with the speaker's gender identity, as defined in our metadata.

First we build a DataFrame of politeness strategies and speaker genders per utterance:

In [55]:
import pandas as pd

utterance_ids = polite_corpus.get_utterance_ids()
rows = []
for uid in utterance_ids:
  utt = polite_corpus.get_utterance(uid)
  meta = utt.meta["politeness_strategies"]
  meta['speaker_gender'] = utt.user.meta['gender']
  rows.append(meta)
politeness_strategies = pd.DataFrame(rows, index=utterance_ids)

Unnamed: 0,feature_politeness_==1st_person==,feature_politeness_==1st_person_pl.==,feature_politeness_==1st_person_start==,feature_politeness_==2nd_person==,feature_politeness_==2nd_person_start==,feature_politeness_==Apologizing==,feature_politeness_==Deference==,feature_politeness_==Direct_question==,feature_politeness_==Direct_start==,feature_politeness_==Factuality==,feature_politeness_==Gratitude==,feature_politeness_==HASHEDGE==,feature_politeness_==HASNEGATIVE==,feature_politeness_==HASPOSITIVE==,feature_politeness_==Hedges==,feature_politeness_==INDICATIVE==,feature_politeness_==Indirect_(btw)==,feature_politeness_==Indirect_(greeting)==,feature_politeness_==Please==,feature_politeness_==Please_start==,feature_politeness_==SUBJUNCTIVE==,speaker_gender
s01_e01_c01_u001,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,female
s01_e01_c01_u002,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,male
s01_e01_c01_u003,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,male
s01_e01_c01_u004,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,female
s01_e01_c01_u006,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,female
s01_e01_c01_u007,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,female
s01_e01_c01_u008,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,male
s01_e01_c01_u010,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,male
s01_e01_c01_u011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,unknown
s01_e01_c01_u012,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,male


We assemble dataframes of politeness strategies by gender_guesser category (male="male" or "mostly_male", female="female" or "mostly_female", andy="andy", unknown="unknown"). We use these to assess for each gender the proportion of utterances that show a certain politeness feature.

In [105]:
female_df = politeness_strategies.loc[politeness_strategies['speaker_gender'].isin(['female', 'mostly_female'])]
sum_female = female_df.sum().drop('speaker_gender')
print('female denom: ', len(female_df))
ratio_female = (sum_female / len(female_df))*100
print(ratio_female)

female denom:  29103
feature_politeness_==1st_person==                27.011
feature_politeness_==1st_person_pl.==           9.38048
feature_politeness_==1st_person_start==         18.2524
feature_politeness_==2nd_person==               30.9143
feature_politeness_==2nd_person_start==         7.70711
feature_politeness_==Apologizing==              2.71106
feature_politeness_==Deference==               0.900251
feature_politeness_==Direct_question==          10.5659
feature_politeness_==Direct_start==             9.79968
feature_politeness_==Factuality==               5.54582
feature_politeness_==Gratitude==                2.28842
feature_politeness_==HASHEDGE==                 11.1982
feature_politeness_==HASNEGATIVE==              18.1734
feature_politeness_==HASPOSITIVE==               32.653
feature_politeness_==Hedges==                   6.52854
feature_politeness_==INDICATIVE==              0.515411
feature_politeness_==Indirect_(btw)==         0.0377968
feature_politeness_==Indire

In [109]:
male_df = politeness_strategies.loc[politeness_strategies['speaker_gender'].isin(['male', 'mostly_male'])]
sum_male = male_df.sum().drop('speaker_gender')
print('male denom: ', len(male_df))
ratio_male = (sum_male / len(male_df))*100
print(ratio_male)

male denom:  29726
feature_politeness_==1st_person==               27.3363
feature_politeness_==1st_person_pl.==           9.81296
feature_politeness_==1st_person_start==          18.758
feature_politeness_==2nd_person==               28.5979
feature_politeness_==2nd_person_start==         7.73061
feature_politeness_==Apologizing==              2.19337
feature_politeness_==Deference==                 1.1976
feature_politeness_==Direct_question==          10.3209
feature_politeness_==Direct_start==             10.2907
feature_politeness_==Factuality==               4.68613
feature_politeness_==Gratitude==                2.09581
feature_politeness_==HASHEDGE==                  11.532
feature_politeness_==HASNEGATIVE==              18.2803
feature_politeness_==HASPOSITIVE==              32.3622
feature_politeness_==Hedges==                   6.38835
feature_politeness_==INDICATIVE==              0.595438
feature_politeness_==Indirect_(btw)==         0.0336406
feature_politeness_==Indirect

In [110]:
andy_df = politeness_strategies.loc[politeness_strategies['speaker_gender']=='andy']
sum_andy = andy_df.sum().drop('speaker_gender')
print('andy denom: ', len(andy_df))
ratio_andy = (sum_andy / len(andy_df))*100
print(ratio_andy)

andy denom:  80
feature_politeness_==1st_person==                20
feature_politeness_==1st_person_pl.==          3.75
feature_politeness_==1st_person_start==        12.5
feature_politeness_==2nd_person==                25
feature_politeness_==2nd_person_start==         2.5
feature_politeness_==Apologizing==             3.75
feature_politeness_==Deference==               1.25
feature_politeness_==Direct_question==         6.25
feature_politeness_==Direct_start==             7.5
feature_politeness_==Factuality==              3.75
feature_politeness_==Gratitude==               1.25
feature_politeness_==HASHEDGE==                 7.5
feature_politeness_==HASNEGATIVE==              7.5
feature_politeness_==HASPOSITIVE==            16.25
feature_politeness_==Hedges==                  6.25
feature_politeness_==INDICATIVE==              1.25
feature_politeness_==Indirect_(btw)==             0
feature_politeness_==Indirect_(greeting)==       15
feature_politeness_==Please==                  1

In [111]:
unk_df = politeness_strategies.loc[politeness_strategies['speaker_gender']=='unknown']
sum_unk = unk_df.sum().drop('speaker_gender')
print('unk denom: ', len(unk_df))
ratio_unk = (sum_unk / len(unk_df))*100
print(ratio_unk)

unk denom:  2429
feature_politeness_==1st_person==               16.9617
feature_politeness_==1st_person_pl.==            9.2219
feature_politeness_==1st_person_start==         13.5447
feature_politeness_==2nd_person==               26.5541
feature_politeness_==2nd_person_start==         7.45163
feature_politeness_==Apologizing==              3.66406
feature_politeness_==Deference==                1.68794
feature_politeness_==Direct_question==          6.62824
feature_politeness_==Direct_start==             6.75175
feature_politeness_==Factuality==               2.88184
feature_politeness_==Gratitude==                2.26431
feature_politeness_==HASHEDGE==                 7.90449
feature_politeness_==HASNEGATIVE==              15.5208
feature_politeness_==HASPOSITIVE==              29.3948
feature_politeness_==Hedges==                   3.82874
feature_politeness_==INDICATIVE==              0.288184
feature_politeness_==Indirect_(btw)==         0.0411692
feature_politeness_==Indirect_(

Sanity check that all utterances have been captured in a dataframe:

In [85]:
polite_corpus.print_summary_stats()

Number of Users: 699
Number of Utterances: 61338
Number of Conversations: 3099


In [97]:
len(female_df) + len(male_df) + len(andy_df) + len(unk_df)

61338

Calculate the proportions of utterances in each gender category that exhibit a given politeness strategy. Also calculate the heuristics for male and female bias.

In [124]:
rel_props = pd.DataFrame([ratio_male,
                          ratio_female,
                          ratio_andy,
                          ratio_unk,
                          (rel_props.loc['male'] - rel_props.loc['female']) / rel_props.loc['male'],
                          (rel_props.loc['female'] - rel_props.loc['male']) / rel_props.loc['female']
                         ], 
                         index=['p_m', 'p_f', 'p_a', 'p_u', 'm rel bias', 'f rel bias'])
rel_props

Unnamed: 0,feature_politeness_==1st_person==,feature_politeness_==1st_person_pl.==,feature_politeness_==1st_person_start==,feature_politeness_==2nd_person==,feature_politeness_==2nd_person_start==,feature_politeness_==Apologizing==,feature_politeness_==Deference==,feature_politeness_==Direct_question==,feature_politeness_==Direct_start==,feature_politeness_==Factuality==,feature_politeness_==Gratitude==,feature_politeness_==HASHEDGE==,feature_politeness_==HASNEGATIVE==,feature_politeness_==HASPOSITIVE==,feature_politeness_==Hedges==,feature_politeness_==INDICATIVE==,feature_politeness_==Indirect_(btw)==,feature_politeness_==Indirect_(greeting)==,feature_politeness_==Please==,feature_politeness_==Please_start==,feature_politeness_==SUBJUNCTIVE==
p_m,27.336339,9.812958,18.75799,28.59786,7.730606,2.193366,1.197605,10.320931,10.290655,4.686133,2.095808,11.531992,18.280293,32.362242,6.388347,0.595438,0.033641,8.86093,0.457512,0.292673,0.659355
p_f,27.010961,9.380476,18.252414,30.914339,7.707109,2.711061,0.900251,10.565921,9.799677,5.54582,2.288424,11.198158,18.173384,32.652991,6.528537,0.515411,0.037797,7.040511,0.707831,0.408893,0.807477
p_a,20.0,3.75,12.5,25.0,2.5,3.75,1.25,6.25,7.5,3.75,1.25,7.5,7.5,16.25,6.25,1.25,0.0,15.0,1.25,0.0,2.5
p_u,16.961713,9.221902,13.544669,26.554138,7.451626,3.664059,1.687937,6.628242,6.75175,2.881844,2.264306,7.904487,15.52079,29.394813,3.828736,0.288184,0.041169,8.686702,1.152738,0.452861,0.658707
m rel bias,0.011903,0.044073,0.026953,-0.081002,0.003039,-0.236027,0.248291,-0.023737,0.047711,-0.183453,-0.091905,0.028949,0.005848,-0.008984,-0.021945,0.134401,-0.123547,0.205443,-0.547131,-0.397097,-0.224646
f rel bias,-0.012046,-0.046104,-0.027699,0.074932,-0.003049,0.190956,-0.330301,0.023187,-0.050101,0.155015,0.084169,-0.029812,-0.005883,0.008904,0.021473,-0.155269,0.109962,-0.258563,0.353642,0.28423,0.183437
