# Converting the Friends dataset into ConvoKit format

This notebook describes how we converted the Friends dataset (https://github.com/emorynlp/character-mining) into a Corpus with ConvoKit.

In [0]:
!pip3 install convokit
# !python3 -m spacy download en

Collecting convokit
[?25l  Downloading https://files.pythonhosted.org/packages/81/f3/d04f33525a2dbf316c31bfab383d1ee84280714b474718e272882e8b2d6a/convokit-2.0.11.tar.gz (77kB)
[K     |████████████████████████████████| 81kB 3.2MB/s 
Collecting msgpack-numpy==0.4.3.2 (from convokit)
  Downloading https://files.pythonhosted.org/packages/ad/45/464be6da85b5ca893cfcbd5de3b31a6710f636ccb8521b17bd4110a08d94/msgpack_numpy-0.4.3.2-py2.py3-none-any.whl
Collecting spacy==2.0.12 (from convokit)
[?25l  Downloading https://files.pythonhosted.org/packages/24/de/ac14cd453c98656d6738a5669f96a4ac7f668493d5e6b78227ac933c5fd4/spacy-2.0.12.tar.gz (22.0MB)
[K     |████████████████████████████████| 22.0MB 1.2MB/s 
Collecting nltk>=3.4 (from convokit)
[?25l  Downloading https://files.pythonhosted.org/packages/f6/1d/d925cfb4f324ede997f6d47bea4d9babba51b49e87a767c170b77005889d/nltk-3.4.5.zip (1.5MB)
[K     |████████████████████████████████| 1.5MB 36.6MB/s 
[?25hCollecting dill==0.2.9 (from convokit)
[?25

In [0]:
import requests
import json
from tqdm import tqdm
from convokit import Corpus, User, Utterance

## The Friends Dataset

The original dataset (https://github.com/emorynlp/character-mining) contains a set of 10 JSON files, each of which represents a complete transcript of 1 season of <i>Friends</i>. Since the data are available in JSON format from this GitHub repo, we download the raw data directly using the `requests` module. You will not need to download raw data files to use this script.

## Generating user information
Since our dataset doesn't have any existing user information, we extract speaker information from the conversation. For each user, we collect the episode in which he/she first appears and guess his/her gender based on the name using the gender_guesser module.

Users are indexed by their name, which is a `<str>`. For each user, we create an object with:

- <b>first_appearance:</b> the episode in which he or she first appeared
- <b>gender:</b> the character's gender, as defined by the `gender_guesser` module's guess of his/her name

In [0]:
! pip3 install gender_guesser
import gender_guesser.detector as gender
d = gender.Detector()

Collecting gender_guesser
[?25l  Downloading https://files.pythonhosted.org/packages/13/fb/3f2aac40cd2421e164cab1668e0ca10685fcf896bd6b3671088f8aab356e/gender_guesser-0.4.0-py2.py3-none-any.whl (379kB)
[K     |████████████████████████████████| 389kB 2.8MB/s 
[?25hInstalling collected packages: gender-guesser
Successfully installed gender-guesser-0.4.0


In [0]:
users = {}
for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      for l in range(len(utterances)):
        utterance = utterances[l]
        speaker_list = utterance['speakers']
        for speaker in speaker_list:
          if speaker not in users:
            users[speaker] = {'first_appearance': episode['episode_id'], 'gender': d.get_gender(speaker.split()[0])}

100%|██████████| 10/10 [00:05<00:00,  2.17it/s]


Sanity-checking the user data, we should see the correct genders assigned to the 6 friends:

In [0]:
print("number of users in the data = {}/700".format(len(users)))
print("Monica Geller object: ", users["Monica Geller"])
print("Joey Tribbiani object: ", users["Joey Tribbiani"])
print("Chandler Bing object: ", users["Chandler Bing"])
print("Phoebe Buffay object: ", users["Phoebe Buffay"])
print("Ross Geller object: ", users["Ross Geller"])
print("Rachel Green object: ", users["Rachel Green"])

number of users in the data = 700/700
Monica Geller object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Joey Tribbiani object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Chandler Bing object:  {'first_appearance': 's01_e01', 'gender': 'mostly_male'}
Phoebe Buffay object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Ross Geller object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Rachel Green object:  {'first_appearance': 's01_e01', 'gender': 'female'}


We then create a User object for each unique character in the dataset.

In [0]:
corpus_users = {k: User(name=k, meta=v) for k,v in users.items()}

In [0]:
print(corpus_users['Monica Geller'].name)
print(corpus_users['Monica Geller'].meta)

Monica Geller
{'first_appearance': 's01_e01', 'gender': 'female'}


## Generating Utterances

We then loop through the data to generate a list of all utterances in the series. To align with the Utterance schema ConvoKit expects, we construct for each utterance:

- **id:** index of the utterance

- **user:** the user who authored the utterance; the speaker in our case

- **root:** id of the conversation root of the utterance; the first utterance in the scene, in our case

- **reply_to:** id of the utterance to which this utterance replies to; None if the utterance is not a reply.

- **timestamp:** time of the utterance (None for us -- the dataset does not contain this information)

- **text:** textual content of the utterance

We also pull in the following metadata including:
- **tokens** a tokenized representation of the text (handy for sentence separation)
-**character_entities** available for some but not all utterances; `None` if unavailable. These are intended to identify who the user is speaking to and/or about.
-**emotion** emotion labels for each token. Available for some but not all utterances; `None` if unavailable. 
-**caption**  available for some but not all utterances; `None` if unavailable. This contains the begin time, end time, and text sans punctuation. Only available for seasons 6-9.
-**transcript_with_note**  a version of the text with an action note (e.g. "(to Ross) Hand me the coffee" vs. "Hand me the coffee"). Available for some but not all utterances; `None` if unavailable.
-**token_with_note** a tokenized representation of the above.

In [0]:
all_utterances = {}



for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      
      root = utterances[0] #set the root as the first utterance in the scene for now
      
      prev_utt = None

      for l in range(len(utterances)):
        utterance = utterances[l]
        
        speaker = utterance['speakers']
        
        if len(speaker) == 0:
          prev_utt = None
          continue
        
        # Add meta       
        meta = {
            'tokens': utterance.get('tokens'),
            'character_entities': utterance.get('character_entities'),
            'emotion': utterance.get('emotion'),
            'caption': utterance.get('caption'),
            'transcript_with_note': utterance.get('transcript_with_note'),
            'tokens_with_note': utterance.get('tokens_with_note')
        }
        
        # Create the Utterance, including meta
        all_utterances[utterance['utterance_id']] = Utterance(
            id=utterance['utterance_id'],
            user=corpus_users[speaker[0]],
            root=root['utterance_id'],
            reply_to=prev_utt,
            timestamp=None,
            text=utterance['transcript'],
            meta=meta
        )
        
        # Get the prev_utt for the next iteration
        prev_utt = utterance['utterance_id']


100%|██████████| 10/10 [00:04<00:00,  2.34it/s]


In [0]:
print("This corpus has {}/61309 utterances".format(len(all_utterances)))

This corpus has 61338/61309 utterances


In [0]:
all_utterances['s01_e18_c05_u021']

Utterance({'id': 's01_e18_c05_u021', 'user': User([('name', 'Ross Geller')]), 'root': 's01_e18_c05_u001', 'reply_to': 's01_e18_c05_u020', 'timestamp': None, 'text': 'Alright.', 'meta': {'tokens': [['Alright', '.']], 'character_entities': [[]], 'emotion': ['Neutral', ['Neutral', 'Neutral', 'Neutral', 'Neutral']], 'caption': None, 'transcript_with_note': None, 'tokens_with_note': None}})

## Creating the corpus from a list of utterances

We now create the corpus from our dict of utterances. Note, we are are allowing convokit to create conversations IDs automatically after loading the utterances list.

In [0]:
utterance_list = [utt for k, utt in all_utterances.items()]

In [0]:
friends_corpus = Corpus(utterances=utterance_list, version=1)

Sanity checks for the number of conversations in the dataset and the first 5 conversations:

In [0]:
print("number of conversations in the dataset={}".format(len(friends_corpus.get_conversation_ids())))

number of conversations in the dataset=3099


In [0]:
convo_ids = friends_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(friends_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['s01_e01_c01_u001', 's01_e01_c01_u002', 's01_e01_c01_u003', 's01_e01_c01_u004', 's01_e01_c01_u006', 's01_e01_c01_u007', 's01_e01_c01_u008', 's01_e01_c01_u010', 's01_e01_c01_u011', 's01_e01_c01_u012', 's01_e01_c01_u013', 's01_e01_c01_u014', 's01_e01_c01_u015', 's01_e01_c01_u016', 's01_e01_c01_u017', 's01_e01_c01_u018', 's01_e01_c01_u019', 's01_e01_c01_u021', 's01_e01_c01_u022', 's01_e01_c01_u023', 's01_e01_c01_u024', 's01_e01_c01_u025', 's01_e01_c01_u026', 's01_e01_c01_u027', 's01_e01_c01_u028', 's01_e01_c01_u029', 's01_e01_c01_u030', 's01_e01_c01_u031', 's01_e01_c01_u032', 's01_e01_c01_u033', 's01_e01_c01_u034', 's01_e01_c01_u035', 's01_e01_c01_u036', 's01_e01_c01_u037', 's01_e01_c01_u038', 's01_e01_c01_u039', 's01_e01_c01_u040', 's01_e01_c01_u041', 's01_e01_c01_u042', 's01_e01_c01_u044', 's01_e01_c01_u045', 's01_e01_c01_u047', 's01_e01_c01_u048', 's01_e01_c01_u049', 's01_e01_c01_u050', 's01_e01_c01_u051', 's01_e01_c01_u052', 's01_e01_c01_u053', 's01_e01_c01_u05

Summary stats for the corpus:

In [0]:
friends_corpus.print_summary_stats()

Number of Users: 699
Number of Utterances: 61338
Number of Conversations: 3099


## Adding corpus-level metadata

We add the name of the corpus.

In [0]:
friends_corpus.meta['name'] = 'Friends Dataset'

# Create the corpus dump

If working in a locally mounted notebook:

In [0]:
friends_corpus.dump("friends-corpus", base_path = "YOUR_BASE_PATH/datasets/friends-corpus")

If working in Google Colab, first mount your Google Drive, then dump:

In [0]:
from zipfile import ZipFile
import os
import google
from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
friends_corpus.dump("friends-corpus", base_path = "gdrive/My Drive/F19-CS6742/")

In [0]:
with ZipFile("gdrive/My Drive/CS6742/friends-corpus.zip", 'w') as zip_f:
  for fname in os.listdir("gdrive/My Drive/CS6742/friends-corpus"):
    zip_f.write("gdrive/My Drive/CS6742/friends-corpus/"+fname)