# Converting the Friends dataset into ConvoKit format

This notebook describes how we converted the Friends dataset (https://github.com/emorynlp/character-mining) into a Corpus with ConvoKit.

In [2]:
!pip3 install convokit
!python3 -m spacy download en

Collecting convokit
Collecting scikit-learn>=0.20.0 (from convokit)
  Using cached https://files.pythonhosted.org/packages/e9/57/8a9889d49d0d77905af5a7524fb2b468d2ef5fc723684f51f5ca63efed0d/scikit_learn-0.21.3-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting dill==0.2.9 (from convokit)
Collecting pandas>=0.23.4 (from convokit)
  Using cached https://files.pythonhosted.org/packages/39/73/99aa822ee88cef5829607217c11bf24ecc1171ae5d49d5f780085f5da518/pandas-0.25.1-cp37-cp37m-macosx_10_9_x86_64.macosx_10_10_x86_64.whl
Collecting spacy==2.0.12 (from convokit)
Collecting msgpack-numpy==0.4.3.2 (from convokit)
  Using cached https://files.pythonhosted.org/packages/ad/45/464be6da85b5ca893cfcbd5de3b31a6710f636ccb8521b17bd4110a08d94/msgpack_numpy-0.4.3.2-py2.py3-none-any.whl
Collecting matplotlib>=3.0.0 (from convokit)
  Using cached https://files.pythonhosted.org/packages/c3/8b/af9e0984f5c0df06d3fab0bf396eb09cbf05f8452de4e950

In [4]:
import requests
import json
from tqdm import tqdm
from convokit import Corpus, User, Utterance

## The Friends Dataset

The original dataset (https://github.com/emorynlp/character-mining) contains a set of 10 JSON files, each of which represents a complete transcript of 1 season of <i>Friends</i>. Since the data are available in JSON format from this GitHub repo, we download the raw data directly using the `requests` module. You will not need to download raw data files to use this script.

## Gather information about the corpus
For the **corpus.json** file, it will include information of number of episodes, number of scenes, number of utterances and number of speakers.
When counting the number of utterances, we ignore utterances that have no conversations.

In [5]:
num_episodes = 0
num_scenes = 0
num_utterances = 0
speakers = set()
for i in range(1,11):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  num_episodes += len(episodes)
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    num_scenes += len(scenes)
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      for l in range(len(utterances)):
        utterance = utterances[l]
        speaker = utterance['speakers']
        speakers.update(speaker)
        num_utterances += 1 if len(speaker) != 0 else 0
corpus = {'friends': 'friends corpus', 'num_episodes': num_episodes, 'num_scenes': num_scenes, 'num_utterances': num_utterances, 'num_speakers': len(speakers)}

In [0]:
print(corpus)

{'friends': 'friends corpus', 'num_episodes': 236, 'num_scenes': 3107, 'num_utterances': 61338, 'num_speakers': 700}


## Generating user information
Since our dataset doesn't have any existing user information, we extract speaker information from the conversation. For each user, we collect the episode in which he/she first appears and guess his/her gender based on the name using the gender_guesser module.

For each user, we create an object with:

- <b>name:</b> the name of the character
- <b>first_appearance:</b> the episode in which he or she first appeared
- <b>gender:</b> the character's gender, as defined by the `gender_guesser` module's guess of his/her name

In [6]:
! pip3 install gender_guesser
import gender_guesser.detector as gender
d = gender.Detector()

Collecting gender_guesser
[?25l  Downloading https://files.pythonhosted.org/packages/13/fb/3f2aac40cd2421e164cab1668e0ca10685fcf896bd6b3671088f8aab356e/gender_guesser-0.4.0-py2.py3-none-any.whl (379kB)
[K     |████████████████████████████████| 389kB 2.2MB/s eta 0:00:01
[?25hInstalling collected packages: gender-guesser
Successfully installed gender-guesser-0.4.0


In [39]:
users = {}
for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      for l in range(len(utterances)):
        utterance = utterances[l]
        speaker_list = utterance['speakers']
        for speaker in speaker_list:
          if speaker not in users:
            users[speaker] = {'first_appearance': episode['episode_id'], 'gender': d.get_gender(speaker.split()[0])}

100%|██████████| 10/10 [00:03<00:00,  2.62it/s]


Sanity-checking the user data, we should see the correct genders assigned to the 6 friends:

In [40]:
print("number of users in the data = {}/700".format(len(users)))
print("Monica Geller object: ", users["Monica Geller"])
print("Joey Tribbiani object: ", users["Joey Tribbiani"])
print("Chandler Bing object: ", users["Chandler Bing"])
print("Phoebe Buffay object: ", users["Phoebe Buffay"])
print("Ross Geller object: ", users["Ross Geller"])
print("Rachel Green object: ", users["Rachel Green"])

number of users in the data = 700/700
Monica Geller object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Joey Tribbiani object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Chandler Bing object:  {'first_appearance': 's01_e01', 'gender': 'mostly_male'}
Phoebe Buffay object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Ross Geller object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Rachel Green object:  {'first_appearance': 's01_e01', 'gender': 'female'}


We then create a User object for each unique character in the dataset.

In [41]:
corpus_users = {k: User(name=k, meta=v) for k,v in users.items()}

In [42]:
print(corpus_users['Monica Geller'].name)
print(corpus_users['Monica Geller'].meta)

Monica Geller
{'first_appearance': 's01_e01', 'gender': 'female'}


## Generating Utterances

We then loop through the data to generate a list of all utterances in the series. To align with the Utterance schema ConvoKit expects, we construct for each utterance:

- **id:** index of the utterance

- **user:** the user who authored the utterance; the speaker in our case

- **root:** id of the conversation root of the utterance; the scene id in our case

- **reply_to:** id of the utterance to which this utterance replies to; None if the utterance is not a reply. The previous speaker, to simplify the process.

- **timestamp:** time of the utterance

- **text:** textual content of the utterance

In [43]:
all_utterances = {}

for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      root = utterances[0]
      prev_utt = None
      for l in range(len(utterances)):
        utterance = utterances[l]
        speaker = utterance['speakers']
        if len(speaker) == 0:
          prev_utt = None
          continue
        all_utterances[utterance['utterance_id']] = Utterance(
            id=utterance['utterance_id'],
            user=corpus_users[speaker[0]],
            root=root['utterance_id'],
            reply_to=prev_utt,
            timestamp=None,
            text=utterance['transcript']
        )
        prev_utt = utterance['utterance_id']


100%|██████████| 10/10 [00:04<00:00,  2.31it/s]


In [44]:
print("This corpus has {}/61309 utterances".format(len(all_utterances)))

This corpus has 61338/61309 utterances


# Creating the corpus from a list of utterances

We now create the corpus from our dict of utterances.

In [45]:
utterance_list = [utt for k, utt in all_utterances.items()]

In [46]:
friends_corpus = Corpus(utterances=utterance_list, version=1)

In [47]:
print("number of conversations in the dataset={}".format(len(friends_corpus.get_conversation_ids())))

number of conversations in the dataset=3099


In [18]:
friends_corpus.get_usernames()

{'#ALL#',
 '1st Customer',
 'A Casino Boss',
 'A Crew Member',
 'A Disembodied Voice',
 'A Drunken Gambler',
 'A Female Student',
 'A Male Customer',
 'A Student',
 'A Tourist',
 'A Waiter',
 'A Woman',
 'Actor',
 'Adoption Agency Guy',
 'Adrienne',
 'Agency Guy',
 'Air Hostess',
 'Air Stewardess',
 'Airline Employee',
 'Alan',
 'Alex',
 'Alexandra Steele',
 'Alice Knight',
 'Alison',
 'Allesandro',
 "Amanda (Ross' date)",
 'Amanda Buffamonteezi',
 'Amber',
 'Amy Green',
 'Anchorwoman',
 'Andrea',
 'Andrea Waltham',
 'Angela Delveccio',
 'Annabelle',
 'Announcement',
 'Announcer',
 'Another Extra',
 "Another Man's Voice",
 'Another Scientist',
 'Another Tour Guide',
 'Answering Machine',
 'Anxious Wedding Guest',
 'Arthur',
 'Ashley',
 'Assistant',
 'Attendant',
 'Aunt Iris',
 'Aunt Lillian',
 'Aunt Lisa',
 'Aunt Millie',
 'Aurora',
 'Ballerina',
 'Bandleader',
 'Bank Officer',
 'Barry Farber',
 'Bass Singer',
 'Ben Geller',
 'Benjamin Hobart',
 'Bernice',
 'Best Man',
 'Big Bully',
 '

In [48]:
friends_corpus.print_summary_stats()

Number of Users: 699
Number of Utterances: 61338
Number of Conversations: 3099


# Create the corpus dump

In [53]:
friends_corpus.dump("corpus", base_path="/Users/emilytseng/Cornell-Conversational-Analysis-Toolkit/datasets/friends-corpus")

In [52]:
pwd

'/Users/emilytseng/Cornell-Conversational-Analysis-Toolkit/datasets/friends-corpus'