## Converting the CANDOR Corpus into ConvoKit format 

This notebook is to help people working with CANDOR Corpus to quickly transform it into ConvoKit format.
You can request CANDOR Corpus from: https://betterup-data-requests.herokuapp.com/ and run through this notebook to get CANDOR Corpus in ConvoKit format!

Details about the construction of the corpus are available in the original paper (pleace cite this paper if you use the corpus):
[Andrew Reece et al. ,The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation. Sci. Adv.9,eadf3197(2023)](https://www.science.org/doi/10.1126/sciadv.adf3197).


In [1]:
from tqdm import tqdm
from convokit import Corpus, Speaker, Utterance
from collections import defaultdict, Counter
import pandas as pd
import numpy as np

Below you should replace the CANDOR_PATH to the CANDOR Corpus with your own local directory path. Note that inside the CANDOR Corpus directory, there should be 3 files, a survey.tsv plus two transcript_method.tsv, and another directory, raw, which contains all the transcriptions in three methods.

In [2]:
# replace the directory with where your CANDOR corpus is saved
CANDOR_PATH = '<YOUR CANDOR CORPUS DIRECTORY>'
# ls: raw  survey.tsv  transcript_backbiter.tsv  transcript_cliffhanger.tsv

Next, you should pick the transcription type. By default, this is "cliffhanger" which in our experience gives the easiest to read version of the transcripts.

According to the paper, utterances are processed in three different algorithms to parse speaker turns into utterances: Audiophile, Cliffhanger, and Backbiter. Please refer back to the paper for more detailed description on how the three algorithms are implemented.

- Audiophile: A turn is when one speaker starts talking until the other speaker starts speaking
- Cliffhanger: A turns is one full sentence said by one speaker based on terminal punctuation marks (periods, question marks, and exclamation points).
- Backbiter: A turn is what one speaker starts talking until the other speaker speaks a non-backchannel words (example backchannel words: "mhm", "yeah", "exactly", etc.)

Note that, for different algorithms used to process utterances in transcripts, Utterance-level metadata will be different.


In [3]:
TRANSCRIPTION_TYPE = "cliffhanger" # or "backbiter" or "audiophile"

In [4]:
survey_path = CANDOR_PATH + "survey.tsv"
survey = pd.read_csv(survey_path, delimiter='\t')

In [None]:
from datetime import datetime
# Convert the strings into datetime objects
date_format = "%Y-%m-%d"
for i, date in enumerate(survey['date']):
    survey['date'][i] = datetime.strptime(date, date_format)

In [6]:
survey = survey.sort_values(by='date')

In [7]:
survey.head(2)

Unnamed: 0.1,Unnamed: 0,user_id,partner_id,convo_id,date,survey_duration_in_seconds,time_zone,pre_affect,pre_arousal,technical_quality,...,my_conscientious,my_neurotic,my_open,your_extraversion,your_agreeable,your_conscientious,your_neurotic,your_open,who_i_talked_to_most_past24,most_common_format_past24
22,0,5ad7c075c25ea0000188486b,5de5538f8fde1c4dbc951498,01849238-f5f0-487e-bca4-7b4fe0c9625c,2020-01-07 00:00:00,6001,,7.0,7.0,2.0,...,2.333333,3.0,4.333333,4.333333,3.666667,3.333333,2.333333,4.0,,
471,1,5d0a731abf09e10001c3f3d8,5acba4a15cd10500016280ca,2673c393-67e0-4880-b56f-d6b2696f1631,2020-01-07 00:00:00,2711,,6.0,5.0,1.0,...,4.0,3.0,4.0,1.666667,4.0,3.333333,3.666667,3.333333,,


We create a speaker list by extracting all speakers from the survey, which is filled before and after by every speaker who conducted video calls during the experiment. It is required when when CANDOR corpus is collecting data from participants.

In [8]:
all_speakers = list(set(survey['user_id'].to_list() + survey['partner_id'].to_list()))
len(all_speakers)

1454

In [9]:
corpus_speakers = {k: Speaker(id = k, meta = {}) for k in all_speakers}

In [10]:
print("number of speakers in the data = {}".format(len(corpus_speakers)))

number of speakers in the data = 1454


### Creating Utterance List

Now, we get extract all utterances from the corpus, conversation by conversation. Here, each conversation is stored as an individual folder in the "raw" folder. Thus, we go into it, and extract one by one.

Note that, there are three versions of transcripts for each conversation, corresponding to three different ways audio transcriptions are processed. For consistency, we recommend sticking with one type of transcription for corpus construction here. The three types are Audiophile, Cliffhanger, Backbiter. Modify the "transcription_type" variable for your intended processing method. Refer back to the paper for details on what each transcription processing method is about. 

CANDOR corpus paper: https://www.science.org/doi/epdf/10.1126/sciadv.adf3197

In [11]:
import os
conversations_list_path = CANDOR_PATH + "raw/"
conversations = [d for d in os.listdir(conversations_list_path) if os.path.isdir(os.path.join(conversations_list_path, d))]
len(conversations)

1650

Note that the fields of ConvoKit Utterance objects are:
Utterance(id=..., speaker =..., conversation_id =..., reply_to=..., timestamp=..., text =..., meta =...)

In [12]:
utt_id_count = 0
corpus_utterances = {}
for convo_id in conversations:
    meta_path = f"{conversations_list_path}{convo_id}/metadata.json"
    transcription_path = f"{conversations_list_path}{convo_id}/transcription/transcript_{TRANSCRIPTION_TYPE}.csv"
    transcription = pd.read_csv(transcription_path)
    for index, row in transcription.iterrows():
        reply_to = None if row['turn_id'] == 0 else utt_id_count-1
        meta = {}
        for k, v in row.items():
            if k != "speaker" and k != "utterance":
                meta.update({k : v})
        utt = Utterance(id=str(utt_id_count), speaker=corpus_speakers[row['speaker']], conversation_id=str(convo_id), reply_to=str(reply_to), timestamp=row['start'], text=row['utterance'], meta=meta)
        corpus_utterances[utt_id_count] = utt
        utt_id_count += 1

print("total number of utterances: ", utt_id_count)

527869


In [13]:
utterance_list = corpus_utterances.values()

In [14]:
CANDOR_corpus = Corpus(utterances=utterance_list)

### Updating Conversation Info

Here, we update the conversation info, especially the metadata from surveys participants filled. Metadata for each conversation correspond to the answer the two speakers gave in the surveys before and after that conversation.

For each conversation, we got 1 survey from each conversation participant, and as this conversation is 2 people video calling, we got 2 surveys per conversation. We decided to organize the metadata in the following way:

convo.meta = {"survey field name" : {"sp_A id" : "sp_A survey value", "sp_B" : "sp_B survey value"} ... }

We choose this way or organizing metadata, as we usually focus on several survey fields, and analysis the values from two participants. This format allow us to quickly extract such information. You can also feel free to modify the format to suit your research / work needs.

The ConvoKit metadata preserve the names from the survey of the original experiment, explained in the paper. For more explaination on what each survey field name is, refer to BetterUp CANDOR Corpus Data Dictionary: https://docs.google.com/spreadsheets/d/1ADoaajRsw63WpM3zS2xyGC1YS5WM_IuhFZ94W84DDls/edit#gid=997152539

In [15]:
print("number of conversations in the dataset = {}".format(len(CANDOR_corpus.get_conversation_ids())))

number of conversations in the dataset = 1650


Below we see how the survey from two participants of a random conversation looks like

In [16]:
convo = CANDOR_corpus.random_conversation()
survey[survey["convo_id"] == convo.id]

Unnamed: 0.1,Unnamed: 0,user_id,partner_id,convo_id,date,survey_duration_in_seconds,time_zone,pre_affect,pre_arousal,technical_quality,...,my_conscientious,my_neurotic,my_open,your_extraversion,your_agreeable,your_conscientious,your_neurotic,your_open,who_i_talked_to_most_past24,most_common_format_past24
1562,0,5b6cb3049ee1a50001c5ef0f,5d3356323e1190001909339b,79912e48-076a-4cb0-82e2-e43a932d0cd3,2020-08-12 00:00:00,3723,8.0,5.0,1.0,1.0,...,3.333333,4.0,4.666667,3.333333,3.666667,3.0,2.666667,4.0,,
1563,1,5d3356323e1190001909339b,5b6cb3049ee1a50001c5ef0f,79912e48-076a-4cb0-82e2-e43a932d0cd3,2020-08-12 00:00:00,6264,5.0,7.0,7.0,1.0,...,2.333333,3.0,3.0,3.333333,3.333333,2.333333,2.333333,4.0,,


In [17]:
# add conversation level metadata
for convo in CANDOR_corpus.iter_conversations():
    convo_id = convo.id
    row1 = survey[survey['convo_id'] == convo_id].iloc[0]
    row2 = survey[survey['convo_id'] == convo_id].iloc[1]
    sp_A = row1['user_id']
    sp_B = row2['user_id']
    metadata = {}
    for field in list(row1.index[1:]):
        if field != "convo_id" and field != 'user_id':
            field_values = {sp_A : row1[field], sp_B : row2[field]}
            metadata.update({field : field_values})
    convo.meta = metadata

From all the survey fields, most are conversation specific - meaning the speaker fills the survey based on this conversation's experience. However, we also filtered out some demographical information about each speaker from the survey, and store them also in speaker level metadata (refer to sp_meta_lst list for selected speaker level information).

Note: In a few cases participants report conflicting values across different surveys for the demographic questions. In those cases the metadata in the Speaker reflects the first given non-Nan answer.  The per-survey answers are available in the conversation metadata.

In [60]:
sp_meta_lst = ['sex', 'politics', 'race', 'edu', 'employ', 'employ_7_TEXT', 'age']

for sp in CANDOR_corpus.iter_speakers():
    df = survey[survey['user_id'] == sp.id]
    df = df.sort_values(by='date')
    for field_name in sp_meta_lst:
        field_values = df[field_name].tolist()
        valid = False
        for x in field_values:
            if not valid:
                if not pd.isna(x):
                    sp.add_meta(field_name, x)
                    valid = True
        if not valid:
            sp.add_meta(field_name, np.nan)

specially, the 'employ' survey field is multiple choice based with numerical numbers associated to answers in order of the following list. We replace the numerical number with the corresponding answer value. If the participant answered 'other' for the 'employ' field, they will be asked to answer 'employ_7_TEXT', which is a text based answer. Thus, we replace 'employ' with the answer from 'employ_7_TEXT'.

In [62]:
employment_lst = ['employed', 'unemployed', 'temp_leave', 'disabled', 'retired', 'homemaker', 'other']

for sp in CANDOR_corpus.iter_speakers():
    if pd.isna(sp.meta['employ']):
        continue
    employ = int(sp.meta['employ'])
    if employ != 7:
        sp.meta['employ'] = employment_lst[employ]
    else:
        sp.meta['employ'] = sp.meta['employ_7_TEXT'] if not pd.isna(sp.meta['employ_7_TEXT']) else 'other'

In [71]:
CANDOR_corpus.delete_metadata(obj_type='speaker', attribute='employ_7_TEXT')

In [72]:
sp = CANDOR_corpus.random_speaker()
sp.meta

ConvoKitMeta({'sex': 'male', 'politics': 6.0, 'race': 'white', 'edu': 'some_college', 'employ': 'temp_leave', 'age': 21.0})

### Save the Corpus

In [None]:
SAVE_PATH = '<YOUR DIRECTORY TO SAVE CORPUS>'
CANDOR_corpus.dump(f"CANDOR-corpus-{TRANSCRIPTION_TYPE}", base_path=SAVE_PATH)

In [None]:
from convokit import meta_index
meta_index(filename = f"{SAVE_PATH}/CANDOR-corpus-{TRANSCRIPTION_TYPE}")

### Retrieve Corpus

In [None]:
my_CANDOR_corpus = Corpus(filename=f"{SAVE_PATH}/CANDOR-corpus-{TRANSCRIPTION_TYPE}")

In [None]:
my_CANDOR_corpus.print_summary_stats()