# Converting the Interview 2P Dataset into the ConvoKit Format

This notebook helps constructing a Convokit-formatted version of the dataset originally distributed with the following paper:

Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian McAuley. 2020. [Interview: Large-Scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding](https://www.aclweb.org/anthology/2020.emnlp-main.653). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8129–41.

Please cite this paper when using this corpus in your research.


**Main Contributors:** Andrea Wang, Lucy Jiang, and Rebecca Hicke

**Conversion Notebook Contributors:** Andrea Wang, Lucy Jiang, Rebecca Hicke, Yash Chatha, Sean Zhang

**Original Dataset:** [NPR Interview 2P](https://www.kaggle.com/datasets/shuyangli94/interview-npr-media-dialog-transcripts?select=utterances-2sp.csv)

Guide informed and inspired by:
* [Converting the Cornell Movie-Dialogs Corpus into ConvoKit format](https://github.com/CornellNLP/ConvoKit/blob/master/examples/converting_movie_corpus.ipynb)
* [ConvoKit Tutorial](https://colab.research.google.com/drive/1_jvL1t9PA2dERKbEm9pCnBS0sbW7B1AW?usp=sharing#scrollTo=kRu1nFlV4z-Z)

We use the following files to create our Corpus:
* **utterances.csv** contains 105k+ multi-party interview transcripts from 20 years of NPR interviews
* **utterances-2sp.csv** contains all conversations within utterances.csv that are between two participants
* **episodes.csv** contains the titles and program names for all episodes
* **host-map.json** contains a dictionary of host ID: name (lowercase string), a list of episodes hosted, and a list of programs hosted

## Installation and Setup

In [None]:
!pip install convokit

In [None]:
import json
import pandas as pd
from tqdm import tqdm
from collections import defaultdict
from convokit import Corpus, Speaker, Utterance

### Data Import

We uploaded and imported the dataset from Google Drive. The original dataset can be found on [Kaggle](https://www.kaggle.com/datasets/shuyangli94/interview-npr-media-dialog-transcripts?select=utterances-2sp.csv).

In [None]:
# For Colab
from google.colab import drive
drive.mount('/content/drive')
data_dir = '/content/drive/MyDrive/Assignment 1 Group Project/dataset/'

Mounted at /content/drive


In [None]:
utterances2p = pd.read_csv(data_dir + "utterances-2sp.csv")
utterances = pd.read_csv(data_dir + "utterances.csv")
episodes = pd.read_csv(data_dir + "episodes.csv")
episodes2p = utterances2p['episode'].unique()

In [None]:
print(f"There are {episodes2p.size} 2p episodes in this dataset.")

There are 23714 2p episodes in this dataset.


### Data Cleaning

We found the subset of episodes in **utterances.csv** that were between two participants as each row consisted of a full turn as opposed to the single sentence utterances in **utterances-2sp.csv**. Additionally, we removed all instances of utterances that were assigned to "_NO_SPEAKER", as these were transcriptions of non-dialogue sounds.

In [None]:
utterances = utterances[utterances['episode'].isin(episodes2p)]
utterances = utterances[utterances['speaker'] != '_NO_SPEAKER']

Some rows in **utterances-2sp.csv** contained incorrect encodings for *host_id* and *is_host*. This often took the form of (1) hosts being miscoded: when *host_id* was -1 for both participants, despite -1 being the proper value for a guest, or (2) guests being miscoded: when guests were given a *host_id* that was not -1. 

For the first issue, we found all episodes where the sum of *host_id* for all rows was equivalent to the number of turns taken within an episode multiplied by -1 (meaning every row within this episode had -1 in the *host_id* column), and removed these episodes from the dataset. For the second situation, we removed all episodes where the minimum *host_id* was not -1. We also removed all utterances with a null value.

In [None]:
utterances2p_by_ep = utterances2p.groupby(['episode'])

In [None]:
# remove episodes where hosts are mis-coded
x = utterances2p_by_ep.agg({"episode": "size", "host_id": "sum"})
remove_episode = x[x['episode'] == -1*x['host_id']].index

In [None]:
# remove episodes where guests are mis-coded
y = utterances2p_by_ep.agg({"host_id": "min"})
remove_episode = remove_episode.append(y[y['host_id'] != -1].index)

In [None]:
utterances = utterances[~utterances['episode'].isin(remove_episode)]

In [None]:
print(f"There are {len(remove_episode)} problematic 2p episodes that we have removed from this dataset.")

There are 1529 problematic 2p episodes that we have removed from this dataset.


Lastly, we removed all episodes where hosts were not included in **host-map.json** by finding the maximum *host_id* in each episode and comparing it against the list of hosts in **host-map.json**.

In [None]:
# find all hosts represented in host-map.json
host_map = json.load(open(data_dir + "host-map.json", "r"))
hosts = pd.DataFrame.from_dict(host_map, orient='index')
hosts = hosts.reset_index()

In [None]:
# remove episodes where hosts are not in host-map.json
z = utterances2p_by_ep.agg({"host_id": "max"})
json_hosts = hosts['index'].astype(int).to_list()
missing_hosts = z[~z['host_id'].isin(json_hosts)].index

In [None]:
utterances = utterances[~utterances['episode'].isin(missing_hosts)]

In [None]:
print(f"There are {len(missing_hosts)} episodes with hosts that are not in host-map.json that we have removed from this dataset.")

There are 775 episodes with hosts that are not in host-map.json that we have removed from this dataset.


In [None]:
print(f"There are {utterances['episode'].unique().size} 2p episodes left in this dataset.")

There are 22149 2p episodes left in this dataset.


In [None]:
# remove null utterances
utterances = utterances[utterances['utterance'].notnull()]

In [None]:
print(f"There are {len(utterances)} utterances remaining in this dataset.")

There are 428624 utterances remaining in this dataset.


## Create Speakers

We begin by determining which speaker is the host in a given conversation. While most host labels have the word "host" in the name, this is not consistent across the entire dataset. We identify which speakers are hosts by aggregating individual utterances in **utterances-2sp.csv** such that they represent each turn. This then allows us to map each *host_id* value in **utterances2p.csv** to its corresponding turn in **utterances.csv**.

In [None]:
# use utterances2p to find host_id (rather than use "host" in speaker_name)
utterances2p = utterances2p[~utterances2p['episode'].isin(remove_episode)]
episode_order2host_id = utterances2p.groupby(["episode", "episode_order"]).agg({"host_id": "min"}).reset_index()
utterances = utterances.merge(episode_order2host_id, how='left')

We then assign guests a *speaker_id* based on the episode that they are part of. For simplicity, we treat each guest as a separate speaker (e.g.: even if the same guest appears in two different episodes, they are still assigned two different IDs). We determine which speaker is a guest by utilizing the utterances-2sp.csv file, in which guests are identified with *host_id* = -1. 

In [None]:
# create guest speaker_id
utterances['speaker_id'] = utterances.apply(lambda row: "g" + str(row['episode']) if row['host_id'] == -1 else "", axis=1)

# create host speaker_id
utterances['speaker_id'] = utterances.apply(lambda row: "h" + str(row['host_id']) if row['host_id'] != -1 else row['speaker_id'], axis=1)

In [None]:
# sanity check - each episode should have exactly two speakers
utterances.groupby(["episode"])['speaker_id'].nunique().sort_values()

episode
1         2
97596     2
97593     2
97590     2
97589     2
         ..
63680     2
63679     2
63674     2
63700     2
141179    2
Name: speaker_id, Length: 22149, dtype: int64

### Assign Speaker Metadata

We gather host names from *host-map.json* to omit the ", host" tag that is present in the speaker names in **utterances.csv**. For each speaker, we save their name and the type of speaker that they are (host or guest). We then create `Speaker` objects for each host and guest.

In [None]:
speaker_meta = {}

In [None]:
# create host data from host_map
hosts['speaker_id'] = "h" + hosts['index']
hosts2p = hosts[hosts['speaker_id'].isin(utterances['speaker_id'].unique())]
hosts2p = hosts2p[['name', 'speaker_id']].to_dict(orient='records')

In [None]:
for host in hosts2p:
  speaker_meta[host['speaker_id']] = {"name": host['name'], "type": "host"}

In [None]:
# create guest data from utterances.csv
speakers = utterances[['speaker', 'speaker_id']].drop_duplicates()
guests = speakers[speakers['speaker_id'].str.startswith("g")].to_dict(orient='records')
for guest in guests:
  speaker_meta[guest['speaker_id']] = {"name": guest['speaker'], "type": "guest"}

In [None]:
corpus_speakers = {k: Speaker(id = k, meta = v) for k,v in speaker_meta.items()}

## Create Utterances and Corpus

To create `Utterance` objects, we iterate through the DataFrame of **utterances** to capture the reply structure and metadata (*episode*, and *order*). As two-participant interviews are linear in structure, we consider all utterances following the first in a conversation to be a reply. We identify new conversations by tracking the last *episode* that corresponds to each utterance.

In [None]:
utterances = utterances.reset_index(drop=True)
utterances = utterances.sort_values(['episode', 'episode_order'])

utterance_corpus = {}
root = ""
last_episode = -1

for index, utterance in utterances.iterrows():
  if utterance["episode"] != last_episode:
    root = str(index)
    reply_to = None
  else:
    reply_to = str(index-1)
  meta = {"episode": utterance["episode"], "order": utterance["episode_order"]}
  utterance_corpus[index] = Utterance(id = str(index), speaker = corpus_speakers[utterance["speaker_id"]], text = str(utterance["utterance"]), root = root, reply_to = reply_to, meta = meta)
  last_episode = utterance["episode"]

Lastly, we create the `Corpus` from a list of `Utterance`s. Each `Conversation` contains metadata including fields such as *program*, *title*, and *date*.

In [None]:
utterance_list = utterance_corpus.values()

In [None]:
corpus = Corpus(utterances = utterance_list)

In [None]:
episodes = episodes[episodes["id"].isin(episodes2p)]

In [None]:
ep_info_dict = {}
for index, ep in episodes.iterrows():
  ep_info_dict[ep["id"]] = {"program": ep["program"], "title": ep["title"], "date": ep["episode_date"]}

In [None]:
for convo in corpus.iter_conversations():
  convo_id = convo.get_id()
  utt = convo.get_utterance_ids()[0]
  episode_id = corpus.get_utterance(utt).meta["episode"]
  convo.meta.update(ep_info_dict[episode_id])

## Save Corpus

In [None]:
corpus.dump("npr-2p-corpus", base_path = data_dir)