<a href="https://colab.research.google.com/github/deepak-karkala/chandler-bing-comment-generator/blob/main/data_preprocessing_sitcom_characters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kaggle



### Load script dataset

In [2]:
!kaggle datasets download -d ryanstonebraker/friends-transcript

Dataset URL: https://www.kaggle.com/datasets/ryanstonebraker/friends-transcript
License(s): GPL-2.0
Downloading friends-transcript.zip to /content
  0% 0.00/1.72M [00:00<?, ?B/s]
100% 1.72M/1.72M [00:00<00:00, 148MB/s]


In [3]:
import zipfile
with zipfile.ZipFile('friends-transcript.zip', 'r') as zip_ref:
    zip_ref.extractall()

In [4]:
import pandas as pd
import re
df = pd.read_csv('friends_quotes.csv')
print(df.head())

     author  episode_number           episode_title  \
0    Monica             1.0  Monica Gets A Roommate   
1      Joey             1.0  Monica Gets A Roommate   
2  Chandler             1.0  Monica Gets A Roommate   
3    Phoebe             1.0  Monica Gets A Roommate   
4    Phoebe             1.0  Monica Gets A Roommate   

                                               quote  quote_order  season  
0  There's nothing to tell! He's just some guy I ...          0.0     1.0  
1  C'mon, you're going out with the guy! There's ...          1.0     1.0  
2  All right Joey, be nice. So does he have a hum...          2.0     1.0  
3                           Wait, does he eat chalk?          3.0     1.0  
4  Just, 'cause, I don't want her to go through w...          4.0     1.0  


In [5]:
print(" \n ".join(list((df.iloc[0:4]["author"] + ": " + df.iloc[0:4]["quote"]).values)))

Monica: There's nothing to tell! He's just some guy I work with! 
 Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him! 
 Chandler: All right Joey, be nice. So does he have a hump? A hump and a hairpiece? 
 Phoebe: Wait, does he eat chalk?


### Extract pairs of previous dialogue and current dialogue

In [29]:
num_prev_dialogues = 15
df_prev_dialogues = pd.DataFrame(columns=["prev_dialogues", "current_dialogue"])
num_words = 0
new_scene_start_idx = 0
prev_episode_id = 1

for i in range(len(df)):
  #print(i)
  curr_episode_id = int(df.iloc[i]["episode_number"])

  # Don't use context from previous episode, start from this dialog
  if curr_episode_id != prev_episode_id:
    new_scene_start_idx = i

  # Don't use context from previous scene, start from this dialog
  if "Scene" in df.iloc[i]["quote"]:
    new_scene_start_idx = i + 1

  if df.iloc[i]["author"] == "Chandler":
    current_dialog_num = i #int(df.iloc[i]["quote_order"])
    current_dialog = df.iloc[i]["quote"]

    #prev_dialogues_range = [max(0, i - num_prev_dialogues) : current_dialog_num]
    prev_dialogues = " \n ".join(list((df.iloc[max(new_scene_start_idx, current_dialog_num - num_prev_dialogues) : current_dialog_num]["author"] + ": " + df.iloc[max(new_scene_start_idx, current_dialog_num - num_prev_dialogues) : current_dialog_num]["quote"]).values))
    prev_dialogues = re.sub("\(.*\)","", prev_dialogues)
    current_dialog = re.sub("\(.*\)","", current_dialog)

    df_prev_dialogues.loc[i] = [prev_dialogues, current_dialog]
    num_words += len(prev_dialogues.split(" "))

  prev_episode_id = curr_episode_id

df_prev_dialogues.reset_index(inplace=True)
df_prev_dialogues.drop('index', axis=1, inplace=True)
df_prev_dialogues.head(2)

Unnamed: 0,index,prev_dialogues,current_dialogue
0,2,Monica: There's nothing to tell! He's just som...,"All right Joey, be nice. So does he have a hum..."
1,6,Monica: There's nothing to tell! He's just som...,Sounds like a date to me.


In [33]:
print(f"Number of words in context dialogues: {num_words}")
print(f"Number of Chandler dialogues: {len(df_prev_dialogues)}")

content_system = """You are given a few dialogues from the famous TV sitcom FRIENDS. You have two tasks. First, you have to summarize the dialogues into context without any names of characters. Second, you have to then write a dialogue as an appropriate response to the given context, the dialogue should be in the style of Chandler Bing, one of the characters from FRIENDS. Keep in mind that Chandler Bing’s humor is marked by a unique blend of sarcasm, self-deprecation, and quick wit. He tends to make jokes that deflect serious or emotional moments, often using his dry, sarcastic tone.  His style is heavily reliant on irony, often delivering punchlines that are deliberately over-the-top or nonsensical. Extract these two things and return them in JSON format."""
num_words_content_message = len(content_system.split(" ")) * len(df_prev_dialogues)
print(f"Number of words to batch input: {num_words + num_words_content_message}")

Number of words in context dialogues: 1327763
Number of Chandler dialogues: 7488
Number of words to batch input: 2263763


### Write data in OpenAI Batch API file format


In [36]:
# OpenAI Batch API file format
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}

{'custom_id': 'request-1',
 'method': 'POST',
 'url': '/v1/chat/completions',
 'body': {'model': 'gpt-3.5-turbo-0125',
  'messages': [{'role': 'system', 'content': 'You are a helpful assistant.'},
   {'role': 'user', 'content': 'Hello world!'}],
  'max_tokens': 1000}}

In [34]:
custom_id = 0
df_batch_input = pd.DataFrame(columns=["custom_id", "method", "url", "body"])
content_system = """You are given a few dialogues from the famous TV sitcom FRIENDS. You have two tasks. First, you have to summarize the dialogues into context without any names of characters. Second, you have to then write a dialogue as an appropriate response to the given context, the dialogue should be in the style of Chandler Bing, one of the characters from FRIENDS. Keep in mind that Chandler Bing’s humor is marked by a unique blend of sarcasm, self-deprecation, and quick wit. He tends to make jokes that deflect serious or emotional moments, often using his dry, sarcastic tone.  His style is heavily reliant on irony, often delivering punchlines that are deliberately over-the-top or nonsensical. Extract these two things and return them in JSON format."""


for i in range(len(df_prev_dialogues)):
  custom_id_str = "friends-" + str(custom_id)
  body = {}
  body["model"] = "gpt-4o-mini"

  messages = []
  message_system = {}
  message_system["role"] = "system"
  message_system["content"] = content_system

  message_user = {}
  message_user["role"] = "user"
  message_user["content"] = df_prev_dialogues.iloc[i]["prev_dialogues"]

  messages.append(message_system)
  messages.append(message_user)
  body["messages"] = messages

  body["max_tokens"] = 500

  df_batch_input.loc[i] = [custom_id_str, "POST", "/v1/chat/completions", body]
  custom_id += 1

In [35]:
import json
# Convert DataFrame to JSONL (this will escape forward slashes)
#df_batch_input.to_json('output.jsonl', orient='records', lines=True)

# Convert to JSON string without escaping forward slashes
#json_str = json.dumps(df_batch_input.to_dict(orient='records'), ensure_ascii=False).encode('utf8')
#batch_data = json.loads(json_str)
batch_data = df_batch_input.to_dict(orient='records')

# Write to a JSONL file
with open("batch_file_friends.jsonl", "w", encoding="utf-8") as f:
    for i, line in enumerate(batch_data):
        json.dump(line, f, ensure_ascii=False)  # Serialize dict as JSON
        if i != len(batch_data) - 1: f.write("\n")  # Newline for JSONL format

### Get chat completions from OpenAI API

In [None]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


In [None]:
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class ContextDialogue(BaseModel):
    context: str
    dialog: str

In [None]:
#response = client.chat.completions.create(
response = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "developer",
            "content": """You are given a few dialogues from the famous TV sitcom FRIENDS. You have two tasks.
                          First, you have to summarize the dialogues into context without any names of characters.
                          Second, you have to then write a dialogue as an appropriate response to the given context,
                          the dialogue should be in the style of Chandler Bing, one of the characters from FRIENDS.
                          Keep in mind that Chandler Bing’s humor is marked by a unique blend of sarcasm, self-deprecation, and quick wit.
                          He tends to make jokes that deflect serious or emotional moments, often using his dry, sarcastic tone.
                          His style is heavily reliant on irony, often delivering punchlines that are deliberately over-the-top or nonsensical.
                          Extract these two things and return them in JSON format."""
        },
        {
            "role": "user",
            "content": """Phoebe: Ooh! Oh!  \n Ross: No, no don\'t! Stop cleansing my aura! No, just leave my aura alone, okay? \n Phoebe: Fine! Be murky! \n Ross: I\'ll be fine, alright? Really, everyone. I hope she\'ll be very happy. \n Monica: No you don\'t. \n Ross: No I don\'t, to hell with her, she left me! \n Joey: And you never knew she was a lesbian... \n Ross: No!! Okay?! Why does everyone keep fixating on that? She didn\'t know, how should I know? \n Chandler: Sometimes I wish I was a lesbian...  Did I say that out loud? \n Ross: I told mom and dad last night, they seemed to take it pretty well. \n Monica: Oh really, so that hysterical phone call I got from a woman at sobbing 3:00 A.M., "I\'ll never have grandchildren, I\'ll never have grandchildren." was what? A wrong number? \n Ross: Sorry. \n Joey: Alright Ross, look. You\'re feeling a lot of pain right now. You\'re angry. You\'re hurting. Can I tell you what the answer is? \n Joey: Strip joint! C\'mon, you\'re single! Have some hormones! \n Ross:"""
        }
    ],
    response_format=ContextDialogue,
)

print(response.choices[0].message.content);

{"context":"A character is upset about a breakup, feeling angry and hurt, while others offer mixed responses ranging from sarcastic quips to genuine concern. Some attempt to comfort him while others make jokes about his situation, leading to a humorous yet awkward moment.","dialog":"Could we have a little less aura and a little more chips and guac? I mean, if we're cleansing anything, let's cleanse the tacos, am I right?"}


In [None]:
import json
json.loads(response.choices[0].message.content)

{'context': "In this scene, a group of friends is dealing with the aftermath of one friend's recent breakup. One friend is attempting to cleanse another friend's aura, but he insists on being left alone. Throughout the conversation, they discuss feelings of anger and sadness related to the breakup, while a couple of the friends poke fun by offering sarcastic or outrageous suggestions on how to cope with the situation, like going to a strip joint. Additionally, there's humor derived from one friend's unexpected and humorous desire to be part of the LGBTQ+ community.",
 'dialog': "Could I BE any more heartbroken? I mean, why go to a strip joint when I can just gather my friends and wallow in self-pity while eating an entire pizza by myself? That's practically a family tradition!"}

In [None]:
json.loads(response.choices[0].message.content)

{'context': "There is a conversation where one is experiencing emotional distress after a breakup. Another person playfully tries to cleanse their aura, while others express skepticism about their well-being. There's a mention of confusion regarding a partner's sexuality leading to humorous moment, and one person reflects on family reactions to the breakup. Another suggests an unconventional way to cope with the pain of being single.",
 'dialog': "Well, in times like these, I often find that the best cure for a broken heart is a combination of ice cream, self-pity, and, of course, that delightful little fling known as denial. Have you tried the denial? It's delightful, seriously!"}

In [None]:
json.loads(response.choices[0].message.content)

{'context': 'A character is upset about a breakup, feeling angry and hurt, while others offer mixed responses ranging from sarcastic quips to genuine concern. Some attempt to comfort him while others make jokes about his situation, leading to a humorous yet awkward moment.',
 'dialog': "Could we have a little less aura and a little more chips and guac? I mean, if we're cleansing anything, let's cleanse the tacos, am I right?"}