<a href="https://colab.research.google.com/github/akash166d/PEFT_fine_tuning_IMDB/blob/main/Custom_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project: Build Your Own Custom Chatbot

**Dataset :** character_descriptions.csv

**Why?**
Media industry is an art form and an ever evolving field. For casting directors it is very important to cast right person who suits the script.
When the media is created casting directors tries to look for similar performances across the field and in most cases they want ot look for recent artist (as they might bring more money on the table as they already have face value) Given that openAI may not have such info about media (in our case characters in particular) to provide with most recent information.

GPT needs text as the input so we will compress the information from all fields into one. To save token space we will keep it simple:

**Text** = Name of Character: <Name>  /n/n Description: <description> /n/n Setting: <setting> /n/n Medium: <medium>



In [37]:
!pip uninstall openai
!pip install openai
!pip install tiktoken

Found existing installation: openai 1.37.0
Uninstalling openai-1.37.0:
  Would remove:
    /usr/local/bin/openai
    /usr/local/lib/python3.10/dist-packages/openai-1.37.0.dist-info/*
    /usr/local/lib/python3.10/dist-packages/openai/*
Proceed (Y/n)? Y
  Successfully uninstalled openai-1.37.0
Collecting openai
  Using cached openai-1.37.0-py3-none-any.whl (337 kB)
Installing collected packages: openai
Successfully installed openai-1.37.0


In [29]:
import pandas as pd
import numpy as np
import openai
import os
from openai.embeddings_utils import distances_from_embeddings
# from openai import OpenAI

In [30]:
from google.colab import userdata
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-00000000000000000000000000000000abcd.12345678"
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
MAX_TOKENS = 1000

# Data Import and Wrangling

In [31]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [32]:
FILE_PATH = "/content/drive/My Drive/PEFT_fine_tuning/character_descriptions.csv"
# Load & transform data
data = pd.read_csv(FILE_PATH)
data['text'] = "Name of Character: " + data.Name + " /n/n Description: " + data.Description + " /n/n Setting: " + data.Setting + " /n/n Medium: " + data.Medium
data = data[["text"]]
print(str(data.shape))
data.head(2)


(55, 1)


Unnamed: 0,text
0,Name of Character: Emily /n/n Description: A y...
1,Name of Character: Jack /n/n Description: A mi...


Since the dataset is small we will create and store embeddings in csv. In most practical setting we will be using a Vector Database to store vectors (or embeddings) and then retreive it based on similarity

In [39]:
embeddings = []
for index, row in data.iterrows():
  response = openai.Embedding.create(
      input=row["text"],
      engine=EMBEDDING_MODEL_NAME
  )
  embeddings.extend([data["embedding"] for data in response["data"]])
data["embeddings"] = embeddings

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [40]:
def question_embeddings(question):
  response = openai.Embedding.create(
      input=question,
      engine=EMBEDDING_MODEL_NAME
  )
  return response["data"][0]["embedding"]

In [None]:
q1 = "Suggest a character name, who is best suited for live comedy show?"
q2 = "Suggest atleast two female character, who are best suited to play an anchor role?"
q1_embeddings = question_embeddings(q1)
q2_embeddings = question_embeddings(q2)

df['q1_distances'] = distances_from_embeddings(
  q1_embeddings,
  df['embeddings'].values,
  distance_metric="cosine"
)
df['q2_distances'] = distances_from_embeddings(
  q2_embeddings,
  df['embeddings'].values,
  distance_metric="cosine"
)
df.head()
dfq1 = df.sort_values(by=["q1_distances"], ascending=True)
dfq2 = df.sort_values(by=["q2_distances"], ascending=True)
dfq1.head()

In [None]:
dfq1["text"].head(10).values

In [None]:
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
prompt_template = """
Answer the question using Context provided below.
If no useful information or answer found just say 'I'm not sure'

Context:

{}

---

Question: {}
Answer:"""
def get_prompt(question, df):
  token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
  context_list = []
  for text in df["text"].head(10).values:
    token_count += len(tokenizer.encode(text))
    if token_count <= MAX_TOKENS:
        context_list.append(text)
    else:
        break
  prompt = prompt_template.format("\n\n###\n\n".join(context_list), question)
  return prompt

In [None]:
q1_prompt=get_prompt(q1, dfq1)
q2_prompt=get_prompt(q2, dfq2)

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1 Custom

In [None]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q1_prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

### Question 2 Custom

In [None]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q2_prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

### Question 1 Basic

In [None]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q1,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

### Question 2 Custom

In [42]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q2,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

In [None]:
#