## Dataset Selection
I chose the Character Descriptions dataset because it contains fictional profiles. It is ideal for demonstrating how a custom chatbot can answer questions specifically about fictional characters, ensuring more accurate, dataset-specific responses compared to a general model.

In [1]:
# Import all required libraries
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


In [2]:
# API Key Setup
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR_API_KEY"

In [3]:
# Function to create embeddings using OpenAI API
def create_embedding(text):
    response = openai.Embedding.create(
        input=[text],
        model="text-embedding-ada-002"  # You can replace with any other model if needed
    )
    return response['data'][0]['embedding']


In [4]:
# Function to embed all context documents
def embed_context(texts):
    embeddings = []
    for text in texts:
        emb = create_embedding(text)
        embeddings.append(emb)
    return embeddings


In [5]:
# Function to create embedding for the question
def create_question_embedding(question):
    return create_embedding(question)


In [6]:
# Function to find most similar contexts to the question
def find_most_similar_context(question_embedding, context_embeddings, texts, top_k=3):
    similarities = cosine_similarity(
        [question_embedding], context_embeddings
    )[0]
    top_indices = similarities.argsort()[-top_k:][::-1]  # Sort and get top_k
    selected_texts = [texts[i] for i in top_indices]
    return selected_texts


In [7]:
# Function to create final prompt for LLM
def create_prompt(question, selected_contexts):
    context_text = "\n\n".join(selected_contexts)
    instruction = f"""You are a helpful assistant.
Given the following context:
{context_text}

Answer the following question:
{question}
"""
    return instruction


In [8]:
# Example documents
texts = [
    "This is document 1 about artificial intelligence.",
    "This is document 2 discussing machine learning models.",
    "This is document 3 which explains neural networks."
]

# Step 1: Embed context documents
context_embeddings = embed_context(texts)

# Step 2: Take a question
question = "What is discussed in document 2?"

# Step 3: Embed the question
question_embedding = create_question_embedding(question)

# Step 4: Find most similar context(s)
selected_contexts = find_most_similar_context(
    question_embedding, context_embeddings, texts, top_k=3
)

# Step 5: Create prompt to send to LLM
final_prompt = create_prompt(question, selected_contexts)

# Step 6: Send to OpenAI model (optional if you want to test)
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": final_prompt}]
)

# Step 7: Print final output
print(response['choices'][0]['message']['content'])


Document 2 discusses machine learning models.
