# Practical Exam: Automating Customer Support with OpenAI API

You work as an AI Engineer at ChatSolveAI, a company that provides automated customer support solutions. The company wants to improve response times and accuracy in answering customer queries by leveraging OpenAI’s GPT models.

Your task is to build a chatbot that classifies customer queries, retrieves relevant responses, and logs interactions in a structured way. The chatbot will use text embeddings, similarity search, API calls, and conversation management techniques.


**Please note:** 

1. The OpenAI Embeddings API supports passing a list of strings to the input parameter in a single request. This allows you to generate multiple embeddings at once without looping over individual elements, which can significantly improve efficiency and reduce the risk of hitting rate limits.

2. When submitting your solution, you may see an error message reading 'Something went wrong while submitting your solution. Please try again.' This is because using the OpenAI API may mean code takes longer to run than code in our other Certifications. Please ignore this message if your code is taking a few minutes to run. However, if your code makes too many API requests, the API will time out. If your cells run for more than a few minutes each, you may need to consider revising your code. 

In [17]:
# Run this cell before running your solution

# Import necessary modules
import os
from openai import OpenAI

# Define the model to use
model = "gpt-3.5-turbo"

# Define the client
client = OpenAI()

# Task 1

ChatSolveAI has provided a knowledge base (`knowledge_base.csv`) containing information about various products, services, and customer policies. To enhance search and query capabilities, you need to convert this data into embeddings and store them for efficient retrieval.

- Load the dataset (`knowledge_base.csv`).
- Generate text embeddings using OpenAI’s embedding model (`text-embedding-3-small`). Each document's `document_text` should be transformed into an embedding vector. 
- Store the generated embeddings in a structured format (`knowledge_embeddings.json`) with the following format available below.
- Store the embedded data and associated metadata for retrieval.  

### Format to store generated embeddings:
```json
[
    {
       "document_id": 1,
       "document_text": "Example document text.",
       "embedding_vector": [0.123, 0.456, ...],
       "metadata": "Additional document info"
    }
]
```

### Data description: 

| Column Name       | Criteria                                                |
|-------------------|---------------------------------------------------------|
| document_id       | Integer. Unique identifier for each document. No missing values. |
| document_text     | String. Text content of the knowledge base. Preprocessed and embedded. |
| embedding_vector  | List. Embedding representation of the `document_text`. |
| metadata          | String. Metadata for additional information. |


In [18]:
# =========================
# TASK 1 — EMBEDDINGS BUILD
# =========================

import csv
import json
from openai import OpenAI

client = OpenAI()

INPUT_CSV = "knowledge_base.csv"
OUTPUT_JSON = "knowledge_embeddings.json"
EMBED_MODEL = "text-embedding-3-small"

documents = []

with open(INPUT_CSV, newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        documents.append({
            "document_id": int(row["document_id"]),
            "document_text": row["document_text"],
            "metadata": row.get("metadata", "")
        })

texts = [d["document_text"] for d in documents]

embedding_response = client.embeddings.create(
    model=EMBED_MODEL,
    input=texts
)

output = []
for doc, emb in zip(documents, embedding_response.data):
    output.append({
        "document_id": doc["document_id"],
        "document_text": doc["document_text"],
        "embedding_vector": emb.embedding,
        "metadata": doc["metadata"]
    })

with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(output, f, indent=2, ensure_ascii=False)

# Task 2

ChatSolveAI receives customer queries that need to be classified and matched with appropriate responses. Your task is to preprocess and embed these queries, perform similarity searches on predefined responses (contained in `predefined_responses.json`), and retrieve the most relevant responses.

- Load the dataset (`processed_queries.csv`).
- Retrieve responses by using cosine similarity to perform a similarity search against predefined responses in `predefined_responses.json`.
- Structure API requests properly and implement error handling, including retry mechanisms to handle rate limits.
- Format model responses as JSON to maintain consistency in output.
- Compute confidence scores for retrieved responses, scaled to 0-1.
- Store the structured responses in a JSON file (`query_responses.json`), suitable for integration with other applications. Your JSON file should be structured as follows:

| Column Name       | Criteria                                                   |
|-------------------|------------------------------------------------------------|
| query_id         | Integer. Unique identifier for each query. No missing values. |
| query_text       | String. Preprocessed query text. |
| top_responses    | List. Top 3 most relevant responses retrieved. |
| confidence_scores | List. Model-based confidence score for the top 3 responses. |

In [19]:
# =========================================
# TASK 2 — FULLY ROBUST, JSON-AGNOSTIC
# =========================================

import csv
import json
import time
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity

# ---------- CONFIG ----------

client = OpenAI()

QUERY_CSV = "processed_queries.csv"
RESPONSES_JSON = "predefined_responses.json"
OUTPUT_JSON = "query_responses.json"

EMBED_MODEL = "text-embedding-3-small"
TOP_K = 3
CHUNK_SIZE = 100
RETRY_DELAY = 5

# ---------- EMBEDDING WITH CHUNKING + RETRY ----------

def embed_texts(texts):
    """Generates embeddings for a list of texts with retry and chunking."""
    embeddings = []
    for i in range(0, len(texts), CHUNK_SIZE):
        chunk = texts[i:i + CHUNK_SIZE]
        if not chunk:
            continue
        while True:
            try:
                res = client.embeddings.create(
                    model=EMBED_MODEL,
                    input=chunk
                )
                embeddings.extend([d.embedding for d in res.data])
                break
            except Exception:
                time.sleep(RETRY_DELAY)
    return np.array(embeddings)

# ---------- LOAD QUERIES ----------

queries = []
with open(QUERY_CSV, newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        queries.append({
            "query_id": int(row["query_id"]),
            "query_text": str(row["query_text"]).strip()
        })

if not queries:
    raise ValueError("No queries loaded from processed_queries.csv")

# ---------- LOAD & NORMALIZE PREDEFINED RESPONSES ----------

with open(RESPONSES_JSON, "r", encoding="utf-8") as f:
    raw_data = json.load(f)

# Function to extract any string from an object recursively
def extract_strings(obj):
    strings = []
    if isinstance(obj, str):
        if obj.strip():
            strings.append(obj.strip())
    elif isinstance(obj, dict):
        for v in obj.values():
            strings.extend(extract_strings(v))
    elif isinstance(obj, list):
        for item in obj:
            strings.extend(extract_strings(item))
    return strings

# Flatten all text from predefined responses
response_texts = extract_strings(raw_data)

if not response_texts:
    raise ValueError("predefined_responses.json contains no usable text")

# ---------- GENERATE EMBEDDINGS ----------

response_embeddings = embed_texts(response_texts)

query_texts = [q["query_text"] for q in queries]
query_embeddings = embed_texts(query_texts)

# ---------- SIMILARITY SEARCH + CONFIDENCE ----------

results = []

for q, q_emb in zip(queries, query_embeddings):
    similarities = cosine_similarity(
        q_emb.reshape(1, -1),
        response_embeddings
    )[0]

    top_indices = similarities.argsort()[-TOP_K:][::-1]
    top_scores = similarities[top_indices]

    clipped = np.clip(top_scores, 0, None)
    confidence_scores = (
        clipped / clipped.max() if clipped.max() > 0 else clipped
    ).tolist()

    top_responses = [response_texts[i] for i in top_indices]

    results.append({
        "query_id": q["query_id"],
        "query_text": q["query_text"],
        "top_responses": top_responses,
        "confidence_scores": confidence_scores
    })

# ---------- SAVE OUTPUT ----------

with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

# Task 3

To provide seamless customer service, ChatSolveAI wants to develop a chatbot that can respond to customer queries efficiently by searching for relevant responses and generating new ones when necessary.

- Develop a chatbot that:
    - Accepts customer queries via text input.
    - Searches for the most relevant responses from a predefined set of responses (`chatbot_responses.json`).
    - Uses the OpenAI Embeddings API (`text-embedding-3-small`) to compute semantic similarity between queries.
    - If no relevant response is found from the predefined set, generates a new response using GPT-3.5-turbo.
- Stores conversation history, including:
    - Query text
    - Retrieved response
    - Timestamp of the interaction
    - Confidence score of the response
- Include one open-ended query not in the predefined responses (e.g., about the refund policy) to test the chatbot’s ability to handle unmatched queries.
- Include one paraphrased query about support hours (e.g., “When can I talk to someone from support?”) to test semantic similarity matching.
- Store structured chatbot responses in a JSON file (`sample_chatbot_responses.json`). Make sure they follow this format:
```json
[
    {
        "query_text": "How do I reset my password?",
        "retrieved_response": "You can reset your password by clicking 'Forgot Password' on the login page.",
        "timestamp": "2025-04-02T14:30:00Z",
        "confidence_score": 0.92
    },
    {
        "query_text": "What are your business hours?",
        "retrieved_response": "Our support team is available from 9 AM to 5 PM, Monday to Friday.",
        "timestamp": "2025-04-02T14:35:00Z",
        "confidence_score": 0.87
    }
]
```

In [20]:
# =========================================
# TASK 3 — CHATBOT WITH HISTORY AND GPT FALLBACK
# =========================================

import json
import time
from datetime import datetime, timezone
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity

# ---------- CONFIG ----------

client = OpenAI()

RESPONSES_JSON = "chatbot_responses.json"
OUTPUT_JSON = "sample_chatbot_responses.json"

EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-3.5-turbo"
TOP_K = 1  # only need top 1 for chatbot reply
CONFIDENCE_THRESHOLD = 0.6  # if similarity < threshold, call GPT
RETRY_DELAY = 5

# ---------- EMBEDDING WITH RETRY ----------

def embed_texts(texts):
    embeddings = []
    for text in texts:
        while True:
            try:
                res = client.embeddings.create(
                    model=EMBED_MODEL,
                    input=text
                )
                embeddings.append(res.data[0].embedding)
                break
            except Exception:
                time.sleep(RETRY_DELAY)
    return np.array(embeddings)

# ---------- LOAD & NORMALIZE PREDEFINED RESPONSES ----------

with open(RESPONSES_JSON, "r", encoding="utf-8") as f:
    raw = json.load(f)

def extract_strings(obj):
    strings = []
    if isinstance(obj, str):
        if obj.strip():
            strings.append(obj.strip())
    elif isinstance(obj, dict):
        for v in obj.values():
            strings.extend(extract_strings(v))
    elif isinstance(obj, list):
        for item in obj:
            strings.extend(extract_strings(item))
    return strings

predefined_texts = extract_strings(raw)
if not predefined_texts:
    raise ValueError("chatbot_responses.json contains no usable text")

# Embed predefined responses
response_embeddings = embed_texts(predefined_texts)

# ---------- CHATBOT FUNCTION ----------

def get_chatbot_response(query_text):
    # Embed the query
    query_emb = embed_texts([query_text])[0]

    # Compute cosine similarity
    sims = cosine_similarity(query_emb.reshape(1, -1), response_embeddings)[0]
    top_idx = sims.argmax()
    top_score = sims[top_idx]
    retrieved_response = predefined_texts[top_idx]

    # Decide whether to use GPT fallback
    if top_score < CONFIDENCE_THRESHOLD:
        # Generate GPT response
        prompt = f"Customer query: {query_text}\nProvide a helpful, concise response."
        gpt_resp = client.chat.completions.create(
            model=CHAT_MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        retrieved_response = gpt_resp.choices[0].message.content.strip()
        confidence_score = 0.5  # fallback default confidence
    else:
        confidence_score = float(np.clip(top_score, 0, 1))

    # Timestamp in UTC ISO format
    timestamp = datetime.now(timezone.utc).isoformat()

    return {
        "query_text": query_text,
        "retrieved_response": retrieved_response,
        "timestamp": timestamp,
        "confidence_score": confidence_score
    }

# ---------- SAMPLE TEST QUERIES ----------

test_queries = [
    "How do I reset my password?",  # matches predefined
    "When can I talk to someone from support?",  # paraphrased support hours
    "Can I get a refund if I return the item?"  # open-ended, likely GPT
]

chat_history = []
for q in test_queries:
    chat_history.append(get_chatbot_response(q))

# ---------- SAVE CHAT HISTORY ----------

with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(chat_history, f, indent=2, ensure_ascii=False)

In [21]:
import json
import os
from datetime import datetime

# ---------- FILE PATHS ----------

TASK1_JSON = "knowledge_embeddings.json"
TASK2_JSON = "query_responses.json"
TASK3_JSON = "sample_chatbot_responses.json"

errors = []

# ---------- TASK 1 VALIDATION ----------

if not os.path.exists(TASK1_JSON):
    errors.append("Task 1 JSON file not found.")
else:
    with open(TASK1_JSON, "r", encoding="utf-8") as f:
        task1_data = json.load(f)
    for i, doc in enumerate(task1_data):
        if not all(k in doc for k in ["document_id", "document_text", "embedding_vector", "metadata"]):
            errors.append(f"Task 1: Missing keys in document {i}.")
        if not isinstance(doc["embedding_vector"], list):
            errors.append(f"Task 1: embedding_vector not a list in document {i}.")
        if not isinstance(doc["document_id"], int):
            errors.append(f"Task 1: document_id not int in document {i}.")

# ---------- TASK 2 VALIDATION ----------

if not os.path.exists(TASK2_JSON):
    errors.append("Task 2 JSON file not found.")
else:
    with open(TASK2_JSON, "r", encoding="utf-8") as f:
        task2_data = json.load(f)
    for i, entry in enumerate(task2_data):
        # Check keys
        if not all(k in entry for k in ["query_id", "query_text", "top_responses", "confidence_scores"]):
            errors.append(f"Task 2: Missing keys in entry {i}.")
        # Check top_responses
        if not isinstance(entry["top_responses"], list) or len(entry["top_responses"]) != 3:
            errors.append(f"Task 2: top_responses must be a list of 3 in entry {i}.")
        # Check confidence scores
        scores = entry.get("confidence_scores", [])
        if not isinstance(scores, list) or len(scores) != 3:
            errors.append(f"Task 2: confidence_scores must be a list of 3 in entry {i}.")
        for score in scores:
            if not (0 <= score <= 1):
                errors.append(f"Task 2: confidence score {score} out of range in entry {i}.")

# ---------- TASK 3 VALIDATION ----------

if not os.path.exists(TASK3_JSON):
    errors.append("Task 3 JSON file not found.")
else:
    with open(TASK3_JSON, "r", encoding="utf-8") as f:
        task3_data = json.load(f)
    for i, entry in enumerate(task3_data):
        # Check keys
        if not all(k in entry for k in ["query_text", "retrieved_response", "timestamp", "confidence_score"]):
            errors.append(f"Task 3: Missing keys in entry {i}.")
        # Check confidence_score
        score = entry.get("confidence_score", -1)
        if not (0 <= score <= 1):
            errors.append(f"Task 3: confidence_score out of range in entry {i}.")
        # Check timestamp format
        try:
            datetime.fromisoformat(entry["timestamp"])
        except Exception:
            errors.append(f"Task 3: Invalid timestamp format in entry {i}.")

# ---------- FINAL REPORT ----------

if not errors:
    print("All tasks validated successfully! ✅")
else:
    print("Validation errors found:")
    for e in errors:
        print("-", e)

All tasks validated successfully! ✅
