#A sample to build a HKJC FAQ chatbot by using ChatGPT

In this notebook, I would like to demonstrate how to make use of 2 models (GPT-3.5 and Embeddings models) from openai to build a chatbot to answer the questions based on the dateset where there're from faq sections of HKJC public website.

##Install and define required lib, models and openai API key

**1. Install and define the required libraries, models and openai api key we have to use for this demo.**

In [None]:
pip install --upgrade openai

In [None]:
pip install --upgrade tiktoken

In [16]:
import numpy as np
import openai
import pandas as pd
import pickle
import tiktoken

COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

In [17]:
api_key="{API-KEY}"
openai.api_key = api_key

**2. By default, GPT-3.5 doesn't have accurate information related to HKJC faq data. Like the following question, it doesn't response this question correctly.
Actually, HKJC operates three public riding schools at Tuen Mun, Pokfulam and Lei Yue Mun. That's, GPT is required to have some assistance here.**

In [18]:
prompt = "Where to learn horse riding courses provided by Hong Kong Jockey Club?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'The Hong Kong Jockey Club provides horse riding courses at its two training centres: the Sha Tin Racecourse and the Beas River Country Club. Both centres offer a range of courses for all levels of riders, from beginners to advanced.'

##Set the prompt for unsure answer response



**3. First of all, set the temperature to 0 of the prompt to make the GPT model not to response unsure answer.**


In [46]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't have relevant information."

Q: Where to learn horse riding course provided by Hong Kong Jockey Club?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't have relevant information."

##Provide extra contextual info in the prompt directly

**4. To help the model answer the question, we provide extra contextual information in the prompt. When the total required context is short, we can include it in the prompt directly to tell the model to explicitly make use of the provided text.**

In [65]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "Sorry, I don't have relevant information."

Context:
The Hong Kong Jockey Club operates three public riding schools at Tuen Mun, Pokfulam, Lei Yue Mun.  
The three schools, all recognized and approved by The British Horse Society, offer courses and activities for all ages. 

Q: Where to learn horse riding course provided by Hong Kong Jockey Club? Can tell me more about it?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'The Hong Kong Jockey Club operates three public riding schools at Tuen Mun, Pokfulam, Lei Yue Mun. These schools offer courses and activities for all ages, and are recognized and approved by The British Horse Society.'

##To build the chatbot with our own custom data on GPT model

**5. Adding extra information into the prompt only works when the dataset of extra content that the model may need to know is small enough to fit in a single prompt. What do we do when we need the model to choose relevant contextual information from within a large body of information?**

In the remainder of this notebook, it will demonstrate a method for augmenting GPT-3.5 with a large body of additional contextual information by using document embeddings and retrieval. This method answers queries in two steps: first it retrieves the information relevant to the query, then it writes an answer tailored to the question based on the retrieved information. The first step uses the [Embeddings API](https://beta.openai.com/docs/guides/embeddings), the second step uses the [Completions API](https://beta.openai.com/docs/guides/completion/introduction).
 
The steps are:
* Preprocess the contextual information by splitting it into chunks and create an embedding vector for each chunk.
* On receiving a query, embed the query in the same vector space as the context chunks and find the context embeddings which are most similar to the query.
* Prepend the most relevant context embeddings to the query prompt.
* Submit the question along with the most relevant context to GPT, and receive an answer which makes use of the provided contextual information.

**6.Preprocess the document library from the relevant dataset:**
- Plan to use document embeddings to fetch the most relevant part of parts of our document library and insert them into the prompt that we provide to GPT-3.5. We therefore need to break up the document library into "sections" of context, which can be searched and retrieved separately. 
- Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3.5 prompt. We 
find that approximately a paragraph of text is usually a good length, but you should experiment for your particular use case.

In [23]:
# We have hosted the processed dataset, so you can download it directly without having to recreate it.
# This dataset has already been split into sections, one row for each section of faq from HKJC public website (partial data)

df = pd.read_csv('https://storage.googleapis.com/alexshlam-chatgpt/hkjc_faq_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(10)

10 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
Membership,What is included in the Concession Schemes and what are the fees involved?,Racing Members may join the following concessi...,210
Horse Racing,"I know little about horseracing, Where can I learn more about the sport?","Please click?""Racing 101""?and?""Know About Hors...",10
Experience Football,"Which are the top football leagues, teams and star players in the world?",European football are widely regarded as the h...,54
Membership,How do I become a Hong Kong Jockey Club Member?,Anyone aged 18 or above can apply to become a ...,130
Experience Football,I know little about football. How is a football match being played?,Football is contested between 2 teams of 11 pl...,33
Horse Racing,Where can I find the latest Racing news?,You can find the details there https://racingn...,10
Membership,How much is the Membership fee for Racing Members and Full Members? And what are the benefits?,c,130
Horse Racing,I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?,"Located at Happy Valley Racecourse, The Hong K...",14
Horse Racing,Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?,Yes. the Club operates three public riding sch...,30
Experience Football,What are the basic factors to consider when I am trying to predict a match result?,"Player injuries and suspensions, both teams' r...",21


Note: We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. See the [documentation on OpenAI embeddings](https://beta.openai.com/docs/guides/embeddings) for more information.


In [24]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.content) for idx, r in df.iterrows()
    }

In [28]:
document_embeddings = compute_doc_embeddings(df)

In [29]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?') : [-0.003935437183827162, 0.00815497525036335, -0.0015714308246970177, -0.036571480333805084, -0.033988747745752335]... (1536 entries)


**7. So we have split our document library into sections, and encoded them by creating embedding vectors that represent each chunk. Next we will use these embeddings to answer our users' questions.**


**8. Find the most similar document embeddings to the question embedding:** 
- At the time of question-answering, to answer the user's query we compute the query embedding of the question and use it to find the most similar document sections.

In [30]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [68]:
order_document_sections_by_query_similarity("Where to learn horse riding?", document_embeddings)[:5]

[(0.8421600772648534,
  ('Horse Racing ',
   'I know little about horseracing, Where can I learn more about the sport?')),
 (0.8285157612694096,
  ('Horse Racing ',
   'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')),
 (0.8060660295010802,
  ('Horse Racing ',
   'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')),
 (0.7796799688026859,
  ('Horse Racing ', 'Where can I find the latest Racing news?')),
 (0.7617852025579473,
  ('Membership', 'How do I become a Hong Kong Jockey Club Member?'))]

In [69]:
order_document_sections_by_query_similarity("Can let me know where can I learn horse riding courses from HKJC?", document_embeddings)[:5]

[(0.860636031189496,
  ('Horse Racing ',
   'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')),
 (0.853733839524847,
  ('Horse Racing ', 'Where can I find the latest Racing news?')),
 (0.8354852763608227,
  ('Horse Racing ',
   'I know little about horseracing, Where can I learn more about the sport?')),
 (0.8329132193851098,
  ('Horse Racing ',
   'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')),
 (0.80637284279489,
  ('Membership', 'How do I become a Hong Kong Jockey Club Member?'))]

Note: We can see that the most relevant document sections for each question.

**9. Add the most relevant document sections to the query prompt:**
- Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. It is helpful to use a query separator to help the model distinguish between separate pieces of text.

In [36]:
MAX_SECTION_LEN = 500
SEPARATOR = "\n* "
ENCODING = "gpt2"  # encoding for text-davinci-003

encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))

f"Context separator contains {separator_len} tokens"

'Context separator contains 3 tokens'

In [52]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't have relevant information."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [70]:
prompt = construct_prompt(
    "Where to learn horse riding?",
    document_embeddings,
    df
)

print("===\n", prompt)

Selected 6 document sections:
('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?')
('Horse Racing ', 'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')
('Horse Racing ', 'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')
('Horse Racing ', 'Where can I find the latest Racing news?')
('Membership', 'How do I become a Hong Kong Jockey Club Member?')
('Membership', 'How much is the Membership fee for Racing Members and Full Members? And what are the benefits?')
===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't have relevant information."

Context:

* Please click?"Racing 101"?and?"Know About Horses"?for details.
* Yes. the Club operates three public riding schools at Tuen Mun, Pokfulam, Lei Yue Mun.? The three schools, all recognize

Note: We have now obtained the document sections that are most relevant to the question. As a final step, let's put it all together to get an answer to the question.

**10. Answer the user's question based on the context:**
- Now that we've retrieved the relevant context and constructed our prompt, we can finally use the Completions API to answer the user's query.

In [71]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [72]:
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array],
    show_prompt: bool = False
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

In [63]:
answer_query_with_context("Where to learn horse riding provided by HKJC?", df, document_embeddings)

Selected 6 document sections:
('Horse Racing ', 'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')
('Horse Racing ', 'Where can I find the latest Racing news?')
('Horse Racing ', 'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')
('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?')
('Membership', 'How do I become a Hong Kong Jockey Club Member?')
('Membership', 'What is included in the Concession Schemes and what are the fees involved?')


'The Hong Kong Jockey Club operates three public riding schools at Tuen Mun, Pokfulam, and Lei Yue Mun. Please click here for details.'

##Showcase how to use Embeddings and Completion APIs for HKJC FAQ chatbot

**11. By combining the Embeddings and Completions APIs, we have created a question-answering model which can answer questions using a large base of additional knowledge. It also understands when it doesn't know the answer!** 

-- Let's have some fun and try some more examples: 

In [73]:
query = "Where to learn horse riding provided by HKJC?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 6 document sections:
('Horse Racing ', 'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')
('Horse Racing ', 'Where can I find the latest Racing news?')
('Horse Racing ', 'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')
('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?')
('Membership', 'How do I become a Hong Kong Jockey Club Member?')
('Membership', 'What is included in the Concession Schemes and what are the fees involved?')

Q: Where to learn horse riding provided by HKJC?
A: The Hong Kong Jockey Club operates three public riding schools at Tuen Mun, Pokfulam, and Lei Yue Mun. Please click here for details.


In [56]:
query = "How do I become a Hong Kong Jockey Club Member?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 6 document sections:
('Membership', 'How do I become a Hong Kong Jockey Club Member?')
('Horse Racing ', 'Where can I find the latest Racing news?')
('Horse Racing ', 'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')
('Membership', 'What is included in the Concession Schemes and what are the fees involved?')
('Horse Racing ', 'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')
('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?')

Q: How do I become a Hong Kong Jockey Club Member?
A: To become a Hong Kong Jockey Club Member, you must be 18 or above and fill in the application form for Racing Membership as well as a separate application form for Full Membership and return both forms to the Club. The application form for Full Membership can be obtained from resident Honorary Stewards, Honorary Voting Members (O) or Votin

In [67]:
query = "Which are the top football leagues, teams and star players in the world?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 9 document sections:
('Experience Football', 'Which are the top football leagues, teams and star players in the world?')
('Experience Football', 'I know little about football. How is a football match being played?')
('Experience Football', 'What are the basic factors to consider when I am trying to predict a match result?')
('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?')
('Membership', 'How much is the Membership fee for Racing Members and Full Members? And what are the benefits?')
('Horse Racing ', 'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')
('Horse Racing ', 'Where can I find the latest Racing news?')
('Horse Racing ', 'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')
('Membership', 'How do I become a Hong Kong Jockey Club Member?')

Q: Which are the top football leagues, teams and star players in the w

In [58]:
query = "Where can I get the latest racing news?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 7 document sections:
('Horse Racing ', 'Where can I find the latest Racing news?')
('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?')
('Horse Racing ', 'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')
('Experience Football', 'What are the basic factors to consider when I am trying to predict a match result?')
('Membership', 'How do I become a Hong Kong Jockey Club Member?')
('Horse Racing ', 'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')
('Membership', 'How much is the Membership fee for Racing Members and Full Members? And what are the benefits?')

Q: Where can I get the latest racing news?
A: You can find the latest racing news at https://racingnews.hkjc.com/english/.


Our Q&A model is less prone to hallucinating answers, and has a better sense of what it does or doesn't know. This works when the information isn't contained in the context when the question is not relevant. 

In [74]:
query = "WHich team won the champion leagues in 2020?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 9 document sections:
('Experience Football', 'Which are the top football leagues, teams and star players in the world?')
('Experience Football', 'I know little about football. How is a football match being played?')
('Experience Football', 'What are the basic factors to consider when I am trying to predict a match result?')
('Membership', 'How much is the Membership fee for Racing Members and Full Members? And what are the benefits?')
('Horse Racing ', 'Where can I find the latest Racing news?')
('Horse Racing ', 'Where can I learn horse riding in Hong Kong? Are there any riding courses provided by Hong Kong Jockey Club?')
('Horse Racing ', 'I know little about horseracing, Where can I learn more about the sport?')
('Horse Racing ', 'I want to have deeper knowledge on Hong Kong horseracing history. Where can I find the related information?')
('Membership', 'How do I become a Hong Kong Jockey Club Member?')

Q: WHich team won the champion leagues in 2020?
A: I don't have releva