# GPT Chatbot Guide By Using Embeddings and Personalized Knowledge Bases

A guide on how to use the OpenAI embeddings endpoint to answer questions based on your dateset such as FAQ sections of public website. 

This notebook will guide you through the process in a few steps:
- Load the CSV file for further processing and set the correct indexes;
- Calculate vectors for each of the sections in the data file, using the embeddings endpoint;
- Search the relevant sections based on a prompt and the vectors (embeddings) we calculated, and
- Answer the question in a chat session based on the context we provided


>  Step 1: Load the dataset (in CSV format) and calculate the number 
of tokens for each text. This is important since there is a maximum 
token amount that we can process using GPT, and there is a cost charged per token processed as well. To achieve this, we can utilize the tiktoken package and the following code:



In [None]:
pip install tiktoken

In [3]:
pip install pandas

In [5]:
import tiktoken
import pandas as pd

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

# Read the dataset into a data frame
df = pd.read_csv('hkjc_faq.csv')
# Set column name 
df.columns = ['question', 'answer']

df['qna'] = df['question'] + df['answer']
df
# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df['qna'].apply(lambda x: len(tokenizer.encode(x)))

df.head()

Unnamed: 0,question,answer,qna,n_tokens
0,Where can I learn horse riding in Hong Kong? A...,Yes. the Club operates three public riding sch...,Where can I learn horse riding in Hong Kong? A...,68
1,I want to have deeper knowledge on Hong Kong h...,"Located at Happy Valley Racecourse, The Hong K...",I want to have deeper knowledge on Hong Kong h...,47
2,Where can I find the latest Racing news?,You can find the details in https://racingnews...,Where can I find the latest Racing news?You ca...,26
3,I know little about football. How is a footbal...,Football is contested between 2 teams of 11 pl...,I know little about football. How is a footbal...,66
4,"Which are the top football leagues, teams and ...",European football are widely regarded as the h...,"Which are the top football leagues, teams and ...",106


In [6]:
max_tokens = 500

# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    # Split the text into sentences
    sentences = text.split('. ')

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
    
    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater 
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of 
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1

    return chunks
    

shortened = []

# Loop through the dataframe
for row in df.iterrows():

    # If the text is None, go to the next row
    if row[1]['qna'] is None:
        continue

    # If the number of tokens is greater than the max number of tokens, split the text into chunks
    if row[1]['n_tokens'] > max_tokens:
        shortened += split_into_many(row[1]['qna'])
    
    # Otherwise, add the text to the list of shortened texts
    else:
        shortened.append( row[1]['qna'] )

In [7]:
df = pd.DataFrame(shortened, columns = ['qna'])
df['n_tokens'] = df['qna'].apply(lambda x: len(tokenizer.encode(x)))
df.head()

Unnamed: 0,qna,n_tokens
0,Where can I learn horse riding in Hong Kong? A...,68
1,I want to have deeper knowledge on Hong Kong h...,47
2,Where can I find the latest Racing news?You ca...,26
3,I know little about football. How is a footbal...,66
4,"Which are the top football leagues, teams and ...",106


> Step 2: In this part, the dataset will be passed to the OpenAI Embedding API to create an embeddings (vector of floating point numbers related to the input text) to store as a csv file. Then we will use these embeddings to find the most appropriate FAQ section(s) based on the user prompt:

In [None]:
pip install openai

In [24]:
import openai
openai.api_key = ''

In [13]:

df['embeddings'] = df['qna'].apply(lambda x: openai.Embedding.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])
df.to_csv('hkjc_faq_embeddings.csv')
df.head()

Unnamed: 0,qna,n_tokens,embeddings
0,Where can I learn horse riding in Hong Kong? A...,68,"[-0.009615047834813595, -0.009309403598308563,..."
1,I want to have deeper knowledge on Hong Kong h...,47,"[-0.014127939939498901, -0.020054219290614128,..."
2,Where can I find the latest Racing news?You ca...,26,"[-0.010904733091592789, 0.02442556619644165, 0..."
3,I know little about football. How is a footbal...,66,"[0.006150386296212673, -0.0035078837536275387,..."
4,"Which are the top football leagues, teams and ...",106,"[0.023489559069275856, -0.0022667041048407555,..."


> Step 3: Load the embeddings CSV file to embeddings dataframe which is able to use for finding the most similar context: 



In [16]:
import numpy as np
from openai.embeddings_utils import distances_from_embeddings, cosine_similarity

df=pd.read_csv('hkjc_faq_embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

df.head()

Unnamed: 0,qna,n_tokens,embeddings
0,Where can I learn horse riding in Hong Kong? A...,68,"[-0.009615047834813595, -0.009309403598308563,..."
1,I want to have deeper knowledge on Hong Kong h...,47,"[-0.014127939939498901, -0.020054219290614128,..."
2,Where can I find the latest Racing news?You ca...,26,"[-0.010904733091592789, 0.02442556619644165, 0..."
3,I know little about football. How is a footbal...,66,"[0.006150386296212673, -0.0035078837536275387,..."
4,"Which are the top football leagues, teams and ...",106,"[0.023489559069275856, -0.0022667041048407555,..."




> Step 4: Define a function to convert the user prompt as embeddings, and then use it to search the most similar context from the dataframe:



In [17]:
def create_context(
    # question, df, max_len=1800, size="ada"
    question, max_len=1800, size="ada"
):
    """
    Create a context for a question by finding the most similar context from the dataframe
    """

    # Get the embeddings for the question
    q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

    # Get the distances from the embeddings
    df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():
        
        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4
        
        # If the context is too long, break
        if cur_len > max_len:
            break
        
        # Else add it to the text that is being returned
        returns.append(row["qna"])

    # Return the context
    return "\n\n###\n\n".join(returns)


> Step 5: Define a function to build the query with corresponding context to call OpenAI Text Completion API to get the relvant answer:  

In [23]:
def qna(
    model="text-davinci-003",
    q="Where can I learn the horse riding provided by Hong Kong Jockey Club?",
    max_len=1800,
    size="ada",
    debug=False,
    max_tokens=300,
    stop_sequence=None
):
    """
    Answer a question based on the most similar context from the dataframe texts
    """
    context = create_context(
        q,

        max_len=1800,
        size="ada",
    )
    # If debug, print the raw model response
    if debug:
        print("Context:\n" + context)
        print("\n\n")

    try:
        # Create a completions using the question and context
        response = openai.Completion.create(
            prompt=f"Your name is called HelpYou 168. Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {q}\nAnswer:",
            temperature=0,
            max_tokens=150,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=None,
            model="text-davinci-003",
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [18]:
question = "What is your name?"
answer = qna(q=question)

print(f"\nQ: {question}\nA: {answer}")


Q: What is your name?
A: My name is HelpYou 168.


In [19]:
question = "Where can I learn horse riding provided by HKJC?"
answer = qna(q=question)

print(f"\nQ: {question}\nA: {answer}")


Q: Where can I learn horse riding provided by HKJC?
A: Yes. the Club operates three public riding schools at Tuen Mun, Pokfulam, Lei Yue Mun.  The three schools, all recognized and approved by The British Horse Society, offer courses and activities for all ages.


In [20]:
question = "香港賽馬會提供的騎術哪裡可以學?"
answer = qna(q=question)

print(f"\nQ: {question}\nA: {answer}")


Q: 香港賽馬會提供的騎術哪裡可以學?
A: 香港賽馬會在屯門、薄扶林和鯉魚門營運三個公共騎術學校，所有學校均獲得英國馬會的認可和批准，提供各年齡層的課程和活動。


In [21]:
question = "How much is the Membership fee for Racing Members and Full Members?"
answer = qna(q=question)

print(f"\nQ: {question}\nA: {answer}")


Q: How much is the Membership fee for Racing Members and Full Members?
A: Racing Membership: The entrance fee for admission to Racing Membership is currently at HK$150,000. Racing Members are also required to pay a monthly subscription currently at HK$850. Full Membership: The entrance fee for admission to Full Membership is currently at HK$850,000; Monthly subscription for continued enjoyment of membership privileges and facilities is currently at HK$2,550.


In [22]:
question = "Which team is the famous English Clubs?"
answer = qna(q=question)

print(f"\nQ: {question}\nA: {answer}")


Q: Which team is the famous English Clubs?
A: The famous English clubs are Manchester Utd, Arsenal, Liverpool and Chelsea.


In [None]:
question = "哪支球隊是著名的英格蘭俱樂部?"
answer = qna(q=question)

print(f"\nQ: {question}\nA: {answer}")


Q: 哪支球隊是著名的英格蘭俱樂部?
A: Manchester Utd, Arsenal, Liverpool and Chelsea.


In [None]:
question = "What is The Racing Club Concession Scheme?"
answer = qna(q=question)

print(f"\nQ: {question}\nA: {answer}")


Q: What is The Racing Club Concession Scheme?
A: The Racing Club Concession Scheme allows Members to use The Racing Club facilities in Happy Valley Racecourse from Monday to Sunday and on all Racedays; and The Racing Club facilities in Sha Tin Racecourse on all Racedays. Members may also enroll in a wide range of racing and lifestyle programmes designed exclusively for Racing Club Members. The joining fee for enrollment into the Scheme is currently at HK$73,000. Members are also required to pay a monthly subscription currently at HK$700. The terms and conditions of the Concession Schemes, including the fees, are subject to review by the Stewards from time to time. All fees and charges paid are not refundable.
