# Power your products with ChatGPT and your own data

This is a walkthrough taking readers through how to build starter Q&A and Chatbot applications using the ChatGPT API and their own data. 

It is laid out in these sections:
- **Setup:** 
    - Initiate variables and source the data
- **Lay the foundations:**
    - Set up the vector database to accept vectors and data
    - Load the dataset, chunk the data up for embedding and store in the vector database
- **Make it a product:**
    - Add a retrieval step where users provide queries and we return the most relevant entries
    - Summarise search results with GPT-3
    - Test out this basic Q&A app in Streamlit
- **Build your moat:**
    - Create an Assistant class to manage context and interact with our bot
    - Use the Chatbot to answer questions using semantic search context
    - Test out this basic Chatbot app in Streamlit
    
Upon completion, you have the building blocks to create your own production chatbot or Q&A application using OpenAI APIs and a vector database.

This notebook was originally presented with [these slides](https://drive.google.com/file/d/1dB-RQhZC_Q1iAsHkNNdkqtxxXqYODFYy/view?usp=share_link), which provide visual context for this journey.

In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Setup

First we'll setup our libraries and environment variables

In [3]:
import openai
import os
import requests
import numpy as np
import pandas as pd
from typing import Iterator
import tiktoken
import textract
from numpy import array, average

from database import get_redis_connection

# Set our default models and chunking size
from config import COMPLETIONS_MODEL, EMBEDDINGS_MODEL, CHAT_MODEL, TEXT_EMBEDDING_CHUNK_SIZE, VECTOR_FIELD_NAME

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ImportWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [4]:
pd.set_option('display.max_colwidth', 0)

In [5]:
data_dir = os.path.join(os.curdir,'data')
pdf_files = sorted([x for x in os.listdir(data_dir) if 'DS_Store' not in x])
pdf_files

['01-EN-3P379970-7B.pdf',
 '01_en_3p436086-1d.pdf',
 '01_en_3p591321-10e-rzr_rzq-tbvjua-installation-manual.pdf']

## Laying the foundations

### Storage

We're going to use Redis as our database for both document contents and the vector embeddings. You will need the full Redis Stack to enable use of Redisearch, which is the module that allows semantic search - more detail is in the [docs for Redis Stack](https://redis.io/docs/stack/get-started/install/docker/).

To set this up locally, you will need to install Docker and then run the following command: ```docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest```.

The code used here draws heavily on [this repo](https://github.com/RedisAI/vecsim-demo).

After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.

In [6]:
# Setup Redis
from redis import Redis
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField,
    NumericField
)
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)

redis_client = get_redis_connection()

In [18]:
# Constants
VECTOR_DIM = 1536 #len(data['title_vector'][0]) # length of the vectors
#VECTOR_NUMBER = len(data)                 # initial number of vectors
PREFIX = "hvacdocs"                            # prefix for the document keys
DISTANCE_METRIC = "COSINE"                # distance metric for the vectors (ex. COSINE, IP, L2)

In [19]:
# Create search index

# Index
INDEX_NAME = "hvac-index"           # name of the search index
VECTOR_FIELD_NAME = 'content_vector'

# Define RediSearch fields for each of the columns in the dataset
# This is where you should add any additional metadata you want to capture
filename = TextField("filename")
text_chunk = TextField("text_chunk")
file_chunk_index = NumericField("file_chunk_index")

# define RediSearch vector fields to use HNSW index

text_embedding = VectorField(VECTOR_FIELD_NAME,
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC
    }
)
# Add all our field objects to a list to be created as an index
fields = [filename,text_chunk,file_chunk_index,text_embedding]

In [20]:
redis_client.ping()

True

In [21]:
# Optional step to drop the index if it already exists
#redis_client.ft(INDEX_NAME).dropindex()

# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except Exception as e:
    print(e)
    # Create RediSearch Index
    print('Not there yet. Creating')
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

Index already exists


### Ingestion

We'll load up our PDFs and do the following
- Initiate our tokenizer
- Run a processing pipeline to:
    - Mine the text from each PDF
    - Split them into chunks and embed them
    - Store them in Redis

In [22]:
# The transformers.py file contains all of the transforming functions, including ones to chunk, embed and load data
# For more details, check the file and work through each function individually
from transformers import handle_file_string

In [23]:
openai.api_key='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'

In [24]:
%%time
# This step takes about 5 minutes

# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# Process each PDF file and prepare for embedding
for pdf_file in pdf_files:
    
    pdf_path = os.path.join(data_dir,pdf_file)
    print(pdf_path)
    
    # Extract the raw text from each PDF using textract
    text = textract.process(pdf_path, method='pdfminer')
    
    # Chunk each document, embed the contents and load to Redis
    handle_file_string((pdf_file,text.decode("utf-8")),tokenizer,redis_client,VECTOR_FIELD_NAME,INDEX_NAME)

./data/01-EN-3P379970-7B.pdf
./data/01_en_3p436086-1d.pdf
./data/01_en_3p591321-10e-rzr_rzq-tbvjua-installation-manual.pdf
CPU times: user 618 ms, sys: 101 ms, total: 719 ms
Wall time: 23.9 s


In [25]:
# Check that our docs have been inserted
redis_client.ft(INDEX_NAME).info()['num_docs']

'136'

## Make it a product

Now we can test that our search works as intended by:
- Querying our data in Redis using semantic search and verifying results
- Adding a step to pass the results to GPT-3 for summarisation

In [26]:
from database import get_redis_results

In [30]:
%%time

f1_query='what type of refrigerant is used for the daikin RZQ24TBVJUA?'

result_df = get_redis_results(redis_client,f1_query,index_name=INDEX_NAME)
result_df.head(2)

CPU times: user 6.66 ms, sys: 1.54 ms, total: 8.2 ms
Wall time: 336 ms


Unnamed: 0,id,result,certainty
0,0,"Filename is: 01-EN-3P379970-7B.pdf; This is installation manual split series air conditioning units that use R410a refrigerant. Specifically, it is for the following products: FVXS09NVJU, FVXS12NVJU, FVXS15NVJU, FVXS18NVJU h s i l g n E s i a ç n a r F l o ñ a p s E DAIKIN ROOM AIR CONDITIONER INSTALLATION MANUAL R410A Split Series Installation manual Manuel dinstallation Manuel dinstallation Manual de instalación MODELS FVXS09NVJU FVXS12NVJU FVXS15NVJU FVXS18NVJU 00_CV_3P379970-7B.indd 1 10/28/2015 20:12:06 Contents Safety Considerations .................................... 1 Accessories ..................................................... 3 Choosing an Installation Site ........................ 3 1. Indoor unit ................................................................... 3 2. Wireless remote controller ........................................... 3 Indoor Unit Installation Diagram ................... 4 Indoor Unit Installation ................................... 5 1. Refrigerant piping ........................................................ 5 2. Drilling a wall hole and installing wall embedded pipe .............................................................................. 7 3. Drain piping ................................................................. 7 4. Installing indoor unit .................................................... 8 4-1. Preparation .......................................................... 8 4-2. Installation ........................................................... 9 5. Flaring the pipe end ..................................................... 12 6. Connecting the refrigerant pipe ................................... 12 6-1. Caution on piping handling .................................. 13 6-2. Selection of copper and heat insulation materials .............................................................. 13 7. Checking for gas leakage ........................................",0.167544424534
1,1,"Filename is: 01_en_3p591321-10e-rzr_rzq-tbvjua-installation-manual.pdf; This document is an installation manual for split system air conditioners. Specifically it is for the following air conditioning products: RZQ18TBVJUA, RZQ24TBVJUA, RZQ30TBVJUA, RZQ36TBVJUA, RZQ42TBVJUA, RZQ48TBVJUA, RZR18TBVJUA, RZR24TBVJUA, RZR30TBVJUA, RZR36TBVJUA, RZR42TBVJUA, RZR48TBVJUA INSTALLATION MANUAL SPLIT SYSTEM Air Conditioners MODEL RZQ18TBVJUA RZQ24TBVJUA RZQ30TBVJUA RZQ36TBVJUA RZQ42TBVJUA RZQ48TBVJUA RZR18TBVJUA RZR24TBVJUA RZR30TBVJUA RZR36TBVJUA RZR42TBVJUA RZR48TBVJUA English Français Español Read these instructions carefully before installation. Keep this manual in a handy place for future reference. This manual should be left with the equipment owner. Lire soigneusement ces instructions avant l’installation. Conserver ce manuel à portée de main pour référence ultérieure. Ce manuel doit être donné au propriétaire de l’équipement. Lea cuidadosamente estas instrucciones antes de instalar. Guarde este manual en un lugar a mano para leer en caso de tener alguna duda. Este manual debe permanecer con el propietario del equipo.",0.172387063503


In [31]:
# Build a prompt to provide the original query, the result and ask to summarise for the user
summary_prompt = '''Summarise this result in a bulleted list to answer the search query a customer has sent.
Search query: SEARCH_QUERY_HERE
Search result: SEARCH_RESULT_HERE
Summary:
'''
summary_prepped = summary_prompt.replace('SEARCH_QUERY_HERE',f1_query).replace('SEARCH_RESULT_HERE',result_df['result'][0])
summary = openai.Completion.create(engine=COMPLETIONS_MODEL,prompt=summary_prepped,max_tokens=500)
# Response provided by GPT-3
print(summary['choices'][0]['text'])

- R410a refrigerant is used for the Daikin RZQ24TBVJUA
- Accessories include indoor unit and wireless remote controller
- Installation requires refrigerant piping, drilling a wall hole and installing wall embedded pipe, and drain piping
- Refrigerant pipe end must be flared
- Check for gas leakage after installation is complete


### Search

Now that we've got our knowledge embedded and stored in Redis, we can now create an internal search application. Its not sophisticated but it'll get the job done for us.

In the directory containing this app, execute ```streamlit run search.py```. This will open up a Streamlit app in your browser where you can ask questions of your embedded data.

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form

## Build your moat

The Q&A was useful, but fairly limited in the complexity of interaction we can have - if the user asks a sub-optimal question, there is no assistance from the system to prompt them for more info or conversation to lead them down the right path.

For the next step we'll make a Chatbot using the Chat Completions endpoint, which will:
- Be given instructions on how it should act and what the goals of its users are
- Be supplied some required information that it needs to collect
- Go back and forth with the customer until it has populated that information
- Say a trigger word that will kick off semantic search and summarisation of the response

For more details on our Chat Completions endpoint and how to interact with it, please check out the docs [here](https://platform.openai.com/docs/guides/chat).

### Framework

This section outlines a basic framework for working with the API and storing context of previous conversation "turns". Once this is established, we'll extend it to use our retrieval endpoint.

In [32]:
# A basic example of how to interact with our ChatCompletion endpoint
# It requires a list of "messages", consisting of a "role" (one of system, user or assistant) and "content"
question = 'How can you help me'


completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": question}
  ]
)
print(f"{completion['choices'][0]['message']['role']}: {completion['choices'][0]['message']['content']}")

assistant: As an AI language model, I can help you with a variety of tasks. You can ask me questions, seek explanations on different topics, proofread your written work, or help you with your homework. I can also assist you in generating ideas for writing assignments, summarizing longer texts, and providing general knowledge and information. Just let me know what you need help with, and I'll do my best to assist you.


In [33]:
from termcolor import colored

# A basic class to create a message as a dict for chat
class Message:
    
    
    def __init__(self,role,content):
        
        self.role = role
        self.content = content
        
    def message(self):
        
        return {"role": self.role,"content": self.content}
        
# Our assistant class we'll use to converse with the bot
class Assistant:
    
    def __init__(self):
        self.conversation_history = []

    def _get_assistant_response(self, prompt):
        
        try:
            completion = openai.ChatCompletion.create(
              model="gpt-3.5-turbo",
              messages=prompt
            )
            
            response_message = Message(completion['choices'][0]['message']['role'],completion['choices'][0]['message']['content'])
            return response_message.message()
            
        except Exception as e:
            
            return f'Request failed with exception {e}'

    def ask_assistant(self, next_user_prompt, colorize_assistant_replies=True):
        [self.conversation_history.append(x) for x in next_user_prompt]
        assistant_response = self._get_assistant_response(self.conversation_history)
        self.conversation_history.append(assistant_response)
        return assistant_response
            
        
    def pretty_print_conversation_history(self, colorize_assistant_replies=True):
        for entry in self.conversation_history:
            if entry['role'] == 'system':
                pass
            else:
                prefix = entry['role']
                content = entry['content']
                output = colored(prefix +':\n' + content, 'green') if colorize_assistant_replies and entry['role'] == 'assistant' else prefix +':\n' + content
                print(output)

In [34]:
# Initiate our Assistant class
conversation = Assistant()

# Create a list to hold our messages and insert both a system message to guide behaviour and our first user question
messages = []
system_message = Message('system','You are a helpful business assistant who has innovative ideas, and talks in the style of alexander pushkin')
user_message = Message('user','What can you do to help me')
messages.append(system_message.message())
messages.append(user_message.message())
messages

[{'role': 'system',
  'content': 'You are a helpful business assistant who has innovative ideas, and talks in the style of alexander pushkin'},
 {'role': 'user', 'content': 'What can you do to help me'}]

In [35]:
# Get back a response from the Chatbot to our question
response_message = conversation.ask_assistant(messages)
print(response_message['content'])

Dear Sir/Madam,

As a helpful business assistant with innovative ideas, I can assist you in a number of ways that will help you grow and develop your business. Firstly, I would recommend that we conduct a thorough analysis of your business operations to identify areas of improvement. This assessment will help us to identify and address any inefficiencies in your business processes, which will ultimately lead to increased productivity and profitability.

Secondly, I can assist you in developing a strategic plan for your business. This plan will outline your goals, objectives, and steps that need to be taken to achieve those goals. By creating a clear and concise roadmap for your business, we can help you to stay focused and motivated as you work towards success.

Lastly, I would suggest that we work together to develop a strong and effective marketing strategy. By identifying your brand's unique selling points, we can develop a marketing plan that will help you stand out from your compe

In [55]:
next_question = 'Tell me more about how you might help me execute a marketing strategy'

# Initiate a fresh messages list and insert our next question
messages = []
user_message = Message('user',next_question)
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
print(response_message['content'])

Ah, my dear friend, crafting a successful marketing strategy requires a keen eye for detail and a creative mind. Together, we can devise an inclusive plan to promote your brand and engage with your customers across various platforms.

Firstly, we shall analyze your target audience, their preferences, and habits, and develop tailored messaging that speaks to them in a language that resonates with their needs. Once we have developed a comprehensive understanding of your customer base, we shall advise you on the most effective channels to engage with them.

We shall explore various digital channels, including email marketing, social media marketing, search engine optimization, and possibly even paid advertising campaigns. We shall also consider offline channels, such as print media and events, which can be just as effective, especially for local businesses.

To execute the strategy effectively, we shall identify key performance indicators to measure your campaign's success rate. This may 

In [56]:
# Print out a log of our conversation so far

conversation.pretty_print_conversation_history()

user:
What can you do to help me
[32massistant:
My dear friend, worry not, for I, your humble assistant, am brimming with innovative ideas to help you in your business pursuits. From the promotion of your goods and services to the expansion of your clientele, I shall assist you with the greatest zeal and fervor.

I shall propose to you a tailored plan, wherein you may utilize various marketing strategies such as social media campaigns, print advertisements, and other methods to engage with your clients and attract potential patrons. Additionally, I can recommend methods to make your products and services more accessible to the public, as well as techniques to streamline your operations and maximize profitability.

Rest assured that with my guidance, your business shall prosper and flourish beyond measure. Let us work together with vigor and passion to achieve your vision![0m
user:
Tell me more about how you might help me execute a marketing strategy
[32massistant:
Ah, my dear friend

### Knowledge retrieval

Now we'll extend the class to call a downstream service when a stop sequence is spoken by the Chatbot.

The main changes are:
- The system message is more comprehensive, giving criteria for the Chatbot to advance the conversation
- Adding an explicit stop sequence for it to use when it has the info it needs
- Extending the class with a function ```_get_search_results``` which sources Redis results

In [59]:
# Updated system prompt requiring Question and Model Number to be extracted from the user
system_prompt = '''
You are a helpful HVAC knowledge base assistant. You need to capture a Question and Model Number from each customer.
The Question is their query on HVAC products, and the Model Number is the model number for an applicable product.
If they haven't provided the model number, ask them for it again.
Once you have the model number, say "searching for answers".

Example 1:

User: I'd like the safety guidelines for installing a daikin room air conditioner

Assistant: Certainly, do you have a specific model number?

User: Sure, CTXG09QVJUW

Assistant: Searching for answers.
'''

# New Assistant class to add a vector database call to its responses
class RetrievalAssistant:
    
    def __init__(self):
        self.conversation_history = []  

    def _get_assistant_response(self, prompt):
        
        try:
            completion = openai.ChatCompletion.create(
              model=CHAT_MODEL,
              messages=prompt,
              temperature=0.1
            )
            
            response_message = Message(completion['choices'][0]['message']['role'],completion['choices'][0]['message']['content'])
            return response_message.message()
            
        except Exception as e:
            
            return f'Request failed with exception {e}'
    
    # The function to retrieve Redis search results
    def _get_search_results(self,prompt):
        latest_question = prompt
        search_content = get_redis_results(redis_client,latest_question,INDEX_NAME)['result'][0]
        return search_content
        

    def ask_assistant(self, next_user_prompt):
        [self.conversation_history.append(x) for x in next_user_prompt]
        assistant_response = self._get_assistant_response(self.conversation_history)
        
        # Answer normally unless the trigger sequence is used "searching_for_answers"
        if 'searching for answers' in assistant_response['content'].lower():
            question_extract = openai.Completion.create(model=COMPLETIONS_MODEL,prompt=f"Extract the user's latest question and the model number for that question from this conversation: {self.conversation_history}. Extract it as a sentence stating the Question and Model Number")
            search_result = self._get_search_results(question_extract['choices'][0]['text'])
            
            # We insert an extra system prompt here to give fresh context to the Chatbot on how to use the Redis results
            # In this instance we add it to the conversation history, but in production it may be better to hide
            self.conversation_history.insert(-1,{"role": 'system',"content": f"Answer the user's question using this content: {search_result}. If you cannot answer the question, say 'Sorry, I don't know the answer to this one'"})
            #[self.conversation_history.append(x) for x in next_user_prompt]
            
            assistant_response = self._get_assistant_response(self.conversation_history)
            print(next_user_prompt)
            print(assistant_response)
            self.conversation_history.append(assistant_response)
            return assistant_response
        else:
            self.conversation_history.append(assistant_response)
            return assistant_response
            
        
    def pretty_print_conversation_history(self, colorize_assistant_replies=True):
        for entry in self.conversation_history:
            if entry['role'] == 'system':
                pass
            else:
                prefix = entry['role']
                content = entry['content']
                output = colored(prefix +':\n' + content, 'green') if colorize_assistant_replies and entry['role'] == 'assistant' else prefix +':\n' + content
                #prefix = entry['role']
                print(output)

In [60]:
conversation = RetrievalAssistant()
messages = []
system_message = Message('system',system_prompt)
user_message = Message('user','I need the installation guidelines a room AC')
messages.append(system_message.message())
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
response_message

{'role': 'assistant',
 'content': 'Sure, do you have a specific model number for the room AC?'}

In [61]:
messages = []
user_message = Message('user','CTXG12QVJUW')
messages.append(user_message.message())
response_message = conversation.ask_assistant(messages)
#response_message

[{'role': 'user', 'content': 'CTXG12QVJUW'}]
{'role': 'assistant', 'content': 'Great, thank you for providing the model number. Based on the installation manual for the Daikin CTXG12QVJUW, the accessories that come with the unit include a mounting plate, a titanium apatite photocatalytic air-purifying filter, a drain hose, insulation tape, a wireless remote controller, a remote controller holder, fixing screws for the remote controller holder, indoor unit fixing screws, dry battery AAA LR03 (alkaline), an operation manual, and a warranty. \n\nAs for the installation guidelines, the manual recommends choosing an installation site that meets certain requirements, such as ensuring that the indoor unit is positioned in a place where the air inlet and outlet are unobstructed, the unit is not exposed to direct sunlight, and there is no source of machine oil vapor nearby. It also recommends finding a location for the wireless remote controller where signals are properly received by the indoor

In [62]:
conversation.pretty_print_conversation_history()

user:
I need the installation guidelines a room AC
[32massistant:
Sure, do you have a specific model number for the room AC?[0m
user:
CTXG12QVJUW
[32massistant:
Great, thank you for providing the model number. Based on the installation manual for the Daikin CTXG12QVJUW, the accessories that come with the unit include a mounting plate, a titanium apatite photocatalytic air-purifying filter, a drain hose, insulation tape, a wireless remote controller, a remote controller holder, fixing screws for the remote controller holder, indoor unit fixing screws, dry battery AAA LR03 (alkaline), an operation manual, and a warranty. 

As for the installation guidelines, the manual recommends choosing an installation site that meets certain requirements, such as ensuring that the indoor unit is positioned in a place where the air inlet and outlet are unobstructed, the unit is not exposed to direct sunlight, and there is no source of machine oil vapor nearby. It also recommends finding a location f

### Chatbot

Now we'll put all this into action with a real (basic) Chatbot.

In the directory containing this app, execute ```streamlit run chat.py```. This will open up a Streamlit app in your browser where you can ask questions of your embedded data. 

__Example Questions__:
- what is the cost cap for a power unit in 2023
- what should competitors include on their application form
- how can a competitor be disqualified

### Consolidation

Over the course of this notebook you have:
- Laid the foundations of your product by embedding our knowledge base
- Created a Q&A application to serve basic use cases
- Extended this to be an interactive Chatbot

These are the foundational building blocks of any Q&A or Chat application using our APIs - these are your starting point, and we look forward to seeing what you build with them!