# Semantic Search Playground

This notebook is accompanying the Ingenuity Blog post: https://blog.siemens.com/2023/07/build-your-own-semantic-search-with-large-language-models/

![alt text](robot_searching_documents.png "Robot searching documents")

## Setup

Install dependencies

In [None]:
!pip install openai langchain tiktoken faiss-cpu PyPDF2

## Load dependencies

Load dependencies and API key. See https://wiki.siemens.com/display/en/The+FAQ+of+ai+attack#TheFAQofaiattack-SiemensOpenAIPlayground on the details how to get access to the Siemens Azure OpenAI endpoint. Your API key needs to be in the file `.key`. In general, make sure to never check the key into version control. This is why `.key` is in `.gitignore`. 

In [1]:
import openai
openai.api_type = "azure"
openai.api_base = "PUT YOUR API ENDPOINT HERE"
openai.api_version = "2023-03-15-preview"
with open('.key', 'r') as file:
    openai.api_key = file.read().rstrip()

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter

import glob
import json
import urllib
import PyPDF2



## Prepare semantic index

In this section, we divide the large text into chunks. Next, the chunks are embedded. As this takes a while, we will save the embeddings, so next time we can directly load them. This preparation has only to be done the first time you run the notebook.

You can bring your own documents. Just put the PDFs into `data/`. For the example we use the [Siemens S7-1500 manual](https://support.industry.siemens.com/cs/attachments/86140384/s71500_et200mp_manual_collection_en-US.pdf). Let's download it! Grab some coffee, the manual is big ;-)



In [7]:
# prepare progressbar
def show_progress(block_num, block_size, total_size):
    print(f'download at {round(block_num * block_size / total_size *100,2)}%', end="\r")

# download the the S7-1500 manual
urllib.request.urlretrieve('https://support.industry.siemens.com/cs/attachments/86140384/s71500_et200mp_manual_collection_en-US.pdf', 'data/s7-1500-manual.pdf', show_progress)
urllib.request.urlretrieve('https://support.industry.siemens.com/cs/attachments/109742272/STEP_7_Professional_V14_enUS_en-US.pdf', 'data/step7-manual.pdf', show_progress)
pass

download at 100.0%

('data/step7-manual.pdf', <http.client.HTTPMessage at 0x15e53a2dd10>)

In [13]:
# helper function that extracts all text from the pdf at `path`
def load_pdf_as_string(path:str) -> str:  
    
    # creating a pdf file object
    pdfFileObj = open(path, 'rb')
    
    # creating a pdf reader object
    pdfReader = PyPDF2.PdfReader(pdfFileObj)

    total_pages = len( pdfReader.pages)
    pages = []
    for i, page in enumerate(pdfReader.pages):
        print(f'{path}: {round((i * 100)/total_pages)}% at page: {i}         ', end="\r")
        pages.append(page.extract_text())

    text = '\n\n'.join(pages)
    pdfFileObj.close()
    print('')
    return text



### Create chunks
First, we extract the text from the pdfs.

In [14]:
# go through all pdfs and extract the text
texts = [load_pdf_as_string(path) for path in glob.glob('input/*.pdf')]


input\s7-1500-manual.pdf: 100% at page: 11924
input\step7-manual.pdf: 100% at page: 14417


Now we split the text into chunks and save them for later use.

In [15]:
#split the texts into chunks which are saved into `data/chunks`.
CHUNK_SIZE = 15_000
text_splitter = RecursiveCharacterTextSplitter(
    # Set a chunk size that is about half the size of the LLMs context length. 
    # We need the rest for the question and the answer.
    chunk_size = CHUNK_SIZE,
    chunk_overlap  = 5_000,
)

docs = text_splitter.create_documents(texts)
        
for i, doc in enumerate(docs):
    with open( f'data/chunks/chunk-{i}.txt',"w", encoding="utf-8") as out_page:
            out_page.write(doc.page_content)

### Embed chunks

Next, we calculate the embedding for each chunk. The embeddings are saved for later use.

In [None]:

embeddings = OpenAIEmbeddings(openai_api_key=openai.api_key, model_kwargs={"engine": "text-embedding-ada-002"})
embedding_list = []

# read all chunks and calculate embedding
for i, path in enumerate(glob.glob('data/chunks/chunk-*.txt')):
    with open(path, 'r', encoding="utf-8") as file:
        chunk = file.read()
        # The API does not yet support embedding of multiple texts in one call. Thus the awkward looping and indexing. 
        embedding_vector = embeddings.embed_documents([chunk], chunk_size=CHUNK_SIZE)
        embedding_list.append(embedding_vector[0])


with open('data/embeddings.json', 'w') as outfile:
    json.dump(embedding_list, outfile)

## Let's search!

Almost ready.

### Load the index

We load the embeddings and their respective chunks of text into the [FAISS](https://github.com/facebookresearch/faiss) vectorstore.

In [None]:
# create the embedding model
embeddings_api = OpenAIEmbeddings(openai_api_key=openai.api_key, model_kwargs={"engine": "text-embedding-ada-002"})

# load the embeddings
with open(f'data/embeddings.json') as json_file:
    embeddings = json.load(json_file)

# load the text chunks
chunks = []
for path in glob.glob(f'data/chunks/chunk-*.txt'):
    with open(path, "r", encoding="utf-8") as file:
        chunks.append(file.read())

# setup the vectorstore
db = FAISS.from_embeddings(
    list(zip(chunks, embeddings)),
    embedding=embeddings_api,
)
retriever = db.as_retriever(search_kwargs={"k": 1})

# prepare the promt template. It takes the question and the chunk of text that hopefully contains the information to answer it
prompt = PromptTemplate.from_template(
    "Answer the following question based on the document. If the document does not provide the information needed, tell so:\nQUESTION:{question}\nDOCUMENT:\n{document}\n\nANSWER:"
)

# create the llm 
model_name = "text-davinci-003"
temperature = 0.0
model_api = OpenAI(
    model_kwargs={"engine": model_name},
    temperature=temperature,
    max_tokens=500,
    openai_api_key=openai.api_key,
)


We define an answer function which takes a question, searches the vectorstore for the closes chunk of text and finally prompts the LLM to answer the question based on the text.

In [3]:
def answer(question):
    chunks = retriever.get_relevant_documents(question)
    # we configured the retriever to give only the best match
    chunk = chunks[0]

    # fill in the promt template
    prompt_value = prompt.format_prompt(
                document=chunk.page_content, 
                question=question
            )
    try:
        print(model_api(prompt_value.to_string()))
    except Exception as e:
        print(f'Error: {e}')

Finally, we are ready to answer questions. Here a few examples:

In [4]:
answer('how do I trace a signal?')


To trace a signal, double-click the "Add new trace" entry in the project tree with the "Traces" system folder below the device. Adapt the name of the trace configuration by clicking the text. Select the signals to be recorded in the "Signals" area. Configure the sampling, trigger mode and the condition for the selected trigger. Transfer the trace configuration to the device with the button. Activate the recording by clicking the button. Wait until the "Recording" or "Recording completed" status is displayed in the status display of the trace. Switch to the "Diagram" tab and click the icon of a signal in the signal table. Select or deselect the individual signals and bits for display with the icon. Transfer the measurement to the project with the button.


In [5]:
answer('what is the difference between ladder diagrams and scl code?')


Ladder diagrams are a graphical representation of a program, while SCL code is a textual representation of a program.


Our system will answer questions only based on the provided PDFs. Let's try a question that cannot be answered based on them.

In [4]:
answer('Should I buy Siemens stock?')


This document does not provide information about whether or not you should buy Siemens stock.


Perfect, the system does not fantasize!

## Your turn to tinker around

Feel free to play around with the parameters of the llm and embeddings api call! 

Here are a few more things to try:

* can you change the prompt to get answers more suitable for novice Simatic users?
* is there any other way to improve the prompt to the LLM?
* what about getting more than one chunk from the vectorstore? Is there a way to use them to improve the answers?

Have fun!