In [39]:
# install all libs
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


# GPT4 with Retrieval Augmentation over Web

In this notebook we'll work through an example of using GPT-4 with retrieval augmentation to answer questions about Iteratec.

In this example, we will download all the content from https://www.iteratec.com/en/. 

This downloads all HTML into the `rtdocs` directory.

In [1]:
# crawl website content using wget
!wget -e robots=off -r -A.html -P rtdocs https://www.iteratec.com/en/

--2023-06-11 16:03:11--  https://www.iteratec.com/en/
Resolving www.iteratec.com (www.iteratec.com)... 3.66.50.156
Connecting to www.iteratec.com (www.iteratec.com)|3.66.50.156|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33861 (33K) [text/html]
Saving to: ‘rtdocs/www.iteratec.com/en/index.html’


2023-06-11 16:03:11 (210 MB/s) - ‘rtdocs/www.iteratec.com/en/index.html’ saved [33861/33861]

--2023-06-11 16:03:11--  https://www.iteratec.com/
Reusing existing connection to www.iteratec.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.iteratec.com/de/ [following]
--2023-06-11 16:03:11--  https://www.iteratec.com/de/
Reusing existing connection to www.iteratec.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 66744 (65K) [text/html]
Saving to: ‘rtdocs/www.iteratec.com/index.html’


2023-06-11 16:03:11 (5.79 MB/s) - ‘rtdocs/www.iteratec.com/index.html’ saved [66744/66744]

--2023-06-11 16:03:11--  https://

Now we can use LangChain itself to process these docs. 
We do this using the `DirectoryLoader` together with the `BSHTMLLoader:

In [2]:
from langchain.document_loaders import BSHTMLLoader, DirectoryLoader

loader = DirectoryLoader('rtdocs/www.iteratec.com/en', loader_cls=BSHTMLLoader)

docs = loader.load()
len(docs)

25

This leaves us with `152` processed doc pages. Let's take a look at the format each one contains:

In [3]:
docs[20].page_content

"\n\n\n\n\nCookies | iteratec\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n  \n\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n  \n\n\n\n\n    \n  \n   \n\n\n\n\n\n\n\n\n\n\n\n\nWhy iteratec\n\n\n\n\n\nWhy iteratec\nWe create excellent solutions that will provide you with competitive advantages.\nLearn more\n\n\n\n\nAmbition\nCompany\n\nManagement\n\n\n\n\n\n\n\n\nHow we work\n\n\n\n\n\nHow we work\nCreate. Code. Care. Coach. That's how we develop digital champions.\nLearn more\n\n\n\n\nCreate\nCode\nCare\nCoach\n\n\n\n\n\n\nWhat we do\n\n\n\n\n\nWhat we do\nWe develop individual solutions for your most important strategic challenges.\nLearn more\n\n\n\n\nDigital Transformation\nIndividual Software & Systems\nAI & Data Analytics\nIT-Security\nWeb3\n\nDecentraVote\n\n\n\n\n\n\n\n\nCareer\n\n\n\n\n\nCareer\nDiscover what makes us unique as a software company.\nLearn more\n\n\n\n\nJobs\n\n\n\n\n\n\nInsights\n\n\n\n\n\nInsights\nShaping the futu

In [4]:
#remove \n in docs.page_content
for doc in docs:
    doc.page_content = doc.page_content.replace('\n', ' ')

Now let's see how we can process all of these. We will chunk everything into ~500 token chunks, we can do this easily with `langchain` and `tiktoken`:

In [5]:
import tiktoken

tokenizer_name = tiktoken.encoding_for_model('gpt-4')
tokenizer_name.name

'cl100k_base'

In [6]:
tokenizer = tiktoken.get_encoding(tokenizer_name.name)

# create the length function
#This code defines a function called `tiktoken_len` that takes a string `text` as input. Inside the function, the `tokenizer.encode()` method is used to tokenize the input text into a sequence of tokens. The `disallowed_special=()` argument specifies that no special tokens should be disallowed during tokenization. Finally, the function returns the length of the token sequence.
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

Process the `docs` into more chunks using this approach.

In [8]:
from typing_extensions import Concatenate
from uuid import uuid4
from tqdm.auto import tqdm

chunks = []

for idx, page in enumerate(tqdm(docs)):
    content = page.page_content
    if len(content) > 100:
        url = page.metadata['source'].replace('rtdocs/', 'https://')
        texts = text_splitter.split_text(content)
        chunks.extend([{
            'id': str(uuid4()),
            'text': texts[i],
            'chunk': i,
            'url': url
        } for i in range(len(texts))])

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 25/25 [00:00<00:00, 53.00it/s]


In [9]:
chunks[0]

{'id': 'fa1a1089-2ac5-416b-998c-51f740c96033',
 'text': "An innovative software development company | iteratec                                                                                                                  Why iteratec      Why iteratec We create excellent solutions that will provide you with competitive advantages. Learn more     Ambition Company  Management         How we work      How we work Create. Code. Care. Coach. That's how we develop digital champions. Learn more     Create Code Care Coach       What we do      What we do We develop individual solutions for your most important strategic challenges. Learn more     Digital Transformation Individual Software & Systems AI & Data Analytics IT-Security Web3  DecentraVote         Career      Career Discover what makes us unique as a software company. Learn more     Jobs       Insights      Insights Shaping the future and writing about it. These are the topics that move us and the technology sector. Learn more     B

Our chunks are ready so now we move onto embedding and indexing everything.

## Initialize Embedding Model

We use `text-embedding-ada-002` as the embedding model. We can embed text like so:

In [42]:
import os
import openai


openai.api_key = os.getenv("OPENAI_API_KEY") or "OPENAI_API_KEY"

openai.Engine.list()  # check we have authenticated

<OpenAIObject list at 0x7fa9341f08b0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "engine",
      "id": "whisper-1",
      "ready": true,
      "owner": "openai-internal",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "babbage",
      "ready": true,
      "owner": "openai",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "davinci",
      "ready": true,
      "owner": "openai",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "text-davinci-edit-001",
      "ready": true,
      "owner": "openai",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "babbage-code-search-code",
      "ready": true,
      "owner": "openai-dev",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "text-similarity-babbage-001",
      "ready": t

In [11]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field.

In [12]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-ada-002` model.

In [13]:
len(res['data'])

2

In [14]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

We will apply this same embedding logic to the langchain docs dataset we've just scraped. But before doing so we must create a place to store the embeddings.

## Initializing the Index

Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a [free API key](https://app.pinecone.io/) and enter it below where we will initialize our connection to Pinecone and create a new index.

In [43]:
import pinecone

# setup pinecone account to get credentials
PINECONE_API_KEY='PINECONE_API_KEY'
PINECONE_ENVIRONMENT='us-west1-gcp-free'

os.environ['PINECONE_ENVIRONMENT'] = PINECONE_ENVIRONMENT

pinecone.init(api_key=PINECONE_API_KEY, enviroment=PINECONE_ENVIRONMENT)
pinecone.whoami()

WhoAmIResponse(username='a9d81a4', user_label='default', projectname='cc97d47')

In [16]:
index_name = 'gpt-4-iteratec-docs'

In [18]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine'
    )
    # wait for index to be initialized
    time.sleep(1)

# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 262}},
 'total_vector_count': 262}

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:

In [19]:
chunks[1]['text']

"partner for excellent software and pioneering technology We are on fire for digital solutions that create excitement. For those who are boldly leading the way, who want to conquer new markets and who want to inspire customers, we are the technology partner at your side. Our holistic approach to uncovering business potential and transforming it into cutting-edge software makes our customers the winners of successful digital transformation. This is what we mean by Developing Digital Champions. Why iteratec             How we develop digital champions We guide you through the entire process: from the initial idea, through implementation and sustainable operation, to the empowerment of your organization.          How we work               Solutions that make the difference We support you in harnessing the potential of digital technologies to reach peak performance.                  Digital Transformation  We identify the potential of digital technologies for your organization, develop app

In [21]:
openai.Embedding.create(input=chunks[1]['text'], engine=embed_model)

<OpenAIObject list at 0x7fa938b884f0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.01436258852481842,
        0.0020250568632036448,
        0.004002465400844812,
        -0.0009027669439092278,
        -1.4531166016240604e-05,
        0.04223009571433067,
        -0.014063084498047829,
        -0.0030597078148275614,
        -0.009999356232583523,
        -0.0189504474401474,
        0.012252445332705975,
        0.01583288237452507,
        0.012313706800341606,
        -0.01028524711728096,
        -0.0045334044843912125,
        -0.003149899421259761,
        0.021469006314873695,
        -0.0036553128156811,
        -0.02186380699276924,
        -0.0021748091094195843,
        -0.040269702672958374,
        -0.0004935012548230588,
        0.021659599617123604,
        0.00535023445263505,
        -0.02108781971037388,
        -0.003046093974262476,
        -0.009420769289135933,
        -0.027772208675

In [22]:
from tqdm.auto import tqdm
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(chunks), batch_size)):
    # find end of batch
    i_end = min(len(chunks), i+batch_size)
    meta_batch = chunks[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except Exception as e:
        print(e)
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'text': x['text'],
        'chunk': x['chunk'],
        'url': x['url']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

100%|██████████| 1/1 [00:02<00:00,  2.91s/it]


Now we've added all of our langchain docs to the index. With that we can move on to retrieval and then answer generation using GPT-4.

## Retrieval

To search through our documents we first need to create a query vector `xq`. Using `xq` we will retrieve the most relevant chunks from the LangChain docs, like so:

In [23]:
def create_embedding(query):
    return openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

def query_pinecone_embedding(openai_embedding):
    # retrieve from Pinecone
    xq = openai_embedding['data'][0]['embedding']
    # get relevant contexts (including the questions)
    return index.query(xq, top_k=5, include_metadata=True)

In [24]:
query = "How does iteratec talk about digital champions?"

embedding = create_embedding(query)
result = query_pinecone_embedding(embedding)
result

{'matches': [{'id': '3c7007fe-defe-43b2-b41e-d41203f2bbdf',
              'metadata': {'chunk': 1.0,
                           'text': 'Digital Champions...           ... are '
                                   'organizations that use pioneering '
                                   'technologies to operate more successfully '
                                   'than their competitors in increasingly '
                                   'volatile market environments.   ... have '
                                   'the tools, methods and structures to '
                                   'optimize their existing business, realize '
                                   'growth opportunities and better exploit '
                                   'disruptive innovations.   ... secure their '
                                   'existence in the long term.             '
                                   'Our values Our values serve as a framework '
                                   'and g

With retrieval complete, we move on to feeding these into GPT-4 to produce answers.

## Retrieval Augmented Generation

GPT-4 is currently accessed via the `ChatCompletions` endpoint of OpenAI. To add the information we retrieved into the model, we need to pass it into our user prompts *alongside* our original query. We can do that like so:

In [25]:
def pinecone_to_list(res):
    # get list of retrieved texts
    contexts = [item['metadata']['text'] for item in res['matches']]
    return "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query

pinecone_to_list(result)


'Digital Champions...           ... are organizations that use pioneering technologies to operate more successfully than their competitors in increasingly volatile market environments.   ... have the tools, methods and structures to optimize their existing business, realize growth opportunities and better exploit disruptive innovations.   ... secure their existence in the long term.             Our values Our values serve as a framework and guiding principles for our daily actions and for collaborating with our customers, colleagues and partners.                  Our performance: Permanently excellent\xa0 We rely on the best team and future-proof technologies. We stand for operational excellence and unique problem-solving expertise. We love difficult, complex challenges. Because we can and we learn every day - from each other and with each other. Our customers and partners can always rely on that.                  Our attitude: Constructively independent\xa0 We are owners in everything

Now we configure following prompt:

In [26]:
def query_gpt(prompt, message, model="gpt-4"):
     return openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": message}
        ]
    )

In [28]:
# system message to 'prime' the model
prompt = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
"""

In [29]:
def search_docs(query, model="gpt-4"):
    embedding = create_embedding(query)
    result = query_pinecone_embedding(embedding)
    open_ai_message = pinecone_to_list(result)
    return query_gpt(prompt, open_ai_message, model=model)

In [33]:
query = "Who is the CEO of iteratec?"
# "How does iteratec help customers become digital champions?"
# "Who is the CEO of iteratec?"
# "How does iteratec talk about digital champions?
# "What is the iteratec mission?"
# "Which technologies does iteratec use?"

result = search_docs(query, model="gpt-4")

In [31]:
result

<OpenAIObject chat.completion id=chatcmpl-7QHqzeS2NU3Ubq2kWT4kNCUbojE5v at 0x7fa937c45620> JSON: {
  "id": "chatcmpl-7QHqzeS2NU3Ubq2kWT4kNCUbojE5v",
  "object": "chat.completion",
  "created": 1686499513,
  "model": "gpt-4-0314",
  "usage": {
    "prompt_tokens": 1892,
    "completion_tokens": 60,
    "total_tokens": 1952
  },
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The specific technologies that iteratec uses are not mentioned in the provided information. However, they offer services in digital transformation, individual software & systems, AI & data analytics, IT-security, and Web3. Their focus is on developing digital products and individual software systems, leveraging technological potential for their customers."
      },
      "finish_reason": "stop",
      "index": 0
    }
  ]
}

To display this response nicely, we will display it in markdown.

In [34]:
from IPython.display import Markdown

display(Markdown(result['choices'][0]['message']['content']))

There is no specific mention of a CEO in the provided information. However, there are several Managing Directors at iteratec, including Klaus Eberhardt (founder), Stefan Rauch, Michael Schulz, and Alexander Youssef.

Let's compare this to a non-augmented query...

In [35]:
res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": prompt },
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

I don't know.

If we drop the `"I don't know"` part of the `primer`?

In [36]:
res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are Q&A bot. A highly intelligent system that answers user questions"},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

As of my last knowledge update in September 2021, the CEO of iteratec is Michael Ernst. However, it's worth checking the most recent information on the company's website or other sources, as management positions might have changed.

Then we see something even worse than `"I don't know"` — hallucinations. Clearly augmenting our queries with additional context can make a huge difference to the performance of our system.

Great, we've seen how to augment GPT-4 with semantic search to allow us to answer LangChain specific queries.

Once you're finished, we delete the index to save resources.

In [None]:
#pinecone.delete_index(index_name)