In [2]:
# install all libs
#!pip install -r requirements.txt

In [3]:
import dotenv
# Reload the variables in your '.env' file (override the existing variables)
dotenv.load_dotenv(".env", override=True)

True

# GPT4 with Retrieval Augmentation over Web

In this notebook we'll work through an example of using GPT-4 with retrieval augmentation to answer questions about NOUS Wissensmanagement GmbH.

In this example, we will download all the content from https://www.nousdigital.net/en/. 

This downloads all HTML into the `nous` directory.

In [4]:
# crawl website content using wget
# !wget -e robots=off -r -A.html -P nous https://www.nousdigital.net/en/

Now we can use LangChain itself to process these docs. 
We do this using the `DirectoryLoader` together with the `BSHTMLLoader:

In [49]:
from langchain.document_loaders import BSHTMLLoader, DirectoryLoader

loader = DirectoryLoader('nous/www.nousdigital.net/en', loader_cls=BSHTMLLoader)

docs = loader.load()
len(docs)

232

This leaves us with `232` processed doc pages. Let's take a look at the format each one contains:

In [50]:
docs[20].page_content

'\n\n\n\n\n\n\n\n\n\n\n      NOUS solutions for culture\n      – NOUSdigital\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nData protection\n\n  \nThis website uses cookies and services in the categories listed below. Select REQUIRED ONLY to accept only technically necessary cookies or SAVE CHOICE to save your individual settings.\n\n\n      FURTHER INFORMATION\n    \n\nPrivacy Policy\n\nLegal Notice\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRequired Cookies\n\n\n\n\n\n\nThese cookies are needed to enable the basic functionality of this website. These cookies can therefore not be disabled.\n\n\n\n\nHTTP Cookie:\naccepted_optional_cookies\n\nPurpose: This cookie stores information about which optional cookies have been accepted or rejected.\nDomain: www.nousdigital.net\nStorage duration: 1 year\n\nThird party: No\n\n\n\n\n\nHTTP Cookie:\ncsrftoken\n\nPurpose: Protect against "Cross Site Request Forgery (CSRF)" attacks via form submission.\nDomain: www.nousdigital.net\nStorage duration: 1 year\n\

In [7]:
#remove '\n' in all docs.page_content
for doc in docs:
    doc.page_content = doc.page_content.replace('\n', ' ')

Now let's see how we can process all of these. We will chunk everything into ~500 token chunks, we can do this easily with `langchain` and `tiktoken`:

In [51]:
import tiktoken

tokenizer_name = tiktoken.encoding_for_model('gpt-4')
tokenizer_name.name

'cl100k_base'

In [52]:
tokenizer = tiktoken.get_encoding(tokenizer_name.name)

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [53]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

Process the `docs` into more chunks using this approach.

In [54]:
from typing_extensions import Concatenate
from uuid import uuid4
from tqdm.auto import tqdm

chunks = []

for idx, page in enumerate(tqdm(docs)):
    content = page.page_content
    if len(content) > 100:
        url = page.metadata['source'].replace('rtdocs/', 'https://')
        texts = text_splitter.split_text(content)
        chunks.extend([{
            'id': str(uuid4()),
            'text': texts[i],
            'chunk': i,
            'url': url
        } for i in range(len(texts))])

  0%|          | 0/232 [00:00<?, ?it/s]

In [55]:
chunks[2]

{'id': '63556d5f-b9f8-4565-8c1a-030fd857d1e0',
 'text': 'What We Do\n\n\n\n\n\n\n\n\n\n\n\n\n\nAbout us\n\n\n\n\n\n\n\n\n\n\n\n\n\nOur Products\n\n\n\n\n\n\n\n\n\n\n\n\n\nOur Showroom\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe Audio Revolution\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreative Content\n\n\n\n\n\n\n\n\n\n\n\n\n\nBlog\n\n\n\n\n\n\n\n\n\n\n\n\n\nNOUS Wissensmanagement GmbH\n\n          Ullmannstraße 35 |\n          1150 Vienna, Austria |\n          info@nousdigital.com\n\n\n\nNewsletter signup\n\n          register\n        \n\n\n\n\n\n\n\n                  Home\n                \n\n\n\n                  What We Do\n                \n\n\n\n                  Our Products\n                \n\n\n\n                  Our Showroom\n                \n\n\n\n                  About us\n                \n\n\n\n                  Blog\n                \n\n\n\n\nLegal Notice\n\n\nTerms & Conditions\n\n\nPrivacy Policy\n\n\nCookie settings\n\n\n\n\n\n                  Deutsch\n                \n\n\n\n           

Our chunks are ready so now we move onto embedding and indexing everything.

## Initialize Embedding Model

We use `text-embedding-ada-002` as the embedding model. We can embed text like so:

In [56]:
import os
import openai


openai.api_key = os.getenv("OPENAI_API_KEY") or "OPENAI_API_KEY"

openai.Engine.list()  # verify that we are authenticated

<OpenAIObject list at 0x7fb533a84090> JSON: {
  "object": "list",
  "data": [
    {
      "object": "engine",
      "id": "whisper-1",
      "ready": true,
      "owner": "openai-internal",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "babbage",
      "ready": true,
      "owner": "openai",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "davinci",
      "ready": true,
      "owner": "openai",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "text-davinci-edit-001",
      "ready": true,
      "owner": "openai",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "babbage-code-search-code",
      "ready": true,
      "owner": "openai-dev",
      "permissions": null,
      "created": null
    },
    {
      "object": "engine",
      "id": "text-similarity-babbage-001",
      "ready": t

In [57]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field.

In [59]:
res

<OpenAIObject list at 0x7fb533b2dbc0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.0030597876757383347,
        0.011693241074681282,
        -0.005041236057877541,
        -0.027201106771826744,
        -0.016350319609045982,
        0.03235017880797386,
        -0.016161609441041946,
        -0.0010235798545181751,
        -0.025812745094299316,
        -0.006638525985181332,
        0.020164944231510162,
        0.016619903966784477,
        -0.009172623045742512,
        0.023413440212607384,
        -0.0101094301789999,
        0.01342532318085432,
        0.025246618315577507,
        -0.016849050298333168,
        0.01208413951098919,
        -0.016350319609045982,
        -0.004229111596941948,
        -0.006466665770858526,
        -0.004336945712566376,
        0.020771509036421776,
        -0.01053402666002512,
        -0.0037000514566898346,
        0.013667949475347996,
        -0.0263519156724

In [58]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-ada-002` model.

In [16]:
len(res['data'])

2

In [17]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

We will apply this same embedding logic to the langchain docs dataset we've just scraped. But before doing so we must create a place to store the embeddings.

## Initializing the Index

Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a [free API key](https://app.pinecone.io/) and enter it below where we will initialize our connection to Pinecone and create a new index.

In [62]:
import pinecone
import os

# setup pinecone account to get credentials
PINECONE_API_KEY=  os.getenv('PINECONE_API_KEY') or 'PINECONE_API_KEY'
PINECONE_ENVIRONMENT='us-west1-gcp-free'

os.environ['PINECONE_ENVIRONMENT'] = PINECONE_ENVIRONMENT

pinecone.init(api_key=PINECONE_API_KEY, enviroment=PINECONE_ENVIRONMENT)
pinecone.whoami()

WhoAmIResponse(username='a9d81a4', user_label='default', projectname='cc97d47')

In [61]:
index_name = 'gpt-4-nous-docs'

In [20]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine'
    )
    # wait for index to be initialized
    time.sleep(1)

# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 4884}},
 'total_vector_count': 4884}

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:

In [21]:
chunks[1]['text']

'HTTP Cookie: _ga  Purpose: Used to distinguish users. Domain: www.nousdigital.net Storage duration: 2 years  Third party: Yes      HTTP Cookie: _gat  Purpose: Used to throttle the request rate. Domain: www.nousdigital.net Storage duration: Session  Third party: Yes      HTTP Cookie: _gid  Purpose: Registers a unique ID that is used to generate statistical data on how the visitor uses the website. Domain: www.nousdigital.net Storage duration: 23 hours  Third party: Yes        Service name: Google Maps   Privacy policy: https://policies.google.com/privacy   Owner: Google LLC       HTTP Cookie: _ga  Purpose: Used to distinguish users. Domain: www.nousdigital.net Storage duration: 2 years  Third party: Yes      HTTP Cookie: _gat  Purpose: Used to throttle the request rate. Domain: www.nousdigital.net Storage duration: Session  Third party: Yes      HTTP Cookie: _gid  Purpose: Registers a unique ID that is used to generate statistical data on how the visitor uses the website. Domain: www.n

In [63]:
openai.Embedding.create(input=chunks[1]['text'], engine=embed_model)

<OpenAIObject list at 0x7fb5557e6930> JSON: {
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.0074389623478055,
        0.0018271435983479023,
        0.030167600139975548,
        -0.023250188678503036,
        -0.013944623060524464,
        0.0037915376015007496,
        -0.0011245940113440156,
        -0.014438724145293236,
        0.011151581071317196,
        -0.04370047152042389,
        0.0035685058683156967,
        -0.0020313032437115908,
        0.010046716779470444,
        -0.025020716711878777,
        0.00855755154043436,
        0.032226353883743286,
        0.025487367063760757,
        -0.021932587027549744,
        0.02928919903934002,
        -0.0011571909999474883,
        -0.016456302255392075,
        0.0074938624165952206,
        -0.027244169265031815,
        -0.0012798584066331387,
        -0.008001687936484814,
        -0.012126057408750057,
        0.017252353951334953,
        -0.004261619

In [64]:
from tqdm.auto import tqdm
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(chunks), batch_size)):
    # find end of batch
    i_end = min(len(chunks), i+batch_size)
    meta_batch = chunks[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except Exception as e:
        print(e)
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'text': x['text'],
        'chunk': x['chunk'],
        'url': x['url']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/11 [00:00<?, ?it/s]

Now we've added all of our langchain docs to the index. With that we can move on to retrieval and then answer generation using GPT-4.

## Retrieval

To search through our documents we first need to create a query vector `xq`. Using `xq` we will retrieve the most relevant chunks from the LangChain docs, like so:

In [65]:
def create_embedding(query):
#This code is defining a function called `create_embedding` that takes a `query` parameter. The function creates an OpenAI embedding by calling the `openai.Embedding.create` method with the `query` as input and the `embed_model` as the engine. The function then returns the created embedding.
    return openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

def query_pinecone_embedding(openai_embedding):
    # retrieve from Pinecone
    xq = openai_embedding['data'][0]['embedding']
    # get relevant contexts (including the questions)
    return index.query(xq, top_k=5, include_metadata=True)

In [66]:
query = "Who is the CEO of NOUS Wissensmanagement GmbH?"

embedding = create_embedding(query)
result = query_pinecone_embedding(embedding)
result

{'matches': [{'id': 'ea4f9d6f-9a8f-479a-b136-6c3c738239c4',
              'metadata': {'chunk': 10.0,
                           'text': '3                           NOUS '
                                   'Wissensmanagement GmbH            '
                                   'Ullmannstraße 35 |           1150 Vienna, '
                                   'Austria |           '
                                   'info@nousdigital.com    Newsletter '
                                   'signup            '
                                   'register                                   '
                                   'Home                                       '
                                   'What We '
                                   'Do                                       '
                                   'Our '
                                   'Products                                       '
                                   'Our '
                                

With retrieval complete, we move on to feeding these into GPT-4 to produce answers.

## Retrieval Augmented Generation

GPT-4 is currently accessed via the `ChatCompletions` endpoint of OpenAI. To add the information we retrieved into the model, we need to pass it into our user prompts *alongside* our original query. We can do that like so:

In [67]:
def pinecone_to_list(res):
    # get list of retrieved texts
    contexts = [item['metadata']['text'] for item in res['matches']]
    return "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query

pinecone_to_list(result)


'3                           NOUS Wissensmanagement GmbH            Ullmannstraße 35 |           1150 Vienna, Austria |           info@nousdigital.com    Newsletter signup            register                                   Home                                       What We Do                                       Our Products                                       Our Showroom                                       About us                                       Blog                      Legal Notice   Terms & Conditions   Privacy\n\n---\n\n3                           NOUS Wissensmanagement GmbH            Ullmannstraße 35 |           1150 Vienna, Austria |           info@nousdigital.com    Newsletter signup            register                                   Home                                       What We Do                                       Our Products                                       Our Showroom                                       About us                          

Now we configure following prompt:

In [68]:
def query_gpt(prompt, message, model="gpt-4"):
     return openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": message}
        ]
    )

In [28]:
# system message to 'prime' the model
prompt = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
"""

In [29]:
def search_docs(query, model="gpt-4"):
    embedding = create_embedding(query)
    result = query_pinecone_embedding(embedding)
    open_ai_message = pinecone_to_list(result)
    return query_gpt(prompt, open_ai_message, model=model)

In [None]:
# "What can you tell me about the company NOUS?"
# "What services does NOUS provide?"
# "Who are the CEOs?"
# "Who is the CEO of NOUS Wissensmanagement?"
# "Which technologies does NOUS use?"

In [79]:
query = "What can you tell me about the company NOUS?"
result = search_docs(query, model="gpt-3.5-turbo")
result['choices'][0]['message']['content']

'NOUS is a business headquartered in Vienna, Austria, with branch offices in Denver (USA) and Dubai (UAE) that ranks among the leading providers in the field of mobile guides and media-based education. They have provided tailor-made concepts with innovative technology for over 200 projects of differing sizes along with various focuses realized with museums and cultural organizations as well as big brands worldwide. NOUS advises institutions and companies on adapting their educational role to bring them inline with this ever more digital world and develops multimedia, intuitive end-user and streaming applications on the web and other platforms. They offer varied and challenging tasks in a young and dynamic team, internal trainings and further education, regular team events, flexible working hours and the possibility to work remotely, and an airy office located in an up-and-coming creative area with very good public connections.'

To display this response nicely, we will display it in markdown.

In [80]:
from IPython.display import Markdown

display(Markdown(result['choices'][0]['message']['content']))

NOUS is a business headquartered in Vienna, Austria, with branch offices in Denver (USA) and Dubai (UAE) that ranks among the leading providers in the field of mobile guides and media-based education. They have provided tailor-made concepts with innovative technology for over 200 projects of differing sizes along with various focuses realized with museums and cultural organizations as well as big brands worldwide. NOUS advises institutions and companies on adapting their educational role to bring them inline with this ever more digital world and develops multimedia, intuitive end-user and streaming applications on the web and other platforms. They offer varied and challenging tasks in a young and dynamic team, internal trainings and further education, regular team events, flexible working hours and the possibility to work remotely, and an airy office located in an up-and-coming creative area with very good public connections.

Let's compare this to a non-augmented query...

In [76]:
res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": prompt },
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

I don't know.

If we drop the `"I don't know"` part of the `primer`?

In [81]:
res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are Q&A bot. A highly intelligent system that answers user questions"},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

NOUS is potentially an ambiguous term, as there might be multiple companies or organizations with this name. One possible interpretation is that you are referring to NOUS Knowledge Management Inc., a Canadian-based tech company. This company specializes in developing software applications and providing consulting services for knowledge management, document management, and process automation. Their flagship product is the NOUS Document Management System, which helps organizations streamline the creation, review, storage, retrieval, and disposal of their documents.

Other possibilities might exist, such as Nous Group, a management consulting firm based in Australia. To provide you with the most accurate information about the company, please provide more details or specify any specific industry or country for the company.

Then we see something even worse than `"I don't know"` — hallucinations. Clearly augmenting our queries with additional context can make a huge difference to the performance of our system.

Great, we've seen how to augment GPT-4 with semantic search to allow us to answer LangChain specific queries.

Once you're finished, we delete the index to save resources.

In [35]:
#pinecone.delete_index(index_name)