[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-chatbot.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-chatbot.ipynb)

# Building RAG Chatbots with LangChain

In this example, we'll work on building an AI chatbot from start-to-finish. We will be using LangChain, OpenAI, and Pinecone vector DB, to build a chatbot capable of learning from the external world using **R**etrieval **A**ugmented **G**eneration (RAG).

We will be using a dataset sourced from the Llama 2 ArXiv paper and other related papers to help our chatbot answer questions about the latest and greatest in the world of GenAI.

By the end of the example we'll have a functioning chatbot and RAG pipeline that can hold a conversation and provide informative responses based on a knowledge base.

### Before you begin

You'll need to get an [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io).

### Prerequisites

Before we start building our chatbot, we need to install some Python libraries. Here's a brief overview of what each library does:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

You can install these libraries using pip like so:

In [1]:
!pip install -qU \
    langchain==0.0.354 \
    openai==1.6.1 \
    datasets==2.10.1 \
    pinecone-client==3.0.0 \
    tiktoken==0.5.2

### Building a Chatbot (no RAG)

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

In [2]:
import os
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or "YOUR_API_KEY"

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

  warn_deprecated(


Chats with OpenAI's `gpt-3.5-turbo` and `gpt-4` chat models are typically structured (in plain text) like this:

```
System: You are a helpful assistant.

User: Hi AI, how are you today?

Assistant: I'm great thank you. How can I help you?

User: I'd like to understand string theory.

Assistant:
```

The final `"Assistant:"` without a response is what would prompt the model to continue the conversation. In the official OpenAI `ChatCompletion` endpoint these would be passed to the model in a format like:

```python
[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hi AI, how are you today?"},
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"}
    {"role": "user", "content": "I'd like to understand string theory."}
]
```

In LangChain there is a slightly different format. We use three _message_ objects like so:

In [3]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand string theory.")
]

The format is very similar, we're just swapped the role of `"user"` for `HumanMessage`, and the role of `"assistant"` for `AIMessage`.

We generate the next response from the AI by passing these messages to the `ChatOpenAI` object.

In [4]:
res = chat(messages)
res

  warn_deprecated(


AIMessage(content='String theory is a theoretical framework in physics that aims to provide a unified description of all fundamental forces and particles in the universe. It suggests that the fundamental building blocks of the universe are not point-like particles, but tiny, vibrating strings.\n\nHere are some key points to understand about string theory:\n\n1. Dimensions: String theory requires the existence of extra dimensions beyond the familiar three dimensions of space (length, width, and height) and one dimension of time. These additional dimensions are compactified or "curled up" at incredibly small scales, making them difficult to detect.\n\n2. Vibrating Strings: In string theory, particles are not considered as separate entities but rather as different vibrational modes of a string. The different vibrational patterns of the string correspond to different particles with varying masses and properties.\n\n3. Supersymmetry: String theory incorporates supersymmetry, which suggests 

In response we get another AI message object. We can print it more clearly like so:

In [5]:
print(res.content)

String theory is a theoretical framework in physics that aims to provide a unified description of all fundamental forces and particles in the universe. It suggests that the fundamental building blocks of the universe are not point-like particles, but tiny, vibrating strings.

Here are some key points to understand about string theory:

1. Dimensions: String theory requires the existence of extra dimensions beyond the familiar three dimensions of space (length, width, and height) and one dimension of time. These additional dimensions are compactified or "curled up" at incredibly small scales, making them difficult to detect.

2. Vibrating Strings: In string theory, particles are not considered as separate entities but rather as different vibrational modes of a string. The different vibrational patterns of the string correspond to different particles with varying masses and properties.

3. Supersymmetry: String theory incorporates supersymmetry, which suggests that every known particle h

Because `res` is just another `AIMessage` object, we can append it to `messages`, add another `HumanMessage`, and generate the next response in the conversation.

In [6]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Why do physicists believe it can produce a 'unified theory'?"
)
# add to messages
messages.append(prompt)

# send to chat-gpt
res = chat(messages)

print(res.content)

Physicists believe that string theory has the potential to produce a unified theory because it incorporates all the fundamental forces and particles of nature into a single framework. This is in contrast to the currently accepted theories, such as the Standard Model of particle physics, which describe the electromagnetic, weak, and strong nuclear forces separately.

Here are some reasons why physicists see string theory as a candidate for a unified theory:

1. Gravity and Quantum Mechanics: One of the main motivations for string theory is its ability to incorporate gravity into a quantum mechanical framework. Gravity, described by Einstein's general theory of relativity, is not fully compatible with quantum mechanics, which is the framework that successfully describes the other three fundamental forces. String theory offers a way to reconcile these two theories by describing gravity in terms of vibrating strings, which naturally exhibit quantum behavior.

2. Consistency and Mathematica

### Dealing with Hallucinations

We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like about the new (and very popular) Llama 2 LLM.

In [7]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="What is so special about Llama 2?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [8]:
print(res.content)

I'm sorry, but I couldn't find any specific information about "Llama 2" in relation to any significant scientific or technological concept or discovery. It's possible that you may be referring to something specific that I am not aware of. Could you please provide more context or clarify your question? That way, I can try to assist you better.


Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it _does_ know the answer — and this can be very hard to detect.

OpenAI have since adjusted the behavior for this particular example as we can see below:

In [9]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Can you tell me about the LLMChain in LangChain?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [10]:
print(res.content)

I apologize, but I couldn't find any information on an LLMChain specifically related to LangChain. It's possible that you may be referring to a specific term or concept that is not widely known or documented. If you can provide more context or explain further, I'll do my best to assist you.


There is another way of feeding knowledge into LLMs. It is called _source knowledge_ and it refers to any information fed into the LLM via the prompt. We can try that with the LLMChain question. We can take a description of this object from the LangChain documentation.

In [11]:
llmchain_information = [
    "A LLMChain is the most common type of chain. It consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. This chain takes multiple input variables, uses the PromptTemplate to format them into a prompt. It then passes that to the model. Finally, it uses the OutputParser (if provided) to parse the output of the LLM into a final format.",
    "Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case.",
    "LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an api, but will also: (1) Be data-aware: connect a language model to other sources of data, (2) Be agentic: Allow a language model to interact with its environment. As such, the LangChain framework is designed with the objective in mind to enable those types of applications."
]

source_knowledge = "\n".join(llmchain_information)

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

In [12]:
query = "Can you tell me about the LLMChain in LangChain?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

Now we feed this into our chatbot as we were before.

In [13]:
# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [14]:
print(res.content)

The LLMChain is a type of chain within the LangChain framework. Chains in LangChain are sequences of modular components or other chains combined in a specific way to achieve a common use case. The LLMChain, in particular, is the most common type of chain.

The LLMChain consists of three main components: a PromptTemplate, a language model (either an LLM or a ChatModel), and an optional output parser. This chain takes multiple input variables, which are formatted into a prompt using the PromptTemplate. The formatted prompt is then passed to the language model for processing. Finally, the OutputParser, if provided, is used to parse the output of the language model into a final format.

LangChain itself is a framework designed for developing applications powered by language models. It aims to create applications that not only make use of language models via an API but also connect the language model to other sources of data, making them data-aware. Additionally, LangChain enables applicati

The quality of this answer is phenomenal. This is made possible thanks to the idea of augmented our query with external knowledge (source knowledge). There's just one problem — how do we get this information in the first place?

We learned in the previous chapters about Pinecone and vector databases. Well, they can help us here too. But first, we'll need a dataset.

### Importing the Data

In this task, we will be importing our data. We will be using the Hugging Face Datasets library to load our data. Specifically, we will be using the `"jamescalam/llama-2-arxiv-papers"` dataset. This dataset contains a collection of ArXiv papers which will serve as the external knowledge base for our chatbot.

In [15]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/llama-2-arxiv-papers-chunked", split="train")
dataset

Found cached dataset json (/home/misge/.cache/huggingface/datasets/jamescalam___json/jamescalam--llama-2-arxiv-papers-chunked-ea255a807f3039a6/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

In [16]:
dataset[0]

{'doi': '1102.0183',
 'chunk-id': '0',
 'chunk': 'High-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nTechnical Report No. IDSIA-01-11\nJanuary 2011\nIDSIA / USI-SUPSI\nDalle Molle Institute for Arti\x0ccial Intelligence\nGalleria 2, 6928 Manno, Switzerland\nIDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland (SUPSI),\nand was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.\nThis work was partially supported by the Swiss Commission for Technology and Innovation (CTI), Project n. 9688.1 IFF:\nIntelligent Fill in Form.arXiv:1102.0183v1  [cs.AI]  1 Feb 2011\nTechnical Report No. IDSIA-01-11 1\nHigh-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nJanuary 2011\nAbs

#### Dataset Overview

The dataset we are using is sourced from the Llama 2 ArXiv papers. It is a collection of academic papers from ArXiv, a repository of electronic preprints approved for publication after moderation. Each entry in the dataset represents a "chunk" of text from these papers.

Because most **L**arge **L**anguage **M**odels (LLMs) only contain knowledge of the world as it was during training, they cannot answer our questions about Llama 2 — at least not without this data.

### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [17]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [18]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [19]:
import time

index_name = 'llama-2-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

PineconeApiException: (429)
Reason: Too Many Requests
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': '2aca6d072469c97508c089e588a687f3', 'Date': 'Fri, 19 Jan 2024 06:17:29 GMT', 'Server': 'Google Frontend', 'Content-Length': '143', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"RESOURCE_EXHAUSTED","message":"Capacity Reached. Increase your index quota or upgrade to create more indexes."},"status":429}


Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-ada-002` model — we can access it via LangChain like so:

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

Using this model we can create embeddings like so:

In [20]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

NameError: name 'embed_model' is not defined

From this we get two (aligning to our two chunks of text) 1536-dimensional embeddings.

We're now ready to embed and index all our our data! We do this by looping through our dataset and embedding and inserting everything in batches.

In [21]:
from tqdm.auto import tqdm  # for progress bar

data = dataset.to_pandas()  # this makes it easier to iterate over the dataset

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/49 [00:00<?, ?it/s]

NameError: name 'embed_model' is not defined

We can check that the vector index has been populated using `describe_index_stats` like before:

In [22]:
index.describe_index_stats()

NameError: name 'index' is not defined

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our chatbot. To do that we'll be diving back into LangChain and reusing our template prompt from earlier.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [23]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

NameError: name 'index' is not defined

Using this `vectorstore` we can already query the index and see if we have any relevant information given our question about Llama 2.

In [24]:
query = "What is so special about Llama 2?"

vectorstore.similarity_search(query, k=3)

NameError: name 'vectorstore' is not defined

We return a lot of text here and it's not that clear what we need or what is relevant. Fortunately, our LLM will be able to parse this information much faster than us. All we need is to connect the output from our `vectorstore` to our `chat` chatbot. To do that we can use the same logic as we used earlier.

In [25]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [26]:
print(augment_prompt(query))

NameError: name 'vectorstore' is not defined

There is still a lot of text here, so let's pass it onto our chat model to see how it performs.

In [27]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

NameError: name 'vectorstore' is not defined

We can continue with more Llama 2 questions. Let's try _without_ RAG first:

In [28]:
prompt = HumanMessage(
    content="what safety measures were used in the development of llama 2?"
)

res = chat(messages + [prompt])
print(res.content)

I apologize for the confusion, but based on the given contexts, there is no specific information available regarding the safety measures used in the development of "LLama 2" or any direct relation to safety measures. It's possible that "LLama 2" is not a widely known or referenced concept in the provided contexts. If you have any other questions or need assistance with a different topic, please let me know, and I'll be happy to help.


The chatbot is able to respond about Llama 2 thanks to it's conversational history stored in `messages`. However, it doesn't know anything about the safety measures themselves as we have not provided it with that information via the RAG pipeline. Let's try again but with RAG.

In [29]:
prompt = HumanMessage(
    content=augment_prompt(
        "what safety measures were used in the development of llama 2?"
    )
)

res = chat(messages + [prompt])
print(res.content)

NameError: name 'vectorstore' is not defined

We get a much more informed response that includes several items missing in the previous non-RAG response, such as "red-teaming", "iterative evaluations", and the intention of the researchers to share this research to help "improve their safety, promoting responsible development in the field".

Delete the index to save resources:

In [30]:
pc.delete_index(index_name)

NotFoundException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': 'd198b486b663a1396bb0f625d2dac8cc', 'Date': 'Fri, 19 Jan 2024 06:21:19 GMT', 'Server': 'Google Frontend', 'Content-Length': '86', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"NOT_FOUND","message":"Resource llama-2-rag not found"},"status":404}


### Weaviate

---

In [32]:
import weaviate
import weaviate.classes as wvc
import os

client = weaviate.connect_to_local(
    port=8080,
    grpc_port=50051,
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]  # Replace with your inference API key
    }
)

In [33]:
questions = client.collections.create(
    name="Question",
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_openai(),  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    generative_config=wvc.Configure.Generative.openai()  # Ensure the `generative-openai` module is used for generative queries
)

UnexpectedStatusCodeException: Create class! Unexpected status code: 422, with response body: {'error': [{'message': 'class name "Question" already exists'}]}.

In [34]:
import requests
import json
import os

resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

question_objs = list()
for i, d in enumerate(data):
    question_objs.append({
        "answer": d["Answer"],
        "question": d["Question"],
        "category": d["Category"],
    })

questions = client.collections.get("Question")
questions.data.insert_many(question_objs)  # This uses batching under the hood

BatchObjectReturn(all_responses=[UUID('7d434b23-c082-4c6f-b289-a55830dd8113'), UUID('17794abf-d48f-41be-b852-58b7b277a188'), UUID('1d6152cd-db90-40b4-b063-da39c1d863ee'), UUID('93e665b8-6d09-47f8-917a-99b9f282cecd'), UUID('1637b11f-e5e7-4ed3-8aa0-5f99e4a61ed4'), UUID('f51e0763-ada4-43f3-9a56-576883ec4680'), UUID('04b903ce-7f60-444f-970c-4b86031f2c6d'), UUID('9ed8825b-a9a7-40dc-8221-b6e40c35a3df'), UUID('2330959d-7613-4532-a79e-9dd3f8eb9b8b'), UUID('3976fc59-ca22-435e-ac4c-a6ee6aa994ff')], elapsed_seconds=1.8372163772583008, errors={}, uuids={0: UUID('7d434b23-c082-4c6f-b289-a55830dd8113'), 1: UUID('17794abf-d48f-41be-b852-58b7b277a188'), 2: UUID('1d6152cd-db90-40b4-b063-da39c1d863ee'), 3: UUID('93e665b8-6d09-47f8-917a-99b9f282cecd'), 4: UUID('1637b11f-e5e7-4ed3-8aa0-5f99e4a61ed4'), 5: UUID('f51e0763-ada4-43f3-9a56-576883ec4680'), 6: UUID('04b903ce-7f60-444f-970c-4b86031f2c6d'), 7: UUID('9ed8825b-a9a7-40dc-8221-b6e40c35a3df'), 8: UUID('2330959d-7613-4532-a79e-9dd3f8eb9b8b'), 9: UUID('39

In [35]:
import weaviate
import weaviate.classes as wvc

# As of November 2023, we are working towards making all WCS instances compatible with the new API introduced in the v4 Python client.
# Accordingly, we show you how to connect to a local instance of Weaviate.
# Here, authentication is switched off, which is why you do not need to provide the Weaviate API key.
client = weaviate.connect_to_local(
    port=8080,
    grpc_port=50051,
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]  # Replace with your inference API key
    }
)

questions = client.collections.get("Question")

response = questions.query.near_text(
    query="biology",
    limit=2
)

response.objects[0].properties # Inspect the first object

{'answer': 'DNA',
 'question': 'In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance',
 'category': 'SCIENCE'}

In [36]:
questions = client.collections.get("Question")

response = questions.query.near_text(
    query="biology",
    limit=2,
    filters=wvc.Filter(path="category").equal("ANIMALS")
)

print(response.objects[0].properties)  # Inspect the first object

{'answer': 'the nose or snout', 'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'category': 'ANIMALS'}


In [None]:
questions = client.collections.get("Question")

response = questions.generate.near_text(
    query="biology",
    limit=2,
    single_prompt="Explain {answer} as you might to a five-year-old."
)

print(response.objects[0].generated)  # Inspect the generated text

DNA is like a special code that tells our bodies how to grow and work. It's like a recipe book that has all the instructions for making you who you are. Just like a recipe book tells you how to make yummy cookies, DNA tells your body how to make your eyes, hair, and even how tall you will be. It's really amazing because it's what makes you unique and different from everyone else!


### Weviate for chatbot

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

  warn_deprecated(


In [47]:
import os
import re
import openai
import pinecone
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

openai.api_key =os.environ["OPENAI_API_KEY"]
MODEL = "text-embedding-ada-002"

def preprocess_text(text):
    text = re.sub(r'\s+', ' ', text)
    return text

def process_pdf(file_path):
    # create a loader
    loader = PyPDFLoader(file_path)
    # load your data
    data = loader.load()
    # Split your data up into smaller documents with Chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=50)
    documents = text_splitter.split_documents(data)
    # Convert Document objects into strings
    texts = [str(doc) for doc in documents]
    return texts

# Define a function to create embeddings
def create_embeddings(texts):
    embeddings_list = []
    for text in texts:
        res = embed_model.embed_documents(text)
        # embeddings_list.append(res['data'][0]['embedding'])
    return embeddings_list

file_path = "10 Academy Cohort A - Weekly Challenge: Week - 6.pdf"  
docs = process_pdf(file_path)
docs

["page_content='10\\nAcademy\\nCohort\\nA\\nWeekly\\nChallenge:\\nWeek\\n6\\nPrecision\\nRAG:\\nPrompt\\nTuning\\nFor\\nBuilding\\nEnterprise' metadata={'source': '10 Academy Cohort A - Weekly Challenge: Week - 6.pdf', 'page': 0}",
 "page_content='RAG:\\nPrompt\\nTuning\\nFor\\nBuilding\\nEnterprise\\nGrade\\nRAG\\nSystems\\nBusiness\\nobjective\\nPromptlyTech\\nis\\nan' metadata={'source': '10 Academy Cohort A - Weekly Challenge: Week - 6.pdf', 'page': 0}",
 "page_content='RAG\\nSystems\\nBusiness\\nobjective\\nPromptlyTech\\nis\\nan\\ninnovative\\ne-business\\nspecializing\\nin\\nproviding' metadata={'source': '10 Academy Cohort A - Weekly Challenge: Week - 6.pdf', 'page': 0}",
 "page_content='innovative\\ne-business\\nspecializing\\nin\\nproviding\\nAI-driven\\nsolutions\\nfor\\noptimizing\\nthe\\nuse\\nof' metadata={'source': '10 Academy Cohort A - Weekly Challenge: Week - 6.pdf', 'page': 0}",
 "page_content='AI-driven\\nsolutions\\nfor\\noptimizing\\nthe\\nuse\\nof\\nLanguage\\nMo

In [21]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key = openai.api_key)

In [22]:
import weaviate
import weaviate.classes as wvc
from langchain.vectorstores import Weaviate
import os

client = weaviate.connect_to_local(
    port=8080,
    grpc_port=50051,
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]  # Replace with your inference API key
    }
)

In [23]:
client = weaviate.Client("http://localhost:8080/")
client.schema.delete_all()
client.schema.get()
schema = {
    "classes": [
        {
            "class": "Chatbot",
            "description": "Documents for chatbot",
            "vectorizer": "text2vec-openai",
            "moduleConfig": {"text2vec-openai": {"model": "ada", "type": "text"}},
            "properties": [
                {
                    "dataType": ["text"],
                    "description": "The content of the paragraph",
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": False,
                            "vectorizePropertyName": False,
                        }
                    },
                    "name": "content",
                },
            ],
        },
    ]
}

client.schema.create(schema)

vectorstore = Weaviate(client, "Chatbot", "content", attributes=["source"])

In [24]:
text_meta_pair = [(doc.page_content, doc.metadata) for doc in docs]
texts, meta = list(zip(*text_meta_pair))
vectorstore.add_texts(texts, meta)

AttributeError: 'str' object has no attribute 'page_content'

In [96]:
query = "what is this weeks task?"

# retrieve text related to the query
docs = vectorstore.similarity_search(query, k=4)

ValueError: Error during query: [{'locations': [{'column': 84, 'line': 1}], 'message': 'Cannot query field "source" on type "Chatbot".', 'path': None}]