# Retrieval Augmented Generation with Google Gemini and BigQuery


## Install libraries

In [1]:
%pip install --upgrade --quiet  langchain langchain-google-vertexai google-cloud-bigquery unstructured beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


## Set up

In [1]:
!gcloud config set project derrick-doit-sandbox --quiet

Updated property [core/project].


### Create BigQuery Dataset

In [2]:
PROJECT_ID = "derrick-doit-sandbox"
REGION = "US"
DATASET = "vector_search"
TABLE = "doc_and_vectors"
SITEMAP='https://ai.google.dev/sitemap_0_of_1.xml'

In [3]:
from google.cloud import bigquery

client = bigquery.Client(project=PROJECT_ID, location=REGION)
client.create_dataset(dataset=DATASET, exists_ok=True)

Dataset(DatasetReference('derrick-doit-sandbox', 'vector_search'))

### Embeddings

Embeddings are a way to store data of all types (including images, audio files, text, documents, etc.) in number arrays called vectors.

Vertex AI Embeddings for Text has an embedding space with 768 dimensions.

Let's visualize the embedding space of the 8 million Stack Overflow questions!

![link text](https://storage.googleapis.com/gweb-cloudblog-publish/images/4._Nomic_AI_Atlas.max-2200x2200.png)
Credit to:
- https://atlas.nomic.ai/
- https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings

### Use Vertex AI Embeddings model

In [4]:
from langchain_google_vertexai import VertexAIEmbeddings

embedding = VertexAIEmbeddings(
    model_name="textembedding-gecko", project=PROJECT_ID
)

### Use BigQuery as Vector Store

In [5]:
from langchain.vectorstores.utils import DistanceStrategy
from langchain_community.vectorstores import BigQueryVectorSearch

In [6]:
store = BigQueryVectorSearch(
    project_id=PROJECT_ID,
    dataset_name=DATASET,
    table_name=TABLE,
    location=REGION,
    embedding=embedding,
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)

# Document processing


## Parse the sitemap

In [7]:
import requests
from bs4 import BeautifulSoup

def parse_sitemap(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "xml")
    urls = [element.text for element in soup.find_all("loc")]
    return urls

sites = parse_sitemap(SITEMAP)

In [8]:
sites_filtered = [url for url in sites if '.css' not in url and '.json' not in url and '.xml' not in url]
sites_filtered

['https://ai.google.dev/api/rest/v1beta/corpora.permissions/get',
 'https://ai.google.dev/examples/chat_calculator',
 'https://ai.google.dev/palm_docs/tuning_quickstart_rest',
 'https://ai.google.dev/api/python/google/ai/generativelanguage/GenerateMessageResponse',
 'https://ai.google.dev/api/python/google/ai/generativelanguage/Permission',
 'https://ai.google.dev/api/python/google/generativeai/types/AuthorError',
 'https://ai.google.dev/api/python/google/ai/generativelanguage/ListDocumentsResponse',
 'https://ai.google.dev/api/rest/v1beta/tunedModels.permissions/create',
 'https://ai.google.dev/prompts/scifi-novel-writer',
 'https://ai.google.dev/tutorials/get_started_go',
 'https://ai.google.dev/api/python/google/generativeai/types/Completion',
 'https://ai.google.dev/api/rest/v1/models/batchEmbedContents',
 'https://ai.google.dev/prompts/topic-to-questions/index_21588cec9ba17091d2dbd6a54df9fb06432054c90809b12e1a1e89b69de5f818.frame',
 'https://ai.google.dev/api/python/google/ai/gene

In [9]:
len(sites_filtered)

491

In [10]:
sites_filtered[100]

'https://ai.google.dev/prompts/unhedge'

## Load page content using LangChains UnstructuredURLLoader

In [11]:
from langchain.document_loaders import UnstructuredURLLoader
loader = UnstructuredURLLoader(urls=sites_filtered)
documents = loader.load()
len(documents)

491

In [12]:
documents[100]

Document(page_content="Google AI for Developers\n\nStay organized with collections\n\nSave and categorize content based on your preferences.\n\nPrompt gallery\n\nUnhedge\n\nRewrite a sentence to be more assertive\n\n\n          Prompt type:\n          \n\nborder_all\n            \nData\n\n\n          Use case:\n          \n\nRewrite\n\n\n            Related:\n            \n\nDetailed grammar rewrite\n\nTalk to snowman\n\nGrammar rewrite\n\nOpen in Google AI Studio\n\nContext\n\nRemove hedging to make your writing more persuasive.\n\nPrompt examples\n\nInput\n                  \n                  Hedged:\n\nOutput\n                  \n                  Unhedged:\n\nI think the report might be ready by tomorrow.\n\nThe report will be ready by tomorrow.\n\nIt is obvious that the children do enjoy playing with the trains.\n\nThe children enjoy playing with trains.\n\nIt seems to me like we could use a vacation day.\n\nWe need a vacation day.\n\nIf you don't mind could you send me the addre

## Chunking

In [13]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 2000,
    chunk_overlap  = 200)

document_chunks = text_splitter.split_documents(documents)

print(f"Number documents {len(documents)}")
print(f"Number chunks {len(document_chunks)}")

Created a chunk of size 2870, which is longer than the specified 2000
Created a chunk of size 10294, which is longer than the specified 2000
Created a chunk of size 10334, which is longer than the specified 2000
Created a chunk of size 2916, which is longer than the specified 2000
Created a chunk of size 2937, which is longer than the specified 2000
Created a chunk of size 3365, which is longer than the specified 2000
Created a chunk of size 4322, which is longer than the specified 2000
Created a chunk of size 2456, which is longer than the specified 2000
Created a chunk of size 2523, which is longer than the specified 2000
Created a chunk of size 2340, which is longer than the specified 2000
Created a chunk of size 2174, which is longer than the specified 2000
Created a chunk of size 2684, which is longer than the specified 2000
Created a chunk of size 2870, which is longer than the specified 2000
Created a chunk of size 2868, which is longer than the specified 2000
Created a chunk of

Number documents 491
Number chunks 1277


In [14]:
document_chunks[100]

Document(page_content="the model, or restrict the boundaries of the responses to only what's within the\nprompt.\nPrompt: Marbles:\nColor: red\nNumber: 12\nColor: blue\nNumber: 28\nColor: yellow\nNumber: 15\nColor: green\nNumber: 17\nHow many green marbles are there? Response: There are 17 green marbles. (text-bison@001)\nExamples\nExamples are input-output pairs that you include in the prompt to give the\nmodel an example of an ideal response. Including examples in\nthe prompt is an effective strategy for customizing the response format.\nPrompt: Classify the following.\nOptions:\n- red wine\n- white wine\nText: Chardonnay\nThe answer is: white wine\nText: Cabernet\nThe answer is: red wine\nText: Moscato\nThe answer is: white wine\nText: Riesling\nThe answer is: Response: white wine (text-bison@001)\nNext steps\nNow that you have an understanding of prompt design, try writing your\nown prompts using Google AI Studio.\nFor a deeper understanding of prompt design, see the\nprompt strate

# Embeddings for documents



## Create embedding for all document chunks

In [15]:
store.add_documents(documents=document_chunks, embedding=embedding)

['1a2d0f60e2504f1586fc51d54536e821',
 '521c4bcbd475407ca74d3a468b12b8f3',
 '7fa714f2bfc4446990e78c5e112d5ba3',
 '32405af399404b729c7f95ce59a99243',
 '618ae1063f4344d7b5fc69466f1938f0',
 '9027dd692ae1481383a82e8ba0eaeecd',
 '9c240eafcd1245faa28fb2e0249a9387',
 'cf01bf2a0b6f437c900b0d8166de552f',
 '771e889a719a4c8abc14bd4f22e22a8a',
 '6a16d354d7454e3289af2c89dd3ec429',
 'e4d9849f9b3b49e89e4165e30371c83c',
 '590ad3a2d20e4acabd4116666bace800',
 '6ed907af34a047528189ade39782070e',
 'b6c9864e13104e9e8091132efb4c7557',
 '07f2e11baf404ba28f3625c53a2a2386',
 '67ea9896fe284e3ca8b22712e29d5479',
 '1a7fe8eb051c4030b4f42866537ec753',
 '12e9a424765949b4b446e949a6187542',
 '14a86648adcd408c8de86ee87821f14b',
 'fd43716099ef4f3aae026a2181f56df0',
 '76023a4b739a479f860326f4fdbf4bc9',
 '08a03216bafe49c099f73dcb1cc05a57',
 '25c65bf9dd714ecfb67b70449fd05296',
 '399e0eb4465740ec91da36cd74850196',
 '619cefb05ddd4a5eaff97c9ce8f8009a',
 '96133f80dc484b0abc9144b22c397d69',
 'a6565ec3188c4fa08da971c96b576c35',
 

In [16]:
question = "What are the deprecation date of PaLM API?"

In [17]:
store.similarity_search(question, k=8)

[Document(page_content="The PaLM API is deprecated for use with Google AI services and tools (but \nPaLM API deprecation guide.\nGoogle AI for Developers\nProducts\nPaLM API deprecation\nStay organized with collections\nSave and categorize content based on your preferences.\nAs of February 15, 2024, the PaLM API for use with Google AI services and\ntooling is deprecated. In 6 months, the PaLM API will be decommissioned,\nmeaning that users won't be able to use a PaLM model in a prompt, tune a new\nPaLM model, or run inference on PaLM-tuned models.\nTo interact with models using Google AI services and tooling, use the\nGemini API, which is now available in a stable\nversion. For more information on migrating from PaLM to Gemini, see the\nmigration guide.\nNote that this deprecation notice does not pertain to PaLM API support in\nVertex AI.\nExcept as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed un

# Retreive answers

## Get answer without context

In [20]:
from langchain_google_vertexai import VertexAI

llm=VertexAI(model_name="gemini-1.0-pro", temperature=0)
llm.invoke("What are the deprecation date of PaLM API?")

'There is no deprecation date for the PaLM API.'

## Get answer with context from BigQuery (Vector Store)

In [22]:
from langchain.prompts import PromptTemplate
from langchain.chains.llm import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA


prompt_template = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.

{context}

Please follow the following rules:
1. If the question is to request links, please only return the source links with no answer.
2. Don't rephrase the question. If you don't know the answer, don't try to make up an answer. Just say **I can't find the final answer but you may want to check the following links** and add the source links as a list.
3. If you find the answer, write the answer in a concise way with many details and add the list of sources that are **directly** used to derive the answer. Exclude the sources that are irrelevant to the final answer.
4. The answer should be in following format. Keep an eye on the changeline and don't truncate the link:

**Question**: {question}
\n**Answer**:
\n**Source**:
"""


QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt_template)
llm_chain = LLMChain(llm=VertexAI(model_name="gemini-1.0-pro", temperature=0), prompt=QA_CHAIN_PROMPT, callbacks=None, verbose=True)
document_prompt = PromptTemplate(
    input_variables=["page_content", "source"],
    template="Context:\ncontent:{page_content}\nsource:{source}",
)
combine_documents_chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_variable_name="context",
    document_prompt=document_prompt,
    callbacks=None,
)
qa = RetrievalQA(
    combine_documents_chain=combine_documents_chain,
    callbacks=None,
    retriever=store.as_retriever(),
    return_source_documents=True,
)
res = qa.invoke("What is the deprecation date of PaLM API?")

print(res['result'])




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.

Context:
content:The PaLM API is deprecated for use with Google AI services and tools (but 
PaLM API deprecation guide.
Google AI for Developers
Products
PaLM API deprecation
Stay organized with collections
Save and categorize content based on your preferences.
As of February 15, 2024, the PaLM API for use with Google AI services and
tooling is deprecated. In 6 months, the PaLM API will be decommissioned,
meaning that users won't be able to use a PaLM model in a prompt, tune a new
PaLM model, or run inference on PaLM-tuned models.
To interact with models using Google AI services and tooling, use the
Gemini API, which is now available in a stable
version. For more information on migrating from PaLM to Gemini, see the
migration guide.
Note that this deprecation notice does not pertain to 