<a href="https://colab.research.google.com/github/akajammythakkar/google-io-extended-brc/blob/main/RAG_with_Gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# install libraries
!pip install langchain chromadb pypdf google-generativeai sentence_transformers

Collecting langchain
  Downloading langchain-0.2.5-py3-none-any.whl (974 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl (526 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.7 (from langchain)
  Downloading langchain_core-0.2.7-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

#### Import necessary libraries

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from pypdf import PdfReader
import google.generativeai as genai
from pprint import pprint
from google.colab import userdata

In [2]:
# Create a PdfReader object to read the PDF file
reader = PdfReader("/content/Alphabet annual report.pdf")

# Extract text from each page in the PDF and strip any leading/trailing whitespace
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter out any empty strings from the extracted texts
pdf_texts = [text for text in pdf_texts if text]

# Pretty-print the text from the first page of the PDF
pprint(pdf_texts[0])

('To our investors,\n'
 '2022 was a year full of change and uncertainty around \n'
 'the world. In February, when war broke out in Ukraine, our teams worked '
 'around-the-clock to make sure our products were helpful to people who needed '
 'them, from providing trustworthy information on Search to disrupting '
 'cyberattacks to partnering with the government to deploy air raid alerts. In '
 'March, I traveled to Warsaw, Poland, where I met Googlers hosting families '
 'who sought refuge, talked with entrepreneurs using our office spaces, and '
 'saw how our products like Google Translate were helping Ukrainians find a '
 'bit of hope and connection.\n'
 'By late spring, the tech industry was adjusting to a \n'
 'more challenging macroeconomic environment, and as a company we embarked on '
 'efforts to sharpen our focus and make sure our efforts are aligned with our '
 'highest priorities. Near the end of the year, AI reached an inflection '
 'point, made possible by our foundational b

In [3]:
# Create a RecursiveCharacterTextSplitter object with specified separators, chunk size, and chunk overlap
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],  # List of separators for splitting the text
    chunk_size=1000,  # Maximum size of each text chunk
    chunk_overlap=0  # Number of characters to overlap between chunks
)

# Join the extracted PDF texts with '\n\n' and split the combined text into chunks
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

# Pretty-print the text of the 11th chunk (index 10) of the split text
pprint(character_split_texts[10])

# Print the total number of chunks created
print(f"\nTotal chunks: {len(character_split_texts)}")

('5\n'
 'Year in Review 2022\n'
 'Multisearch\n'
 'With multisearch, people can now \n'
 'search with both images and text  \n'
 'at the same time in Google Lens.')

Total chunks: 489


In [4]:
# Create a SentenceTransformersTokenTextSplitter object with specified chunk overlap and tokens per chunk
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

# Initialize an empty list to hold the token-split texts
token_split_texts = []

# Loop through each chunk in the character-split texts
for text in character_split_texts:
    # Split the text into smaller chunks using the token splitter and add them to the token_split_texts list
    token_split_texts += token_splitter.split_text(text)

# Print the wrapped text of the 11th chunk (index 10) of the token-split text
pprint(token_split_texts[10])

# Print the total number of token-split chunks created
print(f"\nTotal chunks: {len(token_split_texts)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


('5 year in review 2022 multisearch with multisearch, people can now search '
 'with both images and text at the same time in google lens.')

Total chunks: 511


In [5]:
# Create a SentenceTransformerEmbeddingFunction object
embedding_function = SentenceTransformerEmbeddingFunction()

# Generate embeddings for the 11th chunk (index 10) of the token-split text and print the result
print(embedding_function([token_split_texts[10]]))

[[-0.032990410923957825, -0.039943940937519073, 0.027683231979608536, -0.020394684746861458, 0.01031690463423729, -0.013435314409434795, -0.11188779026269913, 0.0020084530115127563, -0.008219867013394833, -0.05463451147079468, 0.05009116232395172, 0.0608343631029129, 0.05028848722577095, 0.040277741849422455, 0.00026212268858216703, -0.0145362988114357, -0.05434255301952362, 0.0022567829582840204, -0.03845824673771858, -0.05322171375155449, 0.05845063552260399, -0.04633062705397606, 0.11275570094585419, -0.0909026563167572, -0.00557461753487587, 0.024509740993380547, -0.14071717858314514, -0.09735379368066788, -0.007698466069996357, 0.018437139689922333, -5.319666161085479e-05, 0.034669168293476105, -0.0063465856947004795, 0.11858687549829483, -0.08441866934299469, 0.05390059947967529, -0.06958919018507004, 0.05044799670577049, -0.010030854493379593, 0.016402781009674072, -0.04245797544717789, -0.036958444863557816, -0.03685654327273369, -0.05933716520667076, 0.031969502568244934, 0.02

In [6]:
# Create a ChromaDB client
chroma_client = chromadb.Client()

# Create a new collection in ChromaDB with the name "Alphabet Annual Report" and the specified embedding function
chroma_collection = chroma_client.create_collection("alphabet_annual_report", embedding_function=embedding_function)

# Generate a list of string IDs corresponding to the number of token-split text chunks
ids = [str(i) for i in range(len(token_split_texts))]

# Add the token-split text chunks to the ChromaDB collection using the generated IDs
chroma_collection.add(ids=ids, documents=token_split_texts)

# Count and return the number of documents in the ChromaDB collection
chroma_collection.count()

511

In [7]:
# Step 1: Retrieve the API key from user data
GEMINI_API_KEY = userdata.get('API_KEY')  # Get API Key from Secrets

# Step 2: Configure the GenAI client with the retrieved API key
genai.configure(api_key=GEMINI_API_KEY)

# Step 3: Define the generation configuration for the model
generation_config = {
    "temperature": 0.9,       # Controls the randomness of the output (higher values mean more random)
    "top_p": 1,               # Controls nucleus sampling (1 means no filtering)
    "top_k": 1,               # Controls the number of highest probability tokens to consider (1 means only the highest)
    "max_output_tokens": 2048 # Maximum number of tokens in the output
}

# Step 4: Initialize the generative model with the specified name and configuration
model = genai.GenerativeModel(
    model_name="gemini-1.0-pro",       # Name of the model
    generation_config=generation_config  # Configuration for text generation
)

In [8]:
def rag(query, retrieved_documents):
    # Combine the retrieved documents into a single string, separated by double newlines
    information = "\n\n".join(retrieved_documents)

    # Create the message for the generative model, providing context and the user's query
    messages = [
        "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        f"Question: {query}. \n Information: {information}"
    ]

    # Generate a response using the configured generative model
    response = model.generate_content(messages)

    # Return the text part of the first candidate's response
    return response.candidates[0].content.parts[0].text

In [14]:
# Step 1: Define the query string
query = "What are some major revenues coming from?"

# Step 2: Query the ChromaDB collection with the specified query string, retrieving the top 3 results
results = chroma_collection.query(query_texts=[query], n_results=3)

# Step 3: Extract the list of retrieved documents from the query results
retrieved_documents = results['documents'][0]

# Step 4: Loop through each retrieved document, print the wrapped text, and add a newline for readability
for document in retrieved_documents:
    pprint(document)
    print('\n')


('zour employees are critical to our success and we expect to continue '
 'investing in them. our employees are among our best assets and are critical '
 'for our continued success. we expect to continue hiring talented employees '
 'around the globe and to provide competitive compensation programs. for '
 'additional information see culture and workforce in part i, item 1 “ '
 'business. ” revenues and monetization metrics we generate revenues by '
 'delivering relevant, cost - effective online advertising ; cloud - based '
 'solutions that provide enterprise customers of all sizes with infrastructure '
 'and platform services as well as communication and collaboration tools ; '
 'sales of other products and services, such as apps and in - app purchases, '
 'and hardware ; and fees received for subscription - based products. for '
 'details on how we recognize revenue, see note 1 of the notes to consolidated '
 'financial statements included in item 8 of this annual report on form 10 

In [15]:
# Generate the response using the RAG function with the provided query and retrieved documents
output = rag(query=query, retrieved_documents=retrieved_documents)

# Print the generated response
print(output)

**Major revenue sources:**

* **Google advertising:** Includes revenues from Google Search, YouTube Ads, and Google Network, totaling $224.473 billion in 2022.

* **Google services:** Includes revenues from other products and services such as apps, in-app purchases, hardware, and subscription-based products, totaling $253.528 billion in 2022.

* **Google Cloud:** Provides cloud-based solutions for enterprise customers, generating $26.280 billion in revenue in 2022.
