<a href="https://colab.research.google.com/github/akajammythakkar/rag-with-gemini/blob/main/RAG_with_Gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install libraries
!pip install langchain chromadb pypdf2 google-generativeai sentence_transformers



#### Import necessary libraries

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from pypdf import PdfReader
import google.generativeai as genai
from pprint import pprint
from google.colab import userdata

In [None]:
# Create a PdfReader object to read the PDF file
reader = PdfReader("/content/Alphabet annual report.pdf")

# Extract text from each page in the PDF and strip any leading/trailing whitespace
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter out any empty strings from the extracted texts
pdf_texts = [text for text in pdf_texts if text]

# Pretty-print the text from the first page of the PDF
pprint(pdf_texts[0])

('UNITED STATES\n'
 'SECURITIES AND EXCHANGE COMMISSION\n'
 'Washington, D.C. 20549\n'
 '___________________________________________\n'
 'FORM 10-K\n'
 '___________________________________________\n'
 '(Mark One)\n'
 '☒ ANNUAL REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE '
 'ACT OF 1934\n'
 'For the fiscal year ended December 31, 2022\n'
 'OR\n'
 '☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES '
 'EXCHANGE ACT OF 1934\n'
 'For the transition period from              to             .\n'
 'Commission file number: 001-37580\n'
 '___________________________________________\n'
 'Alphabet Inc.\n'
 '(Exact name of registrant as specified in its charter)\n'
 '___________________________________________\n'
 'Delaware 61-1767919\n'
 '(State or other jurisdiction of incorporation or organization) (I.R.S. '
 'Employer Identification No.)\n'
 '1600 Amphitheatre Parkway\n'
 'Mountain V iew, CA 94043\n'
 '(Address of principal executive offices, including

In [None]:
# Create a RecursiveCharacterTextSplitter object with specified separators, chunk size, and chunk overlap
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],  # List of separators for splitting the text
    chunk_size=1000,  # Maximum size of each text chunk
    chunk_overlap=0  # Number of characters to overlap between chunks
)

# Join the extracted PDF texts with '\n\n' and split the combined text into chunks
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

# Pretty-print the text of the 11th chunk (index 10) of the split text
pprint(character_split_texts[10])

# Print the total number of chunks created
print(f"\nTotal chunks: {len(character_split_texts)}")

('•our expectation that our monetization trends will fluctuate, which could '
 'affect our revenues and margins;\n'
 '•fluctuations in our revenues, as well as the change in paid clicks and '
 'cost-per-click and the change in\n'
 'impressions and cost-per-impression, and various factors contributing to '
 'such fluctuations;\n'
 '•our expectation that we will continue to periodically review, refine, and '
 'update our methodologies for\n'
 'monitoring, gathering, and counting the number of paid clicks and '
 'impressions;\n'
 '•our expectation that our results will be affected by our performance in '
 'international markets as users in\n'
 'developing economies increasingly come online;\n'
 '•our expectation that our foreign exchange risk management program will not '
 'fully offset our net exposure to\n'
 'fluctuations in foreign currency exchange rates;\n'
 '•the expected variability of gains and losses related to hedging activities '
 'under our foreign exchange risk\n'
 'managemen

In [None]:
# Create a SentenceTransformersTokenTextSplitter object with specified chunk overlap and tokens per chunk
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

# Initialize an empty list to hold the token-split texts
token_split_texts = []

# Loop through each chunk in the character-split texts
for text in character_split_texts:
    # Split the text into smaller chunks using the token splitter and add them to the token_split_texts list
    token_split_texts += token_splitter.split_text(text)

# Print the wrapped text of the 11th chunk (index 10) of the token-split text
pprint(token_split_texts[10])

# Print the total number of token-split chunks created
print(f"\nTotal chunks: {len(token_split_texts)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


('table of contents alphabet inc. note about forward - looking statements this '
 'annual report on form 10 - k contains forward - looking statements within '
 'the meaning of the private securities litigation reform act of 1995. these '
 'include, among other things, statements regarding : • the growth of our '
 'business and revenues and our expectations about the factors that influence '
 'our success and trends in our business ; • fluctuations in our revenues and '
 'margins and various factors contributing to such fluctuations ; • our '
 'expectation that the continuing shift from an offline to online world will '
 'continue to benefit our business ; • our expectation that the portion of our '
 'revenues that we derive from non - advertising revenues will continue to '
 'increase and may affect our margins ; • our expectation that our traffic '
 'acquisition costs ( tac ) and the associated tac rate will fluctuate, which '
 'could affect our overall margins ;')

Total chunks: 521


In [None]:
# Create a SentenceTransformerEmbeddingFunction object
embedding_function = SentenceTransformerEmbeddingFunction()

# Generate embeddings for the 11th chunk (index 10) of the token-split text and print the result
print(embedding_function([token_split_texts[10]]))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[[-0.038972221314907074, -0.039320990443229675, 0.009861109778285027, -0.05084536597132683, 0.023311369121074677, 0.07700969278812408, 0.04277116805315018, 0.03615760803222656, 0.06379776448011398, 0.026388373225927353, 0.019688792526721954, 0.1407993733882904, -0.017853906378149986, -0.06877782195806503, -0.004027882125228643, -0.018674977123737335, 0.03352261707186699, 0.04201819747686386, -0.028582502156496048, 0.01630794070661068, -0.020787663757801056, 0.0520034059882164, 0.025474274531006813, 0.03172564506530762, -0.03456216678023338, -0.025852885097265244, -0.0893649160861969, 0.014263750053942204, -0.048373088240623474, -0.06445921212434769, -0.08078080415725708, 0.043986380100250244, 0.038726773113012314, 0.016588792204856873, 0.04580427333712578, -0.06902538239955902, -0.01974904164671898, -0.012786035425961018, 0.04005995765328407, 0.008965833112597466, 0.022647129371762276, -0.11218259483575821, -0.05282265320420265, 0.01696201041340828, 0.01976599544286728, 0.0072877081111

In [None]:
# Create a ChromaDB client
chroma_client = chromadb.Client()

# Create a new collection in ChromaDB with the name "Alphabet Annual Report" and the specified embedding function
chroma_collection = chroma_client.create_collection("alphabet_annual_report", embedding_function=embedding_function)

# Generate a list of string IDs corresponding to the number of token-split text chunks
ids = [str(i) for i in range(len(token_split_texts))]

# Add the token-split text chunks to the ChromaDB collection using the generated IDs
chroma_collection.add(ids=ids, documents=token_split_texts)

# Count and return the number of documents in the ChromaDB collection
chroma_collection.count()

521

In [None]:
# Step 1: Retrieve the API key from user data
GEMINI_API_KEY = userdata.get('API_KEY')  # Get API Key from Secrets

# Step 2: Configure the GenAI client with the retrieved API key
genai.configure(api_key=GEMINI_API_KEY)

# Step 3: Define the generation configuration for the model
generation_config = {
    "temperature": 0.9,       # Controls the randomness of the output (higher values mean more random)
    "top_p": 1,               # Controls nucleus sampling (1 means no filtering)
    "top_k": 1,               # Controls the number of highest probability tokens to consider (1 means only the highest)
    "max_output_tokens": 2048 # Maximum number of tokens in the output
}

# Step 4: Initialize the generative model with the specified name and configuration
model = genai.GenerativeModel(
    model_name="gemini-1.0-pro",       # Name of the model
    generation_config=generation_config  # Configuration for text generation
)

In [13]:
def rag(query, retrieved_documents):
    # Combine the retrieved documents into a single string, separated by double newlines
    information = "\n\n".join(retrieved_documents)

    # Create the message for the generative model, providing context and the user's query
    messages = [
       "You are an expert financial research assistant specializing in analyzing annual reports. Your task is to help users by answering their questions based on the provided information from an annual report. "
"You will be given a specific question along with relevant excerpts from the annual report. Please provide a clear and accurate answer using only the given information."
        f"Question: {query}. \n Information: {information}"
    ]

    # Generate a response using the configured generative model
    response = model.generate_content(messages)

    # Return the text part of the first candidate's response
    return response.candidates[0].content.parts[0].text

In [21]:
# Step 1: Define the query string
query = "What are some technologies..?"

# Step 2: Query the ChromaDB collection with the specified query string, retrieving the top 3 results
results = chroma_collection.query(query_texts=[query], n_results=3)

# Step 3: Extract the list of retrieved documents from the query results
retrieved_documents = results['documents'][0]

# Step 4: Loop through each retrieved document, print the wrapped text, and add a newline for readability
for document in retrieved_documents:
    pprint(document)
    print('\n')


('as a result of these factors, the value of our investments could decline, '
 'which could harm our financial condition and operating results. risks '
 'related to our industry people access the internet through a variety of '
 'platforms and devices that continue to evolve with the advancement of '
 'technology and user preferences. if manufacturers and users do not widely '
 'adopt versions of our products and services developed for these interfaces, '
 'our business could be harmed. people access the internet through a growing '
 'variety of devices such as desktop computers, mobile phones, smartphones, '
 'laptops and tablets, video game consoles, voice - activated speakers, '
 'wearables, automobiles, and television - streaming devices. our products and '
 'services may be less popular on some interfaces. each manufacturer or '
 'distributor may establish unique technical standards for its devices, and '
 'our products and services may not be')


('available or may only be availa

In [22]:
# Generate the response using the RAG function with the provided query and retrieved documents
output = rag(query=query, retrieved_documents=retrieved_documents)

# Print the generated response
print(output)

Some technologies that the company is using to solve big problems include:

- Improving transportation and health technology
- Exploring solutions to address climate change
