In [14]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to notebook --output cleaned_notebook.ipynb FinalTest2RAG.ipynb
#Clean Notebook Metadata (Quick Fix)
#Open your notebook in Jupyter or Colab.
#Go to File → Download → Download as → Notebook (.ipynb).
#Before downloading, run:

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr

It builds a simple RAG pipeline: read a PDF, split it into chunks, embed with SentenceTransformers, index with FAISS, retrieve relevant chunks, then answer with a small LLaMA model)

 Step 1: Install Required Libraries


In [None]:
!pip install -q llama-index PyMuPDF sentence-transformers faiss-cpu transformers

Step 2: Upload and Extract PDF Content

In [None]:
from google.colab import files #imports Colab’s file-upload helper (so you can pick a file from your computer).
import fitz  # loads PyMuPDF (library to open and read PDFs).

uploaded = files.upload()

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path) #defines a function to read all pages of a PDF and collect their text.
    text = ""  #creates an empty string to accumulate text.
    for page in doc:
        text += page.get_text()
    return text

pdf_filename = next(iter(uploaded))
document_text = extract_text_from_pdf(pdf_filename)
print(document_text[:1000])  # prints the first 1,000 characters so you can sanity-check the extraction.

Saving Siemens Travel policy.pdf to Siemens Travel policy.pdf
 
Page 1/3 
 
Press 
Munich, October 20, 2022 
Siemens makes business travel more 
sustainable 
 
• Green Stay Initiative (GSI) is first global standard to enable more 
sustainable business-travel lodging and meetings 
• Beginning in 2023, Siemens to select its preferred supplier hotels according 
to GSI sustainability criteria 
• Open-source tool available for also enabling other companies to reduce 
their carbon footprint 
 
Siemens will be the first global company to introduce a mandatory Green Stay 
Initiative (GSI) policy and request that hotels provide data about their environmental 
impact. Together with the international service company HRS, Siemens developed 
global, traceable, and standardized criteria by which the carbon footprint of each 
individual hotel can be tracked and compared with others. So far, there has not been 
a global standard for using traceable criteria to compare hotel sustainability.  
 
In the 

Step 3: Create Embeddings and Store in FAISS Index

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Split text into chunks
def split_text(text, max_length=500):
    return [text[i:i+max_length] for i in range(0, len(text), max_length)]

chunks = split_text(document_text)
embeddings = model.encode(chunks)

# FAISS index
dimension = embeddings.shape[1] #Determines the number of features (dimensions) in each embedding vector.
#embeddings is a 2D array where each row is an embedding vector. shape gives the size of each vector (number of columns).

index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Store mapping of index to chunk
chunk_mapping = {i: chunk for i, chunk in enumerate(chunks)}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer is the embedding model class (turns text into vectors).

faiss is Facebook AI Similarity Search (fast vector search).

numpy is used for arrays.

model = SentenceTransformer('all-MiniLM-L6-v2') loads a small, fast sentence-embedding model.

split_text(...) slices the big document into 500-character pieces (simple, but works).

chunks = split_text(document_text) creates the list of chunks.

embeddings = model.encode(chunks) turns each chunk into a vector (array of numbers).

dimension = embeddings.shape[1] gets the vector size (columns of the 2D array).

index = faiss.IndexFlatL2(dimension) builds a FAISS index that uses L2 distance.

index.add(...) inserts all the chunk embeddings into the index.

chunk_mapping = {i: chunk ...} remembers which chunk (text) corresponds to each embedding row.



Step 4: Load a LLaMA-based Model (via HuggingFace Transformers)

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # Embedding model Re-imports and re-creates the same embedding model as above.
#(Tip: this duplication isn’t harmful, but you don’t need it if Cell 5 already ran.)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Use a smaller, Colab-friendly LLaMA model
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
llama_model = AutoModelForCausalLM.from_pretrained(model_name)

# Set up generation pipeline (use GPU if available)
generator = pipeline("text-generation", model=llama_model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu


-Imports Hugging Face Transformers utilities and PyTorch.

-model_name picks a small chat model to keep resource usage low.

-AutoTokenizer.from_pretrained(...) downloads the matching tokenizer.

-AutoModelForCausalLM.from_pretrained(...) downloads the language model.

-pipeline("text-generation", ...) wraps the model+tokenizer into a convenient generator.

-device=0 if ... else -1 uses GPU if available, otherwise CPU.





Step 5: Query the Document (RAG Retrieval + Generation)

In [None]:
def retrieve_relevant_chunks(query, top_k=4): #  top_k = 3: specifies how many of the most relevant document chunks should be retrieved
    query_embedding = model.encode([query])
    distances, indices = index.search(np.array(query_embedding), top_k)
    # index is the FAISS index which was created earlier, which contains the embeddings of all the document chunks
    # index.search() is a method that searches this index for embeddings that are similar to the query_embedding
    #The method returns two things:
    #distances: The calculated distances between the query embedding and the top_k most similar chunk embeddings. A smaller distance means higher similarity.
    #indices: The indices (positions) in the original chunks list that correspond to the top_k most similar chunk embeddings.
    return [chunk_mapping[i] for i in indices[0]]

def generate_answer(query):
    context_chunks = retrieve_relevant_chunks(query)
    context = "\n".join(context_chunks)

    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    output = generator(prompt, max_new_tokens=200, do_sample=True)[0]['generated_text']
    return output

-retrieve_relevant_chunks(query, top_k) embeds the user question and searches the FAISS index.

-index.search(...) finds the top_k most similar chunks.

-return [chunk_mapping[i] ...] converts those indices back into the original text chunks.

-generate_answer(query) pulls those chunks, stitches them into a context string, and builds a prompt:

-The prompt format gives the model the Context and the Question.

-generator(..., max_new_tokens=200, do_sample=True) creates the answer text.

Example Usage

In [None]:
generate_answer("what is GSI?")

"Context:\n \nPage 1/3 \n \nPress \nMunich, October 20, 2022 \nSiemens makes business travel more \nsustainable \n \n• Green Stay Initiative (GSI) is first global standard to enable more \nsustainable business-travel lodging and meetings \n• Beginning in 2023, Siemens to select its preferred supplier hotels according \nto GSI sustainability criteria \n• Open-source tool available for also enabling other companies to reduce \ntheir carbon footprint \n \nSiemens will be the first global company to introduce a mandatory Gre\nn Stay Initiative has been designed as an \nopen-source tool for enabling other companies to travel more sustainably as well. \nSiemens, which accounted for more than two million room nights annually before the \npandemic, sees GSI as an essential driver for achieving its sustainability targets. \nMore than half of Siemens’ supplier hotels have already provided their data. \nSiemens AG \nCommunications  \nHead: Lynette Jackson \nWerner-von-Siemens-Strasse 1 \n80333 Mu

In [None]:
generate_answer("How many hotel chains are participating in HRS’ Green Stay Initiative?")

'Context:\nply with verifiable corporate reporting standards on environmental, social and \ngovernance topics. Demonstrating noteworthy traction, more than 300 hotel chains \nand thousands of properties across 130 countries are currently participating in HRS’ \nGreen Stay Initiative, less than 18 months after HRS introduced the award-winning \ntechnology to corporate lodging buyers and suppliers worldwide. \n \n“We’ve noted that a number of our Fortune 500 clients have begun making definitive \nrequests regarding\nility criteria, Siemens aims to \nencourage more hotels to reduce their carbon emissions. \n \nSiemens collaborates globally with HRS on a broad range of hotel program \nmanagement functions, including procurement, rate auditing, booking, payment and \nmanagement of meetings. Recognizing that many hotels have limited resources in \nthe wake of the pandemic, HRS’ and Siemens’ sustainability experts purposefully \ndesigned the Green Stay Initiative to make it easy for hotels to

In [None]:
generate_answer("Who is the Contact person?")

'Context:\ne the arrival of net-zero corporate lodging programs, all \nwhile helping the travel industry reach carbon neutrality years in advance of the \n2050 goal outlined in the Paris Agreement.” \n \nThis press release is available at https://sie.ag/3MGicvA \nFollow us on Twitter at www.twitter.com/siemens_press \n \nContact for journalists  \nAndreas Friedrich \nPhone: +49 1522 2103967 \nE-mail: friedrich@siemens.com \n \nKonstanze Somborn \nPhone: +49 173 8947354 \nE-mail: konstanze.somborn@siemens.com \n \n \nSiemen\ns of September 30, 2021, the company had around 303,000 employees worldwide. Further \ninformation is available on the Internet at www.siemens.com. \n\nOPR202210186581EN \n \nSiemens AG \nPress release \n \nReference number: HQCOPR202210186581EN  \n \n“The HRS Green Stay Initiative provides us with a uniform global standard for \nmeasuring and comparing hotels’ environmental footprints,” said Thorsten Eicke, \nHead of Global Mobility Services at Siemens AG. “By impl

In [None]:
def retrieve_relevant_chunks(query, top_k=3):
    # Step 1: Encode the query into an embedding
    query_embedding = model.encode([query])  # Encoding the query

    # Step 2: Perform the similarity search in the FAISS index
    distances, indices = index.search(np.array(query_embedding), top_k)  # `index.search()` returns distances and indices

    # Step 3: Print the distances and indices for debugging purposes
    print(f"Distances: {distances}")  # Show how similar the retrieved chunks are to the query
    print(f"Indices: {indices}")      # Show the positions of the top-k chunks in the original chunk list

    # Step 4: Print the chunk mapping with chunk number (optional for clarity)
    print("\nChunk Mapping (Top-k Chunks):")
    for i, idx in enumerate(indices[0]):
        print(f"Chunk {i + 1}: {chunk_mapping[idx]} (Index: {idx}) - Distance: {distances[0][i]}")

    # Step 5: Return the relevant chunks from the chunk mapping
    return [chunk_mapping[i] for i in indices[0]]


In [None]:
!pip install jedi

Collecting jedi
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2
