<a href="https://colab.research.google.com/github/galaxyenergy/Mike-Cunningham-Law-Firm/blob/main/GALAXY_SAFETY_MANUAL_RAG_CLEAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SECTION 1


Step 1: Setting Up Your Environment

Install Required Packages


In [1]:
!pip install transformers
!pip install langchain
!pip install sentence-transformers
!pip install faiss-cpu  # FAISS for efficient vector search
!pip install PyMuPDF  # For PDF processing
!pip install -U langchain-community



Import Libraries


In [2]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import fitz  # PyMuPDF
import os


  from tqdm.autonotebook import tqdm, trange


# SECTION 2


Step 2: Load and Embed Text from the PDF


Load and Preprocess the PDF


In [3]:
!pip install PyMuPDF
!pip install langchain
!pip install sentence-transformers
!pip install faiss-cpu
!pip install transformers

from google.colab import files
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
# from sentence_transformers import SentenceTransformer  # No longer needed
from transformers import pipeline
import fitz  # PyMuPDF
import os

# Upload the PDF file
uploaded = files.upload()

# Get the uploaded file's path by extracting the filename from the `uploaded` dictionary
file_path = next(iter(uploaded))
print(f"File uploaded: {file_path}")

# Load the PDF file and extract text using PyMuPDF
with fitz.open(file_path) as doc:
    pdf_text = ""
    for page in doc:
        pdf_text += page.get_text()

# Load the SentenceTransformer model for embedding
model_name = "sentence-transformers/all-MiniLM-L6-v2"
# embedding_model = SentenceTransformer(model_name)  # Remove this line - no longer needed
embeddings = HuggingFaceEmbeddings(model_name=model_name)  # Pass model_name directly

# Break down the PDF text into chunks
def split_text_into_chunks(text, max_length=512):
    """Split text into chunks suitable for embedding."""
    words = text.split()
    chunks = [' '.join(words[i:i+max_length]) for i in range(0, len(words), max_length)]
    return chunks

# Now pdf_text is defined, so you can use it
chunks = split_text_into_chunks(pdf_text)
embedded_texts = [embeddings.embed_query(chunk) for chunk in chunks]



Saving GALAXY DRIVING SAFETY PROCEDURE MANUAL.pdf to GALAXY DRIVING SAFETY PROCEDURE MANUAL (5).pdf
File uploaded: GALAXY DRIVING SAFETY PROCEDURE MANUAL (5).pdf


  embeddings = HuggingFaceEmbeddings(model_name=model_name)  # Pass model_name directly
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
with fitz.open(file_path) as doc:
        pdf_text = ""
        for page in doc:
            pdf_text += page.get_text()

In [5]:
chunks = split_text_into_chunks(pdf_text)

Embed Text Using Hugging Face Model


In [6]:
# Load the SentenceTransformer model for embedding
model_name = "sentence-transformers/all-MiniLM-L6-v2"
# embedding_model = SentenceTransformer(model_name)  # Remove this line - no longer needed
embeddings = HuggingFaceEmbeddings(model_name=model_name)  # Pass model_name directly

# Break down the PDF text into chunks
def split_text_into_chunks(text, max_length=512):
    """Split text into chunks suitable for embedding."""
    words = text.split()
    chunks = [' '.join(words[i:i+max_length]) for i in range(0, len(words), max_length)]
    return chunks

# Assuming you have 'pdf_text' defined somewhere
chunks = split_text_into_chunks(pdf_text)
embedded_texts = [embeddings.embed_query(chunk) for chunk in chunks]

In [7]:
# Create a list of (text, embedding) pairs
text_embedding_pairs = list(zip(chunks, embedded_texts))

# Now create the FAISS index
vectorstore = FAISS.from_embeddings(text_embedding_pairs, embeddings)

# SECTION 3


Store Chunks and Embeddings in FAISS


In [8]:
def retrieve_relevant_text(query, k=5):
    """Retrieve top k relevant text chunks from FAISS index for a given query."""
    query_embedding = embeddings.embed_query(query)
    results = vectorstore.similarity_search(query_embedding, k=k)
    return [result.text for result in results]


In [9]:
# Assuming text_embedding_pairs contains the (text, embedding) tuples
texts, embeddings = zip(*text_embedding_pairs)

Step 3: Retrieval Function


In [10]:
!pip install sentence-transformers
!pip install faiss-cpu

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load the sentence-transformers model
model = SentenceTransformer('all-mpnet-base-v2')

# Function to get embeddings using sentence-transformers
def get_embedding(text):
  """Gets the embedding for the provided text using sentence-transformers.

  Args:
    text: The text to embed.

  Returns:
    A list of floats representing the embedding.
  """
  return model.encode(text)

# Assume 'chunks' contains your text data
# Get embeddings for all chunks
embedded_texts = [get_embedding(chunk) for chunk in chunks]

# Create a list of (text, embedding) pairs
text_embedding_pairs = list(zip(chunks, embedded_texts))

# Create the FAISS index
# Assuming embeddings are 768-dimensional (adjust if needed)
dimension = 768
index = faiss.IndexFlatL2(dimension)

# Add embeddings to the FAISS index
# Convert embeddings to a NumPy array for FAISS
embeddings_np = np.array(embedded_texts).astype('float32')
index.add(embeddings_np)

# Now you can use 'index' for similarity search
# Example:
def retrieve_relevant_text(query, k=5):
    """Retrieve top k relevant text chunks from FAISS index for a given query."""
    query_embedding = get_embedding(query)  # Get embedding for the query
    query_embedding_np = np.array([query_embedding]).astype('float32') # Convert to NumPy array
    D, I = index.search(query_embedding_np, k)  # Search the index
    # I contains the indices of the most similar vectors
    # D contains the distances to the most similar vectors
    return [chunks[i] for i in I[0]]  # Return the corresponding text chunks



# SECTION 4


Step 4: Putting It All Together


In [11]:
# Example query
query = "What are the safety procedures?"
relevant_texts = retrieve_relevant_text(query)

print("Top relevant chunks:")
for i, text in enumerate(relevant_texts):
    print(f"{i+1}. {text}")


Top relevant chunks:
1. Are all passengers buckled up before vehicle is put into motion? • Are all cargo and/or potential ‘projectiles’ properly secured? •Does cargo require special provisions? (Such as hazardous waste, cylinders, animals, pipe, instrumentation, etc.) 03/24/2020 Page 9 of 14 03/24/2020 Page 10 of 14 03/24/2020 Page 11 of 14 03/24/2020 Page 12 of 14 03/24/2020 Page 13 of 14 03/24/2020 Page 14 of 14
2. able to properly operate the vehicle in a safe manner. This includes conditions such as operating a vehicle while under the influence of drugs, medicines, or alcohol, or when under conditions of extreme stress, fatigue, or any other physical or mental impairment that may hinder safe vehicle operation. D. Galaxy Energy Services employees and contractors who operate rental, company or personal vehicles on behalf of Galaxy Energy Services must be in a driving safety program operated by their employer which ensures the following conditions are met: •Be in possession of, and sh