# Exercise: Multi-Modal Retrieval Augmented Generation from Scratch

## Objective

Implement a Retrieval Augmented Generation (RAG) system from scratch using the [PHI-3 vision](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) model, [Jina-CLIP-V1](https://huggingface.co/jinaai/jina-clip-v1) embeddings, and the Chroma vector database. Your task is to use this system to generate text based on the contents of the paper "Attention is All You Need" ([https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)).

## Allowed Libraries

You are only permitted to use the following libraries:
- `torch`
- `chromadb`
- `numpy`
- `io`
- `fitz`
- `requests`
- `PIL`
- `transformers`

## Components to Implement

1. **Document Ingestion**: Load the "Attention is All You Need" paper and prepare it for processing. This includes handling the PDF and extracting text and images as needed.

2. **Embeddings Generation**: Use Jina-CLIP-V1 to create embeddings for the text and images extracted from the document.

3. **Vector Database Management**: Utilize the Chroma vector database to store and manage the embeddings.

4. **Retrieval Mechanism**: Implement a mechanism to retrieve relevant document segments (text or images) based on a given query using the stored embeddings.

5. **Generation Model**: Integrate the PHI-3 vision model to generate new content based on the retrieved segments.

## Submission

Your submission should include:
- The complete code for the exercise.
- A README file explaining your implementation and how to run the code.
- Examples of generated outputs based on sample queries.

## Evaluation Criteria

- Correctness: The RAG system should accurately retrieve relevant document segments and generate coherent outputs.
- Efficiency: The implementation should be optimized for performance.
- Creativity: Innovative approaches to integrate and utilize the components are encouraged.
- Clarity: Code should be well-documented and easy to understand.

Good luck and enjoy the exercise!

Importing the required packages to load the file

In [1]:
!pip install transformers einops timm pillow
from transformers import AutoModel

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->timm)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->timm)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->timm)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->timm)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->timm)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch->timm)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch->tim

In [5]:
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

Loading Jina-CLIP model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

configuration_clip.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- configuration_clip.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_clip.py: 0.00B [00:00, ?B/s]

rope_embeddings.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- rope_embeddings.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


eva_model.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- eva_model.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


transform.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- transform.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


hf_model.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- hf_model.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- modeling_clip.py
- rope_embeddings.py
- eva_model.py
- transform.py
- hf_model.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


ImportError: /usr/local/lib/python3.11/dist-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

loading the file and extracting text and image from the document

In [3]:
 pip install PyMuPDF Pillow

Collecting PyMuPDF
  Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m90.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.5


In [4]:
    import fitz
    from PIL import Image
    import io

In [17]:
import requests
import fitz
from PIL import Image
import io

pdf_url = "https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"
pdf_path = "attention_is_all_you_need.pdf" # Define a local path to save the PDF

# Download the PDF
response = requests.get(pdf_url)
with open(pdf_path, 'wb') as f:
    f.write(response.content)

# Open the downloaded PDF
pdf_document = fitz.open(pdf_path)

In [19]:
extracted_images = []

for page_number in range(pdf_document.page_count):
    page = pdf_document.load_page(page_number)
    image_list = page.get_images(full=True)

    for img_index, img_info in enumerate(image_list):
        xref = img_info[0]
        base_image = pdf_document.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image = Image.open(io.BytesIO(image_bytes))
        extracted_images.append(image)

print(f"Extracted {len(extracted_images)} images.")

Extracted 3 images.


In [7]:
image_embeddings = model.encode_image(extracted_images)

preprocessor_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


processing_clip.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- processing_clip.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


To generate multiple text embeddings, you can pass a list of text strings to the `model.encode_text()` function.

In [18]:
extracted_text = []

for page_number in range(pdf_document.page_count):
    page = pdf_document.load_page(page_number)
    extracted_text.append(page.get_text())

print(f"Extracted text from {len(extracted_text)} pages.")

Extracted text from 11 pages.


In [9]:
# Make sure to run the cell above to define the model
text_embeddings = model.encode_text(extracted_text)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

checking similarities between the embeddings

In [None]:
print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
print(text_embeddings[0] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[0] @ image_embeddings[1].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[1].T)# text-image cross-modal similarity

0.72897476
0.17103584
0.15112677
0.14756522
0.1411351


storing the embeddings in chromadb

In [11]:
pip install chromadb

Collecting chromadb
  Downloading chromadb-1.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp311-cp311-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.38.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.38.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.38.0-py3-none-any.whl.metadata (1.5 kB

In [1]:
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"
client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

In [5]:
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "text_embeddings" # Renamed for clarity
client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

In [6]:
# Use get_or_create_collection to handle cases where the collection might already exist
collection = client.get_or_create_collection(name=COLLECTION_NAME)
print(f"Accessed or created collection: {collection.name}")

NameError: name 'client' is not defined

In [24]:
# Make sure to run the cell above to define the model
if extracted_text:
    text_embeddings = model.encode_text(extracted_text)

    # Generate unique IDs for each text embedding
    text_embedding_ids = [f"text_embedding_{i}" for i in range(len(text_embeddings))]

    collection.add(
        embeddings=text_embeddings.tolist(),
        documents=extracted_text, # Include the original text content
        ids=text_embedding_ids
    )

    print(f"Successfully added {len(text_embeddings)} text embeddings to Chroma collection '{COLLECTION_NAME}'.")
else:
    print("No text extracted to create embeddings.")

Successfully added 11 text embeddings to Chroma collection 'text_embeddings'.


Now that the embeddings are stored in Chroma, we can implement the retrieval mechanism.

In [22]:
# Define a sample query
query_text = "What are the benefits of the Attention mechanism?"

# Generate embedding for the query using the Jina-CLIP model
query_embedding = model.encode_text(query_text)

# Query the Chroma collection
# We will query for the top 5 most relevant results (both text and images)
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    include=['documents', 'distances', 'metadatas'] # Include relevant information in the results
)

# Display the results
print("Query Results:")
for i, (doc, dist, metadata) in enumerate(zip(results['documents'][0], results['distances'][0], results['metadatas'][0])):
    print(f"Result {i+1} (Distance: {dist:.4f}):")
    if doc:
        print(f"  Document: {doc[:200]}...") # Print first 200 characters of text documents
    elif metadata and 'file_path' in metadata: # Assuming image metadata includes a file path or identifier
         print(f"  Image: {metadata['file_path']}") # Or some other image identifier
    print("-" * 20)

Query Results:
Result 1 (Distance: 1.0938):
--------------------
Result 2 (Distance: 1.1270):
--------------------
Result 3 (Distance: 1.1564):
--------------------
Result 4 (Distance: 1.1829):
--------------------
Result 5 (Distance: 1.2816):
--------------------


In [25]:
# Make sure to run the cell that loads the model first (cell 6Y5eyPDU4lIW)
if extracted_images:
    image_embeddings = model.encode_image(extracted_images)

    # Generate unique IDs for each image embedding
    image_embedding_ids = [f"image_embedding_{i}" for i in range(len(image_embeddings))]

    # Note: ChromaDB's current implementation doesn't directly support storing image objects in the 'documents' field.
    # We will store a placeholder or identifier for now. If image retrieval is crucial for generation,
    # you might need to store image paths or base64 encoded strings, or handle images separately.
    # For this example, we'll add a placeholder indicating an image is associated with the embedding.
    image_placeholders = [f"Image from page {extracted_images[i].info.get('page', 'unknown')}" for i in range(len(extracted_images))] # Assuming page info might be available


    collection.add(
        embeddings=image_embeddings.tolist(),
        documents=image_placeholders, # Add a placeholder for the image
        ids=image_embedding_ids
    )

    print(f"Successfully added {len(image_embeddings)} image embeddings to Chroma collection '{COLLECTION_NAME}'.")
else:
    print("No images extracted to create embeddings.")

Successfully added 3 image embeddings to Chroma collection 'text_embeddings'.


implementing the mechanism to generate document based on given query

In [None]:
# Define a sample query
query_text = "What are the benefits of the Attention mechanism?"

# Generate embedding for the query using the Jina-CLIP model
query_embedding = model.encode_text(query_text)

# Query the Chroma collection
# We will query for the top 5 most relevant results (both text and images)
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    include=['documents', 'distances', 'metadatas'] # Include relevant information in the results
)

# Display the results
print("Query Results:")
for i, (doc, dist, metadata) in enumerate(zip(results['documents'][0], results['distances'][0], results['metadatas'][0])):
    print(f"Result {i+1} (Distance: {dist:.4f}):")
    if doc:
        print(f"  Document: {doc[:200]}...") # Print first 200 characters of text documents
    elif metadata and 'file_path' in metadata: # Assuming image metadata includes a file path or identifier
         print(f"  Image: {metadata['file_path']}") # Or some other image identifier
    print("-" * 20)

Query Results:
Result 1 (Distance: 1.7033):
--------------------
Result 2 (Distance: 1.7837):
--------------------
Result 3 (Distance: 1.8058):
--------------------


implement phi-3 model to generate new content based on generated output

In [None]:
!pip install flash-attn --no-build-isolation

In [None]:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu118torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

Collecting flash-attn==2.5.8+cu118torch2.3cxx11abiFALSE
  Downloading https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu118torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl (122.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.1/122.1 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting ninja (from flash-attn==2.5.8+cu118torch2.3cxx11abiFALSE)
  Using cached ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.1 kB)
Using cached ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (180 kB)
Installing collected packages: ninja, flash-attn
  Attempting uninstall: flash-attn
    Found existing installation: flash_attn 2.8.3
    Uninstalling flash_attn-2.8.3:
      Successfully uninstalled flash_attn-2.8.3
Successfully installed flash-attn-2.5.8 ninja-1.13.0


In [7]:
from transformers import AutoModelForCausalLM
from transformers import  AutoProcessor
import torch

# Define the model ID for PHI-3 vision
model_id = "microsoft/Phi-3-vision-128k-instruct"

# Retrying to load the model with use_flash_attention_2=False
try:
    # Load the processor and the model, explicitly disabling FlashAttention2
    processor = AutoProcessor.from_pretrained(model_id, device_map="cuda", trust_remote_code=True)
    model_phi3 = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, trust_remote_code=True, use_flash_attention_2=False, attn_implementation="eager",device_map="auto", quantization_config=quantization_config)

    print(f"Loaded PHI-3 vision model: {model_id}")

except ImportError as e:
    print(f"Error loading model: {e}")
    print("It seems there is an issue with FlashAttention2. Please ensure you have the correct version installed and that it is compatible with your CUDA setup.")
    print("Refer to the transformers documentation on FlashAttention2 for more information.")

except Exception as e:
    print(f"An unexpected error occurred: {e}")

ImportError: /usr/local/lib/python3.11/dist-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

In [26]:
# Attempting to install flash_attn from source with specific CUDA flags
# This can sometimes resolve compatibility issues
import os

# You might need to set CUDA_HOME if it's not automatically detected
# os.environ['CUDA_HOME'] = '/usr/local/cuda' # Example path, verify in your environment

# Set the compute capability for your GPU
# You can find your GPU's compute capability by running !nvidia-smi and looking up your GPU model
# Replace '8.0' with your GPU's compute capability (e.g., '7.0' for V100, '7.5' for T4, '8.0' for A100, '8.6' for RTX 30 series, '8.9' for RTX 40 series)
# os.environ['TORCH_CUDA_ARCH_LIST'] = '8.0' # Example for A100

!pip install git+https://github.com/Dao-AILab/flash-attention.git --no-build-isolation

Collecting git+https://github.com/Dao-AILab/flash-attention.git
  Cloning https://github.com/Dao-AILab/flash-attention.git to /tmp/pip-req-build-ixtsid2u
  Running command git clone --filter=blob:none --quiet https://github.com/Dao-AILab/flash-attention.git /tmp/pip-req-build-ixtsid2u
  Resolved https://github.com/Dao-AILab/flash-attention.git to commit 91f14ca07b792645b72efbb05b233907a831c898
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash_attn
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash_attn: filename=flash_attn-2.8.3-cp311-cp311-linux_x86_64.whl size=256022485 sha256=0abc62d04f28f140f4f76ab7cfd1d8ce24a69c6ab0cbace8d4ab99640b68dc0a
  Stored in directory: /tmp/pip-ephem-wheel-cache-n00zahiv/wheels/5d/d7/50/ea7b22955a7d70789a383640683f618c642c6c0cd343effe19
Successfully built flash_attn
Installing collected packages: flash_attn
Su

In [None]:
!pip install flash_attn==2.5.8 --no-build-isolation

Collecting flash_attn==2.5.8
  Downloading flash_attn-2.5.8.tar.gz (2.5 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m2.4/2.5 MB[0m [31m73.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ninja (from flash_attn==2.5.8)
  Downloading ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->flash_attn==2.5.8)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->flash_attn==2.5.8)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

In [1]:
import torch

print(f"CUDA is available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"PyTorch CUDA version: {torch.version.cuda}")
else:
    print("Please ensure you have a CUDA-enabled GPU and that CUDA is properly installed.")

CUDA is available: True
CUDA version: 12.4
PyTorch CUDA version: 12.4


In [2]:
# Reinstall flash_attn as a troubleshooting step for the ImportError
!pip install flash_attn --no-build-isolation

Collecting flash_attn
  Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->flash_attn)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->flash_attn)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->flash_attn)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->flash_attn)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->flash_attn)
  Downloading nvidia_cubla

In [15]:
# Ensure the Jina-CLIP model is loaded
# Check if 'model' is defined, if not, load it
if 'model' not in globals():
    print("Loading Jina-CLIP model...")
    model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
    print("Jina-CLIP model loaded.")
else:
    print("Jina-CLIP model is already loaded.")

Jina-CLIP model is already loaded.


In [23]:
# Define a sample query
query_text = "What are the benefits of the Attention mechanism?"

# Generate embedding for the query using the Jina-CLIP model
# Ensure the 'model' variable from earlier steps is available
query_embedding = model.encode_text(query_text)

# Query the Chroma collection
# Ensure 'collection' variable from earlier steps is available
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    include=['documents', 'distances', 'metadatas']
)

# Combine the retrieved text for the prompt
retrieved_text = "\n".join([doc for doc in results['documents'][0] if doc is not None])

# Check if any text was retrieved
if retrieved_text:
    # Create the prompt for the PHI-3 model
    messages = [
        {"role": "user", "content": f"Based on the following information:\n{retrieved_text}\n\nAnswer the question: {query_text}"}
    ]

    # Ensure 'processor' and 'model_phi3' variables from earlier steps are available
    prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    inputs = processor(prompt, return_tensors="pt").to("cuda:0")

    generation_args = {
        "max_new_tokens": 500,
        "temperature": 0.0,
        "do_sample": False,
    }

    generate_ids = model_phi3.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

    # remove input tokens
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

    print(response)
else:
    print("No relevant documents retrieved to generate a response.")

No relevant documents retrieved to generate a response.
