# Vector Search with ColPali

## **TLDR:**

This is all the code you need to try ColPali:

### Installation

```bash
# If running locally
# For local, you will need to sign into the huggingface cli, request access to PaliGemma and have at least 8GB VRAM.
pip install 'midrasai[local]'

# If using the MidrasAI api
# For remote, you will need to get an api key from midrasai.com
pip install midrasai          
```

### Usage

```python
# If running locally
from midrasai.local import LocalMidras 
midras = LocalMidras()

# If using the API
from midrasai import Midras

midras = Midras(midras_key="your-key")
```
```python
# Everything else is the same!

index_name = "test_index"

midras.create_index(index_name)

response = midras.embed_pdf("Attention_is_all_you_need.pdf", include_images=True)

for id, embedding in enumerate(response.embeddings):
    midras.add_point(
        index=index_name,
        id=id,
        embedding=embedding,
        data={
            "page_number": id + 1,
            "anything_we_want": "Huzzah!"
        }
    )

query = "Explain the architechture diagram proposed in this paper"

results = midras.query_text(index_name, text=query)

print("Top 3 relevant pages, in order:")

for result in results[:3]:
    print(f"Page {result.data['page_number']} score: {result.score}")
```

### Running Locally

If you want to run the ColPali model that Midras uses, you'll need a GPU, I recommend at least a T4.

Also, you'll need to install some extra features from the midrasai package with the command:

```bash
pip install 'midrasai[local]'
```

This will download some additional dependencies such as Pytorch and Transformers needed to run the ColPali model locally.

By installing `midrasai[local]` you will get access to the "local" namespace in the midrasai package.
This namespace holds the LocalMidras class, which is identical to the standard Midras class, but loads and runs ColPali locally instead of remotely.

**NOTE:** You need to have permission to download the ColPali model from Huggingface. To do this, you'll have to follow these steps:

    1. Create a huggingface account

    2. Request access to ColPali

    3. Create an account token

    4. Authenticate from the cli with `huggingface-cli login`
    

In [1]:
from midrasai.local import LocalMidras

midras = LocalMidras()

  from .autonotebook import tqdm as notebook_tqdm
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


KeyboardInterrupt: 

Now that the model is loaded, we can create an index to start saving data!
We can use the "embed_pdf" method to input the *path* to a pdf file locally, which will then be turned into a list of images and embeddings.

In [None]:
# Define any name for our index
index_name = "test_index" 

# Create a new index using our index name
midras.create_index(index_name)

[[[-0.06640625,
   -0.1787109375,
   -0.05712890625,
   -0.140625,
   0.053955078125,
   0.0306396484375,
   -0.04931640625,
   -0.00799560546875,
   -0.04150390625,
   0.15234375,
   -0.1552734375,
   0.119140625,
   0.0810546875,
   -0.1494140625,
   0.056640625,
   0.08203125,
   0.1103515625,
   0.11474609375,
   -0.12158203125,
   0.1181640625,
   0.10791015625,
   -0.1025390625,
   0.0198974609375,
   0.06640625,
   0.08203125,
   0.01458740234375,
   0.2001953125,
   -0.09130859375,
   0.010986328125,
   0.0576171875,
   0.053955078125,
   -0.09033203125,
   0.10693359375,
   0.0478515625,
   -0.0157470703125,
   0.130859375,
   -0.13671875,
   -0.0093994140625,
   -0.00506591796875,
   -0.06591796875,
   -0.06591796875,
   0.1123046875,
   -0.1884765625,
   0.005218505859375,
   -0.146484375,
   -0.0181884765625,
   -0.08740234375,
   -0.052734375,
   -0.054931640625,
   0.06591796875,
   -0.0225830078125,
   -0.0191650390625,
   -0.06298828125,
   0.09375,
   0.033203125,
   0

Now we have our index, but it's empty.

Let's use the *embed_pdf* utility method to process a pdf file which we will then load into our index.

The *embed_pdf* method accepts a path to a local pdf file and returns a response object with the ColBERT vector embeddings for each page in the pdf.

Additionaly, if we set the parameter "include_images" to True, the response object will also return the list of PIL Images that were generated from the pdf.

**NOTE:** The *embed_pdf* method relies on poppler to convert pdfs to images. If it fails, make sure you have poppler installed on your system. 

In [None]:
# Pass the path to a pdf file to embed_pdf, set include_images to True to receive the images generated from the pdf (one per page)
response = midras.embed_pdf("./Attention_is_all_you_need.pdf", include_images=True)

print(f"Number of pages: {len(response.images)}")

Let's visualize the first page...

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(7,7))
plt.imshow(response.images[0])
plt.axis("off")
plt.show()

Now we have our embeddings, so we can use them to store data in our vector database along with any important information.

Let's write a for loop that goes through every image, and calls the *add_point* method to create a list of entries with an id, a vector which will be use for search, and a payload which can be any information you want to attach to this vector. In this payload we are adding a field called "page_number" that is equal to the index of that page plus one. This is because indeces in python start at 0, so page 1 will have an index of 0, page 2 is index 1, and so on.

In [None]:
# For every page in the pdf, let's add a data point to our index.
for i in range(len(response.images)):
    midras.add_point(
        index=index_name,
        id=id,
        embedding=response.embeddings[i],
        data={
            "page_number": id + 1,
            "anything_else_we_want": "Huzzah!"
        }
    )

Finally, with our index loaded with data points, we can use the *query_text* method to run similarity search on the index with a text input.

In [None]:
query = "Explain the architechture diagram proposed in this paper"

results = midras.query_text(index_name, text=query)

print("Top 3 relevant pages, in order:")

for result in results[:3]:
    print(f"Page {result.data['page_number']} score: {result.score}")

    plt.figure(figsize=(7,7))
    plt.imshow(response.images[result.id])
    plt.axis("off")
    plt.show()

[<PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1700x2200>,
 <PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1700x2200>,
 <PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1700x2200>]


### Using the API

If you don't have the hardware to try ColPali locally, you can go to [midrasai](https://midrasai.com) and generate an api key with your Github account.
This key will give you access to a ColPali model on I'm running on some cloud GPUs.

Once you have your key, you can use it to create a Midras client.
Since it's not running locally, you can just install it with `pip install midrasai` and avoid heavy libraries like pytorch!

In [2]:
from midrasai import Midras
import os

midras = Midras(os.getenv("MIDRAS_API_KEY"))

After you create your client with the Midras API, the rest of the code is exactly the same!

In [1]:
# Define any name for our index
index_name = "test_index" 

# Create a new index using our index name
midras.create_index(index_name)

NameError: name 'midras' is not defined

In [2]:
# Pass the path to a pdf file to embed_pdf, set include_images to True to receive the images generated from the pdf (one per page)
response = midras.embed_pdf("./Attention_is_all_you_need.pdf", include_images=True)

print(f"Number of pages: {len(response.images)}")

NameError: name 'midras' is not defined

In [3]:
# For every page in the pdf, let's add a data point to our index.
for i in range(len(response.images)):
    midras.add_point(
        index=index_name,
        id=id,
        embedding=response.embeddings[i],
        data={
            "page_number": id + 1,
            "anything_else_we_want": "Huzzah!"
        }
    )

NameError: name 'response' is not defined

In [4]:
query = "Explain the architechture diagram proposed in this paper"

results = midras.query_text(index_name, text=query)

print("Top 3 relevant pages, in order:")

for result in results[:3]:
    print(f"Page {result.data['page_number']} score: {result.score}")

    plt.figure(figsize=(7,7))
    plt.imshow(response.images[result.id])
    plt.axis("off")
    plt.show()

NameError: name 'midras' is not defined