# Module 1: Use Postgres (Pgvector) vector database as an online store for retrieving documents

## 1. Overview
In this notebook, we explore how to use Feast to retrieve documents from a Postgres (Pgvector) vector database. We will use the `city_embeddings` feature table that we created in the previous notebook. We will use the `retrieve_online_documents` method to retrieve the top-k documents that are closest to the query vector.

If you haven't already, look at the [README](../README.md) for setup instructions prior to starting this notebook.

<img src="../architecture.png" width="750"/>

# 1. Setup the feature store

### Apply feature repository
We first run `feast apply` to register the data sources + features and setup Redis.

In [5]:
!feast apply

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 


    # via Python

    # via CLI


  import dask.dataframe as dd
Deploying infrastructure for [1m[32mcity_embeddings[0m


# 2. Materialize training data
The datasets are prepared in the data directory. You can get them locally by running the commands in README.md. We will materialize the training data into the online store.

In [6]:
!feast materialize 2024-04-01T00:00:00 2024-04-17T00:00:00

In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 


    # via Python

    # via CLI


  import dask.dataframe as dd
Materializing [1m[32m1[0m feature views from [1m[32m2024-03-31 17:00:00-07:00[0m to [1m[32m2024-04-16 17:00:00-07:00[0m into the [1m[32mpostgres[0m online store.

[1m[32mcity_embeddings[0m:
100%|███████

Now, we instantiate a Feast `FeatureStore` object to push data to

In [1]:
from feast import FeatureStore
store = FeatureStore(repo_path=".")

In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 


    # via Python

    # via CLI


  import dask.dataframe as dd


# Prepare a query vector

In [3]:
from batch_score_documents import run_model, TOKENIZER, MODEL
from transformers import AutoTokenizer, AutoModel

In [4]:
question = "the most populous city in the U.S. state of Texas?"

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
model = AutoModel.from_pretrained(MODEL)
query_embedding = run_model(question, tokenizer, model)
query = query_embedding.detach().cpu().numpy().tolist()[0]
print(query)

[0.07801833748817444, -0.02972417138516903, 0.012690403498709202, 0.08342994004487991, -0.07765800505876541, 0.019601989537477493, -0.015240228734910488, -0.008848312310874462, -0.040954213589429855, 0.0025382512249052525, 0.033096734434366226, -0.046222101897001266, 0.05860760435461998, -0.0568450428545475, -0.05276476591825485, 0.0008733967551961541, 0.0573134645819664, -0.05047149211168289, 0.1344185769557953, -0.07026461511850357, -0.012536157853901386, 0.0014152592048048973, 0.03534318506717682, 0.024096962064504623, 0.05246112868189812, 0.020924478769302368, 0.025234023109078407, 0.0519547164440155, -0.039378199726343155, -0.028298156335949898, -0.02180365100502968, 0.04103993624448776, 0.07427085936069489, -0.05584770813584328, -0.0056844125501811504, -0.019990745931863785, 0.030951738357543945, -0.05062446370720863, 0.014741722494363785, 0.04260324314236641, -0.042490728199481964, -0.03377283364534378, 0.04507656395435333, 0.03705034777522087, -0.019746845588088036, -0.05689480

# Retrieve the top-k documents

In [5]:
features = store.retrieve_online_documents(
    feature="city_embeddings:Embeddings",
    query=query,
    top_k=3
).to_dict()

def print_online_features(features):
    for key, value in sorted(features.items()):
        print(key, " : ", value)

print_online_features(features)

Embeddings  :  [[0.11749927699565887, -0.04684491828083992, 0.074561707675457, 0.10036394000053406, -0.02789139188826084, 0.004901227541267872, -0.025490708649158478, -0.014385512098670006, -0.03353535756468773, -0.03694501891732216, 0.019829893484711647, -0.08767078071832657, 0.15164919197559357, -0.05422529578208923, 0.04684631526470184, -0.016555113717913628, 0.06950949877500534, 0.012052210047841072, 0.024535944685339928, -0.0060577718541026115, 0.06979842483997345, 0.026241665706038475, -0.06335429847240448, 0.03742428496479988, -0.006074287462979555, 0.12012293934822083, 0.012978488579392433, 0.019200358539819717, -0.09065929055213928, -0.010197900235652924, 0.046665437519550323, 0.07225364446640015, 0.07100000977516174, -0.08593559265136719, 0.05330311879515648, 0.004392698407173157, -0.06441846489906311, -0.006751690525561571, -0.04681907594203949, -0.006416881922632456, 0.0013941957149654627, -0.014143028296530247, 0.03822663053870201, 0.06176742911338806, -0.07114912569522858

You can see the top 3 document embeddings as well as its distance are returned