# Module 1: Use Postgres (Pgvector) vector database as an online store for retrieving documents

## 1. Overview
In this notebook, we explore how to use Feast to retrieve documents from a Postgres (Pgvector) vector database. We will use the `city_embeddings` feature table that we created in the previous notebook. We will use the `retrieve_online_documents` method to retrieve the top-k documents that are closest to the query vector.

If you haven't already, look at the [README](../README.md) for setup instructions prior to starting this notebook.

# 1. Setup the feature store

### Apply feature repository
We first run `feast apply` to register the data sources + features and setup Redis.

In [1]:
import os
import pandas as pd
from feast import FeatureStore

from batch_score_documents import run_model, TOKENIZER, MODEL
from transformers import AutoTokenizer, AutoModel

In [2]:
df = pd.read_parquet("./feature_repo/data/city_wikipedia_summaries_with_embeddings.parquet")

In [3]:
df.head()

Unnamed: 0,State,Wiki Summary,Embeddings,event_timestamp,item_id
0,"New York, New York","New York, often called New York City or simply...","[0.17517076, -0.1259909, 0.019542355, 0.030451...",2024-05-01 22:24:21.593813,0
1,"Los Angeles, California","Los Angeles, often referred to by its initials...","[0.16593967, -0.10821897, 0.043743934, 0.01682...",2024-05-01 22:24:21.593813,1
2,"Chicago, Illinois",Chicago is the most populous city in the U.S. ...,"[0.16295174, -0.063115865, 0.048169453, 0.0283...",2024-05-01 22:24:21.593813,2
3,"Houston, Texas",Houston ( ; HEW-stən) is the most populous cit...,"[0.10329512, -0.078975916, 0.045779355, 0.0774...",2024-05-01 22:24:21.593813,3
4,"Phoenix, Arizona",Phoenix ( FEE-niks; Spanish: Fénix;) is the ca...,"[0.13658537, -0.038460232, -0.06357397, 0.1216...",2024-05-01 22:24:21.593813,4


In [5]:
os.chdir("./feature_repo")

In [6]:
os.system("feast apply")

In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 


    # via Python

    # via CLI


  import dask.dataframe as dd


Deploying infrastructure for city_embeddings


0

# 2. Materialize training data
The datasets are prepared in the data directory. You can get them locally by running the commands in README.md. We will materialize the training data into the online store.

In [7]:
os.system('CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S");feast materialize-incremental $CURRENT_TIME')

In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 


    # via Python

    # via CLI


  import dask.dataframe as dd
0it [00:00, ?it/s]


Materializing [1m[32m1[0m feature views to [1m[32m2024-05-10 22:41:47-04:00[0m into the [1m[32mpostgres[0m online store.

[1m[32mcity_embeddings[0m from [1m[32m2024-05-11 00:41:49-04:00[0m to [1m[32m2024-05-10 22:41:47-04:00[0m:


0

## Now, we instantiate a Feast `FeatureStore` object to push data to

In [8]:
store = FeatureStore(repo_path=".")

In a future release, Dask DataFrame will use a new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 


    # via Python

    # via CLI


  import dask.dataframe as dd


# Prepare a query vector

In [9]:
question = "the most populous city in the U.S. state of Texas?"

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
model = AutoModel.from_pretrained(MODEL)
query_embedding = run_model(question, tokenizer, model)
query = query_embedding.detach().cpu().numpy().tolist()[0]

# Retrieve the top-k documents

In [10]:
features = store.retrieve_online_documents(
    feature="city_embeddings:Embeddings",
    query=query,
    top_k=3
)

## You can see the top 3 document embeddings as well as the distance returned

In [11]:
features.to_df()

Unnamed: 0,Embeddings,distance
0,"[0.11749928444623947, -0.04684492573142052, 0....",0.935567
1,"[0.10329511761665344, -0.07897591590881348, 0....",0.939936
2,"[0.11634305864572525, -0.10321836173534393, -0...",0.983343


### As a dictionary returning the first 3 embedding values

In [15]:
import json

def print_online_features(features: dict, k: int=3):
    for key, value in sorted(features.items()):
        if key =='Embeddings':
            print(json.dumps({key: [v[0:k] for v in value]}, indent=2))
        else:
            print(json.dumps({key: value}, indent=2))

print_online_features(features.to_dict())

{
  "Embeddings": [
    [
      0.11749928444623947,
      -0.04684492573142052,
      0.0745617225766182
    ],
    [
      0.10329511761665344,
      -0.07897591590881348,
      0.045779354870319366
    ],
    [
      0.11634305864572525,
      -0.10321836173534393,
      -0.0071899304166436195
    ]
  ]
}
{
  "distance": [
    0.9355665445327759,
    0.9399362802505493,
    0.9833431243896484
  ]
}


# END