# Create PGVector tables from documents
- Converts the document extracts from the document preprocessing pipeline into embeddings,
  and places them into a RDS DB with PGVector
- The embedding process will be very slow on a CPU instance and may crash. It is recommended
  to use a GPU instance like ml.g4dn.xlarge
  
## Recommended setup
- *Necessary*: Use Data Science 2.0 kernel, or any kernel with conda and python >= 3.8
- When computing embeddings: Instance with at least 1 gpu, eg. ml.g4dn.xlarge
- When using precomputed embeddings, ml.t3.medium is sufficient
- If not using GPU, set the gpu_available variable in the first cell to False

### Options

In [71]:
# Set compute_embeddings to true if you need to recompute the embeddings
# Otherwise, downloads precomputed embeddings from s3
compute_embeddings = False 
# Set clear_index to true to clear any existing table before inserting documents
clear_index = True
# Set gpu_available to true when using a gpu instance, especially when computing embeddings
gpu_available = False

## Install prerequisites

In [2]:
%%capture
!pip install huggingface_hub transformers
!pip install langchain>=0.0.240
!pip install pandas
!pip install pgvector
!pip install psycopg2-binary
!pip install -U sentence-transformers

## Call Script

In [76]:
args = []
args.append('--compute_embeddings' if compute_embeddings else '--no-compute_embeddings')
args.append('--clear_index' if clear_index else '--no-clear_index')
args.append('--gpu_available' if gpu_available else '--no-gpu_available')

In [80]:
!python rds_combined_script.py {' '.join(args)}

Loaded embeddings combined_title_embeddings
Loaded embeddings title_embeddings
Loaded embeddings parent_title_embeddings
Loaded embeddings document_embeddings
Loaded embeddings bm25
Begin upload to db with pgvector
Finished upload to db with pgvector
