# Create Pinecone.io index from documents
- Converts the document extracts from the document preprocessing pipeline into embeddings,
  and places them into a Pinecone index
- The embedding process will be very slow on a CPU instance and may crash. It is recommended
  to use a GPU instance like ml.g4dn.xlarge
  
## Recommended setup
- *Necessary*: Use Data Science 2.0 kernel, or any kernel with conda and python >= 3.8
- When computing embeddings: Instance with at least 1 gpu, eg. ml.g4dn.xlarge
- When using precomputed embeddings, ml.t3.medium is sufficient
- If not using GPU, set the gpu_available variable in the first cell to False

### Options

In [2]:
# Set compute_embeddings to true if you need to recompute the embeddings
# Otherwise, downloads precomputed embeddings from s3
compute_embeddings = False 
# Set clear_index to true to clear any existing table before inserting documents
clear_index = True
# Set gpu_available to true when using a gpu instance, especially when computing embeddings
gpu_available = False

## Install prerequisites

In [3]:
%%capture
!pip install sentence-transformers
!pip install huggingface_hub transformers
!pip install pinecone-client pinecone-text
!pip install langchain==0.0.218

In [4]:
%%capture
# Requirements needed on base python image, not data science image
!pip install pandas
!pip install sagemaker

## Call Script

In [7]:
args = []
args.append('--compute_embeddings' if compute_embeddings else '--no-compute_embeddings')
args.append('--clear_index' if clear_index else '--no-clear_index')
args.append('--gpu_available' if gpu_available else '--no-gpu_available')

In [None]:
!python pinecone_combined_script.py {' '.join(args)}

Loaded embeddings combined_title_embeddings
Loaded embeddings title_embeddings
Loaded embeddings parent_title_embeddings
Loaded embeddings document_embeddings
Loaded embeddings bm25
Pinecone index langchain-hybrid-index already exists
Clearing the existing pinecone index for namespace all-mpnet-base-v2
Beginning batches upsert to pinecone index
100%|█████████████████████████████████████████| 462/462 [03:46<00:00,  2.04it/s]
Completed upsert to pinecone index
