# Create Pinecone.io index from documents
- Converts the document extracts from the document preprocessing pipeline into embeddings,
  and places them into a Pinecone index
- The embedding process will be very slow on a CPU instance and may crash. It is recommended
  to use a GPU instance like ml.g4dn.xlarge
  
## Recommended setup
- *Necessary*: Use Data Science 2.0 kernel, or any kernel with conda and python >= 3.8
- When computing embeddings: Instance with at least 1 gpu, eg. ml.g4dn.xlarge
- When using precomputed embeddings, ml.t3.medium is sufficient
- If not using GPU, set the gpu_available variable in the first cell to False

### Options

In [None]:
# Set compute_embeddings to true if you need to recompute the embeddings
# Otherwise, downloads precomputed embeddings from s3
compute_embeddings = False 
# Set clear_index to true to clear any existing table before inserting documents
clear_index = True
# Set gpu_available to true when using a gpu instance, especially when computing embeddings
gpu_available = False

## Install prerequisites

In [None]:
%%capture
!pip install sentence-transformers
!pip install huggingface_hub transformers
!pip install pinecone-client pinecone-text
!pip install langchain==0.0.218

In [None]:
%%capture
# Requirements needed on base python image, not data science image
!pip install pandas
!pip install sagemaker

## Call Script

In [10]:
!python pinecone_combined_script.py

Downloading from s3://sagemaker-document-embeddings/documents/faculties.txt to documents/faculties.txt
Downloading from s3://sagemaker-document-embeddings/documents/website_extracts.csv to documents/website_extracts.csv
Downloading from s3://sagemaker-document-embeddings/documents/website_graph.txt to documents/website_graph.txt
Computing parent_title_embeddings
Saving parent_title_embeddings to directory
Computing title_embeddings
Saving title_embeddings to directory
Computing combined_title_embeddings
Saving combined_title_embeddings to directory
Computing document_embeddings
Saving document_embeddings to directory
Fitting bm25 to the texts
100%|████████████████████████████████████| 15586/15586 [00:38<00:00, 401.15it/s]
Computing bm25 vectors
Saving bm25 vectors to directory
Pinecone index langchain-hybrid-index already exists
Clearing the existing pinecone index for namespace all-mpnet-base-v2
Beginning batches upsert to pinecone index
100%|█████████████████████████████████████████|