# Create Pinecone.io index from documents
- Converts the document extracts from the document preprocessing pipeline into embeddings,
  and places them into a Pinecone index
- Requires s3_config.json that indicates the bucket and directory to find the documents in
  website_extracts.csv should be in the indicated documents directory
- The embedding process will be very slow on a CPU instance and may crash. It is recommended
  to use a GPU instance like ml.g4dn.xlarge
  
## Recommended setup
- *Necessary*: Use Data Science 2.0 kernel, or any kernel with conda and python >= 3.8
- When computing embeddings: Instance with at least 1 gpu, eg. ml.g4dn.xlarge
- When using precomputed embeddings, ml.t3.medium is sufficient
- If not using GPU, set the gpu_available variable in the first cell to False

## Install prerequisites

In [2]:
!pip install sentence-transformers
!pip install huggingface_hub transformers
!pip install pinecone-client pinecone-text
!pip install langchain==0.0.218

Collecting sentence-transformers
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Collecting torch>=1.6.0 (from sentence-transformers)
  Using cached torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)
Collecting torchvision (from sentence-transformers)
  Using cached torchvision-0.15.2-cp38-cp38-manylinux1_x86_64.whl (33.8 MB)
Collecting sentencepiece (from sentence-transformers)
  Using cached sentencepiece-0.1.99-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Using cached huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch>=1.6.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch>=1.6.0->sente

## Call Script

In [22]:
!python pinecone_combined_script.py

Downloading from s3://sagemaker-document-embeddings/documents/faculties.txt to documents/faculties.txt
Downloading from s3://sagemaker-document-embeddings/documents/website_extracts.csv to documents/website_extracts.csv
Downloading from s3://sagemaker-document-embeddings/documents/website_graph.txt to documents/website_graph.txt
Downloading from s3://sagemaker-document-embeddings/embeddings-all-mpnet-base-v2/bm25.pkl to embeddings-all-mpnet-base-v2/bm25.pkl
Downloading from s3://sagemaker-document-embeddings/embeddings-all-mpnet-base-v2/combined_title_embeddings.pkl to embeddings-all-mpnet-base-v2/combined_title_embeddings.pkl
Downloading from s3://sagemaker-document-embeddings/embeddings-all-mpnet-base-v2/document_embeddings.pkl to embeddings-all-mpnet-base-v2/document_embeddings.pkl
Downloading from s3://sagemaker-document-embeddings/embeddings-all-mpnet-base-v2/parent_title_embeddings.pkl to embeddings-all-mpnet-base-v2/parent_title_embeddings.pkl
Downloading from s3://sagemaker-doc