# Create Pinecone.io index from documents
- Converts the document extracts from the document preprocessing pipeline into embeddings,
  and places them into a Pinecone index
- Requires s3_config.json that indicates the bucket and directory to find the documents in
  website_extracts.csv should be in the indicated documents directory
- The embedding process will be very slow on a CPU instance and may crash. It is recommended
  to use a GPU instance like ml.g4dn.xlarge
  
## Recommended setup
- *Necessary*: Use Data Science 2.0 kernel, or any kernel with conda and python >= 3.8
- When computing embeddings: Instance with at least 1 gpu, eg. ml.g4dn.xlarge
- When using precomputed embeddings, ml.t3.medium is sufficient
- If not using GPU, set the gpu_available variable in the first cell to False

## Install prerequisites

In [2]:
!pip install sentence-transformers
!pip install huggingface_hub transformers
!pip install pinecone-client pinecone-text
!pip install langchain==0.0.218

Collecting sentence-transformers
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting huggingface-hub>=0.4.0
  Using cached huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Collecting torch>=1.6.0
  Using cached torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)
Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 9.9 MB/s eta 0:00:01
[?25hCollecting sentencepiece
  Using cached sentencepiece-0.1.99-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting scipy
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
[K     |████████████████████████████████| 34.5 MB 49.5 MB/s eta 0:00:01
[?25hCollecting tqdm
  Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting numpy
  Using cached numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting scikit-learn
  Downloading scikit_learn-1.3.0-cp38-cp38-manylin

In [6]:
# Requirements needed on base python image, not data science image
!pip install pandas
!pip install sagemaker

You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m
Collecting sagemaker
  Downloading sagemaker-2.173.0.tar.gz (854 kB)
[K     |████████████████████████████████| 854 kB 6.6 MB/s eta 0:00:01
Collecting cloudpickle==2.2.1
  Using cached cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting google-pasta
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting protobuf<5.0,>=3.12
  Using cached protobuf-4.23.4-cp37-abi3-manylinux2014_x86_64.whl (304 kB)
Collecting smdebug_rulesconfig==1.0.1
  Using cached smdebug_rulesconfig-1.0.1-py2.py3-none-any.whl (20 kB)
Collecting pathos
  Using cached pathos-0.3.0-py3-none-any.whl (79 kB)
Collecting schema
  Using cached schema-0.7.5-py2.py3-none-any.whl (17 kB)
Collecting PyYAML~=6.0
  Downloading PyYAML-6.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (736 kB)
[K     |████████████████████████████████| 736 kB 77.3 MB/s eta 0:00:01
[?25hCollecting jsonschema
  Downloa

## Call Script

In [10]:
!python pinecone_combined_script.py

Downloading from s3://sagemaker-document-embeddings/documents/faculties.txt to documents/faculties.txt
Downloading from s3://sagemaker-document-embeddings/documents/website_extracts.csv to documents/website_extracts.csv
Downloading from s3://sagemaker-document-embeddings/documents/website_graph.txt to documents/website_graph.txt
Computing parent_title_embeddings
Saving parent_title_embeddings to directory
Computing title_embeddings
Saving title_embeddings to directory
Computing combined_title_embeddings
Saving combined_title_embeddings to directory
Computing document_embeddings
Saving document_embeddings to directory
Fitting bm25 to the texts
100%|████████████████████████████████████| 15586/15586 [00:38<00:00, 401.15it/s]
Computing bm25 vectors
Saving bm25 vectors to directory
Pinecone index langchain-hybrid-index already exists
Clearing the existing pinecone index for namespace all-mpnet-base-v2
Beginning batches upsert to pinecone index
100%|█████████████████████████████████████████|