# Vector Stores

- SQL/NoSQL/Graph/Vector databases
- provides native support for storing and performing queries on vector data

## Example databases
- PostgreSQL (with `pg_vector` plugin)
- Apache Cassandra
- Elasticsearch / OpenSearch (AWS)
- Facebook AI Similarity Search (`FAISS`)
- Pinecone
- AWS Kendra

![Vector stores databases](./images/7-vector-stores-databases.png)

## Data types
![Vector stores data](./images/7-vector-stores-data.png)

![Vector stores embeddings](./images/7-vector-stores-embeddings.png)

- vector databases use special search techniques known as Approximate Nearest Neighbor (ANN) search, K-Nearest Neighbour (KNN), Cosine similarity, etc

## Import dependencies

In [None]:
var BedrockEmbeddings = require('@langchain/community/embeddings/bedrock').BedrockEmbeddings;
var Bedrock = require('@langchain/community/llms/bedrock').Bedrock;

## Instantiate the `model` client

In [None]:
var model = new Bedrock({
    model_id:'amazon.titan-text-express-v1',
    temperature: 1,
    maxTokenCount: 512,
    topP: 0.9,
    verbose: true
});

## Instantiate the `embeddings model` client

In [None]:
var embeddingsClient = new BedrockEmbeddings({
    model:'amazon.titan-embed-text-v2:0',
    region:'us-east-1'
});

## Load data into memory

In [None]:
var PDFLoader = require('@langchain/community/document_loaders/fs/pdf').PDFLoader;

var pdfLoaderClient = new PDFLoader("/workspace/packages/llm/src/notebooks/data/7-2022-Shareholder-Letter.pdf");

var rawPdfDocuments;
pdfLoaderClient.load().then((pdfDocuments) => {
    rawPdfDocuments = pdfDocuments;
    console.log(pdfDocuments);
});

## Split documents

In [None]:
var RecursiveCharacterTextSplitter = require("langchain/text_splitter").RecursiveCharacterTextSplitter;

var splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 300,
    chunkOverlap: 30
});

In [None]:
var splitDocs;

splitter.splitDocuments(rawPdfDocuments).then((data) => {
    splitDocs = data;
    console.log(splitDocs);
});

## Create vector store using `documents` and `embeddings model`

In [None]:
var MemoryVectorStore = require('langchain/vectorstores/memory').MemoryVectorStore;

var vectorStore;

MemoryVectorStore.fromDocuments(splitDocs, embeddingsClient).then((store) => {
    vectorStore = store;

});

## Generate a retriever function

- a function to be provided as a tool to `Langchain` that gets data from vector store to be used by model

In [None]:
var retriever = vectorStore.asRetriever({k: 3});

## Create prompt

In [None]:
var ChatPromptTemplate = require('@langchain/core/prompts').ChatPromptTemplate;
var prompt = ChatPromptTemplate.fromTemplate(`
    Answer the user question.
    Context: {context}
    Question: {input}
`);

## Create basic documents chain

In [None]:
var createStuffDocumentsChain = require("langchain/chains/combine_documents").createStuffDocumentsChain;
var combineDocsChain;

createStuffDocumentsChain({
    llm: model,
    prompt,
}).then((chain) => combineDocsChain = chain);

## Create retrieval chain

- chain that is enhanced compared to previous ones, uses the vector store retriever

In [None]:
var createRetrievalChain = require('langchain/chains/retrieval').createRetrievalChain;

var retrievalChain;

createRetrievalChain({
    combineDocsChain,
    retriever
}).then((chain) => retrievalChain = chain);

## Configure input query

In [None]:
var input = "Who is Andy Jassy?";

## Invoke LLM

- we invoke LLM with only relevant information from vector store based on similarity search done on embeddings of all input pdf chunks and embedding of input (handled by chain)

In [None]:
retrievalChain.invoke({
    input,
}).then((response) => console.log(response));

![High level flow](./images/6-embeddings-high-level.png)

## Pinecone

- due to time savings reasons (no need to deploy VPC/RDS/OpenSearch clusters/setup extra local containers) we will be using `Pinecone` `SaaS` vector store
- website is [https://www.pinecone.io/](https://www.pinecone.io/)
- there is a free tier available for one index
- we will get API key for Pinecone and leverage that from local/AWS.

### Setup
1. We need to also install `@langchain/pinecone` by doing

```shell
yarn add @langchain/pinecone
```
2. Create Pinecone account []()
3. Create index `codechat2024`
Use parameters from [AWS Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html) for index setup.

## Import environment variables for API keys

In [1]:
require('dotenv').config({path: require('path').resolve(__dirname, '../../.env')});

{
  parsed: {
    PINECONE_API_KEY: 'da6a78d1-66bd-477a-bebf-e790e968411d',
    PINECONE_INDEX: 'chatbot'
  }
}

## Initialize Pinecone Vector Store

In [2]:
var Pinecone = require('@pinecone-database/pinecone').Pinecone;
var PineconeStore = require('@langchain/pinecone').PineconeStore;
var BedrockEmbeddings = require('@langchain/community/embeddings/bedrock').BedrockEmbeddings;

var embeddingsClient = new BedrockEmbeddings({
    model:'amazon.titan-embed-text-v2:0',
    region:'us-east-1'
});

var pineconeClient = new Pinecone({
    apiKey: process.env.PINECONE_API_KEY,
});

var pineconeIndex = pineconeClient.index(process.env.PINECONE_INDEX);
var pineconeStore = new PineconeStore(embeddingsClient, {
    pineconeIndex
});

## Get `Langchain` `Documents` from existing `PDF`

In [3]:
var PDFLoader = require('@langchain/community/document_loaders/fs/pdf').PDFLoader;

var pdfLoaderClient = new PDFLoader("/workspace/packages/llm/src/notebooks/data/7-2022-Shareholder-Letter.pdf");

var rawPdfDocuments;
pdfLoaderClient.load().then((pdfDocuments) => {
    rawPdfDocuments = pdfDocuments;
    console.log(pdfDocuments);
});

Promise { <pending> }

[
  Document {
    pageContent: 'Dear shareholders:\n' +
      'As I sit down to write my second annual shareholder letter as CEO, I find myself optimistic and energized\n' +
      'by what lies ahead for Amazon. Despite 2022 being one of the harder macroeconomic years in recent memory,\n' +
      'and with some of our own operating challenges to boot, we still found a way to grow demand (on top of\n' +
      'the unprecedented growth we experienced in the first half of the pandemic). We innovated in our largest\n' +
      'businesses to meaningfully improve customer experience short and long term. And, we made important\n' +
      'adjustments in our investment decisions and the way in which we’ll invent moving forward, while still\n' +
      'preserving the long-term investments that we believe can change the future of Amazon for customers,\n' +
      'shareholders, and employees.\n' +
      'While there were an unusual number of simultaneous challenges this past year, the reality is

In [None]:
var RecursiveCharacterTextSplitter = require("langchain/text_splitter").RecursiveCharacterTextSplitter;

var splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 300,
    chunkOverlap: 30
});

var splitDocs;

splitter.splitDocuments(rawPdfDocuments).then((data) => {
    splitDocs = data;
    // console.log(splitDocs);
    console.log("Splitting done");
});

Promise { <pending> }

Splitting done


## Load documents in `Pinecone` vector store

In [5]:
pineconeStore.addDocuments(splitDocs).then((ids) => {
    console.log(ids);
    console.log("Successfully added documents to Pinecone!");
});

Promise { <pending> }

[
  '8f1bb613-8cc7-4bf7-bd38-8ccd037c9804',
  '8e4cac25-be78-47a4-87e0-d2f41278e78e',
  '54a71db7-301a-4221-8a41-79e0e3d41530',
  '9054f197-e2fc-4378-bd9c-021b626d24fc',
  'd0bc9405-0a50-4d29-a39d-7dcddb4bc32a',
  'c0a94c61-69fc-4939-8c25-d54e8769df61',
  '011bd8a8-fc4e-4d6c-bec1-577f6096fc19',
  '6d5ddde3-c381-47c5-9612-8aa93fea56f6',
  'ac0cdf71-7b68-44d8-be56-c649ea195b38',
  '445f90b6-a054-4cfe-9874-85d6bf926575',
  '54a036ef-d4ac-4ce7-8780-11fba22dd093',
  '0c784303-ad4d-433e-9540-9c5fab33b74c',
  '1b326715-5e75-477a-a521-7163dc725eea',
  'c31f691d-0ca7-4d1d-a3ea-72b155c28b86',
  '069356c5-cf03-4f66-8a55-dfa77eb2fcee',
  'e93a156a-0c3d-4cf8-8689-5f79555f5a69',
  '0dd4da15-832d-4d62-94df-1eb9a20fd68d',
  '29fb28b0-7d5e-4d2b-a2ca-1d4118fdab61',
  'b94b6898-313c-4a51-9aae-9f0d199a2e0c',
  '9c8dea56-77a8-41dc-8dac-413de79af33c',
  '7eaab472-724b-467f-a5ff-2c60622a1740',
  '6819df9d-966e-4e2b-aba1-19911f890134',
  '65f7fa68-4305-42ae-902e-f507ff511ba5',
  'a68d2ee1-eaff-41f4-ac77-34c21

## Leverage Pinecone vector data

- now that we have loaded AWS Titan Embeddings into Pinecone vector store let's create a fresh chat client with a retriver using the loaded data
- no more loading of documents, only retrieving

In [None]:
require('dotenv').config({path: require('path').resolve(__dirname, '../../.env')});

var Bedrock = require('@langchain/community/llms/bedrock').Bedrock;

var model = new Bedrock({
    model_id:'amazon.titan-text-express-v1',
    temperature: 1,
    maxTokenCount: 512,
    topP: 0.9,
    verbose: true
});

var ChatPromptTemplate = require('@langchain/core/prompts').ChatPromptTemplate;
var prompt = ChatPromptTemplate.fromTemplate(`
    Answer the user question.
    Context: {context}
    Question: {input}
`);

var Pinecone = require('@pinecone-database/pinecone').Pinecone;
var PineconeStore = require('@langchain/pinecone').PineconeStore;
var BedrockEmbeddings = require('@langchain/community/embeddings/bedrock').BedrockEmbeddings;

var pineconeClient = new Pinecone({
    apiKey: process.env.PINECONE_API_KEY,
});

var pineconeIndex = pineconeClient.index(process.env.PINECONE_INDEX);
var pineconeStore = new PineconeStore(embeddingsClient, {
    pineconeIndex
});

var retriever = vectorStore.asRetriever({k: 3});

var createStuffDocumentsChain = require("langchain/chains/combine_documents").createStuffDocumentsChain;
var combineDocsChain;

createStuffDocumentsChain({
    llm: model,
    prompt,
}).then((chain) => combineDocsChain = chain);


