## Langchain Documents and Vector DB - OpenAI, FAISS

1. OPENAI
查看是否有額度
- https://platform.openai.com/account/billing/overview
- https://platform.openai.com/usage

2. LANGCHAIN 手冊
https://python.langchain.com/docs/integrations/vectorstores

3. 學習LANGCHAIN -> Documents and Vector DB 

## 初始環境設定

In [None]:
import os
from pathlib import Path
HOME = str(Path.home())
Add_Binarry_Path=HOME+'/.local/bin'
os.environ['PATH']=os.environ['PATH']+':'+Add_Binarry_Path
current_foldr=!pwd
current_foldr=current_foldr[0]
current_foldr

## 安裝套件

In [None]:
## For colab
!pip install cohere faiss-cpu gdown kaleido langchain openai pyngrok pypdf python-dotenv sentence-transformers tiktoken -q

### OPENAI API KEY

In [None]:
# OPENAPI KEY method 1

!echo "OPENAI_API_KEY=sk-xxxxxxx" > .env
from dotenv import load_dotenv
load_dotenv() # loads env variables

In [None]:
# OPENAPI KEY  method 2

import os
os.environ["OPENAI_API_KEY"] = "sk-xxxxxxx"

In [None]:
# OPENAPI KEY  method 3

import os
from typing import TextIO
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass()

### LOAD LIBRARY

In [None]:
# Load library
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

### 文件處理

In [None]:
!mkdir -p data/pdf/
!gdown 1AldhEWVCtcE50XARgSnXR0azZ965nNmT -O data/pdf/

In [None]:
# 文件入庫
pdf_file='./data/pdf/e2729e76-29a0-4be5-9eef-67809b05d6b9.pdf'
loader= PyPDFLoader(pdf_file)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

print(len(texts))
print(texts[3:6])

### 片段文字向量化與暫時存入資料庫

In [None]:
embeddings = OpenAIEmbeddings()
vectortdb = FAISS.from_documents(texts, embeddings)
#DB_PATH = 'vectorstore/db_faiss'
#vectortdb.save_local(DB_PATH)

### 輸入文字像量化與暫存資料庫搜尋

In [None]:
# Load DB
#embeddings = OpenAIEmbeddings()
#DB_PATH = 'vectorstore/db_faiss'
#vectortdb = FAISS.load_local(DB_PATH, embeddings)

#: Test Search in Vector DB
query = "請說明櫃公司如何進行資產管理?"
source_documents=vectortdb.similarity_search(query, k=3)

for i, doc in enumerate(source_documents):
    page_content=source_documents[i].page_content
    page=source_documents[i].metadata["page"]
    source=source_documents[i].metadata["source"]
    file = os.path.basename(source) 
    print("Source: "+file+", Page "+str(page+1) )
    print(page_content)
    print("\n\n")
