# RAG system
* LLM: GPT 3.0 turbo/Gemini
* Langchain
* Vector DB: ChromaDB
* Embedding: 
  * 
  * GanymedeNil/text2vec-large-chinese
    1. Use a pipeline as a high-level helper
      * 快速、自動話處理
    2. Load model directly
      * 彈性較大、易優化

## 環境設置

In [None]:
pip install -r requirements.txt  

In [None]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chat_models import GoogleGenerativeAI
from langchain.chains import RetrievalQA

## 資料處理與資後庫匯入
* Data: 綠色金融行動方案3.pdf
* Document Loaders:
  * Langchain offers around 55 types of document loaders, including loaders for Word, CSV, PDF, GoogleDrive, and YouTube
* Split Documents: 
  * Text splitter splits documents or text into chunks to avoid exceeding the LLM's token limit
  * The main parameters include chunk_size (determining the max number of characters per chunk) and chunk_overlap (specifying the overlapping characters between consecutive chunks)
* Embedding Model: 
  * to convert the chunks of text into vectors. 
  * LangChain provides interfaces for many Embedding models.
  * [open source Embedding Model comparison](https://ithelp.ithome.com.tw/articles/10298540?sc=rss.iron)

In [None]:
# Load data
loader = PyMuPDFLoader("綠色金融行動方案3.pdf")
PDF_data = loader.load()

In [None]:
# Split text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=5)
all_splits = text_splitter.split_documents(PDF_data)

In [None]:
# Show text
print(len(all_splits))
print(all_splits[150])

In [None]:
# embedding
model_name = "aspire/acge_text_embedding"
model_kwargs = {'device': 'cpu'}
embeddings = HuggingFaceEmbeddings(model_name=model_name,
                                  multi_process=True,
                                  model_kwargs=model_kwargs,
                                  encode_kwargs={"normalize_embeddings": True},  # set True for cosine similarity
                                  )

In [None]:
# Embedding 結果匯入 VectorDB
persist_directory = 'db'
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory=persist_directory)

## 啟用 LLM 服務: By Google AI

In [None]:
import os
os.environ["GOOGLE_API_KEY"] = "AIzaSyCgEarkxRl7XzCxeoTG3wXhIVSFtw4Ud7g"

model_name = "gpt-3.5-turbo"
Google_llm = GoogleGenerativeAI(model="gemini-pro", google_api_key=google_api_key)

In [None]:
# Text Retrieval + Query LLM
retriever = vectordb.as_retriever()

qa_Google = RetrievalQA.from_chain_type(
    llm=Google_llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [None]:
# Using RAG
query = "綠色金融行動方案 3.0 是什麼？"
qa_Google.run(query)