# RAG Retrieval-Augmented Generation
- Retrieval: 從外部來源檢索最新的訊息
- Aufmented: 增強補充模型現有知識
- Generation: 生成回覆

In [1]:
import os 
import logging 

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())
api_key = os.environ.get('OPENAI_API_KEY')
if api_key is None:
    raise ValueError("The OPENAI_API_KEY environment variable is not set.")

In [2]:
import os 
from langchain_openai import ChatOpenAI
chat = ChatOpenAI(
    openai_api_base=os.environ["CHATGPT_API_ENDPOINT"],
    openai_api_key=os.environ["OPENAI_API_KEY"])



## 讀取PDF

In [6]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("ML-guide.pdf")

In [12]:
pages = loader.load()
print(len(pages))
page = pages[0]
print(page.page_content[:500])
print(page.metadata)

22
MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the 
{'source': 'ML-guide.pdf', 'page': 0}


## Search website information
- langchain WebBaseLoader
- serper: Get 2500 free

In [13]:
from langchain.document_loaders import WebBaseLoader

In [14]:
loader = WebBaseLoader("https://google.com")

In [19]:
docs = loader.load()
print(docs[0].page_content)

Google搜尋圖片地圖PlayYouTube新聞Gmail雲端硬碟更多日曆翻譯圖書購物Blogger財經相片文件更多 »Account Options登入搜尋設定網頁記錄 進階搜尋Google 提供：  English 廣告商業解決方案關於 GoogleGoogle.com.tw© 2024 - 隱私權 - 服務條款   


In [24]:
import requests
import json
load_dotenv(find_dotenv())

url = "https://google.serper.dev/news"

payload = json.dumps({
  "q": "apple inc",
  "hl": "zh-tw"
})
headers = {
  'X-API-KEY': os.environ["X_API_KEY"],
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

{"searchParameters":{"q":"apple inc","hl":"zh-tw","type":"news","engine":"google"},"news":[{"title":"Apple Intelligence has arrived. Here's how to update your device to get it","link":"https://www.foxbusiness.com/technology/apple-intelligence-has-arrived-heres-how-update-your-device-get","snippet":"Apple's inaugural batch of Apple Intelligence features debuted on Monday for eligible devices. It comes after the company first announced...","date":"2 天前","source":"Fox Business","imageUrl":"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT5LbetDTaaDgWmlf0WdqpqqwshUlEHzFQZrM53c_9g19SfxQi94bhj3jNUvQ&s","position":1},{"title":"Apple Set to Report Q4 Earnings: Buy, Sell or Hold the Stock?","link":"https://finance.yahoo.com/news/apple-set-report-q4-earnings-164600386.html","snippet":"AAPL's fourth-quarter fiscal 2024 results are likely to reflect strong services growth despite unfavorable forex and sluggish Mac shipments.","date":"1 天前","source":"Yahoo Finance","imageUrl":"https://encrypt

## Text Splitter 文本分割器
Q: Why
- 因為需要把外部資訊整理添加到向量database
- 使用者問問題, 用檢索的方式找出使用者要的相關內容!!!

每一個小塊: chunk

In [31]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 9,
    chunk_overlap = 0
)

In [32]:
text1 = "123456789"
text2 = "123456789123456789"

print(text_splitter.split_text(text1))
print(text_splitter.split_text(text2))

['123456789']
['123456789', '123456789']


In [38]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 9,
    chunk_overlap = 3
)
print(text_splitter.split_text(text1))
print(text_splitter.split_text(text2))

['123456789']
['123456789', '789123456', '456789']


In [39]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 9,
    chunk_overlap = 9
)
text3 = "This is a sample text to split. It has multiple sentences."
print(text_splitter.split_text(text3))

['This is a', 'a sample', 'text to', 'split.', 'It has', 'multiple', 'sentence', 'sentences', 'entences.']


### PDF

In [45]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("ML-guide.pdf")
pages = loader.load()
print(len(pages))
pages[0].page_content[:500]

22


"MachineLearning-Lecture01  \nInstructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is just spend a little time going over the logistics \nof the class, and then we'll start to talk a bit about machine learning.  \nBy way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so \nI personally work in machine learning, and I've worked on it for about 15 years now, and \nI actually think that machine learning is the "

In [51]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 150,
    length_function = len,
    separators= ["\n\n","\n", " ",""]
)
docs = text_splitter.split_documents(pages)
len(docs)

172

In [54]:
docs[0].page_content

"MachineLearning-Lecture01  \nInstructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is just spend a little time going over the logistics \nof the class, and then we'll start to talk a bit about machine learning.  \nBy way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so \nI personally work in machine learning, and I've worked on it for about 15 years now, and"

In [55]:
docs[1].page_content

"I personally work in machine learning, and I've worked on it for about 15 years now, and \nI actually think that machine learning is the most exciting field of all the computer \nsciences. So I'm actually always excited about teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thing in computer science, but \nthe most exciting thing in all of human endeavor, so maybe a little bias there."

### chunk 如何定義大小
- 太大，可能會導致窗口問題
- 太小，某些語句或段落的意義可能只在特定上下文中才明確
- chunkviz: https://chunkviz.up.railway.app/

In [56]:
import os 
from langchain_openai import ChatOpenAI
chat = ChatOpenAI(
    openai_api_base=os.environ["CHATGPT_API_ENDPOINT"],
    openai_api_key=os.environ["OPENAI_API_KEY"])

In [67]:
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader("htmldocs")
docs = loader.load()
print(len(docs))
print(docs[0].page_content[:500])

2
langchain.indexes.vectorstore.VectorstoreIndexCreator¶
class langchain.indexes.vectorstore.VectorstoreIndexCreator[source]¶
Bases: BaseModel
Logic for creating indexes.
Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
param embedding: Embeddings [Optional]¶
param text_splitter: TextSplitter [Optional]¶
param vectorstore_cls: Type[VectorStore] = <class 'langchain_community.vectorstores.


In [61]:
# gpt-3.5-turbo 4096 tokens: 
# If 4096 - (Input(Instruction + query + content) + output)
# If Chunk nums = 5
  # Chunk Size = 2000 / 5 = 400
  # So Chunk Size <= 400
  # Too Samll not meaningful
  # Too big not efficent


In [64]:
import tiktoken

tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo") #分詞器
tokenizer

<Encoding 'cl100k_base'>

In [69]:
def token_count(text):
    tokens = tokenizer.encode(
        text, 
        disallowed_special = ()
    )
    return len(tokens)

In [72]:
tokens = [token_count(doc.page_content) for doc in docs]
tokens

[1538, 1605]

In [73]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 400,
    chunk_overlap = 20,
    length_function = token_count,
    separators= ["\n\n","\n", " ",""]
)
chunks = text_splitter.split_text(docs[0].page_content)
len(chunks)

5

In [74]:
token_count(chunks[0]),token_count(chunks[1]),token_count(chunks[2]),token_count(chunks[3]),token_count(chunks[4]),

(383, 373, 345, 376, 105)

In [77]:
chunks 

["langchain.indexes.vectorstore.VectorstoreIndexCreator¶\nclass langchain.indexes.vectorstore.VectorstoreIndexCreator[source]¶\nBases: BaseModel\nLogic for creating indexes.\nCreate a new model by parsing and validating input data from keyword arguments.\nRaises ValidationError if the input data cannot be parsed to form a valid model.\nparam embedding: Embeddings [Optional]¶\nparam text_splitter: TextSplitter [Optional]¶\nparam vectorstore_cls: Type[VectorStore] = <class 'langchain_community.vectorstores.inmemory.InMemoryVectorStore'>¶\nparam vectorstore_kwargs: dict [Optional]¶\nasync afrom_documents(documents: List[Document]) → VectorStoreIndexWrapper[source]¶\nCreate a vectorstore index from documents.\nParameters\ndocuments (List[Document]) – \nReturn type\nVectorStoreIndexWrapper\nasync afrom_loaders(loaders: List[BaseLoader]) → VectorStoreIndexWrapper[source]¶\nCreate a vectorstore index from loaders.\nParameters\nloaders (List[BaseLoader]) – \nReturn type\nVectorStoreIndexWrappe