# **------------------- Creat an IETLS Writting Instructor with RAG -------------------**

## **Import**

In [63]:
import os
import io
import sys
import requests
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display, update_display
import subprocess
import pandas as pd
import numpy as np
import gradio as gr

In [64]:
load_dotenv(override=True)
oai_key = os.getenv('OPENAI_API_KEY')
ds_key = os.getenv('DEEPSEEK_API_KEY')
oai_client = OpenAI()
ds_client = OpenAI(api_key=ds_key, base_url="https://api.deepseek.com")

# **-------------------------------------------------------------------------------------------**
# **First of all, we should Create a Chroma KnowledgeBase Base on Local Files**

**Step 1 : Load Files**

**Step 2 : Split Files into Chunks**

**Step 3 : Vector Embedding -- Turn Chunks into Vectors(Using Auto-Encoding LLMS -- OpenAIEmbedding or BERT**

**Step 4 : Pass Vectors into DataBase (eg.Chroma,FAISS)**

# **-------------------------------------------------------------------------------------------**

### **Step 1 :  Load Files**

**递归遍历文件夹**：需要处理多层嵌套文件夹，使用递归方法获取所有文件路径。

**提取文件夹名称作为doc_type**：对于每个文件，记录它所属的直接父文件夹名称。

**支持多种文件类型**：根据文件扩展名选择合适的加载器（如PDF、Word、Excel）。

**元数据添加**：将父文件夹名添加到每个文件的metadata中。

**Import**

In [3]:
from glob import glob
from pathlib import Path
from langchain.document_loaders import DirectoryLoader, TextLoader, PyPDFLoader, Docx2txtLoader
from langchain.docstore.document import Document

1. **`os`**：定位父文件夹
2. **`glob`**：遍历目录获取文件列表
3. **`pathlib`**：分析文件属性（如pdf、word、markdown）
4. **`LangChain`**：加载文件内容并转为Document对象。

In [4]:
# 文件夹路径
folder = (r"C:\Users\zekin\projects\llm_engineering\My Projects.ipynb\Week 5 Projects\IELTS Writting")

In [5]:
# 支持的文件类型及对应加载器
file_loaders = {
    ".pdf": PyPDFLoader,       # PDF处理函数
    ".md": TextLoader,         # Markdown处理函数
    ".docx": Docx2txtLoader,   # Word处理函数
}

In [6]:
# 存储所有文档
documents = []

# 递归获取所有文件并加载
def process_folder(folder_path):
    for item in glob(os.path.join(folder_path, "*")):
        if os.path.isdir(item):
            # 如果是文件夹，递归处理
            process_folder(item)
        else:
            # 如果是文件，加载并添加元数据
            file_ext = Path(item).suffix.lower()
            if file_ext in file_loaders:
                loader = file_loaders[file_ext](item)  # 讲item传入对应的加载器
                docs = loader.load()   # 加载对应的文件
                # 获取直接父文件夹名称
                doc_type = os.path.basename(os.path.dirname(item))
                for doc in docs:
                    doc.metadata["doc_type"] = doc_type
                    documents.append(doc)
            elif file_ext == ".xlsx":
                excel_file = pd.read_excel(item, engine="openpyxl")
                content = excel_file.to_string()
                doc = Document(page_content=content,metadata={'source':item})
                doc_type = os.path.basename(os.path.dirname(item))
                doc.metadata['doc_type'] = doc_type
                documents.append(doc)

In [7]:
process_folder(folder)

In [8]:
len(documents)

85

### **Step 2 :  Split Text Into Chunk**

**用 langchain 的 CharacterTextSplitter 将文档分割成小块（chunks），便于处理**

```chunk_size```: 1000 字符 ≈ 200-300 token  
检索任务：较小的块（500-1500）便于精确匹配   
总结任务：较大块（1000-2000）保留更多上下文

```vchunk_overlap```: 相邻块之间重复的字符数,重叠避免关键信息在块边界丢失  
密集信息（如技术文档）：增大重叠（200-300）   
稀疏文本（如对话）：减小重叠（50-100）

**Import**

In [9]:
from langchain.text_splitter import CharacterTextSplitter

**Split**

In [10]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)  # 将大文档按字符数切分为小块，设置重叠以保留上下文
chunks = text_splitter.split_documents(documents)  # 将每个文档分割成小块，返回新的文档块列表

### **Step 3 :  Vecot Embedding -- Turn Chunks into Vector**

**Import**

In [11]:
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from sklearn.manifold import TSNE

Here, we will use **OpenAIEmbeddings**

In [12]:
embeddings = OpenAIEmbeddings()

**Create our Chroma vectorstore!**

In [13]:
db_name = "IEKTS_vector_db"
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 178 documents


**Get one vector and find how many dimensions it has**

In [14]:
collection = vectorstore._collection  # 得到的是 Chroma 内部的 集合对象
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 1,536 dimensions


**Save Vectordtore**

In [None]:
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory="vectorstore")
vectorstore.persist()  # 保存到本地文件夹 "vectorstore"

### **Step 4 :  A Conversation Chain with RAG and Memory + A Gradio Interface**

#### **Method 1 : Use A Pipeline  -- ConversationalRetrievalChain.from_llm**

**说明**
- `ConversationBufferMemory` 用于保存对话历史，常与 `ConversationalRetrievalChain` 配合。
- `memory_key` 和 `return_messages` 在初始化时设置，影响后续数据访问和格式。

**Import**

In [15]:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [16]:
# LangChain 中初始化一个 DeepSeek 聊天模型
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    temperature=0.7,
    model_name="deepseek-chat",                  
    openai_api_key=ds_key,                         # DeepSeek 的 API 密钥
    openai_api_base="https://api.deepseek.com/v1",  # DeepSeek API 端点
    streaming=True  # 启用流式输出
)

# 初始化一个对话缓冲内存
# chat_history：后续代码会通过这个键访问历史
# return_messages=True: 返回结构化的消息对象
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True) # 返回值：一个配置好的内存实例，用于跟踪对话。

# the retriever is an abstraction over the VectorStore that will be used during RAG
retriever = vectorstore.as_retriever(search_kwargs={"k": 25})  # 从vectorstore中检索前25个最相关的文档。k值越大，传入的文件数越多

# putting it together: set up the conversation chain with the GPT 4o-mini LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm, 
    retriever=retriever, 
    memory=memory
    )

  llm = ChatOpenAI(
  memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True) # 返回值：一个配置好的内存实例，用于跟踪对话。


**```result = conversation_chain.invoke({"question": question})```**:

调用 ConversationalRetrievalChain 处理用户问题并返回结果的关键代码

In [17]:
def chat(question, history):
    result = conversation_chain.invoke({"question": question})
    return result["answer"]

In [18]:
view = gr.ChatInterface(chat, type="messages").launch(inbrowser=True)

* Running on local URL:  http://127.0.0.1:7940

To create a public link, set `share=True` in `launch()`.


#### **Method 2 : Manually Step by Step**

**加载聊天模型**

In [19]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    temperature=0.7,
    model_name="deepseek-chat",
    openai_api_key=ds_key,
    openai_api_base="https://api.deepseek.com/v1",
    streaming=True
)

**初始化内存**

In [20]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

**设置检索器**

In [26]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

**手动处理输入和检索**

In [52]:
def add_message(history, message):
    history.append({"role": "user", "content": message})
    return history, gr.Textbox(value="", interactive=False)

In [53]:
def generate_user_prompt(message):
    prompt_merge = "用户问题: " + message
    results = retriever.invoke(message)
    prompt_merge += "\n检索内容: "
    for doc in results:
        prompt_merge += doc.page_content
    return prompt_merge

In [59]:
system_prompt = """你是一个雅思作文辅助AI，专注于指导用户提升雅思写作技巧。只在用户提出具体问题时提供针对性建议，避免主动输出无关内容。保持回答简洁直接。当用户要求批改作文时，请根据雅思作文的评分标准给出评分，并帮用户找出所有的错误并修正。"""

In [57]:
def get_output(history):
    message = history[-1]["content"]
    full_prompt = generate_user_prompt(message)
    messages = [{'role': 'system', 'content': system_prompt}] + history[:-1] +  [{'role': 'user', 'content': f"根据以下上下文回答: {full_prompt}"}]
    stream = ds_client.chat.completions.create(
        model='deepseek-chat', 
        messages=messages, 
        stream=True
    )
    history.append({"role": "assistant", "content": ""})
    response = ""
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        history[-1]["content"] = response
        yield history

In [58]:
with gr.Blocks() as demo:
    gr.Markdown("""# Hello! I am your personal IETLS Writting Instructor~👩🏻‍🏫
                **问题示例：**
                🧋雅思大/小作文6分的评分标准是？
                🥣雅思大/小作文的结构是？
                🍧大作文的开头段要怎么写？
                🍲小作文图像题的上升趋势可以怎么描述？
                🍛......
    """)

    chatbot = gr.Chatbot(type="messages",height=600, show_copy_button=True)

    chat_input = gr.Textbox(placeholder="在这里提问~",label='Ask questions:')
    stop_btn = gr.Button("停止")

    chat_msg = chat_input.submit(add_message, [chatbot, chat_input], [chatbot, chat_input])
    bot_msg = chat_msg.then(get_output, chatbot, chatbot)
    bot_msg.then(lambda: gr.update(interactive=True), None, [chat_input])

    stop_btn.click(None, cancels=[bot_msg])
    
demo.launch(share=True,node_port=8050)

* Running on local URL:  http://127.0.0.1:7950
* Running on public URL: https://8852d08755f0d839c8.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


