<a href="https://colab.research.google.com/github/dai-zz-flora/something_about_llm/blob/main/llm_reading_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 说明
本脚本展示的是如何使用transformer,faiss,sentence-transformers库来调用大语言模型来分析PDF文档，根据文档内容进行问答

本文的大语言模型使用的是清华的chatglm，版本是THUDM/chatglm-6b-int4



# 准备工作


## 1. 检测GPU

In [None]:
!nvidia-smi

Thu May 25 14:34:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 2. 安装必要的python库

本文会用到transformer、faiss等比较有名的python库

In [None]:
!apt -y update -qq
!apt -y install -qq aria2
!pip install -q transformers datasets  sentencepiece cpm_kernels mdtex2html accelerate sentencepiece install faiss-gpu  sentence-transformers  PyPDF2 pdfplumber

## 3. 下载测试pdf文档

In [None]:
%cd /content
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M "https://ask.qcloudimg.com/draft/1103456/9xw693yh19.pdf" -d /content/ -o "腾讯云区块链TBaaS产品白皮书.pdf"

## 4. 定义一个pdf解析函数

In [None]:
import PyPDF2
import pdfplumber
import pandas as pd


def parse_pdf(pdf_path):
    text_pages = []
    images = []
    tables = []

    # parse text from pdf
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)

        for i in range(num_pages):
            page = reader.pages[i]
            text = page.extract_text()
            text_pages.append(text)

    

    # get images and tables from pdf
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # 解析图片
            extracted_images = page.images
            images.extend(extracted_images)

            # 解析表格
            extracted_tables = page.extract_tables()
            for table in extracted_tables:
                df = pd.DataFrame(table[1:], columns=table[0])
                tables.append(df)

    return text_pages, images, tables

## 5. 初始化模型

这里我们使用量化后的模型THUDM/chatglm-6b-int4，减少内存占用

In [None]:
from transformers import AutoTokenizer, AutoModel
mode_name='THUDM/chatglm-6b-int4'
tokenizer = AutoTokenizer.from_pretrained(mode_name, trust_remote_code=True)
model = AutoModel.from_pretrained(mode_name, trust_remote_code=True).half().cuda()

# PDF解析




## 1. 解析PDF

In [None]:
parsed_text, parsed_images, parsed_tables = parse_pdf("/content/腾讯云区块链TBaaS产品白皮书.pdf")

## 2. 生成文本向量

使用 [sentence transformers库](https://www.sbert.net/) 来生成基于词嵌入的文本向量，选用的模型是 [GanymedeNil/text2vec-large-chinese](https://huggingface.co/GanymedeNil/text2vec-large-chinese)

In [None]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
embeddings_model = SentenceTransformer("shibing624/text2vec-base-chinese")
# Compute embeddings for all senteces in the corpus
embeddings_corpus = embeddings_model.encode(parsed_text, show_progress_bar=True)

# Change data type of embeddings
embeddings = np.array([embedding for embedding in embeddings_corpus]).astype("float32")

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

##3. 利用faiss库建立向量空间索引

In [None]:

# Instantiate the index with faiss
index = faiss.IndexFlatL2(embeddings.shape[1])

# # Pass the index to IndexIDMap
# index = faiss.IndexIDMap(index
index.add(embeddings)

# 使用ChatGLM-6B来构建对话模型


## 1. 初始化问题，并在向量空间索引中搜寻相关章节

2.1. 生成问题向量

2.2 搜寻前5个相似的章节

2.3 将章节拼接为上下文

In [None]:
question = '白皮书都讲了什么'
question_vector = embeddings_model.encode([question])

# 查询最相似的文本
k = 5  # 返回最近邻的数量
_, indices = index.search(np.array(question_vector).astype("float32"), k)

# 获取最相关的文本
related_docs = [parsed_text[idx] for idx in indices[0]]
# 拼接上下文
context = "\n".join([content for content in related_docs])
# for text in related_docs:
#   print(text)

## 2. 根据问题模板生成问题

In [None]:
PROMPT_TEMPLATE = """已知信息：
{context} 

根据上述已知信息，简洁和专业的来回答用户的问题。如果无法从中得到答案，请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”，不允许在答案中添加编造成分，答案请使用中文。 问题是：{question}"""

prompt = PROMPT_TEMPLATE.replace("{question}", question).replace("{context}", context)

## 3. 交给模型解答问题

In [None]:
import torch
max_token = 10000
temperature = 0.8
top_p = 0.9
history = []

response, _ = model.chat(tokenizer,prompt,history=[],max_length=max_token,temperature=temperature,top_p=top_p,)


print(response)
print("\n相关文档如下\n")
for text in related_docs:
  print(text)
  print("\n")