# Langchain 可以使用文档加载器加载不同的文档类型，如：`csv`、`txt` 、`html`、`json`以及`pdf`等，今天如何分享一下基于`pdf`的`loader`

https://langchain-fanyi.readthedocs.io/en/latest/modules/indexes/document_loaders.html

## 需要解决的问题
1. 解析图片，表格
2. 页面结构问题
3. 格式结构问题(符合人类) 

### 使用`pypdf`解析pdf，pdf将按照`page`逐页解析

In [None]:
! pip install pypdf

In [1]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("E:/langchain_RAG/data/baichuan.pdf")
pages_pypdf = loader.load()

In [2]:
len(pages_pypdf)

28

In [3]:
print(pages_pypdf[2].page_content)

C-Eval MMLU CMMLU Gaokao AGIEval BBH GSM8K HumanEval
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12 89.99 69.51
GPT-3.5 Turbo 51.10 68.54 54.06 47.07 46.13 61.59 57.77 52.44
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38 9.78 11.59
LLaMA 2-7B 28.90 45.73 31.38 25.97 26.53 39.16 16.22 12.80
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20 8.64 14.02
Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77 5.46 -
ChatGLM 2-6B (base)∗51.70 47.86 - - - 33.68 32.37 -
Baichuan 1-7B 42.80 42.30 44.02 36.34 34.44 32.48 9.17 9.207B
Baichuan 2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56 24.49 18.29
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89 20.55 15.24
LLaMA 2-13B 35.80 55.09 37.99 30.83 32.29 46.98 28.89 15.24
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04 28.13 16.46
Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94 11.98 16.46
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06 18.20 15.85
Baichuan 1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.5913B
Baichuan 2-13B-Base 58.10 59.17 61.97 54.33 48.17

In [4]:
print(len(pages_pypdf[2].page_content))

2908


In [5]:
pages_pypdf[0].page_content[1583:2000]

'Introduction\nThe field of large language models has witnessed\npromising and remarkable progress in recent years.\nThe size of language models has grown from\nmillions of parameters, such as ELMo (Peters\net al., 2018), GPT-1 (Radford et al., 2018), to\nbillions or even trillions of parameters such as GPT-\n3 (Brown et al., 2020), PaLM (Chowdhery et al.,\n2022; Anil et al., 2023) and Switch Transformers\n(Fedus et al., 20'

### 提取图片信息

In [None]:
! pip install rapidocr-onnxruntime

In [6]:
loader = PyPDFLoader("E:/langchain_RAG/data/baichuan.pdf", extract_images=True)
pages_pypdf_image = loader.load()

In [7]:
print(pages_pypdf_image[2].page_content)

C-Eval MMLU CMMLU Gaokao AGIEval BBH GSM8K HumanEval
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12 89.99 69.51
GPT-3.5 Turbo 51.10 68.54 54.06 47.07 46.13 61.59 57.77 52.44
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38 9.78 11.59
LLaMA 2-7B 28.90 45.73 31.38 25.97 26.53 39.16 16.22 12.80
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20 8.64 14.02
Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77 5.46 -
ChatGLM 2-6B (base)∗51.70 47.86 - - - 33.68 32.37 -
Baichuan 1-7B 42.80 42.30 44.02 36.34 34.44 32.48 9.17 9.207B
Baichuan 2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56 24.49 18.29
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89 20.55 15.24
LLaMA 2-13B 35.80 55.09 37.99 30.83 32.29 46.98 28.89 15.24
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04 28.13 16.46
Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94 11.98 16.46
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06 18.20 15.85
Baichuan 1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.5913B
Baichuan 2-13B-Base 58.10 59.17 61.97 54.33 48.17

In [8]:
print(len(pages_pypdf_image[2].page_content))

3259


### 使用 `pyplumber`  将pdf逐页进行解析， 但是文本结构在分栏的时候存在混淆，解析不完全

In [9]:
from langchain.document_loaders import PDFPlumberLoader

In [10]:
loader = PDFPlumberLoader("E:/langchain_RAG/data/baichuan.pdf")
data_plumber = loader.load()

In [13]:
print(len(data_plumber))

28


In [14]:
print(data_plumber[2].page_content)

C-Eval MMLU CMMLU Gaokao AGIEval BBH GSM8K HumanEval
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12 89.99 69.51
GPT-3.5Turbo 51.10 68.54 54.06 47.07 46.13 61.59 57.77 52.44
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38 9.78 11.59
LLaMA2-7B 28.90 45.73 31.38 25.97 26.53 39.16 16.22 12.80
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20 8.64 14.02
7B Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77 5.46 -
ChatGLM2-6B(base)∗ 51.70 47.86 - - - 33.68 32.37 -
Baichuan1-7B 42.80 42.30 44.02 36.34 34.44 32.48 9.17 9.20
Baichuan2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56 24.49 18.29
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89 20.55 15.24
LLaMA2-13B 35.80 55.09 37.99 30.83 32.29 46.98 28.89 15.24
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04 28.13 16.46
13B Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94 11.98 16.46
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06 18.20 15.85
Baichuan1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.59
Baichuan2-13B-Base 58.10 59.17 61.97 54.33 48.17 48.78

### 使用 `PDFMiner`  ，将整个文档解析成一个完整的文本。文本结构可以自行认为定义

In [15]:
from langchain.document_loaders import PDFMinerLoader

loader = PDFMinerLoader("E:/langchain_RAG/data/baichuan.pdf")
data_miner = loader.load()

In [16]:
len(data_miner)

1

In [17]:
print(data_miner[0].page_content[1590:1800])

Introduction

The field of large language models has witnessed
promising and remarkable progress in recent years.
The size of language models has grown from
millions of parameters, such as ELMo (Peters
et al., 


In [18]:
data_miner[0].page_content[1590:1800]

'Introduction\n\nThe field of large language models has witnessed\npromising and remarkable progress in recent years.\nThe size of language models has grown from\nmillions of parameters, such as ELMo (Peters\net al., '

### 使用非结构化 `Unstructured`

In [19]:
from langchain.document_loaders import UnstructuredPDFLoader

In [20]:
loader = UnstructuredPDFLoader("E:/langchain_RAG/data/baichuan.pdf")

In [21]:
data_unstru = loader.load()

  from .autonotebook import tqdm as notebook_tqdm


In [22]:
print(data_unstru[0].page_content[1662:2000])

Introduction

The field of large language models has witnessed promising and remarkable progress in recent years. The size of language models has grown from millions of parameters, such as ELMo (Peters et al., 2018), GPT-1 (Radford et al., 2018), to billions or even trillions of parameters such as GPT- 3 (Brown et al., 2020), PaLM (Chow


In [23]:
print(len(data_unstru))

1


非结构化加载器针对不同的文本块创建了不同的“元素”。默认情况下，我们将它们组合在一起，但您可以通过指定 `mode=elements` 轻松保持这种分离。然后依据自己的逻辑进行分离

In [24]:
loader = UnstructuredPDFLoader("E:/langchain_RAG/data/baichuan.pdf", mode="elements")

In [25]:
data_elements = loader.load()

In [26]:
data_elements

[Document(page_content='3 2 0 2', metadata={'source': 'E:/langchain_RAG/data/baichuan.pdf', 'coordinates': {'points': ((16.34, 263.25), (16.34, 303.25), (36.34, 303.25), (36.34, 263.25)), 'system': 'PixelSpace', 'layout_width': 595.276, 'layout_height': 841.89}, 'file_directory': 'E:/langchain_RAG/data', 'filename': 'baichuan.pdf', 'languages': ['eng'], 'last_modified': '2023-12-08T23:38:07', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),
 Document(page_content='p e S 0 2', metadata={'source': 'E:/langchain_RAG/data/baichuan.pdf', 'coordinates': {'points': ((16.34, 308.25), (16.34, 363.25), (36.34, 363.25), (36.34, 308.25)), 'system': 'PixelSpace', 'layout_width': 595.276, 'layout_height': 841.89}, 'file_directory': 'E:/langchain_RAG/data', 'filename': 'baichuan.pdf', 'languages': ['eng'], 'last_modified': '2023-12-08T23:38:07', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),
 Document(page_content='] L C . s c [', metadata={