Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
main.py		main.py
main_test.py		main_test.py
multi_document.pdf		multi_document.pdf
requirements.txt		requirements.txt

README.md

Document AI PDF Splitter Sample

NOTE: This sample is deprecated. Use Document AI Toolbox to Split PDFs based on output from a Splitter/Classifier processor.

This project uses Document AI Splitter/Classifier Processors identify split points and uses PikePDF to split PDF documents.

Designed to work with the following processors:

Lending Document Splitter & Classifier) LENDING_DOCUMENT_SPLIT_PROCESSOR
Procurement Document Splitter & Classifier PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR
DEPRECATED General Document Splitter DOCUMENT_SPLIT_PROCESSOR

For more information about Document AI Splitters, check out Document splitters behavior

Quick start

Install Python
Install the prerequisites: pip install -r requirements.txt
Install the Google Cloud SDK
Run gcloud init, create a new project, and enable billing
Enable the Document AI API: gcloud services enable documentai.googleapis.com
Setup application default authentication, run: gcloud auth application-default login
Run the sample: python main.py -i multi_document.pdf.
- You should see the split up sub-documents in your current directory with file names like pg1-2_1040sc_2020_multi_document.
- You should also see the raw Document output from Document AI in a json file multi_document.json

Setup

Install dependencies

Install pyenv: https://github.com/pyenv/pyenv#installation
Use pyenv to install the latest version of Python 3 for example, to install Python version 3.10.1, run: pyenv install 3.10.1
Create a Python virtual environment with the installed version of Python 3, for example, to create a Python 3.10.1 virtual environment called docai-splitter, run: pyenv virtualenv 3.10.1 docai-splitter
Clone this repo and cd to the root of the repo
Configure pyenv to use the virtual python environment we created earlier when in this repo: pyenv local docai-splitter
Install the prerequisites: pip install -r requirements.txt

Setup Google Cloud

Install the Cloud SDK: https://cloud.google.com/sdk/docs/install
Run gcloud init, to create a new project, and link a billing to your project
Enable the Document AI API: gcloud services enable documentai.googleapis.com
Setup application default authentication, run: gcloud auth application-default login

Running the sample

Run the sample: python main.py -i multi_document.pdf
Check to see that the PDFs created in the current directory are sub-documents of multi-document.pdf.

Testing

Linting

Install dependencies:
```
pip install -U pylint
```
Run the linter:
```
pylint *.py
```

Unit tests

Run the unit tests: python main_test.py

Manual

Run the sample: python main.py -i multi_document.pdf
Check to see that the PDFs created in the current directory are sub-documents of multi-document.pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-splitter-python

pdf-splitter-python

README.md

Document AI PDF Splitter Sample

Quick start

Setup

Install dependencies

Setup Google Cloud

Running the sample

Testing

Linting

Unit tests

Manual

Files

pdf-splitter-python

Directory actions

More options

Directory actions

More options

Latest commit

History

pdf-splitter-python

Folders and files

parent directory

README.md

Document AI PDF Splitter Sample

Quick start

Setup

Install dependencies

Setup Google Cloud

Running the sample

Testing

Linting

Unit tests

Manual