NOTE: This sample is deprecated. Use Document AI Toolbox to Split PDFs based on output from a Splitter/Classifier processor.
This project uses Document AI Splitter/Classifier Processors identify split points and uses PikePDF to split PDF documents.
Designed to work with the following processors:
- Lending Document Splitter & Classifier)
LENDING_DOCUMENT_SPLIT_PROCESSOR
- Procurement Document Splitter & Classifier
PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR
- DEPRECATED General Document Splitter
DOCUMENT_SPLIT_PROCESSOR
For more information about Document AI Splitters, check out Document splitters behavior
- Install Python
- Install the prerequisites:
pip install -r requirements.txt
- Install the Google Cloud SDK
- Run
gcloud init
, create a new project, and enable billing - Enable the Document AI API:
gcloud services enable documentai.googleapis.com
- Setup application default authentication, run:
gcloud auth application-default login
- Run the sample:
python main.py -i multi_document.pdf
.- You should see the split up sub-documents in your current directory with file
names like
pg1-2_1040sc_2020_multi_document
. - You should also see the raw
Document
output from Document AI in a json filemulti_document.json
- You should see the split up sub-documents in your current directory with file
names like
- Install pyenv: https://github.com/pyenv/pyenv#installation
- Use pyenv to install
the latest version of Python 3 for
example, to install Python version 3.10.1, run:
pyenv install 3.10.1
- Create a Python virtual environment with the installed version of Python 3,
for example, to create a Python 3.10.1 virtual environment called
docai-splitter
, run:pyenv virtualenv 3.10.1 docai-splitter
- Clone this repo and
cd
to the root of the repo - Configure pyenv to use the virtual python environment we created earlier
when in this repo:
pyenv local docai-splitter
- Install the prerequisites:
pip install -r requirements.txt
- Install the Cloud SDK: https://cloud.google.com/sdk/docs/install
- Run
gcloud init
, to create a new project, and link a billing to your project - Enable the Document AI API:
gcloud services enable documentai.googleapis.com
- Setup application default authentication, run:
gcloud auth application-default login
- Run the sample:
python main.py -i multi_document.pdf
- Check to see that the PDFs created in the current directory are
sub-documents of
multi-document.pdf
.
-
Install dependencies:
pip install -U pylint
-
Run the linter:
pylint *.py
- Run the unit tests:
python main_test.py
- Run the sample:
python main.py -i multi_document.pdf
- Check to see that the PDFs created in the current directory are
sub-documents of
multi-document.pdf
.