Extract text from a PDF (pdf to text). API in docker.

Why did we create this project?

In the Laravel project, it was necessary to extract texts from large files. Existing packages do not work with files larger than 50 megabytes.
Text extraction is an expensive operation. Running on a separate server will reduce the load.
It was necessary to create a cover for the source.

Installation

git clone https://github.com/dotcode-moscow/pdf-api.git
cd pdf-api
docker-compose up -d pdf-api

Method /api/extractText

Extracts text from a file. As a parameter, we pass the URL to the file.

Method /api/pdf/ping

ping-pong method

Method /api/imageToPDF

Image to pdf converter

Basic example

curl -d "url=https://trove.nla.gov.au/newspaper/rendition/nla.news-page29291123.pdf" "http://localhost:8080/api/extractText"

POST(HTTP) example

http://localhost:8080/api/extractText?url=https://trove.nla.gov.au/newspaper/rendition/nla.news-page29291123.pdf

Response (JSON) example

"Page number" (without sorting) and "extracted text".
"img" - jpeg base64 front page cover

{
  "1":"National Library of Australia...",
  "img": "data:image/jpeg;base64..."
}

Production mode

network_mode: "host"

Credit

PDFBox

Contributing

Pull requests are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
target		target
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

target

target

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

mvnw

mvnw

mvnw.cmd

mvnw.cmd

pom.xml

pom.xml

Repository files navigation

Extract text from a PDF (pdf to text). API in docker.

Installation

Method /api/extractText

Method /api/pdf/ping

Method /api/imageToPDF

Basic example

POST(HTTP) example

Response (JSON) example

Production mode

Credit

Contributing

About

Releases

Packages

Languages

License

dotcode-moscow/pdf-api

Folders and files

Latest commit

History

Repository files navigation

Extract text from a PDF (pdf to text). API in docker.

Installation

Method /api/extractText

Method /api/pdf/ping

Method /api/imageToPDF

Basic example

POST(HTTP) example

Response (JSON) example

Production mode

Credit

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages