This program finds words and/or phrases in PDF document pages.
It returns the matches highlighting our searches.
It works as the find in a PDF Viewer but, it is online and search among several files together.
Now it is set up to search in electoral programs but, you can load the PDF files you want without many problems.
Example: Buscador de programas
- Python (tested with version 2.7.9)
- Django (tested with version 1.8.6)
- Elasticsearch (tested with version 2.0.0)
- pdf2htmlEX (tested with version 0.12)
- pdfseparate (tested with version 0.26.5)
- Docker (tested with version 1.9.1)
This repository contain:
webThe django project and app.
dockerDocker files or scripts to build the entire system.
programsThe PDF documents.
prepare.shscripts to prepare programs to Elasticsearch.
load.shscripts to load documents to Elasticsearch.
How to use it
This program has two parts: one, a script to load documents to elasticsearch and another a web interface to query documents.
You load a document in ElasticSearch in two steps:
- Prepare documents
- Load documents to ElasticSearch
prepare.sh scripts to prepare programs to Elasticsearch.
It is a script that call one time for every document to
prepare.py divides PDF files in pages and transforms PDF files to HTML.
$ python prepare.py -h usage: prepare.py [-h] party pdf Split a PDF in pages and transform the pages and the complete PDF in HTML positional arguments: party The Party name pdf The PDF file optional arguments: -h, --help show this help message and exit
Load documents to ElasticSearch
load.sh to load the data to Elasticsearch. It is a script that call one
time for every document to
load.py modifies the elasticsearch database.
$ python load.py -h usage: load.py [-h] [-a PARTY] [-y YEAR] [-i PROGRAM_ID] [-z ZONE] [-p PATH] [-d] [-s] Load a program to ElasticSearch optional arguments: -h, --help show this help message and exit -a PARTY, --party PARTY The Party name -y YEAR, --year YEAR The Program year -i PROGRAM_ID, --program_id PROGRAM_ID The identifier for this program -z ZONE, --zone ZONE The CCAA code. INE. -p PATH, --path PATH The path to the program. -d, --delete Delete de index. -s, --schema Delete de index.
You can search in documents with the web application. This web application is a
Django Project in