Skip to content

Simple indexer + text searcher for a collection of PDF documents.

License

Notifications You must be signed in to change notification settings

alexandru-dinu/pdf-indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF indexer

Code style: black

Simple indexer + text searcher for a collection of PDF documents.

Usage

First, extract the words from the PDF documents:

python extract_text.py --pdf_dir PDF_DIR --text_path TEXT_PATH [--pool_size POOL_SIZE]

Then, construct the index:

python construct_index.py --text_path TEXT_PATH --index_path INDEX_PATH

Finally, run the searcher:

python query.py --index_path INDEX_PATH [--max_results MAX_RESULTS]

Requirements

  • pdfgrep binary for text extraction.
  • spacy for text preprocessing (e.g. tokenization).
  • whoosh for index construction and querying.

About

Simple indexer + text searcher for a collection of PDF documents.

Topics

Resources

License

Stars

Watchers

Forks

Languages