pdf_scraper

pdf_ scraper is a tool to find all pdf files in a folder and all of its sub-folders, extract the text from each pdf files, remove punctuation and stop-words from the text, and count the number of word occurrences in the text. The most common keywords are stored for each pdf file together with the file path. In the end, similar pdf files can be identified by comparison of the keywords between files.

$./pdf_scraper <path>

Installing pdf_scraper

pdf_scraper is not yet available on PyPI, but I will do my best.

$ python -m pip install pdf_scraper

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
stop_words		stop_words
README.md		README.md
pdf_scraper.py		pdf_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop_words

stop_words

README.md

README.md

pdf_scraper.py

pdf_scraper.py

Repository files navigation

pdf_scraper

Installing pdf_scraper

About

Releases

Packages

Languages

aojanzen/pdf_scraper

Folders and files

Latest commit

History

Repository files navigation

pdf_scraper

Installing pdf_scraper

About

Resources

Stars

Watchers

Forks

Languages