pdf2text

This repo contains the code to extract text from pdf/picture/scanned document.

OCR

OCR (Optical Character Recognition) technique is used to identify words in a picture/scanned document and convert it into machine-readable text, that can be processed further with the help of computer. Although the technology is mature and uses advanced techniques, which quite often produces an erroneous output.

This repo contains the code for BYOB Challenge: OCR De-noising.

Steps to run the program:

Clone the repository using:

git clone https://github.com/ViswanathaReddyGajjala/pdf2text.git
Please go to pdftoimage.com to convert the pdf file to jpg image.
We need to place the pdf and the correponding images in /data/demo folder.
Now, run the demo.py file.
Result can be seen on the command line(for windows users) or terminal(for Ubuntu users).
- Note: Please remove the files previously being compiled in the /data/demo and _/data/result folder.

Results

Localized text proposals on a pdf.

* Bounding box co-ordinates can be found in data/results/*.txt .

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
checkpoints		checkpoints
ctpn		ctpn
data		data
lib		lib
README.md		README.md
demo.py		demo.py
pdf2text.py		pdf2text.py
requirements.txt		requirements.txt
sample.py		sample.py
text_extraction.py		text_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoints

checkpoints

ctpn

ctpn

data

data

lib

lib

README.md

README.md

demo.py

demo.py

pdf2text.py

pdf2text.py

requirements.txt

requirements.txt

sample.py

sample.py

text_extraction.py

text_extraction.py

Repository files navigation

pdf2text

OCR

Steps to run the program:

Results

About

Releases

Packages

Languages

ViswanathaReddyGajjala/pdf2text

Folders and files

Latest commit

History

Repository files navigation

pdf2text

OCR

Steps to run the program:

Results

About

Resources

Stars

Watchers

Forks

Languages