Skip to content

This repo contains the code to extract text from pdf/picture/scanned document.

Notifications You must be signed in to change notification settings

ViswanathaReddyGajjala/pdf2text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf2text

This repo contains the code to extract text from pdf/picture/scanned document.

OCR

OCR (Optical Character Recognition) technique is used to identify words in a picture/scanned document and convert it into machine-readable text, that can be processed further with the help of computer. Although the technology is mature and uses advanced techniques, which quite often produces an erroneous output.

This repo contains the code for BYOB Challenge: OCR De-noising.

Steps to run the program:

  1. Clone the repository using:

    git clone https://github.com/ViswanathaReddyGajjala/pdf2text.git

  2. Please go to pdftoimage.com to convert the pdf file to jpg image.

  3. We need to place the pdf and the correponding images in /data/demo folder.

  4. Now, run the demo.py file.

  5. Result can be seen on the command line(for windows users) or terminal(for Ubuntu users).

    • Note: Please remove the files previously being compiled in the /data/demo and _/data/result folder.

Results

  • Localized text proposals on a pdf.

* Bounding box co-ordinates can be found in data/results/*.txt .

About

This repo contains the code to extract text from pdf/picture/scanned document.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages