fallback python-based .ps extraction #50

deanmalmgren · 2014-08-12T15:18:42Z

The .ps parser currently uses pstotext to extract content from postscript files. This is a great starting point, but it might be nice to replicate the behavior of the .pdf parser and have a pure python fallback method to make textract usable across platforms. From a minute of googling around, it looks like others have started down this path:

I'm not sure if it makes more sense to roll our own or just use these other packages to extract text in the right way (I have a slight bias for this approach), but I thought I'd throw this issue together in case it inspires ideas or contributions from others.

The text was updated successfully, but these errors were encountered:

deanmalmgren added the enhancement label Aug 12, 2014

deanmalmgren mentioned this issue Oct 3, 2014

pdf parser: chain pdftotext/pdfminer + tesseract #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fallback python-based .ps extraction #50

fallback python-based .ps extraction #50

deanmalmgren commented Aug 12, 2014

fallback python-based .ps extraction #50

fallback python-based .ps extraction #50

Comments

deanmalmgren commented Aug 12, 2014