Skip to content

DocSplit fails to extract text on Windows when filenames have spaces #41

@ghost

Description

Docsplit invokes pdftotext to extract text, escaping spaces in the filename with \ to construct a command line. On Windows, \ does not escape spaces.

In my instrumented test, docsplit attempts to execute the following command:

pdftotext -enc UTF-8 test-docs/Ideology\ and\ Climate\ Change.pdf extracted-text/Ideology\ and\ Climate\ Change.txt 2>&1

This fails with the error message below. The following command works:

pdftotext -enc UTF-8 "test-docs/Ideology and Climate Change.pdf" "extracted-text/Ideology and Climate Change.txt" 2>&1

You will need poppler on Windows to reproduce, which is available here: http://www.compgeom.com/~piyush/scripts/scripts.html

Full error message follows:

C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/text_extractor.r
b:99:in run': pdftotext version 0.16.6 (Docsplit::ExtractionFailed) Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2004 Glyph & Cog, LLC Usage: pdftotext [options] <PDF-file> [<text-file>] -f <int> : first page to convert -l <int> : last page to convert -r <fp> : resolution, in DPI (default is 72) -x <int> : x-coordinate of the crop area top left corner -y <int> : y-coordinate of the crop area top left corner -W <int> : width of crop area in pixels (default is 0) -H <int> : height of crop area in pixels (default is 0) -layout : maintain original physical layout -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta informatio n -enc <string> : output text encoding name -listenc : list available encodings -eol <string> : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -bbox : output bounding box for each word and page size to html. Sets -htmlmeta -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:106:inextract_full'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:54:in extract_from_pdf' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:38:inblock in extract'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:32:in each' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:32:inextract'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit.rb:
51:in extract_text' from overview-prototype/docloader/docloader.rb:75:inprocessFile'
from overview-prototype/docloader/docloader.rb:150:in block in <main>' from overview-prototype/docloader/docloader.rb:50:incall'
from overview-prototype/docloader/docloader.rb:50:in block in scanDir' from overview-prototype/docloader/docloader.rb:42:inforeach'
from overview-prototype/docloader/docloader.rb:42:in scanDir' from overview-prototype/docloader/docloader.rb:150:in

'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions