DocSplit fails to extract text on Windows when filenames have spaces

Docsplit invokes pdftotext to extract text, escaping spaces in the filename with \ to construct a command line. On Windows, \ does not escape spaces. 

In my instrumented test, docsplit attempts to execute the following command:

  pdftotext -enc UTF-8 test-docs/Ideology\ and\ Climate\ Change.pdf extracted-text/Ideology\ and\ Climate\ Change.txt 2>&1

This fails with the error message below. The following command works:

  pdftotext -enc UTF-8 "test-docs/Ideology and Climate Change.pdf" "extracted-text/Ideology and Climate Change.txt" 2>&1

You will  need poppler on Windows to reproduce, which is available here: http://www.compgeom.com/~piyush/scripts/scripts.html

Full error message follows:

C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/text_extractor.r
b:99:in `run': pdftotext version 0.16.6 (Docsplit::ExtractionFailed)
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -r <fp>           : resolution, in DPI (default is 72)
  -x <int>          : x-coordinate of the crop area top left corner
  -y <int>          : y-coordinate of the crop area top left corner
  -W <int>          : width of crop area in pixels (default is 0)
  -H <int>          : height of crop area in pixels (default is 0)
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta informatio
n
  -enc <string>     : output text encoding name
  -listenc          : list available encodings
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -bbox             : output bounding box for each word and page size to html.
Sets -htmlmeta
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:106:in`extract_full'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:54:in `extract_from_pdf'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:38:in`block in extract'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:32:in `each'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:32:in`extract'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit.rb:
51:in `extract_text'
        from overview-prototype/docloader/docloader.rb:75:in`processFile'
        from overview-prototype/docloader/docloader.rb:150:in `block in <main>'
        from overview-prototype/docloader/docloader.rb:50:in`call'
        from overview-prototype/docloader/docloader.rb:50:in `block in scanDir'
        from overview-prototype/docloader/docloader.rb:42:in`foreach'
        from overview-prototype/docloader/docloader.rb:42:in `scanDir'
        from overview-prototype/docloader/docloader.rb:150:in`<main>'


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocSplit fails to extract text on Windows when filenames have spaces #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DocSplit fails to extract text on Windows when filenames have spaces #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions