Skip to content

Commit

Permalink
Initial import
Browse files Browse the repository at this point in the history
  • Loading branch information
dayflower committed Apr 19, 2012
0 parents commit 578f781
Show file tree
Hide file tree
Showing 14 changed files with 686 additions and 0 deletions.
21 changes: 21 additions & 0 deletions .gitignore
@@ -0,0 +1,21 @@
.hg
.hgignore
.rbenv-gemsets
vendor/bundle
*.gem
*.rbc
.bundle
.config
.yardoc
Gemfile.lock
InstalledFiles
_yardoc
coverage
doc/
lib/bundler/man
pkg
rdoc
spec/reports
test/tmp
test/version_tmp
tmp
3 changes: 3 additions & 0 deletions Gemfile
@@ -0,0 +1,3 @@
source 'https://rubygems.org'

gemspec
22 changes: 22 additions & 0 deletions LICENSE
@@ -0,0 +1,22 @@
Copyright (c) 2012 ITO Nobuaki

MIT License

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
65 changes: 65 additions & 0 deletions README.md
@@ -0,0 +1,65 @@
# MSWordDoc::Extractor

Extract text contents from Microsoft Word Document

## Installation

Add this line to your application's Gemfile:

```ruby
gem 'msworddoc-extractor', :git =>
'git://github.com/dayflower/msworddoc-extractor.git'
```

And then execute:

$ bundle install

## Usage

```ruby
require 'msworddoc-extractor'

doc = MSWordDoc::Extractor.load('sample.doc')
puts doc.contents # doc is MSWordDoc::Essence
# You have to close document explicitly
doc.close()

# Or call load() with block argument (recommended way)
MSWordDoc::Extractor.load('sample.doc') do |doc|
puts doc.header
end
```

### Properties of `MSWordDoc::Essence`

* `document`
* `header`
* `footnote`
* `macro`
* `annotation`
* `endnote`
* `textbox`
* `header_textbox`
* `whole_contents`

## Limitations

Only supports Microsoft Word binary document.
Does not support Microsoft Word XML document (.docx).

This module does not handle `PAP` (PAragraph Properties) and `CHP` (CHaracter
Properties), that define paragraphs and characters style.
Those styling information are required to determine functionalities of
some of special characters (such as row mark, footnote reference, and etc),
but are just ignored in the module, so extracted text will be inaccurate.

Also this module does not handle summary information stream in Word file.

## Contributing

1. Fork it
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Added some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create new Pull Request
8 changes: 8 additions & 0 deletions Rakefile
@@ -0,0 +1,8 @@
#!/usr/bin/env rake
require 'bundler/gem_tasks'
require 'rake/testtask'

Rake::TestTask.new do |test|
test.libs << 'test'
test.test_files = Dir[ 'test/test_*.rb' ]
end
48 changes: 48 additions & 0 deletions bin/worddoc-extract
@@ -0,0 +1,48 @@
#!/usr/bin/env ruby

require 'optparse'
require 'msworddoc-extractor'

def app(*argv)
actions = []

options = [
[ '-d', '--document', 'Main contents (default)', :document ],
[ '-w', '--whole', 'Whole text contents', :whole_contents ],
[ '-i', '--header', 'Header parts', :header ],
[ '-f', '--footnote', 'Footnotes', :footnote ],
[ '-e', '--endnote', 'Endnotes', :endnote ],
[ '-a', '--annotation', 'Annotations', :annotatation ],
[ '-t', '--textbox', 'Text boxes', :textbox ],
[ '--header_textbox', 'Header text boxes', :header_textbox ],
[ '-m', '--macro', 'Macro part', :marco ],
]

optparse = OptionParser.new do |opt|
opt.banner = 'Usage: worddoc-extract [options] <files> ...'

options.each do |o|
action = o.pop
opt.on(*o) { actions << action }
end

opt.separator ''
opt.on('-h', '--help', 'Show this help') { puts opt; exit }
end

if actions.empty?
actions = [ :document ]
end

files = optparse.parse(argv)

files.each do |file|
doc = MSWordDoc::Extractor.load_file(file)
actions.each do |action|
puts doc.send(action)
end
end
end

app(*ARGV)

1 change: 1 addition & 0 deletions lib/msworddoc-extractor.rb
@@ -0,0 +1 @@
require File.expand_path('../msworddoc/extractor', __FILE__)

0 comments on commit 578f781

Please sign in to comment.