Bunpa

Bunpa is an extremely simple wrapper around the MeCab Japanese grammar parser. It was designed with two key features in mind:

Simplicity - only returns the text and major part of speech for each component
Completeness - ensures that whitespace and any unknown characters are preserved

Background

Bunpa parses Japanese text into a set of ordered components. Each component represents either a part of speech (noun, verb, etc.) or formatting (whitespace, etc.) All components have a text value (exactly as they appear in the text provided) and kind (usually part of speech).

All grammatical information is provided by the excellent MeCab Japanese part of speech and morphological analyser. Formatting information is inserted into the set of components in a post processing step (it is not done by MeCab). These components have a fake 'kind' assigned to them. Currently the following kinds of formatting components are handled by Bunpa:

spaces (スペース)
tabs (タブ)
newlines (改行)

Any components that cannot be identified by either MeCab or Bunpa are marked as unknown (未知).

Installation

From within your application's base directory:

Edit your Gemfile and add:
```
 gem 'bunpa'
```
Install the gem:
```
 bundle
```

Usage

Bunpa operates as a very simple parser. It returns the components it identifies as an Array of Bunpa::Text::Component objects, in the same order as they appear in the document. Each Component object has two accessors - 'text' and 'kind', which return the text value and part of speech of the component respectively.

Basic usage is as follows:

require 'bunpa'

# Create the parser
parser = Bunpa::JapaneseTextParser.new

# Get an enumerable of Bunpa::Text::Components
components = parser.parse("A: こんにちは！ お元気ですか。\nB: はい、元気です！")

components.each do |component|
  puts "#{component.text}\t(#{component.kind}"
end

This would output:

A       (名詞)
:       (名詞)
        (スペース)
こんにちは      (感動詞)
！      (記号)
        (スペース)
お      (接頭詞)
元気    (名詞)
です    (助動詞)
か      (助詞)
。      (記号)

        (改行)
B       (名詞)
:       (名詞)
        (スペース)
は      (助詞)
い      (動詞)
、      (記号)
元気    (名詞)
です    (助動詞)
！      (記号)

For a slightly more detailed example, see the usage_example.rb script in the bin directory.

Notes

This is very much a work in progress - it only has minimal testing at the moment, so use at your own risk :)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bin		bin
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
bunpa.gemspec		bunpa.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

lib

lib

spec

spec

.gitignore

.gitignore

.rspec

.rspec

Gemfile

Gemfile

Gemfile.lock

Gemfile.lock

LICENSE

LICENSE

README.md

README.md

bunpa.gemspec

bunpa.gemspec

Repository files navigation

Bunpa

Background

Installation

Usage

Notes

About

Releases

Packages

Languages

License

clownba0t/bunpa

Folders and files

Latest commit

History

Repository files navigation

Bunpa

Background

Installation

Usage

Notes

About

Resources

License

Stars

Watchers

Forks

Languages