Skip to content

Commit

Permalink
Merge v0.9.0
Browse files Browse the repository at this point in the history
  • Loading branch information
adworse committed Dec 7, 2018
2 parents 48620e1 + 1f86daa commit 63d4bd9
Show file tree
Hide file tree
Showing 10 changed files with 169 additions and 67 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Expand Up @@ -2,6 +2,16 @@

## [Unreleased]

## [0.9.0] - 2018-12-07
### Added
- Open cells rendering added. Tables like this are now processed correctly:
```
__|____|_______|_____|
__|____|_______|_____|
__|____|_______|_____|
```


## [0.8.4] - 2018-11-24
### Changed
- Render phrases before cell assembly option of Iguvium::Table#to_a method is now true by default.
Expand Down
23 changes: 12 additions & 11 deletions README.md
Expand Up @@ -24,15 +24,24 @@ Get this table:

* Characters extraction is done by [PDF::Reader gem](https://github.com/yob/pdf-reader). Some PDFs are so messed up it can't extract meaningful text from them. If so, so does Iguvium.

* Current version extracts regular (with constant number of rows per column and vise versa) tables with explicit lines formatting, like this:
* Current version extracts regular (with constant number of rows per column and vise versa)
tables with explicit lines formatting, like this:

```
.__________________.
|____|_______|_____|
|____|_______|_____|
|____|_______|_____|
```
Merged cells content is split as if cells were not merged.
And, after version 0.9.0, like this:
```
__|____|_______|_____|
__|____|_______|_____|
__|____|_______|_____|
```


Merged cells content is split as if cells were not merged unless you use `:phrases` option.

* Performance: considering the fact it has computer vision under the hood, the gem is reasonably fast. Full page extraction takes up to 1 second on modern CPUs and up to 2 seconds on the older ones.

Expand Down Expand Up @@ -107,15 +116,7 @@ Initially inspired by [camelot](https://github.com/socialcopsdev/camelot/) idea

## Roadmap

The next version will deal with open-edged tables like

```
__|____|_______|_____|
__|____|_______|_____|
__|____|_______|_____|
```

It also will keep open-edged rows metadata ('floorless' and 'roofless') for the needs of multipage tables merger.
The next version will keep open-edged rows metadata ('floorless' and 'roofless') for the needs of multipage tables merger.

The final one will recognize tables with merged cells.

Expand Down
1 change: 1 addition & 0 deletions exe/iguvium
Expand Up @@ -18,6 +18,7 @@ opts = Slop.parse { |o|
end
o.on '-h', '--help', 'show help' do
puts o.to_s.gsub(/(usage:).+(iguvium)/, '\1 \2 filename.pdf')
puts Iguvium::VERSION
exit
end
}
Expand Down
17 changes: 7 additions & 10 deletions lib/iguvium/cv.rb
Expand Up @@ -39,11 +39,11 @@ class CV
# Prepares image for recognition: initial blur
# @param image [ChunkyPNG::Image] from {Iguvium::Image.read}
def initialize(image)
@image = blur image
@blurred = blur(image)
@image = to_narray(image).to_a
end

# @return [Array] 8-bit representation of an image
attr_reader :image
attr_reader :image, :blurred

# @return [Recognized]
# lines most probably forming table cells and tables' outer borders as boxes
Expand All @@ -62,9 +62,8 @@ def lines
{
vertical: Labeler.new(verticals)
.lines
.map { |line| flip_line line }
.sort_by { |x, yrange| [yrange.begin, x] },
horizontal: Labeler.new(horizontals).lines.map { |line| flip_line line }.sort_by { |_, y| [y] }
.map { |line| flip_line line },
horizontal: Labeler.new(horizontals).lines.map { |line| flip_line line }
}
end

Expand All @@ -80,14 +79,14 @@ def boxes

def verticals(threshold = 3)
Matrix
.rows(convolve(NArray[*horizontal_scan(image)], VERTICAL, 0).to_a)
.rows(convolve(NArray[*horizontal_scan(blurred)], VERTICAL, 0).to_a)
.map { |pix| pix < threshold ? nil : pix }
.to_a
end

def horizontals(threshold = 3)
Matrix
.rows(convolve(NArray[*vertical_scan(image)], HORIZONTAL, 0).to_a)
.rows(convolve(NArray[*vertical_scan(blurred)], HORIZONTAL, 0).to_a)
.map { |pix| pix < threshold ? nil : pix }
.to_a
end
Expand Down Expand Up @@ -186,8 +185,6 @@ def vertical_scan(image)
def box(coord_array)
ax, bx = coord_array.map(&:last).minmax
ay, by = coord_array.map(&:first).minmax
# additional pixels removed from the box definition
# [ax - 1..bx + 1, ay - 1..by + 1]
[ax..bx, flip_range(ay..by)]
end
end
Expand Down
32 changes: 32 additions & 0 deletions lib/iguvium/row.rb
@@ -0,0 +1,32 @@
# frozen_string_literal: true

module Iguvium

class Row
# gets characters limited by yrange and set of column ranges
def initialize(columns, characters, phrases: true)
@columns = columns
if phrases
characters =
characters
.sort
.chunk_while { |a, b| a.mergable?(b) }
.map { |chunk| chunk.inject(:+) }
end
@characters = characters
end

def cells
@columns.map { |range|
@characters.select { |character| range.cover?(character.x) }
}
end

# @return rendered row array
def render(newlines: false)
end

def merge(other)
end
end
end
79 changes: 67 additions & 12 deletions lib/iguvium/table.rb
Expand Up @@ -16,6 +16,8 @@ def initialize(box, page)
@box = box
@lines = page.lines
@page = page
grid
heal
end

# Renders the table into an array of strings.
Expand All @@ -30,24 +32,74 @@ def initialize(box, page)
# @return [Array] 2D array of strings (content of table's cells)
#
def to_a(newlines: false, phrases: true)
grid[:rows]
@to_a ||=
grid[:rows]
.reverse
.map { |row|
grid[:columns].map do |column|
render(
phrases ? words_inside(column, row) : chars_inside(column, row),
newlines: newlines
)
end
}
grid[:columns].map do |column|
render(
phrases ? words_inside(column, row) : chars_inside(column, row),
newlines: newlines
)
end
}
end

# def width
# grid[:columns].count
# end

# def mergeable?(other)
# width == other.width
# end

# def roofless?
# @roofless
# end

# def floorless?
# @floorless
# end

private

attr_reader :page, :lines, :box

def enhancer(grid)
# @todo write grid enhancer to detect cells between outer grid lines and box borders
# Looks if there are characters inside the box but outside of already detected cells
# and adds rows and/or columns if necessary.
# @return [Iguvium::Table] with added open-cell rows and columns
def heal
heal_rows
heal_cols
self
end

def wide_box
@wide_box ||= [
box.first.begin - 2..box.first.end + 2,
box.last.begin - 2..box.last.end + 2
]
end

def heal_cols
leftcol = box.first.begin..grid[:columns].first.begin
rightcol = grid[:columns].last.end..box.first.end
@grid[:columns].unshift(leftcol) if chars_inside(leftcol, box.last).any?
@grid[:columns].append(rightcol) if chars_inside(rightcol, box.last).any?
end

def heal_rows
# TODO: shrink box (like `box.last.end - 2`)
roofrow = box.last.begin..grid[:rows].first.begin
floorrow = grid[:rows].last.end..box.last.end
if chars_inside(box.first, roofrow).any?
@grid[:rows].unshift(roofrow)
@roofless = true
end
if chars_inside(box.first, floorrow).any?
@grid[:rows].append(floorrow)
@floorless = true
end
end

def characters
Expand All @@ -74,15 +126,18 @@ def words_inside(xrange, yrange)
end

def grid
@grid ||=
return @grid if @grid

@grid =
{
rows: lines_to_ranges(lines[:horizontal]),
columns: lines_to_ranges(lines[:vertical])
}
end

def lines_to_ranges(lines)
lines.select { |line| line_in_box?(line, box) }
# TODO: extend box for the sake of lines select
lines.select { |line| line_in_box?(line, wide_box) }
.map { |line| line.first.is_a?(Numeric) ? line.first : line.last }
.sort
.uniq
Expand Down
2 changes: 1 addition & 1 deletion lib/iguvium/version.rb
@@ -1,5 +1,5 @@
# frozen_string_literal: true

module Iguvium
VERSION = '0.8.4'
VERSION = '0.9.0'
end
61 changes: 33 additions & 28 deletions spec/cv_spec.rb
Expand Up @@ -10,45 +10,48 @@
let(:page_index) { 0 }

let(:gspath) do
return 'gs' unless RbConfig::CONFIG['host_os'].match(/mswin|mingw|cygwin/)
return 'gs' unless RbConfig::CONFIG['host_os'] =~ /mswin|mingw|cygwin/

gspath = Dir.glob('C:/Program Files/gs/gs*/bin/gswin??c.exe').first.tr('/', '\\')
"\"#{gspath}\""
end


let(:lines) { cv.recognize[:lines]}
let(:boxes) { cv.recognize[:boxes]}
let(:lines) { cv.recognize[:lines] }
let(:boxes) { cv.recognize[:boxes] }

it do
expect(boxes)
.to have_attributes(count: 7)
.to have_attributes(count: 13)
.and eql(
[
[64..536, 326..427], [70..188, 434..439], [64..536, 476..536], [70..138, 542..547],
[64..536, 578..610], [70..121, 617..622], [70..133, 669..721]
[66..534, 328..425], [72..186, 436..437], [66..534, 478..534],
[72..136, 544..545], [66..534, 580..608], [72..119, 619..620],
[79..84, 671..679], [87..93, 672..679], [96..103, 672..679],
[106..113, 672..679], [117..124, 672..679], [72..131, 681..684],
[80..118, 688..719]
]
)

expect(lines[:vertical])
.to have_attributes(count: 24)
.and eql(
[
[66, 326..427], [253, 326..426], [440, 326..426], [160, 327..426], [347, 327..426],
[534, 328..425], [74, 434..439], [534, 476..536], [66, 477..536], [222, 477..535],
[378, 477..535], [74, 542..547], [66, 578..610], [534, 578..608], [253, 579..610],
[440, 579..610], [160, 580..609], [347, 580..609], [74, 617..622], [81, 669..680],
[122, 670..675], [108, 675..682], [88, 696..711], [116, 696..711]
[88, 696..711], [116, 696..711], [108, 675..682], [81, 669..680], [122, 670..675],
[74, 617..622], [160, 580..610], [347, 580..610], [66, 578..609], [253, 579..609],
[440, 579..609], [534, 578..608], [74, 542..547], [66, 477..536], [534, 476..536],
[222, 477..535], [378, 477..535], [74, 434..439], [66, 326..427], [160, 327..426],
[253, 326..426], [347, 327..426], [440, 326..426], [534, 328..425]
]
)
expect(lines[:horizontal])
.to have_attributes(count: 25)
.and eql(
[
[64..533, 328], [65..535, 342], [65..536, 356], [65..536, 370], [65..536, 384],
[65..536, 398], [64..536, 425], [70..188, 436], [64..532, 478], [65..534, 492],
[65..535, 506], [64..536, 534], [70..138, 545], [66..532, 580], [65..534, 594],
[64..535, 608], [70..120, 620], [107..124, 676], [79..84, 677], [70..78, 681],
[84..133, 683], [103..108, 690], [79..86, 694], [78..86, 701], [95..108, 718]
[95..108, 718], [78..86, 701], [79..86, 694], [103..108, 690], [84..133, 683],
[70..78, 681], [79..84, 677], [107..124, 676], [70..120, 620], [64..534, 608],
[65..534, 594], [66..532, 580], [70..138, 545], [64..536, 534], [65..535, 506],
[65..534, 492], [64..532, 478], [70..188, 436], [64..536, 425], [65..536, 398],
[65..536, 384], [65..536, 370], [65..536, 356], [65..535, 342], [64..533, 328]
]
)
end
Expand All @@ -58,40 +61,42 @@
let(:path) { 'spec/files/quote.pdf' }
let(:page_index) { 0 }
let(:gspath) do
return 'gs' unless RbConfig::CONFIG['host_os'].match(/mswin|mingw|cygwin/)
return 'gs' unless RbConfig::CONFIG['host_os'] =~ /mswin|mingw|cygwin/

gspath = Dir.glob('C:/Program Files/gs/gs*/bin/gswin??c.exe').first.tr('/', '\\')
"\"#{gspath}\""
end

let(:lines) { cv.recognize[:lines]}
let(:boxes) { cv.recognize[:boxes]}
let(:lines) { cv.recognize[:lines] }
let(:boxes) { cv.recognize[:boxes] }

it do
expect(boxes)
.to have_attributes(count: 5)
.and eql(
[
[40..571, 34..442], [40..571, 460..465], [40..571, 494..499], [40..571, 658..663], [42..81, 716..755]
[42..569, 36..440], [42..569, 462..463], [42..569, 496..497],
[42..569, 660..661], [44..79, 718..753]
]
)

expect(lines[:vertical])
.to have_attributes(count: 14)
.and eql(
[
[42, 34..442], [167, 34..440], [323, 34..440], [378, 34..440], [417, 34..440],
[472, 34..440], [513, 34..440], [568, 34..440], [44, 460..465], [44, 494..499],
[44, 658..663], [53, 732..744], [64, 732..743], [70, 735..743]
[53, 732..744], [64, 732..743], [70, 735..743], [44, 658..663], [44, 494..499],
[44, 460..465], [42, 34..442], [378, 34..441], [417, 34..441], [472, 34..441],
[513, 34..441], [167, 34..440], [323, 34..440], [568, 34..439]
]
)
expect(lines[:horizontal])
.to have_attributes(count: 20)
.and eql(
[
[42..571, 59], [41..570, 80], [41..570, 122], [41..570, 154], [41..570, 186],
[42..571, 217], [41..570, 237], [41..570, 269], [41..570, 380], [42..571, 411],
[41..569, 439], [40..571, 463], [40..571, 497], [40..571, 661], [44..56, 720],
[63..76, 720], [51..66, 731], [63..69, 737], [71..78, 745], [46..67, 748]
[46..67, 748], [71..78, 745], [63..69, 737], [51..66, 731], [44..56, 720],
[63..76, 720], [40..571, 661], [40..571, 497], [40..571, 463], [40..567, 439],
[42..571, 411], [41..570, 380], [41..570, 269], [41..570, 237], [42..571, 217],
[41..570, 186], [41..570, 154], [41..570, 122], [41..570, 80], [42..571, 59]
]
)
end
Expand Down

0 comments on commit 63d4bd9

Please sign in to comment.