Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Iterator on big dataset (lazy generated) #4198

Closed
bew opened this issue Mar 27, 2017 · 6 comments
Closed

An Iterator on big dataset (lazy generated) #4198

bew opened this issue Mar 27, 2017 · 6 comments

Comments

@bew
Copy link
Contributor

bew commented Mar 27, 2017

From irc, someone (domgetter) asked if it was possible to do something like: (pseudo code)

struct Int
  def palindrome?
    self.to_s == self.to_s.reverse
  end
end

nums = Iterator.new do |yielder|
  (100..999).each do |i|
    (i..999).each do |j|
      yielder << i * j
    end
  end
end

puts nums.select(&.palindrome?).max

Here, nums would be an Iterator on a big dataset, lazy generated (on demand, no space used).


I implemented it this way:

struct LazyIterator(T)
  include Iterator(T)

  struct Yielder(T)
    def initialize(@chan : Channel(T | Iterator::Stop))
    end

    def <<(value : T)
      @chan.send value
    end
  end

  @data_chan = Channel(T | Iterator::Stop).new(1)
  @launched = false

  # Create a LazyIterator
  def initialize(&@generator : Yielder(T) ->)
  end

  # Return the next value from the data generator
  def next
    unless @launched
      launch_data_generation
    end

    @data_chan.receive
  end

  # Launch data generation in separate fiber
  private def launch_data_generation
    data_chan = @data_chan
    generator = @generator

    spawn do
      generator.call Yielder(T).new(data_chan)
      data_chan.send Iterator::Stop::INSTANCE
    end

    @launched = true
  end
end

#----------------------------------
# Use the magic :)

struct Int
  def palindrome?
    self.to_s == self.to_s.reverse
  end
end

nums = LazyIterator(Int32).new do |yielder|
  (100..999).each do |i|
    (i..999).each do |j|
      yielder << i * j
    end
  end
end

puts nums.select(&.palindrome?).max

I thought this would be great to gave this in the stdlib, what do you think ?

@bew bew changed the title An Iterator on lazy generated dataset An Iterator on big dataset (lazy generated) Mar 27, 2017
@RX14
Copy link
Contributor

RX14 commented Mar 27, 2017

This should work, 99% sure it's a bug that it doesn't, but it's reasonably short. Iterator.of(Channel) is maybe the only thing I'd add to the stdlib.

struct Int
  def palindrome?
    self.to_s == self.to_s.reverse
  end
end

channel = Channel(Int32).new

spawn do
  (100..999).each do |i|
    (i..999).each do |j|
      channel.send(i * j)
    end
  end
end

iter = Iterator.of { channel.receive? || Iterator.stop }

puts iter.select(&.palindrome?).max

https://carc.in/#/r/1rro

@bcardiff
Copy link
Member

@RX14 you need to close the channel at the very end, otherwise receive? will block.

struct Int
  def palindrome?
    self.to_s == self.to_s.reverse
  end
end

channel = Channel(Int32).new

spawn do
  (100..999).each do |i|
    (i..999).each do |j|
      channel.send(i * j)
    end
  end
  channel.close
end

iter = Iterator.of { channel.receive? || Iterator.stop }
puts iter.map(&.as(Int32)).select(&.palindrome?).max

https://carc.in/#/r/1s64

Iterator.of generates a Iterator(T) where T is the return type of the block. Maybe we should play a bit with types to express an Iterator(T - Iterator::Stop)

Currently:

  def self.of(&block : -> T)
    SingletonProc(T).new(block)
  end

But if we change it to:

module Iterator(T)
  def self.of(&block : -> T)
    SingletonProc(typeof(without_stop(&block))).new(block)
  end

  def self.without_stop(&block : -> T)
    e = block.call
    raise "" if e.is_a?(Iterator::Stop)
    e
  end

  private struct SingletonProc(T)
    include Iterator(T)

    def initialize(@proc : -> (T | Iterator::Stop))
    end

    def next
      @proc.call
    end
  end
end

Then, we can get rid of the .map(&.as(Int32))

iter = Iterator.of { channel.receive? || Iterator.stop }
puts iter.select(&.palindrome?).max

So we could support Iterator.stop calls in the block. I think that is something good.

I'm ok with Iterator.of(Channel) having a special semantic.

With those two things the lazy evaluation is more or less straight, yet I would personally prefer the api exposed by @bew , but sadly I see no way to get rid of the type annotation either in the channel or in a yielder helper class.

@RX14
Copy link
Contributor

RX14 commented Mar 29, 2017

@bcardiff I actually wrote that code on my phone while walking from memory, thanks for tidying it up and fixing the Iterator.of bug.

I think adding a little helper for this would be ok, with a type annotation.

@asterite
Copy link
Member

The idea of an Enumerator using fibers and channels has been around for some time. However, it has several problems:

  • It's slow: a hand-written iterator can be up to 1000 times faster than using fibers and channels
  • If the iterator is not consumed there will be an unfreed fiber and channel, which is just wasted memory (it will be eventually GCed)

Because it's slow we don't want to include it in the standard library and encourage its use.

@bew Do you have a concrete use case for this? Right now iterators are pretty flexible, but this case is hard to express.

@RX14
Copy link
Contributor

RX14 commented May 20, 2017

I think that this can be closed in favour of #4438.

@bew
Copy link
Contributor Author

bew commented May 20, 2017

@RX14 I agree, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants