Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
dask.bag: allow printing result to stdout #1066
Suppose my use case is this, read input from stdin, process it, then output to stdout (so that it can be piped into another process). However this cannot be done easily as of now. I have cheated with
I then hacked the
and then use
However, is there a better way to do this? It would be nice to be able to print the result right after it is computed (
The best way would probably be to fork off a process to do the actual printing with one end of a queue, and have the printing task put the results in the queue to be printed. This should limit the effects of synchronization, which using a lock causes. Assuming the process doing the printing can keep up with the workers creating the data this shouldn't flood your RAM either.
A cheap option though is just to use a manager to create a lock. The following works for me:
from multiprocessing import Manager import dask.bag as db manager = Manager() lock = manager.Lock() data = db.range(1000, npartitions=10) def to_stdout(data, lock=lock): lock.acquire() for row in data: print(row) lock.release() data.map_partitions(to_stdout).compute()
Since grabbing a lock is mildly expensive, we do it once for each chunk rather than once for each record. If this is fast enough for you, then this is the solution I'd recommend.
Yes. In general the multiprocessing scheduler (