Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

Conversation

@DT-Priyanka
Copy link
Contributor

The blocksThreshold is already exposed by FileSplitter operator, exposing same using module.

@DT-Priyanka
Copy link
Contributor Author

@ChandniSingh can you please review this?

dag.setAttribute(blockReader, Context.OperatorContext.PARTITIONER, new StatelessPartitioner<FSSliceReader>(readersCount));
fileSplitter.setBlocksThreshold(readersCount);
}
if (blocksThreshold != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this check as blocksThreshold is always >=1 according to constraint?

@gauravgopi123
Copy link
Contributor

Test case?

@otterc
Copy link
Contributor

otterc commented May 19, 2016

@DT-Priyanka
Can we change the default value of blocksThreshold in AbstractFileSplitter.
Currently it is Integer.MAX_VALUE which is too large and we have seen many a times the users forget to change this and the blockReaders are unable to handle the number of blocks.
Maybe we should reduce it to 10.
@vrozov can you please suggest a reasonable default value

@DT-Priyanka
Copy link
Contributor Author

As per my experience so far, 2 or 4 looks like a good number. @vrozov or anyone has any other idea please let me know. I would remove min=1 constraint and put default in AbstractFileSplitter.

@otterc
Copy link
Contributor

otterc commented May 19, 2016

@DT-Priyanka
Why will you remove min=1 constraint?

@DT-Priyanka
Copy link
Contributor Author

Need to find a good default value and also update tests if required, closing this PR for time being.

@vrozov
Copy link
Member

vrozov commented May 19, 2016

I don't think there is a good default value. It depends on a run-time environment and the amount of processing necessary. In general, it will be better to change the behavior of the downstream operator to process as many records as it can in a single window, instead of trying to process all tuples emitted by FileSplitter.

@otterc
Copy link
Contributor

otterc commented May 19, 2016

That breaks idempotency of the downstream.
Also not having a good default value will still be applicable to Block reader.

File splitter and block reader work in conjunction so to keep things simple and not break idempotency, this property IMO should be on splitter.

Priyanka, how about setting the default to 16?

@DT-Priyanka
Copy link
Contributor Author

If user end up setting static partitioning on Readers, and sets readers count to a really small value say 1 or 2, the readers will be overloaded, reading 16 blocks in a window, and we can still see this problem.
Couple of options:

  1. Force user to set blocksThreshold (by marking it as notNull). Here we will have to put good documentation for what this property actually means and how other parameters like readers count, block size etc can affect this parameter value. (Vlad also suggested we can make this property compulsory parameter).
  2. Looking at number of readers and block size we do some calculations and set the blocksThreshold ourselves. Will need to handle dynamic partitioning scenarios gracefully here.

@otterc
Copy link
Contributor

otterc commented May 20, 2016

Priyanka,
Point 2 is already handled.

I agree with point 1.

@vrozov
Copy link
Member

vrozov commented May 20, 2016

@ChandniSingh can you please elaborate how item 2 is already handled.

@otterc
Copy link
Contributor

otterc commented May 20, 2016

@vrozov dynamic partition of block readers is explained in the documentation of this operator. For changing the block threshold dynamically on splitter there is a ticket open.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants