Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 1403: ArrayIndexOutOfBoundsException is thrown on readLastAddConfirmedAndEntry #1404

Closed
wants to merge 22 commits into from

Conversation

sijie
Copy link
Member

@sijie sijie commented May 12, 2018

Descriptions of the changes in this PR:

Motivation

There are two bugs in ReadLastAddConfirmedAndEntry:

  1. a regression was introduced by DistributionSchedule uses custom wrapped int[] rather than HashSet #657. the long poll read op is attempting to long-poll reading lac. since lac is stored across ensemble, so the retry logic assumes it will attempt over all the bookies in the ensemble. however DistributionSchedule uses custom wrapped int[] rather than HashSet #657 use a write-quorum-size write set for tracking those attempts. this leads to ArrayIndexOutOfBoundsException reported at ArrayIndexOutOfBoundsException is thrown on readLastAddConfirmedAndEntry #1403. The integrate tests added in this PR can easily reproduce this issue.

  2. there was a bug on retry logic when a ledger whose ensemble size is larger than write quorum size. when this happens, it will claim lac is not advanced prior to attempt the bookie in the ensemble. so the client will never know lac is advanced if using long poll reads on such ledgers. The integrate tests added in this PR can also catch this issue.

disclaim: twitter uses long poll reads but never uses ensembleSize > writeQuorumSize. so this is not a problem for dlog users.

Solution

  • introduce a getWriteSetForLongPoll call that uses ensembleSize for building the write set. this would address problem 1)
  • fix the assignment of numEmptyResponsesAllowed, so the long poll reads can work with ensembleSize > writeQuorumSize
  • add integration tests for long polling reads

at the same time, also add an integration test for normal tailing reads with

sijie added 21 commits May 3, 2018 01:13
*Motivation*

mvn build fails on a centos/7 box. Exceptions are thrown as below:

```
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (checkstyle) on project circe-checksum: Execution checkstyle of goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check failed: An API incompatibility was encountered while executing org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;Ljava/lang/Throwable;)V
```

*Solution*

The problem was addressed in https://issues.apache.org/jira/browse/MCHECKSTYLE-335. We need to bump the checkstyle plugin version to `3.0.0`
*Motivation*

 apache#1352 update the BC tests to include newer versions, however it removes the flags that used for 4.6.0 to handle badly shutdown.
 This change makes the BC tests become very flaky.

*Solution*

- apply `badlyshutdown` flag to both from 4.6.0 to 4.6.1 and from 4.6.1 to 4.6.2 upgrade
*Motivation*

Currently all bookkeeper's integration tests and bc tests are docker based and written
using arquillian framework. Due to the network issues in docker and the way how bookies
advertise themselves in the cluster, we can *ONLY* run integration tests in a linux
environment. It is inconvinient to people who is using macOS as the development environment.

*Solution*

This PR is providing a Vagrantfile to lanuch a dev vm that contains all the tools required
for running integration tests and bc tests. So people can actually debug and test integration
tests locally using a linux vm.
*Motivation*

TestSmoke#testTailingReads is testing the tailing behavior at bookkeeper. However the current testing logic is a bit
problematic on checking whether the reader has caught up with the expected last add confirmed. so the read logic break
earlier before reading the expectedLastConfirmedEntry.

*Solution*

Fix the checking logic on breaking the read loop.
@sijie
Copy link
Member Author

sijie commented May 12, 2018

This PR is based on #1402 and #1401 so it includes changes from those two PRs as well.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! +1

Copy link
Member

@jiazhai jiazhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, need merge with master

@eolivelli
Copy link
Contributor

retest this please

@sijie sijie added the type/bug label May 15, 2018
@sijie sijie self-assigned this May 15, 2018
@sijie sijie closed this in 9259173 May 15, 2018
sijie added a commit that referenced this pull request May 15, 2018
…onfirmedAndEntry

Descriptions of the changes in this PR:

*Motivation*

There are two bugs in `ReadLastAddConfirmedAndEntry`:

1) a regression was introduced by #657. the long poll read op is attempting to long-poll reading lac. since lac is stored across ensemble, so the retry logic assumes it will attempt over all the bookies in the ensemble. however #657 use a `write-quorum-size` write set for tracking those attempts. this leads to ArrayIndexOutOfBoundsException reported at #1403. The integrate tests added in this PR can easily reproduce this issue.

2) there was a bug on retry logic when a ledger whose ensemble size is larger than write quorum size. when this happens, it will claim lac is not advanced prior to attempt the bookie in the ensemble. so the client will never know lac is advanced if using long poll reads on such ledgers. The integrate tests added in this PR can also catch this issue.

disclaim: twitter uses long poll reads but never uses `ensembleSize > writeQuorumSize`. so this is not a problem for dlog users.

*Solution*

- introduce a `getWriteSetForLongPoll` call that uses `ensembleSize` for building the write set. this would address problem 1)
- fix the assignment of `numEmptyResponsesAllowed`, so the long poll reads can work with `ensembleSize > writeQuorumSize`
- add integration tests for long polling reads

at the same time, also add an integration test for normal tailing reads with

Author: Sijie Guo <sijie@apache.org>

Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Jia Zhai <None>

This closes #1404 from sijie/longpoll_tests, closes #1403

(cherry picked from commit 9259173)
Signed-off-by: Sijie Guo <sijie@apache.org>
@sijie sijie deleted the longpoll_tests branch July 16, 2018 02:44
reddycharan pushed a commit to reddycharan/bookkeeper that referenced this pull request Oct 17, 2018
…stAddConfirmedAndEntry

Descriptions of the changes in this PR:

*Motivation*

There are two bugs in `ReadLastAddConfirmedAndEntry`:

1) a regression was introduced by apache#657. the long poll read op is attempting to long-poll reading lac. since lac is stored across ensemble, so the retry logic assumes it will attempt over all the bookies in the ensemble. however apache#657 use a `write-quorum-size` write set for tracking those attempts. this leads to ArrayIndexOutOfBoundsException reported at apache#1403. The integrate tests added in this PR can easily reproduce this issue.

2) there was a bug on retry logic when a ledger whose ensemble size is larger than write quorum size. when this happens, it will claim lac is not advanced prior to attempt the bookie in the ensemble. so the client will never know lac is advanced if using long poll reads on such ledgers. The integrate tests added in this PR can also catch this issue.

disclaim: twitter uses long poll reads but never uses `ensembleSize > writeQuorumSize`. so this is not a problem for dlog users.

*Solution*

- introduce a `getWriteSetForLongPoll` call that uses `ensembleSize` for building the write set. this would address problem 1)
- fix the assignment of `numEmptyResponsesAllowed`, so the long poll reads can work with `ensembleSize > writeQuorumSize`
- add integration tests for long polling reads

at the same time, also add an integration test for normal tailing reads with

Author: Sijie Guo <sijie@apache.org>

Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Jia Zhai <None>

This closes apache#1404 from sijie/longpoll_tests, closes apache#1403

(cherry picked from commit 9259173)
Signed-off-by: Sijie Guo <sijie@apache.org>
Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants