Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-523 RATIS-524 RATIS-525 RATIS-526 RATIS-527 RATIS-533 Lots of cleanup on the LogService #18

Closed
wants to merge 6 commits into from

Conversation

joshelser
Copy link
Member

Lots of various things that help clean up the logservice docker integration. Things seem to be repeatable now that I'm using docker on a linux machine (instead of docker for mac).

cc/ @ankitsinghal @VladRodionov @chrajeshbabu

@joshelser joshelser force-pushed the logservice-tweaks branch 4 times, most recently from b59d0a7 to f0d8ea8 Compare April 25, 2019 20:12
@joshelser joshelser changed the title RATIS-523 RATIS-524 RATIS-525 RATIS-526 RATIS-527 Lots of cleanup on the LogService RATIS-523 RATIS-524 RATIS-525 RATIS-526 RATIS-527 RATIS-533 Lots of cleanup on the LogService Apr 25, 2019
@ankitsinghal
Copy link
Contributor

Was getting following exception in the service
` 2019-05-02 03:22:56,201 WARN server.GrpcServerProtocolService (LogUtils.java:warn(136)) - worker2.logservice.ratis.org_9999: Failed requestVote worker1.logservice.ratis.org_9999->worker2.logservice.ratis.org_9999#0
org.apache.ratis.protocol.GroupMismatchException: worker2.logservice.ratis.org_9999: group-87E0933ED47B not found.
at org.apache.ratis.server.impl.RaftServerProxy$ImplMap.get(RaftServerProxy.java:122)
at org.apache.ratis.server.impl.RaftServerProxy.getImplFuture(RaftServerProxy.java:269)
at org.apache.ratis.server.impl.RaftServerProxy.getImpl(RaftServerProxy.java:278)
at org.apache.ratis.server.impl.RaftServerProxy.getImpl(RaftServerProxy.java:273)
at org.apache.ratis.server.impl.RaftServerProxy.requestVote(RaftServerProxy.java:442)
at org.apache.ratis.grpc.server.GrpcServerProtocolService.requestVote(GrpcServerProtocolService.java:55)
at org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$MethodHandlers.invoke(RaftServerProtocolServiceGrpc.java:319)
at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)
at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

May 02, 2019 3:22:56 AM org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference cleanQueue
SEVERE: ~* Channel ManagedChannelImpl{logId=22, target=worker3.logservice.ratis.org:9999} was not shutdown properly!!! *~
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
java.lang.RuntimeException: ManagedChannel allocation site
at org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:103)
at org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:53)
at org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:44)
at org.apache.ratis.thirdparty.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:411)
at org.apache.ratis.grpc.client.GrpcClientProtocolClient.(GrpcClientProtocolClient.java:118)
at org.apache.ratis.grpc.client.GrpcClientRpc.lambda$new$0(GrpcClientRpc.java:55)
at org.apache.ratis.util.PeerProxyMap$PeerAndProxy.lambda$getProxy$0(PeerProxyMap.java:61)
at org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:202)
at org.apache.ratis.util.PeerProxyMap$PeerAndProxy.getProxy(PeerProxyMap.java:60)
at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:107)
at org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:91)
at org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:401)
at org.apache.ratis.client.impl.RaftClientImpl.groupRemove(RaftClientImpl.java:280) `

While the verification tool is saying that writes are successful.
2019-05-02 03:23:46,699 INFO server.LogStateMachine (VerificationTool.java:run(163)) - LogName['testlog5'] Writing message900 2019-05-02 03:23:46,885 INFO server.LogStateMachine (VerificationTool.java:run(163)) - LogName['testlog6'] Writing message900 2019-05-02 03:23:47,304 INFO server.LogStateMachine (VerificationTool.java:run(163)) - LogName['testlog2'] Writing message950 2019-05-02 03:23:47,663 INFO server.LogStateMachine (VerificationTool.java:run(168)) - 1000 log entries written to log LogName['testlog4'] successfully. 2019-05-02 03:23:48,064 INFO server.LogStateMachine (VerificationTool.java:run(163)) - LogName['testlog5'] Writing message950 2019-05-02 03:23:48,365 INFO server.LogStateMachine (VerificationTool.java:run(163)) - LogName['testlog6'] Writing message950 2019-05-02 03:23:48,626 INFO server.LogStateMachine (VerificationTool.java:run(168)) - 1000 log entries written to log LogName['testlog2'] successfully. 2019-05-02 03:23:49,034 INFO server.LogStateMachine (VerificationTool.java:run(168)) - 1000 log entries written to log LogName['testlog5'] successfully. 2019-05-02 03:23:49,124 INFO server.LogStateMachine (VerificationTool.java:run(168)) - 1000 log entries written to log LogName['testlog6'] successfully.
Not sure if it is an issue or just a normal logging, but anyways doesn't seem to be related to changes or JIRAs fixed under this pull request

RaftClientConfigKeys.Rpc.setRequestTimeout(properties, TimeDuration.valueOf(100, TimeUnit.SECONDS));

// Increase the segment size to avoid rolling so quickly
SizeInBytes segmentSize = SizeInBytes.valueOf("32MB");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hardcoding? can't we do it through some config? or it is just default and will be overridden in later step?

Copy link
Member Author

@joshelser joshelser May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was weird. I was playing with making it configurable, but changes to it had some really harmful implications (things were just hanging when I increased this, for example).

Edit: was thinking about the write-buffer size. I don't think I played around with this.

I left it here as something for us to come back to later, but not something I wanted to have people tweak. Maybe it's OK to still let this be configurable? (with a big-fat-warning :P)

TimeDuration leaderElectionMaxTimeout = TimeDuration.valueOf(
leaderElectionTimeoutMin.toLong(TimeUnit.MILLISECONDS) + 200,
TimeUnit.MILLISECONDS);
RaftServerConfigKeys.Rpc.setTimeoutMax(properties, leaderElectionMaxTimeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, why are we hardcoding these values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy-paste out of Ozone. I don't have a good understand of why they chose these.

Was going for just calling out things we want to set in Ratis, but leaving the actual configuration to the work Vlad is going to get to.

a snapshot is an equivalent representation of all of the updates from the log. However, the LogService
is written to expect that we maintain these records. As such, we must not allow snapshots to automatically
happen as we may lose records from teh RAFT log. `RaftServerConfigKeys.Snapshot.setAutoTriggerEnabled()`
defaults to `false` and should not be set to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good tuning guide.

But Snapshot.setAutoTriggerEnabled() config doesn't look like a tuning, as it impacts the log service, shouldn't we throw an exception if it is set to true or explicitly set to false in code(just to avoid human errors)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we throw an exception if it is set to true or explicitly set to false in code(just to avoid human errors)

Very good suggestion. Let me change that.

@joshelser
Copy link
Member Author

org.apache.ratis.protocol.GroupMismatchException: worker2.logservice.ratis.org_9999: group-87E0933ED47B not found.

Yeah, I've seen this too, but I don't know what group this is actually referring to. Something else to dig into. I was ignoring it for now.

SEVERE: ~* Channel ManagedChannelImpl{logId=22, target=worker3.logservice.ratis.org:9999} was not shutdown properly!!! *~
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.

I'm guessing something around gRPC in the Ratis client code isn't closing stuff correctly in all situations. Another thing to come back to as it doesn't seem to be an immediate problem :)

I've seen both of these since doing stuff in the LogService. I don't think they're new.

joshelser added 6 commits May 2, 2019 15:07
* Add lots of new options
* Fix LOG/stdout use
* Remove logs if they already exist before writing
Include brief summary from the Jira issue as to what/why we
changed, things we might want to tune in the future, and
validate that config isn't set which would break the
log service.
@joshelser joshelser force-pushed the logservice-tweaks branch from f0d8ea8 to 538ef5e Compare May 2, 2019 19:35
@joshelser
Copy link
Member Author

Rebased the branch. 538ef5e has some basic validation to make sure we don't have autoSnapshotEnabled

@chrajeshbabu
Copy link
Contributor

The refactor and new docker scripts for docker handling are good @joshelser.

@asfgit asfgit closed this in c01c5e9 May 3, 2019
@joshelser joshelser deleted the logservice-tweaks branch May 3, 2019 18:09
@joshelser
Copy link
Member Author

Thanks for looking, Rajeshbabu!

symious pushed a commit to symious/ratis that referenced this pull request Jan 2, 2024
Include brief summary from the Jira issue as to what/why we
changed, things we might want to tune in the future, and
validate that config isn't set which would break the
log service.

Closes apache#18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants