Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiSCo(Concurrency) unable to propagate context in ForkJoinTask #12

Open
guilhermeblanco opened this issue Sep 30, 2020 · 6 comments
Open

Comments

@guilhermeblanco
Copy link

Following the guidelines provided in https://docs.aws.amazon.com/xray/latest/devguide/aws-x-ray-auto-instrumentation-agent-for-java.html , my hopes were that I could collect tracing data from multiple Java services I have implemented as part of my stack.

My first service is a tool we rely on for CQRS called Axon (https://axoniq.io/download). In our case, we use the AxonServer enterprise (paid), but I was able to replicate the exact same issue on a single node on AxonServer standard edition (open-source). I decided to build my own container and publish to ECR. To package and configure the pre-built JAR file for AxonServer, I created the following Dockerfile:

FROM busybox as source

RUN addgroup -S -g 1001 axonserver \
    && adduser -S -u 1001 -G axonserver -h /axonserver -D axonserver \
    && mkdir -p /axonserver/disco/disco-plugins \
    && mkdir -p /axonserver/config /axonserver/data /axonserver/events /axonserver/log \
    && chown -R axonserver:axonserver /axonserver

FROM amazoncorretto/amazoncorretto:15-alpine-full

COPY --from=source /etc/passwd /etc/group /etc/
COPY --from=source --chown=axonserver /axonserver /axonserver

COPY --chown=axonserver axonserver.jar axonserver.properties /axonserver/
COPY --chown=axonserver disco/* /axonserver/disco/

USER axonserver
WORKDIR /axonserver

ENV JAVA_OPTIONS=""
ENV LANG=C.UTF-8

VOLUME [ "/axonserver/config", "/axonserver/data", "/axonserver/events", "/axonserver/log" ]
EXPOSE 8024/tcp 8124/tcp 8224/tcp

ENTRYPOINT java -Duser.timezone=UTC -Djava.security.egd=file:/dev/./urandom -Dlogging.level.com.amazonaws.xray=DEBUG -javaagent:/axonserver/disco/disco-java-agent.jar=pluginPath=/axonserver/disco/disco-plugins:loggerfactory=software.amazon.disco.agent.reflect.logging.StandardOutputLoggerFactory:verbose $JAVA_OPTIONS -jar axonserver.jar

Axon exposes 3 ports that does the following: 8024 provides the GUI (and also responsible for readiness and liveness), 8124 is the gRPC responsible to receive commands/queries and dispatch to corresponding handlers, and finally to relay the resulting events to any event handler that may be interested, and finally port 8224 that does cross-node communication for clustering.

My setup is currently an EKS cluster with various ties to AWS services, such as Cloudwatch, Container Insights, ECR, IAM-Authenticator, ALB Ingress Controller, etc. To leverage the integration with X-Ray, I've deployed the X-Ray daemon as a Daemonset in my cluster and exposed them as a headless service reachable through aws-xray-daemon.aws-system.svc.cluster.local:2000. Performing a netcat successfully connects to the UDP port, meaning daemon is reachable for the containers. Please let me know if you require me to provide the Kubernetes manifests for this deployment. It's inspired on https://www.eksworkshop.com/intermediate/245_x-ray/x-ray-daemon/ with my specific setup/configuration.

My xray daemon contains the following configuration:

# Maximum buffer size in MB (minimum 3). Choose 0 to use 1% of host memory.
TotalBufferSizeMB: 0
# Maximum number of concurrent calls to AWS X-Ray to upload segment documents.
Concurrency: 8
# Send segments to AWS X-Ray service in a specific region
Region: "us-east-1"
# Change the X-Ray service endpoint to which the daemon sends segment documents.
Endpoint: ""
Socket:
  # Change the address and port on which the daemon listens for UDP packets containing segment documents.
  # Make sure we listen on all IP's by default for the k8s setup
  UDPAddress: 0.0.0.0:2000
Logging:
  LogRotation: true
  # Change the log level, from most verbose to least: dev, debug, info, warn, error, prod (default).
  LogLevel: prod
  # Output logs to the specified file path.
  LogPath: ""
# Turn on local mode to skip EC2 instance metadata check.
LocalMode: false
# Amazon Resource Name (ARN) of the AWS resource running the daemon.
ResourceARN: ""
# Assume an IAM role to upload segments to a different account.
RoleARN: ""
# Disable TLS certificate verification.
NoVerifySSL: false
# Upload segments to AWS X-Ray through a proxy.
ProxyAddress: ""
# Daemon configuration file format version.
Version: 2

When deploying it on my EKS cluster, the application is fully functional/operational. No traces show up in the AWS Console UI. However, when I turn on the logs, I experience these from disco:

axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgentTemplate] DiSCo(Core) finished parsing argument list: pluginPath=/axonserver/disco/disco-plugins:loggerfactory=software.amazon.disco.agent.reflect.logging.StandardOutputLoggerFactory:verbose
axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgentTemplate] DiSCo(Core) passing arguments to ForkJoinTaskInterceptor to process
axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgentTemplate] DiSCo(Core) passing arguments to ForkJoinPoolInterceptor to process
axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgentTemplate] DiSCo(Core) passing arguments to ExecutorInterceptor to process
axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgentTemplate] DiSCo(Core) passing arguments to ThreadInterceptor to process
axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgentTemplate] DiSCo(Core) passing arguments to ThreadSubclassInterceptor to process
axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgentTemplate] DiSCo(Core) passing arguments to ScheduledThreadPoolExecutorInterceptor to process
axonserver-0 axonserver [software.amazon.disco.agent.interception.InterceptionInstaller] DiSCo(Core) attempting to install software.amazon.disco.agent.concurrent.ForkJoinTaskInterceptor
axonserver-0 axonserver [software.amazon.disco.agent.interception.InterceptionInstaller] DiSCo(Core) attempting to install software.amazon.disco.agent.concurrent.ForkJoinPoolInterceptor
axonserver-0 axonserver [software.amazon.disco.agent.interception.InterceptionInstaller] DiSCo(Core) attempting to install software.amazon.disco.agent.concurrent.ExecutorInterceptor
axonserver-0 axonserver [software.amazon.disco.agent.interception.InterceptionInstaller] DiSCo(Core) attempting to install software.amazon.disco.agent.concurrent.ThreadInterceptor
axonserver-0 axonserver [software.amazon.disco.agent.interception.InterceptionInstaller] DiSCo(Core) attempting to install software.amazon.disco.agent.concurrent.ThreadSubclassInterceptor
axonserver-0 axonserver [software.amazon.disco.agent.interception.InterceptionInstaller] DiSCo(Core) attempting to install software.amazon.disco.agent.concurrent.ScheduledThreadPoolExecutorInterceptor
axonserver-0 axonserver [software.amazon.disco.agent.DiscoAgent] DiSCo(Agent) agent startup complete

... Axon startup logs here ...

axonserver-0 axonserver [software.amazon.disco.agent.concurrent.ForkJoinTaskInterceptor] DiSCo(Concurrency) unable to propagate context in ForkJoinTask
axonserver-0 axonserver [software.amazon.disco.agent.concurrent.ForkJoinTaskInterceptor] DiSCo(Concurrency) unable to propagate context in ForkJoinTask

... Same error keeps repeating undefinitely ...

Please let me know if there is anything else I can provide or do to help you troubleshoot this error.

@connellp
Copy link
Contributor

connellp commented Oct 1, 2020

Hello. Thanks for trying to onboard with X-Ray, and sorry that you're experiencing these frustrations. The first thing really that I notice, is that your Dockerfile tells me that you are using Coretto 15. So far, we have tested the disco/x-ray software stack on JDKs 8 and 11. If you are committed to using version 15, we can spend some time on our side testing with that JDK - I wouldn't necessarily be able to immediately commit to a firm timeframe for that, but we can certainly treat it as a '+1' on a feature request for it.

In parallel, are you able to repeat your test using JDK 11? If that worked for you, it would give us a clearer hint that the JDK version might be impactful here.

@willarmiros
Copy link
Contributor

Hi @guilhermeblanco,

Could you also post the contents of your /axonserver/disco/disco-plugins directory? In those initialization logs where it says "attempting to install XXXInterceptor" I would expect to see similar messages messages for an AWS interceptor, SQL interceptor, and Web interceptor which are bundled by default in disco-plugins with the X-Ray Agent ZIP.

As for the unable to propagate context, I've experienced those messages myself and further investigation is needed for them, but I don't think it's related to your issue of not seeing traces.

@guilhermeblanco
Copy link
Author

@willarmiros Interesting... Docker flats the folder structure an only copies the files (without their recursive path). I ended up with my Dockerfile to have all plugins in the same folder as disco-java-agent.jar. I've changes the line to copy them directly and it indeed expanded the list of interceptors installed.

@connellp I've addressed the point William made about missing interceptors and switched to amazoncorretto/amazoncorretto:11-alpine-full as per your request. Everything works smoothly on JDK11. I'll keep this version for this service and will revert the JDK upgrade from 11 => 15 for the other services, so we can promote tracing across all services we're implementing.

Please let me know if there's anything else you may want me to verify or test. Otherwise, I'll leave the fate of this issue in your hands to either close it, rename it, etc. =)

@willarmiros
Copy link
Contributor

Glad to hear you got it working! So you don't see the unable to propagate context messages with the fixed disco-plugins directory and on JDK 11? Also, did you notice roughly how often the messages were being printed? E.g. every 10 seconds, longer, or shorter?

@guilhermeblanco
Copy link
Author

guilhermeblanco commented Oct 1, 2020

@willarmiros Yes, everything is working (I see some errors, but they are mainly due to background processes and lack a main segment), so fully ignorable. Other parts are working as expected.

axonserver-2 axonserver [software.amazon.disco.agent.event.EventBus] DiSCo(Core) publishing event software.amazon.disco.agent.event.ServiceDownstreamRequestEvent@2c2b93df from origin SQL
axonserver-2 axonserver [software.amazon.disco.agent.event.EventBus] DiSCo(Core) publishing event software.amazon.disco.agent.event.ServiceDownstreamResponseEvent@151daca9 from origin SQL
axonserver-2 axonserver [software.amazon.disco.agent.event.EventBus] DiSCo(Core) publishing event software.amazon.disco.agent.event.ServiceDownstreamResponseEvent@65d870be from origin SQL
axonserver-2 axonserver [software.amazon.disco.agent.event.EventBus] DiSCo(Core) publishing event software.amazon.disco.agent.event.ServiceDownstreamResponseEvent@30eb4739 from origin SQL
axonserver-2 axonserver 2020-10-01 19:12:05.621 ERROR 1 --- [           main] c.a.x.s.LogErrorContextMissingStrategy   : Suppressing AWS X-Ray context missing exception (SegmentNotFoundException): No segment in progress.
axonserver-2 axonserver 2020-10-01 19:12:05.622 DEBUG 1 --- [           main] c.a.x.s.LogErrorContextMissingStrategy   : java.lang.RuntimeException: No segment in progress.
axonserver-2 axonserver 2020-10-01 19:12:05.622 ERROR 1 --- [           main] c.a.x.s.LogErrorContextMissingStrategy   : Suppressing AWS X-Ray context missing exception (SubsegmentNotFoundException): Failed to end subsegment: subsegment cannot be found.
axonserver-2 axonserver 2020-10-01 19:12:05.622 DEBUG 1 --- [           main] c.a.x.s.LogErrorContextMissingStrategy   : java.lang.RuntimeException: Failed to end subsegment: subsegment cannot be found.

I'm seeing logs around every 6 seconds.

@connellp
Copy link
Contributor

connellp commented Oct 1, 2020

@guilhermeblanco I think the ForkJoinTask thing is unrelated, and possibly benign. I ran all our integ tests (which are quite large in number and I trust them) on JDK15, and they did all pass, so it's still unexplained. If I find a reproducible case, I will follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants