Skywalking benchmark on Finance Core System #9002

lewiselau · 2022-05-06T17:24:24Z

lewiselau
May 6, 2022

Hi Community,

We have a performance Test on Skywalking, on base of Finance Core system.

The platform architecture:

Application: Finance core system, Kubernetes based, 10 key spring boot microservices, 40 pods with Load balancing.
Skywalking Backend: OAP v9.1.0-SNAPSHOT-85CE164(20220217220813), K8S based, 2 nodes x 8Ghz/8gbRAM each one, elastic search storage (3 - 6 nodes) (After 100CCU, OAP was upgraded to OAP 9.1.0 - 20220508)
Skywalking Javaagent plugins: Javaagent V8.10, Default plugins + Mybatis plugin + Threadpool plugin + url ignore plugin + Trace Maven dependency to generate TraceID into Logback. (After 100CCU, Javaagent was upgraded to V8.11 - 20220508)
Skywalking Vue agent: "skywalking-client-js":"^0.8.0"
Test cases: 5 CCU, 10 CCU, 20 CCU, 40 CCU, 50CCU, 100CCU (CCU - Concurrent User, but here it is Rest API call with Jmeter, without Think Time)

Phased Summary:

Storage will already be the key bottleneck and optimization point. For cost-efficiency, we are willing to extend the ES flush interval for better performance; Seems 60 seconds works well instead 10s default.
There are multi OAP thread pools, we are attempting to scale out some threads, but thread are not the bottleneck as the storage iops.
Some maximum content length need to optimization tuning, which leading part of the data dropped.
Storage consuming: 1GB per CCU / HOUR, include 80% Logging (2 days retention), 8% Tracing (2 days retention), 12% Metrics (10 days retention)

Some open questions:

We utilized Skywalking self observibility to trace the OAP side performance, however we can only find one OAP instance 'localhost:1234' in the scenario of multi OAP pods, even though we practiced static hostname or pod discovered with K8S, in regarding to the setup manual: https://skywalking.apache.org/docs/main/latest/en/setup/backend/backend-telemetry/.
<20220509> This one was resolved. Per review from professional community, the key points are:
- It is opencensus on oap end, so it is opencensus exporter in opentelemetry collector configmap, to reference https://github.com/apache/skywalking-showcase/blob/main/deploy/platform/kubernetes/feature-so11y/open-telemetry.yaml#L51
- Disable SW_PROMETHEUS_FETCHER, which is only for static way
- Keep oap container name and service name same as the ones defined in scrape_configs job of opentelemetry, e.g oap as container name and oap-server as service name.
Please refer to my practice in Appendix part.
No data on Virtual database UI within our environments (Ali K8S PaaS, Ali MySQL PaaS), however on other scenario it works well. No idea so far.
<20220512> This one was resolved. Root cause is, Skywalking enhance was impacted by the enhance from another APM Tracer - Aliyun ARMS, which led to below enhance failure:
- Enhance class io.grpc.netty.NettyClientStream error
- Enhance class com.alibaba.druid.pool.DruidDataSource error.
- Enhance class com.mysql.cj.jdbc.ConnectionImpl error. (This one is related to Virtual database)
  Disable ARMS, then Virtual database UI works well.
We hope to archive some key metrics for long-term analysis, e.g. apdex and p9x response time data. However, there are no default solution. We are trying to validate on 3 options: a) GraphQL with script; b) Coding a GRPC Server to activate Skywalking Exporter feature; c) From storage level, ETL to another storage. We are still on the way.
<20220512> On the way to GRPC Server Option, and will update later on.
After Small version upgrade within oap 9.0 (From v9.1.0-SNAPSHOT-85CE164(20220217220813) to 9.1.0 - 20220508)), no new collected data can be not presented on UI with below notification on UI:
Exception while fetching data (/data) : null
The data can be presented after Storage refresh, e.g. ES.
<20220522> Per test, new data include metrics/trace/log is able to write onto Elastic Search, and also the input data within one week is able to be qeury from UI normally. But in case query the data in past one month where the data over 1 week is already remove by housekeeping job, there will be Exception warning from UI, "Exception while fetching data (/data) : null" as below:
And also, some warinig log generated as below:

2022-05-22 13:10:57,480 org.apache.skywalking.library.elasticsearch.client.SearchClient 63 [armeria-eventloop-epoll-5-1] ERROR [] - [9.1.0-SNAPSHOT-1285f5b] Failed to search, request org.apache.skywalking.library.elasticsearch.requests.search.Search@404c6719, params null, index [sky-walking_segment-20220422, sky-walking_segment-20220423, sky-walking_segment-20220424, sky-walking_segment-20220425, sky-walking_segment-20220426, sky-walking_segment-20220427, sky-walking_segment-20220428, sky-walking_segment-20220429, sky-walking_segment-20220430, sky-walking_segment-20220501, sky-walking_segment-20220502, sky-walking_segment-20220503, sky-walking_segment-20220504, sky-walking_segment-20220505, sky-walking_segment-20220506, sky-walking_segment-20220507, sky-walking_segment-20220508, sky-walking_segment-20220509, sky-walking_segment-20220510, sky-walking_segment-20220511, sky-walking_segment-20220512, sky-walking_segment-20220513, sky-walking_segment-20220514, sky-walking_segment-20220515, sky-walking_segment-20220516, sky-walking_segment-20220517, sky-walking_segment-20220518, sky-walking_segment-20220519, sky-walking_segment-20220520, sky-walking_segment-20220521, sky-walking_segment-20220522]
java.util.concurrent.CompletionException: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [sky-walking_segment-20220422]","resource.type":"index_or_alias","resource.id":"sky-walking_segment-20220422","index_uuid":"_na_","index":"sky-walking_segment-20220422"}],"type":"index_not_found_exception","reason":"no such index [sky-walking_segment-20220422]","resource.type":"index_or_alias","resource.id":"sky-walking_segment-20220422","index_uuid":"_na_","index":"sky-walking_segment-20220422"},"status":404}

Detail performance test data

as below, and we will continue to update later on.

https://gist.github.com/lewiselau/9aa8bec87e3af682d7229660f965fbab
Update: 20220508 00
Direct image here - seems gist not visible to sometime

Appendix:

1. Self obsevibility setting on Discovery K8S manner
Per review and guidance from Community professor, the k8s-discovery option of self-observibility feature is activated. And here attach the configuration file as successful practice.
Environment parameter and Ports on OAP deployment
ports:

  - containerPort: 1234
    name: prometheus-port

env:

  - name: SW_TELEMETRY
    value: prometheus
  - name: SW_TELEMETRY_PROMETHEUS_PORT
    value: '1234'
  - name: SW_OTEL_RECEIVER
    value: default
  - name: SW_OTEL_RECEIVER_ENABLED_OC_RULES
    value: oap
  #- name: SW_PROMETHEUS_FETCHER
    #value: default

Configmap/deployment on opentelemetry collector

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf-so11y
  labels:
    app: opentelemetry-so11y
data:
  otel-collector-config: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: 'skywalking-so11y'
              metrics_path: '/metrics'
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_container_name, __meta_kubernetes_pod_container_port_name]
                  action: keep
                  regex: oap;prometheus-port
                - source_labels: []
                  target_label: service
                  replacement: oap-server
                - source_labels: [__meta_kubernetes_pod_name]
                  target_label: host_name
                  regex: (.+)
                  replacement: $$1
    exporters:
      opencensus:
        endpoint: "sw-service.apm.svc:11800"
        tls:
          insecure: true
      # logging:
      #   logLevel: debug
    service:
      pipelines:
        metrics:
          receivers: [ prometheus ]
          exporters: [ opencensus ]
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-deployment-so11y
  labels:
    app: otel-so11y
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-so11y
  template:
    metadata:
      labels:
        app: otel-so11y
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      serviceAccountName: otel-sa-so11y
      containers:
        - name: otel
          image: otel/opentelemetry-collector:0.50.0
          command:
            - "/otelcol"
            - "--config=/conf/otel-collector-config.yaml"
          volumeMounts:
            - name: otel-collector-config-vol-so11y
              mountPath: /conf
      volumes:
        - name: otel-collector-config-vol-so11y
          configMap:
            name: otel-collector-conf-so11y
            items:
              - key: otel-collector-config
                path: otel-collector-config.yaml

Permission setting since K8S API Server will be accessed for pod metadata information.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-sa-so11y
  namespace: apm
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-role-so11y
  #namespace: apm
rules:
  - apiGroups: [""]
    resources:
      - pods # @feature: so11y; OpenTelemetry needs to read OAP Pods information to get OAP details
    verbs:
      - get
      - watch
      - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-role-binding-so11y
  namespace: apm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-role-so11y
subjects:
  - kind: ServiceAccount
    name: otel-sa-so11y
    namespace: apm

2. The UI snapshot
So far we achieved self-observibility on both statis and K8S-Discovery way.

Superskyyy · 2022-05-06T18:51:10Z

Superskyyy
May 6, 2022
Collaborator

Thanks for the sharing! Looks like the tables are messed up because GitHub trying to limit the width. Let's find out a better way to show the work..

0 replies

kezhenxu94 · 2022-05-07T02:06:47Z

kezhenxu94
May 7, 2022
Collaborator

Some open questions:

We utilized Skywalking self observibility to trace the OAP side performance, however we can only find one OAP instance 'localhost:1234' in the scenario of multi OAP pods, even though we practiced static hostname or pod discovered with K8S, in regarding to the setup manual: https://skywalking.apache.org/docs/main/latest/en/setup/backend/backend-telemetry/.

https://skywalking.apache.org/docs/main/latest/en/setup/backend/backend-telemetry/#service-discovery-on-kubernetes service discover on Kubernetes should work as I use this to set up self observability multiple times and all the OAP instances are correctly shown as their Pod IPs, can you share more details how you set it up and how the result looks like?

We hope to archive some key metrics for long-term analysis, e.g. apdex and p9x response time data. However, there are no default solution. We are trying to validate on 3 options: a) GraphQL with script; b) Coding a GRPC Server to activate Skywalking Exporter feature; c) From storage level, ETL to another storage. We are still on the way.

I personally think using SkyWalking gRPC exporter is a quick and practical way to save those key metrics for longer term

9 replies

wankai123 May 9, 2022
Collaborator

@lewiselau The above config missed SW_OTEL_RECEIVER_ENABLED_OC_RULES=oap not oapx.
I will adopt the new OTEL version, but before that, you could use version 0.29.0 as the showcase used.

lewiselau May 9, 2022
Author

@wankai123 Thanks, it is oap definitely in real environment while it is typo only on the discussion page, which was corrected. So even with oap and 0.29.0, the same error log on otel collector pod.

wankai123 May 9, 2022
Collaborator

@lewiselau I see your log {"kind": "exporter", "name": "otlp", "error ....."
Can you confirm your OTEL exporter config used opencensus? It's not otlp, you'd better refer to the showcase for the whole config.
https://github.com/apache/skywalking-showcase/blob/main/deploy/platform/kubernetes/feature-so11y/open-telemetry.yaml#L51

kezhenxu94 May 9, 2022
Collaborator

Method not found: opentelemetry.proto.collector.metrics.v1.MetricsService/Export

This is clear error logs (you should paste this at the first place) that we never implement open telemetry protocol, please see the comment above, what we implemented is open census exporter, not otlp.

lewiselau May 9, 2022
Author

@wu-sheng @wankai123 @kezhenxu94 Thanks for your kindly help. Per your professional review and guidance, the k8s-discovery option of self-observibility feature was activated. The correction was updated in the main discussion description. (We validated latest otel/opentelemetry-collector:0.50.0 works well)

kezhenxu94 · 2022-05-07T02:08:38Z

kezhenxu94
May 7, 2022
Collaborator

@wu-sheng the author say they will keep updating the perf result but you moved it to your own gist account, I think the author won't be able to update it there

2 replies

wu-sheng May 7, 2022
Collaborator

I just gave him/her a demo, the old one is impossible to read. I DM him/her to use a new gist on their own.

lewiselau May 7, 2022
Author

Thanks for your kind guidance, gist was created for better sharing.

kezhenxu94 · 2022-05-08T02:21:50Z

kezhenxu94
May 8, 2022
Collaborator

@lewiselau in terms of the exception thread pool full, you should bump up the OAP version to include this commit 023a2d3165e2cf2217572caf8f8017b8a202e135, there was a critical bug of queue consumption and the exception thread pool full is verified on my side . It'd better if you could bump up to the latest as it also include another bug fix that will appear when there are tens of millions of services (I do not see in your case though)

8 replies

lewiselau May 9, 2022
Author

@kezhenxu94 Quick update, INVALIDA_ARGUMENT was resolved after upgrading Java Agent end to latest code version. Cheer.
We will start new cycle of higher throughput benchmark soon.

lewiselau May 12, 2022
Author

@kezhenxu94 please below small version upgrade exception for your review.
After Small version upgrade within oap 9.0 (From v9.1.0-SNAPSHOT-85CE164(20220217220813) to 9.1.0 - 20220508)), no new collected data can be presented on UI with below notification on UI:
Exception while fetching data (/data) : null
The data can be presented after Storage refresh, e.g. ES. Not sure whether Elastic SEARCH refresh is required each time to do patch upgrade.

kezhenxu94 May 13, 2022
Collaborator

Not sure whether Elastic SEARCH refresh is required each time to do patch upgrade.

If that's the case it is unexpected breaking change. Maybe you should have check the OAP logs when the UI complains Exception while fetching data (/data) : null

lewiselau May 22, 2022
Author

@kezhenxu94 <20220522> Per test, new data include metrics/trace/log is able to write onto Elastic Search, and also the input data within one week is able to be qeury from UI normally. But in case query the data in past one month where the data over 1 week is already remove by housekeeping job, there will be Exception warning from UI, "Exception while fetching data (/data) : null" as below:
And also, some warinig log generated as below:

2022-05-22 13:10:57,480 org.apache.skywalking.library.elasticsearch.client.SearchClient 63 [armeria-eventloop-epoll-5-1] ERROR [] - [9.1.0-SNAPSHOT-1285f5b] Failed to search, request org.apache.skywalking.library.elasticsearch.requests.search.Search@404c6719, params null, index [sky-walking_segment-20220422, sky-walking_segment-20220423, sky-walking_segment-20220424, sky-walking_segment-20220425, sky-walking_segment-20220426, sky-walking_segment-20220427, sky-walking_segment-20220428, sky-walking_segment-20220429, sky-walking_segment-20220430, sky-walking_segment-20220501, sky-walking_segment-20220502, sky-walking_segment-20220503, sky-walking_segment-20220504, sky-walking_segment-20220505, sky-walking_segment-20220506, sky-walking_segment-20220507, sky-walking_segment-20220508, sky-walking_segment-20220509, sky-walking_segment-20220510, sky-walking_segment-20220511, sky-walking_segment-20220512, sky-walking_segment-20220513, sky-walking_segment-20220514, sky-walking_segment-20220515, sky-walking_segment-20220516, sky-walking_segment-20220517, sky-walking_segment-20220518, sky-walking_segment-20220519, sky-walking_segment-20220520, sky-walking_segment-20220521, sky-walking_segment-20220522]
java.util.concurrent.CompletionException: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [sky-walking_segment-20220422]","resource.type":"index_or_alias","resource.id":"sky-walking_segment-20220422","index_uuid":"_na_","index":"sky-walking_segment-20220422"}],"type":"index_not_found_exception","reason":"no such index [sky-walking_segment-20220422]","resource.type":"index_or_alias","resource.id":"sky-walking_segment-20220422","index_uuid":"_na_","index":"sky-walking_segment-20220422"},"status":404}

Seems not impact the normal function, but not friendly.

wu-sheng May 22, 2022
Collaborator

Maybe this helps, #9076

wu-sheng · 2022-05-08T07:33:25Z

wu-sheng
May 8, 2022
Collaborator

A question, what does CCU represent?

No data on Virtual database UI within our environments (Ali K8S PaaS, Ali MySQL PaaS), however on other scenario it works well. No idea so far.

I don't have any idea how is Ali MySQL PaaS working? Agent basically works with client-side driver(JDBC in this case), if somehow it is changed, then no relative data.

We hope to archive some key metrics for long-term analysis, e.g. apdex and p9x response time data. However, there are no default solution. We are trying to validate on 3 options: a) GraphQL with script; b) Coding a GRPC Server to activate Skywalking Exporter feature; c) From storage level, ETL to another storage. We are still on the way.

The exporter is designed for this. Choose exporting mode only.

Storage will already be the key bottleneck and optimization point. For cost-efficiency, we are willing to extend the ES flush interval for better performance; Seems 60 seconds works well instead 10s default.

Decreasing this definitely works, if it is acceptable.

Storage definitely is the most challenging thing, which is why last year, we began the new design of BanyanDB, https://github.com/apache/skywalking-banyandb. From now, we are going to work more on BanyanDB rather than ElasticSearch. ElasticSearch's resource(iops or saas bill) cost per service is clearly very high, and time-series databases(InfluxDB and IoTDB) don't show distinct performance improvement, especially when face log/trace and are being impacted hugely by trace ID(too many index candidate).
What we learned from the last 5 years leads to only one conclusion, APM/Observability data is not only simply time series, we have to build a new database.

There are multi OAP thread pools, we are attempting to scale out some threads, but thread are not the bottleneck as the storage iops.

The number of threads and thread pools are already large enough, notice Zhenxu mentioned issue, others should be good for 9.0.0.

4 replies

lewiselau May 8, 2022
Author

Cool, really appreciated.
Our CCU scenario was updated on the discussion content. CCU - Concurrent User, but here it is Rest API call with Jmeter, without Think Time.

wu-sheng May 8, 2022
Collaborator

For payload testing, various things could be the issue. The typical one is your system is closing burning down, such as 80% CPU cost, or high CPU load, then a little more CPU cost/swap could trigger butterfly effect.

lewiselau May 12, 2022
Author

@wu-sheng, Mr. Wu, please below update on Virtual database UI
No data on Virtual database UI within our environments (Ali K8S PaaS, Ali MySQL PaaS), however on other scenario it works well.
<20220512> This one was resolved. Root cause is, Skywalking enhance was impacted by the enhance from another APM Tracer - Aliyun ARMS, which led to below enhance failure:
Enhance class io.grpc.netty.NettyClientStream error
Enhance class com.alibaba.druid.pool.DruidDataSource error.
Enhance class com.mysql.cj.jdbc.ConnectionImpl error. (This one is related to Virtual database)
Disable ARMS, then Virtual database UI works well.

wu-sheng May 13, 2022
Collaborator

Got it, but as it is commercial, I can't see what ARMS would do to running env. They may do some dirty work to impact original process somehow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skywalking benchmark on Finance Core System #9002

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 23 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Skywalking benchmark on Finance Core System #9002

lewiselau May 6, 2022

The platform architecture:

Phased Summary:

Some open questions:

Detail performance test data

Appendix:

Replies: 5 comments · 23 replies

Superskyyy May 6, 2022 Collaborator

kezhenxu94 May 7, 2022 Collaborator

wankai123 May 9, 2022 Collaborator

lewiselau May 9, 2022 Author

wankai123 May 9, 2022 Collaborator

kezhenxu94 May 9, 2022 Collaborator

lewiselau May 9, 2022 Author

kezhenxu94 May 7, 2022 Collaborator

wu-sheng May 7, 2022 Collaborator

lewiselau May 7, 2022 Author

kezhenxu94 May 8, 2022 Collaborator

lewiselau May 9, 2022 Author

lewiselau May 12, 2022 Author

kezhenxu94 May 13, 2022 Collaborator

lewiselau May 22, 2022 Author

wu-sheng May 22, 2022 Collaborator

wu-sheng May 8, 2022 Collaborator

lewiselau May 8, 2022 Author

wu-sheng May 8, 2022 Collaborator

lewiselau May 12, 2022 Author

wu-sheng May 13, 2022 Collaborator

lewiselau
May 6, 2022

Replies: 5 comments 23 replies

Superskyyy
May 6, 2022
Collaborator

kezhenxu94
May 7, 2022
Collaborator

wankai123 May 9, 2022
Collaborator

lewiselau May 9, 2022
Author

wankai123 May 9, 2022
Collaborator

kezhenxu94 May 9, 2022
Collaborator

lewiselau May 9, 2022
Author

kezhenxu94
May 7, 2022
Collaborator

wu-sheng May 7, 2022
Collaborator

lewiselau May 7, 2022
Author

kezhenxu94
May 8, 2022
Collaborator

lewiselau May 9, 2022
Author

lewiselau May 12, 2022
Author

kezhenxu94 May 13, 2022
Collaborator

lewiselau May 22, 2022
Author

wu-sheng May 22, 2022
Collaborator

wu-sheng
May 8, 2022
Collaborator

lewiselau May 8, 2022
Author

wu-sheng May 8, 2022
Collaborator

lewiselau May 12, 2022
Author

wu-sheng May 13, 2022
Collaborator