server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() #12896

wilsonwang371 · 2021-04-26T04:47:36Z

This is related to transaction logic improvement #12692. As per Piotr's comment, it might be a better choice to use ReadTx(). ReadTx() does not copy buffer while ConcurrentReadTx() does. Therefore, we can avoid slow transaction execution in applier while minimizing the scope of a write transaction.

wilsonwang371 · 2021-04-26T04:48:08Z

More performance evaluation will be done on this.

codecov-commenter · 2021-04-26T05:44:10Z

Codecov Report

Merging #12896 (5615421) into master (9a3aff6) will decrease coverage by 6.41%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master   #12896      +/-   ##
==========================================
- Coverage   69.60%   63.18%   -6.42%     
==========================================
  Files         411      405       -6     
  Lines       33421    33439      +18     
==========================================
- Hits        23263    21129    -2134     
- Misses       8208    10145    +1937     
- Partials     1950     2165     +215

Flag	Coverage Δ
all	`63.18% <92.85%> (-6.42%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/mvcc/kv.go	`50.00% <ø> (ø)`
server/mvcc/kv_view.go	`60.00% <66.66%> (-20.00%)`	⬇️
server/etcdserver/apply.go	`88.74% <100.00%> (+0.07%)`	⬆️
server/mvcc/kvstore_txn.go	`71.68% <100.00%> (-0.54%)`	⬇️
client/pkg/v3/testutil/recorder.go	`0.00% <0.00%> (-79.67%)`	⬇️
server/etcdserver/api/v3compactor/revision.go	`0.00% <0.00%> (-76.93%)`	⬇️
server/auth/options.go	`32.43% <0.00%> (-59.46%)`	⬇️
server/wal/walpb/record.go	`42.85% <0.00%> (-57.15%)`	⬇️
server/proxy/tcpproxy/userspace.go	`0.00% <0.00%> (-53.85%)`	⬇️
server/wal/repair.go	`0.00% <0.00%> (-52.84%)`	⬇️
... and 158 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a3aff6...5615421. Read the comment docs.

wilsonwang371 · 2021-04-26T18:09:58Z

This is the result compared with the master branch.

txn-readtx.csv

server/etcdserver/apply.go

ptabor · 2021-04-26T20:49:30Z

Rely good numbers. @jingyih - you are an expert on concurrentTx, could you, pls, take a look

wilsonwang371 · 2021-04-26T22:44:54Z

I am doing another mixed read/write workload performance evaluation. Will update with the data after around 48 hours.

wilsonwang371 · 2021-04-28T05:05:59Z

@ptabor The performance looks much better now compared with #12692. I even feel this a little bit too good to be true.... Hense we need to review this carefully.

wilsonwang371 · 2021-04-28T17:49:27Z

I just realized that there is an issue with my code that for read Txn, we are still using ReadTx() and in fact here we should use ConcurrentReadTx().

Updated the patch, doing another round of performance evaluation.

server/etcdserver/apply.go

gyuho · 2021-04-29T05:44:17Z

Updated the patch, doing another round of performance evaluation.

@wilsonwang371 What does "performance" in the heat graph mean? Is that TPS? Can you share more details about provision setup? Single node, or multi node?

wilsonwang371 · 2021-04-29T06:08:49Z

Updated the patch, doing another round of performance evaluation.

@wilsonwang371 What does "performance" in the heat graph mean? Is that TPS? Can you share more details about provision setup? Single node, or multi node?

Hi Gyuho,

To avoid networking interference, I am always using a single node in this patch performance evaluation. The heat map (yellow-green) is either master branch or my patch txn benchmark (queries per second) QPS. The heat map (red-blue) is the QPS difference between master branch and my patch.

I shared my python script for plotting the data and also the test data details in #12692. you can get more information from there.

Let me know if you have other questions.

gyuho · 2021-04-29T06:38:35Z

@wilsonwang371 That's awesome!

Updated the patch, doing another round of performance evaluation.

Have we run another round of performance tests yet? Once we confirm, we can merge.

wilsonwang371 · 2021-04-29T07:02:41Z

@wilsonwang371 That's awesome!

Updated the patch, doing another round of performance evaluation.

Have we run another round of performance tests yet? Once we confirm, we can merge.

it will take another 24 hours to get the data generated. I will post them once available.

gyuho · 2021-04-29T20:39:05Z

@wilsonwang371 Very cool. Can you share more details for the machine spec and how you generated client-side workloads? Are we just using benchmark tool?

wilsonwang371 · 2021-04-29T20:47:56Z

@ptabor @gyuho

Guys, here is the latest evaluation of my patch. There is still a chance of a worst-case 7% performance penalty during R/W ratio 100/1. But compared with #12692, it is better.

Do we still consider a switch for this feature?

wilsonwang371 · 2021-04-29T20:53:35Z

@wilsonwang371 Very cool. Can you share more details for the machine spec and how you generated client-side workloads? Are we just using benchmark tool?

The machine setup is:

Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
2 Sockets, 64 VCPUs

256GB

NVMe SSD-2T* 2

Mellanox CX-5

The benchmark tool is etcd's default benchmark tool. I just make some small modification so that it can send both read and write txn requests.

The client and server are running on the same machine.

server/mvcc/kvstore_txn.go

server/etcdserver/apply.go

gyuho · 2021-05-03T17:31:20Z

If I understand this correctly, we get up to 2x throughput due to increased concurrency for writes because we make reads fully concurrent?

worst-case 7% performance penalty during R/W ratio 100/1

I think it's a safe trade-off to make given that we already did a round of optimizations for kube-apiserver reads (list and watch) for 3.4? Or could be a noise. I think the increased concurrency in writes should be good enough to compensate this.

gyuho · 2021-05-03T17:34:19Z

@wilsonwang371 Could you share exact benchmark tool commands and flags for anyone that might be interested in reproducing this?

wilsonwang371 · 2021-05-03T17:39:01Z

If I understand this correctly, we get up to 2x throughput due to increased concurrency for writes because we make reads fully concurrent?

worst-case 7% performance penalty during R/W ratio 100/1

I think it's a safe trade-off to make given that we already did a round of optimizations for kube-apiserver reads (list and watch) for 3.4? Or could be a noise. I think the increased concurrency in writes should be good enough to compensate this.

In this graph, the ReadTxn/WriteTxn = 0.125 = 1/8. That means we are doing 8 txn writes while doing 1 txn read. The gain is from avoiding shared buffer copying which speeds up the txn writes.

I am using etcd benchmark for this but I made some small change to the benchmark tool so that it can do both txn-range and txn-put at the same time.

Here is the new file I added to the benchmark.

// Copyright 2017 The etcd Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package cmd

import (
	"context"
	"encoding/binary"
	"fmt"
	"math"
	"math/rand"
	"os"
	"time"

	v3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/pkg/v3/report"

	"github.com/spf13/cobra"
	"golang.org/x/time/rate"
	"gopkg.in/cheggaaa/pb.v1"
)

// mixeTxnCmd represents the mixedTxn command
var mixedTxnCmd = &cobra.Command{
	Use:   "mixed-txn key [end-range]",
	Short: "Benchmark a mixed load of txn-put & txn-range",

	Run: mixedTxnFunc,
}

var (
	mixedTxnTotal     int
	mixedTxnRate      int
	mixedTxnOpsPerTxn int
	mixedTxnReadWriteRatio float64
	writeCount        uint64
	readCount         uint64
)

func init() {
	RootCmd.AddCommand(mixedTxnCmd)
	mixedTxnCmd.Flags().IntVar(&keySize, "key-size", 8, "Key size of mixed txn")
	mixedTxnCmd.Flags().IntVar(&valSize, "val-size", 8, "Value size of mixed txn")
	mixedTxnCmd.Flags().IntVar(&mixedTxnOpsPerTxn, "txn-ops", 1, "Number of ops per txn")
	mixedTxnCmd.Flags().IntVar(&mixedTxnRate, "rate", 0, "Maximum txns per second (0 is no limit)")

	mixedTxnCmd.Flags().IntVar(&mixedTxnTotal, "total", 10000, "Total number of txn requests")
	mixedTxnCmd.Flags().IntVar(&keySpaceSize, "key-space-size", 1, "Maximum possible keys")
	mixedTxnCmd.Flags().StringVar(&rangeConsistency, "consistency", "l", "Linearizable(l) or Serializable(s)")

	mixedTxnCmd.Flags().Float64Var(&mixedTxnReadWriteRatio, "rw-ratio", 1, "Read/write ops ratio")
}

func mixedTxnFunc(cmd *cobra.Command, args []string) {
	if keySpaceSize <= 0 {
		fmt.Fprintf(os.Stderr, "expected positive --key-space-size, got (%v)", keySpaceSize)
		os.Exit(1)
	}

	if mixedTxnOpsPerTxn > keySpaceSize {
		fmt.Fprintf(os.Stderr, "expected --txn-ops no larger than --key-space-size, "+
			"got txn-ops(%v) key-space-size(%v)\n", txnPutOpsPerTxn, keySpaceSize)
		os.Exit(1)
	}

	if rangeConsistency == "l" {
		fmt.Println("bench with linearizable range")
	} else if rangeConsistency == "s" {
		fmt.Println("bench with serializable range")
	} else {
		fmt.Fprintln(os.Stderr, cmd.Usage())
		os.Exit(1)
	}

	requests := make(chan []v3.Op, totalClients)
	if mixedTxnRate == 0 {
		mixedTxnRate = math.MaxInt32
	}
	limit := rate.NewLimiter(rate.Limit(mixedTxnRate), 1)
	clients := mustCreateClients(totalClients, totalConns)
	k, v := make([]byte, keySize), string(mustRandBytes(valSize))

	bar = pb.New(mixedTxnTotal)
	bar.Format("Bom !")
	bar.Start()

	r := newReport()
	for i := range clients {
		wg.Add(1)
		go func(c *v3.Client) {
			defer wg.Done()
			for ops := range requests {
				limit.Wait(context.Background())
				st := time.Now()
				_, err := c.Txn(context.TODO()).Then(ops...).Commit()
				r.Results() <- report.Result{Err: err, Start: st, End: time.Now()}
				bar.Increment()
			}
		}(clients[i])
	}

	go func() {
		for i := 0; i < mixedTxnTotal; i++ {
			ops := make([]v3.Op, mixedTxnOpsPerTxn)
			for j := 0; j < mixedTxnOpsPerTxn; j++ {
				op := v3.Op{}
				if rand.Float64() < mixedTxnReadWriteRatio / (1 + mixedTxnReadWriteRatio) {
					opts := []v3.OpOption{v3.WithRange("")}
					if rangeConsistency == "s" {
						opts = append(opts, v3.WithSerializable(), v3.WithPrevKV(), v3.WithPrefix())
					}
					op = v3.OpGet("rangeKey", opts...)
					readCount++
				} else {
					binary.PutVarint(k, int64(((i*mixedTxnOpsPerTxn)+j)%keySpaceSize))
					op = v3.OpPut(string(k), v)
					writeCount++
				}
				ops[j] = op
			}
			requests <- ops
		}
		close(requests)
	}()

	rc := r.Run()
	wg.Wait()
	close(r.Results())
	bar.Finish()
	fmt.Printf("READ: %d, WRITE: %d\n", readCount, writeCount)
	fmt.Println(<-rc)
}

gyuho · 2021-05-03T17:40:37Z

The gain is from avoiding shared buffer copying which speeds up the txn writes.

Yes, you are right. Now I read how we create different types of transactions for write txn.

wilsonwang371 · 2021-05-03T17:41:29Z

Here is the test script that I am using.

#!/bin/bash

set -xe;

etcd_path="/home/wilson.wang/etcd-master";
run_path="/data00/etcd-tests";
run_count=$((1024*256));
# standard k8s setting
limit_count=$((1024*256));
keys_count=$((1024*256));
backend_size=$((20*1024*1024*1024));

pushd ${run_path};

trap ctrl_c INT

function ctrl_c() {
    for i in $(ps aux | grep -v grep | grep etcd | awk "{print \$2}");
    do
      kill -9 $i;
    done;
    popd;
    exit 0;
}

for ratio_str in 1/8 1/4 1/2 2/1 4/1 8/1 100/1;
do
  ratio=$(echo "scale=3; ${ratio_str}" | bc -l);
  for vallen in $(seq 4 10);
  do
    for i in $(seq 5 12);
    do
      rm -rf default.etcd/;
      ${etcd_path}/bin/etcd --quota-backend-bytes=${backend_size} \
        --listen-client-urls http://0.0.0.0:23790 \
        --advertise-client-urls http://127.0.0.1:23790 \
        --log-level 'error' &
      pid=$!;
      sleep 6;
      conn=$((2**$i));
      valsize=$((2**(4 + $vallen)));
      ${etcd_path}/tools/benchmark/benchmark put --sequential-keys \
        --key-space-size=${keys_count} \
        --val-size=${valsize} --key-size=256 \
        --endpoints "http://127.0.0.1:23790" \
        --total=${keys_count} &>/dev/null ;
      sleep 6;
      line="${ratio}, ${conn}, ${valsize}";
      for j in $(seq 0 4);
      do
        tmp=$(${etcd_path}/tools/benchmark/benchmark mixed-txn "" \
          --conns=${conn} --clients=${conn} \
          --total=${run_count} \
          --endpoints "http://127.0.0.1:23790" \
          --rw-ratio ${ratio} \
          2>/dev/null | grep Requests | awk "{print \$2}");

        line="$line, ${tmp}";
        sleep 20;
      done;
      echo ${line};
      kill -9 ${pid};
      sleep 5;
    done;
  done;
done;

popd;

gyuho · 2021-05-03T17:57:25Z

@ptabor @jingyih I am inclined for this change.

But from what I've seen K8s clusters have the Range/Txn ratio of ~4, so for k8s workload positive.

#12692 (comment)

Based on @wilsonwang371's tests, there's no noticeable penalty for this type of workloads.

Actually, the Kubernetes etcd transaction is 100% read-write transaction for compact + guaranteed update calls -- isTxnReadonly will always return false for kube-apiserver etcd transactions. So, this actually benefits all Kubernetes mutable requests.

Please comment if you have any concerns for this change as default mode.

wilsonwang371 · 2021-05-03T18:09:24Z

@ptabor @jingyih I am inclined for this change.

But from what I've seen K8s clusters have the Range/Txn ratio of ~4, so for k8s workload positive.

#12692 (comment)

Based on @wilsonwang371's tests, there's no noticeable penalty for this type of workloads.

Actually, the Kubernetes etcd transaction is 100% read-write transaction for compact + guaranteed update calls -- isTxnReadonly will always return false for kube-apiserver etcd transactions. So, this actually benefits all Kubernetes mutable requests.

Please comment if you have any concerns for this change as default mode.

The reason I posted this new patch is that I want to separate the concern between txn optimization and the one proposed in #12529.

In this case, if we have further optimizations in the shared buffer, we won't have some new conflict logic and it will be easier to further improve etcd performance.

wilsonwang371 · 2021-05-03T18:41:43Z

Ideally, after a patch that is similar to the one proposed in #12529, we should see a performance graphs similar to the ones in #12896 (comment).

gyuho

@wilsonwang371 To move forward, let's make this configurable.

It will be great if we can validate this with Kubernetes scale tests.

From etcd 3.4 release blog post where we set concurrent read tx by default:

We further made backend read transactions fully concurrent. Previously, ongoing long-running read transactions block writes and upcoming reads. With this change, write throughput is increased by 70% and P99 write latency is reduced by 90% in the presence of long-running reads. We also ran Kubernetes 5000-node scalability test on GCE with this change and observed similar improvements. For example, in the very beginning of the test where there are a lot of long-running “LIST pods”, the P99 latency of “POST clusterrolebindings” is reduced by 97.4%.

How about we add a flag

etcd --txn-mode

where we configure:

read-with-copied-buffer (default value, use concurrent read tx no matter what)
write-with-shared-buffer (optional, if txn includes write, fall back to shared buffer)

That way, we can make incremental changes for #12692 with etcd --txn-mode=only-long-read-with-copied-buffer to not create concurrent tx for short writes.

ptabor · 2021-05-04T08:14:22Z

I think this results are awesome. I'm not insisting on a flag any longer (as in the long run its adds complexity),
but maybe for stablization period / experimenting its worth to have one ('experimental'?).

@jkaniuk FYI: about performance impact

gyuho · 2021-05-04T14:48:42Z

stablization period / experimenting its worth to have one ('experimental'?).

Sounds good.

@wilsonwang371 Can we make this a flag (e..g, etcd --experimental-txn-mode-write-with-shared-buffer)?

wilsonwang371 · 2021-05-04T17:19:17Z

I think this results are awesome. I'm not insisting on a flag any longer (as in the long run its adds complexity),
but maybe for stablization period / experimenting its worth to have one ('experimental'?).

@jkaniuk FYI: about performance impact

stablization period / experimenting its worth to have one ('experimental'?).

Sounds good.

@wilsonwang371 Can we make this a flag (e..g, etcd --experimental-txn-mode-write-with-shared-buffer)?

Done. By default, using a shared buffer is on.

…rite

gyuho

lgtm thx!

jkaniuk · 2021-05-06T17:24:02Z

Thanks @ptabor for the heads up.

But from what I've seen K8s clusters have the Range/Txn ratio of ~4, so for k8s workload positive.

Based on @wilsonwang371's tests, there's no noticeable penalty for this type of workloads.

As I understand from those graphs, there should be 1.05-1.25x improvement.

Great results!

WIZARD-CXY · 2021-05-07T02:44:02Z

nice

wilsonwang371 changed the title ~~server: applier uses ReadTx instead of ConcurrentTx~~ server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() Apr 26, 2021

ptabor added the area/performance label Apr 26, 2021

ptabor reviewed Apr 26, 2021

View reviewed changes

server/etcdserver/apply.go Outdated Show resolved Hide resolved

ptabor assigned jingyih Apr 26, 2021

ptabor requested a review from jingyih April 26, 2021 20:49

ptabor unassigned jingyih Apr 26, 2021

wilsonwang371 force-pushed the profiling-txn2 branch from d57e8ea to 457c6c1 Compare April 26, 2021 22:41

wilsonwang371 force-pushed the profiling-txn2 branch from 457c6c1 to 05170c2 Compare April 28, 2021 17:47

gyuho reviewed Apr 28, 2021

View reviewed changes

server/etcdserver/apply.go Outdated Show resolved Hide resolved

server: applier uses ReadTx instead of ConcurrentTx

8d8d037

wilsonwang371 force-pushed the profiling-txn2 branch from 05170c2 to 8d8d037 Compare April 28, 2021 18:06

gyuho reviewed May 3, 2021

View reviewed changes

server/mvcc/kvstore_txn.go Show resolved Hide resolved

gyuho reviewed May 3, 2021

View reviewed changes

server/etcdserver/apply.go Outdated Show resolved Hide resolved

update variable declaration location

5615421

wilsonwang371 force-pushed the profiling-txn2 branch from 53fafc3 to 5615421 Compare May 3, 2021 17:30

gyuho approved these changes May 4, 2021

View reviewed changes

wilsonwang371 force-pushed the profiling-txn2 branch from b4b4de1 to 27e3fd8 Compare May 4, 2021 17:56

server: add experimental flag for using shared buffer in transacton w…

98083ea

…rite

wilsonwang371 force-pushed the profiling-txn2 branch from 27e3fd8 to 98083ea Compare May 4, 2021 18:59

ptabor approved these changes May 4, 2021

View reviewed changes

gyuho approved these changes May 5, 2021

View reviewed changes

gyuho merged commit 344c9f3 into etcd-io:master May 6, 2021

wilsonwang371 deleted the profiling-txn2 branch May 6, 2021 21:43

wilsonwang371 mentioned this pull request May 11, 2021

server: set multiple concurrentReadTx instances share one txReadBuffer #12933

Merged

chaochn47 mentioned this pull request Mar 7, 2024

Watch starvation can cause OOMs #16839

Open

bsdnet mentioned this pull request Apr 4, 2024

Backport https://github.com/etcd-io/etcd/pull/12896 to 3.4.x #17712

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() #12896

server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() #12896

wilsonwang371 commented Apr 26, 2021

wilsonwang371 commented Apr 26, 2021

codecov-commenter commented Apr 26, 2021 •

edited

Loading

wilsonwang371 commented Apr 26, 2021 •

edited

Loading

ptabor commented Apr 26, 2021 •

edited

Loading

wilsonwang371 commented Apr 26, 2021

wilsonwang371 commented Apr 28, 2021

wilsonwang371 commented Apr 28, 2021

gyuho commented Apr 29, 2021 •

edited

Loading

wilsonwang371 commented Apr 29, 2021

gyuho commented Apr 29, 2021

wilsonwang371 commented Apr 29, 2021

gyuho commented Apr 29, 2021

wilsonwang371 commented Apr 29, 2021

wilsonwang371 commented Apr 29, 2021

gyuho commented May 3, 2021

gyuho commented May 3, 2021

wilsonwang371 commented May 3, 2021

gyuho commented May 3, 2021

wilsonwang371 commented May 3, 2021

gyuho commented May 3, 2021

wilsonwang371 commented May 3, 2021

wilsonwang371 commented May 3, 2021

gyuho left a comment •

edited

Loading

ptabor commented May 4, 2021

gyuho commented May 4, 2021 •

edited

Loading

wilsonwang371 commented May 4, 2021

gyuho left a comment

jkaniuk commented May 6, 2021

WIZARD-CXY commented May 7, 2021

server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() #12896

server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() #12896

Conversation

wilsonwang371 commented Apr 26, 2021

wilsonwang371 commented Apr 26, 2021

codecov-commenter commented Apr 26, 2021 • edited Loading

Codecov Report

wilsonwang371 commented Apr 26, 2021 • edited Loading

ptabor commented Apr 26, 2021 • edited Loading

wilsonwang371 commented Apr 26, 2021

wilsonwang371 commented Apr 28, 2021

wilsonwang371 commented Apr 28, 2021

gyuho commented Apr 29, 2021 • edited Loading

wilsonwang371 commented Apr 29, 2021

gyuho commented Apr 29, 2021

wilsonwang371 commented Apr 29, 2021

gyuho commented Apr 29, 2021

wilsonwang371 commented Apr 29, 2021

wilsonwang371 commented Apr 29, 2021

gyuho commented May 3, 2021

gyuho commented May 3, 2021

wilsonwang371 commented May 3, 2021

gyuho commented May 3, 2021

wilsonwang371 commented May 3, 2021

gyuho commented May 3, 2021

wilsonwang371 commented May 3, 2021

wilsonwang371 commented May 3, 2021

gyuho left a comment • edited Loading

Choose a reason for hiding this comment

ptabor commented May 4, 2021

gyuho commented May 4, 2021 • edited Loading

wilsonwang371 commented May 4, 2021

gyuho left a comment

Choose a reason for hiding this comment

jkaniuk commented May 6, 2021

WIZARD-CXY commented May 7, 2021

codecov-commenter commented Apr 26, 2021 •

edited

Loading

wilsonwang371 commented Apr 26, 2021 •

edited

Loading

ptabor commented Apr 26, 2021 •

edited

Loading

gyuho commented Apr 29, 2021 •

edited

Loading

gyuho left a comment •

edited

Loading

gyuho commented May 4, 2021 •

edited

Loading