Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() #12896

Merged
merged 3 commits into from
May 6, 2021

Conversation

wilsonwang371
Copy link
Contributor

This is related to transaction logic improvement #12692. As per Piotr's comment, it might be a better choice to use ReadTx(). ReadTx() does not copy buffer while ConcurrentReadTx() does. Therefore, we can avoid slow transaction execution in applier while minimizing the scope of a write transaction.

@wilsonwang371
Copy link
Contributor Author

More performance evaluation will be done on this.

@codecov-commenter
Copy link

codecov-commenter commented Apr 26, 2021

Codecov Report

Merging #12896 (5615421) into master (9a3aff6) will decrease coverage by 6.41%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12896      +/-   ##
==========================================
- Coverage   69.60%   63.18%   -6.42%     
==========================================
  Files         411      405       -6     
  Lines       33421    33439      +18     
==========================================
- Hits        23263    21129    -2134     
- Misses       8208    10145    +1937     
- Partials     1950     2165     +215     
Flag Coverage Δ
all 63.18% <92.85%> (-6.42%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/mvcc/kv.go 50.00% <ø> (ø)
server/mvcc/kv_view.go 60.00% <66.66%> (-20.00%) ⬇️
server/etcdserver/apply.go 88.74% <100.00%> (+0.07%) ⬆️
server/mvcc/kvstore_txn.go 71.68% <100.00%> (-0.54%) ⬇️
client/pkg/v3/testutil/recorder.go 0.00% <0.00%> (-79.67%) ⬇️
server/etcdserver/api/v3compactor/revision.go 0.00% <0.00%> (-76.93%) ⬇️
server/auth/options.go 32.43% <0.00%> (-59.46%) ⬇️
server/wal/walpb/record.go 42.85% <0.00%> (-57.15%) ⬇️
server/proxy/tcpproxy/userspace.go 0.00% <0.00%> (-53.85%) ⬇️
server/wal/repair.go 0.00% <0.00%> (-52.84%) ⬇️
... and 158 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a3aff6...5615421. Read the comment docs.

@wilsonwang371
Copy link
Contributor Author

wilsonwang371 commented Apr 26, 2021

This is the result compared with the master branch.

image

txn-readtx.csv

@wilsonwang371 wilsonwang371 changed the title server: applier uses ReadTx instead of ConcurrentTx server: make applier use ReadTx() in Txn() instead of ConcurrentReadTx() Apr 26, 2021
server/etcdserver/apply.go Outdated Show resolved Hide resolved
@ptabor
Copy link
Contributor

ptabor commented Apr 26, 2021

Rely good numbers. @jingyih - you are an expert on concurrentTx, could you, pls, take a look

@wilsonwang371
Copy link
Contributor Author

I am doing another mixed read/write workload performance evaluation. Will update with the data after around 48 hours.

@wilsonwang371
Copy link
Contributor Author

@ptabor The performance looks much better now compared with #12692. I even feel this a little bit too good to be true.... Hense we need to review this carefully.

image

image

image

@wilsonwang371
Copy link
Contributor Author

I just realized that there is an issue with my code that for read Txn, we are still using ReadTx() and in fact here we should use ConcurrentReadTx().

Updated the patch, doing another round of performance evaluation.

server/etcdserver/apply.go Outdated Show resolved Hide resolved
@gyuho
Copy link
Contributor

gyuho commented Apr 29, 2021

Updated the patch, doing another round of performance evaluation.

@wilsonwang371 What does "performance" in the heat graph mean? Is that TPS? Can you share more details about provision setup? Single node, or multi node?

@wilsonwang371
Copy link
Contributor Author

Updated the patch, doing another round of performance evaluation.

@wilsonwang371 What does "performance" in the heat graph mean? Is that TPS? Can you share more details about provision setup? Single node, or multi node?

Hi Gyuho,

To avoid networking interference, I am always using a single node in this patch performance evaluation. The heat map (yellow-green) is either master branch or my patch txn benchmark (queries per second) QPS. The heat map (red-blue) is the QPS difference between master branch and my patch.

I shared my python script for plotting the data and also the test data details in #12692. you can get more information from there.

Let me know if you have other questions.

@gyuho
Copy link
Contributor

gyuho commented Apr 29, 2021

@wilsonwang371 That's awesome!

Updated the patch, doing another round of performance evaluation.

Have we run another round of performance tests yet? Once we confirm, we can merge.

@wilsonwang371
Copy link
Contributor Author

@wilsonwang371 That's awesome!

Updated the patch, doing another round of performance evaluation.

Have we run another round of performance tests yet? Once we confirm, we can merge.

it will take another 24 hours to get the data generated. I will post them once available.

@gyuho
Copy link
Contributor

gyuho commented Apr 29, 2021

@wilsonwang371 Very cool. Can you share more details for the machine spec and how you generated client-side workloads? Are we just using benchmark tool?

@wilsonwang371
Copy link
Contributor Author

image
image
image

@ptabor @gyuho

Guys, here is the latest evaluation of my patch. There is still a chance of a worst-case 7% performance penalty during R/W ratio 100/1. But compared with #12692, it is better.

Do we still consider a switch for this feature?

@wilsonwang371
Copy link
Contributor Author

@wilsonwang371 Very cool. Can you share more details for the machine spec and how you generated client-side workloads? Are we just using benchmark tool?

The machine setup is:

Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
2 Sockets, 64 VCPUs

256GB

NVMe SSD-2T* 2

Mellanox CX-5

The benchmark tool is etcd's default benchmark tool. I just make some small modification so that it can send both read and write txn requests.

The client and server are running on the same machine.

server/etcdserver/apply.go Outdated Show resolved Hide resolved
@gyuho
Copy link
Contributor

gyuho commented May 3, 2021

Screen Shot 2021-05-03 at 10 24 03 AM

If I understand this correctly, we get up to 2x throughput due to increased concurrency for writes because we make reads fully concurrent?

worst-case 7% performance penalty during R/W ratio 100/1

I think it's a safe trade-off to make given that we already did a round of optimizations for kube-apiserver reads (list and watch) for 3.4? Or could be a noise. I think the increased concurrency in writes should be good enough to compensate this.

@gyuho
Copy link
Contributor

gyuho commented May 3, 2021

@wilsonwang371 Could you share exact benchmark tool commands and flags for anyone that might be interested in reproducing this?

@wilsonwang371
Copy link
Contributor Author

Screen Shot 2021-05-03 at 10 24 03 AM

If I understand this correctly, we get up to 2x throughput due to increased concurrency for writes because we make reads fully concurrent?

worst-case 7% performance penalty during R/W ratio 100/1

I think it's a safe trade-off to make given that we already did a round of optimizations for kube-apiserver reads (list and watch) for 3.4? Or could be a noise. I think the increased concurrency in writes should be good enough to compensate this.

In this graph, the ReadTxn/WriteTxn = 0.125 = 1/8. That means we are doing 8 txn writes while doing 1 txn read. The gain is from avoiding shared buffer copying which speeds up the txn writes.

I am using etcd benchmark for this but I made some small change to the benchmark tool so that it can do both txn-range and txn-put at the same time.

Here is the new file I added to the benchmark.

// Copyright 2017 The etcd Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package cmd

import (
	"context"
	"encoding/binary"
	"fmt"
	"math"
	"math/rand"
	"os"
	"time"

	v3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/pkg/v3/report"

	"github.com/spf13/cobra"
	"golang.org/x/time/rate"
	"gopkg.in/cheggaaa/pb.v1"
)

// mixeTxnCmd represents the mixedTxn command
var mixedTxnCmd = &cobra.Command{
	Use:   "mixed-txn key [end-range]",
	Short: "Benchmark a mixed load of txn-put & txn-range",

	Run: mixedTxnFunc,
}

var (
	mixedTxnTotal     int
	mixedTxnRate      int
	mixedTxnOpsPerTxn int
	mixedTxnReadWriteRatio float64
	writeCount        uint64
	readCount         uint64
)

func init() {
	RootCmd.AddCommand(mixedTxnCmd)
	mixedTxnCmd.Flags().IntVar(&keySize, "key-size", 8, "Key size of mixed txn")
	mixedTxnCmd.Flags().IntVar(&valSize, "val-size", 8, "Value size of mixed txn")
	mixedTxnCmd.Flags().IntVar(&mixedTxnOpsPerTxn, "txn-ops", 1, "Number of ops per txn")
	mixedTxnCmd.Flags().IntVar(&mixedTxnRate, "rate", 0, "Maximum txns per second (0 is no limit)")

	mixedTxnCmd.Flags().IntVar(&mixedTxnTotal, "total", 10000, "Total number of txn requests")
	mixedTxnCmd.Flags().IntVar(&keySpaceSize, "key-space-size", 1, "Maximum possible keys")
	mixedTxnCmd.Flags().StringVar(&rangeConsistency, "consistency", "l", "Linearizable(l) or Serializable(s)")

	mixedTxnCmd.Flags().Float64Var(&mixedTxnReadWriteRatio, "rw-ratio", 1, "Read/write ops ratio")
}

func mixedTxnFunc(cmd *cobra.Command, args []string) {
	if keySpaceSize <= 0 {
		fmt.Fprintf(os.Stderr, "expected positive --key-space-size, got (%v)", keySpaceSize)
		os.Exit(1)
	}

	if mixedTxnOpsPerTxn > keySpaceSize {
		fmt.Fprintf(os.Stderr, "expected --txn-ops no larger than --key-space-size, "+
			"got txn-ops(%v) key-space-size(%v)\n", txnPutOpsPerTxn, keySpaceSize)
		os.Exit(1)
	}

	if rangeConsistency == "l" {
		fmt.Println("bench with linearizable range")
	} else if rangeConsistency == "s" {
		fmt.Println("bench with serializable range")
	} else {
		fmt.Fprintln(os.Stderr, cmd.Usage())
		os.Exit(1)
	}

	requests := make(chan []v3.Op, totalClients)
	if mixedTxnRate == 0 {
		mixedTxnRate = math.MaxInt32
	}
	limit := rate.NewLimiter(rate.Limit(mixedTxnRate), 1)
	clients := mustCreateClients(totalClients, totalConns)
	k, v := make([]byte, keySize), string(mustRandBytes(valSize))

	bar = pb.New(mixedTxnTotal)
	bar.Format("Bom !")
	bar.Start()

	r := newReport()
	for i := range clients {
		wg.Add(1)
		go func(c *v3.Client) {
			defer wg.Done()
			for ops := range requests {
				limit.Wait(context.Background())
				st := time.Now()
				_, err := c.Txn(context.TODO()).Then(ops...).Commit()
				r.Results() <- report.Result{Err: err, Start: st, End: time.Now()}
				bar.Increment()
			}
		}(clients[i])
	}

	go func() {
		for i := 0; i < mixedTxnTotal; i++ {
			ops := make([]v3.Op, mixedTxnOpsPerTxn)
			for j := 0; j < mixedTxnOpsPerTxn; j++ {
				op := v3.Op{}
				if rand.Float64() < mixedTxnReadWriteRatio / (1 + mixedTxnReadWriteRatio) {
					opts := []v3.OpOption{v3.WithRange("")}
					if rangeConsistency == "s" {
						opts = append(opts, v3.WithSerializable(), v3.WithPrevKV(), v3.WithPrefix())
					}
					op = v3.OpGet("rangeKey", opts...)
					readCount++
				} else {
					binary.PutVarint(k, int64(((i*mixedTxnOpsPerTxn)+j)%keySpaceSize))
					op = v3.OpPut(string(k), v)
					writeCount++
				}
				ops[j] = op
			}
			requests <- ops
		}
		close(requests)
	}()

	rc := r.Run()
	wg.Wait()
	close(r.Results())
	bar.Finish()
	fmt.Printf("READ: %d, WRITE: %d\n", readCount, writeCount)
	fmt.Println(<-rc)
}

@gyuho
Copy link
Contributor

gyuho commented May 3, 2021

The gain is from avoiding shared buffer copying which speeds up the txn writes.

Yes, you are right. Now I read how we create different types of transactions for write txn.

@wilsonwang371
Copy link
Contributor Author

Here is the test script that I am using.

#!/bin/bash

set -xe;

etcd_path="/home/wilson.wang/etcd-master";
run_path="/data00/etcd-tests";
run_count=$((1024*256));
# standard k8s setting
limit_count=$((1024*256));
keys_count=$((1024*256));
backend_size=$((20*1024*1024*1024));

pushd ${run_path};

trap ctrl_c INT

function ctrl_c() {
    for i in $(ps aux | grep -v grep | grep etcd | awk "{print \$2}");
    do
      kill -9 $i;
    done;
    popd;
    exit 0;
}

for ratio_str in 1/8 1/4 1/2 2/1 4/1 8/1 100/1;
do
  ratio=$(echo "scale=3; ${ratio_str}" | bc -l);
  for vallen in $(seq 4 10);
  do
    for i in $(seq 5 12);
    do
      rm -rf default.etcd/;
      ${etcd_path}/bin/etcd --quota-backend-bytes=${backend_size} \
        --listen-client-urls http://0.0.0.0:23790 \
        --advertise-client-urls http://127.0.0.1:23790 \
        --log-level 'error' &
      pid=$!;
      sleep 6;
      conn=$((2**$i));
      valsize=$((2**(4 + $vallen)));
      ${etcd_path}/tools/benchmark/benchmark put --sequential-keys \
        --key-space-size=${keys_count} \
        --val-size=${valsize} --key-size=256 \
        --endpoints "http://127.0.0.1:23790" \
        --total=${keys_count} &>/dev/null ;
      sleep 6;
      line="${ratio}, ${conn}, ${valsize}";
      for j in $(seq 0 4);
      do
        tmp=$(${etcd_path}/tools/benchmark/benchmark mixed-txn "" \
          --conns=${conn} --clients=${conn} \
          --total=${run_count} \
          --endpoints "http://127.0.0.1:23790" \
          --rw-ratio ${ratio} \
          2>/dev/null | grep Requests | awk "{print \$2}");

        line="$line, ${tmp}";
        sleep 20;
      done;
      echo ${line};
      kill -9 ${pid};
      sleep 5;
    done;
  done;
done;

popd;

@gyuho
Copy link
Contributor

gyuho commented May 3, 2021

@ptabor @jingyih I am inclined for this change.

But from what I've seen K8s clusters have the Range/Txn ratio of ~4, so for k8s workload positive.

#12692 (comment)

Based on @wilsonwang371's tests, there's no noticeable penalty for this type of workloads.

Actually, the Kubernetes etcd transaction is 100% read-write transaction for compact + guaranteed update calls -- isTxnReadonly will always return false for kube-apiserver etcd transactions. So, this actually benefits all Kubernetes mutable requests.

Please comment if you have any concerns for this change as default mode.

@wilsonwang371
Copy link
Contributor Author

@ptabor @jingyih I am inclined for this change.

But from what I've seen K8s clusters have the Range/Txn ratio of ~4, so for k8s workload positive.

#12692 (comment)

Based on @wilsonwang371's tests, there's no noticeable penalty for this type of workloads.

Actually, the Kubernetes etcd transaction is 100% read-write transaction for compact + guaranteed update calls -- isTxnReadonly will always return false for kube-apiserver etcd transactions. So, this actually benefits all Kubernetes mutable requests.

Please comment if you have any concerns for this change as default mode.

The reason I posted this new patch is that I want to separate the concern between txn optimization and the one proposed in #12529.

In this case, if we have further optimizations in the shared buffer, we won't have some new conflict logic and it will be easier to further improve etcd performance.

@wilsonwang371
Copy link
Contributor Author

Ideally, after a patch that is similar to the one proposed in #12529, we should see a performance graphs similar to the ones in #12896 (comment).

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wilsonwang371 To move forward, let's make this configurable.

It will be great if we can validate this with Kubernetes scale tests.

From etcd 3.4 release blog post where we set concurrent read tx by default:

We further made backend read transactions fully concurrent. Previously, ongoing long-running read transactions block writes and upcoming reads. With this change, write throughput is increased by 70% and P99 write latency is reduced by 90% in the presence of long-running reads. We also ran Kubernetes 5000-node scalability test on GCE with this change and observed similar improvements. For example, in the very beginning of the test where there are a lot of long-running “LIST pods”, the P99 latency of “POST clusterrolebindings” is reduced by 97.4%.

How about we add a flag

etcd --txn-mode

where we configure:

  • read-with-copied-buffer (default value, use concurrent read tx no matter what)
  • write-with-shared-buffer (optional, if txn includes write, fall back to shared buffer)

That way, we can make incremental changes for #12692 with etcd --txn-mode=only-long-read-with-copied-buffer to not create concurrent tx for short writes.

@ptabor
Copy link
Contributor

ptabor commented May 4, 2021

I think this results are awesome. I'm not insisting on a flag any longer (as in the long run its adds complexity),
but maybe for stablization period / experimenting its worth to have one ('experimental'?).

@jkaniuk FYI: about performance impact

@gyuho
Copy link
Contributor

gyuho commented May 4, 2021

stablization period / experimenting its worth to have one ('experimental'?).

Sounds good.

@wilsonwang371 Can we make this a flag (e..g, etcd --experimental-txn-mode-write-with-shared-buffer)?

@wilsonwang371
Copy link
Contributor Author

I think this results are awesome. I'm not insisting on a flag any longer (as in the long run its adds complexity),
but maybe for stablization period / experimenting its worth to have one ('experimental'?).

@jkaniuk FYI: about performance impact

stablization period / experimenting its worth to have one ('experimental'?).

Sounds good.

@wilsonwang371 Can we make this a flag (e..g, etcd --experimental-txn-mode-write-with-shared-buffer)?

Done. By default, using a shared buffer is on.

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thx!

@gyuho gyuho merged commit 344c9f3 into etcd-io:master May 6, 2021
@jkaniuk
Copy link

jkaniuk commented May 6, 2021

Thanks @ptabor for the heads up.

But from what I've seen K8s clusters have the Range/Txn ratio of ~4, so for k8s workload positive.

Based on @wilsonwang371's tests, there's no noticeable penalty for this type of workloads.

As I understand from those graphs, there should be 1.05-1.25x improvement.

Great results!

@wilsonwang371 wilsonwang371 deleted the profiling-txn2 branch May 6, 2021 21:43
@WIZARD-CXY
Copy link
Contributor

nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

7 participants