Skip to content

Conversation

@yujun777
Copy link
Contributor

@yujun777 yujun777 commented Jul 17, 2023

Proposed changes

Issue Number: close #xxx

  1. Impr tablet sched speed;
    a. Many sched tasks are rather quickly, for example: begin a clone task、begin/end deleting a replica, change a replica to slowly, their cost time is far less than 1s. And the sched thread no need to wait 1s, so we change sched period from 1s to 100ms;
    b. When sched ctx finished, check the tablet immedately. If it's unhealth, put a new ctx for this tablet into the pending queue. So it can repair quickly. No need to wait the TabletChecker's check.
    c. If all the backends of a tablet are alive or decommission, then this tablet can put into sched pending queue immediately.
    Run a test: 3 BE, each BE contains 1000 empty tablets, and decommission 1 BE. The old scheduler take 900s, the new scheduler took 40s.

  2. Fix tablet sched too many times and may block forever.
    a. Repair task no limit sched failed count. It may keep in the pending queue forever. It will stop adding other task added into the pending queue. For example, a decommission task may fail forever if its txn could not finish. So we limit the sched failed count;
    b. A running clone task may also failed forever, so we also limit the running failed count;

  3. Remove tablet sched ctx's dynamic priority.
    The dynamic priority is hard to understand. Also it change rather slowly, it need a few minutes to change priority. This time is rather long. We remove the dynamic priority.

  4. For a tablet, if add its balance task into pending queue. Then the repair task could not add into the queue. This may cause problem: If the tablet's unhealth, the balance task will fail, but next loop the balance task may be selected and add into the queue again. It will stop repairing this tablet.
    So we add a fix. If add a balance ctx for a tablet, later if add a repair ctx for this tablet, the balance ctx will auto convert to a repair task.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@yujun777
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 54.57 seconds
stream load tsv: 503 seconds loaded 74807831229 Bytes, about 141 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 30.1 seconds inserted 10000000 Rows, about 332K ops/s
storage size: 17162058746 Bytes

@yujun777 yujun777 force-pushed the impr-tablet-sched-speed branch from 28fe805 to cbc0f28 Compare July 17, 2023 12:00
@yujun777
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 55.51 seconds
stream load tsv: 513 seconds loaded 74807831229 Bytes, about 139 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.7 seconds inserted 10000000 Rows, about 336K ops/s
storage size: 17162083770 Bytes

@yujun777
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 53.16 seconds
stream load tsv: 550 seconds loaded 74807831229 Bytes, about 129 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.5 seconds inserted 10000000 Rows, about 338K ops/s
storage size: 17161827466 Bytes

@yujun777 yujun777 force-pushed the impr-tablet-sched-speed branch from 3f87faf to 60ef2c1 Compare July 18, 2023 08:06
@yujun777
Copy link
Contributor Author

run buildall

@yujun777
Copy link
Contributor Author

run buildall

@yujun777
Copy link
Contributor Author

run buildall

@yujun777
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 52.31 seconds
stream load tsv: 504 seconds loaded 74807831229 Bytes, about 141 MB/s
stream load json: 18 seconds loaded 2358488459 Bytes, about 124 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17162526772 Bytes

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Jul 18, 2023
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yujun777 yujun777 changed the title [Improvement](tablet clone) impr tablet sched speed and fix tablet sched fail too many times [Improvement](tablet clone) impr tablet sched speed and fix tablet sched failed too many times Jul 18, 2023
Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yiguolei yiguolei merged commit beec0e9 into apache:master Jul 18, 2023
@dataroaring dataroaring added the dev/2.0.0 2.0.0 release label Jul 19, 2023
@xiaokang xiaokang added dev/2.0.0-merged and removed dev/2.0.0 2.0.0 release labels Jul 20, 2023
xiaokang pushed a commit that referenced this pull request Jul 20, 2023
LHG41278 pushed a commit to LHG41278/dorisMine that referenced this pull request Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants