Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active defrag v2 #4691

Merged
merged 2 commits into from Mar 22, 2018
Merged

Active defrag v2 #4691

merged 2 commits into from Mar 22, 2018

Conversation

oranagra
Copy link
Member

@oranagra oranagra commented Feb 18, 2018

First commit: Avoid latency spikes when defragging huge keys

  • big keys are not defragged in one go from within the dict scan instead they are scanned in parts after the main dict hash bucket is done.
  • add latency monitor sample for defrag
  • change default active-defrag-cycle-min to induce lower latency
  • make active defrag start a new scan right away if needed, so it's easier (for the test suite) to detect when it's done
  • make active defrag quit the current cycle after each db / big key
  • defrag some non key long term global allocations
  • some refactoring for smaller functions and more reusable code

Second commit: Adding real allocator fragmentation to INFO and MEMORY command + active defrag test

other fixes / improvements:

  • LUA script memory isn't taken from zmalloc (taken from libc malloc) so it can cause high fragmentation ratio to be displayed (which is false)
  • there was a problem with "fragmentation" info being calculated from RSS and used_memory sampled at different times (now sampling them together)

other details:

  • adding a few more allocator info fields to INFO and MEMORY commands
  • improve defrag test to measure defrag latency of big keys
  • increasing the accuracy of the defrag test (by looking at real grag info), this way we can use an even lower threshold and still avoid false positives
  • keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
  • add these the MEMORY DOCTOR command
  • deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
  • reduce sampling rate of the rss and allocator info

- big keys are not defragged in one go from within the dict scan
  instead they are scanned in parts after the main dict hash bucket is done.
- add latency monitor sample for defrag
- change default active-defrag-cycle-min to induce lower latency
- make active defrag start a new scan right away if needed, so it's easier
  (for the test suite) to detect when it's done
- make active defrag quick the current cycle after each db / big key
- defrag  some non key long term global allocations
- some refactoring for smaller functions and more reusable code
- during dict rehashing, one scan iteration of the dict, can end up scanning
  one bucket in the smaller dict and many many buckets in the larger dict.
  so waiting for 16 scan iterations before checking the time, may be much too long.
…ve defrag test

other fixes / improvements:
- LUA script memory isn't taken from zmalloc (taken from libc malloc)
  so it can cause high fragmentation ratio to be displayed (which is false)
- there was a problem with "fragmentation" info being calculated from
  RSS and used_memory sampled at different times (now sampling them together)

other details:
- adding a few more allocator info fields to INFO and MEMORY commands
- improve defrag test to measure defrag latency of big keys
- increasing the accuracy of the defrag test (by looking at real grag info)
  this way we can use an even lower threshold and still avoid false positives
- keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
- add these the MEMORY DOCTOR command
- deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
- reduce sampling rate of the rss and allocator info
@oranagra
Copy link
Member Author

@antirez from time to time people keep complaining that defrag doesn't work. (last example: #4751)
please consider merging this PR, second commit should clear that confusion.

@antirez
Copy link
Contributor

antirez commented Mar 14, 2018

Hello @oranagra, I'm going to merge this because I believe it's too complex for me every time to dig into the defragmentation code, so I nominate you, if you accept, the maintainer of this Redis sub-systems and will just merge anything you PR. However if I understand correctly, this changes also non-only-defrag related things, like the way fragmentation is computed.

Questions:

  1. Is the total fragmentation fixed just accounting for Lua memory (good idea indeed), and sampling at the same time (also cool)?
  2. If there is any other change, could you explain the changes at higher level, and tell me if we now hook to APIs (Jemalloc or whatever) that may be slower sometimes introducing any latency spike? So far we just queried our total memory allocations + /proc which is also just technically a file system but actually we can kinda imagine it being real-time.

Remarks:

  1. Thanks for improving MEMORY DOCTOR this is cool.
  2. Thanks for all the tests.

Thank you

@oranagra
Copy link
Member Author

Hi.
This sounds like a good plan, especially since i did do intensive code review for that code with @guybe7, and we also QA it internally at RL. (and considering that this is off by default and marked as experimental, that's low risk anyway).

I've split that PR into two commits so that it'll be easier for you to review just the second one (which has changes outside the defrag code).
despite that, I do think you might also want to take a quick look at the changes of the first commit too, the parts outside that defrag.c file (e.g. redis.conf).

Anyway, regarding the second commit and the changes it brings to redis INFO / MEMORY command:

The idea this time was not to change the behavior of the old info fields (they report the same measurement that they used to report), but now, i add new info fields with more info as to what this RSS overhead (incorrectly called "fragmentation") is made of.
in this case, the old fragmentation_ratio field will still include the lua overhead (no change here).
but the new allocator_frag field will not include it (even if both redis and LUA are using glibc malloc).

the thing about sampling memory usage and RSS at the same time is actually a bugfix, otherwise the ratio you calculate can be wrong if the RSS sampled at the server cron was very different than the memory usage at the time of infoCommand.

i will obviously turn your attention to risky things i'm adding, even if you won't review them.
in this case i did add calls to je_mallctl() to collect stats from the server cron, but i also reduced the rate in which they are called (and also the existing call to zmalloc_get_rss) to no more than one in 10ms (in case someone increases hz. i don't believe this is problematic, but if you do, we can reduce the rate further.

all the other changes in that commit are just adding more fields to MEMORY and INFO commands and improving the DOCTOR.

@oranagra
Copy link
Member Author

@antirez i forgot to tag you for the above reply.

@antirez
Copy link
Contributor

antirez commented Mar 16, 2018

Thank you @oranagra, so I'll follow your advice and today I'll read both the commit and merge or comment. Also thanks for splitting the PR into two commits for easy reading.

@antirez
Copy link
Contributor

antirez commented Mar 22, 2018

Hello @oranagra, I read the two commits, on the big one implementing defrag V2 I only scanned the different functions to get an idea of what it does, the code seems well written to me and the improvement in the latency very important, however I cannot claim that my was a review able to uncover any bug. It will be cool if we will be able to reserve some "office time" in San Francisco in case you are willing to give me a 20 minutes tour on how it exactly works. Also it is pretty clear that if we go with a radix-tree+listpacks for all the data types (but sorted sets), this will be a big win from the POV of the simplicity that can be achieved inside the defragmentation code, a single code path for Streams, Lists (we should move away from quicklists as well), Hashes, Sets, would be cool.

About the second commit, it looks cool as well, also the hints in MEMORY DOCTOR are useful. I'm merging everything as it is, but the only thing I would like to improve before Redis 5.0 RC1 is the naming of the fields. mem_fragmentation vs allocator_fragmentation is complex to understand. I wonder if we can pick more specific names, and document somewhere effectively the difference between all those.

So starting with merging the PR, and we can continue later from here. Thanks for your work!

@antirez antirez merged commit da62178 into redis:unstable Mar 22, 2018
@oranagra
Copy link
Member Author

@antirez thanks for merging.
Sure, if we can find some time in RedisConf i'll be happy to go over it with you.

Regarding the names, i originally wanted to change the meaning of mem_fragmentation and have it show what's currently displayed in allocator_fragmentation, and add a field named rss_overhead that will show what's currently displayed in mem_fragmentation.
I think that's more "accurate" but i understand we may want to keep some backwards compatibility, and also show the most inclusive (highest ratio) number were people are used to look for it, and then expose details elsewhere.

if you want to suggest some alternative names, please give an example, and we'll go from there.
thanks again.

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Apr 10, 2018
pulllock pushed a commit to pulllock/redis that referenced this pull request Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants