Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

非堆内存的骗局 #46

Open
cjuexuan opened this issue Dec 2, 2017 · 0 comments
Open

非堆内存的骗局 #46

cjuexuan opened this issue Dec 2, 2017 · 0 comments
Labels

Comments

@cjuexuan
Copy link
Owner

cjuexuan commented Dec 2, 2017

非堆内存的骗局

背景

首先,随着nio的发展,现在一些框架会经常用到非堆内存,比如spark,经常有业务方的小伙伴告诉我们自己任务中内存超出被yarn给kill,所以我们将非堆内存的采集看的比较重要,那么我们参考了常见的几个框架中获取非堆内存的方法,接入我们的监控系统

  1. es

es中对非堆内存的采集在org.elasticsearch.monitor.jvm.JvmStats中,获取的方式走的MxBeans,代码如下

...
memoryMXBean = ManagementFactory.getMemoryMXBean()
memUsage = memoryMXBean.getNonHeapMemoryUsage();
long nonHeapUsed = memUsage.getUsed() < 0 ? 0 : memUsage.getUsed();
...

不过有另一个属性指向了directMemory,这里卖个关子

  1. metrics

metrics对非堆内存的采集在com.codahale.metrics.jvm.MemoryUsageGaugeSet中,获取方式和es是非常相似的

      gauges.put("non-heap.used", new Gauge<Long>() {
            @Override
            public Long getValue() {
                return mxBean.getNonHeapMemoryUsage().getUsed();
            }
        });

于是我们也实现了一把这种采集方式

  new TSMetadata[Long] {
      override val name: String = NON_HEAP_MEM_USED
      override val metricType: MetricType = Gauge
      override val generateValue: ()  Long = ()  mxBeans.getNonHeapMemoryUsage.getUsed
    }

很愉快的找业务方接入,过了几天,业务方找到我们,说这个指标非常的不准,他给我们看了下图

spark monitor
spark monitor

从图中我们看出非堆内存使用是非常少的,而且他maxDirectMemory也没设置,Xmx设置了8G,那么他的maxDirectMemroy应该是8G * (1 - 0.3(新生代) * 0.2(survivor区)) 约等于 7.5G,照理说就不会被kill,可是他的任务还是被yarn kill了,那么我们就要重新审视下这个采集的数据了

小实验

于是写了个小的测试demo看了下

	def main(args: Array[String]): Unit = {
		var heapBuffer = ByteBuffer.allocate(1024 * 1024 * 1024) //1G
		val directBuffer = ByteBuffer.allocateDirect(1024 * 1024 * 1024) //1G
		printInfo()
		Thread sleep 1000
		heapBuffer = null
		directBuffer.asInstanceOf[DirectBuffer].cleaner().clean()
		System.gc()
		printInfo()
		Thread sleep 1000
	}
	def printInfo(): Unit ={
		println("######start################")
		println("-------MxBeans-----------")
		mxBeansPrint()
		println("-------Spoor-------------")
		spoorPrint()
		println("-------Es----------------")
		esPrint()
		println("######end################")

	}
	private def mxBeansPrint():Unit = {
		val set  = new MemoryUsageGaugeSet
		set.getMetrics.asScala.filter(_._1.contains("heap")).toSeq.sortBy(_._1).foreach(kv  println(s"${kv._1}:${FormatUtils.readableFileSize(kv._2.asInstanceOf[Gauge[Number]].getValue.longValue())}"))
	}


	private def spoorPrint():Unit = {
		MemoryTSMetadataSet.memoryTSMetaDataS.sortBy(_.name).foreach(t  println(s"${t.name}:${FormatUtils.readableFileSize(t.generateValue())}"))
	}

	private def esPrint(): Unit ={
		val mem = JvmStats.jvmStats().getMem
		println(s"heapUsed:${FormatUtils.readableFileSize(mem.getHeapUsed.getBytes)}")
		println(s"nonHeapUsed:${FormatUtils.readableFileSize(mem.getNonHeapUsed.getBytes)}")
	}

实验一

jvm启动参数为-Xmx2G -XX:MaxDirectMemorySize=512M,我们尝试申请1G的directMem,然后果然就是熟悉的oom

Exception in thread "main" java.lang.OutOfMemoryError: Direct buffer memory

实验二

启动参数去掉MaxDierctMemorySize,此时的输出为

######start################
-------MxBeans-----------
heap.committed:1.2 GB
heap.init:128 MB
heap.max:1.8 GB
heap.usage:0
heap.used:1 GB
non-heap.committed:12.1 MB
non-heap.init:2.4 MB
non-heap.max:0
non-heap.usage:0
non-heap.used:11.3 MB
-------Spoor-------------
jvm.mem.heap.committed:1.2 GB
jvm.mem.heap.used:1 GB
jvm.mem.nonHeap.committed:12.1 MB
jvm.mem.nonHeap.used:11.5 MB
-------Es----------------
heapUsed:1 GB
nonHeapUsed:11.7 MB
######end################
######start################
-------MxBeans-----------
heap.committed:1.2 GB
heap.init:128 MB
heap.max:1.8 GB
heap.usage:0
heap.used:3.9 MB
non-heap.committed:12.6 MB
non-heap.init:2.4 MB
non-heap.max:0
non-heap.usage:0
non-heap.used:12.1 MB
-------Spoor-------------
jvm.mem.heap.committed:1.2 GB
jvm.mem.heap.used:3.9 MB
jvm.mem.nonHeap.committed:12.6 MB
jvm.mem.nonHeap.used:12.1 MB
-------Es----------------
heapUsed:3.9 MB
nonHeapUsed:12.1 MB
######end################

我们发现三个框架拿的heap那一块基本是一致且准确的,但nonHeap的话,大家非常接近且不准确

实验三

增加sleep时长,打开jconsole和visualVM,观察里面看到的

jconsoleandvm

这张图是一分钟打一次nonHeapused,在numbers里面画出来的

nonHeapMXBeans

神奇的发现这个nonHeap的监控与元空间的的size变化还是比较接近

实验四

设置元空间的最大size -XX:MaxMetaspaceSize=10m,果然出现了预期的异常

metaSpace

实验五

那么到底哪些是是属于非堆部分呢,为啥nonHeap和metaspaceSize也没完全一致,metaspaceSize总是小于我们看到的nonHeapSize的,因为还有StringTable,SybmolTable之类的在堆外分配,具体可以参考RednaxelaFX 知乎回答

那么我们要的directMem的监控到底能不能实现呢,很庆幸,jdk7以后MXBean中有了这一块的监控,具体看图

开始为0

before

申请directMemory,发生变化
allocate

清零

clean

那我们也很自然的走MXBean规范拿到该值

sun jdk的代码在 sun.management.ManagementFactoryHelper

    public static synchronized List<BufferPoolMXBean> getBufferPoolMXBeans() {
        if (bufferPools == null) {
            bufferPools = new ArrayList(2);
            bufferPools.add(createBufferPoolMXBean(SharedSecrets.getJavaNioAccess().getDirectBufferPool()));
            bufferPools.add(createBufferPoolMXBean(FileChannelImpl.getMappedBufferPool()));
        }

        return bufferPools;
    }
	def main(args: Array[String]): Unit = {

		println(s"before${FormatUtils.readableFileSize(getDirectPoolUsed)}")
		val buffer = ByteBuffer.allocateDirect(1024 * 1024 * 1024)//1G
		println(s"allocate${FormatUtils.readableFileSize(getDirectPoolUsed)}")
		buffer.asInstanceOf[DirectBuffer].cleaner().clean()
		println(s"clean${FormatUtils.readableFileSize(getDirectPoolUsed)}")
	}

	def getDirectPoolUsed: Long = 		SharedSecrets.getJavaNioAccess.getDirectBufferPool.getMemoryUsed

输出为

before0
allocate1 GB
clean0

验证了我们的猜想

es卖的关子

            List<BufferPoolMXBean> bufferPools = ManagementFactory.getPlatformMXBeans(BufferPoolMXBean.class);
            stats.bufferPools = new ArrayList<>(bufferPools.size());
            for (BufferPoolMXBean bufferPool : bufferPools) {
                stats.bufferPools.add(new BufferPool(bufferPool.getName(), bufferPool.getCount(), bufferPool.getTotalCapacity(), bufferPool.getMemoryUsed()));
            }

es将这部分放在了bufferPool上,同样在JvmStats

总结

总结下,就是我们对这个指标的理解不够精确,我们其实期望的是directMemory的size,而走nonHeap拿到的并不是我们想要的,最终调整获取方法,得以实现对这一块内存的监控(spark用这部分内存还是用的很猛的)

@cjuexuan cjuexuan added the apm label Dec 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant