You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The cursor.skip() method is often expensive because it requires the server to walk from the
beginning of the collection or index to get the offset or skip position before beginning to return
results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and
more CPU intensive. With larger collections, cursor.skip() may become IO bound.
/** * <p>Limits the number of elements returned in one batch. A cursor * typically fetches a batch of result objects and store them * locally.</p> * * <p>If {@code batchSize} is positive, it represents the size of each batch of objects retrieved. It can be adjusted to optimize * performance and limit data transfer.</p> * * <p>If {@code batchSize} is negative, it will limit of number objects returned, that fit within the max batch size limit (usually * 4MB), and cursor will be closed. For example if {@code batchSize} is -10, then the server will return a maximum of 10 documents and * as many as can fit in 4MB, then close the cursor. Note that this feature is different from limit() in that documents must fit within * a maximum size, and it removes the need to send a request to close the cursor server-side.</p>*/
引言
原因
从代码逻辑上看,这里采取了分批次导出的方式,类似前端的分页,具体是通过skip+limit的方式实现的,那么采用这种方式会有什么问题呢?我们google一下这两个接口的文档:
简单来说,随着页数的增长,skip()会变得越来越慢,但是具体就我们这里导出的场景来说,按理说应该没必要每次都去重复计算,做一些无用功,我的理解应该可以拿到一个指针,慢慢遍历,简单google之后,我们发现果然是可以这样做的。
我们可以在持久层新增一个方法,返回一个cursor专门供上层去遍历数据,这样就不用再去遍历已经导出过的结果集,从O(N2)优化到了O(N),这里还可以指定一个batchSize,设置一次从MongoDB中抓取的数据量(元素个数),注意这里最大是4M.
比如说我这里配置的8000,那么mongo客户端就会去默认抓取这么多的数据量:
经过本地简单的测试,我们发现性能已经有了飞跃的提升,导出30万数据,采用之前的方式,翻页到后面平均要500ms,总耗时60039ms。而优化后的方式,平均耗时在100ms-200ms之间,总耗时16667ms(中间包括业务逻辑的耗时)。
使用
那么我们再看看hasNext内部的逻辑好吗?好的.
最后
initFromCommandResult
拿到结果并解析成Bson对象总结
我们平常写代码的时候,最好都能够针对每个方法、接口甚至是更细的粒度加上埋点,也可以设置成
debug
级别,这样利用log4j/logback等日志框架动态更新级别,可以随时查看耗时,从而更能够针对性的优化,对于本文说的这个场景,我们首先看看是不是代码的逻辑有问题,然后看看是不是数据库的问题,比如说没建索引、数据量过大等,再去想办法针对性的优化,而不要上来就撸代码。The text was updated successfully, but these errors were encountered: