Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance MetricsEsDAO#multiGet to avoid OutOfBounds #5096

Merged
merged 2 commits into from
Jul 14, 2020
Merged

Enhance MetricsEsDAO#multiGet to avoid OutOfBounds #5096

merged 2 commits into from
Jul 14, 2020

Conversation

amwyyyy
Copy link
Contributor

@amwyyyy amwyyyy commented Jul 14, 2020

See this issues #5039

The problem is because the es query times out
image
Cause duplicate service_traffic

{
  "_index": "sw8_service_traffic-20200704",
  "_type": "type",
  "_id": "bWJwLWJvbnVzLWNhbGN1bGF0ZQ==.1",
  "_version": 1,
  "_score": 2,
  "_source": {
    "node_type": 0,
    "name": "mbp-bonus-calculate"
  }
}

{
  "_index": "sw8_service_traffic-20200706",
  "_type": "type",
  "_id": "bWJwLWJvbnVzLWNhbGN1bGF0ZQ==.1",
  "_version": 1,
  "_score": 2,
  "_source": {
    "node_type": 0,
    "name": "mbp-bonus-calculate"
  }
}

Then the server starts throw exception
image

View server source code, under normal circumstances method "ids" the return should be less than or equal to the length of String[] ids. However, because service_traffic is repeated, one id can search two results, resulting in totalHits greater than the actual. So throw ArrayIndexOutOfBoundsException.
image
image

We should guarantee the stability of es, but there will always be unexpected situations, hoping that skywalking is fault-tolerant. Because in the previous version, at most only some trace data was lost.

Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kezhenxu94 @dmsolr Would you take a look at ES API?

Seems like response.getHits().getTotalHits().value VS response.getHits().getHits().length. @amwyyyy A little explanation about the difference?

@codecov
Copy link

codecov bot commented Jul 14, 2020

Codecov Report

Merging #5096 into master will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5096      +/-   ##
============================================
- Coverage     53.09%   53.01%   -0.08%     
- Complexity     2985     2986       +1     
============================================
  Files          1417     1417              
  Lines         30719    30714       -5     
  Branches       3417     3418       +1     
============================================
- Hits          16310    16283      -27     
- Misses        13605    13626      +21     
- Partials        804      805       +1     
Impacted Files Coverage Δ Complexity Δ
...torage/plugin/elasticsearch/base/MetricsEsDAO.java 100.00% <100.00%> (ø) 5.00 <0.00> (ø)
...orage/plugin/elasticsearch7/dao/MetricsEs7DAO.java 100.00% <100.00%> (ø) 3.00 <0.00> (ø)
...erver/library/client/jdbc/JDBCClientException.java 0.00% <0.00%> (-50.00%) 0.00% <0.00%> (-1.00%)
...ing/oap/server/library/server/grpc/GRPCServer.java 48.33% <0.00%> (-11.67%) 6.00% <0.00%> (-1.00%)
...e/plugin/jdbc/h2/dao/H2NetworkAddressAliasDAO.java 77.27% <0.00%> (-9.10%) 5.00% <0.00%> (ø%)
...alking/apm/agent/core/context/util/PeerFormat.java 25.00% <0.00%> (-8.34%) 0.00% <0.00%> (ø%)
...brary/client/jdbc/hikaricp/JDBCHikariCPClient.java 53.52% <0.00%> (-5.64%) 13.00% <0.00%> (ø%)
...server/core/remote/client/RemoteClientManager.java 83.15% <0.00%> (-4.22%) 18.00% <0.00%> (-1.00%)
...skywalking/oap/server/core/CoreModuleProvider.java 75.69% <0.00%> (-4.17%) 9.00% <0.00%> (ø%)
...core/analysis/manual/instance/InstanceTraffic.java 56.00% <0.00%> (-4.00%) 7.00% <0.00%> (ø%)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5972c2...d791aa6. Read the comment docs.

@dmsolr
Copy link
Member

dmsolr commented Jul 14, 2020

In actually, gitHits().length is more robustness. In general, gitHits().Totalhits works well unless the query timeout.

gitHits().totalHits is as many documents matching the query. (hits)
gitHits().length means the ResultSet size of query. (hits and recalls)

Copy link
Member

@kezhenxu94 kezhenxu94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wu-sheng
Copy link
Member

Notice, this still could cause duplicate raws in different index. This should be clear.

@wu-sheng
Copy link
Member

Because timeout still could happen, then duplicate records generated.

@wu-sheng wu-sheng merged commit ce2469d into apache:master Jul 14, 2020
@wu-sheng wu-sheng added this to the 8.1.0 milestone Jul 14, 2020
@wu-sheng wu-sheng added backend OAP backend related. enhancement Enhancement on performance or codes labels Jul 14, 2020
@wu-sheng wu-sheng changed the title optimization MetricsEsDAO multiGet Enhance MetricsEsDAO#multiGet to avoid OutOfBounds Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend OAP backend related. enhancement Enhancement on performance or codes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants