Skip to content

Druid historical loses track of used space #11841

@tanisdlj

Description

@tanisdlj

Affected Version

0.22.2

Description

  • Cluster size: 2 brokers, 2 routers, 2 coordinators, 37 historicals (15 hot, 21 cold, 1 frozen), 2 overlords, 43 middlemanagers
  • Steps to reproduce the problem: One morning we found that, during the night, a massive rebalanced happened leaving many servers at 100% disk usage while others in the same tier were left empty. After restarting the coordinator segments were better balanced but we started noticing this issue. Many servers in two different tiers had their disks full while reporting not being full. A restart on the historical or the coordinator doesn't fix it.

Coordinator log:

Oct 25 08:59:53 druid-master-1 java[19246]: 2021-10-25T08:59:53,185 ERROR [Master-PeonExec--0] org.apache.druid.server.coordinator.HttpLoadQueuePeon - Server[http://stde2-hhot-10.stde2] Failed segment[datasource_2021-09-04T11:00:00.000Z_2021-09-04T12:00:00.000Z_2021-09-04T11:00:00.016Z_226] request[SegmentChangeRequestLoad] with cause [Exception loading segment[datasource_2021-09-04T11:00:00.000Z_2021-09-04T12:00:00.000Z_2021-09-04T11:00:00.016Z_226]].
Oct 25 08:59:53 druid-master-1 java[19246]: 2021-10-25T08:59:53,475 ERROR [Master-PeonExec--0] org.apache.druid.server.coordinator.HttpLoadQueuePeon - Server[http://stde2-hhot-01.stde2] Failed segment[datasource_2021-08-31T19:00:00.000Z_2021-08-31T20:00:00.000Z_2021-08-31T19:00:00.019Z_449] request[SegmentChangeRequestLoad] with cause [Exception loading segment[datasource_2021-08-31T19:00:00.000Z_2021-08-31T20:00:00.000Z_2021-08-31T19:00:00.019Z_449]].
Oct 25 08:59:55 druid-master-1 java[19246]: 2021-10-25T08:59:55,918 ERROR [Master-PeonExec--0] org.apache.druid.server.coordinator.HttpLoadQueuePeon - Server[http://stde2-hhot-01.stde2] Failed segment[datasource_2021-10-24T17:00:00.000Z_2021-10-24T18:00:00.000Z_2021-10-24T17:00:00.012Z_339] request[SegmentChangeRequestLoad] with cause [Exception loading segment[datasource_2021-10-24T17:00:00.000Z_2021-10-24T18:00:00.000Z_2021-10-24T17:00:00.012Z_339]].

One of the historicals affected:

Oct 25 09:47:19 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:19,998 INFO [SimpleDataSegmentChangeHandler-105] org.apache.druid.server.SegmentManager - Attempting to close segment datasource_2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z_2021-08-25T08:00:00.024Z_41
Oct 25 09:47:19 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:19,998 INFO [SimpleDataSegmentChangeHandler-105] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[/druid/segment-cache-1/datasource/2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z/2021-08-25T08:00:00.024Z/41]
Oct 25 09:47:19 stde2-hhot-01.stde2java[58946]: 2021-10-25T09:47:19,999 WARN [SimpleDataSegmentChangeHandler-105] org.apache.druid.segment.loading.StorageLocation - SegmentDir[/druid/segment-cache-1/datasource/2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z/2021-08-25T08:00:00.024Z/41] is not found under this location[/druid/segment-cache-1]
Oct 25 09:47:20 stde2-hhot-01.stde2java[58946]: 2021-10-25T09:47:20,001 INFO [SimpleDataSegmentChangeHandler-24] org.apache.druid.server.SegmentManager - Attempting to close segment datasource_2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z_2021-08-25T08:00:00.024Z_25
Oct 25 09:47:20 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:20,001 INFO [SimpleDataSegmentChangeHandler-24] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[/druid/segment-cache-1/datasource/2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z/2021-08-25T08:00:00.024Z/25]
Oct 25 09:47:20 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:20,001 WARN [SimpleDataSegmentChangeHandler-24] org.apache.druid.segment.loading.StorageLocation - SegmentDir[/druid/segment-cache-1/datasource/2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z/2021-08-25T08:00:00.024Z/25] is not found under this location[/druid/segment-cache-1]
Oct 25 09:47:20 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:20,003 INFO [SimpleDataSegmentChangeHandler-108] org.apache.druid.server.SegmentManager - Attempting to close segment datasource_2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z_2021-08-25T08:00:00.024Z_10
Oct 25 09:47:20 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:20,003 INFO [SimpleDataSegmentChangeHandler-108] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[/druid/segment-cache-2/datasource/2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z/2021-08-25T08:00:00.024Z/10]
Oct 25 09:47:20 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:20,005 INFO [SimpleDataSegmentChangeHandler-4] org.apache.druid.server.SegmentManager - Attempting to close segment ds2_2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z_2021-08-25T08:00:00.086Z_25
Oct 25 09:47:20 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:20,006 INFO [SimpleDataSegmentChangeHandler-4] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[/druid/segment-cache-2/ds2/2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z/2021-08-25T08:00:00.086Z/25]
Oct 25 09:47:20 stde2-hhot-01.stde2 java[58946]: 2021-10-25T09:47:20,007 WARN [SimpleDataSegmentChangeHandler-4] org.apache.druid.segment.loading.StorageLocation - SegmentDir[/druid/segment-cache-2/ds2/2021-08-25T08:00:00.000Z_2021-08-25T09:00:00.000Z/2021-08-25T08:00:00.086Z/25] is not found under this location[/druid/segment-cache-2]

image

image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions