# Learn to delete data with Druid API

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

In working with data, Druid retains a copies of the existing data segments in deep storage and Historical processes. As new data is added into Druid, deep storage grows and becomes larger over time unless explicitly removed.

While deep storage is an important part of Druid's elastic, fault-tolerant design, over time, data accumulation in deep storage can lead to increased storage costs. Periodically deleting data can reclaim storage space and promote optimal resource allocation.

This notebook provides a tutorial on deleting existing data in Druid using the Coordinator API endpoints. 

## Table of contents

- [Prerequisites](#Prerequisites)
- [Ingest data](#Ingest-data)
- [Deletion steps](#Deletion-steps)
- [Delete by time interval](#Delete-by-time-interval)
- [Delete entire table](#Delete-entire-table)
- [Delete by segment ID](#Delete-by-segment-ID)

For the best experience, use JupyterLab so that you can always access the table of contents.


## Prerequisites

This tutorial works with Druid 26.0.0 or later.


Launch this tutorial and all prerequisites using the `druid-jupyter`, `kafka-jupyter`, or `all-services` profiles of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).

If you do not use the Docker Compose environment, you need the following:

* A running Druid instance.<br>
     Update the `druid_host` variable to point to your Router endpoint. For example:
     ```
     druid_host = "http://localhost:8888"
     ```

To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host, where the Router service listens.

`druid_host` is the hostname and port for your Druid deployment. In a distributed environment, you can point to other Druid services. In this tutorial, you'll use the Router service as the `druid_host`.

In [None]:
import requests
import json

# druid_host is the hostname and port for your Druid deployment. 
# In the Docker Compose tutorial environment, this is the Router
# service running at "http://router:8888".
# If you are not using the Docker Compose environment, edit the `druid_host`.

druid_host = "http://host.docker.internal:8888"
druid_host

Before we proceed with the tutorial, let's use the `/status/health` endpoint to verify that the cluster if up and running. This endpoint returns the Python value `true` if the Druid cluster has finished starting up and is running. Do not move on from this point if the following call does not return `true`.

In [None]:
endpoint = druid_host + '/status/health'
response = requests.request("GET", endpoint)
print(response.text)

In the rest of this tutorial, the `endpoint` and other variables are updated in code cells to call a different Druid endpoint to accomplish a task.

## Ingest data

Apache Druid stores data partitioned by time chunks into segments and supports deleting data by dropping segments. Before dropping data, we will use the quickstart Wikipedia data ingested with an indexing spec that creates hourly segments.

The following cell sets `endpoint` to `/druid/indexer/v1/task`. 

In [None]:
endpoint = druid_host + '/druid/indexer/v1/task'
endpoint

Next, construct a JSON payload with the ingestion specs to create a `wikipedia_hour` datasource with hour segmentation. There are many different [methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods) to ingest data, this tutorial uses [native batch ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html) and the `/druid/indexer/v1/task` endpoint. For more information on construction an ingestion spec, see [ingestion spec reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html).

In [None]:
payload = json.dumps({
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "wikipedia_hour",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "useSchemaDiscovery": True
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "hour",
        "queryGranularity": "none",
        "intervals": [
          "2015-09-12/2015-09-13"
        ],
        "rollup": False
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "local",
        "baseDir": "quickstart/tutorial/",
        "filter": "wikiticker-2015-09-12-sampled.json.gz"
      },
      "inputFormat": {
        "type": "json"
      },
      "appendToExisting": False
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000,
      "maxRowsInMemory": 25000
    }
  }
})

headers = {
  'Content-Type': 'application/json'
}

With the payload and headers ready, run the next cell to send a `POST` request to the endpoint.

In [None]:
response = requests.request("POST", endpoint, headers=headers, data=payload)
                            
print(response.text)

Once the data has been ingested, Druid will be populated with segments for each segment interval that contains data. Since the `wikipedia_hour` was ingested with `HOUR` granularity, there will be 24 segments associated with `wikipedia_hour`. 

For demonstration, let's view the segments generated for the `wikipedia_hour` datasource before any deletion is made. Run the following cell to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html).

Using this endpoint, you can query the `sys` [metadata table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema).

In [None]:
endpoint = druid_host + '/druid/v2/sql'
endpoint

Now, you can query the metadata table to retrieve segment information. The following cell sends a SQL query to retrieve `segment_id` information for the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to `objectLines`. This helps format the response with newlines and makes it easier to parse the output.

In [None]:
payload = json.dumps({
  "query": "SELECT segment_id FROM sys.segments WHERE \"datasource\" = 'wikipedia_hour'",
  "resultFormat": "objectLines"
})
headers = {
  'Content-Type': 'application/json'
}
 
response = requests.request("POST", endpoint, headers=headers, data=payload)

print(response.text)

Observe the response retrieved from the previous cell. In total, there are 24 `segment_id`, each containing the datasource name `wikipedia_hour`, along with the start and end hour interval. The tail end of the ID also contains the timestamp of when the request was made. 

For this tutorial, we are concerned with observing the start and end interval for each `segment_id`. 

For example: 
`{"segment_id":"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z"}` indicates this segment contains data from `2015-09-12T00:00:00.000` to `2015-09-12T01:00:00.000Z`.

## Deletion steps

Permanent deletion of a segment in Apache Druid has two steps:

1. A segment is marked as "unused." This step occurs when a segment is dropped by a [drop rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules) or manually marked as "unused" through the Coordinator API or web console. Note that marking a segment as "unused" is a soft delete, it is no longer available for querying but the segment files remain in deep storage and segment records remain in the metadata store. 
2. A kill task is sent to permanently remove "unused" segments. This deletes the segment file from deep storage and removes its record from the metadata store. This is a hard delete: the data is unrecoverable unless you have a backup.

## Delete by time interval

Segments can be deleted in a specified time interval. This begins with marking all segments in the interval as "unused", then sending a kill request to delete it permanently from deep storage.

First, set the endpoint variable to the Coordinator API endpoint `/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the datasource ingested is `wikipedia_hour`, let's specify that in the endpoint.

In [None]:
endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'
endpoint

The following cell constructs a JSON payload with the interval of segments to be deleted. This will mark the intervals from `18:00:00.000` to `20:00:00.000` non-inclusive as "unused." This payload is sent to the endpoint in a `POST` request.

In [None]:
payload = json.dumps({
  "interval": "2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)

print(response.text)

The response from the above cell should return a JSON object with the property `"numChangedSegments"` and the value `2`. This refers to the following segments:

* `{"segment_id":"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-07T21:36:29.244Z"}`
* `{"segment_id":"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-07T21:36:29.244Z"}`

Next, verify that the segments have been soft deleted. The following cell sets the endpoint variable to `/druid/v2/sql` and sends a `POST` request querying for the existing `segment_id`s. 

In [None]:
endpoint = druid_host + '/druid/v2/sql'
payload = json.dumps({
  "query": "SELECT segment_id FROM sys.segments WHERE \"datasource\" = 'wikipedia_hour'",
  "resultFormat": "objectLines"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)

print(response.text)

Observe the response above. There should now be only 22 segments, and the "unused" segments have been soft deleted. 

However, as you've only soft deleted the segments, it remains in deep storage.

Before permanently deleting the segments, let's observe how this can change in deep storage. This step is optional, you can move onto the next set of cells without completing this step.

[OPTIONAL] If you are running Druid externally from the Docker Compose environment, follow these instructions to retrieve segments from deep storage:
    
* Navigate to the distribution directory for Druid, this is the same place where you run `./bin/start-druid` to start up Druid.
* Run this command: `ls -l1 var/druid/segments/wikipedia-hour/`.

[OPTIONAL] If you are running Druid within the Docker Compose environment, follow these instructions to retrieve segments from deep storage:

* Navigate to your Docker terminal.
* Run this command: `docker exec -it historical ls /opt/shared/segments/wikipedia_hour`

The output should look similar to this:

```bash
$ ls -l1 var/druid/segments/wikipedia_hour/
2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z
2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z
2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z
2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z
2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z
2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z
2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z
2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z
2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z
2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z
2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z
2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z
2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z
2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z
2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z
2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z
2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z
2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z
2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z
2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z
2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z
2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z
```

Now, you can move onto sending a kill task to permanently delete the segments from deep storage. This can be done with the `/druid/coordinator/v1/datasources/:dataSource/intervals/:interval` endpoint.

The following cell uses the endpoint, setting the `dataSource` path parameter as `wikipedia_hour` with the interval `2015-09-12_2015-09-13`. 

Notice that the interval is set to `2015-09-12_2015-09-13` which covers the entirety of the 22 segments. Druid will only permanently delete the "unused" segments within this interval. 

In [None]:
endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'
endpoint

Run the next cell to send the `DELETE` request.

In [None]:
response = requests.request("DELETE", endpoint)
print(response.status_code)

Last, observe that the segments have been deleted from deep storage in the following sample output. 

```bash
$ ls -l1 var/druid/segments/wikipedia_hour/
2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z
2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z
2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z
2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z
2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z
2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z
2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z
2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z
2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z
2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z
2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z
2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z
2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z
2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z
2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z
2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z
2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z
2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z
2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z
2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z
```

## Delete entire table

You can delete entire tables the same way you can delete parts of a table, using intervals.

Run the following cell to reset the endpoint to `/druid/coordinator/v1/datasources/:dataSource/markUnused`.

In [None]:
endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'
endpoint

Next, send a `POST` with the payload `{"interval": "2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z"}` to mark the entirety of the table as "unused."

In [None]:
payload = json.dumps({
  "interval": "2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)

print(response.status_code)

To verify the segment changes, the following cell sets the endpoint to `/druid/v2/sql` and send a SQL-based request. 

In [None]:
endpoint = druid_host + '/druid/v2/sql'
payload = json.dumps({
  "query": "SELECT segment_id FROM sys.segments WHERE \"datasource\" = 'wikipedia_hour'",
  "resultFormat": "objectLines"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)

Run the next cells to view the response. You should see that the `response.text` returns nothing, but `response.status_code` returns a 200. 

The response should return the remaining segments, but since the table was deleted, there are no segments to return.

In [None]:
print(response.text)
print(response.status_code)

So far, you've soft deleted the table. Run the following cells to permanently delete the table from deep storage:

In [None]:
endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'
endpoint

In [None]:
response = requests.request("DELETE", endpoint, headers=headers, data=payload)
print(response.status_code)

## Delete by segment ID

In addition to deleting by interval, you can delete segments by using `segment_id`. Let's load in some new data to work with.

Run the following cell to ingest a new set of data for `wikipedia_hour`. 

In [None]:
endpoint = druid_host + '/druid/indexer/v1/task'
payload = json.dumps({
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "wikipedia_hour",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "useSchemaDiscovery": True
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "hour",
        "queryGranularity": "none",
        "intervals": [
          "2015-09-12/2015-09-13"
        ],
        "rollup": False
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "local",
        "baseDir": "quickstart/tutorial/",
        "filter": "wikiticker-2015-09-12-sampled.json.gz"
      },
      "inputFormat": {
        "type": "json"
      },
      "appendToExisting": False
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000,
      "maxRowsInMemory": 25000
    }
  }
})

headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)
                            
print(response.text)

Now that you have a brand new datasource to work with, let's view the segment information for it.

Run the next cell to retrieve the `segment_id` of each segment.

In [None]:
endpoint = druid_host + '/druid/v2/sql'
payload = json.dumps({
  "query": "SELECT segment_id FROM sys.segments WHERE \"datasource\" = 'wikipedia_hour'",
  "resultFormat": "objectLines"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)

print(response.text)

With known `segment_id`, you can mark specific segments "unused" by sending a request to the `/druid/coordinator/v1/datasources/wikipedia_hour/markUnused` endpoint with an array of `segment_id` values.

In [None]:
endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'
endpoint

In the next cell, construct a payload with `segmentIds` property and an array of `segment_id`. This payload should send the segments responsible for the interval `01:00:00.000` to `02:00:00.000` and `5:00:00.000` to `6:00:00.000` to be marked as "unused."

Fill in the `segmentIds` array with the `segment_id` corresponding to these intervals, then run the cell.

In [None]:
payload = json.dumps({
  "segmentIds": [
    "",
    ""
  ]
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)

print(response.text)

You should see a response with the `numChangedSegments` property and the value `2` for the two segments marked as "unused."

Run the cell below to view changes in the datasource's segments.

In [None]:
endpoint = druid_host + '/druid/v2/sql'
payload = json.dumps({
  "query": "SELECT segment_id FROM sys.segments WHERE \"datasource\" = 'wikipedia_hour'",
  "resultFormat": "objectLines"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", endpoint, headers=headers, data=payload)

print(response.text)

Last, run the following cells to permanently delete the segments from deep storage.

In [None]:
endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'
endpoint

In [None]:
response = requests.request("DELETE", endpoint, headers=headers, data=payload)
print(response.status_code)