# Tutorial: Druid SQL segment sizing and partitioning

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
Partitioning is a method of organizing a large datasource into independent partitions.
Partitioning reduces the size of your data and increases query performance.

At ingestion, Apache Druid always partitions its data by time.
Each time chunk is then divided into one or more [segments](https://druid.apache.org/docs/latest/design/segments.html).

This tutorial describes how to configure partitioning for the Druid SQL ingestion method. For information about partitioning configurations supported by other ingestion methods, see [How to configure partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning).

## Prerequisites

Make sure that you meet the requirements outlined in the README.md file of the [apache/druid repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).
Specifically, you need the following:
- Knowledge of SQL
- [Python3](https://www.python.org/downloads/)
- [The `requests` package for Python](https://requests.readthedocs.io/en/latest/user/install/)
- [JupyterLab](https://jupyter.org/install#jupyterlab) (recommended) or [Jupyter Notebook](https://jupyter.org/install#jupyter-notebook) running on a non-default port. Druid and Jupyter both default to port `8888`, so you need to start Jupyter on a different port. 
- An available Druid instance. This tutorial uses the `micro-quickstart` configuration described in the [Druid quickstart](https://druid.apache.org/docs/latest/tutorials/index.html), so no authentication or authorization is required unless explicitly mentioned. If you haven’t already, download Druid version 24.0 or higher and start Druid services as described in the quickstart.

## Prepare your environment

Start by running the following cell. It imports the required Python packages and defines a variable for the Druid host.

In [None]:
import requests
import json

# druid_host is the hostname and port for your Druid deployment. 
# In a distributed environment, use the Router service  as the `druid_host`. 
druid_host = "http://localhost:8888"
dataSourceName = "partitioning-tutorial"
print(f"\033[1mDruid host\033[0m: {druid_host}")

In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` variables are updated to accomplish different tasks.

## Segment size

A segment is the smallest unit of storage in Druid.
It is recommended that you optimize your segment file size at ingestion time for Druid to operate well under a heavy query load.

Consider the following to optimize your segment file size:

- The number of rows per segment should be around five million. You can set the number of rows per segment using the `rowsPerSegment` query context parameter in the [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html) or as a [JDBC connection properties object](https://druid.apache.org/docs/latest/querying/sql-jdbc.html). To specify the `rowsPerSegment` parameters in the Druid web console, navigate to the **Query** page, then click **Engine > Edit context** to bring up the **Edit query context** dialog. For more information on how to specify query context parameters, see [Setting the query context](https://druid.apache.org/docs/latest/querying/sql-query-context.html#setting-the-query-context).
- Segment file size should be within the range of 300-700 MB. The number of rows per segment takes precedence over the segment byte size. 

For more information on segment sizing, see [Segment size optimization](https://druid.apache.org/docs/latest/operations/segment-optimization.html).

## PARTITIONED BY

In Druid SQL, the granularity of a segment is defined by the granularity of the PARTITIONED BY clause.

[INSERT](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#insert) and [REPLACE](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#replace) statements both require the PARTITIONED BY clause.

PARTITIONED BY accepts the following time granularity arguments:
- `time_unit`
- `TIME_FLOOR(__time, period)` 
- `FLOOR(__time TO time_unit)`
- `ALL` or `ALL TIME`

Continue reading to learn about each of the supported arguments.

### Time unit

`PARTITIONED BY(time_unit)`. Partition by `SECOND`, `MINUTE`, `HOUR`, `DAY`, `WEEK`, `MONTH`, `QUARTER`, or `YEAR`.

For example, run the following cell to ingest data from an external source into a table named `partitioning-tutorial` and partition the datasource by `DAY`:

In [None]:
endpoint = "/druid/v2/sql/task"
print(f"\033[1mQuery endpoint\033[0m: {druid_host+endpoint}")
http_method = "POST"

# If you already have an existing datasource named partitioning-tutorial, use REPLACE INTO instead of INSERT INTO.
payload = json.dumps({
"query": "INSERT INTO \"partitioning-tutorial\" SELECT TIME_PARSE(\"timestamp\") \
          AS __time, * FROM TABLE \
          (EXTERN('{\"type\": \"http\", \"uris\": [\"https://druid.apache.org/data/wikipedia.json.gz\"]}', '{\"type\": \"json\"}', '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", \"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, {\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", \"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, {\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", \"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": \"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": \"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": \"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": \"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": \"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": \"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": \"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"regionIsoCode\", \"type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, {\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]')) \
          PARTITIONED BY DAY",
  "context": {
    "maxNumTasks": 3
  }
})

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_taskId_response = response
ingestion_taskId = json.loads(ingestion_taskId_response.text)['taskId']

print(f"\033[1mQuery\033[0m:\n" + payload)
print(f"\nInserting data into the table named {dataSourceName}")
print("\nThe response includes the task ID and the status: " + response.text + ".")

To check on the status of your ingestion task, run the following cell. 

In [None]:
import time

endpoint = f"/druid/indexer/v1/task/{ingestion_taskId}/status"
print(f"\033[1mQuery endpoint\033[0m: {druid_host+endpoint}")
http_method = "GET"

payload = {}
headers = {}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_status = json.loads(response.text)['status']['status']
# If you only want to fetch the status once and print it, 
# uncomment the print statement and comment out the if and while loops
# print(json.dumps(response.json(), indent=4))

if ingestion_status == "RUNNING":
  print("The ingestion is running...")

while ingestion_status != "SUCCESS":
  response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
  ingestion_status = json.loads(response.text)['status']['status']
  time.sleep(15)  
  
if ingestion_status == "SUCCESS": 
  print("The ingestion is complete:")
  print(json.dumps(response.json(), indent=4))

### TIME_FLOOR

`PARTITIONED BY(TIME_FLOOR(__time, period))`. Partition by a timestamp rounded to the specified period.

`period` can be any of the following  ISO 8601 periods:
- `PT1S`: one second
- `PT1M`: one minute
- `PT5M`: five minutes
- `PT10M`: ten minutes
- `PT15M`: fifteen minutes
- `PT30M`: thirty minutes
- `PT1H`: one hour
- `PT6H`: six hours
- `PT8H`: eight hours 
- `P1D`: one day
- `P1W`: one week
- `P1M`: one month
- `P3M`: three months
- `P1Y`: one year

Run the following cell to partition the `partitioning-tutorial` datasource by a timestamp rounded to thirty minutes:

In [None]:
endpoint = "/druid/v2/sql/task"
print(f"\033[1mQuery endpoint\033[0m: {druid_host+endpoint}")
http_method = "POST"

payload = json.dumps({
"query": "REPLACE INTO \"partitioning-tutorial\" OVERWRITE ALL SELECT TIME_PARSE(\"timestamp\") \
          AS __time, * FROM TABLE \
          (EXTERN('{\"type\": \"http\", \"uris\": [\"https://druid.apache.org/data/wikipedia.json.gz\"]}', '{\"type\": \"json\"}', '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", \"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, {\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", \"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, {\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", \"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": \"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": \"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": \"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": \"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": \"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": \"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": \"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"regionIsoCode\", \"type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, {\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]')) \
          PARTITIONED BY TIME_FLOOR(__time, 'PT30M')",
  "context": {
    "maxNumTasks": 3
  }
})

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_taskId_response = response
ingestion_taskId = json.loads(response.text)['taskId']

print(f"\033[1mQuery\033[0m:\n" + payload)
print(f"\nInserting data into the table named {dataSourceName}")
print("\nThe response includes the task ID and the status: " + response.text + ".")

### FLOOR

`PARTITIONED BY(FLOOR(__time TO time_unit))`. Partition by the largest timestamp value that is less than or equal to the specified time unit, where `time_unit` can be any of the following values: `SECOND`, `MINUTE`, `HOUR`, `DAY`, `WEEK`, `MONTH`, `QUARTER`, `YEAR`.

Run the following cell to partition the `partitioning-tutorial` datasource by a timestamp value less than or equal to `HOUR`:

In [None]:
endpoint = "/druid/v2/sql/task"
print(f"\033[1mQuery endpoint\033[0m: {druid_host+endpoint}")
http_method = "POST"

payload = json.dumps({
"query": "REPLACE INTO \"partitioning-tutorial\" OVERWRITE ALL SELECT TIME_PARSE(\"timestamp\") \
          AS __time, * FROM TABLE \
          (EXTERN('{\"type\": \"http\", \"uris\": [\"https://druid.apache.org/data/wikipedia.json.gz\"]}', '{\"type\": \"json\"}', '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", \"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, {\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", \"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, {\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", \"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": \"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": \"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": \"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": \"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": \"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": \"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": \"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"regionIsoCode\", \"type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, {\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]')) \
          PARTITIONED BY FLOOR(__time TO HOUR)",
  "context": {
    "maxNumTasks": 3
  }
})

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_taskId_response = response
ingestion_taskId = json.loads(response.text)['taskId']

print(f"\033[1mQuery\033[0m:\n" + payload)
print(f"\nInserting data into the table named {dataSourceName}")
print("\nThe response includes the task ID and the status: " + response.text + ".")

### ALL and ALL TIME

`PARTITIONED BY ALL`. Disable time partitioning by placing all data in a single time chunk.

PARTITIONED BY ALL and PARTITIONED BY ALL TIME clauses are suitable for datasets that do not have a primary timestamp. In this case, Druid creates a `__time` column in your Druid datasource and sets all timestamps to `1970-01-01T00:00:00Z`.

> To use LIMIT or OFFSET at the outer level of your INSERT or REPLACE query, you must set PARTITIONED BY to ALL or ALL TIME.

Run the following cell to skip time partitioning and place all data into a single time chunk:

In [None]:
endpoint = "/druid/v2/sql/task"
print(f"\033[1mQuery endpoint\033[0m: {druid_host+endpoint}")
http_method = "POST"

payload = json.dumps({
"query": "REPLACE INTO \"partitioning-tutorial\" OVERWRITE ALL SELECT TIME_PARSE(\"timestamp\") \
          AS __time, * FROM TABLE \
          (EXTERN('{\"type\": \"http\", \"uris\": [\"https://druid.apache.org/data/wikipedia.json.gz\"]}', '{\"type\": \"json\"}', '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", \"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, {\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", \"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, {\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", \"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": \"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": \"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": \"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": \"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": \"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": \"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": \"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"regionIsoCode\", \"type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, {\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]')) \
          PARTITIONED BY ALL",
  "context": {
    "maxNumTasks": 3
  }
})

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestion_taskId_response = response
ingestion_taskId = json.loads(response.text)['taskId']

print(f"\033[1mQuery\033[0m:\n" + payload)
print(f"\nInserting data into the table named {dataSourceName}")
print("\nThe response includes the task ID and the status: " + response.text + ".")

## Learn more

To learn more about Druid segment sizing and partitioning, see the following topics:

- [Segments](https://druid.apache.org/docs/latest/design/segments.html) for general information about segments in Druid. 
- [Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html) to learn how to set up partitions within a single datasource.
- [Context parameters](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#context-parameters) for context parameters specific to the multi-stage query task engine.