# Tutorial: Basic visualizations using the Druid API

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
This tutorial introduces basic visualization options you can use with the Druid API.

It focuses on two Python modules to accomplish visualization tasks: [pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/).
This tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).

## Table of contents

- [Prerequisites](#Prerequisites)
- [Create a datasource](#Create-a-datasource)
- [Display data with pandas](#Display-data-with-pandas)
- [Display data with a bar graph](#Display-data-with-a-bar-graph)
- [Display data with a line graph](#Display-data-with-a-line-graph)
- [Next steps](#Next-steps)

For the best experience, use JupyterLab so that you can always access the table of contents.

## Prerequisites

This tutorial works with Druid 25.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).

If you do not use the Docker Compose environment, you need the following:
* A running Druid instance.
   * Update the `druid_host` variable to point to your Router endpoint. For example, `druid_host = "http://localhost:8888"`.
* The following Python packages:
   * `pandas` for data visualization
   * `bokeh` for data visualiztion

To start the tutorial, run the next cell. It imports the Python packages you'll need and defines the Druid host where the Druid Router service listens. The quickstart deployment configures the to listen on port `8888` by default, so you'll be making API calls against `http://router:8888`. 

In [None]:
import requests, json
import druidapi
import pandas as pd
from bokeh.palettes import Spectral10
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

# druid_host is the hostname and port for your Druid deployment. 
# In the Docker Compose tutorial environment, this is the Router
# service running at "http://router:8888".
# If you are not using the Docker Compose environment, edit the `druid_host`.

druid_host = 'http://router:8888'

# Instantiate the Druid API Rest client

druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql

## Create a datasource

The following cell defines the query to create a datasource for the tutorial `wikipedia_vis`.
It uses MSQ to ingest the data, and waits for the MSQ task to complete.
You will see an asterisk [*] in the left margin while the task runs.

In [None]:
sql = '''
INSERT INTO "wikipedia-vis" 
SELECT TIME_PARSE("timestamp") AS __time, * 
FROM TABLE (EXTERN(
    '{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type": "json"}', 
    '[{"name": "added", "type": "long"}, {"name": "channel", "type": "string"}, {"name": "cityName", "type": "string"}, {"name": "comment", "type": "string"}, {"name": "commentLength", "type": "long"}, {"name": "countryIsoCode", "type": "string"}, {"name": "countryName", "type": "string"}, {"name": "deleted", "type": "long"}, {"name": "delta", "type": "long"}, {"name": "deltaBucket", "type": "string"}, {"name": "diffUrl", "type": "string"}, {"name": "flags", "type": "string"}, {"name": "isAnonymous", "type": "string"}, {"name": "isMinor", "type": "string"}, {"name": "isNew", "type": "string"}, {"name": "isRobot", "type": "string"}, {"name": "isUnpatrolled", "type": "string"}, {"name": "metroCode", "type": "string"}, {"name": "namespace", "type": "string"}, {"name": "page", "type": "string"}, {"name": "regionIsoCode", "type": "string"}, {"name": "regionName", "type": "string"}, {"name": "timestamp", "type": "string"}, {"name": "user", "type": "string"}]'
    ))
PARTITIONED BY DAY
'''
sql_client.run_task(sql)


sql_client.wait_until_ready(dataSourceName)

## Display data with pandas

By default, when you query Druid using the API, Druid returns the results as JSON. The JSON output is great for programmatic operations, but it is not easy to scan the data visually. This section shows you how to use the Python pandas module to transform JSON query results to tabular format. 

The following uses the Druid API SQL client to select the top 10 channels by the number of additions and represent the output using JSON as it comes from Druid. The `query_result` is a raw list of rows returned from Druid.

In [None]:
sql = f'''
SELECT channel,
SUM(added) AS additions
FROM "wikipedia-vis" 
GROUP BY channel
ORDER BY additions DESC LIMIT 10
'''

# Run the SQL query
# Return a JSON object with a list of rows
query_result = json.dumps(sql_client.sql_query(sql).rows)
query_result

By default, the Druid API SQL client contains some formatting. For example:

In [None]:
# Run the query with the Druid API SQL client
display.sql(sql)

You can use the pandas library [`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) method to load JSON results from the Druid API into a pandas DataFrame.
Pandas enables you to display the results in a tabular format.
The `orient` parameter sets pandas to accept data as a list of records in the format: `[{column: value, ...}, {column: value, ...}]`.

In [None]:
additions_pd = pd.read_json(query_result, orient='records')

additions_pd

The basis for the Bokeh examples in this tutorial are DataFrames. Another benefit of using a DataFrame is that you can access all the pandas `DataFrame` object methods. For example, `DataFrame.max()`:

In [None]:
additions_pd.max()

Check out the [pandas docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) for more ideas about how to use pandas and DataFrames 
with your Druid data.

## Display data with a bar graph

Tabular format is OK for scanning data; however, you can use visualization and plotting tools that work with Jupyter Notebooks to visualize data as plots or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some basic plots using Druid data.

In this section, you create a simple bar chart showing the channels with the most additions during the time range.

First, call `output_notebook()` to set Bokeh to output plots inline within the notebook. This lets you view your plot inline. You can also use Bokeh's `output_file` function to write your plot to an HTML file. If you want to experiment with `output_file`, add it to the list of imports. For example:
```
from bokeh.io import output_notebook, show, output_file
```

In [None]:
output_notebook()

There are several ways to use Bokeh with a DataFrame. In this case, make a list of channels to serve as the x-axis of our plot. For the y-axis,  divide the total additions by 100000 for ease of display.

In [None]:
channels = additions_pd.channel.to_list()
total_additions = [x / 100000 for x in additions_pd.additions.to_list()]

In [None]:
channels

In [None]:
total_additions

Next, initialize the Bokeh plot (figure) with some basic configurations.

In [None]:
 # Create a new plot with a title. Set the size in pixels
bar_plot = figure(height=500, width=750, x_range=channels,  title="Additions by channel",
           toolbar_location=None)

Now, configure the renderer for the vertical bars on the plot:
- Set the x-axis to `channels`, the list of channels.
- Set the `top` coordinate that determines the bar height to `total_additions`.
- For a splash of color, set the `color` to `Spectral10`. 

Note that palettes in Bokeh are lists of colors. Bokeh expects the list length to equal the list length of the data dimensions -- in this case 10 colors.

See the Bokeh docs for more information on [vertical bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar) and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html). Note the expected output for the cell: `GlyphRenderer(id = 'p1054', …)`

In [None]:
bar_plot.vbar(x=channels, top=total_additions, width=0.5, color=Spectral10)

Leave off the x-axis grid lines for this plot.

In [None]:
bar_plot.xgrid.grid_line_color = None

Now, configure the y-axis:
 - Set the minimum value to `0`.
 - Set the visible range to `(0,40)`.

In [None]:
bar_plot.y_range.start = 0
y_range=(0, 40)

Finally, display your plot with the `show()` method.

In [None]:
show(bar_plot)

## Display data with a line graph

In this section, you'll create a line graph that compares the following, per channel:

* Total additions
* Number of unique editors
* Number of bot edits

First, change the query to include the editors and robots and load the results into a new pandas object.

In [None]:
sql = '''
      SELECT channel, SUM(added) AS additions,
      COUNT (DISTINCT user) as editors,
      SUM(CASE WHEN isRobot='true' THEN 1 ELSE 0 END) AS robots
      FROM "wikipedia-vis"
      GROUP BY channel ORDER BY additions DESC LIMIT 10
      '''


# Submit the request and get a list of JSON objects.
query_result = json.dumps(sql_client.sql_query(sql).rows)

# Add the results to a pandas DataFrame.
editors_pd = pd.read_json(query_result, orient='records')

editors_pd

Set up values for your plots:
- Leave channels as the basis for the x-axis.
- Change the resolution for additions to 1000, so it will fit nicely on our plot with the editors and the robots values.
- Create lists of values for editors and robots.

In [None]:
channels = editors_pd.channel.to_list()
total_additions = [x / 1000 for x in editors_pd.additions.to_list()]
editors = editors_pd.editors.to_list()
robots = editors_pd.robots.to_list()

Next, create a plot the same as before, but this time, leave in the Bokeh tools so you can try those out at the end.

In [None]:
# Create a plot.
line_plot = figure(x_range=channels, height=500, width=750, title="Editors vs robots")

Change the scale to accommodate the current data set.

In [None]:
line_plot.y_range.start = 0
y_range=(0, 4000)

Next, add data lines onto the plot:
- For the line color, you only need one color per line. To keep with the Spectral palette, set the index (0-9) for the color you want.
- Bokeh adds a legend for the lines. The legend label identifies each line in the legend.
- The `line_width` and `line_dash` properties control the line appearance.

See the [Bokeh docs](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.line) for more information about the line figures.

In [None]:
# Add renderers for the lines
addition_line = line_plot.line(channels, total_additions, line_color=Spectral10[0], legend_label="Additions x 1000",line_width=3, line_dash='solid')
deletion_line = line_plot.line(channels, editors, line_color=Spectral10[3], legend_label="Editors", line_width=3, line_dash='dashed')
robots_line = line_plot.line(channels, robots, line_color=Spectral10[7], legend_label="Robots", line_width=3, line_dash='dotted')

Finally, display the plot.

In [None]:
show(line_plot)

The graph shows some interesting data: channels with more bots editing topics tend to have fewer users editing the topics. Those are some hard-working bots!

Notice the toolbar to the right of the plot. You can use it to focus and navigate around the plot.

## Next steps

This tutorial covers the absolute basics for visualization with Apache Druid, pandas, and Bokeh. See the following topics to learn more about the various visualizations you can build:

- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)
- [pandas](https://pandas.pydata.org/)
- [Bokeh](https://bokeh.org/)





