<table style="float:left; border:none">
   <tr style="border:none">
       <td style="border:none">
           <a href="https://bokeh.org/">
           <img
               src="assets/bokeh-transparent.png"
               style="width:50px"
           >
           </a>
       </td>
       <td style="border:none">
           <h1>Bokeh Tutorial</h1>
       </td>
   </tr>
</table>

<div style="float:right;"><h2>06 Data sources</h2></div>

In [1]:
# activate notebook output
from bokeh.io import output_notebook

output_notebook()

# load tutorial data set
import sys

sys.path.append("../data")
from carriers_data import CarrierDataSet

data = CarrierDataSet()

This chapter is focused on how Bokeh handles data. The concepts introduced here are
fundamental to Bokeh. You will use them throughout the rest of the tutorial.

In the previous examples, you have used standard Python lists or Pandas DataFrames as
inputs for your data.

Behind the scenes, Bokeh converts all these inputs to a Bokeh ColumnDataSource.
This is Bokeh's internal data structure. It is used in all plots.

In most cases, Bokeh can just handle the ColumnDataSource automatically. However,
there are many cases where it is useful to create and use a ColumnDataSource
directly. Several of Bokeh's more advanced functionalities rely on using
a ColumnDataSource. This includes hover tooltips, automatically placed labels, 
computed transforms, or custom interactions, for example.

### Creating a ColumnDataSource from a dictionary

The first step to creating a `ColumnDataSource` is to import it from `bokeh.models`:

In [2]:
from bokeh.models import ColumnDataSource

A ColumnDataSource works similar to a table or a pandas DataFrame. It is a mapping of
column names to sequences of values.

You can create a ColumnDataSource from Python dictionaries. The keys of the dictionary
are the column names, and the values are the sequences of values:

In [3]:
source = ColumnDataSource(
    data={
        "x": [1, 2, 3, 4, 5],  # first dictionary creates a column named "x"
        "y": [3, 7, 8, 5, 1],  # second dictionary creates a column named "x"
    }
)

To access the contents of any column, use the `data` property of a `ColumnDataSource`:

In [4]:
source.data["x"]

[1, 2, 3, 4, 5]

The data you provide here is not limited to lists. You can also use NumPy arrays or Pandas Series:

In [5]:
import pandas as pd
import numpy as np

# load pandas series frame from demo data set
monthly_passengers_series = data.get_monthly_values()["passengers"]

# create numpy array of same length as the pandas series
range_array = np.array(range(0, len(monthly_passengers_series), 1))

# create a ColumnDataSource from pandas series and numpy array
source = ColumnDataSource(
    data={
        "x": monthly_passengers_series,  # first dictionary uses a pandas Series
        "y": range_array,  # second dictionary uses a numpy array
    }
)

**All the columns in a ColumnDataSource must always be the SAME length**. This is why
the numpy array in the example above uses `len(monthly_passengers_series)`. This way,
the numpy array is the same length as the pandas Series used for the other column.

The following code cell will show an error. Adjust one of the lists to create a valid
ColumnDataSource:

In the examples so far, you have used a Python list or pandas series for the `x` and `y`
values of functions like `p.circle`. This means that Bokeh has created the
ColumnDataSource for you automatically.

Instead of passing individual sequences of values, however, you can also use a
ColumnDataSource directly. To use a ColumnDataSource directly, you do two things
differently:

1. You pass the ColumnDataSource as the `source` argument to a glyph method
2. To use values from the column of a ColumnDataSource, you pass the **name** of that
    column as the value for the property. For example, instead of passing `x=[1 ,2 ,3]`,
    you pass `x="x_values"`.

In the following code cell, you first create a ColumnDataSource from two dictionaries.
Then you use those two columns as the `x` and `y` values for a circle glyph:

In [6]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource

# create dict as basis for ColumnDataSource
data_dict = {"x_values": [1, 2, 3, 4, 5], "y_values": [6, 7, 2, 3, 6]}

# create ColumnDataSource based on the dict
source = ColumnDataSource(data=data_dict)

# create a plot and renderer with ColumnDataSource data
p = figure(height=300)
p.circle(
    x="x_values",  # use the sequence in the "x_values" column
    y="y_values",  # use the sequence in the "y_values" column
    source=source,  # use the ColumnDataSource as the data source
)

show(p)

### Creating a ColumnDataSource from a DataFrame

There are many similarities between a ColumnDataSource and a pandas DataFrame.
This is why it is simple to create a `ColumnDataSource` object directly from a
DataFrame.

Let's use the monthly passenger, freight, and mail data from the demo dataset again:

In [7]:
monthly_values_df = data.get_monthly_values()
monthly_values_df.head(5)

Unnamed: 0_level_0,passengers,freight,mail,month_name
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,23895970,87598694,46968436,January
2,23970797,91949788,38184246,February
3,38529707,110337215,44435455,March
4,42939029,107077739,46288794,April
5,51766337,115079453,44898759,May


To create a ColumnDataSource from a DataFrame, pass the dataframe when creating the
`ColumnDataSource` object:

In [8]:
source = ColumnDataSource(monthly_values_df)

You now have a ColumnDataSource with the same columns as the DataFrame:
- a series of values in a column called `"passengers"`
- a series of values in a column called `"freight"`
- a series of values in a column called `"mail"`
- a series of strings in a column called `"month_names"`

You can use the ColumnDataSource directly in the same way as before:

In [16]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource

# create ColumnDataSource based on DataFrame from the demo data set
source = ColumnDataSource(monthly_values_df)

# set up the figure
p = figure(
    height=300,
    x_range=source.data[
        "month_name"
    ],  # use the sequence of strings from the "month_name" column as categories
)

# createa a line renderer with data from the "passenger" column
p.line(
    x="month_name",  # use the sequence of strings from the "month_name" column as categories
    y="passengers",  # use the sequence of values from the "passengers" column as values
    source=source,
)

# create a second line renderer with data from a different column
p.line(
    x="month_name",  # use the sequence of strings from the "month_name" column as categories
    y="freight",  # use the sequence of values from the "freight" column as values
    # y="mail",       # use this line instead of the one above to use data from the "mail" column
    source=source,
)

show(p)

### ColumnDataSource transforms

Using ColumnDataSource objects also allows you to use Bokeh's built-in transforms.
Transforms are useful to perform computations on the data before the data is displayed. 
These transforms are run by BokehJS, in the browser. 
This means the underlying data is not modified, and is always
available for other plots in the same document.



you'll learn more about pie and donut charts in one of the next chapters

In [17]:
from math import pi
import pandas as pd
from bokeh.palettes import Category20c
from bokeh.transform import cumsum

x = { 'United States': 157, 'United Kingdom': 93, 'Japan': 89, 'China': 63,
      'Germany': 44, 'India': 42, 'Italy': 40, 'Australia': 35, 'Brazil': 32,
      'France': 31, 'Taiwan': 31, 'Spain': 29 }

data = pd.Series(x).reset_index(name='value').rename(columns={'index':'country'})
data['color'] = Category20c[len(x)]

# represent each value as an angle = value / total * 2pi
data['angle'] = data['value']/data['value'].sum() * 2*pi

p = figure(height=350, title="Pie Chart", toolbar_location=None,
           tools="hover", tooltips="@country: @value")

p.wedge(x=0, y=1, radius=0.4, 
        
        # use cumsum to cumulatively sum the values for start and end angles
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='country', source=data)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

# Next section

[TBD, placeholder]