## Bar plots

This page demonstrates how to recreate the bar plots found in the "visualizing amounts" chapter of the book, [here](https://clauswilke.com/dataviz/visualizing-amounts.html).

A summary of the chapter can be found in our blog post, [here](blog post link).

In [1]:
# import the relevant libraries
import pandas as pd
from bokeh.io import output_notebook

output_notebook()

### Data preparation

The data used for the plots in this notebook can be found in our GitHub [repository](https://github.com/bokeh/dataviz-fundamentals/tree/main/data). Here, we use the pandas library to parse the data.

### Vertical and Horizontal bar plots


In [2]:
file = "../data/movies.csv"
df = pd.read_csv(file)
df["Title"] = df["Title"].apply(lambda x: x.split(":")[0])
df["Weekend gross"] = df["Weekend gross"].apply(
    lambda x: (int(x.split("$")[1])) / 1_000_000
)

df

Unnamed: 0,Rank,Title,Weekend gross
0,1,Star Wars,71.565498
1,2,Jumanji,36.169328
2,3,Pitch Perfect 3,19.928525
3,4,The Greatest Showman,8.805843
4,5,Ferdinand,7.316746


In [3]:
from bokeh.plotting import figure, show
from bokeh.models import FactorRange

# plot a vertical bar
p1 = figure(
    x_range=FactorRange(factors=df.Title),
    height=300,
    width=600,
    title="Movie gross",
    y_axis_label="weekend gross (million USD)",
)

p1.vbar(x="Title", top="Weekend gross", width=0.7, color="#66B2FF", source=df)

p1.xaxis.major_tick_out = 0

p1.yaxis.minor_tick_out = 0
p1.yaxis.major_tick_out = 0
p1.y_range.start = 0

show(p1)

In [4]:
# plot a horizontal bar
p2 = figure(
    y_range=df.Title,
    height=300,
    title="Movie gross",
    x_axis_label="weekend gross (million USD)",
    sizing_mode="stretch_width",
)

p2.hbar(
    y="Title", left=0, right="Weekend gross", height=0.8, color="#66B2FF", source=df
)

p2.x_range.start = 0
p2.xaxis.major_tick_out = 0
p2.xaxis.minor_tick_out = 0

p2.yaxis.major_tick_out = 0


show(p2)

The `vbar()` and `hbar()` methods of the Bokeh `Figure` class are used to create the vertical and horizontal bar plots respectively.

The `vbar()` methods takes the following parameters:

- `x`: The x-coordinates of the bars. It can be a list of categorical values or numerical positions.

- `top`: The top coordinates of the bars. It can be a list of numerical values representing the heights of the bars.

- `width`: The width of the bars. It can be a numerical value or a categorical range (e.g., a list of categories).

- `source`: The data source containing the `x` and `top` values if the values are strings.

- Other optional parameters: color, alpha, line width, etc.

- There are also other plot customization options such as:
    
    - setting the x-axis and y-axis major ticks to be inside the plot 
    - setting the start of the y-axis range to 0 
    

The `hbar()` method takes the following parameters:

- `y`: The y-coordinates of the bars. It can be a list of categorical values or numerical positions.

- `right`: The right coordinates of the bars. It can be a list of numerical values representing the right endpoints of the bars.

- `height`: The vertical height of the bars. It can be a numerical value or a categorical range (e.g., a list of categories). 

- Other optional parameters and plot customization options: Same as for the `vbar()` method

### Grouped and stacked bar plots

#### A. Grouped bars

In [5]:
file = "../data/income_by_age.tsv"
df = pd.read_table(file)

df = (
    df.sort_values(["race", "age"])
    .reset_index(drop=True)
    .iloc[7:35, :]
    .reset_index(drop=True)
)
age_group = df.groupby(["age", "race"])[["median_income"]].sum()
age_group = age_group.unstack().reset_index().rename(columns={"": "age"})
age_group.columns = age_group.columns.droplevel()

age_group

race,age,asian,black,hispanic,white
0,15 to 24,45809,30267,45080,44588
1,25 to 34,80098,39176,45876,65389
2,35 to 44,100443,49336,50245,78093
3,45 to 54,98925,50103,58103,82289
4,55 to 64,91193,40363,51996,69387
5,65 to 74,56646,28697,36704,52219
6,> 74,26487,22302,23797,32203


In [6]:
from bokeh.models import NumeralTickFormatter as NTF
from bokeh.palettes import Blues5
from bokeh.transform import dodge

p = figure(
    title="Median income by age group",
    height=300,
    sizing_mode="stretch_width",
    x_range=FactorRange(factors=age_group.age),
    x_axis_label="age (years)",
    y_axis_label="median income (USD)",
)

bar_width = 0.2

p.vbar(
    x=dodge("age", -0.3, range=p.x_range),
    top="asian",
    source=age_group,
    width=bar_width,
    color=Blues5[0],
    legend_label="Asian",
)

p.vbar(
    x=dodge("age", -0.1, range=p.x_range),
    top="white",
    source=age_group,
    width=bar_width,
    color=Blues5[1],
    legend_label="White",
)

p.vbar(
    x=dodge("age", 0.1, range=p.x_range),
    top="hispanic",
    source=age_group,
    width=bar_width,
    color=Blues5[2],
    legend_label="Hispanic",
)

p.vbar(
    x=dodge("age", 0.3, range=p.x_range),
    top="black",
    source=age_group,
    width=bar_width,
    color=Blues5[3],
    legend_label="Black",
)

# remove x-axis tick labels and tick marks
p.xaxis.major_tick_line_color = None
p.xaxis.major_tick_out = 0
p.xgrid.grid_line_color = None
p.xaxis.axis_line_color = None

# remove y-axis tick labels, tick marks and line.
p.yaxis.minor_tick_out = 0
p.yaxis.major_tick_out = 0
p.yaxis.axis_line_color = None

# configure y-axis range and y-axis ticks
p.y_range.start = 0
p.y_range.end = 100_000
p.yaxis.formatter = NTF(format="$0,0")

show(p)

The `vbar()` method is called as many times for each category of bars you want to plot. It is used to plot the grouped bars using the `dodge` parameter.

The `dodge` parameter is used to create an offset for each bar of each category to make sure they don't overlap.

Other parameters employed are as follows:

- `NumeralTickFormatter`: Imported as `NTF`. It formats the y-axis ticks as dollar amounts.

- `Blues5`: Part of the Bokeh palletes class that provides a color palette for the bars.

- `legend_label`: Provides a label for each bar for the legend.

In [7]:
def plot_bars(df: pd.DataFrame) -> figure:
    """
    Creates a bar chart to visualize median income by age.

    Parameters:
        df (pd.DataFrame): The pandas DataFrame containing the data.
            It should have the following columns:
            - age: String values representing the age groups.
            - median_income: Integer values representing the median income for each age group.

    Returns:
        figure: A Bokeh figure object representing the bar chart.

    Raises:
        ValueError: If the required columns are not present in the DataFrame.
        TypeError: If the data types of the columns are not compatible with the plot.

    Example:
        df = pd.DataFrame({'age': [18, 25, 35], 'median_income': [50000, 60000, 70000]})
        plot = plot_bars(df)
        show(plot)
    """
    # Data validation
    if "age" not in df.columns or "median_income" not in df.columns:
        raise ValueError("The DataFrame must have 'age' and 'median_income' columns.")

    if not pd.api.types.is_numeric_dtype(df["median_income"]):
        raise TypeError("The 'median_income' column must contain numeric values.")

    if not pd.api.types.is_numeric_dtype(df["age"]):
        factors = df["age"].unique().tolist()
        df["age"] = pd.Categorical(df["age"], categories=factors, ordered=True)

    # Function implementation
    p = figure(
        title=f"{df.name}",
        height=300,
        width=300,
        x_range=FactorRange(factors=df.age),
        toolbar_location=None,
    )

    p.vbar(x="age", top="median_income", color="#99CCFF", source=df, width=0.9)
    p.xgrid.grid_line_color = None
    p.xaxis.major_tick_out = 0
    p.yaxis.formatter = NTF(format="$0,0")
    p.xaxis.axis_label = "Age (years)"
    p.yaxis.axis_label = "Median income (USD)"
    p.yaxis.minor_tick_out = 0
    p.yaxis.major_tick_out = 0
    p.y_range.start = 0
    p.y_range.end = 110_000

    return p

In [8]:
# change the age format in the dataframe
df["age"] = df.age.str.replace(" to ", "-")

# add a dataframe name for each race category
asian = df.iloc[:7, :].drop(["year", "race"], axis=1)
asian.name = "Asian"

black = df.iloc[7:14, :].drop(["year", "race"], axis=1)
black.name = "Black"

hispanic = df.iloc[14:21, :].drop(["year", "race"], axis=1)
hispanic.name = "Hispanic"

white = df.iloc[21:, :].drop(["year", "race"], axis=1)
white.name = "White"

In [9]:
# plot and render bar charts in a grid layout
from bokeh.layouts import gridplot

races = [asian, white, hispanic, black]
plots = []
for race in races:
    plot = plot_bars(race)
    plots.append(plot)

layout = gridplot([plots[:2], plots[-2:]])

show(layout)

In the above plots, the `gridplot` fuction is used to create a 2 by 2 grid layout by passing a nested list of the plots arranged in rows and columns. The first two plots are displayed in the first row and the last two plots are displayed in the second row.

#### B. Stacked bars

In [10]:
file = "../data/titanic_all.tsv"
titanic = pd.read_table(file)
t_class = titanic.groupby("class").sex.value_counts().unstack().drop("*", axis=0)
t_class.index = ["1st class", "2nd class", "3rd class"]
t_class = t_class.reset_index().rename(columns={"index": "class"})

t_class

sex,class,female,male
0,1st class,143.0,179.0
1,2nd class,107.0,172.0
2,3rd class,212.0,499.0


In [11]:
from bokeh.models import HoverTool

p = figure(
    title="Male and female passengers on the Titanic",
    height=300,
    x_range=FactorRange(*t_class["class"]),
    toolbar_location=None,
    sizing_mode="stretch_width",
)

p.vbar_stack(
    ["male", "female"],
    x="class",
    source=t_class,
    width=0.9,
    line_color="white",
    color=[(0, 102, 204), (204, 102, 0)],
    legend_label=["male passengers", "female passengers"],
)

hover = HoverTool(tooltips=[("Male", "@male"), ("Female", "@female")])

p.add_tools(hover)
p.xaxis.axis_line_color = None
p.xaxis.axis_line_width = 0
p.xaxis.major_tick_out = 0
p.y_range.start = 0
p.yaxis.visible = False
p.grid.grid_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"

show(p)

Creating vertical stacked bars with Bokeh require the use of the `vbar_stack` method. This method takes the following parameters:

- The names of the columns to stack.  It can be a single string or a list of strings representing the column names. Each column represents a different stack. In this case, the columns are "male" and "female".

- `x`: Specifies the column name or a list of column names in the data source that represents the x-coordinates of the bars.

- `source`: Specifies the data source for the plot. It can be a Bokeh ColumnDataSource object or a pandas DataFrame.

- Other optional parameters: Several attributes of the figure object are modified to customize the appearance of the plot, such as:
    - hiding the x-axis line
    - setting the major ticks to be inside the plot 
    - setting the start of the y-axis range to 0 
    - hiding the y-axis
    - removing grid lines 
    - removing the plot outline

The location and orientation of the legend are also customized.

The `HoverTool` object is also added to the figure to allow the display of tooltips when hovering over the data points on the plot as follows:

- The `HoverTool` object is created and assigned to the variable `hover`.
     
- The `tooltips` parameter of `HoverTool` is set to a list of tuples. Each tuple contains the name of the tooltip and a placeholder specifying the column name to retrieve the data from. The tooltips display the number of male and female passengers when hovering over the corresponding bars.

- The `add_tools` method is called on the figure object to add the hover tool to the plot.

