# How to save time on data visualization

A picture is worth a thousand words. Good visualizations speak for themselves allowing to save the expensive stakeholders' time while presenting a data project. By saving his own while time preparing the graphs a data professional wastes time and attention of his audience on explanations rather than using it to talk about the impliocations. Though tempting, such tradeoff can push a decision on a critical project by weeks or months.


In this blog, I will share the pieces of code that make visualizations easier to read and approaches for saving time while putting the corresponding code together. The material is intended for the Data Scientists who spend significant time exploring the data and using the results to help stakeholders make business decisions. The teck stack will be limited to using Python's standard visualization library in a Jupyter notebook.

## The last mile problem
Python community owns a number of powerful visualization libraries.
<a href="https://seaborn.pydata.org/">Seaborn</a>, <a href="https://plotly.com/">Plotly</a>, <a href="https://altair-viz.github.io/">Altair</a>, <a href="https://www.pygal.org">Pygal</a>, and <a href="https://plotnine.readthedocs.io/">Plotnine</a> are among the many others.
I get very excited reading about their capabilities. However, the resulting plot still needs fine tuning no matter how sophisticated the visualization function is. This last mile from a flashy picture to an informative graph requires the most effort. Say, I found a function from a fancy library that fits perfectly to my data and renders the following plot that I saw in <a href="https://medium.com/@jordan.johnson_35348/is-ruby-on-rails-on-the-next-train-out-cc580d9d0f14">this article</a>:
<div>
<img src="pics/flashy_plot.jpeg" width="400"/>
</div>
Looks slick, but I will have to use words to explain what was measured for each of the programming languages. Adding and x label or writing a title is typically possible but require a dive into documentation of that specific library. Next time when the data fit better to a function from a different library I will have to dive again. Not impossible but tedious.

Instead, I want concentrate on the tricks that cover the last mile using the standard `matplotlib` library. There are two reasons for that:
<ol>
  <li> <b>Inheritance</b>. Some visualization libraries like Seaborn use <code>matplotlib</code> as a backbone. This means that the generated image can then be finished off in the same way as if it was generated with the standard library itself.
  </li>
  <li> <b>Sufficiency</b>. The standard plotting library provides the majority of the required functionality. The result might not be as flashy or interactive but has all the means for bringing the point across efficiently. For example, the above visualization can be delivered with the standard library in the way shown below.
  </li>
</ol>

In [None]:
import matplotlib.pyplot as plt

programming_languages = ('JavaScript',
                         'HTML/CSS',
                         'SQL',
                         'Python',
                         'TypeScript',
                         'Java',
                         'Bash/Shell',
                         'C#')

past_year_use = (65.36, 55.08, 49.43, 48.07, 34.83, 33.27, 29.07, 27.98)

plt.barh(programming_languages, past_year_use)
plt.show()

The bare bones plot from the standard library looks even less informative. The following section will cover how to make this plot easy to read which would add extra code to the setup. The section after that will show ways of saving time typing the same decoration commands over and over again.

## Making plots easier to read

### Image resolution

The resulting plot will only appear on slides or in documentation if it was saved in a file. However, the default saving options are not always ideal. The framing settings can cut parts of the header out of image while the default resolution settings can result in a pixelated outcome.
<div>
<img src="pics/demo_bar_plot_save_default.png" width="400"/>
</div>

It is our responsibility to make sure the final result looks good on the screen. The `bbox_inches='tight'` parameter makes  everything fit and cuts the excessive white space only. The `dpi` parameter controls the resolution.

```python
plt.savefig('pics/demo_bar_plot_save.png', bbox_inches='tight', dpi=300)
```
<div>
<img src="pics/demo_bar_plot_save.png" width="400"/>
</div>

### Figure size
The size of the visualization should be proportional to the amount of information we want to convey. Stretching the bars across the default size plot will only eat up the space without adding any value like in the following example from <a href="https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_col.html#two-variable-bar-plot">here</a>:
<div>
<img src="pics/thick_bars.png" width="400"/>
</div>
Overcrowded plots will benefit from increasing the figure size.

The following piece of code placed before the plotting function will set the figure width to 5 inches and height to 3 inches:
```python
plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
```
<div>
<img src="pics/demo_bar_plot_fs.png" width="500"/>
</div>

In [None]:
# plot preparation
fig = plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot

plt.savefig('pics/demo_bar_plot_fs.png', bbox_inches='tight', dpi=300)
plt.show()

### Title and axis labels

The plot should inform the viewer about its content. Title and labels are a straightforward way of delivering this information. Assumptions about this information being obvious lead to confusion, misinterpretations, unnecessary questions, and lengthy explanations. A simple remedy would be to always label the axis and tell the readers the goal of the plot in the title.
Say, I want bring across the point that, unlike my expectations, Python is not as wide spread in developers' projects as other programming languages. Then I would set the title and the axis as follows:

```python
plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')
```
<div>
<img src="pics/demo_bar_plot_ttl.png" width="500"/>
</div>

In [None]:
# plot preparation
fig = plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot

highlighting = {'color':'red', 'fontweight':'bold'}
plt.title('Less than half developers used Python\n'
          'in their projects in the past year',
          **highlighting)
plt.xlabel('Percentage of developers', **highlighting)
plt.ylabel('Programming Language', **highlighting)

plt.savefig('pics/demo_bar_plot_ttl.png', bbox_inches='tight', dpi=300)
plt.show()

### Grid lines
The grid lines help perceive the differences between the values similar to how the lines on the football field help referees spot the differences between players' positions when calling offsides.
<div>
<img src="pics/offside.jpg" width="400"/>
</div>

The following line sets the grid:

```python
plt.grid()              # show the grid
```

However, the default behavior is that the grid shows up on the top of the plot drawing. And that looks ugly. The fix requires extracting the axis object from the plot before setting the grid below the plot:

```python
ax = plt.gca()          # get the axis from the plot
ax.set_axisbelow(True)  # set the grid lines behind the value bars
```
<div>
<img src="pics/demo_bar_plot_gl.png" width="500"/>
</div>

In [None]:
# plot preparation
fig = plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot

plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')

plt.grid(color='red', linewidth=2)              # show the grid
ax = plt.gca()          # get the axis from the plot
ax.set_axisbelow(True)  # set the grid lines behind the value bars

plt.savefig('pics/demo_bar_plot_gl.png', bbox_inches='tight', dpi=300)
plt.show()

### Ticks formatting

Reading the numeric value takes two steps: reading the number from the ticks and reading the measurement unit from the label. We can cut on mental step by adding the measurement unit to the tick value. The following trick involves defining a function that takes 2 arguments (tick value and tick sequence number) to generate a replacement. That replacement is then applied it to the axis of interest (`xaxis` or `yaxis`):
```python
add_percentage = lambda value, tick_number: f"{value:.0f}%"       # function that adds % to tick value
ax.xaxis.set_major_formatter(plt.FuncFormatter(add_percentage))   # apply the tick modifier to x axis
```
<div>
<img src="pics/demo_bar_plot_tl.png" width="500"/>
</div>

In [None]:
# plot preparation
fig = plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot

plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')

plt.grid()              # show the grid
ax = plt.gca()          # get the axis from the plot
ax.set_axisbelow(True)  # set the grid lines behind the value bars

ax.set_xticklabels(ax.get_xticks(), color='red', weight='bold')
add_percentage = lambda value, tick_number: f"{value:.0f}%"       # function that adds % to tick value
ax.xaxis.set_major_formatter(plt.FuncFormatter(add_percentage))   # apply the tick modifier to x axis

plt.savefig('pics/demo_bar_plot_tl.png', bbox_inches='tight', dpi=300)
plt.show()

### Reference points

Points of reference help navigate the audience around the expectations whether it is the sales goal, an industry standard, the end of quarter day or something of high importance. Adding them explicitly saves efforts in making sense of the plot. 

I added a vertical line at 50% of programmers who used a programming language so it would be easy to see that Python does not cross the line and get a feeling of by how much without even looking at the numbers.

```python
plt.axvline(50, color='m', ls='--') # 50% cutoff vertical line (axhline for horizontal)
```
<div>
<img src="pics/demo_bar_plot_rp.png" width="500"/>
</div>


In [None]:
# plot preparation
fig = plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot

plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')

plt.grid()              # show the grid
ax = plt.gca()          # get the axis from the plot
ax.set_axisbelow(True)  # set the grid lines behind the value bars

add_percentage = lambda value, tick_number: f"{value:.0f}%"       # function that adds % to tick value
ax.xaxis.set_major_formatter(plt.FuncFormatter(add_percentage))   # apply the tick modifier to x axis

plt.axvline(50, color='r', linewidth=4)  # 50% cutoff vertical line (axhline for horizontal)

plt.savefig('pics/demo_bar_plot_rp.png', bbox_inches='tight', dpi=300)
plt.show()

### Legend

Legend is essential to references different layers of the plot if there is more than one. The order of the layers can be mixed up. In such cases individual layers or plots can be referenced through the corresponding handles. In my example, the legend for the bars themselves will not add value while taking the space but the reference line should be annotated. The location argument helps positioning the legend in a way it is not obstructing the areas of high importance.

```python
handle_vline = plt.axvline(50, color='m', ls='--')  # 50% cutoff vertical line (axhline for horizontal)
plt.legend([handle_vline],                          # plot handles
           ['majority cutoff'],                     # plot labels
           loc='upper right')                       # legend position
```
<div>
<img src="pics/demo_bar_plot_lgnd.png" width="500"/>
</div>

In [None]:
# plot preparation
fig = plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot

plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')

plt.grid()              # show the grid
ax = plt.gca()          # get the axis from the plot
ax.set_axisbelow(True)  # set the grid lines behind the value bars

add_percentage = lambda value, tick_number: f"{value:.0f}%"       # function that adds % to tick value
ax.xaxis.set_major_formatter(plt.FuncFormatter(add_percentage))   # apply the tick modifier to x axis

handle_vline = plt.axvline(50, color='m', ls='--')  # 50% cutoff vertical line (axhline for horizontal)
plt.legend([handle_vline],                          # plot handles
           ['majority cutoff'],                     # plot labels
           loc='upper right',                       # legend position
           facecolor='pink')

plt.savefig('pics/demo_bar_plot_lgnd.png', bbox_inches='tight', dpi=300)
plt.show()

### Plot specific manipulations

The above manipulations are applicable to the major plot types including line plots, bar plots, box plots, etc. However, plot specific manipulations can also help delivering the message. For example, the horizontal bar plot can benefit from highlighting the bar of interest, listing the bars in a descending order, and adding the values numbers to the bars.

```python
python_ix = programming_languages.index('Python')  # find the Python bar position
handle_bars[python_ix].set_color('green')          # set the Python bar to different color
ax.invert_yaxis()                                  # make bars display from top to bottom vs the default
for pos, val in enumerate(past_year_use):          # add the exact numbers to the bars
    ax.text(val, pos, f'{val:.0f}% ',
            verticalalignment='center', horizontalalignment='right',
            color='white', fontweight='bold')
plt.box(False)                                     # turn off the box
```
<div>
<img src="pics/demo_bar_plot_custom.png" width="500"/>
</div>

In [None]:
# plot preparation
fig = plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot

plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')

plt.grid()              # show the grid
ax = plt.gca()          # get the axis from the plot
ax.set_axisbelow(True)  # set the grid lines behind the value bars

add_percentage = lambda value, tick_number: f"{value:.0f}%"       # function that adds % to tick value
ax.xaxis.set_major_formatter(plt.FuncFormatter(add_percentage))   # apply the tick modifier to x axis

handle_vline = plt.axvline(50, color='m', ls='--')  # 50% cutoff vertical line (axhline for horizontal)
plt.legend([handle_vline],                          # plot handles
           ['majority cutoff'],                     # plot labels
           loc='lower right')                       # legend position

# plot specific modifications
python_ix = programming_languages.index('Python')  # find the Python bar position
handle_bars[python_ix].set_color('green')          # set the Python bar to different color
ax.invert_yaxis()                                  # make bars display from top to bottom vs the default
# add the exact numbers to the bars
for pos, val in enumerate(past_year_use):
    ax.text(val, pos, f'{val:.0f}% ',
            verticalalignment='center', horizontalalignment='right',
            color='white', fontweight='bold')
plt.box(False)                                     # turn off the box

plt.savefig('pics/demo_bar_plot.png', bbox_inches='tight', dpi=300)
plt.show()

## Making plots faster to set up
The above example showed that a decent amount of fine tuning uses standard commands, though a lot of them. Memorizing that code is not the biggest problem. The issue is the time spent putting the same lines with slight modifications for each plot together. I will lay down a number of ways to reduce the memorization effort and typing time.

### Change the defaults
Default plot settings can be changed in one place and propagated to all the downstream figures in a jupyter notebook. For example, the default grid behavior and the default figure size can be changed in the following way:

```python
plt.rcParams['figure.figsize'] = [5, 3]
plt.rcParams['axes.axisbelow'] = True
plt.rcParams['axes.grid'] = True
```

The full list of the settings, options, and the defaults can be found <a href="https://matplotlib.org/stable/tutorials/introductory/customizing.html">here</a>.

This saves coding time while still needs looking up the `rcParams` settings, adding these settings to each new notebook, and still typing content specific commands.

### Copy and paste
This approach will take care of content specific commands. Just save the template somewhere and copy it every time to finish up the plot. Content specific parts are easy to modify without memorizing all the necessary commands. The template below works well for the majority of my cases:

```python
# plot preparation
plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()
# plotting

# fine tuning
plt.title('Title')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
ax.set_axisbelow(True)
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda value, tick_number: f"{value}" ))
handle_vline = plt.axvline(50, color='r')
plt.legend([handle_vline], ['reference'], loc='best')
plt.savefig('pics/file_name.png', bbox_inches='tight', dpi=150)
plt.show()
```

The template can be modified if some other setting gets used quite a lot.
The major inconvenience with this approach is carrying the template around in a separate file and pulling it from there every time it is needed. Not the end of the world, but annoying.

### Share across jupyter notebooks with magics

The `%macro` magic captures a piece of code for future retrieval or execution. Just name the pattern and point the magic to the cell that holds the corresponding lines. Say, the template code is in cell `10` and I picked the name `fine_tune` for it. The following command will associate the code with the name:
```
%macro -q -r fine_tune 10
```
From this moment on the code can be retrieved anywhere in the notebook by typing:
```
%load fine_tune
```

The next step is to make the template available for other notebooks. The `%store` magic does the job. Its primary use is to share the variables between the notebooks. `fine_tune` is one of such variables. The following line stores template macro variable internally:
```
%store fine_tune
```
The following command makes the template available in the notebook where it needs to be used:
```
%store -r fine_tune
```
After that `%load fine_tune` it when putting a plot together.

In [None]:
# plot preparation
plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()
# plotting

# # fine tuning
# plt.title('Title')
# plt.xlabel('x')
# plt.ylabel('y')
# plt.grid()
# ax.set_axisbelow(True)
# ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda value, tick_number: f"{value}" ))
# handle_vline = plt.axvline(50, color='r')
# plt.legend([handle_vline], ['reference'], loc='best')
# plt.savefig('pics/file_name.png', bbox_inches='tight', dpi=150)
# plt.show()

In [None]:
%macro -q -r fine_tune 10
%store fine_tune

### Extra credit for the extra ~~lazy~~ efficient ones

For those who would go an extra mile to make the templates available without typing `%store -r` for each of them in every new notebook there is a way. This command can be executed as a part of the script that jupyter runs while opening a notebook. Run the following command to check whether the script exists:
```python
import os.path

ipython = !ipython locate
file_name = f'{ipython[0]}/profile_default/ipython_config.py'
os.path.isfile(file_name)
```
If the script does not exist (the last command returned `False`) the following command creates it:
```python
!ipython profile create
```
Edit the file from `file_name` in the following manner:
- find the line that contains `c.InteractiveShellApp.exec_lines`,
- uncomment it if it is commented,
- put the lines of code to be executed in the square brackets.

In my example, the line will turn from 
```python
# c.InteractiveShellApp.exec_lines = []
```
to
```python
c.InteractiveShellApp.exec_lines = [
    '%store -r fine_tune'
]
```

The following script will automate the process of adding new templates to the existing ones.

In [None]:
import os.path

new_template_names = ['fine_tune']

# config folder path
ipython = !ipython locate
file_name = f'{ipython[0]}/profile_default/ipython_config.py'
# search patterns for commented and uncommented lines
exec_line_default = '# c.InteractiveShellApp.exec_lines = []\n'
exec_line = 'c.InteractiveShellApp.exec_lines = [\n'

# create the config files if not there
if not os.path.isfile(file_name):
    !ipython profile create

# read the config file
with open(file_name) as f:
    lines = f.readlines()

# uncommenting the part with line execution if commented
if exec_line_default in lines:
    # find the commented line
    setting_position = lines.index(exec_line_default)
    # replace with uncommented
    lines[setting_position] = exec_line
    # add the closing bracket
    lines.insert(setting_position+1, ']\n')

insert_position = lines.index(exec_line) if exec_line in lines else -1

# add the template loading lines
for template_name in new_template_names:
    # shape the storage request line
    new_line = f"    '%store -r {template_name}',\n"
    # add the line if not already there
    if not new_line in lines:
        insert_position += 1
        lines.insert(insert_position, new_line)

# write the updated config file
with open(file_name, mode='w') as f:
    f.writelines(lines)