# How to save time on data visualization

A picture is worth a thousand words. Good visualizations speak for themselves allowing to save the expensive stakeholders' time while presenting a data project. By saving his own while time preparing the graphs a data professional wastes time and attention of his audience on explanations rather than using it to talk about the impliocations. Though tempting, such tradeoff can push a decision on a critical project by weeks or months.


In this blog, I will share the pieces of code that make visualizations easier to read and approaches for saving time while putting the corresponding code together. The material is intended for the Data Scientists who spend significant time exploring the data and using the results to help stakeholders make business decisions. The teck stack will be limited to using Python's standard visualization library in a Jupyter notebook.

## The last mile problem
Python community owns a number of powerful visualization libraries.
<a href="https://seaborn.pydata.org/">Seaborn</a>, <a href="https://plotly.com/">Plotly</a>, <a href="https://altair-viz.github.io/">Altair</a>, <a href="https://www.pygal.org">Pygal</a>, and <a href="https://plotnine.readthedocs.io/">Plotnine</a> are among the many others.
I get very excited reading about their capabilities. However, the resulting plot still needs fine tuning no matter how sophisticated the visualization function is. This last mile from a flashy picture to an informative graph requires the most effort. Say, I found a function from a fancy library that fits perfectly to my data and renders the following plot that I saw in <a href="https://medium.com/@jordan.johnson_35348/is-ruby-on-rails-on-the-next-train-out-cc580d9d0f14">this article</a>:
<div>
<img src="pics/flashy_plot.jpeg" width="400"/>
</div>
Looks slick, but I will have to use words to explain what was measured for each of the programming languages. Adding and x label or writing a title is typically possible but require a dive into documentation of that specific library. Next time when the data fit better to a function from a different library I will have to dive again. Not impossible but tedious.

Instead, I want concentrate on the tricks that cover the last mile using the standard `matplotlib` library. There are two reasons for that:
<ol>
  <li> <b>Inheritance</b>. Some visualization libraries like Seaborn use <code>matplotlib</code> as a backbone. This means that the generated image can then be finished off in the same way as if it was generated with the standard library itself.
  </li>
  <li> <b>Sufficiency</b>. The standard plotting library provides the majority of the required functionality. The result might not be as flashy or interactive but has all the means for bringing the point across efficiently. For example, the above visualization can be delivered with the standard library in the way shown below.
  </li>
</ol>

In [None]:
import matplotlib.pyplot as plt

programming_languages = ('JavaScript',
                         'HTML/CSS',
                         'SQL',
                         'Python',
                         'TypeScript',
                         'Node.js',
                         'Java',
                         'C#')

past_year_use = (68.62, 55.9, 50.73, 41.53, 36.42, 36.19, 34.51, 29.81)

plt.barh(programming_languages, past_year_use)
plt.show()

The bare bones plot from the standard library looks even less informative. The following section will cover how to make this plot easy to read which would add extra code to the setup. The section after that will show ways of saving time typing the same decoration commands over and over again.

## Making plots easier to read

### Figure size
The size of the visualization should be proportional to the amount of information we want to convey. Stretching the bars across the plot of the default size will only eat up the space without adding any value like in the following example from <a href="https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_col.html#two-variable-bar-plot">here</a>:
<div>
<img src="pics/thick_bars.png" width="400"/>
</div>
Overcrowded plots will benefit from increasing the figure size.

The following piece of code placed before the plotting function will set the figure width to 5 inches and height to 3 inches:
```python
plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
```
<div>
<img src="pics/demo_bar_plot_fs.png" width="500"/>
</div>

### Title and axis labels

The most crucial piece of information that is what is on the plot. An assumption that this might be obvious typically leads to confusion, misinterpretations, unnecessary questions, and lengthy explanations. A simple remedy would be to always label the axis and tell the readers the goal of the plot in the title.
Say, I want bring across the point that unlike my expectations Python is not as wide spread in developers projects compared to other programming languages. Then I would set the title and the axis as follows:

```python
plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')
```
<div>
<img src="pics/demo_bar_plot_ttl.png" width="500"/>
</div>

### Grid lines
The grid lines help perceive the differences between the values similar to how the lines on the football field help referees spot the differences between players' positions when calling offsides.
<div>
<img src="pics/offside.jpg" width="400"/>
</div>

The following line sets the grid:

```python
plt.grid()              # show the grid
```

However, the default behavior is that the grid shows up at the top of the plot drawing. And this is ugly. The fix requires  extracting the axis object from the plot before setting the grid below the plot:

```python
ax = plt.gca()          # get the axis from the plot
ax.set_axisbelow(True)  # set the grid lines behind the value bars
```
<div>
<img src="pics/demo_bar_plot_gl.png" width="500"/>
</div>

### Ticks formatting

Reading the numeric value takes two steps: reading the number from the ticks and reading the measurement unit from the label. We can cut on mental step by adding the measurement unit to the tick value. The following trick involves defining a function that takes 2 arguments (tick value and tick sequence number) to generate a replacement. That replacement is then applied it to the axis of interest (`xaxis` or `yaxis`):
```python
add_percentage = lambda value, tick_number: f"{value:.0f}%"       # function that adds % to tick value
ax.xaxis.set_major_formatter(plt.FuncFormatter(add_percentage))   # apply the tick modifier to x axis
```
<div>
<img src="pics/demo_bar_plot_tl.png" width="500"/>
</div>

### Reference points

Points of reference help navigate the audience around the expectations whether it is the sales goal, an industry standard, the end of quarter day or something of high importance. Adding them explicitly saves efforts in making sense of the plot. 

I added a vertical line at 50% of programmers who used a programming language so it would be easy to see that Python does not cross the line and get a feeling of by how much without even looking at the numbers.

```python
plt.axvline(50, color='r')  # 50% cutoff vertical line (axhline for horizontal)
```
<div>
<img src="pics/demo_bar_plot_rp.png" width="500"/>
</div>


### Legend

Legend is essential to references different layers of the plot if there is more than one. The order of the layers can be mixed up. In such cases individual layers or plots can be referenced through the corresponding handles. In my example, the legend for the bars themselves will not add value while taking the space but the reference line should be annotated. The location argument helps positioning the legend in a way it is not obstructing the areas of high importance.

```python
handle_vline = plt.axvline(50, color='r')  # 50% cutoff vertical line (axhline for horizontal)
plt.legend([handle_vline],                 # plot handles
           ['majority cutoff'],            # plot labels
           loc='upper right')              # legend position
```
<div>
<img src="pics/demo_bar_plot_lgnd.png" width="500"/>
</div>

### Plot specific manipulations

The above manipulations are applicable to the major plot types including line plots, bar plots, box plots, etc. However, plot specific manipulations can also help delivering the message. For example, the horizontal bar plot can benefit from highlighting the bar of interest and listing the bars in a descending order.

```python
python_ix = programming_languages.index('Python')  # find the Python bar position
handle_bars[python_ix].set_color('green')          # set bar to different color
ax.invert_yaxis()                                  # make bars display from top to bottom vs the default
```
<div>
<img src="pics/demo_bar_plot.png" width="500"/>
</div>

### Save the plot

The last step before adding the plot to the presentation slides is saving it with enough resolution. The default value may or may not work for a given figure size. So it is our responsibility to make sure the final result looks good on the screen.

```python
plt.savefig('pics/demo_bar_plot.png', bbox_inches='tight', dpi=300)
```

In [None]:
# plot preparation
plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()  # get the axis from the plot
# plotting
handle_bars = plt.barh(programming_languages, past_year_use)  # the plot
handle_bars[programming_languages.index('Python')].set_color('green')  # set bar of different color
ax.invert_yaxis()  # make bars display from top to bottom vs the default
# fine tuning
plt.title('Less than half developers used Python\n'
          'in their projects in the past year')
plt.xlabel('Percentage of developers')
plt.ylabel('Programming Language')
plt.grid()  # make the grid appear
ax.set_axisbelow(True)  # set the grid lines behind the value bars
add_percentage = lambda value, tick_number: f"{value:.0f}%"       # function that adds % to tick value
ax.xaxis.set_major_formatter(plt.FuncFormatter(add_percentage))   # apply the tick modifier to x axis
handle_vline = plt.axvline(50, color='r')  # 50% cutoff vertical line (axhline for horizontal)
plt.legend([handle_vline],                 # plot handles
           ['majority cutoff'],            # plot labels
           loc='lower right')              # legend position
plt.savefig('pics/demo_bar_plot.png', bbox_inches='tight', dpi=300)
plt.show()

## Making plots faster to set up
The above example showed that a decent amount of fine tuning can be standardized. Nevertheless, it requires a decent amount of code. Memorizing that code is not the biggest problem. The time spent putting the same lines with slight modifications for each plot is. I will lay down a number of ways to reduce memorization and typing time.

### Change plot defaults
This approach allows to save time changing the defaults for each new plot within a jupyter notebook. For example, the default grid behavior and the default figure size can be changing in the following way:

```python
plt.rcParams['figure.figsize'] = [5, 3]
plt.rcParams['axes.axisbelow'] = True
plt.rcParams['axes.grid'] = True
```

The full list of the settings, options, and the defaults can be found <a href="https://matplotlib.org/stable/tutorials/introductory/customizing.html">here</a>.

This saves some coding while still needs memorizing and typing everything else.

### Use jupyter macros

Macros allow to retrieve a block of predefined code by typing <code>%macro_name</code>. Titles, axis names, reference points, and other plot parameters change from figure to figure but the code for setting them up stays the same. Macros allow to load a set of predefined commands and fill in the data needed instead of writing all the commands every time.

For example, the following set of commands is typical for my workflow. I can use all of them or some of them depending on the goal but having them available right away to finish off a plot saves time.

In [None]:
# plot preparation
plt.figure(figsize = (5,3))  # set the figure size (Width, Height)
ax = plt.gca()
# plotting

# fine tuning
plt.title('Title')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
ax.set_axisbelow(True)
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda value, tick_number: f"{value}" ))
handle_vline = plt.axvline(50, color='r')
plt.legend([handle_vline], ['reference'])
# plt.savefig('pics/plot_name.png', bbox_inches='tight', dpi=300)
plt.show()

A macro remembers the content of a cell to then paste it when called. So, the first argument in macro creation is macro name while the second one is the cell number.
<code>%macro -q -r fine_tune 3</code>

In [None]:
%macro -q -r fine_tune 3

In [None]:
%load fine_tune