In [1]:
# URL: https://datasciencelab.wordpress.com/2013/12/21/beautiful-plots-with-pandas-and-matplotlib/

In [2]:
"""
Data visualization plays a crucial role in the communication of results from data analyses, and it should always 
help transmit insights in an honest and clear way. Recently, the highly recommendable blog Flowing Data posted a 
review of data visualization highlights during 2013, and at The Data Science Lab we felt like doing a bit of pretty 
plotting as well.

For Python lovers, matplotlib is the library of choice when it comes to plotting. Quite conveniently, the data
analysis library pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing 
for quick and handy plotting of data frames. Nice examples of plotting with pandas can be seen for instance in 
this ipython notebook. Still, for customized plots or not so typical visualizations, the panda wrappers need a 
bit of tweaking and playing with matplotlib’s inside machinery. If one is willing to devote a bit of time to 
google-ing and experimenting, very beautiful plots can emerge.
"""
# Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.lines import Line2D

## Visualizing demographic data

In [3]:
"""
For this pre-Christmas data visualization table-top experiment we are going to use demographic data from 
countries in the European Union obtained from Wolfram|Alpha. Our data set contains information on population,
extension and life expectancy in 24 European countries. We create a pandas data frame from three series that 
we simply construct from lists, setting the countries as index for each series, and consequently for the data 
frame.
"""

countries = ['France','Spain','Sweden','Germany','Finland','Poland','Italy',
             'United Kingdom','Romania','Greece','Bulgaria','Hungary',
             'Portugal','Austria','Czech Republic','Ireland','Lithuania','Latvia',
             'Croatia','Slovakia','Estonia','Denmark','Netherlands','Belgium']
extensions = [547030,504782,450295,357022,338145,312685,301340,243610,238391,
              131940,110879,93028,92090,83871,78867,70273,65300,64589,56594,
              49035,45228,43094,41543,30528]
populations = [63.8,47,9.55,81.8,5.42,38.3,61.1,63.2,21.3,11.4,7.35,
               9.93,10.7,8.44,10.6,4.63,3.28,2.23,4.38,5.49,1.34,5.61,
               16.8,10.8]
life_expectancies = [81.8,82.1,81.8,80.7,80.5,76.4,82.4,80.5,73.8,80.8,73.5,
                    74.6,79.9,81.1,77.7,80.7,72.1,72.2,77,75.4,74.4,79.4,81,80.5]
data = {'extension' : pd.Series(extensions, index=countries), 
        'population' : pd.Series(populations, index=countries),
        'life expectancy' : pd.Series(life_expectancies, index=countries)}
df = pd.DataFrame(data)
df = df.sort_values('life expectancy', kind = 'quicksort')
df.head()

Unnamed: 0,extension,life expectancy,population
Lithuania,65300,72.1,3.28
Latvia,64589,72.2,2.23
Bulgaria,110879,73.5,7.35
Romania,238391,73.8,21.3
Estonia,45228,74.4,1.34


In [4]:
"""
Now, thanks to the pandas plotting machinery, it is extremely straightforward to show the contents of this 
data frame by calling the pd.plot function. The code below generates a figure with three subplots 
displayed vertically, each of which shows a bar plot for a particular column of the data frame. 
The plots are automatically labelled with the column names of the data frame, and the whole procedure
takes literally no time.
"""

#####################################ààà
"""

fig, axes = plt.subplots(nrows = 3, ncols = 1)
for i, c in enumerate(df.columns):
    
    # df[c] è la colonna
    # ax = axes[i] indica che ognuna delle tre colonne deve essere disposta in una diversa figura,
    # in particolare poiché cambia 'ax' vengono disposte una sotto l'altra
    
    df[c].plot(kind = 'bar', ax = axes[i], figsize = (12, 10), title = c)
plt.savefig('E01.png', bbox_inches = 'tight')
"""

"\n\nfig, axes = plt.subplots(nrows = 3, ncols = 1)\nfor i, c in enumerate(df.columns):\n    \n    # df[c] è la colonna\n    # ax = axes[i] indica che ognuna delle tre colonne deve essere disposta in una diversa figura,\n    # in particolare poiché cambia 'ax' vengono disposte una sotto l'altra\n    \n    df[c].plot(kind = 'bar', ax = axes[i], figsize = (12, 10), title = c)\nplt.savefig('E01.png', bbox_inches = 'tight')\n"

## Customization with matplotlib directives

In [5]:
"""
While this is an acceptable plot for the first steps of data exploration, the figure is not really 
publication-ready. It also looks very much “academic” and lacks that subtle flair that infographics in 
mainstream media have. Over the next paragraphs we will turn this plot into a much more beautiful object 
by playing around with the options that matplotlib supplies.

Let us first start by creating a figure and an axis object that will contain our subfigure:
"""
# Create a figure of given size
fig = plt.figure(figsize = (16, 12))
# Add a subplot
ax = fig.add_subplot(111)
# Set title
ttl = 'Population, size and age expectancy in the European Union'

In [6]:
"""
Colors are very important for data visualizations. By default, the matplotlib color palette offers solid hues,
which can be softened by applying transparencies. Similarly, the default colorbars can be customized to match
our taste (see below how one can define a custom-made color map with a gradient that softly changes from orange 
to gray-blue hues).
"""
# Set color transparency (0: transparent; 1: solid)
a = 0.7
# Create a colormap
customcmap = [(x/24.0,  x/48.0, 0.05) for x in range(len(df))]

In [7]:
"""
The main plotting instruction in our figure uses the pandas plot wrapper. In the initialization options,
we specify the type of plot (horizontal bar), the transparency, the color of the bars following the 
above-defined custom color map, the x-axis limits and the figure title. We also set the color of the bar
borders to white for a cleaner look.
"""
# Plot the 'population' column as horizontal bar plot
df['population'].plot(kind = 'barh', ax = ax, alpha = a, legend = False, color = customcmap, 
                     edgecolor = 'w', xlim = (0, max(df['population'])), title = ttl)

<matplotlib.axes._subplots.AxesSubplot at 0x7f770d0c0550>

In [8]:
"""
After this simple pandas plot directive, the figure already looks very promising. Note that, because we 
sorted the data frame by life expectancy and applied a gradient color map, the color of the different bars 
in itself carries information. We will explicitly label that information below when constructing a color bar.
For now we want to remove the grid, frame and axes lines from our plot, as well as customize its title and x,y
axes labels.
"""
# Remove grid lines (dotted lines inside plot)
ax.grid(False)
# Remove plot frame
ax.set_frame_on(False)
# Pandas trick: remove weird dotted line on axis
##ax.lines[0].set_visible(False)

# Customize title, set position, allow space on top of plot for title
ax.set_title(ax.get_title(), fontsize = 26, alpha = a, ha = 'left')
plt.subplots_adjust(top = 0.9)
ax.title.set_position((0, 1.08))

# Set x axis label on top of plot, set label text
ax.xaxis.set_label_position('top')
xlab = 'Population (in millions)'
ax.set_xlabel(xlab, fontsize = 20, alpha = a, ha = 'left')
ax.xaxis.set_label_coords(0, 1.04)

# Position x tick labels on top
ax.xaxis.tick_top() # Sposta la label delle x in alto
# Remove tick lines in x and y axes
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')

# Customize x tick labels
xticks = [5, 10, 20, 50, 80]
ax.xaxis.set_ticks(xticks)
ax.set_xticklabels(xticks, fontsize = 16, alpha = a)


# Customize y tick labels
yticks = [item.get_text() for item in ax.get_yticklabels()]
ax.set_yticklabels(yticks, fontsize = 10, alpha = a)
ax.yaxis.set_tick_params(pad = 12)

In [9]:
"""
So far, the lenghts of our horizontal bars display the population (in millions) of the EU countries. 
All bars have the same height (which is set to 50% of the total space between bars by default by pandas). 
An interesting idea is to use the height of the bars to display further data. If we could made the bar height
dependent on, say, the countries’ extension, we would be adding an supplementary piece of information to the
plot. This is possible in matplotlib by accessing the elements that contain the bars and assigning them a
specific height in a for loop. Each bar is an element of the class Rectangle, and all the corresponding class 
methods can be applied to it. For assigning a given height according to each country’s extension, we code a 
simple linear interpolation and create a lambda function to apply it.
"""
# Set bar height depedent on country extension
# Set min and max bar thicknewss (from 0 and 1)
hmin, hmax = 0.3, 0.9
xmin, xmax = min(df['extension']), max(df['extension'])
# Function that interpolates linearly between hmin and hmax
f = lambda x: hmin + (hmax - hmin) * (x - xmin) / (xmax - xmin)
# Make array of heights
hs = [f(x) for x in df['extension']]

# Iterate over bars
for container in ax.containers:
    # Each bar has a Rectangle element as child
    for i, child in enumerate(container.get_children()):
       # Reset the lower left point of each bar so that bar is centered
        child.set_y(child.get_y() - 0.125 + 0.5 - hs[i] / 2)
        # Attribute height to each Rectangle according to country's size
        plt.setp(child, height = hs[i])

In [10]:
"""
Having added this “dimension” to the plot, we need a way of labelling the information so that the countries’ 
extension is understandable. A legend would be the ideal solution, but since our plotting directive was set to
display the column ['population'], we can not use the default. We can construct a “fake” legend though, and 
custom-made its handles to roughly match the height of the bars. We position the legend in the lower right 
part of our plot.
"""
# Legend
# Create fake labels for legend
l1 = Line2D([], [], linewidth = 6, color = 'k', alpha = a)
l2 = Line2D([], [], linewidth = 12, color = 'k', alpha = a)
l3 = Line2D([], [], linewidth = 22, color = 'k', alpha = a)
# Set three legend labels to be min, mean and max of coutries extensions
# (rounded up to 10k km2)
rnd = 10000
labels =[str(int(round(l / rnd) * rnd )) for l in (min(df['extension']), np.mean(df['extension']), max(df['extension']))]
# Position legend in lower right part
# Set ncol = 3 for horizontally expanding legend
leg = ax.legend([l1, l2, l3], labels, ncol = 3, frameon = False, fontsize = 16, 
               bbox_to_anchor = [1.1, 0.11], handlelength = 2,
               handletextpad = 1, columnspacing = 2, title = 'Size (in km2)')

# Customize legend title
# Set position to increase space between legend and labels
plt.setp(leg.get_title(), fontsize = 20, alpha = a)
leg.get_title().set_position((0, 10))
# Customize transparency for legend labels
[plt.setp(label, alpha = a) for label in leg.get_texts()]

[[None, None], [None, None], [None, None]]

In [11]:
"""
Finally, there is another piece of information in the plot that needs to be labelled, and that 
is the color map indicating the average life expectancy in the EU countries. Since we used a 
custom-made color map, the regular call to plt.colorbar() would not work. We need to create a 
LinearSegmentedColormap instead and “trick” matplotlib to display it as a colorbar. Then we can use 
the usual customization methods from colorbar to set fonts, transparency, position and size of the diverse 
elements in the color legend.
"""
# Create a fake colorbar
ctb = LinearSegmentedColormap.from_list('custombar', customcmap, N = 2048)
# Trick from http://stackoverflow.com/questions/8342549/
# matplotlib-add-colorbar-to-a-sequence-of-line-plots
sm = plt.cm.ScalarMappable(cmap = ctb, norm = mpl.colors.Normalize(vmin = 72, vmax = 84))
# Fake up the array of the scalar mappable
sm._A = []

# Sel colorbar, aspect ratio
cbar = plt.colorbar(sm, alpha = 0.05, aspect =16, shrink = 0.4)
cbar.solids.set_edgecolor('face')
# Remove colorbar container frame
cbar.outline.set_visible(False)
# Fontsize for colorbar ticklabels
cbar.ax.tick_params(labelsize = 16)
# Customize colorbar ticklabels
mytks = range(72, 86, 2)
cbar.ax.set_yticklabels([str(a) for a in mytks], alpha=a)
 
# Colorbar label, customize fontsize and distance to colorbar
cbar.set_label('Age expectancy (in years)', alpha=a, 
               rotation=270, fontsize=20, labelpad=20)
# Remove color bar tick lines, while keeping the tick labels
cbarytks = plt.getp(cbar.ax.axes, 'yticklines')
plt.setp(cbarytks, visible=False)

# Save figure in png with tight bounding box
plt.savefig('EU.png', bbox_inches='tight', dpi=300)