We decided to graphically represent the journey of private organisations through different funding schemes using a Sankey diagram.

Through plot.ly you can create Sankey diagrams, however, it is difficult to format the data to the format that plot.ly will accept so there is a need for a declarative data visualisation library.
Plot.ly Sankey link: https://plot.ly/python/sankey-diagram/

Altair and Holoviews were the options, however, since only Holoviews contained the ability to use a Sankey diagram this was selected: https://holoviews.org/reference/elements/bokeh/Sankey.html 

To install Holoviews:

In [None]:
conda install -c pyviz holoviews bokeh

In [46]:
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
hv.extension('bokeh')

The easiest way to define a Sankey element is to define a list of edges and their associated quantities:

In [None]:
sankey = hv.Sankey([
    ['A', 'X', 5],
    ['A', 'Y', 7],
    ['A', 'Z', 6],
    ['B', 'X', 2],
    ['B', 'Y', 9],
    ['B', 'Z', 4]]
)
sankey.opts(width=600, height=400)

In our case this would look something like:

In [92]:
sankey = hv.Sankey([
    ['Voucher', 'Loans', 5],
    ['Voucher', 'Equity', 7],
    ['Voucher', 'Z', 6],
    ['Voucher', 'Loans', 50],
    ['Grant', 'Voucher', 9],
    ['Grant', 'Z', 4]]
)
sankey.opts(width=600, height=400)

In [43]:
nodes = ["CSA - Coordination and support action", "CP - Collaborative project (generic)",  
         "NoE - Network of Excellence", "CP-FP - Small or medium-scale focused research project",
         "CSA-SA - Support actions",  "CSA-CA - Coordination (or networking) actions",  
         "CP-IP - Large-scale integrating project", "MSCA-IF-EF-ST - Standard EF", 
         "MC-IEF - Intra-European Fellowships (IEF)"]
nodes = hv.Dataset(enumerate(nodes), 'index', 'label')
edges = [
    (0, 1, 53), (0, 2, 47), (2, 6, 17), (2, 3, 30), (3, 1, 22), (3, 4, 10), 
    (3, 6, 4), (4, 5, 4), (0, 2, 20), (2, 8, 20), (0, 8, 13), (0, 7, 15)
]

value_dim = hv.Dimension('Numbers')
careers = hv.Sankey((edges, nodes), ['From', 'To'], vdims=value_dim)

careers.opts(
    opts.Sankey(labels='label', label_position='right', width=900, height=300, cmap='Set1',
                edge_color=dim('To').str(), node_color=dim('index').str()))

Original example of the above:

In [41]:
nodes = ["PhD", "Career Outside Science",  "Early Career Researcher", "Research Staff",
         "Permanent Research Staff",  "Professor",  "Non-Academic Research"]
nodes = hv.Dataset(enumerate(nodes), 'index', 'label')
edges = [
    (0, 1, 53), (0, 2, 47), (2, 6, 17), (2, 3, 30), (3, 1, 22.5), (3, 4, 3.5), (3, 6, 4.), (4, 5, 0.45)   
]

value_dim = hv.Dimension('Percentage', unit='%')
careers = hv.Sankey((edges, nodes), ['From', 'To'], vdims=value_dim)

careers.opts(
    opts.Sankey(labels='label', label_position='right', width=900, height=300, cmap='Set1',
                edge_color=dim('To').str(), node_color=dim('index').str()))

Attempt at pulling data table in:

In [59]:
#edges = pd.read_csv('Holoviewstest.csv')
#edges.head(5)

Unnamed: 0,source,target,value
0,CSA - Coordination and support action,MC-IIF - International Incoming Fellowships (IIF),125
1,CSA - Coordination and support action,ERC-SG - ERC Starting Grant,19
2,CSA - Coordination and support action,"MC-COFUND - Co-funding of Regional, National a...",46
3,NoE - Network of Excellence,ERC-SG - ERC Starting Grant,99
4,NoE - Network of Excellence,"MC-COFUND - Co-funding of Regional, National a...",166


In [60]:
#sankey = hv.Sankey(edges, label='Funding Schemes')
#sankey.opts(label_position='left', edge_color='target', node_color='index', cmap='tab20')

Running the above example with our first step test data/trial and error testing:

In [89]:
edges = pd.read_csv('test123.csv')
edges.head(5)

Unnamed: 0,source,target,value
0,z,a,1
1,y,b,1
2,x,c,5
3,z,a,6
4,y,b,5


In [90]:
sankey = hv.Sankey(edges, label='Funding Schemes')
sankey.opts(label_position='left', edge_color='target', node_color='index', cmap='tab20')

Testing of the funding schemes in code format, as opposed to reading from csv:

In [94]:
sankey = hv.Sankey([
    ['BBI-CSA - Bio-based Industries Coordination and Support action', 'BBI-IA-DEMO - Bio-based Industries Innovation action - Demonstration', 1],
    ['BBI-CSA - Bio-based Industries Coordination and Support action', 'IA - Innovation action', 1],
    ['BBI-IA-DEMO - Bio-based Industries Innovation action - Demonstration', 'IA - Innovation action', 5],
    ['BBI-IA-DEMO - Bio-based Industries Innovation action - Demonstration', 'BBI-RIA - Bio-based Industries Research and Innovation action', 6],
    ['BBI-IA-DEMO - Bio-based Industries Innovation action - Demonstration', 'IA - Innovation action', 5],
    ['BBI-IA-DEMO - Bio-based Industries Innovation action - Demonstration', 'MSCA-ITN-ETN - European Training Networks', 1]]
)
sankey.opts(width=600, height=400)

Same data as above but in .csv:

In [95]:
edges = pd.read_csv('codingequivalent.csv')
edges.head(5)

Unnamed: 0,source,target,value
0,BBI-CSA - Bio-based Industries Coordination an...,BBI-IA-DEMO - Bio-based Industries Innovation ...,1
1,BBI-CSA - Bio-based Industries Coordination an...,IA - Innovation action,1
2,BBI-IA-DEMO - Bio-based Industries Innovation ...,IA - Innovation action,5
3,BBI-IA-DEMO - Bio-based Industries Innovation ...,BBI-RIA - Bio-based Industries Research and In...,6
4,BBI-IA-DEMO - Bio-based Industries Innovation ...,IA - Innovation action,5


In [96]:
sankey = hv.Sankey(edges, label='Funding Schemes')
sankey.opts(label_position='left', edge_color='target', node_color='index', cmap='tab20')

So the above example works for the first 6 rows in both the coding version and csv version, now attempt with more data:

In [106]:
edges = pd.read_csv('RemoveAcyclic.csv')
edges.head(5)

Unnamed: 0,source,target,value
0,BBI-CSA - Bio-based Industries Coordination an...,BBI-IA-DEMO - Bio-based Industries Innovation ...,1
1,BBI-CSA - Bio-based Industries Coordination an...,IA - Innovation action,1
2,BBI-IA-DEMO - Bio-based Industries Innovation ...,IA - Innovation action,5
3,BBI-IA-DEMO - Bio-based Industries Innovation ...,IA - Innovation action,6
4,BBI-IA-DEMO - Bio-based Industries Innovation ...,IA - Innovation action,5


In [107]:
sankey = hv.Sankey(edges, label='Funding Schemes')
sankey.opts(label_position='left', edge_color='target', node_color='index', cmap='tab20')

This only worked after fixing Acyclic errors. If A goes to B, B then can't go to A. Below is testing when a space is added to the end of all target destinations (eg. B2 & " ").

In [108]:
edges = pd.read_csv('Spaceadded.csv')
edges.head(5)

Unnamed: 0,source,target,value
0,BBI-CSA - Bio-based Industries Coordination an...,BBI-IA-DEMO - Bio-based Industries Innovation ...,1
1,BBI-CSA - Bio-based Industries Coordination an...,IA - Innovation action,1
2,BBI-IA-DEMO - Bio-based Industries Innovation ...,BBI-IA-DEMO - Bio-based Industries Innovation ...,5
3,BBI-IA-DEMO - Bio-based Industries Innovation ...,BBI-RIA - Bio-based Industries Research and In...,6
4,BBI-IA-DEMO - Bio-based Industries Innovation ...,IA - Innovation action,5


In [109]:
sankey = hv.Sankey(edges, label='Funding Schemes')
sankey.opts(label_position='left', edge_color='target', node_color='index', cmap='tab20')

Sorted by largest value at the top:

In [110]:
edges = pd.read_csv('Spaceaddedvalue.csv')
edges.head(5)

Unnamed: 0,source,target,value
0,RIA - Research and Innovation action,RIA - Research and Innovation action,828
1,IA - Innovation action,IA - Innovation action,461
2,RIA - Research and Innovation action,IA - Innovation action,448
3,SME-1 - SME instrument phase 1,SME-2 - SME instrument phase 2,380
4,IA - Innovation action,RIA - Research and Innovation action,359


In [111]:
sankey = hv.Sankey(edges, label='Funding Schemes')
sankey.opts(label_position='left', edge_color='target', node_color='index', cmap='tab20')

The order in which the source and targets are listed in the Sankey doesn't change regardless of how you set up the csv