This notebook sets out how to get all of the information from the supplied SQLite database into the networkx format so that paths between authors etc can be established. After creating the network, a .pickle file is created, I will add this to the dataset, as getting all of the data into the network format is quite inelegant and slow (more of this at the end).

First we have to import the packages that we are going to use and set up our defaults;

In [None]:
import sqlite3
import networkx as nx
import itertools
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
conn = sqlite3.connect('../input/database.sqlite')
c = conn.cursor()
g = nx.Graph()

We can now initialise the network with all of the authorIDs from the Authors table, with a label containing the authors name;

In [None]:
c.execute('SELECT authorID, forename, initials, surname from Authors;')
out = c.fetchall()

out = [g.add_node(i[0], label = ' '.join(' '.join(i[1:4]).split())) for i in out]

We now need to populate the network with edges from the database, in this case, an edge is formed between each author on each paper, requiring us to self-join the authorID in the Paper_Authors table;

[Many thanks to CL. on stack overflow for this solution.][1]

 [1]: http://stackoverflow.com/a/42002707/6813373

In [None]:
c.execute('SELECT \
          pa1.authorID AS author1, \
          pa2.authorID AS author2, \
          p.doi AS doi \
          FROM Paper_Authors AS pa1 \
          JOIN Paper_Authors AS pa2 USING (paperID) \
          JOIN Papers p \
          ON pa1.paperID = p.paperID \
          WHERE pa1.authorID < pa2.authorID;')

out = c.fetchall()
out = [g.add_edge(i[0], i[1], doi = i[2]) for i in out]

nx.write_gpickle(g, './network.pickle')

Now we have all of the data in our network, and can investigate the overall parameters;

In [None]:
print(nx.info(g))

We can see that the average number of collaborators each author has is JACS is around 10, and we can see the distribution by producing a plot;

In [None]:
hist = nx.degree_histogram(g)
plt.figure(figsize=(9,3))
plt.plot(hist, marker = '.')
plt.xlim((0,50))
plt.show()

Now we can use the network to determine the paths between different authors who have published in JACS, for this I will determine the link between myself `118208`, and the 2016 Nobel Prize winner J.-P. Sauvage `2047`, who has a similar surname, but is a far better chemist;

In [None]:
path = nx.shortest_path(g, source = 2047, target = 118208)
print('There are ' + str(len(path)) + ' steps between J.-P. Sauvage and M. Savage')

Since this path is so short, I may know some of the people on the way, so we can create a subgraph of the authors in the path, and extract their label, in this case the author names;

In [None]:
h = nx.subgraph(g, path)

nlabels = nx.get_node_attributes(h, 'label')
dlabels = nx.get_edge_attributes(h, 'doi')

pos = nx.spring_layout(h)
plt.figure(figsize=(9,6))
nx.draw(h, pos = pos, node_color = 'b', edge_color = 'r', node_size = 50)
nx.draw_networkx_labels(h, pos = pos, labels = nlabels, font_size = 10)
nx.draw_networkx_edge_labels(h, pos = pos, labels = dlabels, font_size = 10)
plt.show()

And here we see, I have published with one co-author, who in turn published with J. F. Stoddart, another winner of the 2016 nobel prize.

What would be more interesting is to see all of the shortest paths through the network between the two authors;

In [None]:
paths = nx.all_shortest_paths(g, source = 2047, target = 118208)
paths = [i for i in paths]
h = nx.subgraph(g, sum([i for i in paths], []))   

nlabels = nx.get_node_attributes(h, 'label')
dlabels = nx.get_edge_attributes(h, 'doi')

pos = nx.spring_layout(h)
plt.figure(figsize=(9,6))
nx.draw(h, pos = pos, node_color = 'b', edge_color = 'r', node_size = 50)
nx.draw_networkx_labels(h, pos = pos, labels = nlabels, font_size = 10)
nx.draw_networkx_edge_labels(h, pos = pos, labels = dlabels, font_size = 8)
plt.show()

From this plot, we can seen that I am interconnected with 3 co-authos, who were all on one paper with me, and all on a paper with J. F. Stoddart, who then published with J.-P. Sauvage, indicating that the network I am in, and the one Sauvage is in are linked by Stoddart.

If we print the list of node labels, we can see that there are only 3 papers on this shortest path.

In [None]:
dlabels

We can now plot the route through the network, showing the neighbours of each author, by first creating a subgraph of all of the neighbours of the members of the path, setting this as the global set of positions, and then plotting the members of the path larger and in a different colour, this should allow us to easier visualise the immediate network of each author;

In [None]:
def all_networks(g, start, end):
    def plotpath(g, path):
        h = nx.subgraph(g, path)               
        nx.draw(h, pos = pos, node_color = 'r', edge_color = 'r', 
                node_size = 100, aplha = 0.7, width = 2)
             
    paths = nx.all_shortest_paths(g, source = start, target = end)
    paths = [i for i in paths]
    h = nx.subgraph(g, sum([i for i in paths], []))    
    labels = nx.get_node_attributes(h, 'label')   
    i = nx.subgraph(g, sum([nx.neighbors(g, i) for i in h.nodes()], []))  
    pos = nx.fruchterman_reingold_layout(i)
    
    plt.figure(figsize=(9,6))
    nx.draw(i, pos = pos, node_color = 'b', edge_color = 'b', node_size = 50, alpha = 0.5)
    [plotpath(g, i) for i in paths]
    nx.draw_networkx_labels(i, pos = pos, labels = labels, font_size = 10)
    plt.show()
all_networks(g, 22, 118208)

From this analysis, we can see that there are definite distinct clusters around each of the other three nodes in the network, indicating the research groups of each of the members of the network, there are however a number of interconnections signalling slightly longer paths between members of different groups.

Here I am trying another method of importing the data to the network, bear with me;

In [None]:
e = nx.Graph()

c.execute('SELECT authorID, paperID\
           FROM Paper_Authors;')
out = c.fetchall()
out = [('a' + str(i[0]), 'p' + str(i[1])) for i in out]
e.add_edges_from(out)
print(nx.info(e))

This method of importing the data is much faster, and results in a larger number of nodes, however the number of edges and average degree are significantly lower.

On closer inspection this number is the number of papers + the number of authors, so each author will have neighbors of their papers, and every paper will have neighbors of their authors, but there may be a way in process th network to have only one kind of node.

An alternative method of defining the network is to include authors and papers as nodes;

In [None]:
path2 = nx.shortest_path(e, source = 'a2047', target = 'a118208')

print('Path length of: ' + str(len(path)))
print('Via nodes: ' + str(path))

From this we can see that there are 4 authors in the network, linked by 3 papers, as we would expect. It seems this is the best way to access the data, as it will allow us to know the papers that link the authors or *vice-versa*.

In [None]:
import community

In [None]:
c.execute('SELECT * FROM Authors WHERE Surname IS "Attfield";')
c.fetchall()

In [None]:
part = community.best_partition(g)

In [None]:
values = [part.get(node) for node in g.nodes()]

In [None]:
#nx.draw_spring(g, cmap = plt.get_cmap('jet'), node_color = values, node_size=30, with_labels=False)

In [None]:
c.execute('SELECT doi FROM Papers;')
doi = c.fetchall()

In [None]:
doi[2065]