# Bibliographic Networks: A Python Tutorial

Networks can provide significant measures to identify data driven patterns and dependencies. Though, given a data file it can be difficult to discern how one may approach creating such a network. In this tutorial, we will use a bibliographic data file downloaded from a query search in <a href = https://www.scopus.com/search/form.uri>Scopus</a> to walk through the process of cleaning the data file, writing a Python script to parse the data into nodes and edges, computing graphical measures using <a href = https://networkx.github.io/documentation/stable/index.html>NetworkX</a>, and creating an interactive network display using <a hred = http://holoviews.org/>HoloViews</a>. 

### 1. Data Manipulation in Excel

As you are editing and cleaning your data set, be sure to always save in Excel as <i>CSV UTF-8 (Comma delimited) (.csv)</i>. This will ensure that the data file is readable by the Python reader used in this tutorial, and will keep any special characters. 

#### SCOPUS Specific Data Manipulation 
Few SCOPUS downloadable queries are perfect. This tutorial uses the SCOPUS file containing results for the query <i>economics AND "complex systems."</i> Upon downloading this specific data file, some rows are skewed from inaccurate reading and parsing. If you are costumizing this tutorial, simply scroll through the file and delete any rows where the data is clearly mismatched (i.e. an author name in the 'Title' column, a numerical value in a non-numerical column, etc.). 

Additionally, across several different queries, we discovered duplicates in entry 'Title,' with other columns containing conflicting data. To fix this issue for the purposes of producing a network, duplicates should be removed. With your .csv file open on Excel, select <i>Data -> Table Tools -> Remove Duplicates</i>. Indicate that the .csv file has headers, as all SCOPUS files will, and only select the 'Title' column by which duplicates will be identified. After executing this command, it is important to save the file as a .csv as previously indicated. Otherwise, Excel may default to saving the file as a .txt, or another format, and data features may be lost. By continuously saving the file as a .csv, we ensure that it will continue to be compatible with the Python code for this tutorial.

Generally, for the case of creating a connected network, we want the rows in our bibliographic data file to have a unique title and a list of references. Other customizations can be made as long as this feature is preserved. 

### 2. Import Necessary Libraries and Packages 
The following code will download the necessary libraries and packages for this tutorial. To successfully import, one must be sure that these libraries are installed on the local computing environment. 

To customize this tutorial, decalare your own .csv file. 

In [None]:
! pip install networkx
! pip install numpy
! pip install pandas
! pip install holoviews
! pip install bokeh
! pip install scikit-image
! pip install xarray
! pip install datashader

In [21]:
import csv
import networkx as nx
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews.operation.datashader import datashade, bundle_graph
from networkx.algorithms import community

file_name = 'scopus.csv' # TODO: insert filename

### 3. Partitioning the Data into Nodes and Edges

This Python script is specific to SCOPUS and bibliometric data, though could be easily customized to match the parameters of any data file. To make a network, we must identify objects and relationships between objects. 

With bibliometric data, we can identify titles and designate a connection between titles if one is referenceing the other. The downloaded SCOPUS file identifies a title for a source in each row. The column 'References' indicates a semicolon delimited list of references in MLA or APA format. To make this information useful, we must parse a title from each reference in the list.

In [3]:
node_list = [] # a list of titles and references
edge_list = [] # includes rows of format [a, b] where 'a' references 'b'
type_dict = {} # key: node, value: type ('title' or 'reference'), holds all possible node values

''' 
Requires: 'n_type' is either 'title' or 'reference' 
Modifies: If 'node' occurs in the list, preserves type 'title,' changing either 
          the 'node_list' value and the 'type_dict' type, or 'node' value. Else,
          adds 'node' to 'node_list.'
Effects:  Compares 'node' to the current 'note_list.' 
'''
def comp_add(node_list, node, n_type): 
    for i in range(len(node_list)): 
        # check to see if 'node' compares to any current nodes
        if node in node_list[i] or node_list[i] in node: 
            # if a node exists as a row 'title' and a row 'reference', 
            # we want to favor the type 'title' in our data structures 
            if n_type == 'title': 
                # switch the representation in 'node_list' to 'title'
                node_list[i] = node
                type_dict[node] = n_type 
            else: 
                # switch the representation of 'node' to 'title' 
                node = node_list[i]
            return node 
        
    # the rest of this function executes if 'node' is not already in 'node_list'
    if n_type == 'title':
        node_list.append(node)
        type_dict[node] = n_type
    else: 
        node_list.append(node)
        type_dict[node] = n_type

    return node

''' 
Main loop to parse data into nodes and edges. 
'''
with open(file_name) as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        # add node with unique identifier
        source_node = row['Title']
        source_node = comp_add(node_list, source_node, 'title')
        # add an edge for each source and its references
        refs = row['References'].split(';')
        for ref in refs:
            # disregard web references, and clean data for any formatting inconsistencies
            if 'https://' not in ref and 'http://' not in ref and ref != " " and ref != "":  
                ref = comp_add(node_list, ref, 'reference')
                edge = [source_node, ref] # 'source_node' references 'ref'
                edge_list.append(edge) 


### 4. Graph Manipulation 
Once you have created an <i>edge_list</i> variable, edges can be added to a NetworkX graph. Using NetworkX for this graph manipulation is intuitive and clean, requiring minimal lines of code.

In [4]:
G = nx.Graph()
for n in node_list: 
    G.add_node(n)
G.add_edges_from(edge_list)

For a large graph, depending on the information being represented, one may want to prune the graph to only contain nodes with a degree (the number of connections to a single node) greater than 1. For this bibliometric data, we are primarily interested in the connections between nodes, therefore a node with only one connection is of much less importance. Furthermore, by removing less significant nodes, we can decrease the graph size significantly, creating a more easily understood graphical layout. Be careful to run this code only as many times as you wish to reduce the graph, or else significant information may be lost as the graph is pruned, depending on the degree of interest in the information.  

In [5]:
# by running this code once, all isolated subgraphs will be removed 

# first remove nodes of degree 1
nodes_to_remove = []
for n in G.nodes(): 
    if G.degree(n) == 1: 
        nodes_to_remove.append(n)
G.remove_nodes_from(nodes_to_remove)

# then remove nodes that are isolated 
nodes_to_remove = []
for n in G.nodes(): 
    if G.degree(n) == 0: 
        nodes_to_remove.append(n)
G.remove_nodes_from(nodes_to_remove)

Now that we have pruned our graph, to give the nodes a distinguishable measure, we indicate a label for each node corresponding to its type. This part could be customized to distinguish a node by any measure. 

In [7]:
for n in G.nodes:
    G.node[n]['label'] = type_dict[n]

In [None]:
communities_generator = community.girvan_newman(G)
top_level_communities = next(communities_generator)
print(top_level_communities)

In [23]:
print(top_level_communities[0])

set(['Quantum approach explains the need for expert knowledge: On the example of econometrics', ' Klir, G., Yuan, B., (1995) Fuzzy Sets and Fuzzy Logic, , Prentice Hall, Upper Saddle River', 'Decomposition of complex systems into set of autonomous agents by fuzzy-genetic approach and its application in economic and business environments'])


### 5. Creating a Graphical Display
See inline comments for any places for further customization. A brief discussion of specific functions and layouts are provided below. 

##### Bokeh

##### NetworkX: spring_layout 

##### HoloViews: bundle_graph 

##### HoloViews: datashade 



In [24]:
hv.extension('bokeh')

kwargs = dict(width=1000, height=1000, xaxis=None, yaxis=None)
hv.opts.defaults(hv.opts.Nodes(**kwargs), hv.opts.Graph(**kwargs))
colors = ['#000000']+hv.Cycle('Category20').values  

pos = nx.spring_layout(G,k=0.15,iterations=20)  
# nodes will be colored according the the designated 'label'
bib_ops = hv.opts.Graph(node_color=hv.dim('label'), cmap = 'Set1')
# collect graph from NetworkX 
my_graph = hv.Graph.from_networkx(G, pos).opts(bib_ops)
# bundle edges 
bundled = bundle_graph(my_graph)
# 
(datashade(bundled, normalization='linear', width=800, height=800) * bundled.nodes).opts(
    opts.Nodes(color=hv.dim('label'), size=10, width=1000, cmap=colors, legend_position='right'))

# datashade(bundle_graph(my_graph), normalization='linear', width=900, height=900)
#bundled.opts(padding=0.1)

# green connection if something is referencing it
# orange connection if it is referncing something 
# blue if title , dark blue if high degree, light blue if small degree 
# red if resource, dark red if high degree, light red if small degree 