# Bibliographic Networks: A Python Tutorial

Networks can provide significant measures to identify data driven patterns and dependencies. Though, given a data file it can be difficult to discern how one may approach creating such a network. In this tutorial, we will use a bibliographic data file downloaded from a query search in <a href = https://www.scopus.com/search/form.uri>Scopus</a> to walk through the process of cleaning the data file, writing a python script to parse the data into nodes and edges, computing graphical measures using <a href = https://networkx.github.io/documentation/stable/index.html>NetworkX</a>, and creating an interactive network display using <a hred = http://holoviews.org/>HoloViews</a>. 

### Notes on Data Manipulation in Excel

As you are editing and cleaning your data set, be sure to always save in Excel as <i>CSV UTF-8 (Comma delimited) (.csv)</i>. This will ensure that the data file is readable by the python reader and contains original special characters. 

This rest of this section is specific to SCOPUS downloaded files, as some downloadable queries are unfortunately imperfect. Firstly, some rows are skewed from inaccurate reading and parsing. Simply scroll through the file and delete any rows where the data is clearly mismatched (i.e. an author name in the 'Title' column, a numerical value in a non-numerical column, etc.). 

Additionally, across several different queries, we discovered duplicates in entry 'Title,' with other columns containing conflicting data. To fix this issue for the purposes of producing a network, we want to remove duplicates. 

In Excel with your csv file open, select Date -> Table Tools -> Remove Duplicates. Indicate that the csv file has headers, as all SCOPUS files will, and only select the 'Title' column by which duplicates will be identified. 

After executing this command, it is important to save the file as a csv. Otherwise, Excel may default to a txt, or other format, and some data features may be lost. By continuously saving the file as a csv, we ensure that it will continue to be compatible with the python code for this tutorial.

Generally, for the case of creating a connected network, we want the rows in our bibliographic data file to have an identifiable title and list of references. 

### Import Necessary Libraries and Packages 
The following code will download the necessary libraries and packages for this tutorial. To successfully import, one must be sure that these libraries and installed locally on the computer. 

Using pip, the following libraries can be installed in the terminal window of your computer. 
- pip install networkx
- pip install numpy
- pip install pandas 
- pip install holoviews 
- pip install bokeh 
- pip install scikit-image
- pip install xarray
- pip install datashader

To customize this tutorial, decalare your own csv file. 

In [1]:
import csv
import networkx as nx
import numpy as np
import pandas as pd

In [2]:
import holoviews as hv

In [3]:
from holoviews import opts 

In [4]:
!conda install datashader

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.



In [10]:
!conda install -c bokeh datashader

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.



In [13]:
!pip install datashader



In [5]:
!conda remove datashader -y

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /Users/Jordan_Earnest/miniconda3

  removed specs:
    - datashader


The following packages will be REMOVED:

  datashader-0.6.9-py_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [7]:
!conda install -c bokeh/label/dev datashader -y

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /Users/Jordan_Earnest/miniconda3

  added / updated specs:
    - datashader


The following NEW packages will be INSTALLED:

  datashader         pkgs/main/noarch::datashader-0.6.9-py_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [1]:
!conda install -c ioam/c/dev holoviews -y

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /Users/Jordan_Earnest/miniconda3

  added / updated specs:
    - holoviews


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    appnope-0.1.0              |           py37_0           8 KB
    attrs-19.1.0               |             py_0          35 KB
    backcall-0.1.0             |           py37_0          20 KB
    bleach-3.1.0               |           py37_0         221 KB
    bokeh-0.13.0               |           py37_0         5.0 MB
    dbus-1.13.6                |       h90a0687_0         560 KB
    defusedxml-0.5.0           |           py37_1          29 KB
    entrypoints-0.3            |           py37_0          12 KB
    expat-2.2.6                |       h0a44026_0         129 KB
    gettext-0.19.8.1           |       h15daf44_3         3.4 MB
    glib-2.56.2         

In [5]:
!conda env create --file environment.yml

Collecting package metadata: done
Solving environment: done


  current version: 4.6.4
  latest version: 4.6.8

Please update conda by running

    $ conda update -n base -c defaults conda



Downloading and Extracting Packages
ca-certificates-2019 | 146 KB    | ##################################### | 100% 
numpy-1.16.2         | 4.1 MB    | ##################################### | 100% 
markupsafe-1.1.1     | 24 KB     | ##################################### | 100% 
parso-0.3.4          | 63 KB     | ##################################### | 100% 
cryptography-2.6.1   | 563 KB    | ##################################### | 100% 
geotiff-1.4.3        | 1.0 MB    | ##################################### | 100% 
bokeh-1.0.4          | 5.4 MB    | ##################################### | 100% 
expat-2.2.5          | 128 KB    | ##################################### | 100% 
pyproj-1.9.6         | 64 KB     | ##################################### | 100% 
ipython_genutils-0.2 | 21 KB     | ########


CondaValueError: prefix already exists: /Users/Jordan_Earnest/anaconda3/envs/ds



In [17]:
!conda activate ds


CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.




In [19]:
!conda init ds


ArgumentError: Invalid shells: 
  - ds

Currently available shells are:
  - bash
  - fish
  - powershell
  - tcsh
  - xonsh
  - zsh



In [16]:
from holoviews.operation.datashader import datashader, bundle_graph

In [16]:
file_name = 'scopus.csv' # TODO: insert filename

ModuleNotFoundError: No module named 'datashader'

### Scopus and bibliometric specific: fixing the data file 

The downloaded Scopus file identifies a title for a source in each row. The column 'References' indicates a semicolon delimited list of references in MLA/APA format. To make this information useful, we must parse title names from each reference in the list. To understand the function below, look at the formating of the 'References' column in your Scopus file. 

In [None]:
node_list = []
edge_list = []

def comp_add(node_list, node2, title): 
    for node1 in node_list: 
        if node2 in node1[0] or node1[0] in node2: 
            node2 = node1[0]
            # FIX double index name 
            break 
    # if 'title' == true, then append node with title notation 
    if title == True: 
        node_list.append([node2,'t'])
    # if 'title' == false, then append node with reference notation 
    else: 
        node_list.append([node2,'r'])
    
#def remove_whitespace(ref): 
    #if ref[0] == " ": 
        #ref = ref[1:]

with open(file_name) as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        # add node with unique identifier
        sink_node = row['Title']
        comp_add(node_list,sink_node, True)
        # add an edge for each source and its references
        refs = row['References'].split(';')
        if refs != '': 
            for ref in refs:
                if 'https://' not in ref and 'http://' not in ref and ref != " " and ref != "": 
                # only include journal-type references, and clean data for any formatting inconsistencies 
                    #remove_whitespace(ref)
                    comp_add(node_list, ref, False)
                    edge = [ref, sink_node] # 'ref' references 'sink_node'
                    edge_list.append(edge)


### Graph Manipulation 
Once you have created an <i>edge_list</i> variable, edges can be added to an NetworkX graph. Using NetworkX for this graph manipulation is intuitive and clean, requiring minimal lines of code.

In [None]:
G = nx.Graph() 
color_map = []

for node in node_list: 
    if node[1] == 't': 
        color_map.append('red')
    else: 
        color_map.append('blue')
    G.add_node(node[0])
    
G.add_edges_from(edge_list)

For a large graph, depending on the information in the graph, one may want to prune the graph to only contain nodes with a degree greater than 1. For this bibliometric data, we are primarily concerning with the connections between nodes, therefore a node with only one connection is of much less importance. Furthermore, by removing these less significant nodes, we can decrease the graph size significantly, creating a more easily understood graphical layout. Be careful to run this code only as many times as you wish to reduce the graph, else significant information may be lost, depending on the degree of interest in the information quantity.  

In [None]:
nodes_to_remove = []
for n in G.nodes(): 
    if G.degree(n) == 1: 
        nodes_to_remove.append(n)
G.remove_nodes_from(nodes_to_remove)

In [None]:
hv.extension('bokeh')

# dimensions for graph window 
kwargs = dict(width=1000, height=1000, xaxis=None, yaxis=None)
opts.defaults(opts.Nodes(**kwargs), opts.Graph(**kwargs))

pos = nx.spring_layout(G,k=0.15,iterations=20)
my_graph = hv.Graph.from_networkx(G, pos)
bundled = bundle_graph(my_graph)
(datashade(bundled, normalization='linear', width=900, height=900) * bundled.nodes).opts(
    opts.Nodes(color='circle', size=10, width=1000, legend_position='right'))
datashade(bundle_graph(my_graph), normalization='linear', width=900, height=900)
bundled.opts(padding=0.1)