# Introduction to Tethne: Simple Networks

In this notebook we will use the ``Paper``s in our WoS dataset to create a simple coauthor network, and then export that network to GraphML so that we can visualize it in Cytoscape.

## Before you start

* Download the Web of Science practice dataset from [here](http://devo-evo.lab.asu.edu/methods/tethne/datasets.zip), and store it in a place where you can find it. You'll need the full path to your dataset.
* Complete the tutorial **"Loading Data, part 1"**

## Loading our WoS data

We'll use the Web of Science dataset from our last workbook. Since this is a new workbook, we'll have to load it again. We'll load all of the WoS data files at once using the **``from_dir``** function.

In [1]:
from tethne.readers import wos
datadirpath = '/Users/erickpeirson/Downloads/datasets/wos'
papers = wos.read(datadirpath)

## Importing a ``networks`` module

The modules in the ``tethne.networks`` subpackage provide a variety of functions for building 

| Module                       | Description                                                                 | 
| ---------------------------- |-----------------------------------------------------------------------------|
| ``tethne.networks.authors``  |  Methods for making networks among authors, or among authors' institutions. |
| ``tethne.networks.features`` |  Methods for making networks based on featuresets (later in this workshop). |
| ``tethne.networks.papers``   |  Methods for making networks among papers, or their cited references.       |
| ``tethne.networks.topics``   |  Methods for making networks based on topic models.                         |

In this tutorial we'll work with the ``authors`` module. You can import it like so:

In [2]:
from tethne.networks import authors

In [3]:
authors.coauthors

<function tethne.networks.authors.coauthors>

## Building a network

The functions in the ``authors`` module all take a list of ``Paper``s as their first parameter. See the Tethne documentation for details about each method. In this workbook we'll use the **``authors.coauthors``** function, which doesn't require any additional parameters.

In [4]:
coauthor_graph = authors.coauthors(papers)

The ``authors.coauthors`` function returns a ``Graph`` object from the [``NetworkX`` Python package](https://networkx.github.io/). We'll come back to the joys of NetworkX later on in these tutorials.

In [5]:
type(coauthor_graph)

networkx.classes.graph.Graph

You can view a list of nodes and edges in the ``Graph`` using the ``nodes`` and ``edges`` methods:

In [6]:
coauthor_graph.nodes()[0:10]    # [0:10] selects only the first 10 entries in the list.

[('LIU', 'SR'),
 ('IVEY', 'CHRISTOPHER T'),
 ('LOMAN', 'J'),
 ('FERRIER', 'SHARON M'),
 ('ROOD', 'SB'),
 ('WIESENBERG', 'GUIDO L B'),
 ('DODD', 'RS'),
 ('BRONMARK', 'CHRISTER'),
 ('BALTISBERGER', 'MATTHIAS'),
 ('HEYNEN', 'M')]

In [7]:
coauthor_graph.edges()[0:10]

[(('LIU', 'SR'), ('YIN', 'CY')),
 (('LIU', 'SR'), ('LI', 'CY')),
 (('LIU', 'SR'), ('BERNINGER', 'F')),
 (('IVEY', 'CHRISTOPHER T'), ('THEISS', 'KATHRYN')),
 (('IVEY', 'CHRISTOPHER T'), ('KEPHART', 'SUSAN')),
 (('LOMAN', 'J'), ('CLAESSON', 'D')),
 (('FERRIER', 'SHARON M'), ('BAILEY', 'JOSEPH K')),
 (('FERRIER', 'SHARON M'), ('GRADY', 'KEVIN C')),
 (('FERRIER', 'SHARON M'), ('BANGERT', 'RANDY K')),
 (('FERRIER', 'SHARON M'), ('ALLAN', 'GERY J'))]

In [8]:
print 'This graph has {0} nodes and {1} edges.'.format(
    len(coauthor_graph.nodes()), len(coauthor_graph.edges())
    )

This graph has 4770 nodes and 10923 edges.


The ``authors.coauthors`` function tries to include some additional information about each node and edge. You can see data attached to nodes and edges by using the ``data=True`` keyword argument:

In [9]:
coauthor_graph.nodes(data=True)[0]   # Let's just look at the first one.

(('LIU', 'SR'), {'count': 2.0, 'documentCount': 2.0})

Each edge in our coauthor graph contains information about how many papers two authors have written together, and an abbreviated citation for each ``Paper``.

In [10]:
coauthor_graph.edges(data=True)[0]   # Just look at the first one.

(('LIU', 'SR'),
 ('YIN', 'CY'),
 {'ayjid': 'LI CY 2004 ENVIRONMENTAL AND EXPERIMENTAL BOTANY',
  'date': 2004,
  'weight': 1})

## Thresholds

Many network-building functions accept a ``min_weight`` keyword argument. For the ``coauthors`` function, ``min_weight`` specifies the minimum number of papers that two authors have to write together in order for an edge to be drawn between them. The default is ``min_weight=1``; setting ``min_weight=2`` will result in a smaller graph.

In [11]:
smaller_graph = authors.coauthors(papers, min_weight=2)

In [12]:
print 'This graph has only {0} nodes and {1} edges.'.format(len(smaller_graph.nodes()), len(smaller_graph.edges()))

This graph has only 666 nodes and 960 edges.


## Exporting and visualizing networks

Functions for exporting networks in some common network file formats can be found in the **``writers``** subpackage. For a single network, we can use the methods in the **``writers.graph``** module.

The [GraphML file format](http://graphml.graphdrawing.org/) is a popular and well-supported graph format. To write a GraphML file from your network, first import the ``writers.graph`` module.

In [13]:
from tethne.writers import graph

Then specify a path to the file that you want to create (it will be created for you), and pass it along with your graph to the ``graph.to_graphml`` function.

In [14]:
graphml_path = '/Users/erickpeirson/Downloads//my_graphml_file.graphml'

In [15]:
graph.to_graphml(coauthor_graph, graphml_path)

If you open that the resulting file in a text editor, you should see a bunch of XML that looks something like this:

```
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <key attr.name="weight" attr.type="int" for="edge" id="weight" />
  <key attr.name="ayjid" attr.type="string" for="edge" id="ayjid" />
  <key attr.name="institution" attr.type="string" for="node" id="institution" />
  <key attr.name="type" attr.type="string" for="graph" id="type" />
  <graph edgedefault="undirected">
    <data key="type">coauthors</data>
    <node id="SALVANES AGV">
      <data key="institution">UNIV BERGEN, NORWAY</data>
    </node>
    <node id="CHEN YH">
      <data key="institution">UNIV CALIF BERKELEY, CA 94720 USA</data>
    </node>
    .
    .
    .
    <edge source="CHATTIPAKORN N" target="CHATTIPAKORN S">
      <data key="ayjid">['PRATCHAYASAKUL W 2008 INDIAN JOURNAL OF MEDICAL RESEARCH']</data>
      <data key="weight">1</data>
    </edge>
    <edge source="EIN-GEDY S" target="HUBNER S">
      <data key="ayjid">['HUBNER S 2013 JOURNAL OF EVOLUTIONARY BIOLOGY']</data>
      <data key="weight">1</data>
    </edge>
    <edge source="RIPLEY BS" target="GILBERT ME">
      <data key="ayjid">['RIPLEY BS 2007 JOURNAL OF EXPERIMENTAL BOTANY', 'IBRAHIM DG 2008 PLANT CELL AND ENVIRONMENT', 'OSBORNE CP 2008 JOURNAL OF EXPERIMENTAL BOTANY']</data>
      <data key="weight">3</data>
    </edge>
  </graph>
</graphml>
```

Try loading this file in your favorite network visualization software. Here's what it looks like for me in Cytoscape:

![](http://devo-evo.lab.asu.edu/methods/tethne/images/coauthor_graph_whole.png)

Zooming in, I've used the ``name`` attribute in the graph to add node labels, set edge width as a function of weight (see above), and colored the nodes by institution.

![](http://devo-evo.lab.asu.edu/methods/tethne/images/coauthor_graph_closeup.png)

## The great thing about NetworkX...

...is that it comes with a bunch of great network analysis algorithms! You can read about them [here](http://networkx.github.io/documentation/networkx-1.9/reference/algorithms.html). For example, I can calculate the betweenness centrality for all of the nodes in a network using the ``networkx.betweenness_centrality`` function.

In [16]:
import networkx as nx

In [17]:
bcentrality = nx.betweenness_centrality(coauthor_graph)

``networkx.betweenness_centrality`` returns a ``dict`` that maps node IDs onto their betweenness centrality scores.

In [18]:
bcentrality.items()[5:20]

[(('WIESENBERG', 'GUIDO L B'), 0.0),
 (('FRESCHET', 'GREGOIRE T'), 1.5832114846864756e-06),
 (('BRONMARK', 'CHRISTER'), 0.0),
 (('BALTISBERGER', 'MATTHIAS'), 0.0),
 (('REHFELDT', 'GERALD E'), 0.0),
 (('HEYNEN', 'M'), 0.0),
 (('MARQUEZGUZMAN', 'J'), 0.0),
 (('DORRIS', 'J'), 0.0),
 (('DEJONG', 'TJ'), 0.0),
 (('RAMIREZHERRERA', 'C'), 0.0),
 (('ABBOTT', 'LB'), 0.0),
 (('FRIAR', 'ELIZABETH'), 0.0),
 (('KVET', 'JAN'), 0.0),
 (('DANG', 'QL'), 0.0),
 (('JAQUISH', 'BC'), 0.0)]

We can then add those values as attributes to the nodes in that network using ``networkx.set_node_attributes``:

In [19]:
nx.set_node_attributes(coauthor_graph, 'centrality', bcentrality)

In [20]:
coauthor_graph.nodes(data=True)[10]

(('MARQUEZGUZMAN', 'J'),
 {'centrality': 0.0, 'count': 1.0, 'documentCount': 1.0})

...and those values get included when we export our graph to GraphML!

In [21]:
graphml_path_updated = '/Users/erickpeirson/Downloads/my_graphml_file_centrality.graphml'

In [22]:
graph.to_graphml(coauthor_graph, graphml_path_updated)

Here I've set node size based on betweenness centrality:

![](http://devo-evo.lab.asu.edu/methods/tethne/images/coauthor_graph_closeup_bc.png)