In [1]:
%matplotlib inline

In [2]:
from pprint import pprint
import matplotlib.pyplot as plt

# Introduction to Tethne: Loading Data, part 1

In this section we will use the ``tethne.readers.wos`` module to load up some data from the ISI Web of Science. We'll introduce the ``Paper`` and ``Corpus`` classes, and explore some basic aspects of their structure.

## Using this notebook

This is an interative Python notebook. Most of the content is just marked-down text, like this paragraph, that provides expository on some aspect of the Tethne package. Some of the cells are "code" cells, which look like this:

In [3]:
print "This is a code cell!"

This is a code cell!


You can execute the code in a code cell by clicking on it and pressing Shift-Enter on your keyboard, or by clicking the right-arrow "Run" button in the toolbar at the top of the page. The cell below will automatically be selected, so you can run many cells in quick succession by repeatedly pressing Shift-Enter (or the "Run" button). It's a good idea to run all of the code cells in order, from the top of the tutorial, since many commands later in the tutorial will depend on earlier ones.

## Before you start

* Download the Web of Science practice dataset from [here](http://devo-evo.lab.asu.edu/methods/tethne/datasets.zip), and store it in a place where you can find it. You'll need the full path to your dataset.

## Importing a ``reader`` module

The modules in the ``tethne.readers`` subpackage allow you to read data from a few different databases. The readers for Web of Science and JSTOR DfR datasets are the most rigorously tested, but there is also a reader for Scopus, and one for DSpace (limited use-cases).

| Database                | module                    | 
| ----------------------- |---------------------------|
| Web of Science          | ``tethne.readers.wos``    |
| JSTOR Data-for-Research | ``tethne.readers.dfr``    |
| Scopus                  | ``tethne.readers.scopus`` |
| DSpace (experimental)   | ``tethne.readers.dspace`` |

You can load the ``tethne.readers.wos`` module by importing it from the ``tethne.readers`` subpackage:

In [4]:
from tethne.readers import wos

You can import other reader modules the same way. To load the JSTOR DfR reader module, you would do:

In [5]:
from tethne.readers import dfr

## Using the ``read`` function

The simplest way to load data from a bibliographic dataset is to use the **``read``** method. Each module in the ``tethne.readers`` subpackage should have a ``read`` method. In fact, there should be identically-named versions of all of the common methods in each reader module.

**``read``** can parse a single data file, and returns a ``Corpus`` object. First I'll create a ``str`` object that holds the path to one of my data files, and then I'll pass that as an argument to ``read``.

In [6]:
# datapath should contain the path to one of your WoS data files.
datapath = '/Users/erickpeirson/Downloads/datasets/wos/genecol* OR common garden 1-500.txt'
corpus = wos.read(datapath)

## Reading more than one data file at a time

Often you'll be working with datasets comprised of multiple data files. The Web of Science database only allows you to download 500 records at a time (because they're dirty capitalists). You can use the **``read``** function to load a list of ``Paper``s from a directory containing multiple data files.

First I'll create a ``str`` object containing the path to my data directory, and then I'll load those data using **``wos.read``**. The ``read`` function knows that your path is a directory and not a data file; it looks inside of that directory for WoS data files.

In [7]:
datadirpath = '/Users/erickpeirson/Downloads/datasets/wos'
corpus = wos.read(datadirpath)

## ``Corpus`` objects

A ``Corpus`` is a collection of ``Paper``s with superpowers. Most importantly, it provides a consistent way of indexing bibliographic records. Indexing is important, because it sets the stage for all of the subsequent analyses that we may wish to do with our bibliographic data.

In [8]:
type(corpus), len(corpus)

(tethne.classes.corpus.Corpus, 1859)

A ``Corpus`` behaves like a list of ``Paper``s.

In [9]:
print 'There are %i Papers in my Corpus' % len(corpus)
print 'The first Paper in my Corpus is %s' % corpus[0]
print 'The last Paper in my Corpus is %s' % corpus[-1]

There are 1859 Papers in my Corpus
The first Paper in my Corpus is <tethne.classes.paper.Paper object at 0x111c6aed0>
The last Paper in my Corpus is <tethne.classes.paper.Paper object at 0x10ccb4a10>


Each ``Paper`` represents one bibliographic record. If we inspect a ``Paper``, we should see some structured bibliographic metadata describing a publication. Notice that there is an ``abstract``, title (``atitle``, for 'article title'), and author names (``aulast``, ``auinit``, and ``auuri``).

In [10]:
pprint(corpus[0].__dict__)

{'ER': '',
 'GA': '161KU',
 'ISSN': ['1010-061X', 'J9 J EVOLUTION BIOL'],
 'PD': 'JAN',
 'PG': 5,
 'PT': 'J',
 'WC': 'Ecology; Evolutionary Biology; Genetics & Heredity',
 'abstract': "The waterstrider Aquarius najas is wingless in Northern Europe, while winged individuals occur frequently in Central and Southern Europe. To test if the latitudinal difference is genetically controlled, we collected mature individuals from 10 different populations and raised their offspring in 'common garden' laboratory conditions. Half of these populations were from southern and the other half from central Finland. Daylength and temperature do influence wing development among other species of waterstriders, and thus we maintained a similar short daylength and warm conditions for all populations. These conditions should be favourable for wing development in general. Among laboratory-bred individuals several winged individuals appeared, and their proportion varied between populations. The relative frequen

Since we're working with WoS data, there is also a very long list of ``citedReferences``. Each citation in that list has some of the same fields as the records in your dataset. There are far fewer fields in these citations, simply because not very much information is contained in the original Web of Science data.

In [11]:
pprint(corpus[0].citedReferences[0].__dict__)

{'authors_init': [(u'BLANCKENHORN', u'W U')],
 'date': 1991,
 'doi': '10.2307/2409899',
 'journal': u'EVOLUTION',
 'pageStart': '1520',
 'volume': '45'}
