In [1]:
%matplotlib inline

In [2]:
from pprint import pprint

# Introduction to Tethne: Loading Data, part 1

In this section we will use the ``tethne.readers.wos`` module to load up some data from the ISI Web of Science. We'll introduce the ``Paper`` and ``Corpus`` classes, and explore some basic aspects of their structure.

## Using this notebook

This is an interative Python notebook. Most of the content is just marked-down text, like this paragraph, that provides expository on some aspect of the Tethne package. Some of the cells are "code" cells, which look like this:

In [3]:
print "This is a code cell!"

This is a code cell!


You can execute the code in a code cell by clicking on it and pressing Shift-Enter on your keyboard, or by clicking the right-arrow "Run" button in the toolbar at the top of the page. The cell below will automatically be selected, so you can run many cells in quick succession by repeatedly pressing Shift-Enter (or the "Run" button). It's a good idea to run all of the code cells in order, from the top of the tutorial, since many commands later in the tutorial will depend on earlier ones.

## Before you start

* Download the Web of Science practice dataset from [here](http://devo-evo.lab.asu.edu/methods/tethne/datasets.zip), and store it in a place where you can find it. You'll need the full path to your dataset.

## Importing a ``reader`` module

The modules in the ``tethne.readers`` subpackage allow you to read data from a few different databases. The readers for Web of Science and JSTOR DfR datasets are the most rigorously tested, but there is also a reader for Scopus, and one for DSpace (limited use-cases).

| Database                | module                    | 
| ----------------------- |---------------------------|
| Web of Science          | ``tethne.readers.wos``    |
| JSTOR Data-for-Research | ``tethne.readers.dfr``    |
| Scopus                  | ``tethne.readers.scopus`` |
| DSpace (experimental)   | ``tethne.readers.dspace`` |

You can load the ``tethne.readers.wos`` module by importing it from the ``tethne.readers`` subpackage:

In [4]:
from tethne.readers import wos

You can import other reader modules the same way. To load the JSTOR DfR reader module, you would do:

In [5]:
from tethne.readers import dfr

## Using the ``read`` function

The simplest way to load data from a bibliographic dataset is to use the **``read``** method. Each module in the ``tethne.readers`` subpackage should have a ``read`` method. In fact, there should be identically-named versions of all of the common methods (you'll see a few more, later, like ``read_corpus``) in each reader module.

**``read``** parses a single data file, and returns a list of ``Paper``s (we'll discuss what a ``Paper`` is below). First I'll create a ``str`` object that holds the path to one of my data files, and then I'll pass that as an argument to ``read``.

In [8]:
# datapath should contain the path to one of your WoS data files.
datapath = '/Users/erickpeirson/Downloads/datasets/wos/genecol* OR common garden 1-500.txt'

In [9]:
papers = wos.read(datapath)

There should be 500 ``Paper``s in the list that we just created (``papers``):

In [10]:
len(papers)

500

## ``Paper`` objects

If you print the first element in the list, you should see some structured bibliographic metadata describing a publication. Notice that there is an ``abstract``, title (``atitle``, for 'article title'), and author names (``aulast``, ``auinit``, and ``auuri``). 

In [12]:
pprint(papers[0].__dict__)

{'ER': '',
 'FX': ['We were assisted in the field by R. Alex Smith, Catherine Fox, and Scott',
        'David. The Rocky Mountain Biological Laboratory (RMBL) provided',
        'laboratory space and extensive logistical support. Marc Johnson and',
        'Xoaquin Moreira provided helpful comments on earlier manuscript drafts.',
        'W. K. Petry, M. Lopez, and S. K. Rudeen were supported by the National',
        'Science Foundation Research Experience for Undergraduates (NSF-REU)',
        'program under grant number DBI 0753774 to RMBL. K. I. Perry was',
        'supported as an NSF-REU supplement on DBI 0731346. W. K. Petry was',
        'supported by an NSF Graduate Research Fellowship, and K. A. Mooney by',
        'NSF DEB 0919178, a RMBL Ehrlich Fellowship, and the UC-Irvine Academic',
        'Senate Council on Research, Computing, and Libraries. Mapping of plants',
        'for annual surveys was made possible by NSF grant DBI 0420910 to RMBL.'],
 'GA': '219TL',
 'ISSN': 

Since we're working with WoS data, there is also a very long list of ``citatedReferences``. Each citation in that list has some of the same fields as the records in your dataset. Many fields are missing, however, which means that we simply don't have any data for those fields.

In [17]:
pprint(papers[0].citedReferences[0].__dict__)

{'authors_init': [(u'ABDALA-ROBERTS', u'L')],
 'date': 2012,
 'doi': '10.1111/j.1600-0706.2012.20600.x',
 'journal': u'OIKOS',
 'pageStart': '1905',
 'volume': '121'}


All of these records -- both the records in your dataset and the records for their cited references -- are ``Paper`` objects. 

In [18]:
type(papers[0])

tethne.classes.paper.Paper

The ``Paper`` class behaves a lot like a ``dict``. You can access data in a ``Paper`` like you would a ``dict``:

In [20]:
papers[0]['title']

'Mechanisms Underlying Plant Sexual Dimorphism In Multi-Trophic Arthropod Communities'

## Reading more than one data file at a time

Often you'll be working with datasets comprised of multiple data files. The Web of Science database only allows you to download 500 records at a time (because they're dirty capitalists). You can use the **``from_dir``** function to load a list of ``Paper``s from a directory containing multiple data files.

First I'll create a ``str`` object containing the path to my data directory, and then I'll load those data using **``wos.read``**. ``read`` knows that your path is a directory and not a data file; it looks inside of that directory for WoS data files.

In [24]:
datadirpath = '/Users/erickpeirson/Downloads/datasets/wos'

In [25]:
papers = wos.read(datadirpath)

You should have a much larger list of ``Paper``s now:

In [26]:
len(papers)

1859