#  Data Workflow - Extracting & Utilizing Data from Idigbio Using Python

## 1. Introduction

The purpose of this notebook is to provide an example of a research workflow with biological specimen data using a collection of Python scripts written by the author. The purpose of these scripts is to make it quick and convenient for a user to specify conditions for a subset of data they have an interest in, retrieve all data that matches these conditions from a source on the internet, store this data in a locally hosted database and then perform various actions, such as queries or analysis, on the data through an API. In addition, this notebook will provide some implementation details on the various modules within the collection of Python scripts that will be used in this notebook for documentation & usage purposes.

## 2. Extracting & Storing the Data

The first stage in this workflow will consist of first specifying the source of biological specimen data to be used and then defining a subset of the data available there that is of some interst for later analysis purposes. Next, a query needs to be formatted to retreive this subset of interest from the data source's records and then finally the results must be stored in a local database for later use.

The source of biological specimen data used in this workflow will be the data available at Idigbio (Integrated Digitized Biocollections https://www.idigbio.org/), a site which has specimen records from various specimen collections across the U.S. The records in these collections in turn consist of a large variety of specimens from all over the world. Idigbio provides free access to these records not only through a web portal, but also through an API for more programmatic data retrieval needs. There also exist libraries for the API that make it easily accessible using programming languages like R, Python and Ruby. As an important side note, the Python scripts that this workflow incorporates rely on the Python specific package for this API called simply 'idigbio', which can be installed through pip or https://github.com/iDigBio/idigbio-search-api. 

Next, the subset of data to be explored needs to be defined so that its retrieval process can be started. A great way to explore possible data sets of interest is to use Idigbio's search portal and conduct a few prelimianry queries there to see what is the quality & quantity of records available. Once a suitable dataset is found, the search terms used to retrieve that dataset should be documented so that they can be used to programatically retrieve the same dataset using the author's Python scripts. 

To programatically retrieve the defined dataset, parameters must be given to the iDigBio API that match the search terms defined earlier, so that it can build a query and retrieve the appropriate records from Idigbio's database. In order to construct this query, the Idigbio API library in Python requires a dictionary (often named "rq" for record query) where the search terms of interest are defined as key-value pairs. The keys in this dictionary must consist of the field name of interest in the record and the value of the corresponding value of interest. As the dictionary keys correspond to record field names, there are very specific terms that must be used for them, a full list of which can be found here: https://github.com/idigbio/idigbio-search-api/wiki/Index-Fields. These terms cover many variables commonly associated with biological specimen collection records like scientific name, family, genus among many others.

For example, if the interest is in obtaining all records of collected specimen within the genus "Panthera" that are in the collection of the American Museum of Natural History (AMNH), the appropriate "rq" dictionary would look like this:


In [1]:
rq = {"genus" : "panthera", "institutioncode" : "AMNH"}

As of May 2018, this query into Idigbio's records yielded about 668 results which is a suitable dataset size for this notebook. However, more search terms can be added to the dictionary to refine the query even further or terms can be eliminated to broaden it.

**PLEASE NOTE:** Due to restrictions of the idigbio API, the absolute maximum query result size that is returned is 5000 records. Query result number returned can optionally be limited by specifying a variable called "limit". To get the maximum amount of records, we will specify the limit as 5000:

In [2]:
limit = 5000

Now that the parameters for the query to Idigbio have been defined, the next steps are to actually conduct the query and store the results. Thus, it is necessary to set up a database to store the query results. In this notebook and the Python Scripts written for this workflow process, a PostgreSQL database has been selected for this task and thus, in order to use these scripts, a PostgreSQL database must be set up to which the scripts can connect to and make changes. Please note that the scripts use Python libraries like "records" and "psycopg2" which are meant for interfacing with a PostgreSQL database. The script's database connection details such as database name, user, password etc. can be set in the "DBInfo.py" script's "connectDB()" function.

Once a PostgreSQL database has been set up and appropriate parameters set in the "DBInfo.py" script, the query for and storage of the target data can be done. The first step in this process is to define a table name in which the data will be stored in the PostgreSQL database:

In [3]:
table_name = "records"

Next, using the author's Python script called "API.py", the query itself and storage of resulting data can be done using just one call to the function "createTable()". This function requires the "rq" dictionary defined earlier which contains the desired search terms, the table name string defined above and the optional limit argument for limiting the number of records retrieved: 

In [4]:
import API

API.createTable(rq, table_name, limit)

Database table records has been created successfully.
Database table records has been populated successfully.


Once the function has finished executing, there should be a new table in the database specified containing all of the data returned from the query.

#### How the createTable() function works
This subsection will briefly discuss how the table is created from the query results, this is not necessary knowledge for using the scripts and is more for documentation purposes.

This function utilizes two helper scripts: "TableSchemaCreator.py" and "TableCreator.py". The former script's purpose is to create all the necessary fields into the database table and the latter script's purpose is to then populate those fields.

The way "TableSchemaCreator.py" creates the table fields is dynamic. This means that it is passed the results from the query to Idigbio, which it then uses to create a list of distinct field names present in that data. Additionally, the Idigbio API provides an endpoint from which the types of each field name can be gathered. By taking the list of field names gathered and the type of each field, a SQL command is formed for adding each field and its corresponding type to the table.

The "TableCreator.py" script is also given the results of the query, but its main purpose is to populate the newly created table. As a table with every field present in the query has been formed, this script simply goes through the query result record by record and creates INSERT statements that can then be executed in the database to input the data.

## 3. Retrieving the Data from the PostgreSQL Database