#  Data Workflow - Extracting & Utilizing Data from Idigbio Using Python

## 1. Introduction

The purpose of this notebook is to provide an example of a research workflow with biological specimen data using a collection of Python scripts written by the author. The purpose of these scripts is to make it quick and convenient for a user to specify conditions for a subset of data they have an interest in, retrieve all data that matches these conditions from a source on the internet, store this data in a locally hosted database and then perform various actions, such as queries or analysis, on the data through an API incorporated into the scripts. 

It is important to note that these scripts were written as exploratory projects for learning how to program API's, how to programatically interact with API's, how to programatically interact with databases and how to use Python packages to analyze data. The resulting scripts from these learning processes are now being incorporated into a workflow in this notebook to see how they interact together.

This notebook will also provide some implementation details on the various modules within the collection of Python scripts that will be used in this notebook for documentation & usage purposes.

## 2. Extracting & Storing the Data

The first stage in this workflow will consist of first specifying the source of biological specimen data to be used and then defining a subset of the data available there that is of some interst for later analysis purposes. Next, a query needs to be formatted to retreive this subset of interest from the data source's records and then finally the results must be stored in a local database for later use.

#### Data Source
The source of biological specimen data used in this workflow will be the data available at Idigbio (Integrated Digitized Biocollections https://www.idigbio.org/), a site which has specimen records from various specimen collections across the U.S. The records in these collections in turn consist of a large variety of specimens from all over the world. Idigbio provides free access to these records not only through a web portal, but also through an API for more programmatic data retrieval needs. There also exist libraries for the API that make it easily accessible using programming languages like R, Python and Ruby. As an important side note, the Python scripts that this workflow incorporates rely on the Python specific package for this API called simply 'idigbio', which can be installed through pip or https://github.com/iDigBio/idigbio-search-api. 

#### Defining & querying for a subset of data
Next, the subset of data to be explored needs to be defined so that its retrieval process can be started. A great way to explore possible data sets of interest is to use Idigbio's search portal and conduct a few prelimianry queries there to see what is the quality & quantity of records available. Once a suitable dataset is found, the search terms used to retrieve that dataset should be documented so that they can be used to programatically retrieve the same dataset using the author's Python scripts. This is, however, a completely voluntary step as the same queries can be conducted using the Python scripts as long as the correct search terms are used for the query, something which will be discussed in the next paragraph.

To programatically retrieve the defined dataset, parameters must be given to the iDigBio API that match the search terms defined earlier, so that it can build a query and retrieve the appropriate records from Idigbio's database. In order to construct this query, the Idigbio API library in Python requires a dictionary (often named "rq" for record query) where the search terms are defined as key-value pairs. The keys in this dictionary must consist of the field name of interest in the record and the value of the corresponding value of interest. As the dictionary keys correspond to record field names, there are very specific terms that must be used for them, a full list of which can be found here: https://github.com/idigbio/idigbio-search-api/wiki/Index-Fields. These terms cover many variables commonly associated with biological specimen collection records like scientific name, family, genus among many others.

For example, if the interest is in obtaining all records of collected specimen within the genus "Panthera" that are in the collection of the American Museum of Natural History (AMNH), the appropriate "rq" dictionary would look like this:


In [1]:
rq = {"genus" : "panthera", "institutioncode" : "AMNH"}

As of May 2018, this query into Idigbio's records yielded about 668 results which is a suitable dataset size for this notebook. However, more search terms can be added to the dictionary to refine the query even further or terms can be eliminated to broaden it, yielding less or more records respectively.

**PLEASE NOTE:** Due to restrictions of the idigbio API, the absolute maximum query result size that can be returned is 5000 records. Query result number returned can optionally be limited by specifying a variable called "limit" which is passed to the Idigbio API. To get the maximum amount of records, we will specify the limit as 5000:

In [2]:
limit = 5000

#### Setting up a local database
Now that the parameters for the query to Idigbio have been defined, the next steps are to actually conduct the query and store the results. Thus, it is first necessary to set up a database to store the query results. In this notebook and the Python Scripts written for this workflow process, a PostgreSQL database has been selected for this task. As this is the case, the scripts use Python libraries like "records" and "psycopg2" which are meant for interfacing only with a PostgreSQL database. Consequently, in order to use the author's scripts a PostgreSQL database must be set up to which the scripts can connect to and make changes. The script's database connection details such as database name, user, password etc. can be set in the "DBInfo.py" script's "connectDB()" function.

#### Retrieving the data
Once a PostgreSQL database has been set up and appropriate parameters set in the "DBInfo.py" script, the query for and storage of the target data can be done. The first step in this process is to define the name of the table in which the data will be stored in the PostgreSQL database, a table of the same name should not already exist in the database as the script will automatically create a new table. The table name can be defined as such:

In [3]:
table_name = "records"

Next, using the author's Python script called "API.py", the query itself and storage of resulting data from the query can be done using just one call to the function "createTable()". This function can take 3 arguments, two of which are required and one which is optional. The first positional argument is the "rq" dictionary defined earlier (which contains the desired search terms), the second positional argument is the table_name string defined above and the third optional argument is the "limit" variable used for limiting the number of records retrieved from iDigbBio (5000 by default). The function can be used as such:

In [4]:
import API

API.createTable(rq, table_name, limit)

Database table records has been created successfully.
Database table records has been populated successfully.


Once the function has finished executing, there should be a new table in the database specified containing all of the data returned from the query. As a sidenote on the structure of the author's scripts, "API.py" is the public interface to all of the other scripts meaning that the functions in this script provide access to all the functionalities implemented in the other scripts. The functions present in "API.py" will be discussed in greater detail throughout this notebook as they are used, starting from the createTable() function below.

#### How the createTable() function works
This subsection will briefly discuss how the createTable() function creates a table in the PostgreSQL database using the results given by the user's query. This section does not provide necessary knowledge for using the scripts and is more for documentation purposes.

This function utilizes two helper scripts: "TableSchemaCreator.py" and "TableCreator.py". The former script's purpose is to create all the necessary fields into the database table and the latter script's purpose is to then populate those fields.

The way "TableSchemaCreator.py" creates the table fields is dynamic. This means that it is passed the results from the query to iDigBio, which it then uses to create a list of distinct field names present in that data. Quite simply, the script iteratively looks at the field names in each record present in the query results, compares them to field names present in the database table and then adds the ones not already present there to the table. Additionally, the Idigbio API provides an endpoint from which the types of each field name can be gathered. By taking the list of field names gathered and the type of each field, a SQL command is formed for adding each field and its corresponding type to the table.

The "TableCreator.py" script is also given the results of the query, but its main purpose is to populate the newly created table. As a table with every field present in the query has been formed by the previous script, this script simply goes through the query result record by record and creates INSERT statements that can then be executed in the database to input the data.

The interaction between the database and the program is entirely done using the "pycopg2" library in Python.

## 3. Retrieving the Data from the PostgreSQL Database

Now that the data has been retrieved from iDigBio and stored in the local PostgreSQL database, the goal is to retrieve the data from this database so it can be used for other purposes like analysis. This will be done in a similar way that data was retrieved from iDigBio, meaning through an API. Instead of using iDigBio's API, an API that has similar implementation to it has been provided in the scripts for accessing the database. The following paragraphs will further discuss how this process will work.

#### Launching the API server
Before the API for retrieving data from the local database can be used, the server that processes the requests from the API must be first launched. The script for launching the server and which contains all the necessary URL routes & responses that the server operates by is called "APIServer.py". The web-framework used for the server is a Python library called "Bottle". The web address & port of the server are also defined in this script and thus can be modified there if necessary. 

#### Using the API to retrieve data from the local database
