The API search function

pzermoglio edited this page Jan 15, 2017 · 57 revisions
  1. Introduction
  2. Search query strings
    1. Global full text search
    2. Field-specific keyword search
      1. Searchable fields
      2. Some querying examples
    3. Spatial search
  3. Limiting the number of records returned
  4. Retrieving large result sets

Introduction

The search function of the VertNet API provides a simple way to access VertNet data programmatically. With appropriate API requests, you can easily automate searching for and retrieving custom data sets from VertNet.

The base URL for API search requests is http://api.vertnet-portal.appspot.com/api/search?q=query_object, where query_object is a JSON object specifying the request parameters. Request objects can include the following properties:

  • q: query string
  • l: maximum number of records to return per query. Performance depends heavily on this value as described in Stucky 2013. Optimum and default value is 400. (OPTIONAL)
  • c: search cursor for paging over multiple results (OPTIONAL)

Search requests to the API return a JSON response object that includes the following properties:

  • recs: a list of records including Darwin Core fields, data set metadata fields, and VertNet-specific index fields
  • cursor: cursor string to use if there are more records to page over
  • api_version: the version of the API source code used in the request
  • query_version: the version of the query source code (search.py) used in the request
  • limit: the maximum number of records to return in a response
  • request_date: the UTC datetime when the request was submitted
  • request_origin: the latitude,longitude of the source of the request
  • response_records: the total number of records in the current response
  • matching_records: the total number of records that match the query

As a simple example, the following request searches for Noturus placidus (the threatened Neosho madtom catfish):

http://api.vertnet-portal.appspot.com/api/search?q={"q":"noturus placidus"}

This document discusses each possible property of the search request object ("q", "l", and "c") in detail, beginning with the query string.

Search query strings

The search query is specified as the value of the "q" property of the JSON API request object (e.g., "noturus placidus" in the example above). A search query is just a string that contains at most 2000 Unicode characters. Searches are case insensitive in terms of the content they match, but the boolean operators AND, OR, and NOT (discussed below) must be written in upper case and the term names must be in all lower case (e.g. "basisofrecord"). Looking for punctuated content is a little tricky. You have to enclose the exact string you are looking for in escaped quotes, for example, the following API search looks for any records that contain the exact string urn:occurrence:Arctos:CUMV:Amph:10008:2243441:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"iptrecordid:\"urn:occurrence:Arctos:CUMV:Amph:10008:2243441\""}

Queries can be used for global full text search, field keyword search, and spatial search. This document gives a good introduction to each of these. If you would like to learn more, you will want to read the official documentation for query strings from Google.

Global full text search

This is the simplest search option. It provides a basic keyword search that looks for matching text anywhere in a record. The sample search above for "noturus placidus" is an example of a global full text search. As a slightly more complex example, to search for all records that contain "mvz" (the abbreviation for the Museum of Vertebrate Zoology), "gymnogyps" (the genus of the rare California condor), and "california" anywhere in the record, you could use this query:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"mvz gymnogyps california"}

Query strings can use the Boolean operators AND, OR, and NOT (they must be written in upper case). NOT should always appear before the value it modifies, while AND and OR should be used between values. If multiple search keywords are provided but no Boolean operators are specified, AND is used by default. Thus, the previous query is essentially the same as

http://api.vertnet-portal.appspot.com/api/search?q={"q":"mvz AND gymnogyps AND california"}

which means that all three search terms ("mvz", "gymnogyps", and "california") must occur in a record for it to match the search query. If you try running these two versions of the query, you will see that they produce the same results.

Field-specific keyword search

###Searchable fields

You can also limit keyword searches to match specific values of particular terms. To do so, provide the name of the term immediately before the search text, separated from it by a colon (":"), see examples below the list of terms. The following terms (with non-Darwin Core terms in italics) are indexed and available for searching:

Record-Level

  • institutioncode
  • collectioncode
  • catalognumber
  • dctype (dcterms:type)
  • license (dcterms:license)
  • iptlicense (eml:intellectualRights)
  • haslicense (dcterms:license or eml:intellectualRights has a license designated) {'0','1'}
  • basisofrecord {PreservedSpecimen, FossilSpecimen, MaterialSample, Occurrence, MachineObservation, HumanObservation}
  • isfossil (dwc:basisOfRecord is FossilSpecimen or collection is a paleo collection) {'0','1'}
  • hasmedia (has dwc:associatedMedia) {'0','1'}

Occurrence

  • iptrecordid (same as dwc:occurrenceID)
  • recordedby
  • recordnumber
  • fieldnumber
  • establishmentmeans
  • wascaptive (dwc:establishmentMeans or occurrenceRemarks suggests it was captive) {'0','1'}
  • wasinvasive (was the organism recorded to be invasive where and when it occurred) {'0','1'}
  • sex (standardized sex from original sex field or extracted from elsewhere in the record)
  • lifestage (lifeStage from original sex field or extracted from elsewhere in the record)
  • preparations
  • hastissue (has dwc:preparation that suggests tissue is available) {'0','1'}
  • reproductivecondition

Event (for year, month, day, see below)

  • eventdate
  • year
  • month
  • day
  • startdayofyear
  • enddayofyear

Location

  • continent
  • country
  • stateprovince
  • county
  • municipality
  • island
  • islandgroup
  • waterbody
  • locality
  • geodeticdatum
  • georeferencedby
  • georeferenceverificationstatus
  • location (a Google GeoField of the dwc:decimalLatitude, dwc:decimalLongitude)
  • mappable (has valid dwc:decimalLatitude, dwc:decimalLongitude) {'0','1'}

Geological Context

  • bed
  • formation
  • group
  • member

Identification

  • typestatus
  • hastypestatus (dwc:typeStatus is populated) {'0','1'}

Taxon

  • kingdom
  • phylum
  • class
  • order
  • family
  • genus
  • specificepithet
  • infraspecificepithet
  • scientificname
  • vernacularname

Trait

  • haslength (was a value for length extracted?) {'0','1'}
  • hasmass (was a value for mass extracted?) {'0','1'}
  • hassex (does the record have sex?) {'0','1'}
  • haslifestage (does the record have life stage?) {'0','1'}
  • lengthtype (type of length measurement extracted from the record, can refer to a number or to a range) {'total length', 'standard length', 'snout-vent length','head-body length', 'fork length', 'total length range', 'standard length range', 'snout-vent length range','head-body length range', 'fork length range'}
  • lengthinmm (length measurement extracted from the record) {number}
  • massing (mass measurement extracted from the record) {number} (For detailed information about trait extraction and aggregation and querying via the VertNet portal, see http://vertnet.org/resources/traitsguide.html).

Data Set

  • gbifdatasetid (GBIF identifier for the data set)
  • gbifpublisherid (GBIF identifier for the data publishing organization)
  • lastindexed (date the record was most recently indexed into VertNet) {'YYYY-MM-DD'}
  • networks {MaNIS, ORNIS, HerpNET, FishNet, VertNet, Arctos, Paleo}
  • migrator (the version of the migrator used to process the data set) {'YYYY'-'MM'-'DD'}
  • orgcountry (the country where the organization is located)
  • orgstateprovince (the first-level administrative unit where the organization is located)

Index

Some querying examples

For example, suppose we want to find records of the black-footed ferret, Mustela nigripes, by explicitly searching for its scientific name. We could use this query:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:mustela specificepithet:nigripes"}

Or, suppose we already know the globally unique identifier for an occurrence record (iptrecordid), we could use this query:

[http://api.vertnet-portal.appspot.com/api/search?q={"q":"iptrecordid:7108667e-1483-4d04-b204-6a44a73a5219"}] (http://api.vertnet-portal.appspot.com/api/search?q={%22q%22:%22iptrecordid:7108667e-1483-4d04-b204-6a44a73a5219%22})

Or, to search for more than one iptrecordid at a time, string the list together with ' OR ' such as in the following example:

[http://api.vertnet-portal.appspot.com/api/search?q={"q":"iptrecordid:7108667e-1483-4d04-b204-6a44a73a5219 OR iptrecordid:1efe900e-bde2-45e7-9747-2b2c3e5f36c3"}] (http://api.vertnet-portal.appspot.com/api/search?q={%22q%22:%22iptrecordid:7108667e-1483-4d04-b204-6a44a73a5219%20OR%20iptrecordid:1efe900e-bde2-45e7-9747-2b2c3e5f36c3%22})

Number fields can be searched using less than/greater than comparison operators ("<", "<=", ">", ">=") in addition to the colon (which is equivalent to "=").

Now, let's put together many of the ideas we've discussed so far by using them to build a relatively complex query. Suppose we want to search for records of the black-footed ferret from either Colorado or Kansas, and we don't want any records from before the 20th century. This query will give us the data we want:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:Mustela specificepithet:nigripes stateprovince:(colorado OR kansas) year>=1900"}

Note the use of parentheses to group together the two possible values for the "stateprovince" field.

We've also indexed some non-Darwin Core terms to allow you to find records with media, with tissues, or that are mappable. These fields are called "media", "tissue", and "mappable", respectively. These fields can be thought of as Boolean properties, with a value of 1 for true and 0 for false. Thus, to find records for the black-footed ferret that are mappable, you could use this query:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:mustela specificepithet:nigripes mappable:1"}

Note that in this example we used "mustela" all in lower case, while in the previous example we wrote it capitalized. The use of both is possible, as search is case insensitive.

If we wanted to look for trait information, we could combine different kinds of properties: Booleans such as “has mass”, strings such as “length type”, and numerical such as “length in mm”. If we were looking, for example, for records of the black-footed ferret that have mass data and which total length is between 450 and 500 mm, the following query would return the data we are seeking:

[http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:mustela specificepithet:nigripes hasmass:1 lengthtype:'total length' lengthinmm>=450 lengthinmm<=500"}] (http://api.vertnet-portal.appspot.com/api/search?q=%7B%22q%22:%22genus:mustela%20specificepithet:nigripes%20hasmass:1%20lengthtype:%27total%20length%27%20lengthinmm%3E=450%20lengthinmm%3C=500%22%7D)

Similarly, we could look for those records that also have sex data and a particular mass value, for example: 450g, by using this query:

[http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:mustela specificepithet:nigripes hassex:1 lengthtype:'total length' lengthinmm>=450 lengthinmm<=500 massing=450"}] (http://api.vertnet-portal.appspot.com/api/search?q=%7B%22q%22:%22genus:mustela%20specificepithet:nigripes%20hassex:1%20lengthtype:%27total%20length%27%20lengthinmm%3E=450%20lengthinmm%3C=500%20massing=450%22%7D)

All these examples help us understand how to retrieve records that contain certain data. But, how do we exclude records that have particular information? For example, suppose we want to search for all Mustela records from the state of Virginia, United States. First, we could try the following query:

[http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:(mustela) stateprovince:(virginia)"}] (http://api.vertnet-portal.appspot.com/api/search?q={%22q%22:%22genus:(mustela)%20stateprovince:(virginia)%22})

Now, if we take a close look to the records that we get, we will note that we are getting more than we want, the data retrieved includes both records with stateprovince: Virginia and with stateprovince: West Virginia. Then, in order to exclude those from West Virginia, we can try the following:

[http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:(mustela) stateprovince:(virginia) AND NOT stateprovince:(west virginia)"}] (http://api.vertnet-portal.appspot.com/api/search?q={%22q%22:%22genus:(mustela)%20stateprovince:(virginia)%20AND%20NOT%20stateprovince:(west%20virginia)%22})

If, conversely, we wanted the records from West Virginia, we can simply run the following query:

[http://api.vertnet-portal.appspot.com/api/search?q={"q":"genus:(mustela) stateprovince:(west virginia)"}] (http://api.vertnet-portal.appspot.com/api/search?q={%22q%22:%22genus:(mustela)%20stateprovince:(west%20virginia)%22})

Note that, of these three queries, the number of records retrieved in the last two should sum up to the number of records of the first query (all “virginia” = “virginia” alone + “west virginia”). Take into account that what you will get has a limit of 400 records (see section “Limiting the amount of records returned”), so that if your initial search corresponds to more than 400 results, this sum will not be correctly reflected.

Spatial search

This option allows you to search within a specified number of meters around a given spatial coordinate by using the "distance" operator. Here is an example query that searches for all records within 2 kilometers of the point 33.529, -105.694:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"distance(location,geopoint(33.529,-105.694))<2000"}

The "distance" operator returns the distance in meters between its two arguments, which in the example above are the value of the "location" field and the point 33.529,-105.694.

Limiting the number of records returned

The optional "l" property of the API search request object allows you to specify the maximum number of records the request should return. The default value of "l" is 400. This default was chosen to optimize API performance under a variety of search scenarios, so it should work well most of the time. If you would like to use a custom value for "l", you can set it to any integer value from 1 to 1,000, inclusive.

For example, to retrieve only the first 20 records for Swainson's hawk, Buteo swainsoni, you could use this API call:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"buteo swainsoni","l":20}

Retrieving large result sets

Most of the API search requests that we've looked at so far have returned relatively small result sets, such that we could retrieve all matching records in a single response object. As long as the total number of matching records is 1,000 or less, you can retrieve the entire result set with a single API request by using an appropriate value of the "l" request parameter.

The API request from the previous section, however, which searched for records of Swainson's hawk, matches more than 1,000 records. In this situation, we need to use multiple calls to the API to page through and retrieve all of the matching records. To see how this works, first try running the query for records of Swainson's hawk again:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"buteo swainsoni","l":20}

If all goes well, you should see the first 20 records, as expected, but the value of the "count" property indicates that there are many more matching records waiting for retrieval (at the time of this writing, 1,863, to be exact). To retrieve the next 20 records, we need to use a cursor. Notice that the object returned by the query has a property called "cursor", with a very long string as its value. All we need to do is call the API search function again, using the exact same query string as before, but we also need to include the "c" property in the request JSON object and give it the value of the "cursor" string returned by the first API request. So our API request will look something like this:

http://api.vertnet-portal.appspot.com/api/search?q={"q":"buteo swainsoni","l":20,"c":"cursor_string"},

where "cursor_string" is the value of the "cursor" property of the original API response object. Because the value of the cursor string changes from request to request, you will have to construct this query yourself if you'd like to give it a try. Provided the cursor string is specified correctly, the API request will return a response with the next 20 records in the result set. This process is then repeated until all records are retrieved. How do you know when all matching records have been returned? If there are no more records waiting for retrieval, the value of the "cursor" property in the response object will be null.

To close this section, we should say a few words about the "count" property of API search response objects. The value of "count" is intended to be very accurate for small to medium-sized result sets, but for large result sets, it should be interpreted as an estimate only. Specifically, our testing indicates that for result sets of 10,000 records or less, "count" is always correct. As the size of result sets exceed 10,000 records, "count" becomes increasingly less accurate. For very large result sets, it should be considered as nothing more than a rough guess.