# Calling the SHARE API
----
Here are some working examples of how to query the current SHARE database for individual results, metrics, and statistics.

These particular queries are just examples, and the data is open for anyone to use, so feel free to make your own and experiment!

Soon, we'll need a URL to access the SHARE Search API

In [1]:
SHARE_API = 'https://staging-share.osf.io/api/search/abstractcreativework/_search'

## The SHARE Search Schema

The SHARE search API is built on a tool called elasticsearch. It lets you search a subset of SHARE's normalized metadata in a simple format.

Here are the fields available in SHARE's elasticsearch endpoint:

    - 'title'
    - 'language'
    - 'subject'
    - 'description'
    - 'date'
    - 'date_created'
    - 'date_modified
    - 'date_updated'
    - 'date_published'
    - 'tags'
    - 'links'
    - 'awards'
    - 'venues'
    - 'sources'
    - 'contributors'

You can see a formatted version of the base results from the API by visiting the [SHARE Search API URL](https://staging-share.osf.io/api/search/abstractcreativework/_search).

### Service Names for Reference
----
Each provider harvested from has a specific . Let's make an API call to generate a table to get all of those "internal" names, along with the official name of the repository that they represent.

The SHARE API has different endpoints. One of those endpoints returns a list of all of the providers that SHARE is harvesting from, along with their internal names, official names, links to their homepages, and a simple version of an icon representing their service, in a parsable format called json.

Let's make a call to that API endpoint using the requests libarary, get the json data, and print out all of the shortnames and longnames.

In [2]:
import requests
import json

SHARE_PROVIDERS = 'https://staging-share.osf.io/api/providers/'

data = requests.get(SHARE_PROVIDERS).json()
 
print('Here are the first 10 Providers:')
for source in data['results']:
    print(
        '{}\n{}\n{}\n'.format(
            source['long_title'],
            source['home_page'],
            source['provider_name']
        )
    )

Here are the first 10 Providers:
Research Online @ University of Wollongong
http://ro.uow.edu.au
au.uow

Ghent University Academic Bibliography
https://biblio.ugent.be/
be.ghent

Pontifical Catholic University of Rio de Janeiro
http://www.maxwell.vrac.puc-rio.br
br.pcurio

Lake Winnipeg Basin Information Network
http://130.179.67.140
ca.lwbin

PAPYRUS - Dépôt institutionnel de l'Université de Montréal
http://papyrus.bib.umontreal.ca
ca.umontreal

Western University
http://ir.lib.uwo.ca
ca.uwo

BioMed Central
http://www.springer.com/us/
com.biomedcentral

Social Science Research Network
http://papers.ssrn.com/
com.dailyssrn

figshare
https://figshare.com/
com.figshare

Nature Publishing Group
http://www.nature.com/
com.nature



#### SHARE Schema

You can make queries against any of the fields defined in the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml). If we were able to harvest the information from the original source, it should appear in SHARE. However, not all fields are required for every document. 

Required fields include:
- title
- contributors
- uris
- providerUpdatedDateTime

We add some information after each document is harvested inside the field shareProperties, including:
- source (where the document was originally harvested)
- docID  (a unique identifier for that object from that source)

These two fields can be combined to make a unique document identifier.

## Simple Queries

Let's get the first 3 results from the most basic query - the first page of the most recently updated research release events in SHARE.

We'll use the URL parsing library furl to keep track of all of our arguments to the URL, because we'll be modifying them as we go along. We'll print the URL as we go to take a look at it, so we know what we're requesting.

We'll print out the result's title and sources where it appears.

In [3]:
import furl

search_url = furl.furl(SHARE_API)
search_url.args['size'] = 3
recent_results = requests.get(search_url.url).json()

recent_results = recent_results['hits']['hits']

recent_results

print('The request URL is {}'.format(search_url.url))
print('----------')
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )

The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3
----------
LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 2011-06-17 -- from ['providers.org.datacite']
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448160.0758426051207266 -- from ['providers.org.datacite']
LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 1990-04-20 -- from ['providers.org.datacite']


Now let's limit that query to only documents mentioning "giraffes" somewhere in the title, description, or in any of the metadata. We'd do that by adding a query search term.

In [4]:
search_url.args['q'] = 'giraffes'
recent_results = requests.get(search_url.url).json()

recent_results = recent_results['hits']['hits']

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )

The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3&q=giraffes
---------
Genome reveals why giraffes have long necks -- from ['providers.org.crossref']
Odd creature was ancient ancestor of today’s giraffes -- from ['providers.org.crossref']
Genome reveals why giraffes have long necks -- from ['providers.com.nature']


Let's search for documents from the source mit

In [5]:
search_url.args['q'] = 'sources:providers.org.crossref'
recent_results = requests.get(search_url.url).json()

recent_results = recent_results['hits']['hits']

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )

The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3&q=sources:providers.org.crossref
---------
Communicating Accessibility Resources Benefits Everyone -- from ['providers.org.crossref']
The Devil Is in the Details -- from ['providers.org.crossref']
Progression of coronary artery calcification by cardiac computed tomography -- from ['providers.org.crossref']


Let's combine the two and find documents from MIT that mention giraffes

In [6]:
search_url.args['q'] = 'sources:providers.org.crossref AND giraffes'
recent_results = requests.get(search_url.url).json()

recent_results = recent_results['hits']['hits']

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results:
    print(
        '{} -- from {}'.format(
            result['_source']['title'],
            result['_source']['sources']
        )
    )

The request URL is https://staging-share.osf.io/api/search/abstractcreativework/_search?size=3&q=sources:providers.org.crossref+AND+giraffes
---------
Genome reveals why giraffes have long necks -- from ['providers.org.crossref']
Odd creature was ancient ancestor of today’s giraffes -- from ['providers.org.crossref']
Of Caucasians, Asians, and Giraffes: The Influence of Categorization and Target Valence on Social Projection -- from ['providers.org.crossref']


## Complex Queries
The SHARE Search API runs on elasticsearch - meaning that it can accept complicated queries that give you a wide variety of information.

Here are some examples of how to make more complex queries using the raw elasticsearch results. You can read a [lot more about elasticsearch queries here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

In [7]:
search_url.args = None  # reset the args so that we remove our old query arguments.
search_url.url # Show the URL that we'll be requesting to make sure the args were cleared

'https://staging-share.osf.io/api/search/abstractcreativework/_search'

### Query Setup

We can define a few functions that we can reuse to make querying simpler. Elasticsearch queries are passed through as json blobs specifying how to return the information you want.

In [8]:
import json

def query_share(url, query):
    # A helper function that will use the requests library,
    # pass along the correct headers,
    # and make the query we want
    headers = {'Content-Type': 'application/json'}
    data = json.dumps(query)
    return requests.post(url, headers=headers, data=data).json()


### Some Queries
The SHARE schema has many spots for information, and many of the original sources do not provide this information. We can do a query to find out if a certain field exists or not within certain records. The SHARE API is set up to show an empty list if the field is empty.

Let's query for the counts of documents that have a content in their tags field.

In [9]:
tags_query = {
    "query": {
        "exists": {
            "field": "tags"
        }
    }
}


missing_tags_query = {
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "tags"
                }
            }
        }      
    }
}

In [10]:
with_tags = query_share(search_url.url, tags_query)
missing_tags = query_share(search_url.url, missing_tags_query)

total_results = requests.get(search_url.url).json()['hits']['total']

with_tags_percent = (float(with_tags['hits']['total'])/total_results)*100
missing_tags_percent = (float(missing_tags['hits']['total'])/total_results)*100


print(
    '{} results out of {}, or {}%, have tags.'.format(
        with_tags['hits']['total'],
        total_results,
        format(with_tags_percent, '.2f')
    )
)

print(
    '{} results out of {}, or {}%, do NOT have tags.'.format(
        missing_tags['hits']['total'],
        total_results,
        format(missing_tags_percent, '.2f')
    )
)

print('------------')
print('As a little sanity check....')
print('{} + {} = {}%'.format(with_tags_percent, missing_tags_percent, format(with_tags_percent + missing_tags_percent, '.2f')))

2443294 results out of 4914457, or 49.72%, have tags.
2471163 results out of 4914457, or 50.28%, do NOT have tags.
------------
As a little sanity check....
49.71645901062925 + 50.28354098937074 = 100.00%


## Using SHAREPA for SHARE Parsing and Analysis

While you can always pass raw elasticsearch queries to the SHARE API, there is also a pip-installable python library that you can use that makes elasticsearch aggregations a little simpler. This library is called [sharepa - short for SHARE Parsing and Analysis](https://github.com/CenterForOpenScience/sharepa#sharepa)

### Basic Actions

A basic search will provide access to all documents in SHARE in 10 document slices.

#### Count
You can use sharepa and the basic search to get the total number of documents in SHARE

In [11]:
from sharepa import basic_search

basic_search.count()

4914457

#### Iterating Through Results
Executing the basic search will send the actual basic query to the SHARE API and then let you iterate through results, 10 at a time.

In [12]:
results = basic_search.execute()

for hit in results:
    print(hit.title)

LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 2011-06-17
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448160.0758426051207266
LEDAPS corrected Landsat Enhanced Thematic Mapper image data for Shortgrass Steppe collected on 1990-04-20
Chemical composition of essential oils of three Pistacia cultivars in Khorasan Razavi, Iran
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448190.5563383046761334
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448200.9788127569002799
Test entry from ezid service for identifier: doi:10.6085//TEST/20152611351448150.3484667860365027
Compiled Tree-ring Dates from the Southwestern United States (Unrestricted)
None
Area-based Amino Acid Composition for three types of interactions in the BNCP-CS dataset


If we don't want 10 results, or we want to offset the results, we can use slices

In [13]:
results = basic_search[20:25].execute()
for hit in results:
    print(hit.sources)

['providers.org.datacite']
['providers.org.datacite']
['providers.org.datacite']
['providers.org.datacite']
['providers.org.datacite']


### Advanced Search with sharepa

You can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using [lucene query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax), just like we used in the above examples.

This type of query accepts an exists field. Other options include a query_string, a match query, a multi-match query, a bool query, and any other query structure available in the elasticsearch API.

We can see what that query that we're about to send to elasticsearch by using the pretty print helper function. You'll see that it looks very similar to the queries we defined by hand earlier.

In [14]:
from sharepa import ShareSearch
from sharepa.helpers import pretty_print

my_search = ShareSearch()

my_search = my_search.query(
    'exists', # Type of query, will accept a lucene query string
    field='tags', # This lucene query string will find all documents that don't have tags
)

pretty_print(my_search.to_dict())

{
    "query": {
        "exists": {
            "field": "tags"
        }
    }
}


When you execute that query, you can then iterate through the results the same way that you could with the simple search query.

In [15]:
new_results = my_search.execute()
for hit in new_results:
    print(hit.tags)

['CDL.LTERNET', 'CDL', 'dataPackage', 'Dataset']
['CDL.PISCO', 'CDL']
['CDL.LTERNET', 'CDL', 'dataPackage', 'Dataset']
['CDL.DIGSCI', 'CDL', 'Paper', 'Dataset']
['CDL.PISCO', 'CDL']
['CDL.PISCO', 'CDL']
['CDL.PISCO', 'CDL']
['CDL.DIGANT', 'CDL', 'Dataset']
['TIB.R-GATE', 'TIB']
['CDL.DIGSCI', 'CDL', 'Image']


## Debugging and Problem Solving

Not everything always goes as planned when querying an unfamillar API. Here are some debugging and problem solving strategies when you're querying the SHARE API.

### Schema issues
The SHARE schema has a lot of parts, and much of the information is nested within sections. Making a query isn't always as straight forward as you might think, if you're not looking in the right part of the schema.

Let's say you were trying to query for all SHARE documents that specify the language as not being in English.

We'll guess as to what that query might be, and try to make it using sharepa.

In [16]:
language_search = ShareSearch()

language_search = language_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='NOT languages=english', # This lucene query string will find all documents that don't have tags
)

In [17]:
results = language_search.execute()

for hit in results:
    print(hit.languages)

AttributeError: 'Result' object has no attribute 'languages'

So the result does not have an attribute called languages! Let's try to figure out what went wrong here.

Step one could be that we are trying to find something that does NOT match a given parameter. Since languages is not required, this is returning results that do not include the languages result at all!

So let's fix this up a bit to make sure that we're querying for items that specify language in the first place.

In [None]:
language_search = ShareSearch()

language_search = language_search.filter(
    'exists',
    field="language"
)

language_search.count()

In [None]:
results = language_search.execute()

# Let's see how many documents have language results.
print('There are {} documents with languages specified'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# Check out the first few results
for hit in results:
    print(hit.language)

So now we're better equipped to add on to this filter, and then narrow down to results that are not in English.

When we printed out the first few results, we might have noticed a second problem with our query -- going back to the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml), we might notice that there is a restriction on how languages are captured - as a three letter lowercase representation. Instead of "english" let's look for the three letter abbreviation - "eng"

We can modify our new and improved language query by adding on another query to our started language_search. We'll use the elasticsearch query object Q, and invert it with a ~ symbol, and search for the term "eng."

In [None]:
from elasticsearch_dsl import Q

language_search = language_search.query(~Q("term", language="eng"))

results = language_search.execute()

# Let's see how many documents have language results that aren't eng
print('There are {} documents that do not have "eng" listed.'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# Check out the first few results, make sure "eng" isn't in there
for hit in results:
    print(hit.language)
    print(hit.title)