# Calling the SHARE API
----
Here are some working examples of how to query the current scrAPI database for metrics of results coming through the SHARE Notifiation Service.

These particular queries are just examples, and the data is open for anyone to use, so feel free to make your own and experiment!


## Setup


### Service Names for Reference
----
Each provider harvested from uses a shortened name for its source. Let's make an API call to generate a table to get all of those short names, along with the official name of the repository that they represent.

The SHARE API has different endpoints. One of those endpoints returns a list of all of the providers that SHARE is harvesting from, along with their short names, official names, links to their homepages, and a simple version of an icon representing their service, in a parsable format called json.

Let's make a call to that API endpoint using the requests libarary, get the json data, and print out all of the shortnames and longnames.

In [1]:
import requests

data = requests.get('https://osf.io/api/v1/share/providers/').json()

for source in data['providerMap'].keys():
    print(
        '{} - {}'.format(
            data['providerMap'][source]['short_name'],
            data['providerMap'][source]['long_name'].encode('utf-8')
        )
    )


doepages - Department of Energy Pages
scholarsphere - ScholarSphere @ Penn State University
scholarsarchiveosu - ScholarsArchive@OSU
calhoun - Calhoun: Institutional Archive of the Naval Postgraduate School
scholarworks_umass - ScholarWorks@UMass Amherst
cambridge - Apollo @ University of Cambridge
texasstate - DSpace at Texas State University
osf - Open Science Framework
lwbin - Lake Winnipeg Basin Information Network
uow - Research Online @ University of Wollongong
oaktrust - The OAKTrust Digital Repository at Texas A&M
umich - Deep Blue @ University of Michigan
cogprints - Cognitive Sciences ePrint Archive
utktrace - Trace: Tennessee Research and Creative Exchange
stcloud - The repository at St Cloud State
smithsonian - Smithsonian Digital Repository
csuohio - Cleveland State University's EngagedScholarship@CSU
biomedcentral - BioMed Central
crossref - CrossRef
purdue - PURR - Purdue University Research Repository
unl_digitalcommons - DigitalCommons@University of Nebraska - Lincoln


#### SHARE Schema

You can make queries against any of the fields defined in the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml). If we were able to harvest the information from the original source, it should appear in SHARE. However, not all fields are required for every document. 

Required fields include:
- title
- contributors
- uris
- providerUpdatedDateTime

We add some information after each document is harvested inside the field shareProperties, including:
- source (where the document was originally harvested)
- docID  (a unique identifier for that object from that source)

These two fields can be combined to make a unique document identifier.

## Simple Queries

We need a URL to use to access the SHARE API.

In [2]:
OSF_APP_URL = 'https://osf.io/api/v1/share/search/'

Let's get the first 3 results from the most basic query - the first page of the most recently updated research release events in SHARE.

We'll use the URL parsing library furl to keep track of all of our arguments to the URL, because we'll be modifying them as we go along. We'll print the URL as we go to take a look at it, so we know what we're requesting.

We'll print out the result's title, original source, and when it was updated.

In [3]:
import furl

search_url = furl.furl(OSF_APP_URL)
search_url.args['size'] = 3
search_url.args['sort'] = 'providerUpdatedDateTime'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('----------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime
----------
At the Crossroads: On Fairytales, Firebirds, and Real Life Choices -- from iwu_commons -- updated at 2016-01-26T16:47:20+00:00
School psychology 2010: Demographics, employment, and the context for professional practices – Part 1 -- from u_south_fl -- updated at 2016-01-26T14:25:24+00:00
Principles of Biology -- from npp_ksu -- updated at 2016-01-26T13:30:58+00:00


Now let's limit that query to only documents mentioning "giraffes" somewhere in the title, description, or in any of the metadata. We'd do that by adding a query search term.

In [4]:
search_url.args['q'] = 'giraffes'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=giraffes
---------
Odd creature was ancient ancestor of today’s giraffes -- from crossref -- updated at 2015-11-24T00:00:00+00:00
Naturalized seeing/colonial vision : interrogating the display of races in late nineteenth century France -- from datacite -- updated at 2015-11-11T23:35:35+00:00
Integration of complex shapes and natural patterns -- from datacite -- updated at 2015-11-07T03:06:45+00:00


Let's search for documents from the source mit

In [5]:
search_url.args['q'] = 'shareProperties.source:mit'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=shareProperties.source:mit
---------
Similitude: Interfacing a Traffic Simulator and Network Simulator with Emulated Android Clients -- from mit -- updated at 2016-01-25T18:35:04+00:00
MobiStreams: A Reliable Distributed Stream Processing System for Mobile Devices -- from mit -- updated at 2016-01-25T18:30:59+00:00
Approximate cone factorizations and lifts of polytopes -- from mit -- updated at 2016-01-25T18:25:44+00:00


Let's combine the two and find documents from MIT that mention giraffes

In [6]:
search_url.args['q'] = 'shareProperties.source:mit AND giraffes'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=shareProperties.source:mit+AND+giraffes
---------
Giraffes, religion and conflict : essays in behavioral decision making -- from mit -- updated at 2015-08-03T20:00:59


## Complex Queries
The SHARE Search API runs on elasticsearch - meaning that it can accept complicated queries that give you a wide variety of information.

Here are some examples of how to make more complex queries using the raw elasticsearch results. You can read a [lot more about elasticsearch queries here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

In [7]:
search_url.args = None  # reset the args so that we remove our old query arguments.
search_url.url # Show the URL that we'll be requesting to make sure the args were cleared

'https://osf.io/api/v1/share/search/'

### Query Setup

We can define a few functions that we can reuse to make querying simpler. Elasticsearch queries are passed through as json blobs specifying how to return the information you want.

In [8]:
import json

def query_share(url, query):
    # A helper function that will use the requests library,
    # pass along the correct headers,
    # and make the query we want
    headers = {'Content-Type': 'application/json'}
    data = json.dumps(query)
    return requests.post(url, headers=headers, data=data, verify=False).json()

### Some Queries
The SHARE schema has many spots for information, and many of the original sources do not provide this information. We can do a query to find out if a certain field exists or not within certain records. The SHARE API is set up to not display the field if it is empty.

Let's query for the first 5 all documents that have a sponsorship field.

In [9]:
sponsorship_query = {
    "size": 5,
    "query": {
        "filtered": {
            "filter": {
                "exists": {
                    "field": "sponsorships"
                }
            }
        }
    }
}

In [10]:
results = query_share(search_url.url, sponsorship_query)

for item in results['results']:
#     print(item['sponsorships'])
    print('{} -- from source {} -- sponsored by {}'.format(
            item['title'].encode('utf-8'),
            item['shareProperties']['source'].encode('utf-8'),
            ' '.join(
                [sponsor['sponsor']['sponsorName'] for sponsor in item['sponsorships']]
            )
        )
    )
    print('-------------------')


A Phase III, Randomized, Comparative, Open-label Study of Intravenous Iron Isomaltoside 1000 (Monofer®) Administered as Maintenance Therapy by Single or Repeated Bolus Injections in Comparison With Intravenous Iron Sucrose in Subjects With Stage 5 Chronic Kidney Disease on Dialysis Therapy (CKD-5D) -- from source clinicaltrials -- sponsored by Pharmacosmos A/S 
-------------------
Phase IB Study of FOLFIRINOX Plus PF-04136309 in Patients With Borderline Resectable and Locally Advanced Pancreatic Adenocarcinoma -- from source clinicaltrials -- sponsored by Washington University School of Medicine National Cancer Institute (NCI)
-------------------
Discontinuation of Infliximab Therapy in Patients With Crohn's Disease During Sustained Complete Remission: A National Multi-center, Double Blinded, Randomized, Placebo Controlled Study -- from source clinicaltrials -- sponsored by Copenhagen University Hospital at Herlev 
-------------------
Temperature Evaluation by MRI Thermometry During Ce



Now, let's see how many results do not have tags.

In [11]:
tags_query = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT tags:*"
        }
    }
}

In [12]:
results_with_tags = query_share(search_url.url, tags_query)
total_results = requests.get(search_url.url).json()['count']
results_percent = (float(results_with_tags['count'])/total_results)*100

print(
    '{} results out of {}, or {}%, do not have tags.'.format(
        results_with_tags['count'],
        total_results,
        format(results_percent, '.2f')
    )
)



4157872 results out of 4272328, or 97.32%, do not have tags.




## Using SHAREPA for SHARE Parsing and Analysis

While you can always pass raw elasticsearch queries to the SHARE API, there is also a pip-installable python library that you can use that makes elasticsearch aggregations a little simpler. This library is called [sharepa - short for SHARE Parsing and Analysis](https://github.com/CenterForOpenScience/sharepa#sharepa)

### Basic Actions

A basic search will provide access to all documents in SHARE in 10 document slices.

#### Count
You can use sharepa and the basic search to get the total number of documents in SHARE

In [13]:
from sharepa import basic_search

basic_search.count()

4272328

#### Iterating Through Results
Executing the basic search will send the actual basic query to the SHARE API and then let you iterate through results, 10 at a time.

In [14]:
results = basic_search.execute()

for hit in results:
    print(hit.title)

Avian community structure and incidence of human West Nile infection
Rat12_a
Non compact continuum limit of two coupled Potts models

Simultaneous Localization, Mapping, and Manipulation for Unsupervised
  Object Discovery
Synthesis of High-Temperature Self-lubricating Wear Resistant Composite Coating on Ti6Al4V Alloy by Laser Deposition
Comparative Studies of Silicon Dissolution in Molten Aluminum Under Different Flow Conditions, Part I: Single-Phase Flow
Scrambling of data in all-optical domain
Step behaviour and autonomic nervous system activity in multiparous dairy cows during milking in a herringbone milking system
<p>Typical features of the constant velocity forced dissociation process in the SGP-3-ligated 1G1Q 2CR complex system.</p>


If we don't want 10 results, or we want to offset the results, we can use slices

In [15]:
results = basic_search[20:25].execute()
for hit in results:
    print(hit.title)

Elements of Trust in Named-Data Networking
Effect of Perceived Attributions about Ostracism on Social Pain and Task Performance
Millimeter Wave MIMO Channel Tracking Systems
Metric Dimension and Zero Forcing Number of Two Families of Line Graphs
The Glassey conjecture on asymptotically flat manifolds


### Advanced Search with sharepa

You can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using [lucene query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax), just like we used in the above examples.

This type of query accepts a 'query_string'. Other options include a match query, a multi-match query, a bool query, and any other query structure available in the elasticsearch API.

We can see what that query that we're about to send to elasticsearch by using the pretty print helper function. You'll see that it looks very similar to the queries we defined by hand earlier.

In [16]:
from sharepa import ShareSearch
from sharepa.helpers import pretty_print

my_search = ShareSearch()

my_search = my_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='NOT tags:*', # This lucene query string will find all documents that don't have tags
    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)
)

pretty_print(my_search.to_dict())

{
    "query": {
        "query_string": {
            "analyze_wildcard": true, 
            "query": "NOT tags:*"
        }
    }
}


When you execute that query, you can then iterate through the results the same way that you could with the simple search query.

In [17]:
new_results = my_search.execute()
for hit in new_results:
    print(hit.title)

Avian community structure and incidence of human West Nile infection
Non compact continuum limit of two coupled Potts models

Simultaneous Localization, Mapping, and Manipulation for Unsupervised
  Object Discovery
Synthesis of High-Temperature Self-lubricating Wear Resistant Composite Coating on Ti6Al4V Alloy by Laser Deposition
Comparative Studies of Silicon Dissolution in Molten Aluminum Under Different Flow Conditions, Part I: Single-Phase Flow
Scrambling of data in all-optical domain
Step behaviour and autonomic nervous system activity in multiparous dairy cows during milking in a herringbone milking system
<p>Typical features of the constant velocity forced dissociation process in the SGP-3-ligated 1G1Q 2CR complex system.</p>
The elusive shepherdess


## Debugging and Problem Solving

Not everything always goes as planned when querying an unfamillar API. Here are some debugging and problem solving strategies when you're querying the SHARE API.

### Schema issues
The SHARE schema has a lot of parts, and much of the information is nested within sections. Making a query isn't always as straight forward as you might think, if you're not looking in the right part of the schema.

Let's say you were trying to query for all SHARE documents that specify the language as not being in English.

We'll guess as to what that query might be, and try to make it using sharepa.

In [18]:
language_search = ShareSearch()

language_search = language_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='NOT languages=english', # This lucene query string will find all documents that don't have tags
)

In [19]:
results = language_search.execute()

for hit in results:
    print(hit.languages)

AttributeError: 'Result' object has no attribute 'languages'

So the result does not have an attribute called languages! Let's try to figure out what went wrong here.

Step one could be that we are trying to find something that does NOT match a given parameter. Since languages is not required, this is returning results that do not include the languages result at all!

So let's fix this up a bit to make sure that we're querying for items that specify language in the first place.

In [20]:
language_search = ShareSearch()

language_search = language_search.filter(
    'exists',
    field="languages"
)


In [21]:
results = language_search.execute()

# Let's see how many documents have language results.
print('There are {} documents with languages specified'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# Check out the first few results
for hit in results:
    print(hit.languages)

There are 155407 documents with languages specified
Here are the languages for the first 10 results:
[u'ger']
[u'fre']
[u'eng']
[u'eng']
[u'eng']
[u'eng']
[u'eng']
[u'eng', u'fre']
[u'eng']
[u'eng']


So now we're better equipped to add on to this filter, and then narrow down to results that are not in English.

When we printed out the first few results, we might have noticed a second problem with our query -- going back to the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml), we might notice that there is a restriction on how languages are captured - as a three letter lowercase representation. Instead of "english" let's look for the three letter abbreviation - "eng"

We can modify our new and improved language query by adding on another query to our started language_search. We'll use the elasticsearch query object Q, and invert it with a ~ symbol, and search for the term "eng."

In [22]:
from elasticsearch_dsl import Q

language_search = language_search.query(~Q("term", languages="eng"))

results = language_search.execute()

# Let's see how many documents have language results that aren't eng
print('There are {} documents that do not have "eng" listed.'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# Check out the first few results, make sure "eng" isn't in there
for hit in results:
    print(hit.languages)

There are 4007 documents that do not have "eng" listed.
Here are the languages for the first 10 results:
[u'ger']
[u'fre']
[u'ger']
[u'fre']
[u'lat']
[u'lat']
[u'lat']
[u'lat']
[u'fre']
[u'tha']
