# Calling the SHARE API
----
Here are some working examples of how to query the current scrAPI database for metrics of results coming through the SHARE Notifiation Service.

These particular queries are just examples, and the data is open for anyone to use, so feel free to make your own and experiment!


## The SHARE Schema

Reference the SHARE schema to format queries and explore the SHARE data set.

### https://github.com/CenterForOpenScience/share-schema

### JSON Schema
#### http://json-schema.org/
- describes your existing data format
- clear, human- and machine-readable documentation
- complete structural validation, useful for
    - automated testing
    - validating client-submitted data
    
Appears in yml format in the main schema repo. 
```
$schema: "http://json-schema.org/draft-04/schema#"
type: "object"
description: "This is the Beta schema for the SHARE project."
properties:
    title:
        description: The title and any sub-titles of the resource.
        type: "string"
    contributors:
        description: The people or organizations responsible for making contributions to an object.
        type: array
        items:
            anyOf:
                - $ref: "#/definitions/person"
                - $ref: "#/definitions/organization"
```

This yml format is transformed into json format, which is used in scrapi (SHARE's data processing pipeline) 

```
"properties": {
    "title": {
        "type": "string",
        "description": "The title and any sub-titles of the resource."
    },
    "contributors": {
        "items": {
            "anyOf": [
                {
                        "$ref": "#/definitions/person"
                },
                {
                        "$ref": "#/definitions/organization"
                }
            ]
        },
        "type": "array",
        "description": "The people or organizations responsible for making contributions to an object."
    }
```

## Setup


### Service Names for Reference
----
Each provider harvested from uses a shortened name for its source. Let's make an API call to generate a table to get all of those short names, along with the official name of the repository that they represent.

The SHARE API has different endpoints. One of those endpoints returns a list of all of the providers that SHARE is harvesting from, along with their short names, official names, links to their homepages, and a simple version of an icon representing their service, in a parsable format called json.

Let's make a call to that API endpoint using the requests libarary, get the json data, and print out all of the shortnames and longnames.

In [None]:
import requests

data = requests.get('https://osf.io/api/v1/share/providers/').json()

for source in data['providerMap'].keys():
    print(
        '{} - {}'.format(
            data['providerMap'][source]['short_name'],
            data['providerMap'][source]['long_name'].encode('utf-8')
        )
    )


#### SHARE Schema

You can make queries against any of the fields defined in the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml). If we were able to harvest the information from the original source, it should appear in SHARE. However, not all fields are required for every document. 

Required fields include:
- title
- contributors
- uris
- providerUpdatedDateTime

We add some information after each document is harvested inside the field shareProperties, including:
- source (where the document was originally harvested)
- docID  (a unique identifier for that object from that source)

These two fields can be combined to make a unique document identifier.

## Simple Queries

We need a URL to use to access the SHARE API.

In [None]:
OSF_APP_URL = 'https://osf.io/api/v1/share/search/'

Let's get the first 3 results from the most basic query - the first page of the most recently updated research release events in SHARE.

We'll use the URL parsing library furl to keep track of all of our arguments to the URL, because we'll be modifying them as we go along. We'll print the URL as we go to take a look at it, so we know what we're requesting.

We'll print out the result's title, original source, and when it was updated.

In [None]:
import furl

search_url = furl.furl(OSF_APP_URL)
search_url.args['size'] = 3
search_url.args['sort'] = 'providerUpdatedDateTime'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('----------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

Now let's limit that query to only documents mentioning "giraffes" somewhere in the title, description, or in any of the metadata. We'd do that by adding a query search term.

In [None]:
search_url.args['q'] = 'giraffes'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

Let's search for documents from the source mit

In [None]:
search_url.args['q'] = 'shareProperties.source:mit'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

Let's combine the two and find documents from MIT that mention giraffes

In [None]:
search_url.args['q'] = 'shareProperties.source:mit AND giraffes'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

## Complex Queries
The SHARE Search API runs on elasticsearch - meaning that it can accept complicated queries that give you a wide variety of information.

Here are some examples of how to make more complex queries using the raw elasticsearch results. You can read a [lot more about elasticsearch queries here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

In [None]:
search_url.args = None  # reset the args so that we remove our old query arguments.
search_url.url # Show the URL that we'll be requesting to make sure the args were cleared

### Query Setup

We can define a few functions that we can reuse to make querying simpler. Elasticsearch queries are passed through as json blobs specifying how to return the information you want.

In [None]:
import json

def query_share(url, query):
    # A helper function that will use the requests library,
    # pass along the correct headers,
    # and make the query we want
    headers = {'Content-Type': 'application/json'}
    data = json.dumps(query)
    return requests.post(url, headers=headers, data=data, verify=False).json()

### Some Queries
The SHARE schema has many spots for information, and many of the original sources do not provide this information. We can do a query to find out if a certain field exists or not within certain records. The SHARE API is set up to not display the field if it is empty.

Let's query for the first 5 all documents that have a sponsorship field.

In [None]:
sponsorship_query = {
    "size": 5,
    "query": {
        "filtered": {
            "filter": {
                "exists": {
                    "field": "sponsorships"
                }
            }
        }
    }
}

In [None]:
results = query_share(search_url.url, sponsorship_query)

for item in results['results']:
#     print(item['sponsorships'])
    print('{} -- from source {} -- sponsored by {}'.format(
            item['title'].encode('utf-8'),
            item['shareProperties']['source'].encode('utf-8'),
            ' '.join(
                [sponsor['sponsor']['sponsorName'] for sponsor in item['sponsorships']]
            )
        )
    )
    print('-------------------')


Now, let's see how many results do not have tags.

In [None]:
tags_query = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT tags:*"
        }
    }
}

In [None]:
results_with_tags = query_share(search_url.url, tags_query)
total_results = requests.get(search_url.url).json()['count']
results_percent = (float(results_with_tags['count'])/total_results)*100

print(
    '{} results out of {}, or {}%, do not have tags.'.format(
        results_with_tags['count'],
        total_results,
        format(results_percent, '.2f')
    )
)



## Using SHAREPA for SHARE Parsing and Analysis

While you can always pass raw elasticsearch queries to the SHARE API, there is also a pip-installable python library that you can use that makes elasticsearch aggregations a little simpler. This library is called [sharepa - short for SHARE Parsing and Analysis](https://github.com/CenterForOpenScience/sharepa#sharepa)

### Basic Actions

A basic search will provide access to all documents in SHARE in 10 document slices.

#### Count
You can use sharepa and the basic search to get the total number of documents in SHARE

In [None]:
from sharepa import basic_search

basic_search.count()

#### Iterating Through Results
Executing the basic search will send the actual basic query to the SHARE API and then let you iterate through results, 10 at a time.

In [None]:
results = basic_search.execute()

for hit in results:
    print(hit.title)

If we don't want 10 results, or we want to offset the results, we can use slices

In [None]:
results = basic_search[20:25].execute()
for hit in results:
    print(hit.title)

### Advanced Search with sharepa

You can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using [lucene query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax), just like we used in the above examples.

This type of query accepts a 'query_string'. Other options include a match query, a multi-match query, a bool query, and any other query structure available in the elasticsearch API.

We can see what that query that we're about to send to elasticsearch by using the pretty print helper function. You'll see that it looks very similar to the queries we defined by hand earlier.

In [None]:
from sharepa import ShareSearch
from sharepa.helpers import pretty_print

my_search = ShareSearch()

my_search = my_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='NOT tags:*', # This lucene query string will find all documents that don't have tags
    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)
)

pretty_print(my_search.to_dict())

When you execute that query, you can then iterate through the results the same way that you could with the simple search query.

In [None]:
new_results = my_search.execute()
for hit in new_results:
    print(hit.title)

## Debugging and Problem Solving

Not everything always goes as planned when querying an unfamillar API. Here are some debugging and problem solving strategies when you're querying the SHARE API.

### Schema issues
The SHARE schema has a lot of parts, and much of the information is nested within sections. Making a query isn't always as straight forward as you might think, if you're not looking in the right part of the schema.

Let's say you were trying to query for all SHARE documents that specify the language as not being in English.

We'll guess as to what that query might be, and try to make it using sharepa.

In [None]:
language_search = ShareSearch()

language_search = language_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='NOT languages=english', # This lucene query string will find all documents that don't have tags
)

In [None]:
results = language_search.execute()

for hit in results:
    print(hit.languages)

So the result does not have an attribute called languages! Let's try to figure out what went wrong here.

Step one could be that we are trying to find something that does NOT match a given parameter. Since languages is not required, this is returning results that do not include the languages result at all!

So let's fix this up a bit to make sure that we're querying for items that specify language in the first place.

In [None]:
language_search = ShareSearch()

language_search = language_search.filter(
    'exists',
    field="languages"
)

language_search.count()

In [None]:
results = language_search.execute()

# Let's see how many documents have language results.
print('There are {} documents with languages specified'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# Check out the first few results
for hit in results:
    print(hit.languages)

So now we're better equipped to add on to this filter, and then narrow down to results that are not in English.

When we printed out the first few results, we might have noticed a second problem with our query -- going back to the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml), we might notice that there is a restriction on how languages are captured - as a three letter lowercase representation. Instead of "english" let's look for the three letter abbreviation - "eng"

We can modify our new and improved language query by adding on another query to our started language_search. We'll use the elasticsearch query object Q, and invert it with a ~ symbol, and search for the term "eng."

In [None]:
from elasticsearch_dsl import Q

language_search = language_search.query(~Q("term", languages="eng"))

results = language_search.execute()

# Let's see how many documents have language results that aren't eng
print('There are {} documents that do not have "eng" listed.'.format(language_search.count()))

print('Here are the languages for the first 10 results:')

# Check out the first few results, make sure "eng" isn't in there
for hit in results:
    print(hit.languages)