# Calling the SHARE API
----
Here are some working examples of how to query the current scrAPI database for metrics of results coming through the SHARE Notifiation Service.

These particular queries are just examples, and the data is open for anyone to use, so feel free to make your own and experiment!
----


# Setup

It's good practice in python to import all the modules you're going to use up at the top, so we'll go ahead and do that now. We might not use all of these imports until later, but we'll get everything over with at once.

In [1]:
from __future__ import division

import json
import furl
import requests
from collections import defaultdict

from sharepa import ShareSearch
from sharepa import basic_search
from sharepa import merge_dataframes
from sharepa import bucket_to_dataframe
from sharepa.helpers import pretty_print
from sharepa.helpers import source_counts

### Service Names for Reference
----
Each provider harvested from uses a shortened name for its source. Let's make an API call to generate a table to get all of those short names, along with the official name of the repository that they represent.

The SHARE API has different endpoints. One of those endpoints returns a list of all of the providers that SHARE is harvesting from, along with their short names, official names, links to their homepages, and a simple version of an icon representing their service, in a parsable format called json.

Let's make a call to that API endpoint using the requests libarary, get the json data, and print out all of the shortnames and longnames.

In [2]:
data = requests.get('https://osf.io/api/v1/share/providers/').json()

for source in data['providerMap'].keys():
    print(
        '{} - {}'.format(
            data['providerMap'][source]['short_name'],
            data['providerMap'][source]['long_name'].encode('utf-8')
        )
    )


doepages - Department of Energy Pages
scholarsarchiveosu - ScholarsArchive@OSU
utaustin - University of Texas at Austin Digital Repository
scholarworks_umass - ScholarWorks@UMass Amherst
cambridge - Apollo @ University of Cambridge
texasstate - DSpace at Texas State University
osf - Open Science Framework
lwbin - Lake Winnipeg Basin Information Network
uow - Research Online @ University of Wollongong
oaktrust - The OAKTrust Digital Repository at Texas A&M
umich - Deep Blue @ University of Michigan
cogprints - Cognitive Sciences ePrint Archive
utktrace - Trace: Tennessee Research and Creative Exchange
stcloud - The repository at St Cloud State
smithsonian - Smithsonian Digital Repository
csuohio - Cleveland State University's EngagedScholarship@CSU
biomedcentral - BioMed Central
crossref - CrossRef
purdue - PURR - Purdue University Research Repository
unl_digitalcommons - DigitalCommons@University of Nebraska - Lincoln
scholarscompass_vcu - VCU Scholars Compass
dataone - DataONE: Data O

#### SHARE Schema

You can make queries against any of the fields defined in the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml). If we were able to harvest the information from the original source, it should appear in SHARE. However, not all fields are required for every document.

Required fields include:
- title
- contributors
- uris
- providerUpdatedDateTime

We add some information after each document is harvested inside the field shareProperties, including:
- source (where the document was originally harvested)
- docID  (a unique identifier for that object from that source)

These two fields can be combined to make a unique document identifier.

## Simple Queries

We need a URL to use to access the SHARE API.

In [3]:
OSF_APP_URL = 'https://osf.io/api/v1/share/search/'

Let's get the first 3 results from the most basic query - the first page of the most recently updated research release events in SHARE.

We'll use the URL parsing library furl to keep track of all of our arguments to the URL, because we'll be modifying them as we go along. We'll print the URL as we go to take a look at it, so we know what we're requesting.

We'll print out the result's title, original source, and when it was updated.

In [4]:
search_url = furl.furl(OSF_APP_URL)
search_url.args['size'] = 3
search_url.args['sort'] = 'providerUpdatedDateTime'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('----------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime
----------
Data -- from osf -- updated at 2016-01-24T08:58:15.339000+00:00
Materials -- from osf -- updated at 2016-01-24T08:58:14.962000+00:00
Methods and Measures -- from osf -- updated at 2016-01-24T08:58:14.821000+00:00


Now let's limit that query to only documents mentioning "giraffes" somewhere in the title, description, or in any of the metadata. We'd do that by adding a query search term.

In [5]:
search_url.args['q'] = 'giraffes'
recent_results = requests.get(search_url.url).json()

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=giraffes
---------
Odd creature was ancient ancestor of today’s giraffes -- from crossref -- updated at 2015-11-24T00:00:00+00:00
Naturalized seeing/colonial vision : interrogating the display of races in late nineteenth century France -- from datacite -- updated at 2015-11-11T23:35:35+00:00
Integration of complex shapes and natural patterns -- from datacite -- updated at 2015-11-07T03:06:45+00:00


Let's search for documents from the source mit

In [6]:
search_url.args['q'] = 'shareProperties.source:mit'
recent_results = requests.get(search_url.url).json()

recent_results

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=shareProperties.source:mit
---------
The LSND and MiniBooNE Oscillation Searches at High [Delta]m[superscript 2] -- from mit -- updated at 2016-01-22T19:30:41+00:00
Microstructural view of burrowing with a bioinspired digging robot -- from mit -- updated at 2016-01-22T19:21:08+00:00
Six High-Precision Transits of Ogle-Tr-113b -- from mit -- updated at 2016-01-22T19:15:43+00:00


Let's combine the two and find documents from MIT that mention giraffes

In [7]:
search_url.args['q'] = 'shareProperties.source:mit AND giraffes'
recent_results = requests.get(search_url.url).json()

recent_results

print('The request URL is {}'.format(search_url.url))
print('---------')
for result in recent_results['results']:
    print(
        '{} -- from {} -- updated at {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'],
            result['providerUpdatedDateTime']
        )
    )

The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=shareProperties.source:mit+AND+giraffes
---------
Giraffes, religion and conflict : essays in behavioral decision making -- from mit -- updated at 2015-08-03T20:00:59


## Complex Queries
The SHARE Search API runs on elasticsearch - meaning that it can accept complicated queries that give you a wide variety of information.

Here are some examples of how to make more complex queries using the raw elasticsearch results. You can read a [lot more about elasticsearch queries here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

In [8]:
search_url.args = None  # reset the args so that we remove our old query arguments.
search_url.url # Show the URL that we'll be requesting to make sure the args were cleared

'https://osf.io/api/v1/share/search/'

### Query Setup

We can define a few functions that we can reuse to make querying simpler. Elasticsearch queries are passed through as json blobs specifying how to return the information you want.

In [9]:
def query_share(url, query):
    # A helper function that will use the requests library,
    # pass along the correct headers,
    # and make the query we want
    headers = {'Content-Type': 'application/json'}
    data = json.dumps(query)
    return requests.post(url, headers=headers, data=data, verify=False).json()

### Some Queries
The SHARE schema has many spots for information, and many of the original sources do not provide this information. We can do a query to find out if a certain field exists or not within certain records. The SHARE API is set up to not display the field if it is empty.

Let's query for the first 5 all documents that have a sponsorship field.

In [10]:
sponsorship_query = {
    "size": 5,
    "query": {
        "filtered": {
            "filter": {
                "exists": {
                    "field": "sponsorships"
                }
            }
        }
    }
}

In [11]:
results = query_share(search_url.url, sponsorship_query)

for item in results['results']:
#     print(item['sponsorships'])
    print('{} -- from source {} -- sponsored by {}'.format(
            item['title'].encode('utf-8'),
            item['shareProperties']['source'].encode('utf-8'),
            ' '.join(
                [sponsor['sponsor']['sponsorName'] for sponsor in item['sponsorships']]
            )
        )
    )
    print('-------------------')


A Phase III, Randomized, Comparative, Open-label Study of Intravenous Iron Isomaltoside 1000 (Monofer®) Administered as Maintenance Therapy by Single or Repeated Bolus Injections in Comparison With Intravenous Iron Sucrose in Subjects With Stage 5 Chronic Kidney Disease on Dialysis Therapy (CKD-5D) -- from source clinicaltrials -- sponsored by Pharmacosmos A/S 
-------------------
Phase IB Study of FOLFIRINOX Plus PF-04136309 in Patients With Borderline Resectable and Locally Advanced Pancreatic Adenocarcinoma -- from source clinicaltrials -- sponsored by Washington University School of Medicine National Cancer Institute (NCI)
-------------------
Discontinuation of Infliximab Therapy in Patients With Crohn's Disease During Sustained Complete Remission: A National Multi-center, Double Blinded, Randomized, Placebo Controlled Study -- from source clinicaltrials -- sponsored by Copenhagen University Hospital at Herlev 
-------------------
Temperature Evaluation by MRI Thermometry During Ce



Now, let's see how many results do not have tags.

In [12]:
tags_query = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT tags:*"
        }
    }
}

In [13]:
results_with_tags = query_share(search_url.url, tags_query)
total_results = requests.get(search_url.url).json()['count']
results_percent = (float(results_with_tags['count'])/total_results)*100

print(
    '{} results out of {}, or {}%, do not have tags.'.format(
        results_with_tags['count'],
        total_results,
        format(results_percent, '.2f')
    )
)



4150115 results out of 4264118, or 97.33%, do not have tags.




## Aggregations

While searching for individual results is useful, sharepa also lets you make aggregation queries that give you results across the entirety of the SHARE dataset at once. This is useful if you're curious about the completeness of data sets.

For example, we can find the number of documents per source that are missing titles.

In [14]:
missing_titles_aggregation = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT title:*"
        }
    }, 
    "aggs": {
        "sources": {
            "terms": {
                "field": "_type", # A field where the SHARE source is stored                
                "min_doc_count": 0, 
                "size": 0  # Will return all sources, regardless if there are results
            }
        }
    }
}

In [15]:
# modify the args to make sure we return the raw elasticsearch results
search_url.args['raw'] = 'True'

In [16]:
results_without_titles = query_share(search_url.url, missing_titles_aggregation)

missing_titles_counts = results_without_titles['aggregations']['sources']['buckets']

for source in missing_titles_counts:
    print('{} has {} documents without titles'.format(source['key'], source['doc_count'], ))

dataone has 263947 documents without titles
biomedcentral has 22891 documents without titles
citeseerx has 9366 documents without titles
crossref has 4950 documents without titles
smithsonian has 101 documents without titles
pubmedcentral has 55 documents without titles
datacite has 54 documents without titles
bhl has 7 documents without titles
figshare has 5 documents without titles
scitech has 3 documents without titles
caltech has 2 documents without titles
iowaresearch has 2 documents without titles
rcaap has 2 documents without titles
dash has 1 documents without titles
duke has 1 documents without titles
icpsr has 1 documents without titles
lshtm has 1 documents without titles
mit has 1 documents without titles
shareok has 1 documents without titles
ucescholarship has 1 documents without titles
uiucideals has 1 documents without titles
addis_ababa has 0 documents without titles
arxiv_oai has 0 documents without titles
asu has 0 documents without titles
calhoun has 0 documents wit



This information isn't terribly useful if we don't also know how many documents are in each source.

Let's get that information as well, along stats for what percentage of documents from each source are missing titles. 

We'll do this with an elasticsearch "sigificant terms" aggregation. We're only interested in results that have 1 document or more, meaning all documents from the other sources have titles.

In [17]:
sig_terms_agg = {
    "query": {
        "query_string": {
            "analyze_wildcard": True, 
            "query": "NOT title:*"
        }
    },
    "aggs": {
        "sources":{
            "significant_terms":{
                "field": "_type", # A field where the SHARE source is stored                
                "min_doc_count": 1, # Only results with more than one document
                "percentage": {} # This will make the "score" parameter a percentage
            }
        }
    }
}

In [18]:
docs_with_no_title_results = query_share(search_url.url, sig_terms_agg)
docs_with_no_title = docs_with_no_title_results['aggregations']['sources']['buckets']



In [19]:
for source in docs_with_no_title:
    print(
        '{} has {}/{} or {}% with no titles'.format(
            source['key'],
            source['doc_count'],
            source['bg_count'],
            format(source['score']*100, '.2f')
        )
    )

biomedcentral has 22891/24831 or 92.19% with no titles
dataone has 263947/356154 or 74.11% with no titles
citeseerx has 9366/184037 or 5.09% with no titles
smithsonian has 101/7327 or 1.38% with no titles
crossref has 4950/911594 or 0.54% with no titles
duke has 1/336 or 0.30% with no titles
dash has 1/906 or 0.11% with no titles
caltech has 2/5343 or 0.04% with no titles
uiucideals has 1/3133 or 0.03% with no titles
iowaresearch has 2/6417 or 0.03% with no titles


## Using SHAREPA for SHARE Parsing and Analysis

While you can always pass raw elasticsearch queries to the SHARE API, there is also a pip-installable python library that you can use that makes elasticsearch aggregations a little simpler. This library is called [sharepa - short for SHARE Parsing and Analysis](https://github.com/CenterForOpenScience/sharepa#sharepa)

### Basic Actions

A basic search will provide access to all documents in SHARE in 10 document slices.

#### Count
You can use sharepa and the basic search to get the total number of documents in SHARE

In [20]:
basic_search.count()

4264118

#### Iterating Through Results
Executing the basic search will send the actual basic query to the SHARE API and then let you iterate through results, 10 at a time.

In [21]:
results = basic_search.execute()

for hit in results:
    print(hit.title)

Avian community structure and incidence of human West Nile infection
Rat12_a
Non compact continuum limit of two coupled Potts models

Simultaneous Localization, Mapping, and Manipulation for Unsupervised
  Object Discovery
Synthesis of High-Temperature Self-lubricating Wear Resistant Composite Coating on Ti6Al4V Alloy by Laser Deposition
Comparative Studies of Silicon Dissolution in Molten Aluminum Under Different Flow Conditions, Part I: Single-Phase Flow
Scrambling of data in all-optical domain
Step behaviour and autonomic nervous system activity in multiparous dairy cows during milking in a herringbone milking system
<p>Typical features of the constant velocity forced dissociation process in the SGP-3-ligated 1G1Q 2CR complex system.</p>


If we don't want 10 results, or we want to offset the results, we can use slices

In [22]:
results = basic_search[20:25].execute()
for hit in results:
    print(hit.title)

Elements of Trust in Named-Data Networking
Effect of Perceived Attributions about Ostracism on Social Pain and Task Performance
Millimeter Wave MIMO Channel Tracking Systems
Metric Dimension and Zero Forcing Number of Two Families of Line Graphs
The Glassey conjecture on asymptotically flat manifolds


### Advanced Search with sharepa

You can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using [lucene query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax), just like we used in the above examples.

In [23]:
my_search = ShareSearch()

my_search = my_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='NOT tags:*', # This lucene query string will find all documents that don't have tags
    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)
)

This type of query accepts a 'query_string'. Other options include a match query, a multi-match query, a bool query, and any other query structure available in the elasticsearch API.

We can see what that query that we're about to send to elasticsearch by using the pretty print helper function. You'll see that it looks very similar to the queries we defined by hand earlier.

In [24]:
pretty_print(my_search.to_dict())

{
    "query": {
        "query_string": {
            "analyze_wildcard": true, 
            "query": "NOT tags:*"
        }
    }
}


When you execute that query, you can then iterate through the results the same way that you could with the simple search query.

In [25]:
new_results = my_search.execute()
for hit in new_results:
    print(hit.title)

Avian community structure and incidence of human West Nile infection
Non compact continuum limit of two coupled Potts models

Simultaneous Localization, Mapping, and Manipulation for Unsupervised
  Object Discovery
Synthesis of High-Temperature Self-lubricating Wear Resistant Composite Coating on Ti6Al4V Alloy by Laser Deposition
Comparative Studies of Silicon Dissolution in Molten Aluminum Under Different Flow Conditions, Part I: Single-Phase Flow
Scrambling of data in all-optical domain
Step behaviour and autonomic nervous system activity in multiparous dairy cows during milking in a herringbone milking system
<p>Typical features of the constant velocity forced dissociation process in the SGP-3-ligated 1G1Q 2CR complex system.</p>
The elusive shepherdess


#### Aggregations with sharepa

You can also use sharepa to do aggregations, like we did with the above long query.

We can add an aggregation to my_search that will give us the number of documents per source that meet that previously defined search query (in our case, items that don't have tags). Here's what adding that aggregation will look like -

In [26]:
my_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'significant_terms',  # There are many kinds of aggregations
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=1,
    percentage={},
    size=0
)

SignificantTerms(field='_type', min_doc_count=1, percentage={}, size=0)

We can see which query is actually going to be sent to elasticsearch by printing out the query. This is very similar to the queries we were defining by hand up above.

In [27]:
pretty_print(my_search.to_dict())

{
    "query": {
        "query_string": {
            "analyze_wildcard": true, 
            "query": "NOT tags:*"
        }
    }, 
    "aggs": {
        "sources": {
            "significant_terms": {
                "field": "_type", 
                "percentage": {}, 
                "min_doc_count": 1, 
                "size": 0
            }
        }
    }
}


In [28]:
aggregated_results = my_search.execute()

for source in aggregated_results.aggregations['sources']['buckets']:
    print(
        '{} -- {}% do not have tags'.format(
            source['key'], 
            format(source['score']*100, '.2f')
        )
    )

nist -- 100.00% do not have tags
purdue -- 100.00% do not have tags
sldr -- 100.00% do not have tags
pcurio -- 99.53% do not have tags
oaktrust -- 98.77% do not have tags
scholarsbank -- 98.33% do not have tags
hacettepe -- 97.78% do not have tags
addis_ababa -- 97.68% do not have tags
figshare -- 97.68% do not have tags
texasstate -- 96.85% do not have tags
calhoun -- 96.61% do not have tags
citeseerx -- 96.19% do not have tags
springer -- 95.89% do not have tags
dailyssrn -- 95.63% do not have tags
valposcholar -- 95.53% do not have tags
columbia -- 95.36% do not have tags
digitalhoward -- 95.02% do not have tags
plos -- 94.98% do not have tags
biomedcentral -- 94.76% do not have tags
scholarsarchiveosu -- 94.44% do not have tags
crossref -- 94.44% do not have tags
dash -- 94.38% do not have tags
upennsylvania -- 93.75% do not have tags
cuscholar -- 93.60% do not have tags
cyberleninka -- 93.48% do not have tags
bhl -- 93.12% do not have tags
pcom -- 93.12% do not have tags
asu -- 92