# FORCE 11

Introduction to SHARE Queries and the current state of SHARE Data

## The SHARE Schema

### https://github.com/CenterForOpenScience/share-schema

### JSON Schema
#### http://json-schema.org/
- describes your existing data format
- clear, human- and machine-readable documentation
- complete structural validation, useful for
    - automated testing
    - validating client-submitted data

Appears in yml format in the main schema repo. 
```
$schema: "http://json-schema.org/draft-04/schema#"
type: "object"
description: "This is the Beta schema for the SHARE project."
properties:
    title:
        description: The title and any sub-titles of the resource.
        type: "string"
    contributors:
        description: The people or organizations responsible for making contributions to an object.
        type: array
        items:
            anyOf:
                - $ref: "#/definitions/person"
                - $ref: "#/definitions/organization"
```

This yml format is transformed into json format, which is used in scrapi (SHARE's data processing pipeline) 

```
"properties": {
    "title": {
        "type": "string",
        "description": "The title and any sub-titles of the resource."
    },
    "contributors": {
        "items": {
            "anyOf": [
                {
                        "$ref": "#/definitions/person"
                },
                {
                        "$ref": "#/definitions/organization"
                }
            ]
        },
        "type": "array",
        "description": "The people or organizations responsible for making contributions to an object."
    }
```

## Exploring the data in SHARE

### Names are difficult to disambiguate

SHARE attempts to break names into Given, Family, and Additional pieces. The [SHARE person schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml#L42) also includes spots for ```email```, ```affiliation```, and any links to other identifiers, such as ORCIDS, in the ```sameAs``` field.

Let's do a query to showcase different names appearing in the 5 most recent documents in SHARE

In [1]:
# Making the Query

import furl
import requests


def query_share(size, query=None):
    SHARE_API = 'https://osf.io/api/v1/share/search/'
    search_url = furl.furl(SHARE_API)
    search_url.args['size'] = size
    search_url.args['sort'] = 'providerUpdatedDateTime'
    if query:
        search_url.args['q'] = query
    return requests.get(search_url.url).json()

def print_title_contributors(results):
    for result in results['results']:
        print(result['title'].encode('utf-8'))
        print('~~~~~~~~~')
        for name in result['contributors']:
            print(name['name'])
        print('-------------------------------------------')
        
results =  query_share(5)
print_title_contributors(results)

Offshore petroleum security: Analysis of offshore security threats, target attractiveness, and the international legal framework for the protection and security of offshore petroleum installations
~~~~~~~~~
Kashubsky, Mikhail
-------------------------------------------
Gradual Appearance of a Regulated Retinotectal Projection Pattern in Xenopus laevis
~~~~~~~~~
O'Rourke, Nancy A.
Fraser, Scott E.
-------------------------------------------
Wound Healing, Cell Communication, and DNA Synthesis during lmaginal Disc Regeneration in Drosophila
~~~~~~~~~
Bryant, Peter J.
Fraser, Scott E.
-------------------------------------------
On the Optimal Density for Real-Time Data Gathering of Spatio-Temporal Processes in Sensor Networks
~~~~~~~~~
Cristescu, Răzvan
Vetterli, Martin
-------------------------------------------
Molecular Mechanisms of Avian Neural Crest Cell Migration on Fibronectin and Laminin
~~~~~~~~~
Perris, Roberto
Paulsson, Mats
Bronner-Fraser, Marianne
---------------------------

#### Names don't always show up in the same format

Let's choose a name, remove the middle initial, and see if we get a result

In [2]:
from sharepa import ShareSearch

def search_a_name(name):
    name_search = ShareSearch()
    name_search = name_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.name": {
                                "query": name, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )
    
    return name_search

def print_name_results(name):
    search = search_a_name(name)
    if search.count() == 1:
        print('There is {} document with the contributor {}'.format(  search.count(), name))
    else:
        print('There are {} documents with the contributor {}'.format(search.count(), name))

print_name_results('Meyerowitz, Elliot M.')
print_name_results('Meyerowitz, Elliot')
print_name_results('Elliot Meyerowitz')
print_name_results('Elliot M Meyerowitz')
print_name_results('Elliot M. Meyerowitz')

There are 27 documents with the contributor Meyerowitz, Elliot M.
There are 31 documents with the contributor Meyerowitz, Elliot
There is 1 document with the contributor Elliot Meyerowitz
There is 1 document with the contributor Elliot M Meyerowitz
There is 1 document with the contributor Elliot M. Meyerowitz


## Identifiers are Difficult to Identify

Here's a query that will search for all documents with contributors that have at least one orcid

In [3]:
recent_results = query_share(3, 'contributors.sameAs:*orcid*')

print('There are {} results'.format(recent_results['count']))
print('----------')
for result in recent_results['results']:
    print(result['title'].encode('utf-8'))
    print(result['shareProperties']['source'])
    print('~~~~~~~~~')
    for name in result['contributors']:
        print('{} - {}'.format(name['name'].encode('utf-8'), name['sameAs']))
    print('-------------------------------------------')

There are 29231 results
----------
Size Still Matters, Although It Shouldn’t: The Debate on Small Cetaceans, IWC 65, and Monaco’s Resolution on Highly Migratory Cetaceans
crossref
~~~~~~~~~
Ed Couzens - [u'http://orcid.org/0000-0001-9321-1912']
-------------------------------------------
The collection of MicroED data for macromolecular crystallography
crossref
~~~~~~~~~
Dan Shi - []
Brent L Nannenga - []
M Jason de la Cruz - []
Jinyang Liu - []
Steven Sawtelle - []
Guillermo Calero - []
Francis E Reyes - []
Johan Hattne - [u'http://orcid.org/0000-0002-8936-0912']
Tamir Gonen - []
-------------------------------------------
NO PRECISE LOCALIZATION FOR FRB 150418: CLAIMED RADIO TRANSIENT IS AGN VARIABILITY
crossref
~~~~~~~~~
P. K. G. Williams - [u'http://orcid.org/0000-0003-3734-3587']
E. Berger - [u'http://orcid.org/0000-0002-9392-9681']
-------------------------------------------


Not every contributor on each document has an orcid suppplied.

### Try the same for another form of identifier - an email address

In [4]:
# Contributors with Email Addresses

results = query_share(3, 'contributors.email:*')

print('There are {} results'.format(results['count']))
print('----------')
for result in results['results']:
    print(result['title'].encode('utf-8'))
    print(result['shareProperties']['source'])
    print('~~~~~~~~~')
    for name in result['contributors']:
            print('{} - {}'.format(name['name'].encode('utf-8'), name.get('email')))
    print('-------------------------------------------')


There are 52298 results
----------
Sphagnum phylogenetic tree
dataone
~~~~~~~~~
Granath, Gustaf - fia.bengtsson@ebc.uu.se
-------------------------------------------
Data from: Near infrared spectroscopy (NIRS) predicts non-structural carbohydrate concentrations in different tissue types of a broad range of tree species
dataone
~~~~~~~~~
Ramirez, Jorge A. - ramirez_correa.jorge_andres@courrier.uqam.ca
Vohland, Michael - None
Posada, Juan M. - None
Reu, Björn - None
Messier, Christian - None
Hoch, Günter - None
Handa, I. Tanya - None
-------------------------------------------
Candolin & Tukiainen Red area and intensity
dataone
~~~~~~~~~
Candolin, Ulrika - ulrika.candolin@helsinki.fi
Tukiainen, Iina - None
-------------------------------------------


## No shared taxonomy for subjects or explicit document types - manuscripts, data, figures, etc.

One of our developers is working on this right now!

The DC field "Type" is an excellent place to start - however, there is a lot of variation for what is allowed inside of this field. 

"Element Description: The nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the DCMIType vocabulary )."

In [5]:
# Type

import pandas as pd
from sharepa import ShareSearch, basic_search
from sharepa.helpers import pretty_print

type_search = ShareSearch()
total_documents = basic_search.count()

type_search.aggs.bucket(
    'typeTermFilter',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations
    field='otherProperties.properties.type',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=1,
    exclude= "of|and|or",
    size=50,
)

type_results_executed = type_search.execute()

type_results = type_results_executed.aggregations.typeTermFilter.to_dict()['buckets']

type_dataframe = pd.DataFrame(type_results)
type_dataframe['percent'] = (type_dataframe['doc_count'] / total_documents)*100
type_dataframe

Unnamed: 0,doc_count,key,percent
0,1308082,article,21.748706
1,1161817,text,19.316844
2,1056061,journal,17.558502
3,195459,paper,3.249781
4,188823,book,3.139448
5,185616,figure,3.086127
6,185067,dataset,3.077
7,115957,chapter,1.927948
8,109111,thesis,1.814124
9,88958,info:eu,1.479052


## Documents with Descriptions

Are abstracts copyrightable?

Abstracts, summaries and descriptions are not always made available. 

In [7]:
# How many documents have descriptions?
from __future__ import division

results = query_share(10, 'description:*')

print('There are {} results'.format(results['count']))
print('{}/{} or {}% of results have descriptions'.format(results['count'], total_documents, (results['count']/total_documents)*100))

There are 3297772 results
3297772/6014528 or 54.8301047065% of results have descriptions
