# SHARE Query Requests from the Community

Here's where we can keep track of code for common things that members of the SHARE community might like to know!

## Setup

In [1]:
import json
import requests

SHARE_SEARCH_API = 'https://osf.io/api/v1/share/search/'
ALL_PROVIDER_INFO = requests.get('https://osf.io/api/v1/share/providers/').json()['providerMap']

def query_share(url, query):
    headers = {'Content-Type': 'application/json'}
    data = json.dumps(query)
    return requests.post(url, headers=headers, data=data, verify=False).json()

def get_longname_for_shortname(shortname):
    for source in ALL_PROVIDER_INFO.keys():
        if source == shortname:
            return ALL_PROVIDER_INFO[source]['long_name']
    

## Queries

In [2]:
# What's the earliest and latest document from each source?

import pandas as pd

date_stats_agg = {
    "aggregations": {
        "sources": {
            "terms": {"field": "_type", "size": 0},
            "aggregations": {
                "source_stats": {
                    "stats": {"field": "providerUpdatedDateTime"}
                }
            }
        }
    }
}

date_results = query_share(SHARE_SEARCH_API, date_stats_agg)['aggregations']['sources']['buckets']

date_results_df = pd.DataFrame()
date_results_df['source_shortname'] = [result['key'].encode('utf-8') for result in date_results]
date_results_df['source_longname'] = [get_longname_for_shortname(name).encode('utf-8') for name in date_results_df['source_shortname']]
date_results_df['earliest_date'] = [result['source_stats']['min_as_string'] for result in date_results]
date_results_df['latest_date'] = [result['source_stats']['max_as_string'] for result in date_results]
date_results_df



Unnamed: 0,source_shortname,source_longname,earliest_date,latest_date
0,datacite,DataCite MDS,2015-07-26T00:03:30.000Z,2016-05-21T01:58:23.000Z
1,crossref,CrossRef,2014-08-03T00:00:00.000Z,2016-05-22T00:00:00.000Z
2,figshare,figshare,2014-10-28T00:00:00.000Z,2016-05-20T22:13:00.000Z
3,scitech,DoE's SciTech Connect Database,2014-10-03T00:00:00.000Z,2016-05-20T00:00:00.000Z
4,pubmedcentral,PubMed Central,2014-12-28T00:00:00.000Z,2016-05-20T00:00:00.000Z
5,dataone,DataONE: Data Observation Network for Earth,2015-04-11T00:00:00.000Z,2016-05-21T00:00:00.000Z
6,arxiv_oai,ArXiv,2014-10-03T00:00:00.000Z,2016-05-20T00:00:00.000Z
7,rcaap,RCAAP - Repositório Científico de Acesso Abert...,2015-12-27T02:00:54.000Z,2016-05-22T02:14:38.000Z
8,citeseerx,CiteSeerX Scientific Literature Digital Librar...,2008-07-01T00:00:00.000Z,2016-05-22T00:00:00.000Z
9,cyberleninka,CyberLeninka - Russian open access scientific ...,2015-12-22T00:00:00.000Z,2016-05-21T23:20:05.000Z


In [3]:
# Uncomment the following lines if running locally - will save to file formats

# date_results_df.to_csv('SHARE_Min_Max_dates.csv')
# date_results_df.to_excel('SHARE_Min_Max_dates.xlsx')

## Lucene Search and NOT Queries

A user wanted to know how to query for one term but exclude another

In [4]:
query = '?q=pedigree NOT child'

In [5]:
results = requests.get(SHARE_SEARCH_API + query).json()
results

{u'aggregations': None,
 u'aggs': None,
 u'count': 1072,
 u'results': [{u'contributors': [{u'additionalName': u'',
     u'familyName': u'Tung',
     u'givenName': u'Jenny',
     u'name': u'Tung, Jenny'},
    {u'additionalName': u'B.',
     u'familyName': u'Barriero',
     u'givenName': u'Luis',
     u'name': u'Barriero, Luis B.'},
    {u'additionalName': u'B.',
     u'familyName': u'Burns',
     u'givenName': u'Michael',
     u'name': u'Burns, Michael B.'},
    {u'additionalName': u'C.',
     u'familyName': u'Grenier',
     u'givenName': u'J.',
     u'name': u'Grenier, J. C.'},
    {u'additionalName': u'',
     u'familyName': u'Lynch',
     u'givenName': u'Josh',
     u'name': u'Lynch, Josh'},
    {u'additionalName': u'E.',
     u'familyName': u'Grieneisen',
     u'givenName': u'L.',
     u'name': u'Grieneisen, L. E.'},
    {u'additionalName': u'',
     u'familyName': u'Altmann',
     u'givenName': u'J.',
     u'name': u'Altmann, J.'},
    {u'additionalName': u'C.',
     u'familyName':

## Querying by Document Type

Currently, document type is not curated by SHARE. However, we do collected many sources that are using the OAI-PMH metadata protocol, which includes dc:type. You can search that field in SHARE for now, until the harvesters collect and curate document type.

In [6]:
query = '?q=otherProperties.properties.type:article'


In [7]:
results = requests.get(SHARE_SEARCH_API + query).json()

for result in results['results']:
    for prop in result['otherProperties']:
        if prop['name'] == 'type':
            print(prop)
    print(result['title'])
    print(result['uris']['canonicalUri'])
    

{u'name': u'type', u'properties': {u'type': u'article'}}
The Phonetics of Register in Takhian Thong Chong
http://www.escholarship.org/uc/item/1t52p7qb
{u'name': u'type', u'properties': {u'type': u'article'}}
Advancing methodological knowledge within state and local demography: a case study.
http://www.escholarship.org/uc/item/1j85x2qp
{u'name': u'type', u'properties': {u'type': u'article'}}
[Front Matter]
http://www.escholarship.org/uc/item/2fw6515j
{u'name': u'type', u'properties': {u'type': u'article'}}
No More "Kicking the Can down the Road"
http://www.escholarship.org/uc/item/2tj8r6j2
{u'name': u'type', u'properties': {u'type': u'article'}}
Finding a Winning Strategy Against the MP3 Invasion: Supplemental Measures the Recording Industry Must Take to Curb Online Piracy
http://www.escholarship.org/uc/item/61w7d07w
{u'name': u'type', u'properties': {u'type': u'article'}}
Proposition 11 - What It Will Do
http://www.escholarship.org/uc/item/9867d79x
{u'name': u'type', u'properties': {u'

Here is an analysis of the top terms found in SHARE's collected dc:type field

In [8]:
import pandas as pd
from sharepa import ShareSearch, basic_search
from sharepa.helpers import pretty_print

type_search = ShareSearch()
total_documents = basic_search.count()

type_search.aggs.bucket(
    'typeTermFilter',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations
    field='otherProperties.properties.type',
    exclude= "of|and|or",
    size=50,
)

type_results_executed = type_search.execute()

type_results = type_results_executed.aggregations.typeTermFilter.to_dict()['buckets']

type_dataframe = pd.DataFrame(type_results)
type_dataframe['percent'] = (type_dataframe['doc_count'] / total_documents)*100

In [9]:
type_dataframe

Unnamed: 0,doc_count,key,percent
0,1807899,article,25.43421
1,1507145,journal,21.203088
2,1298430,text,18.266807
3,232890,paper,3.276385
4,207730,dataset,2.922425
5,202957,book,2.855276
6,197065,figure,2.772385
7,133246,info:eu,1.874555
8,133246,repo,1.874555
9,133246,semantics,1.874555


## Query by Exact Phrase

Question -- Is there a way to search SHARE for a specific phrase? For example, information literacy, information AND literacy, and "information literacy" give results with both terms, but not necessarily as the phrase "information literacy." Information and literacy can be in different parts of the record.

In [10]:
phrase_query = {
    "query": {
        "match_phrase" : {
            "title" : "information literacy"
        }
    }
}

results = query_share(SHARE_SEARCH_API, phrase_query)

for result in results['results']:
    print(
        '{} -- from {} -- {}'.format(
            result['title'].encode('utf-8'),
            result['shareProperties']['source'].encode('utf-8'),
            result['uris']['canonicalUri'].encode('utf-8')
        )
    )


Information Literacy -- from uiucideals -- http://hdl.handle.net/2142/41497
Information Literacy Inventory -- from crossref -- http://dx.doi.org/10.1037/t32581-000
Information Literacy in Schools -- from crossref -- http://dx.doi.org/10.1201/b19843-8
Information Literacy Doll -- from datacite -- http://dx.doi.org/10.6084/M9.FIGSHARE.1012828.V1
MOBILE INFORMATION LITERACY CURRICULUM -- from uwashington -- http://hdl.handle.net/1773/34803
Information literacy at work -- from crossref -- http://dx.doi.org/10.1108/el-04-2014-0063
“Real Deal” Information Literacy -- from calpoly -- http://digitalcommons.calpoly.edu/lib_fac/95
Information Literacy (Book Review) -- from uiucideals -- http://hdl.handle.net/2142/41357
Teaching Information Literacy Online -- from cuny -- http://academicworks.cuny.edu/ulj/vol18/iss1/5
Data Information Literacy: Developing Data Information Literacy Programs -- from umassmed -- http://escholarship.umassmed.edu/escience_symposium/2016/program/6




In [11]:

# Using sharepa

phrase_search = ShareSearch()

phrase_search = phrase_search.query(
    'match_phrase',
    title="information literacy"
)

results = phrase_search.execute()

for result in results:
    print(
        '{} -- from {} -- {}'.format(
            result.title.encode('utf-8'),
            result.shareProperties.source.encode('utf-8'),
            result.uris.canonicalUri
        )
    )

Information Literacy -- from uiucideals -- http://hdl.handle.net/2142/41497
Information Literacy Inventory -- from crossref -- http://dx.doi.org/10.1037/t32581-000
Information Literacy in Schools -- from crossref -- http://dx.doi.org/10.1201/b19843-8
Information Literacy Doll -- from datacite -- http://dx.doi.org/10.6084/M9.FIGSHARE.1012828.V1
MOBILE INFORMATION LITERACY CURRICULUM -- from uwashington -- http://hdl.handle.net/1773/34803
Information literacy at work -- from crossref -- http://dx.doi.org/10.1108/el-04-2014-0063
“Real Deal” Information Literacy -- from calpoly -- http://digitalcommons.calpoly.edu/lib_fac/95
Information Literacy (Book Review) -- from uiucideals -- http://hdl.handle.net/2142/41357
Teaching Information Literacy Online -- from cuny -- http://academicworks.cuny.edu/ulj/vol18/iss1/5
Data Information Literacy: Developing Data Information Literacy Programs -- from umassmed -- http://escholarship.umassmed.edu/escience_symposium/2016/program/6
