# SHARE Data in the Wide World

This notebook will focus on how to export SHARE data into different formats, and how to query SHARE for specific information from your institution, say from a list of names or from a list of emails or ORCIDs that act as reseearcher identifiers.


## Exporting a DataFrame to csv and Excel

When doing an aggregation on SHARE data, it might be beneficial to export the data to a format that is easier to widely distribute, such as a csv file or and Excel file.

First, we'll do a SHARE aggregation query for documents from each source that have a description, turn it into a pandas DataFrame, and export the data into both csv and Excel formats.

In [3]:
import pandas as pd

from sharepa import ShareSearch
from sharepa.helpers import pretty_print

description_search = ShareSearch()

description_search = description_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='description:*', # This lucene query string will find all documents that don't have a description
    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)
)

description_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'significant_terms',  # There are many kinds of aggregations
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=0,
    percentage={}, # Will make the score value the percentage of all results (doc_count/bg_count)
    size=0
)

description_results = description_search.execute()

In [3]:
description_dataframe = pd.DataFrame(description_results.aggregations.sources.to_dict()['buckets'])

# We will add our own "percent" column to make things clearer
description_dataframe['percent'] = (description_dataframe['score'] * 100)

# Let's set the source name as the index, and then drop the old column
description_dataframe = description_dataframe.set_index(description_dataframe['key'])
description_dataframe = description_dataframe.drop('key', 1)

# Finally, we'll show the results!
description_dataframe

Unnamed: 0_level_0,bg_count,doc_count,score,percent
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
nist,3,3,1.000000,100.000000
dailyssrn,6896,6880,0.997680,99.767981
pcurio,12215,12170,0.996316,99.631600
ut_chattanooga,296,292,0.986486,98.648649
addis_ababa,1907,1878,0.984793,98.479287
columbia,2079,2046,0.984127,98.412698
cogprints,52,51,0.980769,98.076923
figshare,574897,563267,0.979770,97.977029
asu,13271,12891,0.971366,97.136614
hacettepe,45,43,0.955556,95.555556


Let's export this pandas dataframe to a csv file, and to an excel file.

The next cell will work when running locally!

In [4]:
# Note: Uncomment the following lines if running locally:

description_dataframe.to_csv('SHARE_Counts_with_Descriptions.csv')
description_dataframe.to_excel('SHARE_Counts_with_Descriptions.xlsx')

## Working with outside data

Let's say we had a list of names of researchers that were from a particular University. We're interested in seeing if their full names appear in any sources across the SHARE data set.

In [5]:
names = ["Susan Jones", "Ravi Patel"]

In [6]:
name_search = ShareSearch()

for name in names:
    name_search = name_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.name": {
                                "query": name, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )


name_results = name_search.execute()

print('There are {} documents with contributors who have any of those names.'.format(name_search.count()))
print('Here are the first 10:')
print('---------')
for result in name_results:
    print(
        '{} -- with contributors {}'.format(
            result.title.encode('utf-8'),
            ', '.join([contributor.name.encode('utf-8') for contributor in result.contributors])
        )
    )


There are 39 documents with contributors who have any of those names.
Here are the first 10:
---------
Short- and Long-Term Outcomes for Extremely Preterm Infants -- with contributors Ravi Patel
"Prospective, Randomized, Multi-Center, Efficacy Non-inferiority Study of MEDIHONEY® Gel Versus Collagenase for Wound Debridement" -- with contributors Ravi Patel, MD
Obstetrical and Neonatal Perspectives on Prematurity -- with contributors Tracy Manuck, Ravi Patel
Representative structures of bHLH proteins from the Protein Data Bank -- with contributors Susan Jones
The cohesion interaction network -- with contributors Susan Jones, John Sgouros
Evolutionary tree for SMC proteins, created using PHYLIP 69,70 -- with contributors Susan Jones, John Sgouros
Assessing the Fitness of an Academic Library for Doctoral Research -- with contributors Edwards, Susan ; Jones, Lynn ;
‘It’s not what it looks like. I’m Santa’: Connecting Community through Film -- with contributors Susan Jones, Joanna McIntyre
S

If we were interested to see an analysis of what sources these names came from, we can add an aggregation.

In [7]:
name_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
    min_doc_count=1
)

name_results = name_search.execute()

pd.DataFrame(name_results.aggregations.sources.to_dict()['buckets'])

Unnamed: 0,doc_count,key
0,12,datacite
1,11,crossref
2,7,pubmedcentral
3,3,clinicaltrials
4,1,arxiv_oai
5,1,citeseerx
6,1,figshare
7,1,ghent
8,1,mblwhoilibrary
9,1,utaustin


Say instead of names, which can be a little more arbitrary, we'd like to search by email address or ORCID instead.

In [8]:
orcids = [
    'http://orcid.org/0000-0003-1942-4543',
    'http://orcid.org/0000-0003-4875-1447',
    'http://orcid.org/0000-0002-6085-4433',
    'http://orcid.org/0000-0002-7995-9948',
    'http://orcid.org/0000-0002-2170-853X',
    'http://orcid.org/0000-0002-8899-9087'
]

In [9]:
orcid_search = ShareSearch()

for orcid in orcids:
    orcid_search = orcid_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.sameAs": {
                                "query": orcid, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )

orcid_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
    min_doc_count=1
)

orcid_results = orcid_search.execute()

In [10]:
print('There are {} documents with contributors who have any of those orcids.'.format(orcid_search.count()))

all_agg_df = pd.DataFrame()
all_agg_df['title'] = [result.title for result in orcid_results]
all_agg_df['docID'] = [result.shareProperties.docID for result in orcid_results]
all_agg_df['source'] = [result.shareProperties.source for result in orcid_results]
all_agg_df

There are 15 documents with contributors who have any of those orcids.


Unnamed: 0,title,docID,source
0,Widespread shortening of 3' untranslated regio...,10.1101/026831,crossref
1,COMADRE: a global database of animal demography,10.1101/027821,crossref
2,Light-induced indeterminacy alters shade avoid...,10.1101/024018,crossref
3,Optimisation of a treat-to-target approach in ...,10.1136/annrheumdis-2015-208324,crossref
4,A selfish genetic element drives recurring sel...,10.1101/024851,crossref
5,Clinical trials of new drugs for the treatment...,10.1136/annrheumdis-2016-209429,crossref
6,A psychometric analysis of outcome measures in...,10.1136/annrheumdis-2014-207235,crossref
7,Response to: ‘Evidence for treating rheumatoid...,10.1136/annrheumdis-2016-209499,crossref
8,Pharmacological treatment of psoriatic arthrit...,10.1136/annrheumdis-2015-208466,crossref
9,The many evolutionary fates of a large segment...,10.1101/043687,crossref


In [38]:
## Exporting SHARE Records as a CSV File - Evaluating Contributors

import requests
from sharepa import basic_search

from pandas.io.json import json_normalize

results = requests.get('https://osf.io/api/v1/share/search/?sort=providerUpdatedDateTime').json()['results']
flattened = json_normalize(results, 'contributors', ['shareProperties', 'title'])
flattened

Unnamed: 0,additionalName,familyName,givenName,name,shareProperties,title
0,A.,Vsevolozhskaya,Olga,"Vsevolozhskaya, Olga A.","{u'source': u'uky', u'filetype': u'xml', u'doc...",Statistical Interpretation including the APPRO...
1,A.,Vsevolozhskaya,Olga,"Vsevolozhskaya, Olga A.","{u'source': u'uky', u'filetype': u'xml', u'doc...",Estimated Probability of Becoming Alcohol Depe...
2,C.,Anthony,James,"Anthony, James C.","{u'source': u'uky', u'filetype': u'xml', u'doc...",Estimated Probability of Becoming Alcohol Depe...
3,,Eden,Brad,"Eden, Brad","{u'source': u'valposcholar', u'filetype': u'xm...","Rethinking Technical Services: New Frameworks,..."
4,,Eden,Brad,"Eden, Brad","{u'source': u'valposcholar', u'filetype': u'xm...",Partnerships and New Roles in 21st-Century Aca...
5,,Eden,Brad,"Eden, Brad","{u'source': u'valposcholar', u'filetype': u'xm...",Creating Research Infrastructures in 21st-Cent...
6,,Eden,Brad,"Eden, Brad","{u'source': u'valposcholar', u'filetype': u'xm...",Cutting-Edge Research in Developing the Librar...
7,,Eden,Brad,"Eden, Brad","{u'source': u'valposcholar', u'filetype': u'xm...",Enhancing Teaching and Learning in the 21st-Ce...
8,,Eden,Brad,"Eden, Brad","{u'source': u'valposcholar', u'filetype': u'xm...",Leading the 21st-Century Academic Library: Suc...
9,,Eden,Brad,"Eden, Brad","{u'source': u'valposcholar', u'filetype': u'xm...",We Can’t Wait Any Longer: Managing Long-Term E...
