# SHARE Data in the Wide World

This notebook will focus on how to export SHARE data into different formats, and how to query SHARE for specific information from your institution, say from a list of names or from a list of emails or ORCIDs that act as reseearcher identifiers.


## Exporting a DataFrame to csv and Excel

When doing an aggregation on SHARE data, it might be beneficial to export the data to a format that is easier to widely distribute, such as a csv file or and Excel file.

First, we'll do a SHARE aggregation query for documents from each source that have a description, turn it into a pandas DataFrame, and export the data into both csv and Excel formats.

In [None]:
import pandas as pd

from sharepa import ShareSearch
from sharepa.helpers import pretty_print

description_search = ShareSearch()

description_search = description_search.query(
    'query_string', # Type of query, will accept a lucene query string
    query='description:*', # This lucene query string will find all documents that don't have a description
    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)
)

description_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'significant_terms',  # There are many kinds of aggregations
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    min_doc_count=0,
    percentage={}, # Will make the score value the percentage of all results (doc_count/bg_count)
    size=0
)

description_results = description_search.execute()

In [None]:
description_dataframe = pd.DataFrame(description_results.aggregations.sources.to_dict()['buckets'])

# We will add our own "percent" column to make things clearer
description_dataframe['percent'] = (description_dataframe['score'] * 100)

# Let's set the source name as the index, and then drop the old column
description_dataframe = description_dataframe.set_index(description_dataframe['key'])
description_dataframe = description_dataframe.drop('key', 1)

# Finally, we'll show the results!
description_dataframe

Let's export this pandas dataframe to a csv file, and to an excel file.

The next cell will work when running locally!

In [None]:
# Note: Uncomment the following lines if running locally:

# description_dataframe.to_csv('SHARE_Counts_with_Descriptions.csv')
# description_dataframe.to_excel('SHARE_Counts_with_Descriptions.xlsx')

## Working with outside data

Let's say we had a list of names of researchers that were from a particular University. We're interested in seeing if their full names appear in any sources across the SHARE data set.

In [None]:
names = ["Susan Jones", "Ravi Patel"]

In [None]:
name_search = ShareSearch()

for name in names:
    name_search = name_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.name": {
                                "query": name, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )


name_results = name_search.execute()

print('There are {} documents with contributors who have any of those names.'.format(name_search.count()))
print('Here are the first 10:')
print('---------')
for result in name_results:
    print(
        '{} -- with contributors {}'.format(
            result.title.encode('utf-8'),
            ', '.join([contributor.name.encode('utf-8') for contributor in result.contributors])
        )
    )


If we were interested to see an analysis of what sources these names came from, we can add an aggregation.

In [None]:
name_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
    min_doc_count=1
)

name_results = name_search.execute()

pd.DataFrame(name_results.aggregations.sources.to_dict()['buckets'])

Say instead of names, which can be a little more arbitrary, we'd like to search by email address or ORCID instead.

In [None]:
orcids = [
    'http://orcid.org/0000-0003-1942-4543',
    'http://orcid.org/0000-0003-4875-1447',
    'http://orcid.org/0000-0002-6085-4433',
    'http://orcid.org/0000-0002-7995-9948',
    'http://orcid.org/0000-0002-2170-853X',
    'http://orcid.org/0000-0002-8899-9087'
]

In [None]:
orcid_search = ShareSearch()

for orcid in orcids:
    orcid_search = orcid_search.query(
        {
            "bool": {
                "should": [
                    {
                        "match": {
                            "contributors.sameAs": {
                                "query": orcid, 
                                "operator": "and",
                                "type" : "phrase"
                            }
                        }
                    }
                ]
            }
        }
    )

orcid_search.aggs.bucket(
    'sources',  # Every aggregation needs a name
    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
    field='_type',  # We store the source of a document in its type, so this will aggregate by source
    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
    min_doc_count=1
)

orcid_results = orcid_search.execute()

In [None]:
print('There are {} documents with contributors who have any of those orcids.'.format(orcid_search.count()))

all_agg_df = pd.DataFrame()
all_agg_df['title'] = [result.title for result in orcid_results]
all_agg_df['docID'] = [result.shareProperties.docID for result in orcid_results]
all_agg_df['source'] = [result.shareProperties.source for result in orcid_results]
all_agg_df

In [None]:
## Exporting SHARE Records as a CSV File - Evaluating Contributors

import requests
from sharepa import basic_search

from pandas.io.json import json_normalize

results = requests.get('https://osf.io/api/v1/share/search/?sort=providerUpdatedDateTime').json()['results']
flattened = json_normalize(results, 'contributors', ['shareProperties', 'title'])
flattened