# Querying Preprint Metrics!

This notebook will contain examples of how to use the preprint date analytics endpoints developed for the Sloan grant. 

# Main API Queries

First, we'll focus on making queries to the main views and downloads endpoints. The request is controlled by adding various query parameters at the end of the URL. To add the first query param, add a "?" to the end of the url, followed by the param name, an "=", and the value. So to look for downloads metrics for the preprint with the guid `abcde`, your URL would look like:

`/_/metrics/preprints/downloads/?guids=abcde`

To add a second query param to the URL, add a "&" and follow the same pattern. To then look for metrics for this guid on the date `2019-01-01`, your URL would look like:

`/_/metrics/preprints/downloads/?guids=abcde&on_date=2019-01-01`

To search for results for a list of guids, seperate them with a ",":

`/_/metrics/preprints/downloads/?guids=abcde,efghi,jklmn&on_date=2019-01-01`

Query params:

- `guids`: The guids, seperated by commas, to request metrics for

- `on_date`: metrics for this specific day. If you include an on_date, you cannot include other date parameters. Must be in the format of `YYYY-MM-DD`

- `start_datetime`: restrict the results to starting on this datetime. Can either be in the format `YYYY-MM-DD` and results will start at midnight of that day, or `YYYY-MM-DDThh:mm` where h and m are the hour and minute. If you provide a start datetime with no end datetime, the end datetime will default to a full day ago at 11:59pm UTC. If you provide a start or end datetime including minutes, the other value must also include minutes.

- `end_datetime`: restrict the results to ending on this date. You cannot provide an `end_datetime` with no start datetime. See formatting rules for `start_datetime` for more specifics.

- `interval`: how fine grained you'd like the results you'd like to be returned to you. To see what you can enter in this section, check out [the elasticsearch docs on intervals](https://www.elastic.co/guide/en/elasticsearch/reference/current/_intervals.html)


If no time period is specified, metrics will be returned for the previous 5 full days - 6 days ago at midnight UTC to one day ago at 11:59pm UTC.

## Notes

* There is no verification server-side that any of the guids are *actually* preprints or, indeed, exist. Any guids that are requested that have no data will not return data.
* The api calls start from `/_/` rather than `/v2/` because this is a private endpoint
* The example on withdrawn preprints below doesn't have any data because it would take me a while to figure out how to withdraw a preprint, but the elastic index doesn't know anything about the current status of a preprint; if there was an access, then there will be an entry.

# Query Examples

In the following cells, replace the variables with ones you'd like to use.

In [1]:
# python setup 
import os
import json

import requests

In [2]:
# Change me! Uncomment and/or change the variables you'd like to use
METRICS_BASE = 'http://localhost:8000/_/metrics/preprints/'
# METRICS_BASE = 'https://api.osf.io/v2/metrics/preprints/'  
TOKEN = os.environ['LOCAL_OSF_TOKEN']
# TOKEN = os.environ['LOCAL_HENRIQUE_OSF_TOKEN']
# LOCAL_TOKEN = 'thisismylongstringthatisanosftokenyoucancreatefromosfsettings'

headers = {
    'Content-Type': 'application/vnd.api+json',
    'Authorization': 'Bearer {}'.format(TOKEN)
}

SINGLE_PREPRINT_GUID = 'jpftg'
WITHDRAWN_PREPRINT_GUID = 'hg89q'
LIST_OF_GUIDS =  ['jpftg', 'f8ph9', 'xr4jv', 'mdpcb']

In [3]:
# get download results for one preprint from the past 5 full days
url = '{}downloads/?guids={}'.format(METRICS_BASE, SINGLE_PREPRINT_GUID)
res = requests.get(url, headers=headers)

print('Request URL was: {}'.format(url))
print(json.dumps(res.json(), indent=4))


Request URL was: http://localhost:8000/_/metrics/preprints/downloads/?guids=jpftg
{
    "metric_type": "downloads",
    "data": [
        {
            "2019-03-08T00:00:00.000Z": {
                "jpftg": 48
            }
        },
        {
            "2019-03-09T00:00:00.000Z": {
                "jpftg": 48
            }
        },
        {
            "2019-03-10T00:00:00.000Z": {
                "jpftg": 48
            }
        },
        {
            "2019-03-11T00:00:00.000Z": {
                "jpftg": 48
            }
        },
        {
            "2019-03-12T00:00:00.000Z": {
                "jpftg": 48
            }
        },
        {
            "2019-03-13T00:00:00.000Z": {
                "jpftg": 48
            }
        }
    ]
}


In [4]:
# get view results for one preprint from January 3, 2019
url = '{}views/?guids={}&on_date=2019-01-03'.format(METRICS_BASE, SINGLE_PREPRINT_GUID)
res = requests.get(url, headers=headers)

print('Request URL was: {}'.format(url))
print(json.dumps(res.json(), indent=4))


Request URL was: http://localhost:8000/_/metrics/preprints/views/?guids=jpftg&on_date=2019-01-03
{
    "metric_type": "views",
    "data": [
        {
            "2019-01-03T00:00:00.000Z": {
                "jpftg": 18
            }
        }
    ]
}


In [5]:
','.join(LIST_OF_GUIDS)

'jpftg,f8ph9,xr4jv,mdpcb'

In [6]:
# get view results for a list of preprints from January 2, 2018
url = '{}views/?guids={}&on_date=2018-01-02'.format(
    METRICS_BASE, 
    ','.join(LIST_OF_GUIDS)
)
res = requests.get(url, headers=headers)

print('Request URL was: {}'.format(url))
print(json.dumps(res.json(), indent=4))


Request URL was: http://localhost:8000/_/metrics/preprints/views/?guids=jpftg,f8ph9,xr4jv,mdpcb&on_date=2018-01-02
{
    "metric_type": "views",
    "data": [
        {
            "2018-01-02T00:00:00.000Z": {
                "f8ph9": 18,
                "xr4jv": 18,
                "jpftg": 18,
                "mdpcb": 18
            }
        }
    ]
}


In [7]:
# get view results from January 3, 2019 to today for a list of guids
url = '{}views/?guids={}&start_datetime=2019-01-03'.format(
    METRICS_BASE, 
    ','.join(LIST_OF_GUIDS)
)
res = requests.get(url, headers=headers)

print('Request URL was: {}'.format(url))
print(json.dumps(res.json(), indent=4))



Request URL was: http://localhost:8000/_/metrics/preprints/views/?guids=jpftg,f8ph9,xr4jv,mdpcb&start_datetime=2019-01-03
{
    "metric_type": "views",
    "data": [
        {
            "2019-01-03T00:00:00.000Z": {
                "f8ph9": 18,
                "xr4jv": 18,
                "jpftg": 18,
                "mdpcb": 18
            }
        },
        {
            "2019-01-04T00:00:00.000Z": {
                "jpftg": 0,
                "mdpcb": 0,
                "f8ph9": 0,
                "xr4jv": 0
            }
        },
        {
            "2019-01-05T00:00:00.000Z": {
                "jpftg": 0,
                "mdpcb": 0,
                "f8ph9": 0,
                "xr4jv": 0
            }
        },
        {
            "2019-01-06T00:00:00.000Z": {
                "jpftg": 0,
                "mdpcb": 0,
                "f8ph9": 0,
                "xr4jv": 0
            }
        },
        {
            "2019-01-07T00:00:00.000Z": {
                "jpftg": 0

In [8]:
# get view results from Jan 1, 2019 to Jan 3, 2019
url = '{}views/?guids={}&start_datetime=2019-01-01&end_datetime=2019-01-03'.format(METRICS_BASE, SINGLE_PREPRINT_GUID)
res = requests.get(url, headers=headers)

print('Request URL was: {}'.format(url))
print(json.dumps(res.json(), indent=4))



Request URL was: http://localhost:8000/_/metrics/preprints/views/?guids=jpftg&start_datetime=2019-01-01&end_datetime=2019-01-03
{
    "metric_type": "views",
    "data": [
        {
            "2019-01-03T00:00:00.000Z": {
                "jpftg": 18
            }
        }
    ]
}


In [9]:
# get view results from March 1, 2019 at 1:00am UTC to March 1, 2019 at 1:30am UTC by 5 min intervals
url = '{}views/?guids={}&start_datetime=2019-03-01T01:00&end_datetime=2019-03-01T01:30&interval=5m'.format(METRICS_BASE, SINGLE_PREPRINT_GUID)
res = requests.get(url, headers=headers)

print('Request URL was: {}'.format(url))
print(json.dumps(res.json(), indent=4))


Request URL was: http://localhost:8000/_/metrics/preprints/views/?guids=jpftg&start_datetime=2019-03-01T01:00&end_datetime=2019-03-01T01:30&interval=5m
{
    "metric_type": "views",
    "data": [
        {
            "2019-03-01T01:00:00.000Z": {
                "jpftg": 8
            }
        },
        {
            "2019-03-01T01:05:00.000Z": {
                "jpftg": 8
            }
        },
        {
            "2019-03-01T01:10:00.000Z": {
                "jpftg": 16
            }
        },
        {
            "2019-03-01T01:15:00.000Z": {
                "jpftg": 4
            }
        },
        {
            "2019-03-01T01:20:00.000Z": {
                "jpftg": 4
            }
        },
        {
            "2019-03-01T01:25:00.000Z": {
                "jpftg": 4
            }
        }
    ]
}


In [10]:
# Access metrics for a withdrawn preprint from March 1, 2019 to yesterday at 11:59pm UTC
url = '{}views/?guids={}&start_datetime=2019-01-03'.format(METRICS_BASE, WITHDRAWN_PREPRINT_GUID)
res = requests.get(url, headers=headers)

print('Request URL was: {}'.format(url))
print(json.dumps(res.json(), indent=4))



Request URL was: http://localhost:8000/_/metrics/preprints/views/?guids=hg89q&start_datetime=2019-01-03
{
    "metric_type": "views",
    "data": []
}


# Advanced Query Examples

The preprint metrics API also allows `POST` requests containing more complicated raw queries for preprint metrics. These requests are made to just the bare `/v2/metrics/preprints/views/` and `/v2/metrics/preprints/downloads/` endpoints, without any query parameters. All of the data for the query is contained in a JSON object included in the request's `POST` data.

These queries can be anything at all conforming to [the elasticsearch query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html), so there are many, many options not just limited to what you see here. 

Results will be formatted in the raw elasticsearch format, and so won't conform to the specific format outlined above.

From the [official preprint metrics docs](https://www.notion.so/cos/Impact-API-documentation-6d7c638c0cb642f8989287a1794580b2), each data point is stored with the following fields:

- timestamp
- provider_id, e.g. "socarxiv"
- preprint_id, e.g. "qmdc4"
- user_id, e.g. "q7fts"
- version (file version)
- path

You can use any of those fields when building a custom query. Note, you'll have to add "keyword" to queries for `path`, `provider_id`, or `preprint_id` - see examples below.



In [11]:
post_url = '{}downloads/'.format(METRICS_BASE)  # TODO - here's where to leave off

# total preprint downloads per year
query = {
    "aggs" : {
        "preprints_over_time" : {
            "date_histogram" : {
                "field" : "timestamp",
                "interval" : "year"
            }
        }
    }
}

payload = {
    'data': {
        'type': 'preprint_metrics',
        'attributes': {
            'query': query
        }
    }
}


res = requests.post(post_url, headers=headers, json=payload)
res.json()['aggregations']['preprints_over_time']['buckets']



[{'key_as_string': '2016-01-01T00:00:00.000Z',
  'key': 1451606400000,
  'doc_count': 1344},
 {'key_as_string': '2017-01-01T00:00:00.000Z',
  'key': 1483228800000,
  'doc_count': 1584},
 {'key_as_string': '2018-01-01T00:00:00.000Z',
  'key': 1514764800000,
  'doc_count': 1488},
 {'key_as_string': '2019-01-01T00:00:00.000Z',
  'key': 1546300800000,
  'doc_count': 2928}]

In [12]:
# see views/downloads broken down by month for one provider
# restricted to one year
query = {
    "query": {
        "term": {"provider_id.keyword": "psyarxiv"}
    },
    "aggs" : {
        "preprints_from_2017": {
            "filter": {
                "range" : {
                    "timestamp" : {
                        "gte" : "2017-01-01",
                        "lt" : "2017-12-31"
                    }
                }
            },
            "aggs": {
                "preprints_per_month" : {
                    "date_histogram" : {
                        "field" : "timestamp",
                        "interval" : "month"
                    }
                }
            }
        }
    }
}


payload = {
    'data': {
        'type': 'preprint_metrics',
        'attributes': {
            'query': query
        }
    }
}

res = requests.post(post_url, headers=headers, json=payload)
res.json()['aggregations']['preprints_from_2017']['preprints_per_month']['buckets']


[{'key_as_string': '2017-01-01T00:00:00.000Z',
  'key': 1483228800000,
  'doc_count': 18},
 {'key_as_string': '2017-02-01T00:00:00.000Z',
  'key': 1485907200000,
  'doc_count': 18},
 {'key_as_string': '2017-03-01T00:00:00.000Z',
  'key': 1488326400000,
  'doc_count': 360}]

In [13]:
# downloads that come from logged in users
logged_in_query = {
    "query": {
         "exists" : { "field" : "user_id" }
    },
    "size": 0,
    "aggs" : {
        "preprints_per_year" : {
            "date_histogram" : {
                "field" : "timestamp",
                "interval" : "year"
            }
        }
    }
}


payload = {
    'data': {
        'type': 'preprint_metrics',
        'attributes': {
            'query': logged_in_query
        }
    }
}

res = requests.post(post_url, headers=headers, json=payload)
res.json()['aggregations']['preprints_per_year']['buckets']



[{'key_as_string': '2017-01-01T00:00:00.000Z',
  'key': 1483228800000,
  'doc_count': 240},
 {'key_as_string': '2018-01-01T00:00:00.000Z',
  'key': 1514764800000,
  'doc_count': 144},
 {'key_as_string': '2019-01-01T00:00:00.000Z',
  'key': 1546300800000,
  'doc_count': 1584}]

In [14]:
# downloads that come from NON-logged in users
logged_out_query = {
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "user_id"
                }
            }
        }
    },
    "size": 0,
    "aggs" : {
        "preprints_per_year" : {
            "date_histogram" : {
                "field" : "timestamp",
                "interval" : "year"
            }
        }
    }
}


payload = {
    'data': {
        'type': 'preprint_metrics',
        'attributes': {
            'query': logged_out_query
        }
    }
}

res = requests.post(post_url, headers=headers, json=payload)
res.json()['aggregations']['preprints_per_year']['buckets']



[{'key_as_string': '2016-01-01T00:00:00.000Z',
  'key': 1451606400000,
  'doc_count': 1344},
 {'key_as_string': '2017-01-01T00:00:00.000Z',
  'key': 1483228800000,
  'doc_count': 1344},
 {'key_as_string': '2018-01-01T00:00:00.000Z',
  'key': 1514764800000,
  'doc_count': 1344},
 {'key_as_string': '2019-01-01T00:00:00.000Z',
  'key': 1546300800000,
  'doc_count': 1344}]

# Notes and Extras

Code below here is for reference, or to run little one off adjustments in the terminal for adding more metrics.

It's commented out because it won't run in this notebook "as is" and would need adjustment by whomever was running it for local testing purposes.

In [15]:
# A little bit of code to add views and downloads to certain preprints for a developer
# uncomment, adjust, and run me in an interactive shell to add views/downloads

# from datetime import datetime

# from osf.metrics import PreprintView, PreprintDownload
# from osf.models import Preprint

# me = OSFUser.objects.get(username='erin@cos.io')
# user_to_use = OSFUser.objects.get(username='henrique@cos.io')

# metric_dates = ['2017-01-01', '2018-01-02', '2019-01-03']
# times = ['T00:00', 'T01:00', 'T02:00']

# preps = [Preprint.load('ythm7'), Preprint.load('h5rgp'), Preprint.load('e3fq4')]

# metrics = [PreprintView, PreprintDownload]
# for preprint_to_add in preps:
#     for metric in metrics:
#         for date in metric_dates:
#             for time in times:
#                 metric.record_for_preprint(
#                     preprint=preprint_to_add,
#                     user=user_to_use,
#                     path=preprint_to_add.primary_file.path,
#                     timestamp=datetime.strptime(date + time, '%Y-%m-%dT%H:%M')
#                 )