## How to use the Papermill Alarm

The Papermill Alarm is a simple API which receives a title and abstract from you and which returns a prediction of whether the paper _looks like_ it came from a papermill. It's important to keep in mind that, just because a paper _looks like_ a papermill-product, this does not mean that it is one. 

Let's start with the basics. We'll import the requests package which will do the majority of the work. 

In [34]:
import requests

Now before we get into making requests, we need to [subscribe to the API](https://rapidapi.com/clear-skies-clear-skies-default/api/papermill-alarm) and get an access key. The access key is essentially a password to use the API, so we shouldn't store it in the text of this notebook. That's not safe. So instead, we will make it into an environment variable. 

To do that, assuming you are using Windows, we will: 
- type 'env' into the search bar and you should see the option 'Edit the environment variables for your account'
- click 'Environment Variables', 'New' and 
- enter 'PAPERMILL_ALARM_BATCH_KEY' as the "Variable name" and paste the key itself as the "Variable value".
- Importantly, you will need to close this window, shut down the terminal running this notebook (with ctrl+c) and then restart the terminal for this change to take effect!

Now we can use the code below to access our rapidapi key without making it visible.

In [35]:
import os

rapidapi_key = os.environ.get('PAPERMILL_ALARM_BATCH_KEY')
# this line just checks that the key exists
# if you see 'assertion error', it means that we can't find the key. 
# Check you have the right name for it and then restart and try again.
assert rapidapi_key

The rapidapi key will be passed to rapidapi via http headers. So let's define those now so that it's out of the way. 

In [36]:
url = 'papermill-alarm.p.rapidapi.com'
headers = {
    'content-type':'application/json',
    'X-RapidAPI-Key':rapidapi_key,
    'X-RapidAPI-Host': url
}

Let's build a simple query function to query the API

In [37]:
import json
import time

In [38]:
def query_papermill_alarm(doc):
    # build the payload in the expected format
    payload = {"payload":[doc]}
    # define the URL endpoint that the papermill alarm uses
    url = 'https://papermill-alarm.p.rapidapi.com'
    # make a POST request to the API
    r = requests.post(url, 
                      headers = headers,
                      json = payload)
    # if the response code is good, then we return the prediction
    if r.status_code == 200:
        return r.json()


Let's run the function on a single document

In [39]:
doc = {"id":"your_document_id",
       "title":"This is not a title of a paper",
       "abstract":"This is just an example piece of text. Not a real abstract."}
response = query_papermill_alarm(doc)
response

{'message': [{'id': 'your_document_id',
   'title': 'This is not a title of a paper',
   'abstract': 'This is just an example piece of text. Not a real abstract.',
   'message': {'status': 'green',
    'message': 'We did not detect any features in this metadata which are consistent with paper-mill activity. However, this check is only a simple check of metadata and does not cover all known indicators of papermill activity.'}}],
 'status_code': 200}

OK - so we can see the result for 1 document, but wouldn't it be better to have a way to put multiple documents through and analyse them? 

Let's load a large number of documents in the same format as the above.

In [40]:
with open('arxiv_random_sample.json','r') as f:
    arxiv_data = json.load(f)
# check how many docs we have for testing
len(arxiv_data)

100

Now, we simply pass those documents through our function to get the predictions

In [41]:
def chunks(l,n):
    """We'll use this function to break the data into batches"""
    for i in range(0,len(l),n):
        yield l[i:(i+n)]

In [42]:
def query_papermill_alarm_batch(batch):
    # build the payload in the expected format
    payload = {"payload":batch}
    # define the URL endpoint that the papermill alarm uses
    url = 'https://papermill-alarm.p.rapidapi.com'
    # make a POST request to the API
    r = requests.post(url, 
                      headers = headers,
                      json = payload)
    # if the response code is good, then we return the prediction
    if r.status_code == 200:
        return r.json()


In [43]:
from tqdm import tqdm


## the wakeup function is just an ad hoc function which
## ensures that the API is awake. Running the API is expensive, 
## so it automatically switches itself off.
## This means that it's wise to wake it up before we make requests
from wakeup import wakeup
assert wakeup(headers=headers)

## Now query the API
results = []
for batch in tqdm(list(chunks(arxiv_data,10))):
    resp_data = query_papermill_alarm_batch(batch)
    if resp_data and 'message' in resp_data:
        results += resp_data['message']
# check how many results we got
len(results)

Papermill Alarm is awake and working. Beginning to process docs!
Response status code: 200
PMA response: 200


100%|██████████| 10/10 [01:03<00:00,  6.38s/it]


100

## Analyse this
- let's just take a look at these predictions and see what we've got

In [44]:
def response_to_df(resp):
    """
    Simply convert the response to a 1-deep dict so that we can
    easily convert to dataframe
    """
    return {'id':resp.get('id'),
            'title':resp.get('title'),
            'abstract':resp.get('abstract'),
            'message':resp.get('message',dict()).get('message'),
            'alert':resp.get('message',dict()).get('status')}
import pandas as pd
df = pd.DataFrame([response_to_df(resp) for resp in results])
df.alert.value_counts()

green     99
orange     1
Name: alert, dtype: int64

That's weird. All of the articles we checked came back 'Green'. That's actually what we expect. The Papermill Alarm is trained on PubMed and so it is used to seeing papers in the biomedical fields. It isn't expecting to see physics, computer science and other ArXivy fields. 

You might consider this to be an 'out of domain' test. It's actually quite an important test, you see _because_ our API hasn't seen ArXiv before, we can't predict what it will do when it sees it. Now we know. Funnily enough, most of Pubmed will also come back Green. We are looking for quite a small signal in the grand scheme of things. 

## A better test
Instead of looking for nothing, let's look for something and see if we can find that.


Then let's retrieve a different dataset. This is a list of retracted articles from [Smut Clyde's spreadsheets](https://docs.google.com/spreadsheets/d/1zKxfaqug4ZhwHyGzslF38pFyC8xtU8lzmmOFMGYITDI/edit#gid=0).

In [45]:
with open('sample_retractions.json','r') as f:
    retractions = json.load(f)
# check how many docs we have for testing
len(retractions)

31

In [46]:
assert wakeup(headers=headers)
results = []
for batch in tqdm(list(chunks(retractions,10))):
    resp_data = query_papermill_alarm_batch(batch)
    if 'message' in resp_data:
        results += resp_data['message']

len(results)

Papermill Alarm is awake and working. Beginning to process docs!
Response status code: 200
PMA response: 200


100%|██████████| 4/4 [00:22<00:00,  5.69s/it]


31

In [47]:
df = pd.DataFrame([response_to_df(resp) for resp in results])
df.alert.value_counts()

red       30
orange     1
Name: alert, dtype: int64

And there we are.