## Papermill Alarm template analysis

This is a simple analysis notebook for use with the Papermill Alarm. It will allow you to identify journals and special volumes/issues that may have been targeted by papermills in the past. 

It's important to re-iterate that the Papermill Alarm only detects similarity between a paper and other papers which are believed to have been created by paper mills. It is not evidence of misconduct in itself and should be thought of as an indication of where to look carefully for that evidence. 

The processes below will help you to find papermill content in published literature using OpenAlex. There is plenty of scope to adapt the process here to any specific problem. 

E.g. 
- instead of reading data from OpenAlex, you could read data from new submissions and run them through the same process. This might help you identify papermill submissions before peer-review. 
- instead of looking at a selection of ISSNs, you could just look at a whole publisher. This would mean adapting the OpenAlex query

### Before we begin. Prerequisites.
1. This does require some experience of Python
2. you will need 2 environment variables
    - __PAPERMILL_ALARM_BATCH_KEY__ (this is the API key for the BATCH version of the papermill alarm which you can get from RapidAPI)
    - __MY_EMAIL__ (this is your email address which will get sent to OpenAlex with your queries. This way, if you send too many queries and it is a problem for them, they can email you to ask you to stop. Including your email address is optional, BUT if you don't include it, OpenAlex might give you slower responses. [See their docs for details](https://docs.openalex.org/api).)
    - after creating the environment variables, you will need to restart whichever terminal you launched this notebook from.
3. Then you will need to step through this notebook one cell at a time. Sometimes you will need to edit cells to update variables. 

In [4]:
%matplotlib inline

In [5]:
# Before we start, import all the packages we will need
import requests
import json
import os
import pandas as pd
from glob import glob 
from tqdm import tqdm

## Start by defining the journals we want to study
- we'll make a name for our study and this will automatically get translated into the name of a data directory where data will be stored.
- update the variables below to have the issn of the journal(s) we want to look at
- then you might want to update the path variable

# Example
I managed the peer-review operations for a gravitational physics journal from 2009-2015. This is a subject area, and an era, where I wouldn't expect to see any papermilling going on. But I am curious to see!

Furthermore, the Papermill Alarm (v1) was trained on PubMed. This means that this is another _out of domain_ test. We don't expect to find anything, but we won't know until we check. 

In [7]:
# Enter the ISSN of the journal you want to study
# this cell will create a folder in your D drive for it 
# If you don't want to use the D drive, change 'data_dir' to a path of your choice
# once you have finished editing this cell, delete the line containing 'assert False'
STUDY_NAME = 'Gravitational Physics'
ISSNS = ['0264-9381', '1572-9532', '2470-0029', ] # ISSNs of journals in this area
YEARS = list(range(2009,2016)) # equivalent to saying 'lets look at every year from 2009 to 2016 (but not including 2016!)
DATA_DIR = os.path.abspath(f'D:\\Papermill_alarm_study_{STUDY_NAME}')
DATASET_P = os.path.join(DATA_DIR, f'journal_openalex_data_{STUDY_NAME}.json')
RESULTS_P = os.path.join(DATA_DIR, f'papermill_alarm_results_{STUDY_NAME}.json')
# assert False

In [8]:
## check to see if your data directory already exists
## if not, create it.
if not os.path.exists(DATA_DIR):
    os.mkdir(DATA_DIR)

## Environment variables
- we need environment variables for the Papermill Alarm API key AND for our email address. 

This email address will be sent to OpenAlex with our queries so that they know who to contact if the queries cause problems for their API. You can simply send an empty string if you want to query OpenAlex anonymously, but then you will get slower responses.

In [9]:
email = os.environ.get('MY_EMAIL')
## this will throw an error if you haven't defined an email address properly
## you can set email='' to access OpenAlex anonymously, but sending your email address is recommended
assert email

In [10]:
# set the API key
rapidapi_key = os.environ.get('PAPERMILL_ALARM_BATCH_KEY')
# this line just checks that the key exists
# if you see 'assertion error', it means that we can't find the key. 
# Check you have created the environment variable and then restart and try again.
assert rapidapi_key

In [11]:
url = 'https://papermill-alarm.p.rapidapi.com'
headers = {
    'content-type':'application/json',
    'X-RapidAPI-Key':rapidapi_key,
    'X-RapidAPI-Host': 'papermill-alarm.p.rapidapi.com'
}

# Acquire data
We will download all of the data for the ISSNs we chose above.

The code below will first import some ad hoc functions for accessing OpenAlex's API

Then we will check to see if the data has already been downloaded. 
- if not, we download it
- if so, we simply load it from file (much faster!)

In [12]:
from openalex import openalex_from_issn, abstract_from_oa_response    

In [13]:
# does our dataset exist?
if not os.path.exists(DATASET_P):            
    # if not, we'll download everything 
    all_jnl_data = []
    for issn in ISSNS:
        print(f'Retrieving data for {issn}')
        for data in tqdm(openalex_from_issn(issn, email)):
            all_jnl_data += data
    # then save it to file
    with open(DATASET_P, 'w') as f:
        json.dump(all_jnl_data,f)
else:
    # if our dataset DOES exist, we just load it
    with open(DATASET_P, 'r') as f:
        all_jnl_data = json.load(f)
        
        
len(all_jnl_data)

47971

# Check the data


Before we proceed, we need to check that the data we have from OpenAlex is suitable for the task. 

We'll quickly audit the data to check for
- missing data. We need titles, abstracts and unique IDs for each document.
- duplication. In case we accidentally downloaded the same document twice - no point in querying the papermill alarm twice when we only need to do it once.
- any obviously unclean text that we might be able to fix. A title might not be missing, but it might indicate that the article is not a research article e.g. 'preface' or 'editorial'

# Then clean (filter) the data

Our checks will dictate changes that we can make to improve the data quality. This might include:

- remove any data that can't be fixed
- limit the data to whatever we're interested in. E.g. recent dates only. 

In [14]:
## OpenAlex data comes in structured JSON format.
## This is great, but it would be easier to work
## with a tabular format. So we'll convert
## the OpenAlex documents into DataFrame rows.
## DataFrames are like Excel, except good.
def convert_doc(doc):
    "Flatten an openalex doc into a dataframe row"
    return {
        'id':doc.get('id'),
        'title':doc.get('title'),
        'abstract':abstract_from_oa_response(doc),
        'publication_year':doc.get('publication_year'),
        'publication_date':doc.get('publication_date'),
        'volume':doc.get('biblio',{}).get('volume'),
        'issue':doc.get('biblio',{}).get('issue'),
        'issn_l':doc.get('issn_l'),
        'journal':doc.get('host_venue',{}).get('display_name'),
        'publisher':doc.get('host_venue',{}).get('publisher'),
        'is_retracted':doc.get('is_retracted')
        
    }

In [15]:
# first, I'll convert the data into a temporary dataframe to make it a bit more friendly
dftmp = pd.DataFrame([convert_doc(doc) for doc in all_jnl_data])
dftmp.shape

(47971, 11)

In [16]:
# how many articles are present for each year?
dftmp['publication_year'].value_counts()

2021    4856
2020    4527
2019    4278
2018    4158
2016    4152
2017    4015
2022    3315
2004     843
1989     823
2009     741
2003     720
2006     718
2002     712
2005     710
2010     707
2011     703
2008     699
2014     659
2007     638
2013     635
2015     631
2012     601
2001     579
2000     563
1997     525
1999     490
1996     456
1993     445
1998     443
1994     408
1992     399
1990     390
1995     375
1977     356
1991     332
1987     330
1988     304
1986     242
1985     218
1984     196
1979     166
1983     133
1982     112
1981     105
1978      99
1976      98
1980      90
1974      63
1975      62
1971      50
1972      42
1973      39
1970      20
Name: publication_year, dtype: int64

In [17]:
# filter the data just to the years that we specified at the start
dftmp = dftmp[dftmp['publication_year'].isin(set(YEARS))]
# check the shape - how much data is left after this filter?
dftmp.shape

(4677, 11)

#### Check for missing abstracts

Usually there are SOME missing abstracts, so we might see the number of 'non-null' entries here as being lower for abstracts than for, say, titles. 

In [18]:
dftmp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4677 entries, 0 to 47967
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                4677 non-null   object
 1   title             4677 non-null   object
 2   abstract          4512 non-null   object
 3   publication_year  4677 non-null   int64 
 4   publication_date  4677 non-null   object
 5   volume            1902 non-null   object
 6   issue             1900 non-null   object
 7   issn_l            0 non-null      object
 8   journal           4677 non-null   object
 9   publisher         4662 non-null   object
 10  is_retracted      4677 non-null   bool  
dtypes: bool(1), int64(1), object(9)
memory usage: 406.5+ KB


How many non-null values do we have?

Remember, we need title, abstract and id. Are there enough non-null values for abstract? Abstracts are sometimes missing in OpenAlex. 

If there are missing abstracts, we might want to try a different data source if possible. Missing abstracts can severely limit what we get.

In [19]:
# where are missing abstracts?
dftmp['abstract_missing'] = dftmp['abstract'].isna()

In [20]:
## if we are looking at volumes / issues,
## manually check volume / issues to see if the abstracts are missing at that level
## e.g. if it's a special issues journal is it just that sometimes abstracts don't get submitted for a whole issue/volume?
dftmp[['journal','volume','abstract_missing']].groupby(['journal','volume']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,abstract_missing
journal,volume,Unnamed: 2_level_1
Classical and Quantum Gravity,21,0
Classical and Quantum Gravity,23,0
Classical and Quantum Gravity,26,16
Classical and Quantum Gravity,27,16
Classical and Quantum Gravity,28,12
Classical and Quantum Gravity,29,10
Classical and Quantum Gravity,30,7
Classical and Quantum Gravity,31,8
Classical and Quantum Gravity,32,0
Classical and Quantum Gravity,33,0


In [47]:
## manually check individual articles with missing abstracts
## should the data be here?
dftmp[dftmp['abstract'].isna()].sample(3)

Unnamed: 0,id,title,abstract,publication_year,publication_date,volume,issue,issn_l,journal,publisher,is_retracted,abstract_missing
22086,https://openalex.org/W4239490462,1949–2011 Sixty-Two Years Gravity Research Fou...,,2010,2010-06-29,42,12,,General Relativity and Gravitation,Springer Nature,False,True
20251,https://openalex.org/W1480699288,"Giorgio Ferrarese, Donato Bini: Introduction t...",,2009,2009-02-01,41,2,,General Relativity and Gravitation,Springer Nature,False,True
14316,https://openalex.org/W4240682702,LISA 8 Science Organizing Committee and Local ...,,2011,2011-04-19,28,9,,Classical and Quantum Gravity,IOP Publishing,False,True


In [48]:
## either way, let's drop all rows with missing abstracts
dftmp = dftmp[~dftmp['abstract'].isna()]
dftmp.shape

(4514, 12)

In [49]:
## sometimes we have generic titles which appear multiple times. 
## these are usually 'Preface' or 'Editorial' etc. We can probably drop these.
maximum_times_a_title_can_appear = 1
dftmp = dftmp[dftmp['title'].isin(set((dftmp['title'].value_counts()>maximum_times_a_title_can_appear).index))]
dftmp.shape

(4514, 12)

In [50]:
# now drop any duplicates
dftmp = dftmp.drop_duplicates('id', keep = 'first')
dftmp.shape

(4513, 12)

In [51]:
## manually look at some errors
# [doc for doc in formatted_docs if not all([len(str(doc.get('id')))>4, len(str(doc.get('title')))>4,len(str(doc.get('abstract')))>4])]

In [52]:
# filter out docs with unusually short titles/abstracts

def check_str_len(s,n):
    # min n words in str
    s = str(s) # coerce to str - prefer that this is done upstream, but this fn will filter problematic rows anyway
    # then split the string into words and check that there are more than n
    return len(s.split())>n

min_title_length = 3
min_abstract_length = 10

dftmp = dftmp[(dftmp['title'].map(lambda x: check_str_len(x,min_title_length))) & (dftmp['abstract'].map(lambda x: check_str_len(x,min_abstract_length)))]
dftmp.shape

(4401, 12)

# Now for the fun part
- let's query the Papermill Alarm!

In [53]:
# import our ad hoc function to check that the papermill alarm is awake
from wakeup import wakeup

In [54]:
# this simple function turns a list into a list of lists
# it's basically a way to convert our data into batches
def chunks(l,n):
    for i in range(0,len(l),n):
        yield l[i:(i+n)]

In [55]:
def query_papermill_alarm(doc_batch):
    # build the payload in the expected format
    payload = {"payload":doc_batch}
    # define the URL endpoint that the papermill alarm uses
    url = 'https://papermill-alarm.p.rapidapi.com'
    # make a POST request to the API
    r = requests.post(url, 
                      headers = headers,
                      json = payload)
    # if the response code is good, then we return the prediction
    if r.status_code == 200:
        resp_data = r.json()
        return resp_data.get('message',[])

In [56]:
ids = set(dftmp['id'])
len(ids)

4401

In [57]:
ids_found = set()
results = []
n=10

# check to see if we already ran this search
if os.path.exists(RESULTS_P):            
    with open(RESULTS_P, 'r') as f:
        results = json.load(f)
    ids_found = {x['id'] for x in results}

# if we already ran the search, let's drop all the ids that we already found
ids_to_find = {x for x in ids if x not in ids_found}
print(f'We still have {len(ids_to_find)} documents still to query with the Papermill Alarm.')    

We still have 4401 documents still to query with the Papermill Alarm.


In [58]:
# then we can just search for the remaining ids
dftmp = dftmp[dftmp['id'].isin(ids_to_find)]
## limit to the columns we need 
dftmp= dftmp[['id','title','abstract']]
# now just get the rows as json docs
formatted_docs = list(dftmp.T.to_dict().values())

print(f'Checking {len(formatted_docs)} documents...')

Checking 4401 documents...


In [60]:
if len(formatted_docs)>0:
    # wake up the papermill alarm
    awake = wakeup(headers=headers)
    assert awake
    # here we perform the search
    for doc_batch in tqdm(chunks(formatted_docs,n), total = 1+(len(formatted_docs)//n)):
        results += query_papermill_alarm(doc_batch)

    # and write out the results
    with open(RESULTS_P, 'w') as f:
        json.dump(results,f)
    
len(results)

Response status code: 500
PMA response: None
Papermill Alarm needs time to wake up. Waiting for 60s.
Response status code: 504
PMA response: None
Papermill Alarm needs time to wake up. Waiting for 60s.
Response status code: 504
PMA response: None
Papermill Alarm needs time to wake up. Waiting for 60s.
Papermill Alarm is awake and working. Beginning to process docs!
Response status code: 200
PMA response: 200


100%|██████████| 441/441 [44:33<00:00,  6.06s/it]


4401

In [61]:
def response_to_df(resp):
    """
    Simply convert the response to a 1-deep dict so that we can
    easily convert to dataframe
    """
    return {'id':resp.get('id'),
            'title':resp.get('title'),
            'abstract':resp.get('abstract'),
            'message':resp.get('message',dict()).get('message'),
            'alert':resp.get('message',dict()).get('status')}

df = pd.DataFrame([response_to_df(resp) for resp in results])
df.alert.value_counts()

green     4395
orange       6
Name: alert, dtype: int64

## Findings
We're seeing 4395 green alerts (nothing to worry about) and 6 'orange' alerts which are essentially red alerts with low-confidence (you can usually ignore orange alerts). 

Since we already know that we are unlikely to find any papermill-products in this area, this is in-line with expectations. It's reasonable to conclude that the small number of orange alerts represent a low false-positive rate. 

In [78]:
## look at some alerts at random
df[df['alert']=='orange'].sample(3)

Unnamed: 0,id,title,abstract,message,alert,suspect
2762,https://openalex.org/W2005674598,An alternative well-posedness property and sta...,"In the first part of this paper, we show that ...",This article has SOME features in common with ...,orange,1
3556,https://openalex.org/W2067816691,Hawking radiation of charged rotating AdS blac...,Extending researches on Hawking radiation to c...,This article has SOME features in common with ...,orange,1
950,https://openalex.org/W1985871997,Fluid/gravity correspondence for general non-r...,"In this paper, we investigate the fluid/gravit...",This article has SOME features in common with ...,orange,1


In [64]:
alerts_of_interest = {'red','orange'} # 'red' alerts are high similarity to past papermills, 'orange' have lower similarity

# label alerts as 'suspect'. 
df['suspect'] = [1  if x in alerts_of_interest else 0 for x in df['alert'].tolist()]

# These next few cells are to help explore cases where we DO find something. 
- results below here aren't likely to be meaningful in cases where we don't have red alerts. 
- I'm leaving the code here for anyone who wants to play around with it on a different dataset. 

### Merge with original dataset to get volume and issue numbers

In [65]:
# now merge the openalex data with our predictions
right =  pd.DataFrame([convert_doc(doc) for doc in all_jnl_data])
adf = df[['id','message','alert','suspect']].merge(right = right, on='id', how = 'left')
adf.shape

(4402, 14)

In [1]:
# adf

#### Check for articles that are already retracted

In [67]:
## number of retractions recorded by openalex
right['is_retracted'].sum()

1

In [76]:
# right[right['is_retracted']==1]

In [70]:
## quick and dirty check for retracted articles
adf[adf['title'].map(lambda x: any(substr in x for substr in {'retract', 'expression of concern'})) ]

Unnamed: 0,id,message,alert,suspect,title,abstract,publication_year,publication_date,volume,issue,issn_l,journal,publisher,is_retracted


## Check for journals / volumes /issues that may have been targeted
- depending on whether you think papermills are getting in through specific journals or through special issues, set the 'target_level' variable to 'journal','volume' or 'issue'

In [71]:
target_level = 'journal' # 'volume' # 'issue'

gb = adf[[target_level,'suspect']].groupby([target_level]).sum().sort_values('suspect',ascending = False).reset_index()

In [2]:
## if we have used a target level like 'volume' or 'issue', we can check to see
## if we have some issues that stand out.
# gb.suspect.hist(bins=100)

In [21]:
# gb.head(30)

In [74]:
journal_names = {'journal_name'}
adf[(adf['journal'].isin(journal_names))*(adf['alert']!='green')]

Unnamed: 0,id,message,alert,suspect,title,abstract,publication_year,publication_date,volume,issue,issn_l,journal,publisher,is_retracted
