# PubMed/Biorxiv/Chemrxiv/Medrxiv scraper

This script checks several preprint servers as well as PubMed for new papers and determines whether they may be of interest to a reader. The model here uses PubMedBERT embeddings of the article/preprint abstract - specifically, the beginning-of-sentence and end-of-sentence tokens - and feeds them to a logistic regression model that is retrained daily on annotated data. In plain english, I provide a bunch of examples of papers that interest me and papers that don't, and use that to determine whether new papers will or won't also interest me. A downside is that this has lots of false positives, but those don't take long to filter through.

In [1]:
%%capture
!pip install pymed==0.8.9
!pip install paperscraper==0.2.10
!git clone https://github.com/delalamo/dda_scripts.git

In [2]:
import torch
import pandas as pd
from sklearn.linear_model import LogisticRegression
import textwrap
from pymed import PubMed
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
import json
import numpy as np
from datetime import datetime, timedelta
from transformers import AutoTokenizer, AutoModelForMaskedLM



In [3]:
# Download LLM for encoding
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")
model = AutoModelForMaskedLM.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext").to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
def process(abstract: str) -> np.ndarray:
  """ Tokenize and encode abstract, return concatenated BOS and EOS array
  Token vector must be truncated at 512 (incl BOS & EOS) due to LLM limitations
  """
  inputs = tokenizer(abstract, return_tensors="pt").to("cuda")
  inputs = {k: torch.concat((v[:, :511], v[:, -1].unsqueeze(-1)), dim=-1) for k, v in inputs.items()}
  hidden_states = model(**inputs, output_hidden_states=True).hidden_states
  return torch.concat((hidden_states[-1][0, 0, :], hidden_states[-1][0, -1, :])).detach().cpu().numpy()

def format_date(date, sep):
  assert len(sep) == 1
  return sep.join([date.strftime(f"%{x}") for x in "Ymd"])

def format_dates(dates, sep):
  return [format_date(d, sep) for d in dates]

I no longer check on weekends, so searches carried out on Monday must also check articles/preprints deposited over the weekend.

In [5]:
yesterday = datetime.now() - timedelta(days = 1)

start_rxivs = format_date(yesterday, "-")
end_rxivs = start_rxivs

pubmed_days = [format_date(yesterday, "/")]

# check if we're catching up on weekend pubs
if yesterday.weekday() == 6: # 6 is sunday
  saturday = yesterday - timedelta(days = 1)
  friday = yesterday - timedelta(days = 2)

  start_rxivs = format_date(friday, "-")
  pubmed_days.extend(format_dates([friday, saturday], "/"))

In [6]:
df = pd.read_csv("dda_scripts/extras/annotated_abstracts.tsv", sep="\t").dropna(subset="label")

In [7]:
labels = []
reps = []
with torch.no_grad():
  for i, row in df.iterrows():
    reps.append(process(row["abstract"]))
    labels.append(row["label"])
clf = LogisticRegression(random_state=0, max_iter=10000).fit(reps, labels)

In [8]:
medrxiv(begin_date=start_rxivs, end_date=end_rxivs, save_path="medrxiv.jsonl")
biorxiv(begin_date=start_rxivs, end_date=end_rxivs, save_path="biorxiv.jsonl")
chemrxiv(begin_date=start_rxivs, end_date=end_rxivs, save_path="chemrxiv.jsonl")

51it [00:40,  1.26it/s]
184it [02:45,  1.11it/s]
26it [00:24,  1.06it/s]
100%|██████████| 28/28 [00:00<00:00, 3456.17it/s]


In [9]:
data = {"Title": [], "Abstract": [], "Probability": [], "Journal": []}

In [10]:
for jsonfile in ["medrxiv.jsonl", "biorxiv.jsonl", "chemrxiv.jsonl"]:
  with open(jsonfile) as infile:
    for line in infile:
      l = json.loads(line)
      abstract = l["abstract"].replace("\n", " ")
      reps = process(abstract)
      prob = clf.predict_proba(reps.reshape(1, -1))[:, 1].item()
      data["Title"].append(l["title"])
      data["Abstract"].append(abstract)
      data["Probability"].append(prob)
      data["Journal"].append(jsonfile.split(".")[0])

In [11]:
for date in pubmed_days:
  pubmed = PubMed(tool="MyTool", email="my@email.address")
  search_query = f"{date}[PDAT]"
  results = pubmed.query(search_query, max_results=500000)
  errors = []
  for i, article in enumerate(results):
    if article.abstract is None:
      continue
    reps = process(article.abstract)
    prob = clf.predict_proba(reps.reshape(1, -1))[:, 1].item()
    data["Title"].append(article.title)
    data["Abstract"].append(article.abstract.replace("\n", " "))
    data["Probability"].append(prob)
    data["Journal"].append(article.journal.strip().replace("\n", " "))

In [12]:
cutoff = 0.05

papers = pd.DataFrame.from_dict(data)
papers = papers.sort_values("Probability", ascending=False)
papers[papers["Probability"] >= cutoff]

Unnamed: 0,Title,Abstract,Probability,Journal
2077,ProSTAGE: Predicting Effects of Mutations on P...,Protein thermodynamic stability is essential t...,0.962591,Journal of chemical information and modeling
160,ProtHyena: A fast and efficient foundation pro...,The emergence of self-supervised deep language...,0.955628,biorxiv
98,The blobulator: a webtool for identification a...,MotivationClusters of hydrophobic residues are...,0.912446,biorxiv
672,Conservation of Hot Spots and Ligand Binding S...,The neural network-based program AlphaFold2 (A...,0.910195,Journal of chemical information and modeling
251,Molecular Graph Transformer: Stepping Beyond A...,Graph Neural Networks (GNNs) have revolutioniz...,0.889386,chemrxiv
909,Implicit model to capture electrostatic featur...,Membrane protein structure prediction and desi...,0.883075,PLoS computational biology
2064,Searching for Structure: Characterizing the Pr...,The identification and characterization of the...,0.843904,Journal of chemical information and modeling
117,Orthogonalized human protease control of secre...,Synthetic circuits that regulate protein secre...,0.830171,biorxiv
496,MARS an improved de novo peptide candidate sel...,Understanding the nature and extent of non-can...,0.777074,Nature communications
834,Viruses traverse the human proteome through pe...,We present a drug design strategy based on str...,0.607981,Proceedings of the National Academy of Science...


In [13]:
# Format for adding to TSV for subsequent days.
# Manually set the value at the end of each row to one to indicate a positive example.
for _, row in papers[papers["Probability"] >= cutoff].iterrows():
  prob, title, journal, abstract =  row["Probability"], row["Title"], row["Journal"], row["Abstract"]
  print(f"{prob:.4f}\t{title} ({journal})\t{abstract}\t0")

0.9626	ProSTAGE: Predicting Effects of Mutations on Protein Stability by Using Protein Embeddings and Graph Convolutional Networks. (Journal of chemical information and modeling)	Protein thermodynamic stability is essential to clarify the relationships among structure, function, and interaction. Therefore, developing a faster and more accurate method to predict the impact of the mutations on protein stability is helpful for protein design and understanding the phenotypic variation. Recent studies have shown that protein embedding will be particularly powerful at modeling sequence information with context dependence, such as subcellular localization, variant effect, and secondary structure prediction. Herein, we introduce a novel method, ProSTAGE, which is a deep learning method that fuses structure and sequence embedding to predict protein stability changes upon single point mutations. Our model combines graph-based techniques and language models to predict stability changes. Moreover,