# Tweets from bioRxiv and medRxiv

Find and display events from the CSHL DOI prefix 10.1101 from the last 7 days, and find the most tweeted preprints.

In [1]:
import sys
sys.path.append('../')
sys.path.insert(0, '..')

import pandas as pd # data analysis library
import json
import mrced2 # module to run event data queries
import os # some file manipulations
import math # some number manipulations
import altair.vegalite.v3 as alt # some data visualizations
from IPython.display import Markdown as md # some markdown manipulations
from datetime import datetime, date, timedelta # some date manipulations

In [2]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = (date.today() - timedelta(days = 7)).strftime('%Y-%m-%d')
end_date = date.today()

In [3]:
ed = mrced2.eventData(email = email)
ed.buildQuery({'obj-id.prefix' : prefix, 'source': 'twitter', 'rows': 0,'from-occurred-date' : start_date, 'until-occurred-date' : end_date})

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&obj-id.prefix=10.1101&source=twitter&rows=0&from-occurred-date=2021-08-02&until-occurred-date=2021-08-09


In [4]:
ed.runQuery(retry = 5)

Event Data query started...
API query complete  200
output file written to 1101/tweets.json


In [5]:
pages = math.ceil(ed.events.getHits() / 1000)

68586 events found


In [6]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = date.today() - timedelta(days = 7)
end_date = date.today()

# find the all result pages for the search
ed = mrced2.eventData(email = email)
ed.getAllPages(pages, {'rows': 1000, 'obj-id.prefix' : prefix, 'source': 'twitter', 'from-occurred-date' : start_date, 'until-occurred-date' : end_date}, fileprefix = '1101/tweets_') 

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-08-02&until-occurred-date=2021-08-09
Event Data query started...
API query complete  200
output file written to 1101/tweets_0000.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=699195ae-bc2a-436f-addc-38325fa8354d&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-08-02&until-occurred-date=2021-08-09
Event Data query started...
API query complete  200
output file written to 1101/tweets_0001.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=e6c32447-80b8-4440-8d3e-b129454191e8&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-08-02&until-occurred-date=2021-08-09
Event Data query started...
API query complete  200
output file written to 1101/tweets_0002.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=5851acb7-9fb8-4a1a-a2f5-b8036353d6fa&rows=

Initialisation to look at the properties of the results.

In [7]:
jd1 = mrced2.eventRecord() # instance of a class to interpret the events
files = os.listdir('1101') # get all the filenames

jd1.mergeJsons(files, folder = '1101') # load the json event data from multiple files

failed to load preprint_tweets_2021-07-29.csv
failed to load .DS_Store
failed to load .gitkeep
failed to load preprint_tweets_2021-08-02.csv
failed to load .ipynb_checkpoints
failed to load preprint_tweets_2021-07-26.csv
output file written to 1101/tweets.json


In [8]:
js = json.load(open("1101/tweets.json"))
df = pd.json_normalize(js, record_path = ['message', 'events'])
gdf = df.groupby(['obj_id']).size().reset_index(name='count').sort_values('count', ascending=False)
cdf = gdf[gdf['count'] >= 3]

In [9]:
email = "info@front-matter.io"
rest = mrced2.restApi(email = email)

data = []
for index, row in cdf.iterrows():
    rest.runQuery(row)
    if rest.work is not None and date.fromisoformat(rest.work["posted"]) >= start_date:
        data.append(rest.work)
print(data)
    
tdf = pd.DataFrame(data, columns=['doi','tweets','archive','subject-area','covid','title','authors','abstract','posted'])
tdf.to_csv('1101/preprint_tweets_' + str(date.today().strftime('%Y-%m-%d')) + '.csv')

tdf.head(50)

REST API query started for 10.1101/2021.07.08.21259912...
REST API query complete  200
REST API query started for 10.1101/2021.05.31.21258081...
REST API query complete  200
REST API query started for 10.1101/2021.06.01.21258176...
REST API query complete  200
REST API query started for 10.1101/2021.07.05.21260050...
REST API query complete  200
REST API query started for 10.1101/2021.07.07.21260122...
REST API query complete  200
REST API query started for 10.1101/2021.08.05.21261642...
REST API query complete  404
REST API query started for 10.1101/2021.05.08.21256619...
REST API query complete  200
REST API query started for 10.1101/2021.04.20.21254636...
REST API query complete  200
REST API query started for 10.1101/2021.07.26.21261146...
REST API query complete  200
REST API query started for 10.1101/2021.04.15.21252192...
REST API query complete  200
REST API query started for 10.1101/2021.06.30.21259787...
REST API query complete  200
REST API query started for 10.1101/2021.07.

Unnamed: 0,doi,tweets,archive,subject-area,covid,title,authors,abstract,posted
0,10.1101/2021.08.03.454980,109,bioRxiv,Bioinformatics,False,The structural coverage of the human proteome ...,"[{'name': 'Eduard Porta-Pardo'}, {'name': 'Vic...",<p>The protein structure field is experiencing...,2021-08-03
1,10.1101/2021.07.31.454200,75,bioRxiv,Neuroscience,False,Gene regulatory networks controlling temporal ...,"[{'name': 'Pin Lyu'}, {'name': 'Thanh Hoang'},...","<p>Gene regulatory networks (GRNs), consisting...",2021-08-02
2,10.1101/2021.08.02.454472,57,bioRxiv,Genomics,False,Microbial community of recently discovered Auk...,"[{'name': 'Daan R Speth'}, {'name': 'Feiqiao B...",<p>Hydrothermal vents have been key to our und...,2021-08-02
3,10.1101/2021.08.03.454952,56,bioRxiv,Plant Biology,False,High expression of VRT2 during wheat spikelet ...,"[{'name': 'Anna E. Backhaus'}, {'name': 'Ashle...",<p>Spikelets are the fundamental building bloc...,2021-08-04
4,10.1101/2021.08.03.454992,49,bioRxiv,Neuroscience,False,Modulating D1 rather than D2 receptor-expressi...,"[{'name': 'Seongsik Yun'}, {'name': 'Ben Yang'...",<p>Overactive dopamine transmission in psychos...,2021-08-03
5,10.1101/2021.08.05.455304,48,bioRxiv,Biochemistry,False,Structure of PINK1 reveals autophosphorylation...,"[{'name': 'Shafqat Rasool'}, {'name': 'Simon V...",<p>Mutations in PINK1 causes autosomal-recessi...,2021-08-05
6,10.1101/2021.08.01.454633,32,bioRxiv,Systems Biology,False,Regulatory perturbations of ribosome allocatio...,"[{'name': 'David Hidalgo'}, {'name': 'César A....",<p>Bacteria regulate their cellular resource a...,2021-08-02
7,10.1101/2021.07.31.454592,24,bioRxiv,Molecular Biology,True,SARS-CoV-2 fears green: the chlorophyll catabo...,"[{'name': 'Guillermo H. Jimenez-Aleman'}, {'na...",<p>SARS-CoV-2 pandemic is having devastating c...,2021-08-02
8,10.1101/2021.08.07.455365,18,bioRxiv,Neuroscience,False,Spatiotemporal dynamics of human microglia are...,"[{'name': 'David A. Menassa'}, {'name': 'Tim A...","<p>Microglia, the brain's resident macrophages...",2021-08-07
9,10.1101/2021.08.05.455283,13,bioRxiv,Plant Biology,False,Integrating omics approaches to discover and p...,"[{'name': 'Dayana K. Turquetti-Moraes'}, {'nam...",<p>Soybean is one of the major sources of edib...,2021-08-06


### Tweets of bioRxiv and medRxiv preprints

In [10]:
num_rows = tdf['archive'].count()
num_covid = tdf['covid'].value_counts(ascending=True)[1]
num_biorxiv = tdf['archive'].value_counts(ascending=True)[1]
num_medrxiv = tdf['archive'].value_counts(ascending=True)[0]
end_date = date.today().strftime('%Y-%m-%d')
max_count = tdf['tweets'].max()

md('{} preprints (including {} covering SARS-CoV-2, {} from bioRxiv and {} from medRxiv) published in the last 7 days before {} had been tweeted at least 3 times (maximum {}).'.format(num_rows, num_covid, num_biorxiv, num_medrxiv, end_date, max_count))

30 preprints (including 5 covering SARS-CoV-2, 27 from bioRxiv and 3 from medRxiv) published in the last 7 days before 2021-08-09 had been tweeted at least 3 times (maximum 109).