# Tweets from bioRxiv and medRxiv

Find and display events from the CSHL DOI prefix 10.1101 from the last 7 days, and find the most tweeted preprints.

In [1]:
import sys
sys.path.append('../')
sys.path.insert(0, '..')

import pandas as pd # data analysis library
import json
import mrced2 # module to run event data queries
import os # some file manipulations
import math # some number manipulations
import altair.vegalite.v3 as alt # some data visualizations
from IPython.display import Markdown as md # some markdown manipulations
from datetime import datetime, date, timedelta # some date manipulations

In [2]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = (date.today() - timedelta(days = 7)).strftime('%Y-%m-%d')
end_date = date.today()

In [3]:
ed = mrced2.eventData(email = email)
ed.buildQuery({'obj-id.prefix' : prefix, 'source': 'twitter', 'rows': 0,'from-occurred-date' : start_date, 'until-occurred-date' : end_date})

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&obj-id.prefix=10.1101&source=twitter&rows=0&from-occurred-date=2021-07-22&until-occurred-date=2021-07-29


In [4]:
ed.runQuery(retry = 5)

Event Data query started...
API query complete  200
output file written to 1101/tweets.json


In [5]:
pages = math.ceil(ed.events.getHits() / 1000)

15954 events found


In [6]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = date.today() - timedelta(days = 7)
end_date = date.today()

# find the all result pages for the search
ed = mrced2.eventData(email = email)
ed.getAllPages(pages, {'rows': 1000, 'obj-id.prefix' : prefix, 'source': 'twitter', 'from-occurred-date' : start_date, 'until-occurred-date' : end_date}, fileprefix = '1101/tweets_') 

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-07-22&until-occurred-date=2021-07-29
Event Data query started...
API query complete  200
output file written to 1101/tweets_0000.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=c778b3d2-1b13-47d7-8287-8e0a5102902b&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-07-22&until-occurred-date=2021-07-29
Event Data query started...
API query complete  200
output file written to 1101/tweets_0001.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=3c5841c6-4402-4e71-b0a7-4a9f9fa20b71&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-07-22&until-occurred-date=2021-07-29
Event Data query started...
API query complete  200
output file written to 1101/tweets_0002.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=8685d638-5cb0-4832-831f-538bf199c663&rows=

Initialisation to look at the properties of the results.

In [7]:
jd1 = mrced2.eventRecord() # instance of a class to interpret the events
files = os.listdir('1101') # get all the filenames

jd1.mergeJsons(files, folder = '1101') # load the json event data from multiple files

failed to load .DS_Store
failed to load .gitkeep
failed to load .ipynb_checkpoints
failed to load preprint_tweets_2021-07-26.csv
output file written to 1101/tweets.json


In [8]:
js = json.load(open("1101/tweets.json"))
df = pd.json_normalize(js, record_path = ['message', 'events'])
gdf = df.groupby(['obj_id']).size().reset_index(name='count').sort_values('count', ascending=False)
cdf = gdf[gdf['count'] >= 3]

In [9]:
email = "info@front-matter.io"
rest = mrced2.restApi(email = email)

data = []
for index, row in cdf.iterrows():
    rest.runQuery(row)
    if rest.work is not None and date.fromisoformat(rest.work["posted"]) >= start_date:
        data.append(rest.work)
print(data)
    
tdf = pd.DataFrame(data, columns=['doi','tweets','archive','subject-area','covid','title','authors','abstract','posted'])
tdf.to_csv('1101/preprint_tweets_' + str(date.today().strftime('%Y-%m-%d')) + '.csv')

tdf.head(50)

REST API query started for 10.1101/2021.06.25.449905...
REST API query complete  200
REST API query started for 10.1101/2021.07.15.21260561...
REST API query complete  200
REST API query started for 10.1101/2021.04.20.21254636...
REST API query complete  200
REST API query started for 10.1101/2021.02.23.432474...
REST API query complete  200
REST API query started for 10.1101/2021.06.28.21259420...
REST API query complete  200
REST API query started for 10.1101/2021.07.19.21260767...
REST API query complete  200
REST API query started for 10.1101/2021.06.01.21258176...
REST API query complete  200
REST API query started for 10.1101/2021.07.05.21260050...
REST API query complete  200
REST API query started for 10.1101/2021.05.08.443253...
REST API query complete  200
REST API query started for 10.1101/2021.05.31.21258122...
REST API query complete  200
REST API query started for 10.1101/2021.06.17.448820...
REST API query complete  200
REST API query started for 10.1101/2021.07.23.21261

Unnamed: 0,doi,tweets,archive,subject-area,covid,title,authors,abstract,posted
0,10.1101/2021.07.19.21260767,625,medRxiv,Pediatrics,True,Children with SARS-CoV-2 in the National COVID...,"[{'name': 'Blake Martin'}, {'name': 'Peter E. ...",<sec><title>Importance</title><p>SARS-CoV-2</p...,2021-07-22
1,10.1101/2021.07.23.21261030,232,medRxiv,Epidemiology,True,Breakthrough Symptomatic COVID-19 Infections L...,"[{'name': 'Daisy Massey'}, {'name': 'Diana Ber...",<p>Vaccines have been shown to be extremely ef...,2021-07-26
2,10.1101/2021.07.26.453748,74,bioRxiv,Microbiology,False,Identification of a new family of “megaphages”...,"[{'name': 'Slawomir Michniewski'}, {'name': 'B...",<p>Megaphages – bacteriophages harbouring extr...,2021-07-26
3,10.1101/2021.07.23.453070,60,bioRxiv,Evolutionary Biology,False,Regulation of sedimentation rate shapes the ev...,"[{'name': 'Omaya Dudin'}, {'name': 'Sébastien ...",<p>Significant increases in sedimentation rate...,2021-07-23
4,10.1101/2021.07.23.453379,57,bioRxiv,Bioinformatics,False,"AlphaPept, a modern and open framework for MS-...","[{'name': 'Maximilian T. Strauss'}, {'name': '...","<p>In common with other omics technologies, ma...",2021-07-26
5,10.1101/2021.07.23.453492,52,bioRxiv,Scientific Communication and Education,False,Delineating Medical Education: Bibliometric Re...,"[{'name': 'Lauren A. Maggio'}, {'name': 'Anton...",<sec><title>Background</title><p>The field of ...,2021-07-26
6,10.1101/2021.07.23.453605,51,bioRxiv,Genomics,False,Identifying cell-state associated alternative ...,"[{'name': 'Carlos F. Buen Abad Najar'}, {'name...",<p>Alternative splicing shapes the transcripto...,2021-07-24
7,10.1101/2021.07.24.21261040,47,medRxiv,Epidemiology,True,Novel risk factors for Coronavirus disease-ass...,"[{'name': 'Umang Arora'}, {'name': 'Megha Priy...",<sec><title>Background</title><p>The epidemiol...,2021-07-26
8,10.1101/2021.07.23.21261041,44,medRxiv,Public and Global Health,True,The impact of large mobile air purifiers on ae...,"[{'name': 'F. F. Duill'}, {'name': 'F. Schulz'...","<p>In the wake of the SARS-CoV-2 pandemic, an ...",2021-07-26
9,10.1101/2021.07.23.453478,44,bioRxiv,Biophysics,False,Mechanosensitivity of nucleocytoplasmic transport,"[{'name': 'Ion Andreu'}, {'name': 'Ignasi Gran...",<p>Mechanical force controls fundamental cellu...,2021-07-24


### Tweets of bioRxiv and medRxiv preprints

In [10]:
num_rows = tdf['archive'].count()
num_covid = tdf['covid'].value_counts(ascending=True)[1]
num_biorxiv = tdf['archive'].value_counts(ascending=True)[1]
num_medrxiv = tdf['archive'].value_counts(ascending=True)[0]
end_date = date.today().strftime('%Y-%m-%d')
max_count = tdf['tweets'].max()

md('{} preprints (including {} covering SARS-CoV-2, {} from bioRxiv and {} from medRxiv) published in the last 7 days before {} had been tweeted at least 3 times (maximum {}).'.format(num_rows, num_covid, num_biorxiv, num_medrxiv, end_date, max_count))

112 preprints (including 20 covering SARS-CoV-2, 89 from bioRxiv and 23 from medRxiv) published in the last 7 days before 2021-07-29 had been tweeted at least 3 times (maximum 625).