# Tweets from bioRxiv and medRxiv

Find and display events from the CSHL DOI prefix 10.1101 from the last 7 days, and find the most tweeted preprints.

In [1]:
import sys
sys.path.append('../')
sys.path.insert(0, '..')

import pandas as pd # data analysis library
import json
import mrced2 # module to run event data queries
import os # some file manipulations
import math # some number manipulations
import altair.vegalite.v3 as alt # some data visualizations
from IPython.display import Markdown as md # some markdown manipulations
from datetime import datetime, date, timedelta # some date manipulations

In [2]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = (date.today() - timedelta(days = 7)).strftime('%Y-%m-%d')
end_date = date.today()

In [3]:
ed = mrced2.eventData(email = email)
ed.buildQuery({'obj-id.prefix' : prefix, 'source': 'twitter', 'rows': 0,'from-occurred-date' : start_date, 'until-occurred-date' : end_date})

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&obj-id.prefix=10.1101&source=twitter&rows=0&from-occurred-date=2021-07-05&until-occurred-date=2021-07-12


In [4]:
ed.runQuery(retry = 5)

Event Data query started...
API query complete  200
output file written to 1101/tweets.json


In [5]:
pages = math.ceil(ed.events.getHits() / 1000)

14943 events found


In [6]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = date.today() - timedelta(days = 7)
end_date = date.today()

# find the all result pages for the search
ed = mrced2.eventData(email = email)
ed.getAllPages(pages, {'rows': 1000, 'obj-id.prefix' : prefix, 'source': 'twitter', 'from-occurred-date' : start_date, 'until-occurred-date' : end_date}, fileprefix = '1101/tweets_') 

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-07-05&until-occurred-date=2021-07-12
Event Data query started...
API query complete  200
output file written to 1101/tweets_0000.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=ad1289bf-30d2-435a-9f42-9907a004f3d8&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-07-05&until-occurred-date=2021-07-12
Event Data query started...
API query complete  200
output file written to 1101/tweets_0001.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=dcf28451-b046-47cf-ab6e-6c27e070b5df&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-07-05&until-occurred-date=2021-07-12
Event Data query started...
API query complete  200
output file written to 1101/tweets_0002.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=4d6b3314-00f3-4a8e-9821-88c36468d5b1&rows=

Initialisation to look at the properties of the results.

In [7]:
jd1 = mrced2.eventRecord() # instance of a class to interpret the events
files = os.listdir('1101') # get all the filenames

jd1.mergeJsons(files, folder = '1101') # load the json event data from multiple files

failed to load preprint_tweets_2021-05-10.csv
failed to load preprint_tweets_2021-05-17.csv
failed to load .DS_Store
failed to load preprint_tweets_2021-07-05.csv
failed to load .gitkeep
failed to load preprint_tweets_2021-06-21.csv
failed to load tweets_sorted.json
failed to load preprint_tweets_2021-06-28.csv
failed to load preprint_tweets_2021-06-14.csv
failed to load preprint_tweets_2021-06-07.csv
failed to load preprint_tweets_2021-05-18.csv
failed to load preprint_tweets_2021-05-24.csv
failed to load preprint_tweets_2021-05-31.csv
failed to load preprint_tweets_2021-05-09.csv
failed to load preprint_tweets_2021-05-08.csv
output file written to 1101/tweets.json


In [8]:
js = json.load(open("1101/tweets.json"))
df = pd.json_normalize(js, record_path = ['message', 'events'])
gdf = df.groupby(['obj_id']).size().reset_index(name='count').sort_values('count', ascending=False)
cdf = gdf[gdf['count'] >= 5]

In [9]:
email = "info@front-matter.io"
rest = mrced2.restApi(email = email)

data = []

for index, row in cdf.iterrows():
    rest.runQuery(row)
    if rest.work is not None and date.fromisoformat(rest.work["posted"]) >= start_date:
        data.append(rest.work)
    
tdf = pd.DataFrame(data, columns=['doi','tweets','archive','subject-area','covid','title','authors','abstract','posted'])
tdf.to_csv('1101/preprint_tweets_' + str(date.today().strftime('%Y-%m-%d')) + '.csv')

tdf.head(50)

REST API query started for 10.1101/2021.06.11.21258690...
REST API query complete  200
REST API query started for 10.1101/2021.05.16.21257255...
REST API query complete  200
REST API query started for 10.1101/2020.12.21.423721...
REST API query complete  200
REST API query started for 10.1101/2021.06.09.447686...
REST API query complete  200
REST API query started for 10.1101/2021.06.29.450356...
REST API query complete  200
REST API query started for 10.1101/2021.05.03.21256520...
REST API query complete  200
REST API query started for 10.1101/2021.06.28.21259673...
REST API query complete  200
REST API query started for 10.1101/2021.05.29.21258055...
REST API query complete  200
REST API query started for 10.1101/2021.06.01.21258176...
REST API query complete  200
REST API query started for 10.1101/2020.11.15.383323...
REST API query complete  200
REST API query started for 10.1101/2021.07.03.21259976...
REST API query complete  200
REST API query started for 10.1101/2021.05.05.21256

Unnamed: 0,doi,tweets,archive,subject-area,covid,title,authors,abstract,posted
0,10.1101/2021.07.03.21259976,246,medRxiv,Infectious Diseases (except HIV/AIDS),True,Incidence of Severe Acute Respiratory Syndrome...,"[{'name': 'N Kojima'}, {'name': 'A Roshani'}, ...",<sec><title>Introduction</title><p>The protect...,2021-07-07
1,10.1101/2021.07.08.21259351,212,medRxiv,Hematology,True,Heparin for Moderately Ill Patients with Covid-19,"[{'name': 'Michelle Sholzberg'}, {'name': 'Gra...","<sec><title>Background</title><p>Heparin, in a...",2021-07-08
2,10.1101/2021.07.05.451163,199,bioRxiv,Evolutionary Biology,False,The crucial role of genome-wide genetic variat...,"[{'name': 'Marty Kardos'}, {'name': 'Ellie Arm...",<p>The unprecedented rate of extinction calls ...,2021-07-06
3,10.1101/2021.07.09.451765,91,bioRxiv,Immunology,False,The PD-1 checkpoint receptor maintains toleran...,"[{'name': 'Martina Damo'}, {'name': 'Can Cui'}...",<p>Peripheral tolerance is thought to result f...,2021-07-10
4,10.1101/2021.07.07.451065,81,bioRxiv,Synthetic Biology,False,Synthetic genomic reconstitution reveals princ...,"[{'name': 'Sudarshan Pinglay'}, {'name': 'Mili...",<p>Precise <italic>Hox</italic> gene expressio...,2021-07-07
5,10.1101/2021.07.04.451074,73,bioRxiv,Neuroscience,False,Towards a Neurometric-based Construct Validity...,"[{'name': 'Pin-Hao A. Chen'}, {'name': 'Domini...",<p>Trust is a nebulous construct central to su...,2021-07-05
6,10.1101/2021.07.05.451142,71,bioRxiv,Neuroscience,False,Synaptic Encoding of Vestibular Sensation Regu...,"[{'name': 'Kyla R. Hamling'}, {'name': 'Kather...",<p>Vertebrate vestibular circuits use sensory ...,2021-07-05
7,10.1101/2021.07.05.451212,67,bioRxiv,Plant Biology,False,The Phytochemical Diversity of Commercial Cann...,"[{'name': 'Christiana J. Smith'}, {'name': 'Da...",<p>The legal status of <italic>Cannabis</itali...,2021-07-06
8,10.1101/2021.07.09.451712,65,bioRxiv,Evolutionary Biology,False,Global patterns of subgenome evolution in orga...,"[{'name': 'Joel Sharbrough'}, {'name': 'Justin...","<p>Whole-genome duplications (WGDs), in which ...",2021-07-09
9,10.1101/2021.07.07.451547,64,bioRxiv,Evolutionary Biology,False,Early adaptation in a microbial community is d...,"[{'name': 'Sandeep Venkataram'}, {'name': 'Hua...",<p>Evolutionary dynamics in ecological communi...,2021-07-09


### Tweets of bioRxiv and medRxiv preprints

In [10]:
num_rows = tdf['archive'].count()
num_covid = tdf['covid'].value_counts(ascending=True)[1]
num_biorxiv = tdf['archive'].value_counts(ascending=True)[1]
num_medrxiv = tdf['archive'].value_counts(ascending=True)[0]
end_date = date.today().strftime('%Y-%m-%d')
max_count = tdf['tweets'].max()

md('{} preprints (including {} covering SARS-CoV-2, {} from bioRxiv and {} from medRxiv) published in the last 7 days before {} had been tweeted at least 5 times (maximum {}).'.format(num_rows, num_covid, num_biorxiv, num_medrxiv, end_date, max_count))

54 preprints (including 12 covering SARS-CoV-2, 36 from bioRxiv and 18 from medRxiv) published in the last 7 days before 2021-07-12 had been tweeted at least 5 times (maximum 246).