# Tweets from bioRxiv and medRxiv

Find and display events from the CSHL DOI prefix 10.1101 from the last 7 days, and find the most tweeted preprints.

In [11]:
import sys
sys.path.append('../')
sys.path.insert(0, '..')

import pandas as pd # data analysis library
import json
import mrced2 # module to run event data queries
import os # some file manipulations
import math # some number manipulations
import altair.vegalite.v3 as alt # some data visualizations
from IPython.display import Markdown as md # some markdown manipulations
from datetime import datetime, date, timedelta # some date manipulations

In [12]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = (date.today() - timedelta(days = 7)).strftime('%Y-%m-%d')
end_date = date.today()

In [13]:
ed = mrced2.eventData(mailto = email)
ed.buildQuery({'obj-id.prefix' : prefix, 'source': 'twitter', 'rows': 500,'from-occurred-date' : start_date, 'until-occurred-date' : end_date})

https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&obj-id.prefix=10.1101&source=twitter&rows=500&from-occurred-date=2021-10-04&until-occurred-date=2021-10-11


In [14]:
ed.runQuery(retry = 5)

Event Data query started...
API query complete  200
output file written to 1101/tweets.json


In [15]:
pages = math.ceil(ed.events.getHits() / 1000)

22531 events found


In [16]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = date.today() - timedelta(days = 7)
end_date = date.today()

# find the all result pages for the search
ed = mrced2.eventData(mailto = email)
ed.getAllPages(pages, {'rows': 1000, 'obj-id.prefix' : prefix, 'source': 'twitter', 'from-occurred-date' : start_date, 'until-occurred-date' : end_date}, fileprefix = '1101/tweets_') 

https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-10-04&until-occurred-date=2021-10-11
Event Data query started...
API query complete  200
output file written to 1101/tweets_0000.json
https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&cursor=b91c6f0a-1ecd-4143-b228-fe293dea90af&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-10-04&until-occurred-date=2021-10-11
Event Data query started...
API query complete  200
output file written to 1101/tweets_0001.json
https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&cursor=2ec9b04e-9d10-4666-bed1-e6c5766a79f7&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-10-04&until-occurred-date=2021-10-11
Event Data query started...
API query complete  200
output file written to 1101/tweets_0002.json
https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&curso

Initialisation to look at the properties of the results.

In [17]:
jd1 = mrced2.eventRecord() # instance of a class to interpret the events
files = os.listdir('1101') # get all the filenames

jd1.mergeJsons(files, folder = '1101') # load the json event data from multiple files

failed to load preprint_tweets_2021-07-29.csv
failed to load preprint_tweets_2021-09-06.csv
failed to load preprint_tweets_2021-09-13.csv
failed to load .DS_Store
failed to load .gitkeep
failed to load preprint_tweets_2021-08-30.csv
failed to load preprint_tweets_2021-08-09.csv
failed to load preprint_tweets_2021-08-23.csv
failed to load preprint_tweets_2021-08-02.csv
failed to load preprint_tweets_2021-08-16.csv
failed to load .ipynb_checkpoints
failed to load preprint_tweets_2021-10-05.csv
failed to load preprint_tweets_2021-10-11.csv
failed to load preprint_tweets_2021-10-04.csv
failed to load preprint_tweets_2021-10-06.csv
failed to load preprint_tweets_2021-10-07.csv
failed to load preprint_tweets_2021-07-26.csv
failed to load preprint_tweets_2021-09-20.csv
output file written to 1101/tweets.json


In [18]:
js = json.load(open("1101/tweets.json"))
df = pd.json_normalize(js, record_path = ['message', 'events'])
gdf = df.groupby(['obj_id']).size().reset_index(name='count').sort_values('count', ascending=False)
cdf = gdf[gdf['count'] >= 3]

In [19]:
email = "info@front-matter.io"
rest = mrced2.restApi(email = email)

data = []
for index, row in cdf.iterrows():
    rest.runQuery(row)
    if rest.work is not None and date.fromisoformat(rest.work["posted"]) >= start_date:
        data.append(rest.work)
print(data)
    
tdf = pd.DataFrame(data, columns=['doi','tweets','archive','subject-area','covid','title','authors','abstract','posted'])
tdf.to_csv('1101/preprint_tweets_' + str(date.today().strftime('%Y-%m-%d')) + '.csv')

tdf.head(50)

REST API query started for 10.1101/2021.08.24.21262415...
REST API query complete  200
REST API query started for 10.1101/2021.08.19.21262139...
REST API query complete  200
REST API query started for 10.1101/2021.09.28.21264262...
REST API query complete  200
REST API query started for 10.1101/2021.09.22.21263977...
REST API query complete  200
REST API query started for 10.1101/2021.09.28.21264260...
REST API query complete  200
REST API query started for 10.1101/2021.09.13.21262182...
REST API query complete  200
REST API query started for 10.1101/2021.09.30.462488...
REST API query complete  200
REST API query started for 10.1101/2021.02.16.21251535...
REST API query complete  200
REST API query started for 10.1101/2021.07.23.21260998...
REST API query complete  200
REST API query started for 10.1101/2021.05.03.21256520...
REST API query complete  200
REST API query started for 10.1101/2021.07.31.21261387...
REST API query complete  200
REST API query started for 10.1101/2021.07.08

Unnamed: 0,doi,tweets,archive,subject-area,covid,title,authors,abstract,posted
0,10.1101/2021.10.08.463746,56,bioRxiv,Bioengineering,False,Deep-Learning Super-Resolution Microscopy Reve...,"[{'name': 'Rong Chen'}, {'name': 'Xiao Tang'},...",<p>Single-molecule localization microscopy (SM...,2021-10-09
1,10.1101/2021.10.02.21264267,39,medRxiv,Infectious Diseases (except HIV/AIDS),True,Year-long COVID-19 infection reveals within-ho...,"[{'name': 'Veronique Nussenblatt'}, {'name': '...",<sec><title>Background</title><p>B-cell deplet...,2021-10-05
2,10.1101/2021.10.06.21264535,37,medRxiv,Infectious Diseases (except HIV/AIDS),True,Analytical performance of eleven SARS-CoV-2 an...,"[{'name': 'Meriem Bekliz'}, {'name': 'Kenneth ...",<p>Global concerns arose as the emerged and ra...,2021-10-07
3,10.1101/2021.10.05.463267,26,bioRxiv,Genetics,False,Intercellular transport of RNA can limit herit...,"[{'name': 'Nathan Shugarts'}, {'name': 'Andrew...",<p>RNAs in circulation carry sequence-specific...,2021-10-06
4,10.1101/2021.10.04.21264500,22,medRxiv,Obstetrics and Gynecology,True,Increase in preterm stillbirths and reduction ...,"[{'name': 'Lisa Hui'}, {'name': 'Melvin Barrie...",<sec><title>Objectives</title><p>The COVID-19 ...,2021-10-05
5,10.1101/2021.10.07.463568,18,bioRxiv,Neuroscience,False,Molecular rhythm alterations in prefrontal cor...,"[{'name': 'Xiangning Xue'}, {'name': 'Wei Zong...",<p>Severe and persistent disruptions to sleep ...,2021-10-09
6,10.1101/2021.10.08.463671,15,bioRxiv,Molecular Biology,False,High-throughput mutagenesis identifies mutatio...,"[{'name': 'Mariela Cortés-López'}, {'name': 'L...",<p>During CART-19 immunotherapy for B-cell acu...,2021-10-08
7,10.1101/2021.10.07.463475,14,bioRxiv,Developmental Biology,False,Pseudo-dynamic analysis of heart tube formatio...,"[{'name': 'Isaac Esteban'}, {'name': 'Patrick ...",<p>Understanding organ morphogenesis requires ...,2021-10-09
8,10.1101/2021.10.07.463556,14,bioRxiv,Microbiology,False,Large-scale discovery of candidate type VI sec...,"[{'name': 'Alexander Martin Geller'}, {'name':...",<p>Type VI secretion systems (T6SS) are common...,2021-10-07
9,10.1101/2021.10.04.462880,13,bioRxiv,Immunology,False,Nanoengineered DNA origami with repurposed TOP...,"[{'name': 'Keying Zhu'}, {'name': 'Yang Wang'}...","<p>Targeting myeloid cells, especially microgl...",2021-10-04


### Tweets of bioRxiv and medRxiv preprints

In [20]:
num_rows = tdf['archive'].count()
num_covid = tdf['covid'].value_counts(ascending=True)[1]
num_biorxiv = tdf['archive'].value_counts(ascending=True)[1]
num_medrxiv = tdf['archive'].value_counts(ascending=True)[0]
end_date = date.today().strftime('%Y-%m-%d')
max_count = tdf['tweets'].max()

md('{} preprints (including {} covering SARS-CoV-2, {} from bioRxiv and {} from medRxiv) published in the last 7 days before {} had been tweeted at least 3 times (maximum {}).'.format(num_rows, num_covid, num_biorxiv, num_medrxiv, end_date, max_count))

58 preprints (including 6 covering SARS-CoV-2, 52 from bioRxiv and 6 from medRxiv) published in the last 7 days before 2021-10-11 had been tweeted at least 3 times (maximum 56).