# Tweets from bioRxiv and medRxiv

Find and display events from the CSHL DOI prefix 10.1101 from the last 7 days, and find the most tweeted preprints.

In [11]:
import sys
sys.path.append('../')
sys.path.insert(0, '..')

import pandas as pd # data analysis library
import json
import mrced2 # module to run event data queries
import os # some file manipulations
import math # some number manipulations
import altair.vegalite.v3 as alt # some data visualizations
from IPython.display import Markdown as md # some markdown manipulations
from datetime import datetime, date, timedelta # some date manipulations

In [12]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = (date.today() - timedelta(days = 7)).strftime('%Y-%m-%d')
end_date = date.today()

In [13]:
ed = mrced2.eventData(email = email)
ed.buildQuery({'obj-id.prefix' : prefix, 'source': 'twitter', 'rows': 0,'from-occurred-date' : start_date, 'until-occurred-date' : end_date})

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&obj-id.prefix=10.1101&source=twitter&rows=0&from-occurred-date=2021-05-10&until-occurred-date=2021-05-17


In [14]:
ed.runQuery(retry = 5)

Event Data query started...
API query complete  200
output file written to 1101/tweets.json


In [15]:
pages = math.ceil(ed.events.getHits() / 1000)

17613 events found


In [16]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = date.today() - timedelta(days = 7)
end_date = date.today()

# find the all result pages for the search
ed = mrced2.eventData(email = email)
ed.getAllPages(pages, {'rows': 1000, 'obj-id.prefix' : prefix, 'source': 'twitter', 'from-occurred-date' : start_date, 'until-occurred-date' : end_date}, fileprefix = '1101/tweets_') 

https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-05-10&until-occurred-date=2021-05-17
Event Data query started...
API query complete  200
output file written to 1101/tweets_0000.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=a9be10b3-6f93-4e21-849b-9bd325251ec3&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-05-10&until-occurred-date=2021-05-17
Event Data query started...
API query complete  200
output file written to 1101/tweets_0001.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=029bca72-5570-4f4e-ba44-3a0ff25ffa2a&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2021-05-10&until-occurred-date=2021-05-17
Event Data query started...
API query complete  200
output file written to 1101/tweets_0002.json
https://api.eventdata.crossref.org/v1/events?mailto=Anonymous&cursor=65439495-4d8f-47e2-9031-7e9d1d37f608&rows=

Initialisation to look at the properties of the results.

In [17]:
jd1 = mrced2.eventRecord() # instance of a class to interpret the events
files = os.listdir('1101') # get all the filenames

jd1.mergeJsons(files, folder = '1101') # load the json event data from multiple files

failed to load preprint_tweets_2021-05-10.csv
failed to load preprint_tweets_2021-05-17.csv
failed to load .DS_Store
failed to load .gitkeep
failed to load tweets_sorted.json
failed to load preprint_tweets_2021-05-09.csv
failed to load preprint_tweets_2021-05-08.csv
output file written to 1101/tweets.json


In [18]:
js = json.load(open("1101/tweets.json"))
df = pd.json_normalize(js, record_path = ['message', 'events'])
gdf = df.groupby(['obj_id']).size().reset_index(name='count').sort_values('count', ascending=False)
cdf = gdf[gdf['count'] >= 5]

In [23]:
email = "info@front-matter.io"
rest = mrced2.restApi(email = email)

data = []

for index, row in cdf.iterrows():
    rest.runQuery(row)
    if rest.work is not None and date.fromisoformat(rest.work["posted"]) >= start_date:
        data.append(rest.work)
    
tdf = pd.DataFrame(data, columns=['doi','tweets','archive','subject-area','covid','title','authors','abstract','posted'])
tdf.to_csv('1101/preprint_tweets_' + str(date.today().strftime('%Y-%m-%d')) + '.csv')

tdf.head(50)

REST API query started for 10.1101/2021.03.11.435000...
REST API query complete  200
REST API query started for 10.1101/2021.05.06.21256403...
REST API query complete  200
REST API query started for 10.1101/2021.02.27.433180...
REST API query complete  200
REST API query started for 10.1101/2021.03.22.436441...
REST API query complete  200
REST API query started for 10.1101/2021.05.07.21256539...
REST API query complete  200
REST API query started for 10.1101/2021.05.11.21256578...
REST API query complete  200
REST API query started for 10.1101/2021.05.06.21256755...
REST API query complete  200
REST API query started for 10.1101/2021.05.12.21257080...
REST API query complete  200
REST API query started for 10.1101/2021.05.11.21257037...
REST API query complete  200
REST API query started for 10.1101/2021.05.07.21256854...
REST API query complete  200
REST API query started for 10.1101/2021.05.11.443151...
REST API query complete  200
REST API query started for 10.1101/2021.03.01.21252

Unnamed: 0,doi,tweets,archive,subject-area,covid,title,authors,abstract,posted
0,10.1101/2021.05.07.21256539,312,medRxiv,Pediatrics,True,Immune profile of children with post-acute seq...,"[{'name': 'Gabriele Di Sante'}, {'name': 'Dani...",<p>There is increasing reporting by patients’ ...,2021-05-10
1,10.1101/2021.05.11.21256578,295,medRxiv,Infectious Diseases (except HIV/AIDS),True,Live virus neutralisation testing in convalesc...,"[{'name': 'Claudia Gonzalez'}, {'name': 'Carla...",<sec><title>Background</title><p>SARS-CoV-2 mu...,2021-05-11
2,10.1101/2021.05.06.21256755,280,medRxiv,Epidemiology,True,Clinical coding of long COVID in English prima...,"[{'name': ' '}, {'name': 'Alex J Walker'}, {'n...",<sec><title>Background</title><p>Long COVID is...,2021-05-13
3,10.1101/2021.05.12.21257080,251,medRxiv,Genetic and Genomic Medicine,True,A year of genomic surveillance reveals how the...,"[{'name': 'Eduan Wilkinson'}, {'name': 'Marta ...",<p>The progression of the SARS-CoV-2 pandemic ...,2021-05-13
4,10.1101/2021.05.11.21257037,247,medRxiv,Infectious Diseases (except HIV/AIDS),True,Mental health of Adolescents in the Pandemic: ...,"[{'name': 'Judith Blankenburg'}, {'name': 'Mag...",<sec><title>Backround</title><p>Post-COVID19 c...,2021-05-11
5,10.1101/2021.05.07.21256854,238,medRxiv,Endocrinology (including Diabetes Mellitus and...,False,Association of machine learning-derived measur...,"[{'name': 'Saaket Agrawal'}, {'name': 'Marcus ...",<sec><title>Background</title><p>Obesity is de...,2021-05-10
6,10.1101/2021.05.11.443151,200,bioRxiv,Neuroscience,False,Evaluating brain parcellations using the dista...,"[{'name': 'Da Zhi'}, {'name': 'Maedbh King'}, ...",<p>An important goal of human brain mapping is...,2021-05-11
7,10.1101/2021.05.08.21256866,133,medRxiv,Infectious Diseases (except HIV/AIDS),True,Antibody Responses After a Single Dose of ChAd...,"[{'name': 'Sebastian Havervall'}, {'name': 'Ul...",<sec><title>Background</title><p>Recent report...,2021-05-11
8,10.1101/2021.05.06.21256788,115,medRxiv,Infectious Diseases (except HIV/AIDS),True,Rapid detection of neutralizing antibodies to ...,"[{'name': 'Kei Miyakawa'}, {'name': 'Sundarara...",<p>The uncontrolled spread of the COVID-19 pan...,2021-05-10
9,10.1101/2021.05.11.21256877,106,medRxiv,Infectious Diseases (except HIV/AIDS),True,A blood atlas of COVID-19 defines hallmarks of...,"[{'name': ' '}, {'name': 'David J Ahern'}, {'n...",<p>Treatment of severe COVID-19 is currently l...,2021-05-11


### Tweets of bioRxiv and medRxiv preprints

In [24]:
num_rows = tdf['archive'].count()
num_covid = tdf['covid'].value_counts(ascending=True)[1]
num_biorxiv = tdf['archive'].value_counts(ascending=True)[1]
num_medrxiv = tdf['archive'].value_counts(ascending=True)[0]
end_date = date.today().strftime('%Y-%m-%d')
max_count = tdf['tweets'].max()

md('{} preprints (including {} covering SARS-CoV-2, {} from bioRxiv and {} from medRxiv) published in the last 7 days before {} had been tweeted at least 5 times (maximum {}).'.format(num_rows, num_covid, num_biorxiv, num_medrxiv, end_date, max_count))

256 preprints (including 63 covering SARS-CoV-2, 196 from bioRxiv and 60 from medRxiv) published in the last 7 days before 2021-05-17 had been tweeted at least 5 times (maximum 312).