# Tweets from bioRxiv and medRxiv

Find and display events from the CSHL DOI prefix 10.1101 from the last 7 days, and find the most tweeted preprints.

In [1]:
import sys
sys.path.append('../')
sys.path.insert(0, '..')

import pandas as pd # data analysis library
import json
import mrced2 # module to run event data queries
import os # some file manipulations
import math # some number manipulations
import altair.vegalite.v3 as alt # some data visualizations
from IPython.display import Markdown as md # some markdown manipulations
from datetime import datetime, date, timedelta # some date manipulations

In [2]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = (date.today() - timedelta(days = 7)).strftime('%Y-%m-%d')
end_date = date.today()

In [3]:
ed = mrced2.eventData(mailto = email)
ed.buildQuery({'obj-id.prefix' : prefix, 'source': 'twitter', 'rows': 500,'from-occurred-date' : start_date, 'until-occurred-date' : end_date})

https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&obj-id.prefix=10.1101&source=twitter&rows=500&from-occurred-date=2022-01-11&until-occurred-date=2022-01-18


In [4]:
ed.runQuery(retry = 5)

Event Data query started...
API query complete  200
output file written to 1101/tweets.json


In [5]:
pages = math.ceil(ed.events.getHits() / 1000)

28505 events found


In [6]:
email = "info@front-matter.io"
prefix = "10.1101"
start_date = date.today() - timedelta(days = 7)
end_date = date.today()

# find the all result pages for the search
ed = mrced2.eventData(mailto = email)
ed.getAllPages(pages, {'rows': 1000, 'obj-id.prefix' : prefix, 'source': 'twitter', 'from-occurred-date' : start_date, 'until-occurred-date' : end_date}, fileprefix = '1101/tweets_') 

https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2022-01-11&until-occurred-date=2022-01-18
Event Data query started...
API query complete  200
output file written to 1101/tweets_0000.json
https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&cursor=7ed9e3d1-fc1f-447a-b7ca-6fde4577d334&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2022-01-11&until-occurred-date=2022-01-18
Event Data query started...
API query complete  200
output file written to 1101/tweets_0001.json
https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&cursor=94591669-2c17-400f-b198-4d65812e8395&rows=1000&obj-id.prefix=10.1101&source=twitter&from-occurred-date=2022-01-11&until-occurred-date=2022-01-18
Event Data query started...
API query complete  200
output file written to 1101/tweets_0002.json
https://api.eventdata.crossref.org/v1/events?mailto=info@front-matter.io&curso

Initialisation to look at the properties of the results.

In [7]:
jd1 = mrced2.eventRecord() # instance of a class to interpret the events
files = os.listdir('1101') # get all the filenames

jd1.mergeJsons(files, folder = '1101') # load the json event data from multiple files

failed to load preprint_tweets_2021-10-18.csv
failed to load preprint_tweets_2021-10-25.csv
failed to load preprint_tweets_2021-07-29.csv
failed to load preprint_tweets_2021-12-20.csv
failed to load preprint_tweets_2021-09-06.csv
failed to load preprint_tweets_2021-09-13.csv
failed to load preprint_tweets_2021-12-27.csv
failed to load preprint_tweets_2021-10-23.csv
failed to load .gitkeep
failed to load preprint_tweets_2021-08-30.csv
failed to load preprint_tweets_2021-11-01.csv
failed to load preprint_tweets_2021-11-15.csv
failed to load preprint_tweets_2022-01-18.csv
failed to load preprint_tweets_2021-11-29.csv
failed to load preprint_tweets_2021-08-09.csv
failed to load preprint_tweets_2021-08-23.csv
failed to load preprint_tweets_2022-01-03.csv
failed to load .tweets.json.icloud
failed to load preprint_tweets_2021-11-08.csv
failed to load preprint_tweets_2022-01-10.csv
failed to load preprint_tweets_2021-08-02.csv
failed to load preprint_tweets_2021-08-16.csv
failed to load prepri

In [8]:
js = json.load(open("1101/tweets.json"))
df = pd.json_normalize(js, record_path = ['message', 'events'])
gdf = df.groupby(['obj_id']).size().reset_index(name='count').sort_values('count', ascending=False)
cdf = gdf[gdf['count'] >= 3]

In [9]:
email = "info@front-matter.io"
rest = mrced2.restApi(email = email)

data = []
for index, row in cdf.iterrows():
    rest.runQuery(row)
    if rest.work is not None and date.fromisoformat(rest.work["posted"]) >= start_date:
        data.append(rest.work)
print(data)
    
tdf = pd.DataFrame(data, columns=['doi','tweets','archive','subject-area','covid','title','authors','abstract','posted'])
tdf.to_csv('1101/preprint_tweets_' + str(date.today().strftime('%Y-%m-%d')) + '.csv')

tdf.head(50)

REST API query started for 10.1101/2021.12.20.21267966...
REST API query complete  200
REST API query started for 10.1101/2022.01.05.22268800...
REST API query complete  200
REST API query started for 10.1101/2021.07.08.21260210...
REST API query complete  200
REST API query started for 10.1101/2022.01.10.22269010...
REST API query complete  200
REST API query started for 10.1101/2021.06.11.21258690...
REST API query complete  200
REST API query started for 10.1101/2021.05.03.21256520...
REST API query complete  200
REST API query started for 10.1101/2021.12.18.21268018...
REST API query complete  200
REST API query started for 10.1101/2021.08.24.21262415...
REST API query complete  200
REST API query started for 10.1101/2021.12.17.473248...
REST API query complete  200
REST API query started for 10.1101/2021.08.30.21262866...
REST API query complete  200
REST API query started for 10.1101/2022.01.07.475305...
REST API query complete  200
REST API query started for 10.1101/2022.01.12.4

Unnamed: 0,doi,tweets,archive,subject-area,covid,title,authors,abstract,posted
0,10.1101/2022.01.10.22269010,2384,medRxiv,Infectious Diseases (except HIV/AIDS),True,Infectious viral load in unvaccinated and vacc...,"[{'name': 'Olha Puhach'}, {'name': 'Kenneth Ad...",<sec><title>Background</title><p>Viral load (V...,2022-01-11
1,10.1101/2022.01.12.476031,368,bioRxiv,Microbiology,True,The SARS-CoV-2 Omicron (B.1.1.529) variant exh...,"[{'name': 'Shuofeng Yuan'}, {'name': 'Zi-Wei Y...",<p>The newly emerging SARS-CoV-2 Omicron (B.1....,2022-01-13
2,10.1101/2022.01.14.476225,104,bioRxiv,Cancer Biology,False,Single-cell multi-omics of human clonal hemato...,"[{'name': 'Anna S. Nam'}, {'name': 'Neville Du...",<p>Somatic mutations in cancer genes have been...,2022-01-16
3,10.1101/2022.01.11.475838,93,bioRxiv,Bioinformatics,False,Lightweight compositional analysis of metageno...,"[{'name': 'Luiz Carlos Irber'}, {'name': 'Phil...",<p>The identification of reference genomes and...,2022-01-12
4,10.1101/2022.01.11.475793,88,bioRxiv,Ecology,False,Global monitoring of soil animal communities u...,"[{'name': 'Anton M. Potapov'}, {'name': 'Xin S...",<p>Here we introduce the Soil BON Foodweb Team...,2022-01-12
5,10.1101/2022.01.13.476251,84,bioRxiv,Genomics,False,Mixing genome annotation methods in a comparat...,"[{'name': 'Caroline M. Weisman'}, {'name': 'An...",<p>Comparisons of genomes of different species...,2022-01-15
6,10.1101/2022.01.11.475254,76,bioRxiv,Bioinformatics,False,Identifying and correcting repeat-calling erro...,"[{'name': 'Kar-Tong Tan'}, {'name': 'Michael K...",<p>Nanopore long-read genome sequencing is eme...,2022-01-12
7,10.1101/2022.01.10.475719,74,bioRxiv,Neuroscience,False,Learning in reverse: Dopamine errors drive exc...,"[{'name': 'Benjamin M. Seitz'}, {'name': 'Ivy ...","<p>For over two decades, midbrain dopamine was...",2022-01-12
8,10.1101/2022.01.11.475728,73,bioRxiv,Bioinformatics,False,Predicting patient treatment response and resi...,"[{'name': 'Sanju Sinha'}, {'name': 'Rahulsimha...",<p>Tailoring the best treatments to cancer pat...,2022-01-12
9,10.1101/2022.01.13.476171,65,bioRxiv,Genomics,False,A high-resolution comparative atlas across 74 ...,"[{'name': 'Elise Parey'}, {'name': 'Alexandra ...",<p>Teleost fish are one of the most species-ri...,2022-01-13


### Tweets of bioRxiv and medRxiv preprints

In [10]:
num_rows = tdf['archive'].count()
num_covid = tdf['covid'].value_counts(ascending=True)[1]
num_biorxiv = tdf['archive'].value_counts(ascending=True)[1]
num_medrxiv = tdf['archive'].value_counts(ascending=True)[0]
end_date = date.today().strftime('%Y-%m-%d')
max_count = tdf['tweets'].max()

md('{} preprints (including {} covering SARS-CoV-2, {} from bioRxiv and {} from medRxiv) published in the last 7 days before {} had been tweeted at least 3 times (maximum {}).'.format(num_rows, num_covid, num_biorxiv, num_medrxiv, end_date, max_count))

292 preprints (including 35 covering SARS-CoV-2, 252 from bioRxiv and 40 from medRxiv) published in the last 7 days before 2022-01-18 had been tweeted at least 3 times (maximum 2384).