# bio.tools to text mining
This is a Jupyter notebook for extracting publications for software tools in bio.tools that are open access and amenable for text mining.


This may eventually become a *polyglot* notebook, combining Python and R code snippets. To allow for execution of R code, we first load the rpy2 package:

In [None]:
%load_ext rpy2.ipython

Then load the necessary libraries (not all of these may be necessary with in the current notebook, but will be):

In [None]:
%%R
library(jsonlite)
library(httr)
library(stringr)
install.packages("europepmc")
library(europepmc)

# Get a subset of bio.tools publications
First extract tool names and corresponding publications from bio.tools:

In [None]:
%%R
toolnames <- c()
toolpmids <- c()
page <- 1
while(page<230) { # replace with check if last page has been reached...
  tools <- content(GET(paste0('https://bio.tools/api/tool/?topic=%22', 'proteomics', '%22&format=json&page=', page)), as='parsed')$list
  for(t in tools) {for(p in t$publication) if(length(p$pmid)) {
      toolnames <- c(toolnames, t$name)
      toolpmids <- c(toolpmids, p$pmid)
      }
  }
  page <- page+1
}

Then check and keep only those tools and publications, where the publications are open access:

In [None]:
%%R
tooloa <- c()
for(pmid in toolpmids) {
    is_oa <- suppressMessages(epmc_search(query = paste0('EXT_ID:', pmid), output = 'parsed'))
    if(is_oa$isOpenAccess=='Y') toolnameoa <- c()
    tooloa <- c(tooloa, is_oa$isOpenAccess)
}

Count what fraction of tool publications are open access:

In [None]:
%%R
sum(tooloa=="Y")/(sum(tooloa=="Y") + sum(tooloa=="N"))

And how many open access publications we have:

In [None]:
%%R
sum(tooloa=="Y")

Keep only those tool-publication pairs where the publications are open access:

In [None]:
%%R
toolpmids <- toolpmids[tooloa=="Y"]
toolnames <- toolnames[tooloa=="Y"]

Make a data frame and save to TSV file:

In [None]:
%%R
df <- data.frame(Tool = toolnames, PMID = toolpmids)
write.table(df, file = "proteomics_tools_and_pmids_oa.tsv", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)