# Workflow Tutorial for bio.tools Curation Tooling

This document contains detailed steps for running the workflow after running Pub2Tools for a specific month. It takes the output log from Pub2Tools, separate json files with low-priority tools and preprints as input.

> For testing purposes, please copy and use a different document as this serves only as a tutorial.

We will start by importing all dependencies into the workspace:

In [None]:
import json
from biotools_dev import login_prod
from tool_processing import process_tools
from tool_validation import validate_tools
from preprints import identify_preprints
from utils.utils import check_date
from utils.csv_utils import generate_csv
from utils.json_utils import generate_json
from biotools_dev import add_tools

We will now start with the actual workflow. To start, some variables need to be defined. The _to_curate_ variable is defined to decrease the number of tools to be manually curated.

The output from Pub2Tools can include more than 700 possible tools every month, making it hard to manually curate everything and stay up to date to novel approaches.

Therefore, the workflow was designed such that only high-priority tools are added into the curation worksheet while the rest are added to a low-priority file for potential future review. The priority of the tools is already defined by Pub2Tools, which ranks them based on the likelihood of their being useful tools, a pattern observed by previous curators. Hence, only the top-ranked tools - specified by the _to_curate_ variabke and excluding preprints - are selected for immediate addition to the monthly worksheet.   


1. **Define run settings:**
    * _to_curate_ (int || 'all'): number of published tools to be added to the database.

In [None]:
to_curate = 100

File paths for different files must be defined. These files include the output from Pub2Tools and json files or zip files with all preprints and low-priority tools. 

There should be a low-priority tool file for each month. For a new run, this file will be created.

2. **Define file paths:**

    * _json_file_ (str): path for json file with Pub2Tools output
    * _pub2tools_log_ (str): path to existing output log file from Pub2Tools
    * _preprints_file_ (str): path to existing json file with all of the preprints
    * _low_priority_ (str): path to zip file with low priority tools

In [None]:
json_file = "to_biotools_sep22.json"
pub2tools_log = "pub2tools.log"
preprints_file = "data/preprints.json"
low_priority = "data/low_tools.zip"

Credentials are needed for the development version of the server to be able to upload the tools to dev.

Make sure not to submit any credentials to the repository when making changes to the workflow!

> Always create a copy of this file and do not make changes to the original one.


3. **Define username and password.**

In [None]:
username = ''
password = ''

4. **Authentication.**

In [None]:
from biotools_dev import login_prod

token = login_prod(username, password)

As mentioned before, Pub2Tools returns tools with a confidence flag, and we only consider the ones where this flag is set to "high". 

5. **Read Pub2Tools output** and get tools with **high confidence** score from json file.

In [None]:
import json
from tool_processing import process_tools

with open(json_file,encoding="utf8") as jf:
    data = json.load(jf)
    tools = data['list']

processed_tools = process_tools(tools)

**Tool validation**

Tool validation goes through all the tools in the output from Pub2Tools and checks if there are errors using the bio.tools API.

6. **Validate tools** and separate them into valid and problem tools.

In [None]:
from tool_validation import validate_tools
valid_tools, problem_tools = validate_tools(processed_tools, token)

**Identify preprints**

This section comprises two steps: first, we identify newly published preprints from the global preprints file. Then, we go through the list of valid tools and identify the preprints in that one and add them to the existing preprints file. 

7. Check if there are any newly published tools in _preprints_file_, and return only those with updated _publication_link_ and _is_preprint_ flag. Function will delete published preprints from _preprints_file_.

Here, we set _rerun_ to 'true' because we are running the function on preprints that were already identified previously.

In [None]:
from preprints import identify_preprints
pubs_prp = identify_preprints(rerun = True, tools = None, json_prp = preprints_file)

8. Repeat identification for validated tools. Return only publications. Function will update _preprints_file_ with identified preprints in _valid_tools_.

In this case, _rerun_ is set to 'false' since we are running the function with tools from a new month.

In [None]:
pubs = identify_preprints(rerun = False, tools = valid_tools, json_prp = preprints_file)

**Create .csv file**

Once we have the valid tools and the preprints, we can combine them in _tools_to_add_ and add these to a CSV file with the valid tools at the top and the newly published preprints at the bottom.

The created file has 4 different columns: 

1. tool link in the development database  
2. tool name   
3. homepage   
4. publication link.  
 
The tools that are not included in this file (_tools_left_) will be added to a json file, as previously mentioned, and zipped with the other low priority files from previous months. 

9. Generate csv file from _to_curate_ first _pubs_ and all _pubs_prp_

    Returns:
    
    * _tools_to_add_: tools to add to database 
    * _tools_left_: tools not in _tools_to_add.

In [None]:
from utils.utils import check_date
from utils.csv_utils import generate_csv

file_date = check_date(pub2tools_log)
tools_to_add, tools_left = generate_csv(pubs, pubs_prp, to_curate, file_date)



**Create json files**

9. Generate json files with tools that will not be curated

In [None]:
from utils.json_utils import generate_json

generate_json(tools_left, file_date)

**Add tools to dev**

10. Add tools to add to the development version of bio.tools.

In [None]:
from biotools_dev import add_tools

add_tools(tools_to_add, token, WRITE_TO_DB = True)
