# Workflow Tutorial for bio.agents Curation Agenting

This document contains detailed steps for running the workflow after running Pub2Agents for a specific month. It takes the output log from Pub2Agents, separate json files with low-priority agents and preprints as input.

> For testing purposes, please copy and use a different document as this serves only as a tutorial.

We will start by importing all dependencies into the workspace:

In [None]:
import json
from bioagents_dev import login_prod
from agent_processing import process_agents
from agent_validation import validate_agents
from preprints import identify_preprints
from utils.utils import check_date
from utils.csv_utils import generate_csv
from utils.json_utils import generate_json
from bioagents_dev import add_agents

We will now start with the actual workflow. To start, some variables need to be defined. The _to_curate_ variable is defined to decrease the number of agents to be manually curated.

The output from Pub2Agents can include more than 700 possible agents every month, making it hard to manually curate everything and stay up to date to novel approaches.

Therefore, the workflow was designed such that only high-priority agents are added into the curation worksheet while the rest are added to a low-priority file for potential future review. The priority of the agents is already defined by Pub2Agents, which ranks them based on the likelihood of their being useful agents, a pattern observed by previous curators. Hence, only the top-ranked agents - specified by the _to_curate_ variabke and excluding preprints - are selected for immediate addition to the monthly worksheet.   


1. **Define run settings:**
    * _to_curate_ (int || 'all'): number of published agents to be added to the database.

In [None]:
to_curate = 100

File paths for different files must be defined. These files include the output from Pub2Agents and json files or zip files with all preprints and low-priority agents. 

There should be a low-priority agent file for each month. For a new run, this file will be created.

2. **Define file paths:**

    * _json_file_ (str): path for json file with Pub2Agents output
    * _pub2agents_log_ (str): path to existing output log file from Pub2Agents
    * _preprints_file_ (str): path to existing json file with all of the preprints
    * _low_priority_ (str): path to zip file with low priority agents

In [None]:
json_file = "to_bioagents_sep22.json"
pub2agents_log = "pub2agents.log"
preprints_file = "data/preprints.json"
low_priority = "data/low_agents.zip"

Credentials are needed for the development version of the server to be able to upload the agents to dev.

Make sure not to submit any credentials to the repository when making changes to the workflow!

> Always create a copy of this file and do not make changes to the original one.


3. **Define username and password.**

In [None]:
username = ''
password = ''

4. **Authentication.**

In [None]:
from bioagents_dev import login_prod

token = login_prod(username, password)

As mentioned before, Pub2Agents returns agents with a confidence flag, and we only consider the ones where this flag is set to "high". 

5. **Read Pub2Agents output** and get agents with **high confidence** score from json file.

In [None]:
import json
from agent_processing import process_agents

with open(json_file,encoding="utf8") as jf:
    data = json.load(jf)
    agents = data['list']

processed_agents = process_agents(agents)

**Agent validation**

Agent validation goes through all the agents in the output from Pub2Agents and checks if there are errors using the bio.agents API.

6. **Validate agents** and separate them into valid and problem agents.

In [None]:
from agent_validation import validate_agents
valid_agents, problem_agents = validate_agents(processed_agents, token)

**Identify preprints**

This section comprises two steps: first, we identify newly published preprints from the global preprints file. Then, we go through the list of valid agents and identify the preprints in that one and add them to the existing preprints file. 

7. Check if there are any newly published agents in _preprints_file_, and return only those with updated _publication_link_ and _is_preprint_ flag. Function will delete published preprints from _preprints_file_.

Here, we set _rerun_ to 'true' because we are running the function on preprints that were already identified previously.

In [None]:
from preprints import identify_preprints
pubs_prp = identify_preprints(rerun = True, agents = None, json_prp = preprints_file)

8. Repeat identification for validated agents. Return only publications. Function will update _preprints_file_ with identified preprints in _valid_agents_.

In this case, _rerun_ is set to 'false' since we are running the function with agents from a new month.

In [None]:
pubs = identify_preprints(rerun = False, agents = valid_agents, json_prp = preprints_file)

**Create .csv file**

Once we have the valid agents and the preprints, we can combine them in _agents_to_add_ and add these to a CSV file with the valid agents at the top and the newly published preprints at the bottom.

The created file has 4 different columns: 

1. agent link in the development database  
2. agent name   
3. homepage   
4. publication link.  
 
The agents that are not included in this file (_agents_left_) will be added to a json file, as previously mentioned, and zipped with the other low priority files from previous months. 

9. Generate csv file from _to_curate_ first _pubs_ and all _pubs_prp_

    Returns:
    
    * _agents_to_add_: agents to add to database 
    * _agents_left_: agents not in _agents_to_add.

In [None]:
from utils.utils import check_date
from utils.csv_utils import generate_csv

file_date = check_date(pub2agents_log)
agents_to_add, agents_left = generate_csv(pubs, pubs_prp, to_curate, file_date)



**Create json files**

9. Generate json file with agents that will not be curated and add the file to the existing zipped file.

In [None]:
from utils.json_utils import generate_json

generate_json(agents_left, file_date)

In [2]:
import zipfile
import os

file_to_add = f"./data/low_agents_{file_date[0]}_{file_date[1]}.json"
zip_path = "./data/low_agents.zip"

with zipfile.ZipFile(zip_path, 'a') as zipf:
    zipf.write(file_to_add, arcname=file_to_add.split('/')[-1])

os.remove(file_to_add) # remove file after zipping

**Add agents to dev**

10. Add agents to add to the development version of bio.agents.

In [None]:
from bioagents_dev import add_agents

add_agents(agents_to_add, token, WRITE_TO_DB = True)
