# Tutorial for adapting viriation to another virus of choice
This tutorial will guide you through adapting the viriation program for any virus of your choice. The process primarily involves fine-tuning BERT and gradient boosting models to suit your specific needs.

## Table of Contents

1. [**Installation**](#Installation)
2. [**Re-training BERT model**](#BERT)
3. [**Re-training Gradient Boosting model**](#LightGBM)
4. [**Refactoring program**](#Refactor)

# Installation

Follow steps 1, 2, 3, and 5 of the [installation guide](https://github.com/davidhy8/viriation/wiki/Installation-Guide) to setup the program. This will involve setting up your conda and pip environments and directories. In your case, you will also need to create a separate directory train at `data/train` so that you can save your training data for your models. You may use the following commands:

```
cd data/
mkdir -p train/raw/ train/processed/bioc/ train/processed/html/
```

These folder directories can serve as intermediary locations to save information while trying to prepare your training data. Below a script consisting of helper functions that are needed later is loaded for convenience.


In [1]:
%run helper.ipynb

env: NCBI_API_KEY="6667a919224612da1287d74ff0d3f7b5e208"
No record can be found for the input: 9005165


2024-09-23 09:23:54 cn0660 numexpr.utils[3611410] INFO Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-09-23 09:23:54 cn0660 numexpr.utils[3611410] INFO NumExpr defaulting to 8 threads.


# BERT

The current PubMedBERT model is specifically trained to identify text chunks discussing SARS-CoV-2 mutations, so it will not flag texts reporting mutations from other viruses. To adapt the BERT model for your virus of interest, we recommend following the same procedure we used to train our PubMedBERT model, as outlined in `notebook/train.ipynb`. You can find detailed instructions [here](https://github.com/davidhy8/viriation/wiki/Model-Training-for-COVID%E2%80%9019).

If a database of mutations is not available for your virus of interest, you may need to manually prepare data to train the BERT model. In our study on SARS-CoV-2, we used 300 positive and 300 negative examples of literature (publications, preprints, and grey literature) reporting SARS-CoV-2 mutations. We recommend preparing a similar amount of data for your virus if possible. If very few relevant publications are available for your virus, it may be useful to use our pre-trained SARS-CoV-2 model as a base model and fine-tune it with the limited data you have.

The second step in training the BERT model involves fine-tuning it with annotated text chunks, each containing 512 tokens. This process of data annotation is made easy with the notebook `notebook/dataset_labeller.ipynb`, which reads a dataframe containing columns for DOI and text (the full document) and splits the document into chunks for labeling in the command line. In the SARS-CoV-2 study, approximately 100 text chunks were annotate.


## Preparing data

In order to fine-tune the BERT model for your specific needs, you must find **positive** and **negative examples** of literature reporting novel mutations from your virus of choice. Your collection of papers may include publications, preprints and grey literature, but they should categorized accordingly to ease BioC JSON retrieval in the latter steps. For example you can gather the DOIs of your papers in JSON format as follows:

`{'DOI_1': 'Preprint', 'DOI_2: 'Publication', 'DOI_3': 'Grey literature', ....}`

`notebook/pokay_processor.ipynb` and `notebook/full_preprocessing.ipynb` demonstrates the process performed for preparing the training data for SARS-CoV-2. In the segment below we provide template code for processing your data:

In [4]:
import json

with open('../data/train/raw/papers.json') as json_file: # Read in JSON file with collection of papers
    papers = json.load(json_file)

# Split dictionary by literature type
publications_bioc = {}
preprint_bioc = {}
grey_bioc = {}

for key, value in papers.items():
    if value == 'publication':
        publications_bioc[key] = None
    elif value == 'preprint':
        preprint_bioc[key] = None
    elif value == 'grey':
        grey_bioc[key] = None

In [None]:
# Retrieve BioC JSON file for each type of literature
publication_bioc, publication_unk_bioc = get_journal_publication_bioc(publication_bioc) # publications
rxiv_bioc, rxiv_unk_bioc = get_rxiv_bioc(preprint_bioc) # preprints

For grey literature, these are usually only available as PDFs and undergo a series of conversions in order to change into BioC JSON format. You may use the following steps:
1. Download PDF version of article
2. Convert the file to HTML format using Allen Institute's [tool](https://papertohtml.org)
3. Using AutoCorpus, convert the HTML file to BioC JSON format:
```
cd submodules/autocorpus/
# python run_app.py -c "../../data/other/config_allen.json" -t "../../data/train/processed/bioc/" -f "../../data/train/processed/html/" -o JSON
```

In [None]:
# Load grey literature from directory
for key in grey_bioc:
    file = get_file_name(key)
    file = "~/viriation/data/train/processed/bioc/" + file + "_bioc.json" # might have to adjust location
    try:
        grey_bioc[key] = Path(file).read_text().replace('\n', '')
    except:
        pass

After retreiving the BioC JSON format for all literature, now we will extract the text from each JSON file and format it into a dataframe for model training. Recall that in addition to a set of positive examples, we also need a set of negative examples for the data which can be prepared in similar fashion as the instructions above. Note that the amount of positive and negative examples should be exactly equal so that the training data doesn't skew the model towards classifying in one  direction.

In [None]:
pos = publication_bioc.copy()
pos = pos.update(rxiv_bioc)
pos = pos.update(grey_bioc) # merge dictionaries to get positive examples

neg = negative_examples # separately prepare negative examples with exact same number of papers as positive example

pos_text = litcovid_text_extract(pos) # Extract text from BioC JSON of positive examples
neg_text = pokay_extract(neg) # Extract text from BioC JSON of negative examples

# Create dataframe
df = pd.DataFrame(neg_text, columns=["text"])
df["label"] = 0

df_2 = pd.DataFrame(pos_text, columns=["text"])
df_2["label"] = 1

df = pd.concat([df, df_2])

# Save dataset
df.to_csv("../data/train/processed/bert_dataset.csv")

The dataset provided is used for the initial round of training. To re-train the model, a new dataset is required. In this second dataset, each of the 512 text chunks must be manually labeled, making this process very time-consuming, and therefore, it is only done for a small subset of text chunks - text chunks that our initial model cannot predict with high confidence. While it might seem redundant to label and train on these text chunks now, instead of doing it in the first round, labeling all the text chunks for a large number of papers would have been too time-intensive and retraining the model along it's decision boundary is also very effective.

In [None]:
# Create retraining dataset
retrain_data  # Load retrain data (preprocess in the same way as the positive examples) 
retrain_data_text = litcovid_text_extract(retrain_data) # Extract text from JSON

# Create dataframe
df = pd.DataFrame(retrain_data_text, columns=["text"])

# Save chunks dataset -> this dataset will need to be manually annotated
df.to_csv('../data/train/processed/chunks_dataset.csv')

## Training model

Once the training data is prepared, we will train our BERT model in two stages, using the datasets specifically created for each round. This can be done very easily by swapping the training data (`bert_dataset.csv` & `chunks_dataset.csv`) specified in `/notebook/train.ipynb` with the datasets above and modifying the input/output paths. The rest of the training process is covered thoroughly in the notebook. 

During the model re-training step, you will have to label the retrain_df_1 which consists of text chunks that the model struggles to predict. You may use the code and labelling guidelines below to do so.

**Labelling guidelines**, only label 1 (Yes) if it:
- Includes specific terms like "Delta variant" or any specific variant name
- Includes terms like "mutation", "viral variant", "strain", "variant of concern" and "genetic variant"
- Describes characteristics, behaviours, or impacts of the variants, even if not named explicitly
- Compares different variants
- Discusses genetic mutations or sequences related to viral variants
- Discusses spread, transmission rates or infection rates associated with specific variants
- Discusses how variants affect vaccine efficacy
- Describes changes in symptom severity due to variants
- Discusses health outcomes associated with variants
- Discusses public health measures or responses tailored to specific variants
- Discusses time and place of variant emergence and spread

In [None]:
# LABEL CHUNKS
# Obtain list of sentences 
df = pd.read_csv("../data/train/processed/retrain_df_1.csv")
# text = df.sample(n=3)

# Split dataframe into smaller dataframes to work with
df_split = np.array_split(df,30)

# Instantiate dataframe
result = pd.DataFrame(columns=["text", "label"])

# Iterate through each split
for i, split in enumerate(df_split):
    # if i < 3:
    #     continue
    print("Current split " + str(i))
    labelled_chunks = get_label(split, 'text')
    filename = "../data/train/processed/" + str(i) + "_chunks_labelled.csv"
    labelled_chunks.to_csv(filename)
    # labelled_chunks["position"] = i
    labelled_chunks.to_csv(filename)
    print("Finished split " + str(i))
    print("============================================================")
    # Combine labelled_split with result
    result = pd.concat([result, labelled_chunks])
    
result.to_csv("../data/train/processed/labelled_chunks.csv")

# LightGBM

The gradient boosting model was originally trained to assess the positional importance of each text chunk within a paper when predicting the paper's overall relevance. Since positional information in literature (e.g., abstracts, conclusions) typically remains consistent, fine-tuning the gradient boosting model is optional. If you choose to fine-tune it, refer to `notebook/lightGBM_train.ipynb` for guidance. Note that the original model was trained using over 10,000 text chunks from 600 papers.

# Refactor

The web-scraping program was originally designed to capture new literature on SARS-CoV-2. To collect papers discussing your virus of interest, you will need to update the search terms in `run.sh` and `app/scripts/scrape_papers.py`.
Currently, the search terms used for literature scraping are for COVID-19. For your virus of choice, you will need to modify these terms and implement it into the following spots in the program. 

1. On line 145 of `app/scripts/scrape_papers.py`, modify `covid_terms = ['coronavirus', 'ncov', 'cov', '2019-nCoV', 'SARS-CoV-2', 'COVID19', 'COVID']` with your own terms.

2. On line 27 of `run.sh`, modify `esearch -db pubmed -query "('coronavirus'[All Fields] OR 'ncov'[All Fields] OR 'cov'[All Fields] OR '2019-nCoV'[All Fields] OR 'SARS-CoV-2'[All Fields] OR 'COVID19'[All Fields] OR 'COVID-19'[All Fields] OR 'COVID'[All Fields]) AND (\"${start_pm}\"[CRDT] : \"${end_pm}\"[CRDT]) NOT preprint[pt]" | efetch -format docsum > data/scraper/pubmed/litcovid.xml` with your own terms. 

Example with search terms "HIV", and "HIV-A":

```
esearch -db pubmed -query "('HIV'[All Fields] OR 'HIV-A'[All Fields]) AND (\"${start_pm}\"[CRDT] : \"${end_pm}\"[CRDT]) NOT preprint[pt]" | efetch -format docsum > data/scraper/pubmed/litcovid.xml
``` 
