# Interpreting Nucleotide Transformer Models with Sparse Autoencoders

This notebook demonstrates how to analyze and interpret the internal representations of the Nucleotide Transformer (NT) model using Sparse Autoencoders (SAEs).

**Purpose**: Transformer-based models like NT have achieved impressive results in genomic sequence modeling, but their internal representations remain largely opaque. We use SAEs to identify interpretable features in the model's latent space that correspond to biological concepts.

**Approach**:
1. Load a pre-trained Nucleotide Transformer model and genomic sequence data
2. Extract activations from an intermediate layer of the model
3. Train (or load) a Sparse Autoencoder on these activations
4. Analyze specific latent features to understand what biological patterns they detect
5. Validate findings using BLAST searches against known genetic databases


#Environment Setup and Dependencies (feel free to ignore)

We install some packages, connect to drive and set the seed. Feel free to ignore.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

!pip install transformers

Mounted at /content/drive


In [2]:
# set seeds
import random
import numpy as np
import torch

def set_seed(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)

set_seed(42)

In [3]:
## load custom functions from utils.py

import sys
sys.path.append('//content/drive/MyDrive/SAEs_for_Genomics')

import importlib
import utils
importlib.reload(utils)

<module 'utils' from '//content/drive/MyDrive/SAEs_for_Genomics/utils.py'>

# Load NT model

We here load the smallest version of the nucleotide transformer (50m params). The model follows a standard BERT architecture and is pretrained on genomes from hundreds of different species. For details, see the paper [here](https://www.nature.com/articles/s41592-024-02523-z).

In [4]:
"loading smallest nucleotide transformer (50m params)"
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoConfig
import torch

num_params = 50 ## default 50

# Import the tokenizer and the model
tokenizer_nt = AutoTokenizer.from_pretrained(f"InstaDeepAI/nucleotide-transformer-v2-{num_params}m-multi-species", trust_remote_code=True)
model_nt = AutoModelForMaskedLM.from_pretrained(f"InstaDeepAI/nucleotide-transformer-v2-{num_params}m-multi-species", trust_remote_code=True)

# Option 2: get random init
config = AutoConfig.from_pretrained(f"InstaDeepAI/nucleotide-transformer-v2-{num_params}m-multi-species", trust_remote_code=True)
#model_nt = AutoModelForMaskedLM.from_config(config, trust_remote_code=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

esm_config.py:   0%|          | 0.00/14.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species:
- esm_config.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_esm.py:   0%|          | 0.00/58.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species:
- modeling_esm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/224M [00:00<?, ?B/s]

# Load and preprocess addgene dataset

This is data of engineered plasmids that we are running the NT-model on to extract activations.

In [5]:
import pandas as pd


# Constants
TEST_DATA_PATH = '/content/drive/MyDrive/NOO_paper/Datasets/WorldWide/BLAST_geac_ext_169k_val_random.csv'
TRAIN_DATA_PATH = '/content/drive/MyDrive/NOO_paper/Datasets/WorldWide/BLAST_geac_ext_169k_train_random.csv'
INFREQUENT_THRESHOLD = 10

def split_test_data(test_data):
    """Split test data into input and target variables."""
    y_test = test_data['nations']
    x_test = test_data[['sequence']]
    return x_test, y_test

def replace_infrequent_labels(labels, threshold=INFREQUENT_THRESHOLD):
    """Identify and replace infrequent labels."""
    label_counts = labels.value_counts()
    infrequent_labels = label_counts[label_counts < threshold].index
    return labels.replace(infrequent_labels, 'infrequent')

def map_labels_to_integers(labels):
    """Map labels to integers."""
    unique_labels = labels.unique()
    return {label: int(i) for i, label in enumerate(unique_labels)}

def without_US(data):
    """Filter out rows where the nation is 'UNITED STATES'."""
    data_wo_US = data[data['nations'] != 'UNITED STATES']
    data_wo_US.reset_index(drop=True, inplace=True)

    data_w_US = data[data['nations'] == 'UNITED STATES']
    data_w_US.reset_index(drop=True, inplace=True)
    return data_wo_US, data_w_US

def US_vs_them(labels):
    """Categorize labels into 'UNITED STATES' and 'NON US'."""
    return labels.apply(lambda x: x if x == 'UNITED STATES' else 'NON US')

def pad_sequence(seq, length, pad_char='N'):
    """Pad sequences to the specified length with the given character."""
    return seq.ljust(length, pad_char)[:length]

# Load data
train_data = pd.read_csv(TRAIN_DATA_PATH)
test_data = pd.read_csv(TEST_DATA_PATH)

print(f'test_data shape: {test_data.shape}')

# Remove US
# train_data, train_data_US = without_US(train_data)
# test_data, test_data_US = without_US(test_data)

print(f'test_data shape: {test_data.shape}')

# Split data
x_train, y_train = train_data[['sequence']], train_data['nations']
x_test, y_test = split_test_data(test_data)

print(f'test_data shape: {y_test.shape}')
print(f'x_train shape: {x_train.shape}')
print(f'y_train shape: {y_train.shape}')

# Combine labels from train and test datasets
processed_labels = pd.concat([y_train, y_test], axis=0, ignore_index=True)
label_to_int = map_labels_to_integers(processed_labels)


# map labels to integers
y_train = y_train.map(label_to_int)
y_test = y_test.map(label_to_int)

print(f'y_test shape: {y_test.shape}')


# reset indices before concat
x_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
x_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

df_train = pd.concat([x_train, y_train], axis=1)
df_val = pd.concat([x_test, y_test], axis=1)

print(f'test_data shape: {test_data.shape}')


# Filter out sequences shorter than min_length and clean them
min_length = 0
df_train = df_train[df_train['sequence'].str.len() > min_length]
df_val = df_val[df_val['sequence'].str.len() > min_length]

print(f'test_data shape: {test_data.shape}')


# Ensure indices are reset correctly
df_train.reset_index(drop=True, inplace=True)
df_val.reset_index(drop=True, inplace=True)

# Display the split data
print("Train Data Shape:", df_train.shape)
print("Validation Data Shape:", df_val.shape)


test_data shape: (15551, 4)
test_data shape: (15551, 4)
test_data shape: (15551,)
x_train shape: (93306, 1)
y_train shape: (93306,)
y_test shape: (15551,)
test_data shape: (15551, 4)
test_data shape: (15551, 4)
Train Data Shape: (93306, 2)
Validation Data Shape: (15551, 2)


# Set-up & Load SAE

Here we define the Sparse Autoencoder (SAE) architecture that will be used to interpret the Nucleotide Transformer model.

Key components of our SAE implementation:

1. **Dictionary expansion**: We use a larger hidden dimension (`dict_mult=8`) than the original model's MLP dimension to allow for more specialized, sparse features.

2. **Activation Function**: The SAE uses a Jump-ReLU activation function.

3. **Loss function components**:
   - Reconstruction loss: Ensures the SAE can accurately reconstruct the original activations
   - Sparsity loss: we use a continuous L0 loss approximation to encourage sparsity. This continous approximation allows us to directly optimise L0 loss which we think is a better proxy for sparsity than L1.



After setting up the architecture, we'll load pre-trained weights that have already been optimized on millions of activations from the Nucleotide Transformer model.

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F

cfg = {
    "seed": 49,
    "batch_size": 4096*6,
    "buffer_mult": 384,
    "lr": 5e-5,
    #"num_tokens": tokenizer_nt.vocab_size,
    "d_model": 512,
    "l1_coeff": 1e-1,
    "beta1": 0.9,
    "beta2": 0.999,
    "dict_mult": 8, # hidden_d = d_model * dict_mult
    "seq_len": 512,
    "d_mlp": 512,
    "enc_dtype":"fp32",
    "remove_rare_dir": False,
    "total_training_steps": 10000,
    "lr_warm_up_steps": 1000,
    "device": "cuda"
}

cfg["model_batch_size"] = 64
cfg["buffer_size"] = cfg["batch_size"] * cfg["buffer_mult"]
cfg["buffer_batches"] = cfg["buffer_size"] // cfg["seq_len"]

DTYPES = {"fp32": torch.float32, "fp16": torch.float16, "bf16": torch.bfloat16}

class AutoEncoder(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # HP-choices
        d_hidden = cfg["d_mlp"] * cfg["dict_mult"]
        d_mlp = cfg["d_mlp"]
        self.l0_coeff = cfg.get("l0_coeff", 5)
        self.threshold = cfg.get("activation_threshold", 0.3)
        # Temperature for sigmoid approximation
        self.temperature = cfg.get("temperature", 1.0)
        dtype = DTYPES[cfg["enc_dtype"]]
        torch.manual_seed(cfg["seed"])

        self.W_enc = nn.Parameter(torch.nn.init.kaiming_uniform_(torch.empty(d_mlp, d_hidden, dtype=dtype)))
        self.W_dec = nn.Parameter(torch.nn.init.kaiming_uniform_(torch.empty(d_hidden, d_mlp, dtype=dtype)))
        self.b_enc = nn.Parameter(torch.zeros(d_hidden, dtype=dtype))
        self.b_dec = nn.Parameter(torch.zeros(d_mlp, dtype=dtype))
        self.W_dec.data[:] = self.W_dec / self.W_dec.norm(dim=-1, keepdim=True)

        self.d_hidden = d_hidden
        self.to("cuda") if torch.cuda.is_available() else self.to("cpu")

    def get_continuous_l0(self, x):
        """
        Compute continuous relaxation of L0 norm using sigmoid
        This provides useful gradients unlike the discrete L0
        """
        # Shifted sigmoid to approximate step function
        return torch.sigmoid((x.abs() - self.threshold) / self.temperature)

    def forward(self, x):
        # encoding and decoding of input vec
        x_cent = x - self.b_dec
        pre_acts = x_cent @ self.W_enc + self.b_enc
        acts = F.relu(pre_acts)

        # Compute continuous L0 approximation before thresholding
        l0_proxy = self.get_continuous_l0(acts)

        # Apply hard threshold for forward pass --- This is actually jumprelu (I think!)
        acts_sparse = (acts.abs() > self.threshold).float() * acts
        x_reconstruct = acts_sparse @ self.W_dec + self.b_dec

        # L2 Loss (Reconstruction Loss)
        l2_loss = F.mse_loss(x_reconstruct.float(), x.float(), reduction='none')
        l2_loss = l2_loss.sum(-1)
        l2_loss = l2_loss.mean()

        # Normalized MSE for reporting
        nmse = torch.norm(x - x_reconstruct, p=2) / torch.norm(x, p=2)

        # Continuous L0 loss (using sigmoid approximation)
        l0_loss = l0_proxy.sum(dim=1).mean()

        # Total Loss: reconstruction + sparsity
        loss = l2_loss + self.l0_coeff * l0_loss

        # For monitoring: true L0 count (not used in optimization)
        true_l0 = (acts_sparse.float().abs() > 0).float().sum(dim=1).mean()

        # For monitoring: L1 loss
        l1_loss = acts_sparse.float().abs().sum(-1).mean()

        return loss, x_reconstruct, acts_sparse, l2_loss, nmse, l1_loss, true_l0

    @torch.no_grad()
    def remove_parallel_component_of_grads(self):
        W_dec_normed = self.W_dec / self.W_dec.norm(dim=-1, keepdim=True)
        W_dec_grad_proj = (self.W_dec.grad * W_dec_normed).sum(-1, keepdim=True) * W_dec_normed
        self.W_dec.grad -= W_dec_grad_proj


## Load already-trained SAE

In [7]:
random_weights_path = "/content/drive/MyDrive/SAEs_for_Genomics/Weights/nt50m_sae_+40mtokens.pt"
state_dict = torch.load(random_weights_path)
sae_model = AutoEncoder(cfg)
sae_model.load_state_dict(state_dict)



  state_dict = torch.load(random_weights_path)


<All keys matched successfully>

# Using trained SAE to interpret the NuclTrans

In the following sections, we'll:
- Load tokenized genomic sequences with biological annotations
- Apply our pre-trained SAE to NT activations
- Analyze which SAE features activate on which biological elements
- Validate our findings through external genetic databases



## Loading test-sequence with functional annotations


In a different notebook we created three random, non-overlapping sets of annotated plasmid sequences. Here we load these sequences together with their annotations.

In [8]:
import pandas as pd
import torch
from transformers import AutoTokenizer

def load_and_process_annotations(file_path):
    """Load CSV and add 'valseq_' prefix to seq_id column if not already present."""
    df = pd.read_csv(file_path)
    df['seq_id'] = df['seq_id'].astype(str)
    # Add 'valseq_' prefix only if it's not already there
    df['seq_id'] = df['seq_id'].apply(lambda x: x if x.startswith('valseq_') else f'valseq_{x}')
    return df

def extract_and_tokenize_sequences(df_annotations, df_val, tokenizer_nt):
    """Extract sequence IDs, get corresponding sequences, and tokenize them."""
    # Extract and sort sequence IDs
    seq_ids = list(set(df_annotations['seq_id']))
    # More robust parsing of sequence IDs
    parsed_ids = []
    for seq_id in seq_ids:
        try:
            if 'valseq_' in seq_id:
                parsed_ids.append(int(seq_id.split('valseq_')[1]))
            else:
                parsed_ids.append(int(seq_id))
        except ValueError:
            print(f"Warning: Could not parse seq_id: {seq_id}")
            continue

    seq_ids = sorted(parsed_ids)

    # Get and tokenize sequences
    sequences = df_val['sequence'].iloc[seq_ids].tolist()
    tokens = tokenizer_nt(
        sequences,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )

    return tokens, seq_ids

# File paths
base_path = '/content/drive/MyDrive/SAEs_for_Genomics/Annotated_seqs'
files = {
    's0': f'{base_path}/ann_of_1000_seqs_set0.csv',
    's1': f'{base_path}/ann_of_1000_seqs_set1.csv',
    's2': f'{base_path}/ann_of_1000_seqs_set2.csv',
}

# Process all files
dfs = {key: load_and_process_annotations(path) for key, path in files.items()}

# Extract and tokenize sequences for each dataset
results = {
    key: extract_and_tokenize_sequences(df, df_val, tokenizer_nt)
    for key, df in dfs.items()
}

# Unpack results if needed
tokens_s0, seq_ids_s0 = results['s0']
tokens_s1, seq_ids_s1 = results['s1']
tokens_s2, seq_ids_s2 = results['s2']

In [None]:
## SANITY CHECKs

for _ in range(100):

    # check that sequencs are not identical at the same poistion
    N = np.random.randint(0, len(seq_ids_s1))
    assert not torch.equal(tokens_s1['input_ids'][N], tokens_s2['input_ids'][N])
    assert not torch.equal(tokens_s1['input_ids'][N], tokens_s0['input_ids'][N])
    assert not torch.equal(tokens_s2['input_ids'][N], tokens_s0['input_ids'][N])

# look at overlap between seq_ids
assert len(set(seq_ids_s1).intersection(set(seq_ids_s2))) == 0
assert len(set(seq_ids_s1).intersection(set(seq_ids_s0))) == 0
assert len(set(seq_ids_s2).intersection(set(seq_ids_s0))) == 0



### From tokenised sequences create df of each token with annotation

### skip for N >= 1000

Here we create the token_dfs anew which takes a lot of time

In [None]:
# Create a table that lists each token in the sequences alongside its annotation(s)
"""
for i in ['s0', 's1', 's2']:

    if i == 's0':
        tokens = tokens_s0['input_ids']
        seq_ids = seq_ids_s0

    elif i == 's1':
        tokens = tokens_s1['input_ids']
        seq_ids = seq_ids_s1
    elif i == 's2':
        tokens = tokens_s2['input_ids']
        seq_ids = seq_ids_s2

    token_df = utils.make_token_df_new(
                          tokens = tokens.squeeze(),
                          tokenizer = tokenizer_nt,
                          df_annotated = dfs[i],
                          seq_ids = seq_ids,
                          len_prefix = 6, ## choice: what should these be?
                          len_suffix = 6,
                          nucleotides_per_token = 6, # particular to this model
                          descriptor_col = 'Type' # values: Feature, Type, Description
    )
    token_df

    # save token_df
    token_df.to_csv(f'/content/drive/MyDrive/SAEs_for_Genomics/Annotated_seqs/token_df_1k_s{i}_TYPE.csv', index=False)

"""

### and load directly

In [9]:
# load token_df for >= 1000 seqs
token_df_1k_s1 = pd.read_csv('/content/drive/MyDrive/SAEs_for_Genomics/Annotated_seqs/token_df_1k_ss1.csv')
token_df_1k_s2 = pd.read_csv('/content/drive/MyDrive/SAEs_for_Genomics/Annotated_seqs/token_df_1k_ss2.csv')
token_df_1k_s0 = pd.read_csv('/content/drive/MyDrive/SAEs_for_Genomics/Annotated_seqs/token_df_1k_ss0.csv')


## Running SAE on Sequences

Here we get the latent activations of the SAE for each annotated input seqeunce. We need this to interpret the latents in the next section.

**I recommend using an L4 GPU (or more powerful) to speed up this part** With less GPU Ram, adjust the batchsize downwards.

We start by getting the SAE activations for (all) token in our dataset

In [10]:
from torch.cuda.amp import autocast
from tqdm import tqdm

d_model = cfg["d_model"]
d_mlp = cfg["d_mlp"]
num_layer = 11 # @param
batch_size = 52

tokens = tokens_s1 #@param options:

# Calculate batch information
total_tokens = tokens['input_ids'].shape[0] * tokens['input_ids'].shape[1]
num_batches = (total_tokens + batch_size - 1) // batch_size

all_latents = []
all_acts = []

# Ensure models are in eval mode
sae_model.eval()
model_nt.eval()

# Add progress bar
for i in tqdm(range(num_batches), desc="Processing batches", unit="batch"):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, total_tokens)

    # Reshape tokens for current batch
    batch_input_ids = tokens['input_ids'][start_idx:end_idx].cuda()
    batch_attention_mask = tokens['attention_mask'][start_idx:end_idx].cuda()

    with torch.no_grad():
        #add mixed precision
        with autocast():
            # Get MLP activations
            mlp_act = utils.get_layer_activations(model_nt.cuda(),
                                                batch_input_ids,
                                                batch_attention_mask,
                                                layer_N=num_layer)

            mlp_act = mlp_act[0].reshape(-1, d_mlp)
            all_acts.append(mlp_act)

            # Forward pass through SAE
            loss, x_reconstruct, latents, l2_loss, nmse, l1_loss, true_l0 = sae_model(mlp_act)
            all_latents.append(latents)

# Combine results, move to cpu before
all_acts = torch.cat(all_acts, dim=0).cpu()
all_latents = [x.cpu() for x in all_latents]
combined_latents = torch.cat(all_latents, dim=0).cpu()
torch.cuda.empty_cache()

  with autocast():
Processing batches: 100%|██████████| 9768/9768 [02:13<00:00, 73.12batch/s]


## Interpreting SAE latents

Lets look at the most activating tokens for a given SAE latent, alongside their functional annotations. As a case-study, we here look at a particularly interpretable latent: 946 which is highly monosemantic for genes that encode Puromycin Resistance.

In [12]:
latent_id = 1264 # @param or set particular int value in range 0, 4095

# we avoid modifying token_df directly as its very time-consuming to reload if we mess it up
token_df_copy = token_df_1k_s1.copy() # @param

# get the activation value for the N-th unit in the SAE for each input in batch
hidden_act_feature_id = combined_latents[:, latent_id] # N = feature_id

# add this to the dataframe
token_df_copy[f"latent-{latent_id}-act"] = hidden_act_feature_id.cpu().detach().numpy()

# sort to show the most activating tokens on top, add colours
token_df_copy.sort_values(f"latent-{latent_id}-act", ascending=False).head(50
                                                                           ).style.background_gradient("coolwarm")

Unnamed: 0,seq_id,token_pos,tokens,context,token_annotations,context_annotations,e-value annotation,percentage match,latent-1264-act
314851,9530,483,AATACA,GGAGTGGGACAGAGAAATTAACAATTACACAAGCTT |AATACA| CTCCTTAATTGAAGAATCGCAAAACCAGCAAGAAAA,['env'],['env'],[1.66e-226],[100.],12.200504
102808,2965,408,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.00e-126 3.49e-205],[100. 100.],11.841129
506990,15506,110,GTGGAG,GGGGACCCGACAGGCCCGAAGGAATAGAAGAAGAAG |GTGGAG| AGAGAGACAGAGACAGATCCATTCGATTAGTGAACG,['env'],['env'],[5.38e-181],[100.],11.598942
444818,13279,402,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.00e-126 3.46e-205],[100. 100.],11.497379
505136,15429,304,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[9.30e-127 2.99e-205],[100. 100.],11.364567
489436,14702,476,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.10e-126 3.81e-205],[100. 100.],11.130192
219510,6405,374,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.00e-126 3.33e-205],[100. 100.],11.122379
27999,786,351,GTGCAG,TTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTG |GTGCAG| AGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTC,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.10e-126 1.76e-226],[100. 100.],11.083317
482580,14440,276,GTGCAG,TTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTG |GTGCAG| AGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTC,"['RRE', 'env']","['RRE', 'RRE', 'env']",[9.40e-127 1.46e-226],[100. 100.],11.059879
314870,9530,502,GTGGTA,AAGTTTGTGGAATTGGTTTAACATAACAAATTGGCT |GTGGTA| TATAAAATTATTCATAATGATAGTAGGAGGCTTGGT,['env'],['env'],[1.66e-226],[100.],10.872379


## Calculate F1 Scores of latent-concept detection

Now that we've identified latent features that appears to detect HIV-related sequences, we need a quantitative method to evaluate how accurately it identifies relevant genetic elements. We use the F1 score, which balances precision (what percentage of sequences activating the feature are truly HIV-related) and recall (what percentage of known HIV-related sequences activate the feature).

A challenge in evaluating these features is how to handle multi-token genetic elements. Traditional evaluation approaches would treat each token independently, but this could underestimate the capability of our feature detector. In genomic data, a feature might only need to strongly activate on one part of a gene to successfully identify it.

We implement two evaluation approaches:
1. **Modified recall calculation**: For each annotated region (like HIV genes), we only require the feature to activate strongly on at least one token within that region. This reflects the detection capabilities more accurately.
2. **Standard token-level evaluation**: Each token is evaluated independently, providing a more conservative estimate of performance.

The function `preprocess_annotation_data` handles the first approach, while setting `modified_recall=False` gives us the second approach. We calculate metrics across multiple activation thresholds to find the optimal detection point.

The results show that for the PuroR annotation (Puromycin resistance gene), feature #946 achieves its highest F1 score of 0.913 at threshold 5 with the modified approach. This demonstrates that the feature is a highly specific detector for HIV-related genetic elements.

#### Elanas Method for Recall Calc:

if in an annotated region the latent activates on a single token, we don't count the tokens it didn't activate on

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score
import pandas as pd

def preprocess_annotation_data(token_df, annotation, latent_id):
    """
    Preprocesses token dataframe for a given annotation and latent ID.
    For each annotated region, keeps only the token with highest activation.
    """
    # Get highest activation tokens for annotated regions
    high_act_tokens = (
        token_df[token_df['token_annotations'].str.contains(annotation)]
        .groupby('seq_id')
        .apply(lambda x: x.nlargest(1, f"latent-{latent_id}-act"))
        .reset_index(drop=True)
    )

    # Combine with non-annotated tokens
    return pd.concat([
        high_act_tokens,
        token_df[~token_df['token_annotations'].str.contains(annotation)]
    ])

def modified_recall(modified_df, annotation, latent_id):
    """
    Computes modified recall for a given annotation and latent ID.
    """
    pass


def compute_metrics_across_thresholds(token_df, annotation, latent_id, thresholds: list, modified_recall: bool = True):
    """
    Computes precision, recall, and F1 scores across different activation thresholds.

    Args:
        token_df: DataFrame with token data
        annotation: String identifying the annotation type
        latent_id: ID of the latent being analyzed

    Returns:
        List of tuples (threshold, precision, recall, f1)
    """
    # Preprocess data
    if modified_recall:
        modified_df = preprocess_annotation_data(token_df, annotation, latent_id)

    else:
        modified_df = token_df.copy()

    # Generate thresholds
    if thresholds is None:
        max_act = round(max(token_df[f"latent-{latent_id}-act"]))
        thresholds = range(max_act - 1)

    print(thresholds)
    results = []
    for threshold in thresholds:
        # Generate prediction masks
        pred_precision = (token_df[f"latent-{latent_id}-act"] > threshold).astype(int)
        pred_recall = (modified_df[f"latent-{latent_id}-act"] > threshold).astype(int)

        # Generate ground truth masks
        true_precision = token_df['token_annotations'].apply(lambda x: 1 if annotation in x else 0)
        true_recall = modified_df['token_annotations'].apply(lambda x: 1 if annotation in x else 0)

        # Compute metrics
        precision = precision_score(true_precision, pred_precision)
        recall = recall_score(true_recall, pred_recall)
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        results.append((threshold, precision, recall, f1))

    return results

def print_metrics(results):
    """Prints formatted metrics for each threshold."""
    for threshold, precision, recall, f1 in results:
        print(f"F1 score for threshold {threshold}: {f1:.3f}, "
              f"Precision: {precision:.3f}, Recall: {recall:.3f}")
        print("-" * 50)

results = compute_metrics_across_thresholds(token_df_copy, annotation = 'PuroR', latent_id = 946, thresholds=None, modified_recall=True)
print_metrics(results)

results = compute_metrics_across_thresholds(token_df_copy, annotation = 'PuroR', latent_id = 946, thresholds=None, modified_recall=False)
print_metrics(results)

  .apply(lambda x: x.nlargest(1, f"latent-{latent_id}-act"))


range(0, 11)
F1 score for threshold 0: 0.289, Precision: 0.170, Recall: 0.949
--------------------------------------------------
F1 score for threshold 1: 0.425, Precision: 0.274, Recall: 0.949
--------------------------------------------------
F1 score for threshold 2: 0.625, Precision: 0.472, Recall: 0.923
--------------------------------------------------
F1 score for threshold 3: 0.766, Precision: 0.654, Recall: 0.923
--------------------------------------------------
F1 score for threshold 4: 0.868, Precision: 0.819, Recall: 0.923
--------------------------------------------------
F1 score for threshold 5: 0.913, Precision: 0.903, Recall: 0.923
--------------------------------------------------
F1 score for threshold 6: 0.850, Precision: 0.949, Recall: 0.769
--------------------------------------------------
F1 score for threshold 7: 0.789, Precision: 0.965, Recall: 0.667
--------------------------------------------------
F1 score for threshold 8: 0.498, Precision: 0.981, Recall: 

## Result: Latent Feature 1264 Detects HIV-Related Sequences

The analysis below reveals a striking pattern: latent feature #1246 in our SAE strongly activates on specific functional elements characteristic of HIV. To validate our hypothesis that this feature might be detecting HIV or lentivirus-related patterns, we:

1. Identified the top 100 tokens that most strongly activate this feature
2. Performed BLAST searches against the NCBI nucleotide database
3. Found that ~86% of these top-activating sequences are from HIV or related lentiviruses

This finding suggests that the Nucleotide Transformer model has internally learned to represent HIV-specific genetic patterns during its pre-training, despite not being explicitly trained on viral classification tasks. The SAE has successfully isolated this knowledge into a single interpretable feature.

This demonstrates the power of combining transformer models with interpretability techniques like SAEs for discovering biologically meaningful patterns in genetic data.

In [None]:
latent_id = 1264 # @param or set particular int value in range 0, 4095

# we avoid modifying token_df directly as its very time-consuming to reload if we mess it up
token_df_copy = token_df_1k_s1.copy() # @param

# get the activation value for the N-th unit in the SAE for each input in batch
hidden_act_feature_id = combined_latents[:, latent_id] # N = feature_id

# add this to the dataframe
token_df_copy[f"latent-{latent_id}-act"] = hidden_act_feature_id.cpu().detach().numpy()

# sort to show the most activating tokens on top, add colours
token_df_copy.sort_values(f"latent-{latent_id}-act", ascending=False).head(50
                                                                           ).style.background_gradient("coolwarm")

Unnamed: 0,seq_id,token_pos,tokens,context,token_annotations,context_annotations,e-value annotation,percentage match,latent-1264-act
314851,9530,483,AATACA,GGAGTGGGACAGAGAAATTAACAATTACACAAGCTT |AATACA| CTCCTTAATTGAAGAATCGCAAAACCAGCAAGAAAA,['env'],['env'],[1.66e-226],[100.],12.200504
102808,2965,408,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.00e-126 3.49e-205],[100. 100.],11.841129
506990,15506,110,GTGGAG,GGGGACCCGACAGGCCCGAAGGAATAGAAGAAGAAG |GTGGAG| AGAGAGACAGAGACAGATCCATTCGATTAGTGAACG,['env'],['env'],[5.38e-181],[100.],11.598942
444818,13279,402,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.00e-126 3.46e-205],[100. 100.],11.497379
505136,15429,304,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[9.30e-127 2.99e-205],[100. 100.],11.364567
489436,14702,476,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.10e-126 3.81e-205],[100. 100.],11.130192
219510,6405,374,TGGTGC,CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAG |TGGTGC| AGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.00e-126 3.33e-205],[100. 100.],11.122379
27999,786,351,GTGCAG,TTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTG |GTGCAG| AGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTC,"['RRE', 'env']","['RRE', 'RRE', 'env']",[1.10e-126 1.76e-226],[100. 100.],11.083317
482580,14440,276,GTGCAG,TTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTG |GTGCAG| AGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTC,"['RRE', 'env']","['RRE', 'RRE', 'env']",[9.40e-127 1.46e-226],[100. 100.],11.059879
314870,9530,502,GTGGTA,AAGTTTGTGGAATTGGTTTAACATAACAAATTGGCT |GTGGTA| TATAAAATTATTCATAATGATAGTAGGAGGCTTGGT,['env'],['env'],[1.66e-226],[100.],10.872379


### Verification of Result by BLASTing most activating tokens



In [None]:
!pip install biopython
!pip install tqdm

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.3/3.3 MB[0m [31m116.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


In [None]:
# Create a FASTA file from the sequences
def create_fasta_from_df(df, output_file):
    with open(output_file, 'w') as f:
        for idx, row in df.iterrows():
            # Get the sequence and remove spaces and '|'
            seq = row['context'].replace(' ', '').replace('|', '')

            # Write in FASTA format with sequence ID and the sequence
            f.write(f">sequence_{idx}\n{seq}\n")

# Create the FASTA file
output_file = "sequences.fasta"
top_N = 100

# Sort and store the result, then take top N rows
sorted_df = token_df_copy.sort_values(f"latent-{latent_id}-act", ascending=False)
top_sequences = sorted_df.head(top_N)

create_fasta_from_df(top_sequences, output_file)

# Verify the file contents
with open(output_file, 'r') as f:
    print("First few sequences in the FASTA file:")
    print(f.read().strip()[:500])  # Print first 500 characters as preview

First few sequences in the FASTA file:
>sequence_314851
GGAGTGGGACAGAGAAATTAACAATTACACAAGCTTAATACACTCCTTAATTGAAGAATCGCAAAACCAGCAAGAAAA
>sequence_102808
CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT
>sequence_506990
GGGGACCCGACAGGCCCGAAGGAATAGAAGAAGAAGGTGGAGAGAGAGACAGAGACAGATCCATTCGATTAGTGAACG
>sequence_444818
CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT
>sequence_505136
CATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGT
>sequence_489436
CAT


In [None]:
from Bio import Entrez, SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML  # Add this import
from tqdm import tqdm
import time


def blast_sequence(seq, database="nr"):
    try:
        # Run BLAST search
        result_handle = NCBIWWW.qblast(
            "blastn",                     # nucleotide BLAST
            database,                     # nucleotide database
            seq,
            expect=e_threshold,                 # E-value threshold
            hitlist_size=n_hits                # Number of hits to return
        )
        return result_handle
    except Exception as e:
        print(f"Error during BLAST: {e}")
        return None

def analyze_blast_results(blast_record):
    hiv_related = False
    for alignment in blast_record.alignments:
        if any(term.lower() in alignment.title.lower()
               for term in ['hiv', 'lentivirus', 'immunodeficiency virus']):
            hiv_related = True
            break
    return hiv_related

# Config
Entrez.email = "maiwald.aaron@outlook.de"
n_hits = 30
e_threshold = 1e-10



# Assuming your sequences are in a FASTA file
sequences = []  # Store your sequences here
hiv_matches = 0

# Read your sequences (modify this part based on how your sequences are stored)
with open('sequences.fasta', 'r') as file:
    for record in SeqIO.parse(file, 'fasta'):
        sequences.append(str(record.seq))

# Process each sequence
for i, seq in enumerate(tqdm(sequences)):

    print(f"Processing sequence {i+1}/{len(sequences)}")
    result_handle = blast_sequence(seq)

    if result_handle:
        print("Parsing BLAST results...")
        # Parse BLAST results
        blast_records = NCBIXML.parse(result_handle)

        for blast_record in blast_records:
            if analyze_blast_results(blast_record):
                hiv_matches += 1
                print("HIV/lentivirus match found!")


    # NCBI recommends waiting between requests
    time.sleep(3)

### Results of BLAST analysis:

In [None]:

# Calculate and print results
match_percentage = (hiv_matches / len(sequences)) * 100
print(f"\nResults:")
print(f"Total sequences: {len(sequences)}")
print(f"HIV/lentivirus matches: {hiv_matches}")
print(f"Percentage of matches: {match_percentage:.2f}%")

  0%|          | 0/100 [00:00<?, ?it/s]

Processing sequence 1/100
Parsing BLAST results...
HIV/lentivirus match found!


  1%|          | 1/100 [01:05<1:48:28, 65.74s/it]

Processing sequence 2/100
Parsing BLAST results...
HIV/lentivirus match found!


  2%|▏         | 2/100 [03:25<2:58:42, 109.42s/it]

Processing sequence 3/100
Parsing BLAST results...
HIV/lentivirus match found!


  3%|▎         | 3/100 [03:45<1:50:47, 68.53s/it] 

Processing sequence 4/100
Parsing BLAST results...
HIV/lentivirus match found!


  4%|▍         | 4/100 [05:05<1:56:58, 73.11s/it]

Processing sequence 5/100
Parsing BLAST results...
HIV/lentivirus match found!


  5%|▌         | 5/100 [05:25<1:25:11, 53.81s/it]

Processing sequence 6/100
Parsing BLAST results...
HIV/lentivirus match found!


  6%|▌         | 6/100 [05:45<1:06:32, 42.47s/it]

Processing sequence 7/100
Parsing BLAST results...
HIV/lentivirus match found!


  7%|▋         | 7/100 [07:05<1:24:54, 54.78s/it]

Processing sequence 8/100
Parsing BLAST results...
HIV/lentivirus match found!


  8%|▊         | 8/100 [10:25<2:34:41, 100.89s/it]

Processing sequence 9/100
Parsing BLAST results...
HIV/lentivirus match found!


  9%|▉         | 9/100 [11:45<2:23:08, 94.38s/it] 

Processing sequence 10/100
Parsing BLAST results...
HIV/lentivirus match found!


 10%|█         | 10/100 [13:05<2:15:04, 90.05s/it]

Processing sequence 11/100
Parsing BLAST results...


 11%|█         | 11/100 [18:25<3:57:47, 160.31s/it]

Processing sequence 12/100
Parsing BLAST results...
HIV/lentivirus match found!


 12%|█▏        | 12/100 [18:45<2:52:28, 117.59s/it]

Processing sequence 13/100
Parsing BLAST results...


 13%|█▎        | 13/100 [19:05<2:07:41, 88.06s/it] 

Processing sequence 14/100
Parsing BLAST results...
HIV/lentivirus match found!


 14%|█▍        | 14/100 [19:25<1:36:40, 67.44s/it]

Processing sequence 15/100
Parsing BLAST results...
HIV/lentivirus match found!


 15%|█▌        | 15/100 [19:45<1:15:26, 53.26s/it]

Processing sequence 16/100
Parsing BLAST results...
HIV/lentivirus match found!


 16%|█▌        | 16/100 [20:05<1:00:21, 43.11s/it]

Processing sequence 17/100
Parsing BLAST results...
HIV/lentivirus match found!


 17%|█▋        | 17/100 [20:25<50:01, 36.16s/it]  

Processing sequence 18/100
Parsing BLAST results...
HIV/lentivirus match found!


 18%|█▊        | 18/100 [20:45<42:49, 31.34s/it]

Processing sequence 19/100
Parsing BLAST results...
HIV/lentivirus match found!


 19%|█▉        | 19/100 [22:05<1:01:59, 45.92s/it]

Processing sequence 20/100
Parsing BLAST results...
HIV/lentivirus match found!


 20%|██        | 20/100 [22:25<50:53, 38.17s/it]  

Processing sequence 21/100
Parsing BLAST results...
HIV/lentivirus match found!


 21%|██        | 21/100 [22:45<43:14, 32.85s/it]

Processing sequence 22/100
Parsing BLAST results...
HIV/lentivirus match found!


 22%|██▏       | 22/100 [24:06<1:01:09, 47.05s/it]

Processing sequence 23/100
Parsing BLAST results...
HIV/lentivirus match found!


 23%|██▎       | 23/100 [24:25<49:40, 38.71s/it]  

Processing sequence 24/100
Parsing BLAST results...
HIV/lentivirus match found!


 24%|██▍       | 24/100 [24:45<42:07, 33.26s/it]

Processing sequence 25/100
Parsing BLAST results...
HIV/lentivirus match found!


 25%|██▌       | 25/100 [25:05<36:24, 29.13s/it]

Processing sequence 26/100
Parsing BLAST results...
HIV/lentivirus match found!


 26%|██▌       | 26/100 [25:25<32:47, 26.58s/it]

Processing sequence 27/100
Parsing BLAST results...
HIV/lentivirus match found!


 27%|██▋       | 27/100 [25:45<29:42, 24.42s/it]

Processing sequence 28/100
Parsing BLAST results...


 28%|██▊       | 28/100 [26:05<27:48, 23.17s/it]

Processing sequence 29/100
Parsing BLAST results...
HIV/lentivirus match found!


 29%|██▉       | 29/100 [30:25<1:51:29, 94.21s/it]

Processing sequence 30/100
Parsing BLAST results...
HIV/lentivirus match found!


 30%|███       | 30/100 [30:45<1:23:53, 71.90s/it]

Processing sequence 31/100
Parsing BLAST results...
HIV/lentivirus match found!


 31%|███       | 31/100 [31:05<1:04:51, 56.39s/it]

Processing sequence 32/100
Parsing BLAST results...
HIV/lentivirus match found!


 32%|███▏      | 32/100 [32:25<1:11:52, 63.42s/it]

Processing sequence 33/100
Parsing BLAST results...
HIV/lentivirus match found!


 33%|███▎      | 33/100 [33:45<1:16:26, 68.45s/it]

Processing sequence 34/100
Parsing BLAST results...
HIV/lentivirus match found!


 34%|███▍      | 34/100 [37:05<1:58:44, 107.95s/it]

Processing sequence 35/100
Parsing BLAST results...
HIV/lentivirus match found!


 35%|███▌      | 35/100 [37:25<1:28:13, 81.43s/it] 

Processing sequence 36/100
Parsing BLAST results...
HIV/lentivirus match found!


 36%|███▌      | 36/100 [37:45<1:07:20, 63.14s/it]

Processing sequence 37/100
Parsing BLAST results...


 37%|███▋      | 37/100 [38:05<52:34, 50.08s/it]  

Processing sequence 38/100
Parsing BLAST results...
HIV/lentivirus match found!


 38%|███▊      | 38/100 [39:25<1:01:05, 59.12s/it]

Processing sequence 39/100
Parsing BLAST results...
HIV/lentivirus match found!


 39%|███▉      | 39/100 [39:45<48:06, 47.33s/it]  

Processing sequence 40/100
Parsing BLAST results...


 40%|████      | 40/100 [41:05<57:11, 57.19s/it]

Processing sequence 41/100
Parsing BLAST results...
HIV/lentivirus match found!


 41%|████      | 41/100 [41:25<45:10, 45.95s/it]

Processing sequence 42/100
Parsing BLAST results...
HIV/lentivirus match found!


 42%|████▏     | 42/100 [41:45<36:57, 38.24s/it]

Processing sequence 43/100
Parsing BLAST results...
HIV/lentivirus match found!


 43%|████▎     | 43/100 [42:05<31:04, 32.71s/it]

Processing sequence 44/100
Parsing BLAST results...
HIV/lentivirus match found!


 44%|████▍     | 44/100 [42:25<27:06, 29.04s/it]

Processing sequence 45/100
Parsing BLAST results...
HIV/lentivirus match found!


 45%|████▌     | 45/100 [42:45<23:59, 26.17s/it]

Processing sequence 46/100
Parsing BLAST results...
HIV/lentivirus match found!


 46%|████▌     | 46/100 [45:05<54:21, 60.40s/it]

Processing sequence 47/100
Parsing BLAST results...
HIV/lentivirus match found!


 47%|████▋     | 47/100 [46:25<58:32, 66.28s/it]

Processing sequence 48/100
Parsing BLAST results...


 48%|████▊     | 48/100 [46:45<45:20, 52.32s/it]

Processing sequence 49/100
Parsing BLAST results...
HIV/lentivirus match found!


 49%|████▉     | 49/100 [47:05<36:16, 42.68s/it]

Processing sequence 50/100
Parsing BLAST results...
HIV/lentivirus match found!


 50%|█████     | 50/100 [47:25<29:51, 35.83s/it]

Processing sequence 51/100
Parsing BLAST results...
HIV/lentivirus match found!


 51%|█████     | 51/100 [47:45<25:29, 31.22s/it]

Processing sequence 52/100
Parsing BLAST results...
HIV/lentivirus match found!


 52%|█████▏    | 52/100 [48:05<22:09, 27.70s/it]

Processing sequence 53/100
Parsing BLAST results...
HIV/lentivirus match found!


 53%|█████▎    | 53/100 [48:25<19:56, 25.46s/it]

Processing sequence 54/100
Parsing BLAST results...
HIV/lentivirus match found!


 54%|█████▍    | 54/100 [48:45<18:12, 23.75s/it]

Processing sequence 55/100
Parsing BLAST results...


 55%|█████▌    | 55/100 [49:05<17:03, 22.74s/it]

Processing sequence 56/100
Parsing BLAST results...
HIV/lentivirus match found!


 56%|█████▌    | 56/100 [50:25<29:15, 39.90s/it]

Processing sequence 57/100
Parsing BLAST results...
HIV/lentivirus match found!


 57%|█████▋    | 57/100 [50:45<24:15, 33.84s/it]

Processing sequence 58/100
Parsing BLAST results...
HIV/lentivirus match found!


 58%|█████▊    | 58/100 [51:05<20:51, 29.81s/it]

Processing sequence 59/100
Parsing BLAST results...
HIV/lentivirus match found!


 59%|█████▉    | 59/100 [51:25<18:16, 26.74s/it]

Processing sequence 60/100
Parsing BLAST results...


 60%|██████    | 60/100 [51:45<16:32, 24.81s/it]

Processing sequence 61/100
Parsing BLAST results...
HIV/lentivirus match found!


 61%|██████    | 61/100 [52:05<15:09, 23.32s/it]

Processing sequence 62/100
Parsing BLAST results...
HIV/lentivirus match found!


 62%|██████▏   | 62/100 [52:25<14:09, 22.35s/it]

Processing sequence 63/100
Parsing BLAST results...
HIV/lentivirus match found!


 63%|██████▎   | 63/100 [52:45<13:18, 21.59s/it]

Processing sequence 64/100
Parsing BLAST results...
HIV/lentivirus match found!


 64%|██████▍   | 64/100 [53:05<12:42, 21.19s/it]

Processing sequence 65/100
Parsing BLAST results...


 65%|██████▌   | 65/100 [53:25<12:07, 20.79s/it]

Processing sequence 66/100
Parsing BLAST results...
HIV/lentivirus match found!


 66%|██████▌   | 66/100 [53:45<11:40, 20.59s/it]

Processing sequence 67/100
Parsing BLAST results...
HIV/lentivirus match found!


 67%|██████▋   | 67/100 [54:05<11:12, 20.38s/it]

Processing sequence 68/100
Parsing BLAST results...
HIV/lentivirus match found!


 68%|██████▊   | 68/100 [54:25<10:51, 20.35s/it]

Processing sequence 69/100
Parsing BLAST results...
HIV/lentivirus match found!


 69%|██████▉   | 69/100 [54:45<10:25, 20.17s/it]

Processing sequence 70/100
Parsing BLAST results...
HIV/lentivirus match found!


 70%|███████   | 70/100 [55:05<10:04, 20.14s/it]

Processing sequence 71/100
Parsing BLAST results...
HIV/lentivirus match found!


 71%|███████   | 71/100 [55:25<09:40, 20.01s/it]

Processing sequence 72/100
Parsing BLAST results...
HIV/lentivirus match found!


 72%|███████▏  | 72/100 [55:45<09:23, 20.14s/it]

Processing sequence 73/100
Parsing BLAST results...


 73%|███████▎  | 73/100 [56:05<08:59, 19.97s/it]

Processing sequence 74/100
Parsing BLAST results...
HIV/lentivirus match found!


 74%|███████▍  | 74/100 [56:25<08:41, 20.06s/it]

Processing sequence 75/100




Parsing BLAST results...
HIV/lentivirus match found!


 75%|███████▌  | 75/100 [1:13:45<2:15:53, 326.13s/it]

Processing sequence 76/100
Parsing BLAST results...


 76%|███████▌  | 76/100 [1:15:05<1:40:52, 252.19s/it]

Processing sequence 77/100
Parsing BLAST results...


 77%|███████▋  | 77/100 [1:15:25<1:09:57, 182.51s/it]

Processing sequence 78/100
Parsing BLAST results...
HIV/lentivirus match found!


 78%|███████▊  | 78/100 [1:15:45<49:05, 133.91s/it]  

Processing sequence 79/100
Parsing BLAST results...
HIV/lentivirus match found!


 79%|███████▉  | 79/100 [1:16:05<34:50, 99.56s/it] 

Processing sequence 80/100
Parsing BLAST results...
HIV/lentivirus match found!


 80%|████████  | 80/100 [1:16:25<25:16, 75.84s/it]

Processing sequence 81/100
Parsing BLAST results...
HIV/lentivirus match found!


 81%|████████  | 81/100 [1:17:45<24:24, 77.08s/it]

Processing sequence 82/100
Parsing BLAST results...
HIV/lentivirus match found!


 82%|████████▏ | 82/100 [1:18:05<17:58, 59.92s/it]

Processing sequence 83/100
Parsing BLAST results...
HIV/lentivirus match found!


 83%|████████▎ | 83/100 [1:18:25<13:35, 47.99s/it]

Processing sequence 84/100
