# FASTA Dataset Preprocessing Visualization

This notebook demonstrates how DNA/RNA sequences are preprocessed for the Binary Neural Network (BNN) model.

## Overview
- **Input**: Raw FASTA sequences (COVID vs Non-COVID)
- **Output**: K-mer frequency vectors for BNN training
- **Method**: Extract K-mers of length k=3 and create vocabulary-based encoding

In [30]:
# Import required libraries
import os
import sys
from collections import Counter

# Add the current directory to path to import our modules
sys.path.append('/workspace/app/Pure_BNN')

print("Setting up FASTA dataset preprocessing demonstration...")
print("=" * 60)

Setting up FASTA dataset preprocessing demonstration...


In [31]:
# Example raw sequences (simulated examples from FASTA format)
print("1. RAW SEQUENCES FROM FASTA FILES")
print("=" * 40)

# Example COVID sequence (shortened for demonstration)
covid_sequence = "ATGGCGAACCGTCAGGACCTGGCCCAGTTTCAGGACAAGTGTCAGAATCGCTGGGATAACACCGCAGTAAGCGTCACCTTGAAGGCTGTCTTTGCTGGTGACAACGTCACCTTCCAGGCAATTGCCGACACCTTCACGTGCAGAAATGCTGTCGCGCTTACCTACAACGCTATGGCTCAGTCAACCTTCGTCGGCAACAACCCCTTGGGCAACAACGTCACGTGGAAC"

# Example Non-COVID sequence (shortened for demonstration)
non_covid_sequence = "ATGAAATACCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCCTATATGTCTTTTGCAAAAGCCCTCTCGCTATTTTGGTTTTTATGGCCATTATCTGATCCCCACCCGAAATCCTATGGCAACAACTTTAAAAACAAAAACATCATGGAAGCCATGGCGCCAGTTGATAGACCGGTGGCCACGGATCTTGAAGGCAGTTGTCGCCAG"

print("COVID Sequence (first 200 nucleotides):")
print(f">{covid_sequence[:200]}...")
print(f"Length: {len(covid_sequence)} nucleotides")
print()

print("Non-COVID Sequence (first 200 nucleotides):")
print(f">{non_covid_sequence[:200]}...")
print(f"Length: {len(non_covid_sequence)} nucleotides")
print()

print("Class Labels:")
print("• COVID sequences → Label: 0 (positive class)")
print("• Non-COVID sequences → Label: 1 (negative class)")

1. RAW SEQUENCES FROM FASTA FILES
COVID Sequence (first 200 nucleotides):
>ATGGCGAACCGTCAGGACCTGGCCCAGTTTCAGGACAAGTGTCAGAATCGCTGGGATAACACCGCAGTAAGCGTCACCTTGAAGGCTGTCTTTGCTGGTGACAACGTCACCTTCCAGGCAATTGCCGACACCTTCACGTGCAGAAATGCTGTCGCGCTTACCTACAACGCTATGGCTCAGTCAACCTTCGTCGGCAACAA...
Length: 228 nucleotides

Non-COVID Sequence (first 200 nucleotides):
>ATGAAATACCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCCTATATGTCTTTTGCAAAAGCCCTCTCGCTATTTTGGTTTTTATGGCCATTATCTGATCCCCACCCGAAATCCTATGGCAACAACTTTAAAAACAAAAACATCATGGAAGCCATGGCGCCAGTTGATAGACCGGTGGCCACGGATCTT...
Length: 218 nucleotides

Class Labels:
• COVID sequences → Label: 0 (positive class)
• Non-COVID sequences → Label: 1 (negative class)


In [32]:
# K-mer extraction function (simplified version from the training script)
def extract_kmers(sequence, k=3):
    """Extract all K-mers from a sequence"""
    clean_sequence = ''.join([c for c in sequence if c in 'ACGT'])
    kmers = []
    for i in range(len(clean_sequence) - k + 1):
        kmer = clean_sequence[i:i+k]
        if len(kmer) == k:
            kmers.append(kmer)
    return kmers

print("2. K-MER EXTRACTION PROCESS")
print("=" * 40)

# Set K-mer length
k = 5
print(f"K-mer length: {k}")
print()

# Extract K-mers from both sequences
covid_kmers = extract_kmers(covid_sequence, k)
non_covid_kmers = extract_kmers(non_covid_sequence, k)

print("COVID sequence K-mers (first 20):")
print(covid_kmers[:20])
print(f"Total K-mers extracted: {len(covid_kmers)}")
print()

print("Non-COVID sequence K-mers (first 20):")
print(non_covid_kmers[:20])
print(f"Total K-mers extracted: {len(non_covid_kmers)}")
print()

# Show unique K-mers
covid_unique = set(covid_kmers)
non_covid_unique = set(non_covid_kmers)

print(f"Unique K-mers in COVID sequence: {len(covid_unique)}")
print(f"Unique K-mers in Non-COVID sequence: {len(non_covid_unique)}")
print(f"Common K-mers: {len(covid_unique.intersection(non_covid_unique))}")

2. K-MER EXTRACTION PROCESS
K-mer length: 5

COVID sequence K-mers (first 20):
['ATGGC', 'TGGCG', 'GGCGA', 'GCGAA', 'CGAAC', 'GAACC', 'AACCG', 'ACCGT', 'CCGTC', 'CGTCA', 'GTCAG', 'TCAGG', 'CAGGA', 'AGGAC', 'GGACC', 'GACCT', 'ACCTG', 'CCTGG', 'CTGGC', 'TGGCC']
Total K-mers extracted: 224

Non-COVID sequence K-mers (first 20):
['ATGAA', 'TGAAA', 'GAAAT', 'AAATA', 'AATAC', 'ATACC', 'TACCT', 'ACCTA', 'CCTAT', 'CTATT', 'TATTG', 'ATTGC', 'TTGCC', 'TGCCT', 'GCCTA', 'CCTAC', 'CTACG', 'TACGG', 'ACGGC', 'CGGCA']
Total K-mers extracted: 214

Unique K-mers in COVID sequence: 182
Unique K-mers in Non-COVID sequence: 190
Common K-mers: 41


In [33]:
# Build vocabulary from all K-mers
print("3. K-MER VOCABULARY BUILDING")
print("=" * 40)

# Combine all K-mers to build vocabulary
all_kmers = covid_kmers + non_covid_kmers
kmer_counts = Counter(all_kmers)

# Get top K-mers (simulating max_kmers=20 for demonstration)
max_kmers = 1000
most_common_kmers = kmer_counts.most_common(max_kmers)

print(f"Building vocabulary with top {max_kmers} most frequent K-mers:")
print()

# Create K-mer to index mapping
kmer_to_idx = {}
for i, (kmer, count) in enumerate(most_common_kmers):
    kmer_to_idx[kmer] = i

print("Vocabulary (K-mer → Index):")
for i, (kmer, count) in enumerate(most_common_kmers):
    print(f"  {kmer} → {i:2d} (appears {count:2d} times)")

print()
print(f"Vocabulary size: {len(kmer_to_idx)}")
print(f"Total K-mers processed: {len(all_kmers)}")
print(f"Unique K-mers found: {len(kmer_counts)}")

3. K-MER VOCABULARY BUILDING
Building vocabulary with top 1000 most frequent K-mers:

Vocabulary (K-mer → Index):
  ATGGC →  0 (appears  5 times)
  ACAAC →  1 (appears  5 times)
  CGTCA →  2 (appears  4 times)
  ACCTT →  3 (appears  4 times)
  GGCAA →  4 (appears  4 times)
  AACAA →  5 (appears  4 times)
  TGGCC →  6 (appears  3 times)
  CAGTT →  7 (appears  3 times)
  GCTGG →  8 (appears  3 times)
  GTCAC →  9 (appears  3 times)
  CACCT → 10 (appears  3 times)
  CAACG → 11 (appears  3 times)
  CCTTC → 12 (appears  3 times)
  GAAAT → 13 (appears  3 times)
  TATGG → 14 (appears  3 times)
  GCAAC → 15 (appears  3 times)
  CAACA → 16 (appears  3 times)
  CCTAT → 17 (appears  3 times)
  TGGCG → 18 (appears  2 times)
  GTCAG → 19 (appears  2 times)
  TCAGG → 20 (appears  2 times)
  CAGGA → 21 (appears  2 times)
  AGGAC → 22 (appears  2 times)
  GGCCC → 23 (appears  2 times)
  CCAGT → 24 (appears  2 times)
  GACAA → 25 (appears  2 times)
  CAGAA → 26 (appears  2 times)
  TCGCT → 27 (appears 

In [34]:
# Encode sequences as K-mer frequency vectors
def encode_sequence_kmers(sequence, kmer_to_idx, k=3):
    """Encode sequence using K-mer frequency representation"""
    kmers = extract_kmers(sequence, k)
    kmer_vector = [0] * len(kmer_to_idx)
    
    # Count K-mer occurrences
    kmer_counts = Counter(kmers)
    for kmer, count in kmer_counts.items():
        if kmer in kmer_to_idx:
            idx = kmer_to_idx[kmer]
            kmer_vector[idx] = count
    
    # Normalize by sequence length
    if len(kmers) > 0:
        kmer_vector = [count / len(kmers) for count in kmer_vector]
    
    return kmer_vector

print("4. SEQUENCE ENCODING (K-mer Frequency Vectors)")
print("=" * 50)

# Encode both sequences
covid_vector = encode_sequence_kmers(covid_sequence, kmer_to_idx, k)
non_covid_vector = encode_sequence_kmers(non_covid_sequence, kmer_to_idx, k)

print("COVID sequence encoded as frequency vector:")
print("Index: K-mer  → Frequency")
for i, (kmer, _) in enumerate(most_common_kmers):
    freq = covid_vector[i]
    print(f"  {i:2d}: {kmer}  → {freq:.4f}")

print()
print("Non-COVID sequence encoded as frequency vector:")
print("Index: K-mer  → Frequency")
for i, (kmer, _) in enumerate(most_common_kmers):
    freq = non_covid_vector[i]
    print(f"  {i:2d}: {kmer}  → {freq:.4f}")

print()
print("Vector Properties:")
print(f"• Vector length: {len(covid_vector)}")
print(f"• Sum of COVID vector: {sum(covid_vector):.4f}")
print(f"• Sum of Non-COVID vector: {sum(non_covid_vector):.4f}")
print("• Frequencies are normalized by sequence length")

4. SEQUENCE ENCODING (K-mer Frequency Vectors)
COVID sequence encoded as frequency vector:
Index: K-mer  → Frequency
   0: ATGGC  → 0.0089
   1: ACAAC  → 0.0179
   2: CGTCA  → 0.0179
   3: ACCTT  → 0.0179
   4: GGCAA  → 0.0134
   5: AACAA  → 0.0089
   6: TGGCC  → 0.0045
   7: CAGTT  → 0.0045
   8: GCTGG  → 0.0089
   9: GTCAC  → 0.0134
  10: CACCT  → 0.0134
  11: CAACG  → 0.0134
  12: CCTTC  → 0.0134
  13: GAAAT  → 0.0045
  14: TATGG  → 0.0045
  15: GCAAC  → 0.0089
  16: CAACA  → 0.0089
  17: CCTAT  → 0.0000
  18: TGGCG  → 0.0045
  19: GTCAG  → 0.0089
  20: TCAGG  → 0.0089
  21: CAGGA  → 0.0089
  22: AGGAC  → 0.0089
  23: GGCCC  → 0.0045
  24: CCAGT  → 0.0045
  25: GACAA  → 0.0089
  26: CAGAA  → 0.0089
  27: TCGCT  → 0.0045
  28: CGCTG  → 0.0045
  29: ACACC  → 0.0089
  30: GCAGT  → 0.0045
  31: TCACC  → 0.0089
  32: CCTTG  → 0.0089
  33: CTTGA  → 0.0045
  34: TTGAA  → 0.0045
  35: TGAAG  → 0.0045
  36: GAAGG  → 0.0045
  37: AAGGC  → 0.0045
  38: GCTGT  → 0.0089
  39: CTGTC  → 0.0089
  4

In [35]:
# Show final dataset format
print("5. FINAL DATASET FORMAT FOR BNN TRAINING")
print("=" * 45)

print("Training Sample Format:")
print("-" * 25)

# Simulate how data appears in the dataloader
samples = [
    (covid_vector, 0, "COVID"),
    (non_covid_vector, 1, "Non-COVID")
]

for i, (vector, label, class_name) in enumerate(samples):
    print(f"Sample {i+1} ({class_name}):")
    print(f"  Input vector shape: {len(vector)}")
    print(f"  Input vector (first 10): {vector[:10]}")
    print(f"  Label: {label}")
    print(f"  Class: {class_name}")
    print()

print("Summary of Preprocessing Pipeline:")
print("-" * 35)
print("1. Raw FASTA sequence (DNA/RNA nucleotides)")
print("2. Extract overlapping K-mers of length k=5")
print("3. Build vocabulary from most frequent K-mers")
print("4. Count K-mer frequencies in each sequence")
print("5. Normalize frequencies by sequence length")
print("6. Create fixed-length vectors for BNN input")
print()
print("Key Benefits:")
print("• Fixed input size regardless of sequence length")
print("• Captures local sequence patterns (K-mers)")
print("• Frequency encoding preserves importance")
print("• Suitable for binary neural network training")

5. FINAL DATASET FORMAT FOR BNN TRAINING
Training Sample Format:
-------------------------
Sample 1 (COVID):
  Input vector shape: 331
  Input vector (first 10): [0.008928571428571428, 0.017857142857142856, 0.017857142857142856, 0.017857142857142856, 0.013392857142857142, 0.008928571428571428, 0.004464285714285714, 0.004464285714285714, 0.008928571428571428, 0.013392857142857142]
  Label: 0
  Class: COVID

Sample 2 (Non-COVID):
  Input vector shape: 331
  Input vector (first 10): [0.014018691588785047, 0.004672897196261682, 0.0, 0.0, 0.004672897196261682, 0.009345794392523364, 0.009345794392523364, 0.009345794392523364, 0.004672897196261682, 0.0]
  Label: 1
  Class: Non-COVID

Summary of Preprocessing Pipeline:
-----------------------------------
1. Raw FASTA sequence (DNA/RNA nucleotides)
2. Extract overlapping K-mers of length k=5
3. Build vocabulary from most frequent K-mers
4. Count K-mer frequencies in each sequence
5. Normalize frequencies by sequence length
6. Create fixed-lengt

In [36]:
# Simulate the binarization process in BNN
import torch
import torch.nn as nn

def sign_function(x):
    """Binary activation function used in BNN"""
    return torch.sign(x)

def binarize_weights(weights):
    """Binarize weights to +1 or -1"""
    return torch.sign(weights)

print("6. BINARIZATION IN BINARY NEURAL NETWORK")
print("=" * 45)

# Convert our vectors to PyTorch tensors
covid_tensor = torch.tensor(covid_vector, dtype=torch.float32)
non_covid_tensor = torch.tensor(non_covid_vector, dtype=torch.float32)

print("Step 1: Input K-mer Frequency Vectors")
print("-" * 35)
print(f"COVID vector (first 10 values): {covid_tensor[:10].numpy()}")
print(f"Non-COVID vector (first 10 values): {non_covid_tensor[:10].numpy()}")
print()

# Simulate a simple binary layer transformation
print("Step 2: Binary Layer Processing")
print("-" * 30)

# Create a simple binary weight matrix (simulated)
input_size = len(covid_vector)
hidden_size = 8  # Small example
torch.manual_seed(42)  # For reproducible results

# Simulate binary weights (+1 or -1)
binary_weights = torch.randn(hidden_size, input_size)
binary_weights = binarize_weights(binary_weights)

print(f"Binary weights example (first layer, first neuron):")
print(f"Shape: {binary_weights.shape}")
print(f"First 10 weights: {binary_weights[0, :10].numpy()}")
print("Note: All weights are either +1 or -1")
print()

# Forward pass simulation
print("Step 3: Forward Pass Through Binary Layer")
print("-" * 40)

# Matrix multiplication with binary weights
covid_hidden = torch.matmul(binary_weights, covid_tensor)
non_covid_hidden = torch.matmul(binary_weights, non_covid_tensor)

print("Before binary activation:")
print(f"COVID hidden states: {covid_hidden.numpy()}")
print(f"Non-COVID hidden states: {non_covid_hidden.numpy()}")
print()

# Apply binary activation (sign function)
covid_binary = sign_function(covid_hidden)
non_covid_binary = sign_function(non_covid_hidden)

print("After binary activation (sign function):")
print(f"COVID binary states: {covid_binary.numpy()}")
print(f"Non-COVID binary states: {non_covid_binary.numpy()}")
print("Note: All activations are either +1 or -1")
print()

6. BINARIZATION IN BINARY NEURAL NETWORK
Step 1: Input K-mer Frequency Vectors
-----------------------------------
COVID vector (first 10 values): [0.00892857 0.01785714 0.01785714 0.01785714 0.01339286 0.00892857
 0.00446429 0.00446429 0.00892857 0.01339286]
Non-COVID vector (first 10 values): [0.01401869 0.0046729  0.         0.         0.0046729  0.00934579
 0.00934579 0.00934579 0.0046729  0.        ]

Step 2: Binary Layer Processing
------------------------------
Binary weights example (first layer, first neuron):
Shape: torch.Size([8, 331])
First 10 weights: [ 1.  1.  1. -1.  1. -1. -1. -1. -1.  1.]
Note: All weights are either +1 or -1

Step 3: Forward Pass Through Binary Layer
----------------------------------------
Before binary activation:
COVID hidden states: [ 9.82142687e-02 -7.14285821e-02  9.31322575e-09  6.25000000e-02
 -1.16071425e-01  6.25000000e-02  3.72529030e-09  9.82142910e-02]
Non-COVID hidden states: [ 0.03738317  0.05607476 -0.05607476 -0.0093458  -0.05607476 -