# Bioinformatics Hash Table Seeding and Search Toolkit

This notebook demonstrates various hash-based seeding and search strategies for DNA sequence analysis, including k-mer hashing, minimisers, spaced seeds, strobemers, and fuzzy (approximate) matching.

**Sections:**

1. Hash Table for k-mers (with chaining for collisions)
2. K-Mer Seeding Technique
3. Minimisers Seeding Technique
4. Spaced Seeds Seeding Technique
5. Strobemers Seeding Technique
6. Fuzzy Seeds (Allowing Mismatches) Seeding Technique

## Input and Output

**Inputs:**
- `reference_sequence` (str): The DNA sequence (reference genome) to be indexed and searched.
- `query_sequence` (str): The DNA sequence (query/read) to search for in the reference.
- `k` (int): Length of k-mers or seeds.
- `w` (int): Window size for minimisers/strobemers.
- `s` (int): Number of minimisers to combine for strobemers.
- `step` (int): Step size for strobemers.
- `pattern` (str): Binary string (e.g., '1101') for spaced seeds.
- `max_mismatches` (int): Maximum allowed mismatches for fuzzy seeds.

**Outputs:**
- For each seeding/searching technique, the output is:
    - The positions (indices) in the reference where the query (or its seeds/variants) match.
    - For fuzzy and spaced seeds, a mapping of query positions to reference positions.
    - For strobemers, a list of matches with query start, reference positions, and the strobemer sequence.
    - Printed output showing matches or indicating no match.

## 1. Hash Table for k-mers

**Time Complexity:**
- Insert: O(1) average, O(n) worst (with chaining)
- Search: O(1) average, O(n) worst

**Space Complexity:** O(N), where N is the number of unique k-mers stored

In [None]:
class HashNode:
    """
    Node for the hash table (for chaining in case of collisions).
    Stores the binary k-mer string, list of positions, and pointer to next node.
    
    Input:
        key (str): binary k-mer string
        position (int): position in the reference
    Output:
        HashNode object
    """
    def __init__(self, key, position):
        self.key = key  # binary k-mer string
        self.positions = [position]  # list of positions in the reference
        self.next = None  # for chaining if collision occurs

def binary_to_decimal(binary):
    """
    Converts a binary string to its decimal equivalent.
    
    Input:
        binary (str): A string representing a binary number
    Output:
        int: The decimal representation of the binary number
    """
    decimal = 0
    power = len(binary) - 1
    for digit in binary:
        decimal += int(digit) * (2 ** power)
        power -= 1
    return decimal

class HashTable:
    """
    Implements a hash table with chaining for collisions.
    
    Input:
        size (int): Size of the hash table
    Output:
        HashTable object
    """
    def __init__(self, size):
        self.size = size
        self.table = [None] * size

    def hash(self, key):
        """
        Computes the hash value for a given key (binary k-mer string).
        
        Input:
            key (str): The key to be hashed
        Output:
            int: The hash value (decimal representation of the binary string, modulo table size)
        """
        return binary_to_decimal(key) % self.size

    def insert(self, key, position):
        """
        Inserts a k-mer at a position, handling collisions by chaining.
        
        Input:
            key (str): binary k-mer string
            position (int): position in the reference sequence
        Output:
            None (modifies hash_table in-place)
        """
        index = self.hash(key)
        if self.table[index] is None:
            self.table[index] = HashNode(key, position)
        else:
            node = self.table[index]
            prev = None
            while node is not None:
                if node.key == key:
                    node.positions.append(position)
                    return
                prev = node
                node = node.next
            prev.next = HashNode(key, position)

    def search(self, key):
        """
        Searches for a key in the hash table and returns the index if found, otherwise -1.
        
        Input:
            key (str): binary k-mer string
        Output:
            int: The index of the hash table if key is found, otherwise -1
        """
        index = self.hash(key)
        node = self.table[index]
        while node is not None:
            if node.key == key:
                return index
            node = node.next
        return -1

    def print_table(self):
        """
        Prints the contents of the hash table.
        Input: None
        Output: None (prints to stdout)
        """
        for index, node in enumerate(self.table):
            print(f'Index {index}: ', end='')
            while node is not None:
                print(f'({node.key}) -> {node.positions} -> ', end='')
                node = node.next
            print("None")

## 2. k-mer Hashing and Query

**Time Complexity:**
- k-mer hashing: O(n), where n is the length of the reference
- Query: O(m), where m is the length of the query

**Space Complexity:** O(n)

In [None]:
def genome_to_binary(seq):
    """
    Converts a DNA k-mer to a binary string using: A=00, C=01, G=10, T=11
    
    Input:
        seq (str): DNA k-mer
    Output:
        str: binary string
    """
    binary = ""
    for i in seq:
        if i == "A":
            binary += "00"
        elif i == "C":
            binary += "01"
        elif i == "G":
            binary += "10"
        elif i == "T":
            binary += "11"
    return binary

def k_mer(k, reference_sequence, hash_table):
    """
    Slides a window of length k over the reference, converts each k-mer to binary, and inserts into the hash table.
    
    Input:
        k (int): Length of k-mer.
        reference_sequence (str): Reference DNA sequence.
        hash_table (HashTable): Hash table object to store k-mers.
    Output:
        None (modifies hash_table in-place)
    """
    i = 0
    while i <= len(reference_sequence) - k:
        current = reference_sequence[i : i + k]
        binary = genome_to_binary(current)
        hash_table.insert(binary, i)
        i += 1  # For overlapping k-mers

def query_search_kmer(k, query_sequence, hash_table):
    """
    Slides a window of length k over the query, converts each k-mer to binary, and searches in the hash table.
    
    Input:
        k (int): Length of k-mer.
        query_sequence (str): Query DNA sequence.
        hash_table (HashTable): Hash table object containing reference k-mers.
    Output:
        Prints the positions (indices) in the reference where each query k-mer is found.
    """
    i = 0
    while i <= len(query_sequence) - k:
        seed = query_sequence[i : i + k]
        binary = genome_to_binary(seed)
        positions = hash_table.search(binary)
        if positions != -1:
            print(f"Seed '{seed}' found at hash table index: {positions}")
        else:
            print(f"Seed '{seed}' NOT found in reference.")
        i += 1

## 3. Minimisers

**Time Complexity:** O(nw), where n is the length of the reference and w is the window size

**Space Complexity:** O(n)

In [None]:
def minimisers(reference_sequence, w, k, hash_table):
    """
    For each window of size w over the reference, finds the k-mer with the smallest binary value (the minimiser),
    inserts it, and returns the list of minimisers.
    
    Input:
        reference_sequence (str): Reference DNA sequence.
        w (int): Window size.
        k (int): k-mer length.
        hash_table (HashTable): Hash table to store minimisers.
    Output:
        minimisers_list (list): List of minimiser binary strings.
    """
    l = []
    minimisers_list = []
    i = 0
    while i <= len(reference_sequence) - k:
        current = reference_sequence[i : i + k]
        binary = genome_to_binary(current)
        decimal = binary_to_decimal(binary)
        l.append((decimal, binary))
        if len(l) == w:
            min_decimal, min_binary = min(l)
            hash_table.insert(min_binary, i)
            minimisers_list.append(min_binary)
            l.pop(0)  # Slide the window
        i += 1
    return minimisers_list

def query_search_minimisers(k, query_sequence, w, hash_table, reference_sequence):
    """
    For each window in the query, finds the minimiser and searches for it in the hash table.
    
    Input:
        k (int): k-mer length.
        query_sequence (str): Query DNA sequence.
        w (int): Window size.
        hash_table (HashTable): Hash table containing reference minimisers.
        reference_sequence (str): Reference DNA sequence.
    Output:
        matches (dict): Mapping of query positions to hash table indices where minimisers are found.
    """
    l = []
    matches = {}
    i = 0
    while i <= len(reference_sequence) - k:
        current = reference_sequence[i : i + k]
        binary = genome_to_binary(current)
        decimal = binary_to_decimal(binary)
        l.append((decimal, binary))
        if len(l) == w:
            min_decimal, min_binary = min(l)
            position = hash_table.search(min_binary)
            if position != -1:
                matches[i] = position
            l.pop(0)  # Slide the window
        i += 1
    return matches

## 4. Spaced Seeds

**Time Complexity:** O(nk), where n is the length of the reference and k is the pattern length

**Space Complexity:** O(n)

In [None]:
def spaced_seeds(reference_sequence, pattern, hash_table):
    """
    For each window, only keeps bases where pattern is '1', converts to binary, and inserts.
    
    Input:
        reference_sequence (str): Reference DNA sequence.
        pattern (str): Binary string for spaced seed pattern (e.g., '1101').
        hash_table (HashTable): Hash table to store spaced seeds.
    Output:
        None (modifies hash_table in-place)
    """
    k = len(pattern)
    i = 0
    while i <= len(reference_sequence) - k:
        seed = reference_sequence[i:i+k]
        new_seed = ""
        for j in range(k):
            if pattern[j] == '1':
                new_seed += seed[j]
        if len(new_seed) > 0:
            binary = genome_to_binary(new_seed)
            hash_table.insert(binary, i)
        i += 1

def search_spaced_seeds(sequence, pattern, hash_table):
    """
    For each window in the query, applies the pattern, converts to binary, and searches in the hash table.
    
    Input:
        sequence (str): Query DNA sequence.
        pattern (str): Binary string for spaced seed pattern.
        hash_table (HashTable): Hash table containing reference spaced seeds.
    Output:
        matches (dict): Mapping of query positions to hash table indices where spaced seeds are found.
    """
    k = len(pattern)
    matches = {}
    for i in range(len(sequence) - k + 1):
        seed = sequence[i:i + k]
        new_seed = ""
        for j in range(k):
            if pattern[j] == '1':
                new_seed += seed[j]
        if len(new_seed) > 0:
            binary = genome_to_binary(new_seed)
            positions = hash_table.search(binary)
            if positions != -1:
                matches[i] = positions
    return matches

## 5. Strobemers

**Time Complexity:** O(nws), where n is the length of the reference, w is window size, s is strobemer size

**Space Complexity:** O(n)

In [None]:
def strobemers(reference_sequence, w, k, s, step, hash_table):
    """
    Combines s minimisers (with a step) to form a strobemer and inserts into the hash table.
    
    Input:
        reference_sequence (str): Reference DNA sequence.
        w (int): Window size for minimisers.
        k (int): k-mer length.
        s (int): Number of minimisers to combine.
        step (int): Step size for strobemers.
        hash_table (HashTable): Hash table to store strobemers.
    Output:
        strobemers_list (list): List of strobemer binary strings.
    """
    minimiser_list = minimisers(reference_sequence, w, k, hash_table)
    strobemers_list = []
    for i in range(len(minimiser_list) - (s-1)*step):
        strober_seq = ""
        positions = []
        for j in range(s):
            if i + j*step >= len(minimiser_list):
                break
            strober_seq += minimiser_list[i + j*step]
            positions.append(i + j*step)
        if len(strober_seq) == k*s:
            hash_table.insert(strober_seq, positions[0])
            strobemers_list.append(strober_seq)
    return strobemers_list

def query_search_strobemers(query_sequence, w, k, s, step, hash_table):
    """
    Forms strobemers from the query and searches for them in the hash table.
    
    Input:
        query_sequence (str): Query DNA sequence.
        w (int): Window size for minimisers.
        k (int): k-mer length.
        s (int): Number of minimisers to combine.
        step (int): Step size for strobemers.
        hash_table (HashTable): Hash table containing reference strobemers.
    Output:
        matches (list): List of dicts with query start, reference positions, and strobemer sequence.
    """
    matches = []
    query_minimisers = minimisers(query_sequence, w, k, hash_table)
    for i in range(len(query_minimisers) - (s-1)*step):
        strober_seq = ""
        for j in range(s):
            if i + j*step >= len(query_minimisers):
                break
            strober_seq += query_minimisers[i + j*step]
        if len(strober_seq) == k*s:
            index = hash_table.search(strober_seq)
            if index != -1:
                matches.append({
                    'query_start': i,
                    'reference_positions': hash_table.table[index].positions,
                    'sequence': strober_seq
                })
    return matches

## 6. Fuzzy Seeds (Allowing Mismatches)

**Time Complexity:** O(m * k^max_mismatches * 4^max_mismatches), where m is query length

**Space Complexity:** O(m)

In [None]:
from itertools import combinations, product

def fuzzy_seeds_search(query_sequence, k, max_mismatches, hash_table):
    """
    For each k-mer in the query, generates all variants with up to max_mismatches mismatches,
    and searches for them in the hash table.
    
    Input:
        query_sequence (str): Query DNA sequence.
        k (int): k-mer length.
        max_mismatches (int): Maximum allowed mismatches.
        hash_table (HashTable): Hash table containing reference k-mers.
    Output:
        all_matches (list): Sorted list of reference positions where matches were found.
        (Also prints details of matches to stdout.)
    """
    all_matches = set()
    for i in range(len(query_sequence) - k + 1):
        query_kmer = query_sequence[i:i + k]
        printed_header = False
        for m in range(0, max_mismatches + 1):
            for positions in combinations(range(k), m):
                for substitutions in product('ACGT', repeat=m):
                    variant = list(query_kmer)
                    for pos, base in zip(positions, substitutions):
                        variant[pos] = base
                    variant = ''.join(variant)
                    variant_binary = genome_to_binary(variant)
                    index = hash_table.search(variant_binary)
                    if index != -1:
                        node = hash_table.table[index]
                        while node:
                            if node.key == variant_binary:
                                if not printed_header:
                                    print(f"\nSeed '{query_kmer}' (max {max_mismatches} mismatches):")
                                    printed_header = True
                                print(f"  {m}-mismatch: '{variant}' → positions {node.positions}")
                                all_matches.update(node.positions)
                                break
                            node = node.next
        if not printed_header:
            print(f"\nSeed '{query_kmer}' → NO matches (with {max_mismatches} mismatches allowed)")
    return sorted(all_matches)

# Example Usage

In [None]:
# Example reference and query
reference_sequence = "CTGACTGAGCTACTATCCTGA"
query_sequence = "ATGCTACT"
k = 4
table_size = 4 ** k  # For direct indexing

# Initialize hash table
hash_table = HashTable(table_size)

# 1. k-mer hashing
print("=== k-mer Hashing ===")
k_mer(k, reference_sequence, hash_table)
hash_table.print_table()
query_search_kmer(k, query_sequence, hash_table)

# 2. Minimisers
print("\n=== Minimisers ===")
minimisers(reference_sequence, 2, 2, hash_table)
hash_table.print_table()
matches = query_search_minimisers(2, query_sequence, 2, hash_table, reference_sequence)
for q_pos, ref_positions in matches.items():
    print(f"Query position {q_pos} matched reference positions {ref_positions}")

# 3. Spaced Seeds
print("\n=== Spaced Seeds ===")
pattern = '1101'
spaced_seeds(reference_sequence, pattern, hash_table)
hash_table.print_table()
matches = search_spaced_seeds(query_sequence, pattern, hash_table)
for q_pos, ref_positions in matches.items():
    print(f"Query position {q_pos} matched reference positions {ref_positions}")

# 4. Strobemers
print("\n=== Strobemers ===")
strobemers(reference_sequence, 2, 2, 2, 1, hash_table)
hash_table.print_table()
matches = query_search_strobemers(query_sequence, 2, 2, 2, 1, hash_table)
for match in matches:
    print(match)

# 5. Fuzzy Seeds
print("\n=== Fuzzy Seeds ===")
fuzzy_seeds_search(query_sequence, 4, 1, hash_table)
