# Motif Finding
## Exhaustive Algorithm

### High-Level Description (Conceptual)

**Goal**: Find the most representative common subsequence (motif) of fixed length across multiple DNA sequences.

**Approach**:
Use **exhaustive enumeration** to test all possible combinations of starting positions (offsets) for motifs in each sequence.

1. **Generate all possible combinations** of motif start positions for each sequence.
2. For each combination:

   * Extract motif substrings from each sequence based on the offsets.
   * Compute a **score**: for each column, count how many times each character appears and take the maximum.
3. Return the combination of offsets with the highest score.

---

### Low-Level Description (Implementation)

**`enumerate` function**:

* Generates all possible combinations of motif start positions (offsets).
* Each offset ranges from `0` to `seq_length - motif_length`.
* Implements a counter-like iteration (base-n counting) with carry-over.

**`score` function**:

* Given a combination of offsets, extracts the corresponding motif snippets from each sequence.
* For each position in the motif (column), counts the frequency of each character.
* Sums the maximum frequency per column to compute the total score.

**`motif` function**:

* Iterates through all possible offset combinations using `enumerate`.
* Uses `score` to evaluate each combination.
* Tracks and returns the combination with the highest score found.


In [7]:
def proximo(offset, num_seqs, limite):
    pos = 0

    while pos < num_seqs:
        offset[pos] +=1

        if offset[pos] < limite:
            return offset

        offset[pos] = 0
        pos += 1

def enumerar(num_seqs,tam_seqs,tam_motifs,):
    estado = [0] * num_seqs
    limite = tam_seqs - tam_motifs + 1

    while estado is not None:
        yield tuple(estado)
        estado = proximo(estado, num_seqs, limite)

def score(seqs, offset, tam_motif):
    snips = [s[p: p + tam_motif] for p,s in zip(offset, seqs)]
    return sum(max([col.count(x) for x in col]) for col in zip(*snips))

def motif(seqs, num_seqs, tam_seq, tam_motif):

    assert all(len(s) == tam_seq for s in seqs)
    
    maior_s = 0
    for estado in enumerar(num_seqs, tam_seq, tam_motif):
        atual = score(seqs, estado, tam_motif)
        if atual > maior_s:
            melhor_pos = estado
            maior_s = atual

    return (melhor_pos, maior_s)

In [8]:
seqs = "ATGGTCGC TTGTCTGA CCGTAGTA".split()
motif(seqs, 3, 8, 3)

((3, 2, 2), 8)

In [9]:
seqs = "ATGGTCGC TTGTCTGA CCGTAGTA ATGCTAGC ATGGGTAG AGTAGCGC GGTAGATG TATATAAG".split()
motif(seqs, 8, 8, 3)

((1, 0, 3, 4, 5, 2, 2, 0), 21)

## Branch and Bound
### High-Level Description (Conceptual)

**Goal**: Identify the most representative fixed-length motif across multiple DNA sequences more efficiently than exhaustive search.

**Approach**:
Use a **Branch and Bound** strategy to prune the search space:

1. **Systematically explore** all possible combinations of motif start positions.
2. At each step (branch), compute a **partial score** using the current offsets.
3. Use a **bounding function** to estimate the maximum possible score from the current state.
4. **Prune branches** where the upper bound is worse than the best score found so far.
5. Keep track of the best scoring configuration throughout the process.

---

### Low-Level Description (Implementation)

**`score_bb` function**:

* Given a set of motif start positions (`offset`), extracts the motif substrings from each sequence.
* Computes the **score** as the sum of the highest character frequency per column.

**`branch_and_bound` function**:

* A recursive function that builds offset combinations one level at a time.
* **Base Case**: If all offsets are selected, compute the score and update the best solution if it’s an improvement.
* **Branching**: At each level, test each possible offset for the current sequence.
* **Bounding**: For each partial configuration:

  * Compute the current partial score.
  * Estimate the **upper bound** (max possible additional score assuming perfect alignment for remaining sequences).
  * Only continue if this upper bound is better than the best score found so far.

**`motif_bb` function**:

* Initializes the global state to track the best score and offsets.
* Defines the valid range of offsets.
* Starts the recursive Branch and Bound process.
* Returns the best motif configuration found and its score.


In [10]:
def score_bb(seqs, offset, tam_motif):
    # Extract motifs from sequences based on the given offsets
    snips = [s[p: p + tam_motif] for p, s in zip(offset, seqs)]
    
    # Calculate the score: sum of the highest frequency of any nucleotide/character in each column
    return sum(max(col.count(x) for x in set(col)) for col in zip(*snips))

def branch_and_bound(offset, num_seqs, limite, tam_motifs, estado_global, seqs, level=0):
    """ Branch and Bound implementation to find the best motif alignment """
    
    # Base case: if all offsets have been filled
    if level == num_seqs:  
        # Calculate the score for the current motif alignment
        atual = score_bb(seqs, offset, tam_motifs)
        
        # If the current score is better than the global best, update the best score and positions
        if atual > estado_global["maior_s"]:
            estado_global["maior_s"] = atual
            estado_global["melhor_pos"] = offset[:]
        else:
            return

    # Branching: Generate candidates for the current level
    for pos in range(limite):
        offset[level] = pos  # Set the current offset for this sequence
        
        # Calculate the partial score with the new offset configuration
        atual = score_bb(seqs, offset[:level+1], tam_motifs)

        # Bounding: If the maximum possible score for this branch is still better than the global best, explore it
        limite_superior = atual + (num_seqs - level - 1) * tam_motifs
        
        # If the upper bound of this branch is greater than the current best score, recurse further
        if limite_superior > estado_global["maior_s"]:
            branch_and_bound(offset, num_seqs, limite, tam_motifs, estado_global, seqs, level+1)

def motif_bb(seqs, num_seqs, tam_seq, tam_motif):
    """ Function to start the Branch and Bound search for the best motif alignment """
    
    # Ensure all sequences are of the expected length
    assert all(len(s) == tam_seq for s in seqs)

    # Initialize the global state: store the best score and positions
    estado_global = {"maior_s": 0, "melhor_pos": None}
    
    # Initialize the list of offsets (start at position 0 for each sequence)
    offset = [0] * num_seqs
    
    # Calculate the limit for valid offset values (based on sequence length and motif length)
    limite = tam_seq - tam_motif + 1

    # Start the Branch and Bound process
    branch_and_bound(offset, num_seqs, limite, tam_motif, estado_global, seqs)

    # Return the best motif positions and the corresponding score found
    return estado_global["melhor_pos"], estado_global["maior_s"]

In [11]:
seqs = "ATGGTCGC TTGTCTGA CCGTAGTA".split()
motif_bb(seqs, 3, 8, 3)

IndexError: list assignment index out of range

In [None]:
seqs = """cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc""".splitlines()
motif_bb(seqs, 5, len(seqs[0]), 8)

([25, 20, 2, 55, 59], 40)

## Gibbs Sampling
### High-Level Description (Conceptual)

**Goal**: Identify a shared fixed-length motif across multiple DNA sequences using a **stochastic heuristic** rather than exhaustive search.

**Approach**:
Use the **Gibbs Sampling** algorithm—a probabilistic iterative method that refines motif positions based on a position weight matrix (PWM):

1. **Initialize**: Randomly select a start position for the motif in each sequence.
2. Iteratively:

   * **Remove one sequence** temporarily.
   * Build a PWM from motifs in the remaining sequences.
   * **Sample a new position** in the removed sequence, proportional to how well its substrings match the PWM.
   * **Update** the positions and compute a score.
3. Keep track of the best-scoring motif configuration throughout.
4. Stop when either a maximum number of iterations or a convergence threshold is reached.

---

### Low-Level Description (Implementation)

**`random_init_positions`**:

* Randomly selects a starting index for the motif in each sequence.

**`create_motifs`**:

* Extracts motif substrings based on current positions.

**`pwm`**:

* Builds a **Position Weight Matrix** from a list of motifs using Laplace smoothing (pseudocounts).
* For each column (motif position), calculates the relative frequency of each nucleotide (A, T, C, G).

**`prob_seq`**:

* Computes the probability of a given motif based on the PWM by multiplying base-wise probabilities.

**`prob_positions`**:

* Computes the probability distribution over all possible start positions in a sequence based on the current PWM.

**`roulette_wheel`**:

* Implements **roulette wheel selection**, a probabilistic method to pick the next position using the computed distribution.

**`gibbs_sampling`**:

* Main algorithm loop:

  * Randomly removes one sequence.
  * Builds PWM from the others.
  * Samples a new motif position for the removed sequence based on the PWM.
  * Updates current positions and calculates a motif **score** (based on column-wise maximum probabilities).
  * Keeps track of the best-scoring configuration.

**`motif_gibbs`**:

* Validates inputs and wraps the Gibbs sampling process.
* Returns the best motif alignment positions and corresponding score.


In [None]:
import random

class GibbsSampling:
    """Implements Gibbs Sampling algorithm for motif discovery.
    
    Attributes:
        seqs (list): List of DNA sequences to analyze
        w (int): Motif length to search for
        pseudo (int): Pseudocount value to prevent zero probabilities
        n (int): Number of sequences
        t (int): Length of each sequence (assumes equal-length sequences)
    """
    
    def __init__(self, sequences, motif_length, pseudo=1):
        """Initialize GibbsSampling parameters.
        
        Args:
            sequences (list): List of DNA sequences (must be equal length)
            motif_length (int): Length of motif to search for
            pseudo (int, optional): Pseudocount for Laplace smoothing. Default=1
        """
        self.seqs = sequences
        self.w = motif_length
        self.pseudo = pseudo
        self.n = len(sequences)
        self.t = len(sequences[0])

    def random_init_positions(self):
        """Generate random starting positions for motif search.
        
        Returns:
            dict: Initial positions mapping {sequence: start_index}
        """
        return {seq: random.randint(0, self.t - self.w) for seq in self.seqs}

    def create_motifs(self, positions):
        """Extract motifs based on current positions.
        
        Args:
            positions (dict): Current motif positions {seq: start_index}
            
        Returns:
            list: List of motif sequences corresponding to current positions
        """
        return [seq[positions[seq]:positions[seq] + self.w] for seq in positions]

    def pwm(self, motifs):
        """Build Position Weight Matrix (PWM) from motifs.
        
        Args:
            motifs (list): List of motif sequences to analyze
            
        Returns:
            list: PWM matrix as list of dictionaries {base: probability}
        """
        bases = 'ATCG'
        pwm_matrix = []
        for pos in zip(*motifs):
            counts = {base: pos.count(base) + self.pseudo for base in bases}
            total = sum(counts.values())
            pwm_matrix.append({base: counts[base] / total for base in bases})
        return pwm_matrix

    def prob_seq(self, seq, pwm):
        """Calculate probability of sequence given PWM.
        
        Args:
            seq (str): DNA sequence to evaluate
            pwm (list): Position Weight Matrix
            
        Returns:
            float: Probability score for the sequence
        """
        prob = 1.0
        for i, base in enumerate(seq):
            prob *= pwm[i][base]
        return prob

    def prob_positions(self, seq, pwm):
        """Calculate normalized probabilities for all motif positions in sequence.
        
        Args:
            seq (str): Sequence to analyze
            pwm (list): Current Position Weight Matrix
            
        Returns:
            list: Normalized probabilities for each possible starting position
        """
        probabilities = [self.prob_seq(seq[i:i+self.w], pwm) for i in range(self.t - self.w + 1)]
        total = sum(probabilities)
        return [p / total for p in probabilities]

    def roulette_wheel(self, probabilities):
        """Select position using probabilistic roulette wheel selection.
        
        Args:
            probabilities (list): List of position probabilities
            
        Returns:
            int: Selected position index
        """
        r = random.uniform(0, sum(probabilities))
        s = 0
        for i, p in enumerate(probabilities):
            s += p
            if s >= r:
                return i
        return len(probabilities) - 1

    def gibbs_sampling(self, max_iter=100, threshold=50):
        """Execute Gibbs Sampling algorithm.
        
        Args:
            max_iter (int): Maximum iterations. Default=100
            threshold (int): Convergence threshold (iterations without improvement). Default=50
            
        Returns:
            tuple: (best_positions dict, best_score float)
        """
        positions = self.random_init_positions()
        best_positions = positions.copy()
        best_score = 0
        count = 0

        for _ in range(max_iter):
            if count < threshold:
                seq_to_remove = random.choice(self.seqs)
                temp_positions = positions.copy()
                temp_positions.pop(seq_to_remove)
                motifs = self.create_motifs(temp_positions)
                pwm = self.pwm(motifs)
                probabilities = self.prob_positions(seq_to_remove, pwm)
                new_pos = self.roulette_wheel(probabilities)
                positions[seq_to_remove] = new_pos

                current_motifs = self.create_motifs(positions)
                pwm_all = self.pwm(current_motifs)
                score = sum(max(col.values()) for col in pwm_all)

                if score > best_score:
                    best_score = score
                    best_positions = positions.copy()
                    count = 0
                else:
                    count += 1

        return best_positions, best_score

    def print_motif(self, positions):
        """Display motifs based on final positions.
        
        Args:
            positions (dict): Final motif positions {seq: start_index}
        """
        for seq in self.seqs:
            if seq in positions:
                start = positions[seq]
                print(seq[start:start+self.w])


def motif_gibbs(seqs, num_seqs, tam_seq, tam_motif, max_iter=100, threshold=50):
    """Wrapper function for Gibbs Sampling motif discovery.
    
    Args:
        seqs (list): List of DNA sequences
        num_seqs (int): Expected number of sequences (validation)
        tam_seq (int): Expected sequence length (validation)
        tam_motif (int): Motif length to search for
        max_iter (int): Maximum iterations. Default=100
        threshold (int): Convergence threshold. Default=50
        
    Returns:
        tuple: (positions dict, score float)
        
    Raises:
        AssertionError: If input validation fails
    """
    assert len(seqs) == num_seqs
    assert all(len(s) == tam_seq for s in seqs)

    gibbs = GibbsSampling(seqs, tam_motif)
    positions, score = gibbs.gibbs_sampling(max_iter=max_iter, threshold=threshold)
    return positions, score



In [None]:
seqs = "ATGGTCGC TTGTCTGA CCGTAGTA".split()
motif_gibbs(seqs, 3, 8, 3)


({'ATGGTCGC': 2, 'TTGTCTGA': 1, 'CCGTAGTA': 1}, 1.4285714285714284)