# Median String Search Algorithm


The Median String Search Algorithm is used in bioinformatics to find a "median string" from a given set of DNA sequences. A median string is a string that minimizes the total Hamming distance (the number of positions at which the corresponding symbols are different) between itself and all given sequences. This is useful in motif finding, where we aim to find a common substring (motif) in a set of sequences. The goal of this algorithm is to find a string `s` of length `l` such that the sum of Hamming distances between `s` and every sequence from the given set of `k` DNA sequences is minimized.

Here's how we can implement it step by step:

1. Generate all possible strings of length `l` from the four nucleotides (A, C, G, T).
2. Compute the Hamming distance between the generated string and each substring of the sequences.
3. Minimize the total distance and return the string.

The algorithm's complexity is $O(4^l × n × m)$, where `l` is the length of the median string, `n` is the number of sequences, and `m` is the length of the sequences. This makes it exponential in `l`, so it is only practical for small values of `l`.


In [6]:
import itertools

## Calculating Hamming Distance

The `hamming_distance` function computes the Hamming distance between two strings `s1` and `s2` of equal length.

In [7]:
def hamming_distance(s1, s2):
    # distance = 0
    # for i in range(len(s1)):
    #     if s1[i] != s2[i]:
    #         distance += 1
    distance = sum(c1 != c2 for c1, c2 in zip(s1, s2))
    return distance


## Calculating Distance to Closest Pattern

The `distance_to_closest_pattern` function will find the closest substring (in terms of Hamming distance) within each DNA sequence. It will compute the Hamming distance between the given pattern and every possible substring in the sequence and return the minimum distance.

In [8]:
def distance_to_closest_pattern(pattern, sequence):
    min_distance = float('inf')
    for i in range(len(sequence) - l + 1):
        substring = sequence[i:i + l]
        distance = hamming_distance(pattern, substring)
        if distance < min_distance:
            min_distance = distance
    return min_distance


## Main Function

The next function is the main function that generates all possible DNA strings of length `l` and calls the closest pattern matching function. For each candidate string, we calculate the total Hamming distance across all sequences, track the string that gives the smallest total distance and return it as the median string. So, the function returns the median string that minimizes the sum of distances to all DNA sequences.

In [9]:
def median_string_search(dna_sequences, l):
    alphabet = ['A', 'C', 'G', 'T']

    # Generate all possible strings of length l
    all_possible_strings = [''.join(p) for p in itertools.product(alphabet, repeat=l)]

    # Initialize variables to track the best median string
    best_median_string = None
    best_total_distance = float('inf')

    # For each possible string, calculate the total distance to all sequences
    for pattern in all_possible_strings:
        total_distance = 0
        for sequence in dna_sequences:
            total_distance += distance_to_closest_pattern(pattern, sequence)

        # If we found a smaller total distance, update the best median string
        if total_distance < best_total_distance:
            best_total_distance = total_distance
            best_median_string = pattern

    return best_median_string


## Testing the Algorithm

In [10]:
dna_sequences = [
    "AGCTGACCTG",
    "CGCTGACGTA",
    "GCTGACCTGA",
    "TCTGACGTCA"
]
l = 3  # Length of the median string
median_string = median_string_search(dna_sequences, l)
print(f"Median String: {median_string}")

Median String: CTG
