Dan Shea  
2021-06-22  

#### Problem
As is the case with point mutations, the most common type of sequencing error occurs when a single nucleotide from a read is interpreted incorrectly.

__Given:__ A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read $s$ in the dataset, one of the following applies:

$s$ was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);  
$s$ is incorrect, it appears in the dataset exactly once, and its Hamming distance is $1$ with respect to _exactly one correct read_ in the dataset (or its reverse complement).

__Return:__ A list of all corrections in the form `[old read]->[new read]`. (Each correction must be a single symbol substitution, and you may return the corrections in any order.)

##### Sample Dataset
```
>Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC
```
##### Sample Output
```
TTCAT->TTGAT
GAGGA->GATGA
TTTCC->TTTCA
```

In [1]:
from Bio import SeqIO

In [2]:
def parse_input(filename):
    with open(filename, 'r') as fh:
        sequences = list(SeqIO.parse(fh, 'fasta'))
        return sequences

In [3]:
def hamming(a, b):
    l = len(a)
    d = 0
    for i in range(l):
        if a[i] != b[i]:
            d += 1
    return d

In [4]:
def compute_counts(sequences):
    n = len(sequences)
    counts = [1] * n
    for i in range(n):
        for j in range(i+1,n):
            if sequences[i].seq == sequences[j].seq:
                counts[i] += 1
                counts[j] += 1
            elif sequences[i].seq == sequences[j].reverse_complement().seq:
                counts[i] += 1
                counts[j] += 1
    return counts

In [5]:
def compute_corr(sequences, counts):
    n = len(sequences)
    no_match_idx = [idx for idx in range(n) if counts[idx] == 1]
    correct_idx = [idx for idx in range(n) if counts[idx] > 1]
    m = len(no_match_idx)
    o = len(correct_idx)
    corrections = {}
    
    def isin_corrections(idx):
        if idx in corrections:
            return True
        if idx in [i[0] for i in corrections.values()]:
            return True
        return False
    
    for i in range(m):
        for j in range(o):
            if no_match_idx[i] == correct_idx[j]:
                continue
            if no_match_idx[i] in corrections:
                break
            if hamming(sequences[no_match_idx[i]].seq, sequences[correct_idx[j]].seq) == 1:
                corrections[no_match_idx[i]] = (correct_idx[j], False)
                break
            if hamming(sequences[no_match_idx[i]].seq, sequences[correct_idx[j]].reverse_complement().seq) == 1:
                corrections[no_match_idx[i]] = (correct_idx[j], True)
                break
    return corrections

In [6]:
def format_corr(sequences, corrections):
    corr = []
    for idx in corrections:
        from_seq = sequences[idx].seq
        revcomp = corrections[idx][1]
        to_seq = None
        if revcomp:
            to_seq = sequences[corrections[idx][0]].reverse_complement().seq
        else:
            to_seq = sequences[corrections[idx][0]].seq
        corr.append(f'{from_seq}->{to_seq}\n')
    return corr

In [7]:
def output_corr(corr, filename='ans.txt'):
    with open(filename, 'w') as fh:
        fh.writelines(corr)

In [8]:
def parse_file_print_ans(filename):
    sequences = parse_input(filename)
    counts = compute_counts(sequences)
    corrections = compute_corr(sequences, counts)
    corr = format_corr(sequences, corrections)
    output_corr(corr)

In [9]:
parse_file_print_ans('sample.txt')

In [10]:
parse_file_print_ans('rosalind_corr.txt')