# Bioinformatics quick practice problems (from Rosalind)
https://rosalind.info/problems/list-view/

### Problem: Counting DNA Nucleotides

A DNA string is a sequence of symbols chosen from the alphabet consisting of 'A' (adenine), 'C' (cytosine), 'G' (guanine), and 'T' (thymine). The length of a DNA string is the number of symbols it contains.

For example, the DNA string `"ATGCTTCAGAAAGGTCTTACG"` is a string of length 21.

### Input:
- A DNA string `s` with a maximum length of 1,000 nucleotides (nt).

### Output:
- Four integers, each representing the count of the nucleotides 'A', 'C', 'G', and 'T' in the DNA string `s`, respectively. The counts should be separated by spaces.

In [1]:
def counting_nucs(file_path):
    with open(file_path, "r") as file:
        A = 0
        C = 0
        G = 0
        T = 0
        
        while True:
            char = file.read(1)
            if not char:
                break
            if char == "A":
                A += 1
            elif char == "C":
                C += 1
            elif char == "G":
                G += 1
            elif char == "T":
                T += 1
    return A,C,G,T

file_path = "rosalind_data/rosalind_dna.txt"
result = counting_nucs(file_path)
print(*result) #  unpacking operator * unpacks the elements of the tuple so that they are printed as separate arguments 


201 212 245 231


### Problem: Transcribing DNA into RNA

An RNA string is similar to a DNA string but uses 'U' (uracil) instead of 'T' (thymine). The process of transcribing DNA to RNA involves replacing all occurrences of 'T' with 'U'.

### Input:
- A DNA string `t` with a maximum length of 1,000 nucleotides (nt).

### Output:
- The RNA string `u`, which is formed by replacing all 'T' characters in `t` with 'U'.

In [28]:
def DNA_to_RNA(file_path):
    with open(file_path, "r") as file:
        RNA = file.read().replace('T', 'U')
    return RNA

file_path = "rosalind_data/rosalind_rna.txt"
result = DNA_to_RNA(file_path)
print(result)

AUGCCAGGCACGUAGAAGUAAUGUUAGGAUAGGCUACAAUAGAGAUCGCACUCGAUACACCGCCACGAAAUUGAUGCAGCAACCAAGACUCUGGUAUAAGUGUGAGGCGGGAUUUACCUUGGAUGCCGUUGGCACAGCUAUGGUUUGCCUACACAGGUUUGCAGGGUCGUGGACUACGCUUCAAAACUAGGUUUAUGGAUCCGAACCUUUGUAUGAAAAUCGCAUAUUCUUGGUACUCGUCAGCAAGUGGGAGGAUCCGUUACGAUUCCUCCAUGUAAAUGGCGGGCAGCCGGUCGCAAUUCAAAAUUAAUCAGUCGGUACGCGAUCGGUGUUAAAGCUGCUUGUAAGUCGCAUCAUUUCCUCUUCGAACCGAAUAAUUUCCAAAAUUUGUUUUCGGGUACUGUCCUGACUCAGAUUCUCAGUAAUGGCCUAAAUUCUAUUUCUCCGUUCAACAUGGUACAUCCUCCCUUUAACAGCCUUGCAAUUUACACUGUCCGUGCCACACAUAGCCUCUAUCGGGGGGGAAUACAUUUGGAACGAUGGUUACCUGGUGUUGAUAUACAGCAAAAUUCAGGCCAAAGCCAUCUUCGAACCCGUAUACGCUAUAAAAAAGUUCCCAAGUCGUGUCUCAUGAACGAAAUACUGCAACAGUAUAAGAGGCUUGGAGGAACUGGACUUACUCAGUCACAGCUCGUAGCCCUAUUGGAUGCCUGCUGUGCUACACUUCGACUAGGCAGAAAUAGCGCGCUGUAGAGAUAAUAGGCAACGUAGAACAGACUUCGACAUCAUGAGGAUGCAUCACUUGGAGCACUGGUGACGACUCGUUAUGGGAAAGCUUUGUGCGCCCCAGUGCACCUGACGUCUCUAUCCAGCCUCCGGUUCUCCUAAGUCCCGAGGCUAUGAUUCUUGGAUGAGAAUCUACGAUACCAUCACAUGUUUUAAUGUGUGGGUGCGGGUUUCGUCACUA



### Problem: Complementing a Strand of DNA

In DNA, the symbols 'A' (adenine) and 'T' (thymine) are complements of each other, as are 'C' (cytosine) and 'G' (guanine). To find the reverse complement of a DNA string, you first reverse the string and then replace each nucleotide with its complement.

For example, the reverse complement of the DNA string `"GTCA"` is `"TGAC"`.

### Input:
- A DNA string `s` with a maximum length of 1,000 base pairs (bp).

### Output:
- The reverse complement `sc` of the DNA string `s`.

In [33]:
def reverse_complement(file_path):
    with open(file_path, "r") as file:
        complement = ""
        while True:
            char = file.read(1)
            if not char:
                break
            if char == "A":
                complement += "T"
            elif char == "C":
                complement += "G"
            elif char == "G":
                complement += "C"
            elif char == "T":
                complement += "A"
        reverse_complement = complement[::-1]
        
    return reverse_complement

file_path = "rosalind_data/rosalind_revc.txt"
result = reverse_complement(file_path)
print(result)


TAATAACCACATGTCGGTACCTATATTGAGCGCGGCACTCTGGGGCTCCGCAAGTATGTGCACATAGACAATTTGTATAATCGGTGCACACTAAGAAACCTGAGTCGACACTTTCAGCCCGCCATTCGAACAAGTTCAAAGCGAAGACTAAAGCCCTCACGCTAGGTTTTAGAGGACGTTTCGTAGATCCCTCGCGCAGGACAACACCACCATTAAGCCCACCCCACGCTTTTTCCCAGCGACGATTCACTCGCGTTGAGTATGCTCGACAGTACCCAACAGAAGCTCGCCCGGCGGGATAGGACGTTTTACGGCCTTTATGCAAGCATACTGACAACCTGATAGGATCAACTTGCGCAAACTACCCCAGAGACTCTGGTTTCGGCAGGTACTTCGCGGAGGGATGTAACCGCATATAGGGGTACCTAGGGGAAGAAGTAGGATTAACCCGGGGTGCTCTGCGGTTACGCTATCCAATCTATGGACGCGATTTACCCCGCTCGCTACTTATTAGAGACATGCTATTGAGCCAAATGCTGCGGTGAGGCCACACAAAACTAGGTGGAGCGTACGGCGTAGACAGTCCGCAATACTTCTTGGCGAGCTCTAGCAAATCACAGGAGCCAATGCTCTGGGTTGACAGTAACCTGGAGGTACAAGATAACCGTGCTATTGTCTAGTGGACCACTAATTTCCCGCTTAAATTAGCATATACGTCCTTCTACCGAATGTACAATTACAGTATAAGACATCCTACCTGCACGGTCAAAGGGTCTTCTGAACTAACGGTACCGGTCGCGTAGGGAAAGCTCTGTTAGTTTCGTGATGTCACGCTGACATATCGCGTCGAGCCATTGACATTTGCGCAGACCGATTACGGTCGCAATGCCCTTCTGCAATTGTAATAATATTCACATCTTTCACGTTTGGCCAAACAATGTCTTTGACGTA


### Problem: Rabbits and Recurrence Relations - Fibonacci Sequence

The rabbit problem is a variation of the Fibonacci sequence where you calculate the number of rabbit pairs after a given number of months, with specific reproduction rules. Each pair of rabbits produces `k` pairs of offspring every month, and these offspring take a month to reach reproductive age.

### Input:
- Two positive integers:
  - `n`: the number of months (where `1 ≤ n ≤ 40`)
  - `k`: the number of rabbit pairs produced by each pair of rabbits per month (where `1 ≤ k ≤ 5`)

### Output:
- The total number of rabbit pairs after `n` months, considering the reproduction rules.

In [44]:
def rabbit_pairs(file_path):
    with open(file_path, "r") as file:
        numbers = file.read().split()
        n = int(numbers[0])
        k = int(numbers[1])
        if n == 1 or n == 2:
            return 1

        # Initialize base cases
        prev2 = 1  # F(n-2)
        prev1 = 1  # F(n-1)
        
        # Calculate F(n) using the recurrence relation
        for i in range(3, n + 1):
            current = prev1 + k * prev2
            prev2 = prev1
            prev1 = current

    return prev1

file_path = "rosalind_data/rosalind_fib.txt"
result = rabbit_pairs(file_path)
print(result)

357913941


### Problem: Computing GC Content

The GC-content of a DNA string is the percentage of nucleotides in the string that are either 'C' (cytosine) or 'G' (guanine). For example, the GC-content of the DNA string `"AGCTATAG"` is 37.5%, which represents the proportion of 'C' and 'G' nucleotides in the string. The reverse complement of a DNA string has the same GC-content.

DNA strings are often labeled in databases using the FASTA format. In FASTA format, each string starts with a line that begins with '>', followed by a label. Subsequent lines contain the DNA sequence. The label for each string in Rosalind's implementation will follow the format "Rosalind_xxxx", where "xxxx" is a four-digit code between 0000 and 9999.

### Input:
- Up to 10 DNA strings in FASTA format, with each string having a maximum length of 1,000 base pairs.

### Output:
- The ID of the DNA string with the highest GC-content, followed by the GC-content of that string. The GC-content should be reported with a default error tolerance of 0.001 in decimal values.

In [58]:
import sys
class FastAreader :
   
    def __init__ (self, fname=None):
        '''contructor: saves attribute fname '''
        self.fname = fname
            
    def doOpen (self):
        ''' Handle file opens, allowing STDIN.'''
        if self.fname is None:
            return sys.stdin
        else:
            return open(self.fname)
        
    def readFasta (self):
        ''' Read an entire FastA record and return the sequence header/sequence'''
        header = ''
        sequence = ''
        
        with self.doOpen() as fileH:
            
            header = ''
            sequence = ''
            
            # skip to first fasta header
            line = fileH.readline()
            while not line.startswith('>') :
                line = fileH.readline()
            header = line[1:].rstrip()

            for line in fileH:
                if line.startswith ('>'):
                    yield header,sequence
                    header = line[1:].rstrip()
                    sequence = ''
                else :
                    sequence += ''.join(line.rstrip().split()).upper()

        yield header,sequence
    
def countNuc(sequence, nuc):
    count = 0
    for char in sequence:
        if not char:
            break
        if char == nuc:
            count += 1
    return count 

def find_gc_content(sequence):
    gCount = 0
    cCount = 0
    nucCount = len(sequence)
    gCount = countNuc(sequence, "G")
    cCount = countNuc(sequence, "C")
    gC_content = ((gCount + cCount) / nucCount)*100
    return gC_content

reader = FastAreader("rosalind_data/rosalind_gc.txt")
highest_header  = None
highest_content = 0

for header, sequence in reader.readFasta():
    seq_GC = find_gc_content(sequence)
    if highest_header == None or seq_GC > highest_content:
        highest_header = header
        highest_content = seq_GC
    else:
        continue
    
print(f"{highest_header}\n{highest_content:.6f}")



Rosalind_3421
51.260504


### Problem: Counting Point Mutations (Hamming Distance)

The Hamming distance between two strings `s` and `t` is a measure of how many corresponding symbols differ between the two strings. For two strings of equal length, the Hamming distance is the count of positions at which the symbols in the two strings differ.

For example, if the Hamming distance between two strings is 7, this means there are 7 positions where the symbols in the strings are different. Mismatched symbols are often highlighted to show where the differences occur.

### Input:
- Two DNA strings `s` and `t`, which have equal length and do not exceed 1,000 base pairs.

### Output:
- The Hamming distance `dH(s, t)`, which is the count of positions where the symbols in `s` and `t` differ.

In [62]:
def countPointMuts(file_path):
    with open(file_path, "r") as file:
        seq1 = file.readline()
        seq2 = file.readline()
        
        hammingCount = 0
        for i in range(len(seq1)):
            seq1nuc = seq1[i]
            seq2nuc = seq2[i]
            if seq1nuc != seq2nuc:
                hammingCount += 1
            else:
                continue
    return hammingCount

file_path = "rosalind_data/rosalind_hamm.txt"
result = countPointMuts(file_path)
print(result)

504


In [64]:
def RNA_to_AA(rna_seq):
    rnaCodonTable = {
        # RNA codon table
        # U
        'UUU': 'F', 'UCU': 'S', 'UAU': 'Y', 'UGU': 'C',  # UxU
        'UUC': 'F', 'UCC': 'S', 'UAC': 'Y', 'UGC': 'C',  # UxC
        'UUA': 'L', 'UCA': 'S', 'UAA': '-', 'UGA': '-',  # UxA
        'UUG': 'L', 'UCG': 'S', 'UAG': '-', 'UGG': 'W',  # UxG
        # C
        'CUU': 'L', 'CCU': 'P', 'CAU': 'H', 'CGU': 'R',  # CxU
        'CUC': 'L', 'CCC': 'P', 'CAC': 'H', 'CGC': 'R',  # CxC
        'CUA': 'L', 'CCA': 'P', 'CAA': 'Q', 'CGA': 'R',  # CxA
        'CUG': 'L', 'CCG': 'P', 'CAG': 'Q', 'CGG': 'R',  # CxG
        # A
        'AUU': 'I', 'ACU': 'T', 'AAU': 'N', 'AGU': 'S',  # AxU
        'AUC': 'I', 'ACC': 'T', 'AAC': 'N', 'AGC': 'S',  # AxC
        'AUA': 'I', 'ACA': 'T', 'AAA': 'K', 'AGA': 'R',  # AxA
        'AUG': 'M', 'ACG': 'T', 'AAG': 'K', 'AGG': 'R',  # AxG
        # G
        'GUU': 'V', 'GCU': 'A', 'GAU': 'D', 'GGU': 'G',  # GxU
        'GUC': 'V', 'GCC': 'A', 'GAC': 'D', 'GGC': 'G',  # GxC
        'GUA': 'V', 'GCA': 'A', 'GAA': 'E', 'GGA': 'G',  # GxA
        'GUG': 'V', 'GCG': 'A', 'GAG': 'E', 'GGG': 'G'  # GxG
    }

    aa_seq = ""
    for i in range(0, len(rna_seq) - 2, 3):
        codon = rna_seq[i:i+3]
        if codon in rnaCodonTable:
            amino_acid = rnaCodonTable[codon]
            if amino_acid == '-':  # Stop translating at termination codon
                break
            aa_seq += amino_acid
    return aa_seq

def read_rna_from_file(file_path):
    with open(file_path, "r") as file:
        rna_seq = file.read().strip().replace("\n", "")
    return rna_seq

file_path = "rosalind_data/rosalind_prot.txt"
rna_seq = read_rna_from_file(file_path)
result = RNA_to_AA(rna_seq)
print(result)


MSSPVVIVLCGRVYTPPRTRRNSHDGERSPPSSSRAVLPSTDVFLRTPKRSPRKPMNCSQVPNFYTQTIVVCRALLGTHSLGHGNFSEFMRVSLAHPREPSSHSDHIDRGASGNGAGISRKSRCDYNRFETVKRRSGEAHNTKIRHYKLHSFMTTQIKERVLTTVETGQTRCMPLSTGRHLSLYCPQPSCPSCIRLLTSSCLTVWAEPLYSCRQFSSVVSGKLASPVWIIILRRYRPSDASLTLATEYICDRSIPESEKINCVPIPSLKMRLCSAGRSSCIAYIFQHFRLLIKHTPTSRRLLLHRAAAINDNASPSPGGRCWLSTRSKRDTLIAAPYARGLSRTHPAAYVSPKRGSVHAGCSRSSPRHGKVCRPRTCARDGDLSSHLKGSRVTRNYSDKTGCGSYARGEDTLCDIADSGLVGLTTTYRGLTLIGSMIKRHSRFKRTCFYDPQPANWTHLAVAGVFYGSCICGVHISAFIPSPHNVFFLSSTFNYQELGTLRVWVMLIISPARIVGDAPYLANLFPTQAKIPIVSCETNLLQPAAFRNCDATRAGLFDIAESLLGSLSQCSMIPIMCRKHLPTRWQPTVYTEQESHVRTGFQLRPLVGHIRLRYLDTPSAHNSGECNTQSLRGTFAGETLRHYVVALPGRAPQVSTLVLKGIHQAIGDRDQTRLRISPEIPWDFQHVNTLRSLKHRWRSVTASLLTFPGVFELQYDNPAPSPGLVCKDKDHLQLRRLNPPARDDAGVKRFTRNPKMARPNISGSDCLITNPPARACAFTILLVHFTRRVMVLAFSIREPIRQQSGHWRSAASSSGGQRLIVEVGIVAHSARAFVRRPPLRFGSTETHKYSFRFVFAAGQRSSRSNLNTTVHSKIWAEKAGRRVGKRTLSPYREGKGEFHWNVRRARRSRSTRRLLGDLWPLHQAALRVRPYHIDVASCLFRPKVPNYPHGSSAPGTSALFRLRGVWISGGWLVCRPVLHAWGKLPIGLVGGLRPKKLNLRM

### Problem: Finding Motifs in a DNA Sequence

Given two DNA strings, `s` and `t`, the task is to find all positions where `t` appears as a substring within `s`. A substring is a sequence of characters that appears in a continuous sequence within another string.

For example, consider the string `"AUGCUUCAGAAAGGUCUUACG"`. The positions of all occurrences of the character `'U'` in this string are 2, 5, 6, 15, 17, and 18. Each character in a string has a position, starting from the left and including itself. For example, the substring starting at position 2 and ending at position 5 is `"UGCU"`.

The goal is to identify all starting positions in `s` where the string `t` appears as a contiguous substring. If `t` occurs multiple times within `s`, all of these starting positions should be returned.

### Input:
- Two DNA strings `s` and `t`, each with a maximum length of 1,000 base pairs.

### Output:
- A list of all positions in `s` where `t` begins as a substring.

In [93]:
def findMotifs(file_path):
    with open(file_path, "r") as file:
        sequence = file.readline()
        motif = file.readline().strip()
        index = 0 
        result = ""
        while index != -1:
            index = sequence.find(motif, index)
            if index != -1:
                index +=1
                # print(index, end=" ")
                result += str(index) + " "
    return result[:-1]
       
            

file_path = "rosalind_data/rosalind_subs.txt"
result = findMotifs(file_path)
print(result)

23 38 45 66 110 286 328 359 366 404 411 489 535 589 606 622 629 714 734 758


In [7]:
def readFasta(file_path):
    ''' Read a FastA file and return a dictionary with headers as keys and sequences as values. '''
    fasta_dict = {}
    header = ''
    sequence = ''
    
    with open(file_path, "r") as fileH:
        for line in fileH:
            if line.startswith('>'):
                # Save the previous header and sequence
                if header:
                    fasta_dict[header] = sequence
                # Update header and reset sequence
                header = line[1:].rstrip()
                sequence = ''
            else:
                # Append to the current sequence
                sequence += ''.join(line.rstrip().split()).upper()
                
        # Save the last header and sequence
        if header:
            fasta_dict[header] = sequence

    return fasta_dict

def findSharedMotifs(fasta_dict):
    ''' Find the longest common substring in all sequences of the FASTA dictionary. '''
    sequences = list(fasta_dict.values())
    
    if not sequences:
        return ""
    
    def is_common_substring(substring, sequences):
        ''' Check if the substring is present in all sequences. '''
        return all(substring in seq for seq in sequences)
    
    # Get the shortest sequence to minimize search space
    shortest_seq = min(sequences, key=len)
    longest_motif = ""

    # Generate all possible substrings of the shortest sequence and find the longest common one
    for length in range(len(shortest_seq), 0, -1):  # Decrease length from longest to shortest
        for start in range(len(shortest_seq) - length + 1):
            substring = shortest_seq[start:start + length]
            if is_common_substring(substring, sequences):
                longest_motif = substring
                break
        if longest_motif:
            break

    return longest_motif


# Example usage:
file_path = "rosalind_data/rosalind_lcsm.txt"
fasta_dict = readFasta(file_path)
result = findSharedMotifs(fasta_dict)
print(result)


ATCCAAAATCATTGGATGCTGACTGATTGGGGAGTTGATCTTACACCGCGAAGCGTTCACCTCATTCGCGGCTCTATTCAGGCACAGGAAACTGGAGTCCACACTTTCACGTCGGGATCGGATCCCTCATCACCACCGTAGCGTCCCGTATGGGCGGAGATAAC
