# DNA

Problem
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

Sample Dataset
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
Sample Output
20 12 17 21

In [62]:
dna_seq = open('rosalind_dna.txt' , 'r').read()
if '>' in dna_seq :
    data_array = dna_seq.split('>')
    for i in data_array:
        if i == '':
             data_array.remove(i)
    for i in data_array: data_array[data_array.index(i)] = i.split('\n', 2)

from collections import Counter
a=dna_seq.count("A")
b=dna_seq.count("C")
c=dna_seq.count("G")
d=dna_seq.count("T")
print(a, b , c ,d)

201 199 206 210


# RNA 

Problem
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t.

Sample Dataset
GATGGAACTTGACTACGTAAATT
Sample Output
GAUGGAACUUGACUACGUAAAUU

In [64]:
seq = "GATGGAACTTGACTACGTAAATT"
seq.replace("T","U")

'GAUGGAACUUGACUACGUAAAUU'

# REVC

Problem
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

Sample Dataset
AAAACCCGGT
Sample Output
ACCGGGTTTT

In [84]:
seq = "AAAACCCGGT"
char_to_replace = {"A" : "T" , "C" : "G" , "T" :"A", "G" : "C"}
seq=seq.translate(str.maketrans(char_to_replace))
seq[::-1]

'ACCGGGTTTT'

# IPRB

Problem
Given: Positive integers n≤40 and k≤5.

Return: The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

Sample Dataset
5 3
Sample Output
19

In [65]:
def fibo_rabbit(months, progeny):
    child, parent = 1, 0
    for i in range(months-1):
        child,parent= parent*progeny , child + parent
    return child + parent

# TEST
fibo_rabbit(5,3)


19

# GC

Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
Rosalind_0808
60.919540


In [87]:
from Bio import SeqIO
from Bio.SeqUtils import GC
gc_content = 0
seq_name = ""
unzip_fasta = open("rosalind_gc.txt" , "r")
for record in SeqIO.parse(unzip_fasta ,"fasta"):
    if gc_content < GC(record.seq):
        gc_content=GC(record.seq)
        seq_name = record.id
print(seq_name)
print(gc_content)

Rosalind_5991
52.17391304347826


 # SUBS
Finding a Motif in DNA

Problem


Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s.
Sample Dataset

ATAT
Sample Output
2 4 10



In [88]:
s = "GATATATGCATATACTT"
t = "ATAT"
for position in range(len(s)):
    if s[position:].startswith(t):
        print(position+1)

2
4
10


In [67]:
import re

(s,t) = open("rosalind_subs.txt").read().split()
t  = re.compile("(?=("+t+"))")

for match in t.finditer(s):
        print (match.start()+1)

15
91
98
132
163
223
239
267
321
347
354
361
376
411
562
607
643
720
727
763
791
808
842
849


# LIA

Given: Two positive integers k (k≤7) and N (N≤2k). In this problem, we begin with Tom, who in the 0th generation has genotype Aa Bb. Tom has two children in the 1st generation, each of whom has two children, and so on. Each organism always mates with an organism having genotype Aa Bb.

Return: The probability that at least N Aa Bb organisms will belong to the k-th generation of Tom's family tree (don't count the Aa Bb mates at each level). Assume that Mendel's second law holds for the factors.

Sample Dataset
2 1
Sample Output
0.684

In [9]:
import math
k = 2
N = 1
P = 2**k

probability1 = 0                                                                
for i in range(N, P + 1):                                                      
    probability2 = (math.factorial(P) /                                                
            (math.factorial(i) * math.factorial(P - i))) * (0.25**i) * (0.75**(
                P - i))                                                        
    probability1 += probability2                                                        
print(probability1)    

0.68359375


# IEV
Calculating Expected Offspring

Problem

Given: Six nonnegative integers, each of which does not exceed 20,000. The integers correspond to the number of couples in a population possessing each genotype pairing for a given factor. In order, the six given integers represent the number of couples having the following genotypes:

AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa
Return: The expected number of offspring displaying the dominant phenotype in the next generation, under the assumption that every couple has exactly two offspring.

Sample Dataset
1 0 0 1 0 1
Sample Output
3.5

In [16]:
couples = [1,0,0,1,0,1]
expected = []
for i in range(len(couples)):
    if i < 3:
        expected.append(couples[i]* 2)
    elif i == 3:
        expected.append(couples[i]*3 / 4*2 )
    elif i == 4:
        expected.append(couples[i] / 2 * 2)
    else :
        expected.append(0)
print(sum(expected))


3.5


# CONS
 Consensus and Profile

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)


In [23]:
file = open('rosalind_cons.txt' , 'r').read()
if '>' in file :
    data_array = file.split('>')
    for i in data_array:
        if i == '':
             data_array.remove(i)
    for i in data_array: data_array[data_array.index(i)] = i.split('\n', 2)

# Profile matrrix
sequences = []
for i in data_array:
    data_array[data_array.index(i)] = i[1]
    sequences.append(i[1])

n = len(sequences[0])

profile_matrix = {
    'A': [0]*n,
    'C': [0]*n,
    'G': [0]*n,
    'T': [0]*n,
    }

for dna in sequences:
    for position, nucleotide in enumerate(dna):
        profile_matrix[nucleotide][position] += 1

#consensus sequence formation

result = []
for position in range(n):
    max_count = 0
    max_nucleotide = None
    for nucleotide in profile_matrix:
        if profile_matrix[nucleotide][position] > max_count:
            max_count = profile_matrix[nucleotide][position]
            max_nucleotide = nucleotide
    result.append(max_nucleotide)

print(''.join(result))
print('A:', ' '.join(map(str, profile_matrix['A'])))
print('C:', ' '.join(map(str, profile_matrix['C'])))
print('G:', ' '.join(map(str, profile_matrix['G'])))
print('T:', ' '.join(map(str, profile_matrix['T'])))




ACTTTAAGTGGAAAGTAGGAATGGGCGATCTACGAATCCCTCAACAGTTTGATAACCAGC
A: 4 2 2 2 0 3 6 1 2 3 1 4 4 4 3 1 4 1 1 4 3 0 0 2 2 1 1 3 3 3 2 3 2 2 5 3 2 2 2 2 3 0 4 4 1 4 3 3 3 2 1 5 2 4 5 2 3 5 3 2
C: 3 3 2 0 3 3 2 1 1 0 4 3 1 3 0 2 2 2 2 2 2 3 4 2 2 5 4 2 3 4 2 2 3 2 2 1 2 5 4 5 1 4 1 0 4 3 1 1 2 1 2 1 2 1 3 3 4 1 2 5
G: 1 2 2 3 1 3 2 6 3 6 5 3 4 1 4 3 1 4 6 1 2 2 5 5 3 2 5 3 0 2 2 2 3 3 2 3 2 0 1 2 2 4 1 4 3 2 4 2 1 1 4 1 2 3 2 3 1 2 4 2
T: 2 3 4 5 6 1 0 2 4 1 0 0 1 2 3 4 3 3 1 3 3 5 1 1 3 2 0 2 4 1 4 3 2 3 1 3 4 3 3 1 4 2 4 2 2 1 2 4 4 6 3 3 4 2 0 2 2 2 1 1


# PROB 
Introduction to random strings 

Problem
Given: A DNA string s of length at most 100 bp and an array A containing at most 20 numbers between 0 and 1.

Return: An array B having the same length as A in which B[k] represents the common logarithm of the probability that a random string constructed with the GC-content found in A[k] will match s exactly.

In [44]:
from math import log
with open("rosalind_prob.txt","r") as file:
    content = file.readlines()
string = content[0]
probability_list = list(map(float,(content[1].split())))

res = ""
for i in probability_list:
    temp = []
    for j in string:
        if j == "A" or j == "T":
            temp.append((1-i)/2)
        if j == "G" or j == "C":
            temp.append(i/2)
    new_tmp= []
    new_tmp =[log(y,10) for y in temp]

    c = round(sum(new_tmp),3)
    res = res + " " + str(c)
print(res)


 -74.739 -70.019 -64.364 -61.452 -57.975 -57.396 -56.245 -55.975 -55.941 -56.238 -57.322 -57.762 -59.977 -61.866 -66.945 -68.529 -79.939 -84.703
