# Counting Nucleotides in a DNA String

Sample Dataset



AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

logic: In order to count the occurrences of each nucleotide (A, C, G, T) in the given DNA string. Let's write a Python function to do this:

In [1]:
def count_nucleotides(dna_string):
    return (
        dna_string.count('A'),
        dna_string.count('C'),
        dna_string.count('G'),
        dna_string.count('T')
    )

# Test with the sample dataset
sample_dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
result = count_nucleotides(sample_dna)
print(" ".join(map(str, result)))

20 12 17 21


# Conversion from DNA string to RNA string

Sample Dataset:
GATGGAACTTGACTACGTAAATT

logic: In order to solve this problem, we need to create a function that transcribes a DNA string to an RNA string by replacing all occurrences of 'T' with 'U'. Here's a Python function to accomplish this:

In [2]:
def dna_to_rna(dna_string):
    return dna_string.replace('T', 'U')

# Test with the sample dataset
sample_dna = "GATGGAACTTGACTACGTAAATT"
rna_string = dna_to_rna(sample_dna)
print(rna_string)

GAUGGAACUUGACUACGUAAAUU


# Complementing a Strand of DNA

Sample Dataset:  
AAAACCCGGT

logic: To solve this problem, we need to create a function that produces the reverse complement of a given DNA string. Here's a Python function to accomplish this:

In [3]:
def reverse_complement(dna_string):
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    return ''.join(complement[base] for base in reversed(dna_string))

# Test with the sample dataset
sample_dna = "AAAACCCGGT"
result = reverse_complement(sample_dna)
print(result)

ACCGGGTTTT


# Fibonacci's Rabbits and Recurrence Relations

Example of a variation of the Fibonacci sequence, often referred to as the "Rabbit Sequence" or "Fibonacci's Rabbits". 

Sample Dataset:  
5 3

logic:  In the first month, we start with 1 mature pair and 0 immature pairs. In each subsequent month the number of new immature pairs is the number of mature pairs from last month multiplied by k. The number of new mature pairs is the sum of immature and mature pairs from last month (all rabbits mature after one month). We update our mature and immature counts for the next iteration. The total number of pairs is the sum of mature and immature pairs.

In [10]:
def rabbit_pairs(n, k):
    if n <= 2:
        return 1
    
    # Initialize the first two months
    mature = 1  # Mature (reproductive) pairs
    immature = 0  # Immature pairs
    
    # Calculate rabbit pairs for each month
    for month in range(2, n):
        # New immature pairs = mature pairs * k
        new_immature = mature * k
        # New mature pairs = last month's immature pairs + last month's mature pairs
        new_mature = immature + mature
        
        mature = new_mature
        immature = new_immature
        
        total = mature + immature
        print(f"Month {month+1}: {total}")
    
    return mature + immature

# Test with the sample dataset
n, k = 5, 3
result = rabbit_pairs(n, k)
print(f"Final result: {result}")

Month 3: 4
Month 4: 7
Month 5: 19
Final result: 19


# Computing GC Content

Because of the base pairing relations of the two DNA strands, cytosine and guanine will always appear in equal amounts in a double-stranded DNA molecule. Thus, to analyze the symbol frequencies of DNA for comparison against a database, we compute the molecule's GC-content, or the percentage of its bases that are either cytosine or guanine.

In [None]:
Sample Dataset
>Sequence_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Sequence_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Sequence_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

logic: To solve this problem, we need to create a function that can parse FASTA format DNA strings, calculate their GC-content, and return the ID of the string with the highest GC-content along with its GC-content percentage. 

In [11]:
def parse_fasta(fasta_string):
    sequences = {}
    current_id = None
    current_seq = []

    for line in fasta_string.split('\n'):
        if line.startswith('>'):
            if current_id:
                sequences[current_id] = ''.join(current_seq)
            current_id = line[1:]
            current_seq = []
        else:
            current_seq.append(line)

    if current_id:
        sequences[current_id] = ''.join(current_seq)

    return sequences

def gc_content(dna_string):
    gc_count = dna_string.count('G') + dna_string.count('C')
    return (gc_count / len(dna_string)) * 100

def highest_gc_content(fasta_string):
    sequences = parse_fasta(fasta_string)
    max_gc = 0
    max_id = ''

    for seq_id, sequence in sequences.items():
        gc = gc_content(sequence)
        if gc > max_gc:
            max_gc = gc
            max_id = seq_id

    return max_id, max_gc

# Test with the sample dataset
sample_data = """>Sequence_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Sequence_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Sequence_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT"""

result_id, result_gc = highest_gc_content(sample_data)
print(result_id)
print(f"{result_gc:.6f}")

Sequence_0808
60.919540


# Counting Point Mutations and Hamming Distance

Sample Dataset:

Sequence A: GAGCCTACTAACGGGAT , 
Sequence B: CATCGTAATGACGGCCT

logic: To solve this problem, we need to create a function that calculates the Hamming distance between two DNA strings. 

In [14]:
def hamming_distance(s, t):
    if len(s) != len(t):
        raise ValueError("Strings must be of equal length")
    
    return sum(c1 != c2 for c1, c2 in zip(s, t))

# Test with the sample dataset
s = "GAGCCTACTAACGGGAT"
t = "CATCGTAATGACGGCCT"

distance = hamming_distance(s, t)
print(distance)

7


# Mendel's Law of Segregation

Sample Dataset:  2 , 2 , 2

logic: To solve this problem, we need to calculate the probability of producing an offspring with a dominant allele when randomly selecting two organisms from a population. We calculate the total number of organisms and the total number of possible pairs. We then calculate the probability for each scenario that results in a dominant allele. When k (homozygous dominant) mates with anyone, it's always dominant. When m (heterozygous) mates with m, there's a 0.75 chance of dominant allele. When m mates with n (homozygous recessive), there's a 0.5 chance of dominant allele.

In [20]:
def dominant_allele_probability(k, m, n):
    total = k + m + n
    total_pairs = total * (total - 1)  # Total number of ways to choose 2 organisms
    
    # Probabilities of getting at least one dominant allele:
    
    # k with k
    p_kk = k * (k - 1) / total_pairs
    
    # k with m or n
    p_k_other = k * (m + n) * 2 / total_pairs
    
    # m with m (0.75 chance)
    p_mm = m * (m - 1) / total_pairs * 0.75
    
    # m with n (0.5 chance)
    p_mn = m * n * 2 / total_pairs * 0.5
    
    # Probability of getting a dominant allele
    probability = p_kk + p_k_other + p_mm + p_mn
    
    return probability

# Test with the sample dataset
k, m, n = 2, 2, 2
result = dominant_allele_probability(k, m, n)
print(f"{result:.5f}")

0.78333


# Translating RNA into Protein

Sample Dataset : AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA

logic: To solve this problem, we need to translate an RNA sequence into a protein sequence using the RNA codon table. We define a codon_table dictionary that maps each RNA codon to its corresponding amino acid. The rna_to_protein function takes an RNA string as input and translates it to a protein string. It iterates over the RNA string in steps of 3 (each codon is 3 nucleotides). For each codon, it looks up the corresponding amino acid in the codon_table. If it encounters a 'Stop' codon, it stops the translation. It builds up the protein string one amino acid at a time. We test the function with the sample dataset provided.

In [22]:
def rna_to_protein(rna_string):
    # RNA codon table
    codon_table = {
        'UUU': 'F', 'CUU': 'L', 'AUU': 'I', 'GUU': 'V',
        'UUC': 'F', 'CUC': 'L', 'AUC': 'I', 'GUC': 'V',
        'UUA': 'L', 'CUA': 'L', 'AUA': 'I', 'GUA': 'V',
        'UUG': 'L', 'CUG': 'L', 'AUG': 'M', 'GUG': 'V',
        'UCU': 'S', 'CCU': 'P', 'ACU': 'T', 'GCU': 'A',
        'UCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A',
        'UCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A',
        'UCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A',
        'UAU': 'Y', 'CAU': 'H', 'AAU': 'N', 'GAU': 'D',
        'UAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D',
        'UAA': 'Stop', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E',
        'UAG': 'Stop', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E',
        'UGU': 'C', 'CGU': 'R', 'AGU': 'S', 'GGU': 'G',
        'UGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G',
        'UGA': 'Stop', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G',
        'UGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'
    }

    protein = []
    for i in range(0, len(rna_string), 3):
        codon = rna_string[i:i+3]
        if len(codon) == 3:  # Ensure we have a full codon
            amino_acid = codon_table[codon]
            if amino_acid == 'Stop':
                break
            protein.append(amino_acid)

    return ''.join(protein)

# Test with the sample dataset
rna_string = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"
protein_string = rna_to_protein(rna_string)
print(protein_string)

MAMAPRTEINSTRING


# Finding a Motif in DNA

Sample Dataset : GATATATGCATATACTT , ATAT

logic: To solve this problem, we need to find all occurrences of a substring within a larger string. We define a function  that takes two parameters(s, t). We initialize an empty list location and iterate through the main string s, but only up to the point where there's enough room for the substring t to fit. We then check if the substring of s starting at that position and of length equal to t matches t. If there's a match, we add the position to our locations list. We add 1 to the index because the problem specifies 1-indexed positions. After the function definition we read the input strings s and t. Call our function with these inputs. Print the results, joining the list of locations with spaces.

In [25]:
def find_substring_locations(s, t):
    locations = []
    for i in range(len(s) - len(t) + 1):
        if s[i:i+len(t)] == t:
            locations.append(i + 1)  # Add 1 because positions are 1-indexed
    return locations

# Get input from the user
print("Enter the main DNA string and the substring on separate lines:")
s = input().strip()
t = input().strip()

# Find locations
locations = find_substring_locations(s, t)

# Print results
print(' '.join(map(str, locations)))

Enter the main DNA string and the substring on separate lines:


 GATATATGCATATACTT
 ATAT


2 4 10


# Consensus and Profile 
### The natural problem to find an average-case strand to represent the most likely common ancestor of the given strands.



Sample Dataset : 
>Strand_1 : 
ATCCAGCT ,  
>Strand_2 :
GGGCAACT , 
>Strand_3 :
ATGGATCT , 
>Strand_4 :
AAGCAACC , 
>Strand_5 :
TTGGAACT , 
>Strand_6 :
ATGCCATT , 
>Strand_7 :
ATGGCACT

logic:  Copy the entire code above into a Python file, the script will prompt you to enter the DNA strings in FASTA format. Enter each line and press enter after each line. When you're done, press enter twice to finish the input. The script will output the consensus string and the profile matrix.

In [30]:
def parse_input(input_string):
    sequences = {}
    lines = input_string.strip().split('\n')
    
    if lines and lines[0].startswith('>'):  # FASTA format
        current_id = ''
        current_sequence = ''
        for line in lines:
            if line.startswith('>'):
                if current_id:
                    sequences[current_id] = current_sequence
                current_id = line[1:]
                current_sequence = ''
            else:
                current_sequence += line.strip()
        if current_id:
            sequences[current_id] = current_sequence
    else:  # Simple list of sequences
        for i, line in enumerate(lines, 1):
            if line.strip():
                sequences[f'Sequence_{i}'] = line.strip()
    
    return sequences

def create_profile_matrix(dna_strings):
    if not dna_strings:
        raise ValueError("No DNA strings provided")
    length = len(dna_strings[0])
    profile = {'A': [0] * length, 'C': [0] * length, 'G': [0] * length, 'T': [0] * length}
    
    for dna in dna_strings:
        if len(dna) != length:
            raise ValueError(f"All DNA strings must have the same length. Found lengths {length} and {len(dna)}")
        for i, nucleotide in enumerate(dna):
            if nucleotide not in profile:
                raise ValueError(f"Invalid nucleotide {nucleotide} found in DNA string")
            profile[nucleotide][i] += 1
    
    return profile

def find_consensus(profile):
    if not profile['A']:
        raise ValueError("Profile matrix is empty")
    consensus = ''
    for i in range(len(profile['A'])):
        max_count = 0
        max_nucleotide = ''
        for nucleotide in 'ACGT':
            if profile[nucleotide][i] > max_count:
                max_count = profile[nucleotide][i]
                max_nucleotide = nucleotide
        consensus += max_nucleotide
    return consensus

# Get input from the user
print("Enter the DNA strings (in FASTA format or as a simple list, press Enter twice to finish):")
input_data = ''
while True:
    line = input()
    if line == '':
        break
    input_data += line + '\n'

print("\nReceived input:")
print(input_data)

try:
    # Parse input
    sequences = parse_input(input_data)
    if not sequences:
        raise ValueError("No valid sequences found in input")
    
    print("\nParsed sequences:")
    for id, seq in sequences.items():
        print(f"{id}: {seq}")
    
    dna_strings = list(sequences.values())

    # Create profile matrix
    profile = create_profile_matrix(dna_strings)

    # Find consensus string
    consensus = find_consensus(profile)

    # Print results
    print("\nResults:")
    print(consensus)
    for nucleotide in 'ACGT':
        print(f"{nucleotide}: {' '.join(map(str, profile[nucleotide]))}")

except ValueError as e:
    print(f"Error: {e}")

Enter the DNA strings (in FASTA format or as a simple list, press Enter twice to finish):


 ATCCAGCT
 GGGCAACT
 ATGGATCT
 AAGCAACC
 TTGGAACT
 ATGCCATT
 ATGGCACT
 



Received input:
ATCCAGCT
GGGCAACT
ATGGATCT
AAGCAACC
TTGGAACT
ATGCCATT
ATGGCACT


Parsed sequences:
Sequence_1: ATCCAGCT
Sequence_2: GGGCAACT
Sequence_3: ATGGATCT
Sequence_4: AAGCAACC
Sequence_5: TTGGAACT
Sequence_6: ATGCCATT
Sequence_7: ATGGCACT

Results:
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6


In [None]:
ATCCAGCT
 GGGCAACT
 ATGGATCT
 AAGCAACC
 TTGGAACT
 ATGCCATT
 ATGGCACT

# Mortal Fibonacci Rabbits

### A depiction of a rabbit situation in which rabbits live for (m=3) months (meaning that they reproduce only twice before dying).

Sample Dataset : 
6 3

logic: To solve this problem, we need to modify the Fibonacci sequence to account for rabbits dying after m months. We define a function mortal_fibonacci_rabbits that takes two parameters: (n, m) Initialize a list rabbits of length m, where each element represents the number of rabbit pairs born in that month. We start with 1 pair in the first month, and iterate through each month up to n. We calculate the number of new rabbits born this month, which is the sum of all mature rabbits (all except the newborns from last month). We shift the rabbit population, adding the new rabbits at the beginning and removing the oldest rabbits (which die after m months). After n months, we return the sum of all rabbits alive. We get the input from the user and print the result. 

In [32]:
def mortal_fibonacci_rabbits(n, m):
    # Initialize the rabbit population for the first m months
    rabbits = [0] * m
    rabbits[0] = 1  # First month, we have 1 pair of rabbits

    # Calculate the rabbit population for each month
    for month in range(1, n):
        # Calculate new rabbits born this month
        new_rabbits = sum(rabbits[1:])  # All rabbits except newborns reproduce

        # Shift the rabbit population, removing rabbits that die
        rabbits = [new_rabbits] + rabbits[:-1]

    # Return the total number of rabbit pairs alive
    return sum(rabbits)

# Get input from the user
n, m = map(int, input("Enter n and m separated by a space: ").split())

# Calculate and print the result
result = mortal_fibonacci_rabbits(n, m)
print(result)

Enter n and m separated by a space:  6 3 


4


# Overlap Graphs 
### A Brief Introduction to Graph Theory

Sample Dataset : 
>Sequence_0498 : 
AAATAAA , 
>Sequence_2391 :
AAATTTT , 
>Sequence_2323 :
TTTTCCC , 
>Sequence_0442 :
AAATCCC , 
>Sequence_5013 :
GGGTGGG

logic: This function parses the FASTA format input and returns a dictionary where keys are sequence IDs and values are the DNA sequences.
It checks each pair of sequences to see if the k-length suffix of one matches the k-length prefix of the other. We get the input from the user in FASTA format. We parse the FASTA input using our parse_fasta function. We create the overlap graph with k=3 using our overlap_graph function. Enter the DNA strings in FASTA format when prompted. Press Enter twice to finish input. The script will output the adjacency list for the overlap graph.

In [41]:
def overlap_graph(sequences, k):
    adjacency_list = []
    for id1, seq1 in sequences.items():
        suffix = seq1[-k:]
        for id2, seq2 in sequences.items():
            if id1 != id2 and seq2.startswith(suffix):
                adjacency_list.append((id1, id2))
    return adjacency_list

# Get input from the user
print("Enter the DNA strings one at a time (press Enter without input to finish):")
sequences = {}
while True:
    input_data = input().strip()
    if not input_data:
        break
    parts = input_data.split(':')
    if len(parts) == 2:
        seq_id = parts[0].strip()
        seq = parts[1].strip()
        sequences[seq_id] = seq
    else:
        print("Invalid input format. Please use 'Sequence_ID : DNASEQUENCE'")

# Create overlap graph with k=3
k = 3
graph = overlap_graph(sequences, k)

# Print results
if graph:
    for edge in graph:
        print(f"{edge[0]} {edge[1]}")
else:
    print("No overlaps found.")







Enter the DNA strings one at a time (press Enter without input to finish):


 Sequence_0498 : AAATAAA
 Sequence_2391 : AAATTTT
 Sequence_2323 : TTTTCCC
 Sequence_0442 : AAATCCC
 Sequence_5013 : GGGTGGG
 


Sequence_0498 Sequence_2391
Sequence_0498 Sequence_0442
Sequence_2391 Sequence_2323


# Calculating Expected Offspring

Sample Dataset : 1 0 0 1 0 1

logic: We define a function expected_dominant_offspring that takes a list of the number of couples for each genotype pairing. We define a list of probabilities for each genotype pairing to produce offspring with the dominant phenotype:

AA-AA: 1 (100%)
AA-Aa: 1 (100%)
AA-aa: 1 (100%)
Aa-Aa: 0.75 (75%)
Aa-aa: 0.5 (50%)
aa-aa: 0 (0%)

In [42]:
def expected_dominant_offspring(couples):
    # Probability of dominant phenotype for each genotype pairing
    probabilities = [1, 1, 1, 0.75, 0.5, 0]
    
    # Calculate expected value
    expected = sum(c * p * 2 for c, p in zip(couples, probabilities))
    
    return expected

# Get input from the user
input_data = input("Enter six integers separated by spaces: ")
couples = list(map(int, input_data.split()))

if len(couples) != 6:
    print("Error: You must enter exactly 6 integers.")
else:
    result = expected_dominant_offspring(couples)
    print(f"{result:.1f}")

Enter six integers separated by spaces:  1 0 0 1 0 1


3.5


# Finding a Shared Motif

Sample Dataset : 
>Sequence_1
GATTACA
>Sequence_2
TAGACCA
>Sequence_3
ATACA

logic: To solve this problem, we'll implement a function that finds the longest common substring of a collection of DNA strings.

In [47]:
def parse_input(input_string):
    sequences = {}
    for line in input_string.split('\n'):
        parts = line.strip().split()
        if len(parts) == 2:
            sequences[parts[0]] = parts[1]
    return sequences

def longest_common_substring(strings):
    if not strings:
        return []
    
    ref = strings[0]
    longest = []
    max_length = 0
    
    for i in range(len(ref)):
        for j in range(i + 1, len(ref) + 1):
            substring = ref[i:j]
            
            if all(substring in s for s in strings[1:]):
                if len(substring) > max_length:
                    longest = [substring]
                    max_length = len(substring)
                elif len(substring) == max_length and substring not in longest:
                    longest.append(substring)
    
    return longest

# Get input from the user
print("Enter the DNA strings (press Enter twice to finish):")
input_data = ''
while True:
    line = input()
    if line == '':
        break
    input_data += line + '\n'

# Parse input
sequences = parse_input(input_data)

# Find longest common substrings
results = longest_common_substring(list(sequences.values()))

# Print results
print("Longest common substrings:")
for result in results:
    print(result)

Enter the DNA strings (press Enter twice to finish):


 Sequence_1 GATTACA
 Sequence_2 TAGACCA
 Sequence_3 ATACA
 


Longest common substrings:
TA
AC
CA


# Mendel's Law of Independent Assortment

Sample Dataset : 2 1

logic: We import the comb function from the math module to calculate combinations. The binomial probability formula we're using is:     P(X ≥ N) = Σ(i=N to 2^k) C(2^k, i) * p^i * (1-p)^(2^k-i)
Where: X is the number of AaBb organisms C(n,k) is the number of ways to choose k items from n items p is the probability of an individual being AaBb (0.25)

In [49]:
from math import comb

def probability_AaBb(k, N):
    # Probability of an offspring being AaBb
    p_AaBb = 0.25
    
    # Total number of organisms in k-th generation
    total_organisms = 2**k
    
    # Calculate probability of at least N AaBb organisms
    probability = 0
    for i in range(N, total_organisms + 1):
        probability += comb(total_organisms, i) * (p_AaBb**i) * ((1-p_AaBb)**(total_organisms-i))
    
    return probability

# Get input from user
k, N = map(int, input("Enter k and N separated by a space: ").split())

# Calculate and print the result
result = probability_AaBb(k, N)
print(f"{result:.3f}")

Enter k and N separated by a space:  2 1


0.684


# Finding a Protein Motif

Sample Dataset :             
A2Z669 , 
B5ZC00 , 
P07204_TRBM_HUMAN , 
P20840_SAG1_YEAST

logic: To solve this problem, we need to create a script that can fetch protein sequences from UniProt and then search for the N-glycosylation motif in these sequences. We import the re module for regular expressions and the requests module to fetch data from the UniProt website. fetch_protein_sequence(uniprot_id): This function fetches the protein sequence in FASTA format from UniProt and returns just the sequence (without the header). find_n_glycosylation_motif(sequence): This function uses a regular expression to find all occurrences of the N-glycosylation motif in the sequence. It returns a list of starting positions (1-indexed). process_proteins(uniprot_ids): This function processes each UniProt ID, fetches its sequence, finds the motif positions, and prints the results if any motifs are found. We get the input UniProt IDs from the user. 

In [55]:
import re
import requests

def fetch_protein_sequence(uniprot_id):
    # Try different URL formats
    urls = [
        f"https://www.uniprot.org/uniprot/{uniprot_id}.fasta",
        f"https://rest.uniprot.org/uniprotkb/{uniprot_id}.fasta",
        f"https://www.uniprot.org/uniprot/{uniprot_id.split('_')[0]}.fasta"
    ]
    
    for url in urls:
        try:
            response = requests.get(url)
            response.raise_for_status()
            lines = response.text.split('\n')
            return ''.join(lines[1:])  # Skip the header line
        except requests.exceptions.RequestException:
            continue
    
    print(f"Error: Failed to fetch sequence for {uniprot_id}")
    return None

def find_n_glycosylation_motif(sequence):
    pattern = r'N[^P][ST][^P]'
    matches = re.finditer(pattern, sequence)
    return [match.start() + 1 for match in matches]  # +1 for 1-based indexing

def process_proteins(uniprot_ids):
    for uniprot_id in uniprot_ids:        
        sequence = fetch_protein_sequence(uniprot_id)
        if sequence:
            motif_positions = find_n_glycosylation_motif(sequence)
            if motif_positions:
                print(uniprot_id)
                print(' '.join(map(str, motif_positions)))

# Get input from the user
print("Enter UniProt IDs (one per line, press Enter twice to finish):")
uniprot_ids = []
while True:
    line = input().strip()
    if not line:
        break
    uniprot_ids.append(line)

# Process the proteins
process_proteins(uniprot_ids)

Enter UniProt IDs (one per line, press Enter twice to finish):


 B5ZC00
 A2Z669
 P07204_TRBM_HUMAN
 P20840_SAG1_YEAST
 


B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614


# Inferring mRNA from Protein

Sample Dataset :  
MA

logic: In order to solve this problem we need to calculate the number of possible RNA sequences that could produce a given protein sequence, considering the genetic code and the stop codon. We start with a total of 1 and iterate through each amino acid in the protein, and for each amino acid, we multiply the total by the number of codons that can encode it. We use the modulo operation after each multiplication to keep the number manageable. After processing all amino acids, we multiply by the number of stop codons (3) to account for the end of the protein. We return the final total, which is the number of possible RNA sequences modulo 1,000,000.

In [57]:
def count_rna_sequences(protein):
    modulo = 1000000
    codon_count = {
        'F': 2, 'L': 6, 'S': 6, 'Y': 2, 'C': 2, 'W': 1,
        'P': 4, 'H': 2, 'Q': 2, 'R': 6, 'I': 3, 'M': 1,
        'T': 4, 'N': 2, 'K': 2, 'V': 4, 'A': 4, 'D': 2,
        'E': 2, 'G': 4
    }
    stop_codons = 3  # UAA, UAG, UGA

    total = 1
    for amino_acid in protein:
        total = (total * codon_count[amino_acid]) % modulo

    # Multiply by the number of stop codons for the last position
    total = (total * stop_codons) % modulo

    return total

# Get input from the user
protein = input("Enter the protein sequence: ").strip()

# Calculate and print the result
result = count_rna_sequences(protein)
print(result)

Enter the protein sequence:  MA


12


# Open Reading Frames

Sample Dataset :
>Sequence_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG

logic: An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. First we need to save the DNA sequence in FASTA format to a file (e.g., sequence.fasta). We then run the script and enter the filename when prompted. We then import necessary modules from Biopython. The find_orfs function does the main work, and it considers both the original sequence and its reverse complement. For each strand, it checks all three reading frames. It looks for start codons ('ATG') and then searches for the next stop codon. When it finds a complete ORF, it translates it to a protein sequence. It uses a set to store unique protein sequences. Finally we read the FASTA file. We call find_orfs to get all possible protein sequences.

In [64]:
def reverse_complement(dna):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return ''.join(complement[base] for base in reversed(dna))

def translate(seq):
    codon_table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein = ""
    for i in range(0, len(seq), 3):
        codon = seq[i:i+3]
        if len(codon) == 3:
            amino_acid = codon_table.get(codon, 'X')
            if amino_acid == '_':
                break
            protein += amino_acid
    return protein

def find_orfs(seq):
    orfs = set()
    
    for strand in [seq, reverse_complement(seq)]:
        for frame in range(3):
            for start in range(frame, len(strand)-2, 3):
                if strand[start:start+3] == 'ATG':
                    for end in range(start+3, len(strand)-2, 3):
                        codon = strand[end:end+3]
                        if codon in ['TAA', 'TAG', 'TGA']:
                            orf = strand[start:end+3]
                            protein = translate(orf)
                            orfs.add(protein)
                            break
    return list(orfs)

# Get input from the user
input_data = input("Enter the sequence data: ").strip()

# Parse input
parts = input_data.split()
if len(parts) == 2:
    seq_id, dna_seq = parts
else:
    print("Invalid input format. Please use 'Sequence_ID DNASEQUENCE'")
    exit()

# Find ORFs and translate to proteins
proteins = find_orfs(dna_seq)

# Print results
for protein in proteins:
    print(protein)

Enter the sequence data:  Sequence_99 AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG


MTPRLGLESLLE
MLLGSFRLIPKETLIQVAGSSPCNLS
MGMTPRLGLESLLE
M


# Enumerating Gene Orders

### Rearrangements Power Large-Scale Genomic Changes

Sample Dataset : 3

logic: We import the itertools module, which provides the permutations function. We define a function that uses itertools.permutations(range(1, n+1)) to generate all permutations of numbers from 1 to n. It converts the result to a list and counts the total number of permutations. It returns both the total count and the list of permutations.

In [66]:
import itertools

def generate_permutations(n):
    # Generate all permutations
    perms = list(itertools.permutations(range(1, n+1)))
    
    # Count the total number of permutations
    total = len(perms)
    
    return total, perms

# Get input from the user
n = int(input("Enter a positive integer n (≤7): "))

# Generate permutations
total, permutations = generate_permutations(n)

# Print results
print(total)
for perm in permutations:
    print(' '.join(map(str, perm)))

Enter a positive integer n (≤7):  3


6
1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1


# Calculating Protein Mass

Sample Dataset : 
SKADYEK

logic: To solve this problem, we need to calculate the total weight of a protein string based on the monoisotopic mass of each amino acid. The script will work for any valid protein string up to 1000 amino acids long, as specified in the problem. It assumes that the input string contains only valid amino acid symbols (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y). 

In [68]:
def protein_mass(protein_string):
    # Monoisotopic mass table for amino acids
    mass_table = {
        'A': 71.03711,
        'C': 103.00919,
        'D': 115.02694,
        'E': 129.04259,
        'F': 147.06841,
        'G': 57.02146,
        'H': 137.05891,
        'I': 113.08406,
        'K': 128.09496,
        'L': 113.08406,
        'M': 131.04049,
        'N': 114.04293,
        'P': 97.05276,
        'Q': 128.05858,
        'R': 156.10111,
        'S': 87.03203,
        'T': 101.04768,
        'V': 99.06841,
        'W': 186.07931,
        'Y': 163.06333
    }

    total_mass = sum(mass_table[aa] for aa in protein_string)
    return round(total_mass, 3)

# Get input from the user
protein_string = input("Enter the protein string: ").strip()

# Calculate and print the result
result = protein_mass(protein_string)
print(result)

Enter the protein string:  SKADYEK


821.392


# Locating Restriction Sites

In [None]:
>Sequence_24 TCAATGCATGCGGGTCTATATGCAT

logic: The restriction enzyme is a homodimer, which means that it is composed of two identical substructures. Each of these structures separates from the restriction enzyme in order to bind to and cut one strand of the phage DNA molecule; both substructures are pre-programmed with the same target string containing 4 to 12 nucleotides to search for within the phage DNA. For this reason we need a script that finds all reverse palindromes in a given DNA sequence, with lengths between 4 and 12.

In [72]:
def reverse_complement(dna):
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    return ''.join(complement[base] for base in reversed(dna))

def find_reverse_palindromes(dna):
    palindromes = []
    for i in range(len(dna)):
        for length in range(4, 13):  # lengths from 4 to 12
            if i + length > len(dna):
                break
            substring = dna[i:i+length]
            if substring == reverse_complement(substring):
                palindromes.append((i+1, length))  # +1 because positions are 1-indexed
    return palindromes

def parse_input(input_string):
    parts = input_string.strip().split()
    if len(parts) == 2:
        return parts[1]  # Return just the DNA sequence
    else:
        raise ValueError("Invalid input format")

# Get input from the user
input_data = input("Enter the DNA sequence: ").strip()

try:
    # Parse input
    dna_sequence = parse_input(input_data)

    # Find reverse palindromes
    results = find_reverse_palindromes(dna_sequence)

    # Print results
    for position, length in results:
        print(f"{position} {length}")

except ValueError as e:
    print(f"Error: {e}")

Enter the DNA sequence:  Sequence_24 TCAATGCATGCGGGTCTATATGCAT


4 6
5 4
6 6
7 4
17 4
18 4
20 6
21 4


# RNA Splicing

In [None]:
Sample Dataset
>Sequence_10
ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG
>Sequence_12
ATCGGTCGAA
>Sequence_15
ATCGGTCGAGCGTGT

logic: We'll need to parse the FASTA input, remove the introns from the DNA sequence, transcribe it to RNA, and then translate it to a protein sequence. Enter your DNA sequences in FASTA format when prompted. The first sequence should be the main DNA string, followed by the intron sequences. Press Enter twice to finish input. The script will output the protein sequence.

In [74]:
def parse_input(input_string):
    sequences = {}
    for line in input_string.split('\n'):
        parts = line.strip().split()
        if len(parts) == 2:
            sequences[parts[0]] = parts[1]
    return sequences

def remove_introns(dna, introns):
    for intron in introns:
        dna = dna.replace(intron, '')
    return dna

def transcribe(dna):
    return dna.replace('T', 'U')

def translate(rna):
    codon_table = {
        'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
        'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M', 'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
        'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S', 'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
        'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
        'UAU': 'Y', 'UAC': 'Y', 'UAA': '_', 'UAG': '_', 'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
        'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
        'UGU': 'C', 'UGC': 'C', 'UGA': '_', 'UGG': 'W', 'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
        'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'
    }
    protein = ''
    for i in range(0, len(rna), 3):
        codon = rna[i:i+3]
        if len(codon) == 3:
            amino_acid = codon_table.get(codon, '')
            if amino_acid == '_':
                break
            protein += amino_acid
    return protein

# Get input from the user
print("Enter the DNA sequences (press Enter twice to finish):")
input_data = ''
while True:
    line = input()
    if line == '':
        break
    input_data += line + '\n'

# Parse input
sequences = parse_input(input_data)

# The first sequence is the main DNA string, the rest are introns
dna = list(sequences.values())[0]
introns = list(sequences.values())[1:]

# Remove introns, transcribe to RNA, and translate to protein
dna_without_introns = remove_introns(dna, introns)
rna = transcribe(dna_without_introns)
protein = translate(rna)

# Print the result
print(protein)

Enter the DNA sequences (press Enter twice to finish):


 >Sequence_10 ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG
 >Sequence_12 ATCGGTCGAA
 >Sequence_15 ATCGGTCGAGCGTGT
 


MVYIADKQHVASREAYGHMFKVCA


# Enumerating k-mers Lexicographically

Sample Dataset : 
A C G T  ,  
2

logic: In order to solve this problem we need a script that generates all possible strings of a given length from a given alphabet and orders them lexicographically. 

In [76]:
from itertools import product

def generate_strings(alphabet, n):
    # Generate all possible combinations
    combinations = product(alphabet, repeat=n)
    
    # Convert each combination to a string and sort lexicographically
    strings = sorted(''.join(combo) for combo in combinations)
    
    return strings

# Get input from the user
alphabet = input("Enter the alphabet symbols separated by spaces: ").split()
n = int(input("Enter the length of strings to generate: "))

# Generate and print the strings
result = generate_strings(alphabet, n)
for string in result:
    print(string)

Enter the alphabet symbols separated by spaces:  A C G T
Enter the length of strings to generate:  2


AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
TC
TG
TT


# Longest Increasing Subsequence

Sample Dataset :  
5  ,  
5 1 4 2 3

logic: This problem requires implementing an algorithm to find the longest increasing subsequence (LIS) and longest decreasing subsequence (LDS) of a given permutation. We define two main functions, thelongest_increasing_subsequence(seq) and the longest_decreasing_subsequence(seq).  The LIS function implements the dynamic programming algorithm to find the longest increasing subsequence.  The LDS function uses the LIS algorithm on the reversed and negated sequence to find the longest decreasing subsequence.

In [79]:
def longest_increasing_subsequence(seq):
    n = len(seq)
    # L[i] - length of the longest increasing subsequence ending at i
    L = [1] * n
    # P[i] - predecessor of element i in the longest increasing subsequence
    P = [-1] * n

    for i in range(1, n):
        for j in range(i):
            if seq[i] > seq[j] and L[i] < L[j] + 1:
                L[i] = L[j] + 1
                P[i] = j

    # Find the maximum length and its index
    length = max(L)
    index = L.index(length)

    # Reconstruct the sequence
    result = []
    while index != -1:
        result.append(seq[index])
        index = P[index]

    return result[::-1]  # Reverse the result

def longest_decreasing_subsequence(seq):
    # We can use LIS algorithm on the negated sequence
    lis = longest_increasing_subsequence([-x for x in seq])
    return [-x for x in lis]  # Convert back to original values

# Get input from the user
n = int(input("Enter the length of the permutation: "))
permutation = list(map(int, input("Enter the permutation separated by spaces: ").split()))

# Find longest increasing and decreasing subsequences
lis = longest_increasing_subsequence(permutation)
lds = longest_decreasing_subsequence(permutation)

# Print results
print(' '.join(map(str, lis)))
print(' '.join(map(str, lds)))

Enter the length of the permutation:  5
Enter the permutation separated by spaces:  5 1 4 2 3


1 2 3
5 4 2


# Genome Assembly as Shortest Superstring

Sample Dataset
>Sequence_56 
ATTAGACCTG , 
>Sequence_57  
CCTGCCGGAA , 
>Sequence_58 
AGACCTGCCG ,  
>Sequence_59  
GCCGGAATAC

logic: This problem requires implementing an algorithm to find the shortest superstring that contains all given DNA sequences. overlap(s1, s2): Finds the largest overlap between the end of s1 and the start of s2. the merge_strings(s1, s2) function merges s1 and s2 based on their overlap. The find_shortest_superstring(dna_strings) function mplements a greedy algorithm to find the shortest superstring. It repeatedly finds the pair of strings with the largest overlap. It merges these strings and replaces them in the list with the merged string. This continues until only one string remains, which is the shortest superstring. The parse_fasta(fasta_string) function parses the FASTA format input and returns a list of DNA sequences. We get the input from the user in FASTA format. We parse the FASTA input to get the list of DNA strings. We find the shortest superstring using our algorithm and print the result.

In [82]:
def overlap(s1, s2):
    for i in range(min(len(s1), len(s2)), 0, -1):
        if s1[-i:] == s2[:i]:
            return i
    return 0

def merge_strings(s1, s2):
    o = overlap(s1, s2)
    return s1 + s2[o:]

def find_shortest_superstring(dna_strings):
    while len(dna_strings) > 1:
        max_overlap = 0
        best_pair = (0, 1)
        
        for i in range(len(dna_strings)):
            for j in range(len(dna_strings)):
                if i != j:
                    o = overlap(dna_strings[i], dna_strings[j])
                    if o > max_overlap:
                        max_overlap = o
                        best_pair = (i, j)
        
        i, j = best_pair
        merged = merge_strings(dna_strings[i], dna_strings[j])
        dna_strings = [s for k, s in enumerate(dna_strings) if k not in best_pair]
        dna_strings.append(merged)
    
    return dna_strings[0]

def parse_input(input_string):
    sequences = []
    for line in input_string.split('\n'):
        parts = line.strip().split()
        if len(parts) == 2:
            sequences.append(parts[1])
    return sequences

# Get input from the user
print("Enter the DNA sequences (press Enter twice to finish):")
input_data = ''
while True:
    line = input()
    if line == '':
        break
    input_data += line + '\n'

# Parse input
dna_strings = parse_input(input_data)

# Find shortest superstring
if dna_strings:
    result = find_shortest_superstring(dna_strings)
    print(result)
else:
    print("No valid DNA sequences found in the input.")

Enter the DNA sequences (press Enter twice to finish):


 Sequence_56 ATTAGACCTG
 Sequence_57 CCTGCCGGAA
 Sequence_58 AGACCTGCCG
 Sequence_59 GCCGGAATAC
 


ATTAGACCTGCCGGAATAC


# Perfect Matchings and RNA Secondary Structures

Sample Dataset : 
>Sequence_23
AGCUAGUCAU

logic: This problem requires us to calculate the total number of perfect matchings in the bonding graph of an RNA string.

In [85]:
from math import factorial

def count_perfect_matchings(rna):
    # Count the occurrences of each nucleotide
    count_A = rna.count('A')
    count_U = rna.count('U')
    count_C = rna.count('C')
    count_G = rna.count('G')
    
    # Check if the counts are valid for perfect matching
    if count_A != count_U or count_C != count_G:
        return 0
    
    # Calculate the number of perfect matchings
    return factorial(count_A) * factorial(count_C)

def parse_input(input_string):
    parts = input_string.strip().split()
    if len(parts) == 2:
        return parts[1]  # Return just the RNA sequence
    else:
        raise ValueError("Invalid input format")

# Get input from the user
input_data = input("Enter the RNA sequence: ").strip()

try:
    # Parse input
    rna = parse_input(input_data)

    # Calculate and print the result
    result = count_perfect_matchings(rna)
    print(result)

except ValueError as e:
    print(f"Error: {e}")

Enter the RNA sequence:  Sequence_23 AGCUAGUCAU


12
