### Introduction to the Bioinformatics Armory

**Problem:** This initial problem is aimed at familiarizing you with Rosalind's task-solving pipeline. To solve it, you merely have to take a given DNA sequence and find its nucleotide counts; this problem is equivalent to “Counting DNA Nucleotides” in the Stronghold.

Of the many tools for DNA sequence analysis, one of the most popular is the Sequence Manipulation Suite. Commonly known as SMS 2, it comprises a collection of programs for generating, formatting, and analyzing short strands of DNA and polypeptides.

One of the simplest SMS 2 programs, called DNA stats, counts the number of occurrences of each nucleotide in a given strand of DNA. An online interface for DNA stats can be found here.

**Given:** A DNA string s of length at most 1000 bp.<br>
**Return:** Four integers (separated by spaces) representing the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s. Note: You must provide your answer in the format shown in the sample output below.

In [4]:
# Get the string

f = open('../../../Downloads/rosalind_ini.txt')
s = [line.strip() for line in f][0]
f.close()

In [5]:
nucleotide_counts = [0]*4 # The order is A,C,G,T

for i in s:
    nucleotide_counts[0] += (i == 'A')
    nucleotide_counts[1] += (i == 'C')
    nucleotide_counts[2] += (i == 'G')
    nucleotide_counts[3] += (i == 'T')
    
print(*nucleotide_counts)

258 234 240 220


### GenBank Introduction

**Problem:** GenBank comprises several subdivisions:

    Nucleotide: a collection of nucleic acid sequences from several sources.
    Genome Survey Sequence (GSS): uncharacterized short genomic sequences.
    Expressed Sequence Tags, (EST): uncharacterized short cDNA sequences.

Searching the Nucleotide database with general text queries will produce the most relevant results. You can also use a simple query based on protein name, gene name or gene symbol.

To limit your search to only certain kinds of records, you can search using GenBank's Limits page or alternatively use the Filter your results field to select categories of records after a search.

If you cannot find what you are searching for, check how the database interpreted your query by investigating the Search details field on the right side of the page. This field automatically translates your search into standard keywords.

For example, if you search for Drosophila, the Search details field will contain (Drosophila[All Fields]), and you will obtain all entries that mention Drosophila (including all its endosymbionts). You can restrict your search to only organisms belonging to the Drosophila genus by using a search tag and searching for Drosophila[Organism].

**Given:** A genus name, followed by two dates in YYYY/M/D format.<br>
**Return:** The number of Nucleotide GenBank entries for the given genus that were published between the dates specified.

In [13]:
from Bio import Entrez
handle = Entrez.esearch(db="nucleotide", term='"Microphotus"[Organism] AND ("2001/02/11"[Publication Date] : "2011/05/21"[Publication Date]')
record = Entrez.read(handle)
print(record["Count"])

44


### Data Formats

**Problem:** GenBank can be accessed here. A detailed description of the GenBank format can be found here. A tool, from the SMS 2 package, for converting GenBank to FASTA can be found here. 

**Given:** A collection of n (n≤10) GenBank entry IDs. <br>
**Return:** The shortest of the strings associated with the IDs in FASTA format.

In [34]:
from Bio import Entrez
from Bio import SeqIO

In [35]:
# Read the file
f = open('../../../Downloads/rosalind_frmt.txt')
ids = [i.strip().split(' ') for i in f.readlines()][0]
f.close()

In [36]:
ids

['NM_002037',
 'NM_001082732',
 'JX317645',
 'JX445144',
 'JN573266',
 'BT149870',
 'JX460804',
 'JX205496']

In [37]:
handle = Entrez.efetch(db='nucleotide', id=ids, rettype='fasta')
records = list(SeqIO.parse(handle, 'fasta'))

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


In [38]:
print(records) # Looking inside the 'records'

[SeqRecord(seq=Seq('AGAGCATCAGCAAGAGTAGCAGCGAGCAGCCGCGCTGGTGGCGGCGGCGCGTCG...GAA'), id='NM_002037.5', name='NM_002037.5', description='NM_002037.5 Homo sapiens FYN proto-oncogene, Src family tyrosine kinase (FYN), transcript variant 1, mRNA', dbxrefs=[]), SeqRecord(seq=Seq('TTTCACACGGCCACTGCTGTTCACAGAAAATGCCAGGATGATGCCTCGCCTGCG...AGG'), id='NM_001082732.2', name='NM_001082732.2', description='NM_001082732.2 Oryctolagus cuniculus toll like receptor 4 (TLR4), mRNA', dbxrefs=[]), SeqRecord(seq=Seq('ATGGCATCCACAAGCAGCAGCAGCAGAATCAACAACAACCGCCATGCCGTCAGG...TAA'), id='JX317645.1', name='JX317645.1', description='JX317645.1 Culex quinquefasciatus neuropeptide F mRNA, complete cds', dbxrefs=[]), SeqRecord(seq=Seq('ACATGGGAAGCAGTGGGACGAGCAGAGTAGTCAAAGACTCAACTAAGCTTCATC...AAA'), id='JX445144.3', name='JX445144.3', description='JX445144.3 Paeonia lactiflora cultivar Taohuafeixue ethylene-insensitive 3 (EIN3) mRNA, complete cds', dbxrefs=[]), SeqRecord(seq=Seq('GGATTCCGACGTGAATCTCGGTTCGTAGTTTGCGGT

In [39]:
print(records[0]) # Understanding the 'record' object

ID: NM_002037.5
Name: NM_002037.5
Description: NM_002037.5 Homo sapiens FYN proto-oncogene, Src family tyrosine kinase (FYN), transcript variant 1, mRNA
Number of features: 0
Seq('AGAGCATCAGCAAGAGTAGCAGCGAGCAGCCGCGCTGGTGGCGGCGGCGCGTCG...GAA')


In [40]:
# Further looking into the 'record' object
for i in records:
    print(i.id,'\n',i.seq,'\n')

NM_002037.5 
 AGAGCATCAGCAAGAGTAGCAGCGAGCAGCCGCGCTGGTGGCGGCGGCGCGTCGTTGCAGTTGCGCCATCTGTCAGGAGCGGAGCCGGCGAGGAGGGGGCTGCCGCGGGCGAGGAGGAGGGGTCGCCGCGAGCCGAAGGCCTTCGAGACCCGCCCGCCGCCCGGCGGCGAGAGTAGAGGCGAGGTTGTTGTGCGAGCGGCGCGTCCTCTCCCGCCCGGGCGCGCCGCGCTTCTCCCAGCGCACCGAGGACCGCCCGGGCGCACACAAAGCCGCCGCCCGCGCCGCACCGCCCGGCGGCCGCCGCCCGCGCCAGGGAGGGATTCGGCCGCCGGGCCGGGGACACCCCGGCGCCGCCCCCTCGGTGCTCTCGGAAGGCCCACCGGCTCCCGGGCCCGCCGGGGACCCCCCGGAGCCGCCTCGGCCGCGCCGGAGGAGGGCGGGGAGAGGACCATGTGAGTGGGCTCCGGAGCCTCAGCGCCGCGCAGTTTTTTTGAAGAAGCAGGATGCTGATCTAAACGTGGAAAAAGACCAGTCCTGCCTCTGTTGTAGAAGACATGTGGTGTATATAAAGTTTGTGATCGTTGGCGGACATTTTGGAATTTAGATAATGGGCTGTGTGCAATGTAAGGATAAAGAAGCAACAAAACTGACGGAGGAGAGGGACGGCAGCCTGAACCAGAGCTCTGGGTACCGCTATGGCACAGACCCCACCCCTCAGCACTACCCCAGCTTCGGTGTGACCTCCATCCCCAACTACAACAACTTCCACGCAGCCGGGGGCCAAGGACTCACCGTCTTTGGAGGTGTGAACTCTTCGTCTCATACGGGGACCTTGCGTACGAGAGGAGGAACAGGAGTGACACTCTTTGTGGCCCTTTATGACTATGAAGCACGGACAGAAGATGACCTGAGTTTTCACAAAGGAGAAAAATTTCAAATATTGAACAGCTCGGAAGGAGATTGGTGGGAAGCCCGCTCCTTGACAA

In [41]:
# Looking at the sequence lengths
def shortest_records(records):
    
    # Store the lengths of the sequences in order
    lengths = []
    # Find the minimum length
    minimum_length = 0
    for i in records:
        length = len(i.seq)
        lengths.append(length)
        if minimum_length == 0 or length < minimum_length:
            minimum_length = length
    
    # Get the index of the shortest sequence
    shortest_index = lengths.index(minimum_length)
    
    # Get the information for the shortest sequence
    r = records[shortest_index]
    print('>',r.description,'\n',r.seq)
    

In [42]:
shortest_records(records)

> JX317645.1 Culex quinquefasciatus neuropeptide F mRNA, complete cds 
 ATGGCATCCACAAGCAGCAGCAGCAGAATCAACAACAACCGCCATGCCGTCAGGTCATCAGCCTCTTCAGCGTTCACCCAGCGACTGCTAATCGGCCTGCTGGTCTGCACCCTGGTGCTGGATCTTAGCTGCCTGACCGAGGCCCGGCCCCAGGACGATCCCACCTCCGTCGCCGAAGCCATCCGACTGCTGCAGGAGCTGGAAACCAAGCACGCCCAACATGCCCGACCAAGATTCGGAAAACGTGGCTATCTCCAGCCGGCAAGTTACGGCCAGGACGAACAGGAGCGAAACTATTACAGGATGATTGGCAGGATTCAGCGTTTTCAAGATGAACAGAACGCCGTACTCAACTAA
