# Bioinformatics Armory

## INI - Introduction to the Bioinformatics Armory

> **Let's Be Practical**      
If you are an accomplished coder, then you can write a separate program for every new task you encounter. In practice, these programs only need to be written once and posted to the web, where those of us who are not great coders can use them quickly and efficiently. In the Armory, we will familiarize ourselves with a sampling of some of the more popular bioinformatics tools taken from "out of the box" software.         
To be equitable, we will focus mainly on free, internet-based software and on programs that are compatible with multiple operating systems. The "Problem" section will contain links to this software, with short descriptions about how to use it.

**Problem**

This initial problem is aimed at familiarizing you with Rosalind's task-solving pipeline. To solve it, you merely have to take a given [DNA sequence](http://rosalind.info/glossary/dna/) and find its [nucleotide](http://rosalind.info/glossary/nucleotide/) counts; this problem is equivalent to [“Counting DNA Nucleotides”](http://rosalind.info/problems/dna/) in the [Stronghold](http://rosalind.info/problems/list-view/).

Of the many tools for [DNA](http://rosalind.info/glossary/dna/) sequence analysis, one of the most popular is the [Sequence Manipulation Suite](http://rosalind.info/glossary/sms-2/). Commonly known as SMS 2, it comprises a collection of programs for generating, formatting, and analyzing short [strands](http://rosalind.info/glossary/strand/) of DNA and [polypeptides](http://rosalind.info/glossary/polypeptide/).

One of the simplest SMS 2 programs, called " `DNA stats` ", counts the number of occurrences of each nucleotide in a given strand of DNA. An online interface for " `DNA stats` " can be found [here](http://www.bioinformatics.org/sms2/dna_stats.html).

> **Given:**   A [DNA string](http://rosalind.info/glossary/dna-string/) `s` of length at most 1000 [bp](http://rosalind.info/glossary/base-pair/).

> **Return:**  Four integers (separated by spaces) representing the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in `s`. **Note**: You must provide your answer in the format shown in the sample output below.

**Sample Dataset**

    AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
    
**Sample Output**
    
    20 12 17 21

In [8]:
def count_dna_bases_from_file(filename):
    
    from Bio.Seq import Seq
    
    bases = ['A', 'C', 'G', 'T']
    result_str = ""
    input_str = ""
    
    # get input from file
    f_in = open(filename, 'r')
    for line in f_in:
        input_str += line
    
    # initial sequence from biopython
    my_seq = Seq(input_str)
    
    for base in bases:
        result_str += str(my_seq.count(base)) + " "
        
    print "results: '%s' " %result_str

In [10]:
file = "source_files/bioinf_armory/rosalind_ini.txt"
count_dna_bases_from_file(file)

results: '235 208 215 220 ' 


## DBPR - Introduction to Protein Databases

**Problem**

The UniProt Knowledgebase can be found [here](http://www.uniprot.org/help/uniprotkb).

You can see a complete description of a protein by entering its UniProt access ID into the site's query field. Equivalently, you may simply insert its ID (`uniprot_id`) directly into a UniProt hyperlink as follows:

    http://www.uniprot.org/uniprot/uniprot_id 

For example, the data for protein B5ZC00 can be found at [http://www.uniprot.org/uniprot/B5ZC00](http://www.uniprot.org/uniprot/B5ZC00).

Swiss-Prot holds protein data as a structured `".txt file"`. You can obtain it by simply adding .txt to the link:

    http://www.uniprot.org/uniprot/uniprot_id.txt
    
> **Given:** The UniProt ID of a protein.
> **Return:** A list of biological processes in which the protein is involved (biological processes are found in a subsection of the protein's "Gene Ontology" (GO) section)

**Sample Dataset**

    Q5SLP9
    
**Sample Output**
    
    DNA recombination
    DNA repair
    DNA replication

In [19]:
def print_involved_processes_by_UniProtId(id):
    '''
    A list of biological processes in which the protein is involved 
    (biological processes are found in a subsection of the protein's "Gene Ontology" (GO) section).
    '''
    from Bio import ExPASy
    from Bio import SwissProt
    handle = ExPASy.get_sprot_raw(id) #you can give several IDs separated by commas
    record = SwissProt.read(handle) # use SwissProt.parse for multiple proteins

    for ref in record.cross_references:

        # example for ref:
        # ('GO', 'GO:0006310', 'P:DNA recombination', 'IEA:UniProtKB-HAMAP')

        if 'GO' in ref[0]:   # find GO - gene section in reference
            if ref[2]:
                if 'P:' in ref[2]: # processes starts with P like "P:DNA repair"
                    print ref[2].split(':')[1]


In [22]:
my_id = "Q2STE7"
print_involved_processes_by_UniProtId(my_id)

ATP hydrolysis coupled proton transport
plasma membrane ATP synthesis coupled proton transport
