# Bioinformatics Stronghold

## Counting DNA Nucleotides

**[DNA](https://en.wikipedia.org/wiki/DNA) (or deoxyribonucleic acid)** is the molecule that carries the genetic information in all cellular forms of life and some viruses.
It belongs to a class of molecules called the [nucleic acids](https://en.wikipedia.org/wiki/Nucleic_acid), which are polynucleotides - that is, long chains of nucleotides.

Each nucleotide consists of three components:

* a [nitrogenous base](https://en.wikipedia.org/wiki/Nitrogenous_base): [cytosine](https://en.wikipedia.org/wiki/Cytosine) (C), [guanine](https://en.wikipedia.org/wiki/Guanine) (G), [adenine](https://en.wikipedia.org/wiki/Adenine) (A) or [thymine](https://en.wikipedia.org/wiki/Thymine) (T)
* a [5-carbon sugar molecule](https://en.wikipedia.org/wiki/Pentose) (deoxyribose in the case of DNA, ribose in the case of [RNA](https://en.wikipedia.org/wiki/RNA) (ribonucleic acid))
* a [phosphate](https://en.wikipedia.org/wiki/Phosphate) molecule

The backbone of the polynucleotide is a chain of sugar and phosphate molecules. Each of the sugar groups in this sugar-phosphate backbone is linked to one of the four nitrogenous bases.

![Image of DNA](https://www2.le.ac.uk/projects/vgec/diagrams/31%20polynucleotide%202.jpg/image_preview)

<font size='4'> Problem </font>

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is  <font size='3'>"ATGCTTCAGAAAGGTCTTACG." </font>

**Given**: A DNA string s of length at most 1000 nt.

**Return**: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

<font size='4'> Define functions </font>

In [1]:
# Define DNA code
dna_code = ["A", "T", "C", "G"]

# Define a function to validate the given sequence and return one in upper case
def valid_seq(seq):
    seq = seq.upper()
    for index, nuc in enumerate(seq):
        if nuc not in dna_code:
            print(f'Unhandled nucleotide at position {index}!')
    return seq

# Define function to return the nucleotide count
def nuc_count(seq):
    num_A = seq.count("A")
    num_C = seq.count("C")
    num_G = seq.count("G")
    num_T = seq.count("T")
    return (num_A, num_C, num_G, num_T)

# Alternative function that returns a dictionary with nucleotide count
def countNucFrequency(seq):
    tmpFreqDict = {"A": 0, "C": 0, "G": 0, "T": 0}
    for nuc in seq:
        tmpFreqDict[nuc] += 1
    return tmpFreqDict

<font size='4'> Test our functions </font>

In [2]:
# Sample dataset (change first and last character into lower case to check all our functions)
dna_string = "aGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGc"

# Sample output
print(f"The number of nucleotides (A, C, G, T) is {nuc_count(valid_seq(dna_string))}")
print(f"\nNucleotide count: {countNucFrequency(valid_seq(dna_string))}")

The number of nucleotides (A, C, G, T) is (20, 12, 17, 21)

Nucleotide count: {'A': 20, 'C': 12, 'G': 17, 'T': 21}
