# Lesson 7
+ dictionary


Dictionaries are mappings between keys and items stored in the dictionaries. Unlike lists and tuples, dictionaries are unordered. Alternatively one can think of dictionaries as sets in which something stored against every element of the set. 

They can be defined as follows:

``d = dict()``
``d = {}``

Dictionaries can be defined by using the ``{ key : value }`` syntax. 

The ordering for a dictionary is not based on the order in which elements are added but on its own ordering (based on hash index ordering). It is best never to assume an ordering when iterating over elements of a dictionary.

To add an element:

``d[key] = value``


### Reverse complement
Create a dictionary to be able to handle the computation of the reverse complement of a DNA sequence.
Write a function that receives in input a string and computes and returns its reverse complement.
Write a program that asks the user a sequence and - by using the previously defined function - computes and outputs its reverse complement.

Input: ``GGGGaaaaaaaatttatatat``

Output: ``atatataaattttttttCCCC``

In [11]:
DNACOMPLEMENT = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}

def computeReverseComplement(seq):
    rcseq = ""
    size = len(seq)
    for i in range(size-1,-1,-1):
        chinseq = seq[i]
        chcompl = DNACOMPLEMENT.get(chinseq)
        rcseq += chcompl
    return rcseq

inseq = raw_input()
revcomp = computeReverseComplement(inseq)
print(revcomp)


GGGGaaaaaaaatttatatat
atatataaattttttttCCCC


### Palindrome Neuclotid Sequence
Write a program that receives in input a sequence of 6 neuclotids and computes and outputs True if it is a palindrome with respect to DNA sequence matching (the reverse complement is identical to the sequence itself). 
Use the previously developed function.

For example: CACGTG is a palindrome

In [14]:
DNACOMPLEMENT = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}

def computeReverseComplement(seq):
    rcseq = ""
    size = len(seq)
    for i in range(size-1,-1,-1):
        chinseq = seq[i]
        chcompl = DNACOMPLEMENT.get(chinseq)
        rcseq += chcompl
    return rcseq

def isPalindrome4matching_slow(seq):
    revcomp = computeReverseComplement(seq)
    print(revcomp)
    size = len(revcomp)
    for i in range(0, size):
        if revcomp[i] != seq[i]:
            return False
    return True

inseq = raw_input()
ris = isPalindrome4matching_slow(inseq)
print(ris)

CACGTG
CACGTG
CACGTG
True


### Loading a dictionary from file
Write a function that receives in input the name of a file containing - separated by commas - the information on amino acids codes, and create and returns a dictionary, having as a key the three charactera encoding (all capitals), and as values the single character.

File structure:
Name,Abbr.,Molecular Weight,Molecular Formula,Residue Formula,Residue Weight,pKa,pKb,pKx,pl
Alanine,Ala,A,89.10,C3H7NO2,C3H5NO,71.08,2.34,9.69,–,6.00
Arginine,Arg,R,174.20,C6H14N4O2,C6H12N4O,156.19,2.17,9.04,12.48,10.76
Asparagine,Asn,N,132.12,C4H8N2O3,C4H6N2O2,114.11,2.02,8.80,–,5.41
Aspartic acid,Asp,D,133.11,C4H7NO4,C4H5NO3,115.09,1.88,9.60,3.65,2.77

Expected output:
amino_dict = {'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C','GLY':'G','GLN':'Q','GLU':'E','HIS':'H','ILE':'I','LEU':'L','LYS':'K','MET':'M','PHE':'F','PRO':'P','SER':'S','THR':'T','TRP':'W','TYR':'Y','VAL':'V'}

In [7]:
def loadAminoDictionary(filename):
    fin = open(filename, "r")
    header = fin.readline()  #read and discard first line
    aadict = {}
    fullinfo = {}
    for line in fin:
        aminoline = line.strip()
        aminoinfo = aminoline.split(",") # aminoinfo = line.strip().split(",")
        #print(aminoinfo)
        #['Alanine', 'Ala', 'A', '89.10', 'C3H7NO2', 'C3H5NO', '71.08', '2.34', '9.69', '\xe2\x80\x93', '6.00']
        aadict[aminoinfo[1].upper()] = aminoinfo[2].upper()
        fullinfo[aminoinfo[1].upper()] = aminoinfo
    fin.close()
    return aadict, fullinfo
        
srcfile = "amino_acids_table.csv"
d, dall = loadAminoDictionary(srcfile)
print(d)
print(dall)


{'ILE': 'I', 'GLN': 'Q', 'HYP': 'O', 'GLY': 'G', 'GLP': 'U', 'GLU': 'E', 'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'LYS': 'K', 'PRO': 'P', 'ASN': 'N', 'VAL': 'V', 'THR': 'T', 'HIS': 'H', 'TRP': 'W', 'PHE': 'F', 'ALA': 'A', 'MET': 'M', 'LEU': 'L', 'ARG': 'R', 'TYR': 'Y'}
{'ILE': ['Isoleucine', 'Ile', 'I', '131.18', 'C6H13NO2', 'C6H11NO', '113.16', '2.36', '9.60', '\xe2\x80\x93', '6.02'], 'GLN': ['Glutamine', 'Gln', 'Q', '146.15', 'C5H10N2O3', 'C5H8N2O2', '128.13', '2.17', '9.13', '\xe2\x80\x93', '5.65'], 'HYP': ['Hydroxyproline', 'Hyp', 'O', '131.13', 'C5H9NO3', 'C5H7NO2', '113.11', '1.82', '9.65', '\xe2\x80\x93', '\xe2\x80\x93'], 'GLY': ['Glycine', 'Gly', 'G', '75.07', 'C2H5NO2', 'C2H3NO', '57.05', '2.34', '9.60', '\xe2\x80\x93', '5.97'], 'GLP': ['Pyroglutamatic', 'Glp', 'U', '139.11', 'C5H7NO3', 'C5H5NO2', '121.09', '\xe2\x80\x93', '\xe2\x80\x93', '\xe2\x80\x93', '5.68'], 'GLU': ['Glutamic acid', 'Glu', 'E', '147.13', 'C5H9NO4', 'C5H7NO3', '129.12', '2.19', '9.67', '4.25', '3.22'], 'CYS': [

### 
Write a function that receives in input the name of a file storing in a csv format the codon amino acids correspondence and returns a dictionary, having as a key the triplet and as a value the amino acid.
Write a function that receives in input a DNA sequence and compute and returns the corresponding xxx.
Write a program that receives in input the name of a file storing a RNA sequence in the FASTA format, and the name where to save the results, and computes the protein (to be saved in the file) and outputs codon usage statistics. 
The file containing the encoding is called "codon.csv"


Codon table (linearized):

GCU,A
GCC,A
GCA,A
GCG,A
CGU,R
CGC,R
...

FASTA file:

>A06662 Synthetic nucleotide sequence of the human GSH transferase pi gene. : Location:1..1000
UGGGACCAGUCAGCAGAGGCAGCGUGUGUGCGCGUGCGUGUGCGUGUGUGUGCGUGUGUG
UGUGUACGCUUGCAUUUGUGUCGGGUGGGUAAGGAGAUAGAGAUGGGCGGGCAGUAGGCC
...

Output file:



In [17]:
CODONTABLEFILE = "codon.csv"

def loadCodonAminoDictionary(filename):
    fin = open(filename, "r")
    header = fin.readline()  #read and discard first line
    dict = {}
    for line in fin:
        line = line.strip()
        info = line.split(",") # aminoinfo = line.strip().split(",")
        #print(info)
        #['GCU' , 'A']
        dict[info[0].upper()] = info[1].upper()
    fin.close()
    return dict

codondict = loadCodonAminoDictionary(CODONTABLEFILE)
print(codondict)

['GCC', 'A']
['GCA', 'A']
['GCG', 'A']
['CGU', 'R']
['CGC', 'R']
['CGA', 'R']
['CGG', 'R']
['AGA', 'R']
['AGG', 'R']
['UCU', 'S']
['UCC', 'S']
['UCA', 'S']
['UCG', 'S']
['AGU', 'S']
['AGC', 'S']
['AUU', 'I']
['AUC', 'I']
['AUA', 'I']
['AUU', 'I']
['AUC', 'I']
['AUA', 'I']
['UUA', 'L']
['UUG', 'L']
['CUU', 'L']
['CUC', 'L']
['CUA', 'L']
['CUG', 'L']
['GGU', 'G']
['GGC', 'G']
['GGA', 'G']
['GGG', 'G']
['AAU', 'N']
['AAC', 'N']
['GUU', 'V']
['GUC', 'V']
['GUA', 'V']
['GUG', 'V']
['GAU', 'D']
['GAC', 'D']
['ACU', 'T']
['ACC', 'T']
['ACA', 'T']
['ACG', 'T']
['UGU', 'C']
['UGC', 'C']
['CCU', 'P']
['CCC', 'P']
['CCA', 'P']
['CCG', 'P']
['CAA', 'Q']
['CAG', 'Q']
['GAA', 'E']
['GAG', 'E']
['CAU', 'H']
['CAC', 'H']
['AAA', 'K']
['AAG', 'K']
['UUU', 'F']
['UUC', 'F']
['UAU', 'Y']
['UAC', 'Y']
['AUG', 'M']
['UGG', 'W']
['AUG', 'START']
['UAG', 'STOP']
['UGA', 'STOP']
['UAA', 'STOP']
{'GUC': 'V', 'AUA': 'I', 'GUA': 'V', 'GUG': 'V', 'ACU': 'T', 'AAC': 'N', 'AGG': 'R', 'UGG': 'W', 'UAG': 'STOP', 'AGC

In [40]:
CODONTABLEFILE = "codon.csv"

def loadCodonAminoDictionary(filename):
    fin = open(filename, "r")
    header = fin.readline()  #read and discard first line
    cadict = {}
    for line in fin:
        line = line.strip()
        info = line.split(",") # aminoinfo = line.strip().split(",")
        #print(info)
        #['GCU' , 'A']
        cadict[info[0].upper()] = info[1].upper()
    fin.close()
    return cadict

codondict = loadCodonAminoDictionary(CODONTABLEFILE)
print(codondict)

{'ACC': 'T', 'AUA': 'I', 'AAG': 'K', 'AAA': 'K', 'GUU': 'V', 'AAC': 'N', 'AGG': 'R', 'UAU': 'Y', 'GUC': 'V', 'UAG': 'STOP', 'AGC': 'S', 'AUC': 'I', 'AGA': 'R', 'AAU': 'N', 'AGU': 'S', 'ACU': 'T', 'GUG': 'V', 'CAC': 'H', 'ACG': 'T', 'CAA': 'Q', 'CAG': 'Q', 'CCG': 'P', 'CCC': 'P', 'GGU': 'G', 'UCU': 'S', 'AUG': 'M', 'CGA': 'R', 'CCA': 'P', 'CGC': 'R', 'UGG': 'W', 'CGG': 'R', 'UCG': 'S', 'CCU': 'P', 'GGG': 'G', 'GGA': 'G', 'GGC': 'G', 'GAG': 'E', 'UCC': 'S', 'UAC': 'Y', 'GAC': 'D', 'GAA': 'E', 'GCA': 'A', 'GCC': 'A', 'CUU': 'L', 'UCA': 'S', 'GCG': 'A', 'UGA': 'STOP', 'CUG': 'L', 'UAA': 'STOP', 'AUU': 'I', 'CAU': 'H', 'CUA': 'L', 'UUU': 'F', 'CGU': 'R', 'GUA': 'V', 'UGC': 'C', 'UGU': 'C', 'ACA': 'T', 'CUC': 'L', 'UUG': 'L', 'UUA': 'L', 'GAU': 'D', 'UUC': 'F'}


### Translating RNA into Protein
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term *genetic string* will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Input: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp - 1 kbp = 1000 base pairs).

Input: ``AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA``
Output: ``MAMAPRTEINSTRING``

In [41]:
CODONTABLEFILE = "codon.csv"

def loadCodonAminoDictionary(filename):
    fin = open(filename, "r")
    header = fin.readline()  #read and discard first line
    cadict = {}
    for line in fin:
        line = line.strip()
        info = line.split(",") # aminoinfo = line.strip().split(",")
        #print(info)
        #['GCU' , 'A']
        cadict[info[0].upper()] = info[1].upper()
    fin.close()
    return cadict

def getProtein(seq, codict):
    size = len(seq)
    prot = ""
    for i in range(0, size, 3):    #three elements at a time
        triplet = seq[i:i+3]  #UGG
        if codict.has_key(triplet):
            corr = codict.get(triplet)
            if corr != BEGIN and corr != END:
                prot += codict.get(triplet)
        else:
            prot += UNKNOWN
    return prot

codondict = loadCodonAminoDictionary(CODONTABLEFILE)

rnaseq = raw_input()
protein = getProtein(rnaseq, codondict)
print(protein)

AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
MAMAPRTEINSTRING


In [28]:
CODONTABLEFILE = "codon.csv"
UNKNOWN = "*"
BEGIN = "START"
END = "STOP"

def loadCodonAminoDictionary(filename):
    fin = open(filename, "r")
    header = fin.readline()  #read and discard first line
    cadict = {}
    for line in fin:
        line = line.strip()
        info = line.split(",") # aminoinfo = line.strip().split(",")
        #print(info)
        #['GCU' , 'A']
        cadict[info[0].upper()] = info[1].upper()
    fin.close()
    return cadict

def loadFasta(filename):
    fin = open(filename, "r")
    header = fin.readline() #header
    seq = ""
    for line in fin: 
        seq += line.strip()
    fin.close()
    return seq

def getProtein(seq, codict):
    size = len(seq)
    prot = ""
    for i in range(0, size, 3):    #three elements at a time
        triplet = seq[i:i+3]  #UGG
        if codict.has_key(triplet):
            corr = codict.get(triplet)
            if corr != BEGIN and corr != END:
                prot += codict.get(triplet)
        else:
            prot += UNKNOWN
    return prot
            

codondict = loadCodonAminoDictionary(CODONTABLEFILE)
srcFile = "rnaseq.FASTA" #raw_input()
dstFile = "protein_seq.FASTA" #raw_input()
rnaseq = loadFasta(srcFile)
protein = getProtein(rnaseq, codondict)
print(protein)



WDQSAEAACVRVRVRVCACVCVRLHLCRVGKEIEGGQAQVPKALNPLVWSLLRAGAIEKSEQGCVGLEGSSREASSK*FAIIWENPARDRQNGIESWQLKWTGFGTSLVVGSKQRRIWDSGGLAWGRRGCLRGWEGEDDTWWCLAGGGQGLCEGTARATEAFDP*VPEPGRQDLHCGRPGEHLA


In [38]:
CODONTABLEFILE = "codon.csv"
UNKNOWN = "*"
BEGIN = "START"
END = "STOP"

def loadCodonAminoDictionary(filename):
    fin = open(filename, "r")
    header = fin.readline()  #read and discard first line
    cadict = {}
    for line in fin:
        line = line.strip()
        info = line.split(",") # aminoinfo = line.strip().split(",")
        #print(info)
        #['GCU' , 'A']
        cadict[info[0].upper()] = info[1].upper()
    fin.close()
    return cadict

def loadFasta(filename):
    fin = open(filename, "r")
    header = fin.readline() #header
    seq = ""
    for line in fin: 
        seq += line.strip()
    fin.close()
    return seq

def getProtein(seq, codict):
    size = len(seq)
    prot = ""
    for i in range(0, size, 3):    #three elements at a time
        triplet = seq[i:i+3]  #UGG
        if codict.has_key(triplet):
            corr = codict.get(triplet)
            if corr != BEGIN and corr != END:
                prot += codict.get(triplet)
        else:
            prot += UNKNOWN
    return prot
            

codondict = loadCodonAminoDictionary(CODONTABLEFILE)
srcFile = "rnaseq.FASTA" #raw_input()
dstFile = "protein_seq.FASTA" #raw_input()
rnaseq = loadFasta(srcFile)
protein = getProtein(rnaseq, codondict)
fout = open(dstFile, "w")
fout.write(protein)
fout.close()


Modify the previous program to extract the name of the gene to be written in the output file as ``gene_protein`` as a first line, and translete three subsequent frames, writing x-frame before the frame

In [39]:
CODONTABLEFILE = "codon.csv"
UNKNOWN = "*"
BEGIN = "START"
END = "STOP"
GENEUNKNOWN = "unknown"

def loadCodonAminoDictionary(filename):
    fin = open(filename, "r")
    header = fin.readline()  #read and discard first line
    cadict = {}
    for line in fin:
        line = line.strip()
        info = line.split(",") # aminoinfo = line.strip().split(",")
        #print(info)
        #['GCU' , 'A']
        cadict[info[0].upper()] = info[1].upper()
    fin.close()
    return cadict

def loadFasta(filename):
    fin = open(filename, "r")
    header = fin.readline() #header
    if header[0] == '>':
        geneinfo = header.split()
        gene = geneinfo[0]
    else:
        gene = GENEUNKNOWN
    seq = ""
    for line in fin: 
        seq += line.strip()
    fin.close()
    return gene, seq

def getProtein(seq, startat, codict):
    size = len(seq)
    prot = ""
    for i in range(startat, size, 3):    #three elements at a time
        triplet = seq[i:i+3]  #UGG
        if codict.has_key(triplet):
#            corr = codict.get(triplet)
#            if corr != BEGIN and corr != END:
                prot += codict.get(triplet)
        else:
            prot += UNKNOWN
    return prot

            

codondict = loadCodonAminoDictionary(CODONTABLEFILE)
srcFile = "rnaseq.FASTA" #raw_input()
dstFile = "protein_seq_3.FASTA" #raw_input()
genename, rnaseq = loadFasta(srcFile)
fout = open(dstFile, "w")
fout.write(genename + "_protein\n")
for i in range(0,3):
    protein = getProtein(rnaseq, i, codondict)
    fout.write(str(i) + "-frame\n")
    fout.write(protein + "\n")
fout.close()


MAMAPRTEINSTRINGSTOP
