# Python Collections (Mappings & Streams)

## Mappings
* A mapping is a mutable unordered collection of key/value pairs.
* Data structures implementing mappings, including *associative arrays*, *lookup tables*, and *hash tables*.
<img src="images/dict.png">
***
### Dictionaries
* There is just one mapping type in Python: `dict`, for *"dictionary"*.
* `dict` can be called with a collection argument to create a dictionary with the elements of the argument.
* The elements must be tuples or lists of two elements - a key and a value:

In [2]:
dict((('A','adenine'),('T', 'thymine'), ('C','cytosine'),('G','guanine')))

{'A': 'adenine', 'T': 'thymine', 'C': 'cytosine', 'G': 'guanine'}

* Dictionaries can also be written as a comma-separated list of key/value pairs enclosed in curly braces, with each key and value separated by a colon.
* Empty braces create an empty dictionary.
* The order within the braces doesn’t matter, since the dictionary implementation imposes its own order.

In [3]:
{'A': 'adenine', 'C': 'cytosine', 'G': 'guanine', 'T': 'thymine'}

{'A': 'adenine', 'C': 'cytosine', 'G': 'guanine', 'T': 'thymine'}

* The keys of a mapping must be unique within the collection.
* `dict` does not allow keys to be instances of mutable built-in types.

### Dictionary example: RNA codon translation table

In [10]:
RNA_codon_table = {
#                        Second Base
#        U             C             A             G
# U
    'UUU': 'Phe', 'UCU': 'Ser', 'UAU': 'Tyr', 'UGU': 'Cys',     # UxU
    'UUC': 'Phe', 'UCC': 'Ser', 'UAC': 'Tyr', 'UGC': 'Cys',     # UxC
    'UUA': 'Leu', 'UCA': 'Ser', 'UAA': '---', 'UGA': '---',     # UxA
    'UUG': 'Leu', 'UCG': 'Ser', 'UAG': '---', 'UGG': 'Urp',     # UxG
# C
    'CUU': 'Leu', 'CCU': 'Pro', 'CAU': 'His', 'CGU': 'Arg',     # CxU
    'CUC': 'Leu', 'CCC': 'Pro', 'CAC': 'His', 'CGC': 'Arg',     # CxC
    'CUA': 'Leu', 'CCA': 'Pro', 'CAA': 'Gln', 'CGA': 'Arg',     # CxA
    'CUG': 'Leu', 'CCG': 'Pro', 'CAG': 'Gln', 'CGG': 'Arg',     # CxG
# A
    'AUU': 'Ile', 'ACU': 'Thr', 'AAU': 'Asn', 'AGU': 'Ser',     # AxU
    'AUC': 'Ile', 'ACC': 'Thr', 'AAC': 'Asn', 'AGC': 'Ser',     # AxC
    'AUA': 'Ile', 'ACA': 'Thr', 'AAA': 'Lys', 'AGA': 'Arg',     # AxA
    'AUG': 'Met', 'ACG': 'Thr', 'AAG': 'Lys', 'AGG': 'Arg',     # AxG
# G
    'GUU': 'Val', 'GCU': 'Ala', 'GAU': 'Asp', 'GGU': 'Gly',     # GxU
    'GUC': 'Val', 'GCC': 'Ala', 'GAC': 'Asp', 'GGC': 'Gly',     # GxC
    'GUA': 'Val', 'GCA': 'Ala', 'GAA': 'Glu', 'GGA': 'Gly',     # GxA
    'GUG': 'Val', 'GCG': 'Ala', 'GAG': 'Glu', 'GGG': 'Gly'      # GxG
}


In [11]:
def translate_RNA_codon(codon):
    """RNA codon lookup from a dictionary"""
    return RNA_codon_table[codon]


In [12]:
translate_RNA_codon('GUG')

'Val'

In [7]:
RNA_codon_table

{'UUU': 'Phe',
 'UCU': 'Ser',
 'UAU': 'Tyr',
 'UGU': 'Cys',
 'UUC': 'Phe',
 'UCC': 'Ser',
 'UAC': 'Tyr',
 'UGC': 'Cys',
 'UUA': 'Leu',
 'UCA': 'Ser',
 'UAA': '---',
 'UGA': '---',
 'UUG': 'Leu',
 'UCG': 'Ser',
 'UAG': '---',
 'UGG': 'Urp',
 'CUU': 'Leu',
 'CCU': 'Pro',
 'CAU': 'His',
 'CGU': 'Arg',
 'CUC': 'Leu',
 'CCC': 'Pro',
 'CAC': 'His',
 'CGC': 'Arg',
 'CUA': 'Leu',
 'CCA': 'Pro',
 'CAA': 'Gln',
 'CGA': 'Arg',
 'CUG': 'Leu',
 'CCG': 'Pro',
 'CAG': 'Gln',
 'CGG': 'Arg',
 'AUU': 'Ile',
 'ACU': 'Thr',
 'AAU': 'Asn',
 'AGU': 'Ser',
 'AUC': 'Ile',
 'ACC': 'Thr',
 'AAC': 'Asn',
 'AGC': 'Ser',
 'AUA': 'Ile',
 'ACA': 'Thr',
 'AAA': 'Lys',
 'AGA': 'Arg',
 'AUG': 'Met',
 'ACG': 'Thr',
 'AAG': 'Lys',
 'AGG': 'Arg',
 'GUU': 'Val',
 'GCU': 'Ala',
 'GAU': 'Asp',
 'GGU': 'Gly',
 'GUC': 'Val',
 'GCC': 'Ala',
 'GAC': 'Asp',
 'GGC': 'Gly',
 'GUA': 'Val',
 'GCA': 'Ala',
 'GAA': 'Glu',
 'GGA': 'Gly',
 'GUG': 'Val',
 'GCG': 'Ala',
 'GAG': 'Glu',
 'GGG': 'Gly'}

* To obtain a function that will help you see the structure of your data, you should include the following line in your Python files:

In [8]:
from pprint import pprint as pp

In [9]:
pp(RNA_codon_table)

{'AAA': 'Lys',
 'AAC': 'Asn',
 'AAG': 'Lys',
 'AAU': 'Asn',
 'ACA': 'Thr',
 'ACC': 'Thr',
 'ACG': 'Thr',
 'ACU': 'Thr',
 'AGA': 'Arg',
 'AGC': 'Ser',
 'AGG': 'Arg',
 'AGU': 'Ser',
 'AUA': 'Ile',
 'AUC': 'Ile',
 'AUG': 'Met',
 'AUU': 'Ile',
 'CAA': 'Gln',
 'CAC': 'His',
 'CAG': 'Gln',
 'CAU': 'His',
 'CCA': 'Pro',
 'CCC': 'Pro',
 'CCG': 'Pro',
 'CCU': 'Pro',
 'CGA': 'Arg',
 'CGC': 'Arg',
 'CGG': 'Arg',
 'CGU': 'Arg',
 'CUA': 'Leu',
 'CUC': 'Leu',
 'CUG': 'Leu',
 'CUU': 'Leu',
 'GAA': 'Glu',
 'GAC': 'Asp',
 'GAG': 'Glu',
 'GAU': 'Asp',
 'GCA': 'Ala',
 'GCC': 'Ala',
 'GCG': 'Ala',
 'GCU': 'Ala',
 'GGA': 'Gly',
 'GGC': 'Gly',
 'GGG': 'Gly',
 'GGU': 'Gly',
 'GUA': 'Val',
 'GUC': 'Val',
 'GUG': 'Val',
 'GUU': 'Val',
 'UAA': '---',
 'UAC': 'Tyr',
 'UAG': '---',
 'UAU': 'Tyr',
 'UCA': 'Ser',
 'UCC': 'Ser',
 'UCG': 'Ser',
 'UCU': 'Ser',
 'UGA': '---',
 'UGC': 'Cys',
 'UGG': 'Urp',
 'UGU': 'Cys',
 'UUA': 'Leu',
 'UUC': 'Phe',
 'UUG': 'Leu',
 'UUU': 'Phe'}


<img src="images/dictop.png" ><br>
<img src="images/dictm.png">

* Last three methods return "sequence-like objects": they aren't sequences, but they can be used as if they were in many contexts.

In [10]:
list(RNA_codon_table.keys())

['UUU',
 'UCU',
 'UAU',
 'UGU',
 'UUC',
 'UCC',
 'UAC',
 'UGC',
 'UUA',
 'UCA',
 'UAA',
 'UGA',
 'UUG',
 'UCG',
 'UAG',
 'UGG',
 'CUU',
 'CCU',
 'CAU',
 'CGU',
 'CUC',
 'CCC',
 'CAC',
 'CGC',
 'CUA',
 'CCA',
 'CAA',
 'CGA',
 'CUG',
 'CCG',
 'CAG',
 'CGG',
 'AUU',
 'ACU',
 'AAU',
 'AGU',
 'AUC',
 'ACC',
 'AAC',
 'AGC',
 'AUA',
 'ACA',
 'AAA',
 'AGA',
 'AUG',
 'ACG',
 'AAG',
 'AGG',
 'GUU',
 'GCU',
 'GAU',
 'GGU',
 'GUC',
 'GCC',
 'GAC',
 'GGC',
 'GUA',
 'GCA',
 'GAA',
 'GGA',
 'GUG',
 'GCG',
 'GAG',
 'GGG']

## Streams

* A stream is a temporally ordered sequence of indefinite length, usually limited to one type of element.
* Each stream has two ends: a source that provides the elements and a sink that absorbs the elements.
* The more common kinds of stream sources are files, network connections, and the output of a kind of function called a `generator`. 
* Files and network sources are also common kinds of sinks.
***
### Files
* A Python file is an object that is an `interface` to an external file, not the file itself.
* File objects provide methods for reading, writing, and managing their instances.
* Depending on a parameter supplied when an instance is created, the elements of the file object are either bytes or Unicode characters. 
* Some methods treat files as streams of bytes or characters, and other methods treat them as streams of lines of bytes or characters.
* Most of the time a file object is a one-way sequence: it can either be read from or written to. 
* It is possible to create a file object that is a two-way stream, though it would be more accurate to say it is a pair of streams—one for reading and one for writing—that just happen to connect to the same external file. 
* Normally when a file object is created, if there was already a file with the same path that file is emptied. 
* File objects can be created to append instead, though, so that data is written to the end of an existing file.
#### Working with file objects
* built-in function `open(path, mode)` creates a file object representing the external file at the operating system location specified by the string path.
* The default use is reading, and the default interpretation is text.
***
<img src="images/fileopen.png">
* call the method `close()` to close a file object when it’s no longer needed
* The `with` statement is used to open and name a file, then automatically close the file regardless of whether an error occurs during the execution of its statements.
> `with open(path, mode) as name:`
>     `statements using name`
* More than one file can be opened with the same with statement, as when reading from one and writing to the other.
> `with open(path1, mode1) as name1, open(path2, mode2) as name2, ... :`
>     `statements using names`

#### File reading
* `fileobj.read([count])` - Reads count bytes, or until the end of the file, whichever comes first; if count is omitted, reads everything until the end of the file. If at the end of the file, returns an empty string. This method treats the file as an input stream of characters.
* `fileobj.readline([count])` - Reads one line from the file object and returns the entire line, including the end-of-line character; if count is present, reads at most count characters. If at the end of the file, returns an empty string. This method treats the file as an input stream of lines.
* `fileobj.readlines()` - Reads lines of a file object until the end of the file is reached and returns them as a list of strings; this method treats the file as an input stream of lines.
***
#### File Writing
* `fileobj.write(string)` - Writes string to fileobj , treating it as an output stream of characters.
* `fileobj.writelines(sequence)` - Writes each element of sequence , which must all be strings, to fileobj, treating it as an output stream of lines.

In [1]:
def read_FASTA_strings(filename):
    """Read FASTA sequence from a file"""
    with open(filename) as file:
        return file.read().split('>')[1:]

In [2]:
seqs = read_FASTA_strings("data/aa003.fasta")

In [3]:
seqs

['gi|6693803|gb|AAF24990.1|AF121349_4 (AF121349) late expression factor 5 [Neodiprion sertifer nucleopolyhedrovirus]\nMPPCSEKTLKDIEEIFLKFRRKKKWEDLIRYLKYKQPKCVKTFNLTGTGHKYHAMWAYNPITDKREKKQISLDVMKIQEL\nHRITNNNSKLYVEIRKIMTDDHRCPCEEIKNYMQQIAEYKNNRSNKVFNTPPTKIVPNALEKILKNFTINLMIDKKPKKK\nITKSAHTIKHPPVLNIDYEHTLEFAGQTTVKEICKHASLGDTIEIQNRSFDEMVNLYTTCVQCKQMYKIQ\n',
 'gi|6693805|gb|AAF24991.1| (AF125506) astacin family metalloendopeptidase FARM-1 [Hydra vulgaris]\nMSSSNHIHVLRAIDEYHKHTCLKFVKRTNQDAYLSFYPGGGCSSLVGYVRGRINDVSLAGGCLRLGTVMHEIGHSIGLYH\nEQSRPDRDDHVTIIWNNIQSNMRFNFDKFDRNKINSLGFPYDYESMMHYESNAFGGGQVTIRTKDPSKQKLIGNRQGFSE\nIDKQQINAMYNCNRGGSTLPPSVPPTVSPVAQCVEGQDLDNRCLGWATSGYCTATDPAHLETMKKKCCKSCKESAICNDK\nNTRCDEWAKKGECKANPNWMLGNCSKSCLVC\n',
 'gi|6693816|gb|AAF24994.1|AF129447_1 (AF129447) RpoB [Klebsiella ornithinolytica]\nAAVKEFFGSSQLSQFMDQNNPLSEITHKRRISALGPGGLTRERAGFEVRDVHPTHYGRVCPIETPEGPNIGLINSLSVYA\nQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDDEGHFVEDLVTCRSKGESSLFSRDQVDYMDVSTQQ\nVVSVGGSSERVL\

* Problems with output:
 - the description line preceding each sequence is part of the sequence string
 - the string contains internal newline characters.

### Generators

<img src="images/gen1.png">
* A generator is an object that returns values from a series it computes. eg. `random.randint`
* Advantages of generators:
 * A generator can produce an infinitely large series of values, as in the case of `random.randint`
 * A generator can encapsulate significant computation with the caller requesting values until it finds one that meets some condition
 * A generator can take the place of a list when the list is so long and/or its values are so large that creating the entire list before processing its elements would use enormous amounts of memory.
* A value is obtained from a generator by calling the built-in function `next` with the generator object as its argument. 
* The function that produced the generator object resumes its execution until a `yield` statement is encountered. The value of the `yield` is returned as the value of `next`.
* The values of parameters and names assigned in the function are retained between calls.
> `next(generator[, default])` - Gets the next value from the generator object; if the generator has no more values to produce, returns `default`, raising an error if no default value was specified.

In [1]:
def genTest():
    yield 1
    yield 2

In [2]:
genTest()

<generator object genTest at 0x7fc5d862e5c8>

In [7]:
foo = genTest()

In [8]:
foo.__next__()

1

In [9]:
for n in genTest():
    print(n)

1
2


In [11]:
def genFib():
    fibn_1 = 1  # fib(n - 1)
    fibn_2 = 0  # fib(n - 2)
    while True:
        next = fibn_1 + fibn_2 # fib(n) = fib(n - 1) + fib(n - 2)
        yield next
        fibn_2 = fibn_1
        fibn_1 = next

In [13]:
fib = genFib()
for i in range(10):
    print(fib.__next__())

1
2
3
5
8
13
21
34
55
89


## Comprehensions

* A *comprehension* creates a set, list, or dictionary from the results of evaluating an expression for each element of another collection.
* Each kind of comprehension is written surrounded by the characters used to surround the corresponding type of collection value: brackets for lists, and braces for sets and dictionaries.

### List comprehensions
The simplest form of list comprehension is:
`[expression for item in collection]`

In [None]:
def validate_base_sequence(base_sequence, RNAflag = False):
    valid_bases = 'UCAG' if RNAflag else 'TCAG'
    return all([(base in valid_bases)
                for base in base_sequence.upper()])

In [4]:
from random import randint

def random_base(RNAflag = False):
    return ('UCAG' if RNAflag else 'TCAG')[randint(0,3)]

def random_codon(RNAflag = False):
    return random_base(RNAflag) + random_base(RNAflag) + random_base(RNAflag)

def random_codons(minlength = 3, maxlength = 10, RNAflag = False):
    """Generate a random list of codons (RNA if RNAflag, else DNA)
    between minlength and maxlength, inclusive"""
    return [random_codon(RNAflag)
            for n in range(randint(minlength, maxlength))]

In [5]:
minlength = 3
maxlength = 10
RNAflag = True

In [7]:
randnum = randint(minlength, maxlength)
randnum

4

In [8]:
[n for n in range(randnum)]

[0, 1, 2, 3]

In [9]:
[random_codon(RNAflag) for n in range(randnum)]

['CGC', 'CGG', 'UCC', 'GUC']

In [13]:
def random_codons_translation(minlength = 3, maxlength = 10):
    """Generate a random list of codons between minlength and
    maxlength, inclusive"""
    return [translate_RNA_codon(codon) for codon in
            random_codons(minlength, maxlength, True)]
    

In [14]:
random_codons_translation()

['Glu', 'Pro', 'Ser', 'Asp', 'Thr', '---', 'Thr', 'Ser', 'His', 'His']

In [15]:
def test():
    print()
    print(random_base())
    print(random_base())
    print(random_base(False))
    print(random_base(False))
    print()
    print(random_base(True))
    print(random_base(True))
    print(random_base(True))
    print(random_base(True))
    print()
    print(random_codon())
    print(random_codon(False))
    print(random_codon(True))
    print()
    print(random_codons())
    print(random_codons())
    print(random_codons())
    print(random_codons())
    print()
    print(random_codons(6))
    print(random_codons(6, 15))
    print()
    print(random_codons(RNAflag = True))
    print(random_codons(RNAflag = True))
    print()
    print(random_codons_translation())
    print(random_codons_translation(5))
    print()
    print(random_codons_translation(8, 12))
    print(random_codons_translation(8, 12))
test()


C
G
C
C

G
U
A
G

GGT
TGT
ACA

['TCA', 'TCG', 'TAG', 'CCT', 'TCT', 'AGC', 'CTC', 'TGG', 'GGC']
['TAA', 'CTA', 'ATG', 'ATG', 'TAA', 'AGA', 'ATG', 'TCT', 'TTC', 'CGG']
['CCA', 'AAC', 'GAC', 'TCC', 'GGT', 'AGC', 'AGA', 'TGG', 'TGC']
['TTT', 'ATG', 'GGC', 'CAG']

['AGA', 'ACT', 'TAG', 'ACG', 'GTT', 'GCG', 'CCT', 'ATT', 'TAA', 'AAC']
['TAG', 'TCC', 'TAA', 'TGC', 'GAA', 'ATG']

['CAA', 'GAA', 'CCG', 'AGU', 'UUA', 'GAC', 'CGU', 'GAC']
['UGG', 'AUG', 'UAC', 'AGU', 'GAC', 'GGC', 'CUC', 'UGG', 'CAG', 'ACU']

['Arg', 'Ile', 'Asn', 'Ala', 'Ile', 'Leu', 'Ser']
['Leu', '---', 'Asn', 'Thr', '---', '---', 'Cys', 'Pro', 'Pro']

['Ser', 'Ile', 'Cys', 'Ala', 'Thr', 'Val', 'Gly', 'Asp', 'Ser', 'Arg', 'Ala']
['Val', 'Urp', 'Thr', 'His', 'Arg', 'Thr', 'Ile', 'Phe']


#### Revisit FASTA reader

In [None]:
def read_FASTA_entries(filename):
    return [seq.partition('\n') for seq in read_FASTA_strings(filename)]

In [None]:
def read_FASTA_sequences(filename):
    return [[seq[0], seq[2].replace('\n', '')]           # delete newlines
             for seq in read_FASTA_entries(filename)]

In [None]:
def read_FASTA_sequences_unpacked(filename):
    return [(info, seq.replace('\n', ''))
            for info, ignore, seq in                     # ignore is ignored (!)
            read_FASTA_entries(filename)]
    

In [None]:
def read_FASTA_sequences_and_info(filename):
    return [[seq[0].split('|'), seq[1]] for seq in
            read_FASTA_sequences(filename)]

In [None]:
filename = 'data/aa003.fasta'

seqs = read_FASTA_strings(filename)

seqs = read_FASTA_entries(filename)

seqs = read_FASTA_sequences(filename)

seqs = read_FASTA_sequences_unpacked(filename)

seqs = read_FASTA_sequences_and_info(filename)
