## Learning objectives


1. Python data structures

2. Nesting data structures

3. Copying data structures

4. Using dictionary methods to access data

5. Importing modules and getting command line arguments

6. Applying data structures in "for" loops

---


## Python data structures

So far, we've talked about loops. Now let's look at the other three data structures python offers.

- tuples
- sets
- dictionaries

### Tuples

Tuples are a lot like lists in that they're ordered and can hold anything, but unlike lists, once they're created thay can't be changed. We define a tuple using parantheses - () or the funtion tuple(), which takes a list as an argument. If the tuple only contains 1 item, you need a comma after it - (X,).

In [1]:
A = (1, 2, 3)

# What happens if we try to assign a new item to A[1]?
A[1] = 5

TypeError: 'tuple' object does not support item assignment

Tuples are good for things that you don't want or need to change. We'll see a good example in a moment of such a use case. Just like lists, you can find out the number of entries with `len`

In [2]:
A = (1, 2, 3)
print(len(A))

3


### Sets

A set is a collections of immutable or "hashable" objects. This means that the objects cannot be able to be changed once created. Sets are unordered but have really fast lookup. They also have set-specific operations like intersect and union. Sets are defined using the function set() with nothing or a list or tuple for an argument.

In [3]:
A = set([1, 2, 3])

# We can check if the set contains something with "in"
print(1 in A)

# We can add things to the set with the "add" method
A.add(4)

# We can remove things using the "remove" method
A.remove(1)

# We can convert to a list, but the order cannot be counted on
print(list(A))

# But what if we put in a mutable object?
A.add([1])

True
[2, 3, 4]


TypeError: unhashable type: 'list'

Sets use a hash table.
![Hash_table_5_0_1_1_1_1_1_LL.svg](attachment:Hash_table_5_0_1_1_1_1_1_LL.svg)

### Dictionaries

A dictionary is a collection of key and value pairs. Keys can be anything immutable and values can be anything. Dictionaries are unordered but have also have fast lookup because they also use a hash table. Dictionaries are defined using squiggly brackets - {}, or dict()

In [4]:
A = {1: "a", "b": 2}

# We can add to a dictionary by defining a new value for a key
A["c"] = 3

# We can remove a value using "del"
del A[1]

# Also, dictionaries can use "pop"
v = A.pop("c")
print("Pop:")
print(v)

# We can check for a key with "in"
print("\nIn:")
print("b" in A)

# If we want a list-like object for a key, convert to tuple
B = [1, 2, 3]
A[tuple(B)] = "list key"

print("\nDictionary contents:")
print(A)

# We can also extract keys, values, or both from a dictionary
# We need to convert to list if not using in a for loop
keys = list(A.keys())
print("\nKeys:")
print(keys)

values = list(A.values())
print("\nvalues:")
print(values)

both = list(A.items())
print("\nBoth:")
print(both)

# One other really useful method for dictionaries, setdefault()
A.setdefault("new", "entry")
A.setdefault("b", "no_used")
print("\nDictionary contents:")
print(A)

# Along those lines, we can use the method "get" if we want a default value if the key is missing
print("\nGet:")
print(A.get("new", "missing"))
print(A.get("not_there", "missing"))

Pop:
3

In:
True

Dictionary contents:
{'b': 2, (1, 2, 3): 'list key'}

Keys:
['b', (1, 2, 3)]

values:
[2, 'list key']

Both:
[('b', 2), ((1, 2, 3), 'list key')]

Dictionary contents:
{'new': 'entry', 'b': 2, (1, 2, 3): 'list key'}

Get:
entry
missing


## Importing modules

Modules are collections of code that someone has written and can be brought into your project and used. They can give you access to functions, classes and variables. There are several very useful modules builtin to python. We've already used `copy`. Now let's look at one that you will use in almost every script, `sys`.

In [5]:
import sys

# In order to use things in sys, we need to use sys.
print(type(sys.stdout))

# We could also use from sys import to get a specific object or objects
from sys import stdout
print(type(stdout))
      
# We can also rename the module within our program
import sys as mysys
print(type(mysys.stdout))

<class 'ipykernel.iostream.OutStream'>
<class 'ipykernel.iostream.OutStream'>
<class 'ipykernel.iostream.OutStream'>


Now, writing a program to run without being able to pass it different information each time is of limited use. Let's see how to get values given as arguments when running the program.
`sys.argv` is a list containing all of the arguments on the command line starting with the python script name. So, to get arguments passed to the script, you can copy it from `sys.argv`.

In [6]:
import sys
arguments = list(sys.argv)
print(arguments)

['/Users/msauria/miniconda2/envs/py3/lib/python3.5/site-packages/ipykernel_launcher.py', '-f', '/Users/msauria/Library/Jupyter/runtime/kernel-aa970173-6bdf-4cbd-a123-7082cc1ec0db.json']


## A little more about "for" loops


Because `for` loops are such a critical tool in python, let's make sure that we understand them. A for loop does one thing. Takes items from an ordered set of objects and puts them one at a time into a variable that can then be used in the body of the for loop.


The ordered set of objects can be a list, a tuple, the characters of a string, or an iterator. An iterator is just like a list that doesn't hold everything in memory at once but fetches or creates the next item when it is needed. We've seen two iterators so far, `range` and filehandles.



In [7]:
A = ['a', 'b', 'c']
for i in A:
    print(i)

A = range(3)
for i in A:
    print(i)

a
b
c
0
1
2


The basic syntax of the for loop doesn't change, regardless of what you are looping over. You just need the statement `for`, a variable to put values in, the keyword `in`, the ordered set of objects, and the colon to indicate the beginning of the body of the for loop. Just be careful not to change the length of the ordered set of items you are looping through in your for loop, as this can cause errors.

For loops can also implicitly unpack things, i.e. take a variable that contains multiple ordered entries and assign each to a different variable.

In [8]:
A = [("a", 1), ["b", 2]]
for a, b in A:
    print("{} and {}".format(a, b))

a and 1
b and 2


## Combining tuples and lists


As we saw, tuples are immutable, meaning that they can't be changed once created. However, we also saw that when you create a list, what you have really created is a reference to the chain of items in the list. So, what happens when you have a list in a tuple?




In [9]:
L = [1,2,3]
T1 = (L, 'A', 'B')
print(T1)

([1, 2, 3], 'A', 'B')


But, what happens if we try to change something in the list? A list is mutable, but it is contained in a tuple, which isn't.




In [10]:
L[0] = 9
print(T1)

([9, 2, 3], 'A', 'B')


Okay, it works! That is because the reference isn't changing.


## Shallow and deep copy


We've seen how data structures can be nested. Now what if we need to have two copies of them?




In [11]:
T2 = tuple(T1)
print(T2)

([9, 2, 3], 'A', 'B')


But if we alter the list of T1, what happens to T2?




In [12]:
T1[0][0] = 7
print(T1)
print(T2)

([7, 2, 3], 'A', 'B')
([7, 2, 3], 'A', 'B')


What we have done is make a shallow copy. If we need to make a copy of all of the elements, including the list, we need to make a deep copy. To do this, we could do it manually.




In [13]:
L2 = list(L)
T2 = tuple([L, T1[1:]])

However, this is not very elegant or practical. Another option is to use a built-in python module called 'copy'. There are only two functions in this module, ```copy``` and ```deepcopy```, which are pretty self-explanatory names. Let's see what a deep copy looks like.




In [14]:
import copy

T2 = copy.deepcopy(T1)
T1[0][0] = 5
print(T1)
print(T2)

([5, 2, 3], 'A', 'B')
([7, 2, 3], 'A', 'B')


We now have a deep copy of our tuple. What's more, the ```deepcopy``` function will work on most python objects, and all of the ones you will encounter here.


### Excercise:


Create a nested data structure (one data structure as the member of another data structure) and make a deep copy of it. Then alter the nested data structure and print both outer data structures to prove that your deep copy worked.




In [15]:
from copy import deepcopy

A = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
B = deepcopy(A)

A[0][0] = 'A'
print(A)
print(B)

[['A', 2, 3], [4, 5, 6], [7, 8, 9]]
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]


## Accessing dictionary items


Earlier you saw three different methods that could access information from a dictionary. Let's look at how these work. Take the use case of keeping track of which tissues each of a set of genes are expressed in. Given a set of gene names A-E and tissues 0-6, let's find some basic properties of this data set. Let's find how many genes are expressed in tissue 1




In [16]:
data = {
    'A': [0, 2, 3],
    'B': [1],
    'C': [3, 4, 5],
    'D': [0, 1, 2, 3],
    'E': [6],
}

count = 0
for g in data.keys():
    if 1 in data[g]:
    	count += 1
    # or we could simply do count += 1 in data[g]
print(count)

# We could also use the .values method
count = 0
for t in data.values():
    count += 1 in t
print(count)

2
2


What about keeping track of which genes are expressed in tissue 1? To do this, we need both the keys (gene names) and values (tissue lists) so the most logical method to use is ```.items```. This will introduce a new bit of for loop syntax. Since two items are returned each time in a tuple, the key and the value, we can either put that tuple into a variable and access each part in the for loop like this:




In [17]:
genes = []
for i in data.items():
    key = i[0]
    value = i[1]
    if 1 in value:
    	genes.append(key)
print(genes)

['B', 'D']


However, an easier way would be to unpack the tuple in the line that we call the for loop. This works because we know how many values are in the tuple ahead of time.




In [18]:
genes = []
for k, v in data.items():
    if 1 in v:
    	genes.append(k)
print(genes)

['B', 'D']


### Excercise:


Write code that takes in our data dictionary and counts the total number of unique tissues and returns a sorted list of these tissues.




In [19]:
u_tissues = []
for k, v in data.items():
    u_tissues += v
u_tissues = list(set(u_tissues))
u_tissues.sort()

print(u_tissues)


[0, 1, 2, 3, 4, 5, 6]


## Enumerate


Since we've just seen the unpacking syntax in a for loop, let's look at one other function that uses this approach and is very handy, the function ```enumerate```. Enumerate takes any iterator (including things like lists) and returns a counter along with the iterator items.




In [20]:
L = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']

for i, v in enumerate(L):
    print(i, v)

0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I


## Counting unique items


Now let's say we want to count unique items in a text file. In this case, let's look at a t_data.ctab file, which is a file of transcript-level expression measurements formated for the program 'ballgown'. We'll be using the file ```data/t_data.ctab```. First, let's load it into memory from the file.




In [28]:
fs = open('../../data/SRR072893.t_data.ctab', 'r')
data = []
for line in fs:
    data.append(line.rstrip().split('\t'))
fs.close()

Now, the first line of the file will tell us which columns have which data.




In [30]:
print(data[0])

['t_id', 'chr', 'strand', 'start', 'end', 't_name', 'num_exons', 'length', 'gene_id', 'gene_name', 'cov', 'FPKM']


Now for counting unique genes. We can see that gene names are in the ninth column (where the first column is number zero). So, we've already seen that keeping track of unique elements is an excellent use of sets, so let's use that approach here.

What steps do we need to accomplish this?

1. create a variable to keep track of results
2. find the gene name field
3. check if gene is in our variable
4. if not, add it
5. repeat 2-4 for each entry in data

In [31]:
genes = set()
for line in data[1:]: # We don't want to include the header
    gene = line[9]
    if gene not in genes:
        genes.add(gene)
    # we don't even need to check, we could simply add
    # genes.add(gene)
print(genes)

{'CR33294', 'Or82a', 'aux', 'CG14642', 'CG14641', 'CR45220', 'CG40198', 'CG9780', 'CG45783', 'CG12581', 'CR40182', 'CG14636', 'CG42402', 'mir-929', 'Alg-2', 'TwdlF', 'TwdlV', 'DhpD', 'Parp', 'CG34305', 'CR41571', 'Dsk', 'cpx', 'TwdlG', 'spok', 'CG41128', 'CG31516', 'CG41099', 'Gfat1', 'CR41601', 'CR12798', 'abs', 'CG12582', 'CR45597', 'TwdlU', 'CG9766', 'CG14644', 'CG45784', 'Gel', 'Tim17b', 'CG1092'}


### Exercise


Can you modify the code so that we are only counting unique genes that have at least one transcript with non-zero expression?




In [32]:
genes = set()
for line in data[1:]: # We don't want to include the header
    gene = line[9]
    fpkm = float(line[11])
    if fpkm > 0:
        genes.add(gene)
print(genes)

{'Alg-2', 'CG41128', 'CG31516', 'CG41099', 'aux', 'Parp', 'CG14641', 'abs', 'CR45220', 'CG12581', 'CR40182', 'Tim17b', 'CG12582'}


## Counting occurences of items


What if we want to count the number of transcripts for each gene? We will still need to keep track of unique gene names, but now we need to track the number of transcripts for each as well. This is exactly the sort of thing a dictionary is good for. Let's use the same approach as above, but modify our code to use a dictionary instead of a set to track genes.




In [33]:
genes = {}
for line in data[1:]: # We don't want to include the header
    gene = line[9]
    if gene not in genes:
        genes[gene] = 1
    else:
        genes[gene] += 1
print(genes)

{'TwdlG': 3, 'CR33294': 2, 'CG41128': 1, 'Or82a': 1, 'CG31516': 1, 'CG41099': 4, 'CG14642': 3, 'CG14641': 1, 'CG34305': 1, 'abs': 1, 'CR45220': 2, 'CR41601': 1, 'CG40198': 2, 'CR12798': 1, 'CG9780': 1, 'CG12581': 2, 'CR40182': 1, 'CR45597': 1, 'CG14636': 2, 'Gfat1': 9, 'CG12582': 5, 'mir-929': 2, 'aux': 4, 'TwdlV': 1, 'Alg-2': 1, 'TwdlF': 1, 'DhpD': 1, 'TwdlU': 1, 'Parp': 1, 'CG14644': 2, 'CG45784': 1, 'CR41571': 1, 'CG42402': 3, 'Dsk': 1, 'CG45783': 1, 'cpx': 21, 'Gel': 7, 'Tim17b': 2, 'CG1092': 2, 'CG9766': 1, 'spok': 1}


Of course, we can actually make this even a little easier by getting rid of the conditional statement and using the dictionary method ```.setdefault```.




In [34]:
genes = {}
for line in data[1:]: # We don't want to include the header
    gene = line[9]
    genes.setdefault(gene, 0)
    genes[gene] += 1
print(genes)

{'TwdlG': 3, 'CR33294': 2, 'CG41128': 1, 'Or82a': 1, 'CG31516': 1, 'CG41099': 4, 'CG14642': 3, 'CG14641': 1, 'CG34305': 1, 'abs': 1, 'CR45220': 2, 'CR41601': 1, 'CG40198': 2, 'CR12798': 1, 'CG9780': 1, 'CG12581': 2, 'CR40182': 1, 'CR45597': 1, 'CG14636': 2, 'Gfat1': 9, 'CG12582': 5, 'mir-929': 2, 'aux': 4, 'TwdlV': 1, 'Alg-2': 1, 'TwdlF': 1, 'DhpD': 1, 'TwdlU': 1, 'Parp': 1, 'CG14644': 2, 'CG45784': 1, 'CR41571': 1, 'CG42402': 3, 'Dsk': 1, 'CG45783': 1, 'cpx': 21, 'Gel': 7, 'Tim17b': 2, 'CG1092': 2, 'CG9766': 1, 'spok': 1}


### Exercise


Given our set of genes and transcript counts, can you figure out which gene has the most transcripts and how many transcripts that is? Start with writing the pseudo-code so you know what steps you need to accomplish the task.

1. create variables to keep track of the best gene name and number of transcripts
2. load a gene name and number of transcripts
3. check if current number of transcripts is bigger than current best
4. if yes, replace best gene name and transcript number
5. repeat 2-4 until all genes in our gene set have been considered

In [35]:
best_gene = ''
best_transcripts = 0
for k, v in genes.items():
    if v > best_transcripts:
        best_gene = k
        best_transcripts = v
print(best_gene, best_transcripts)

cpx 21
