## Data Structures and Sequences

### Tuples

A common use of variable unpacking over sequences of tuples or lists.

In [2]:
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

for a, b, c in seq:
    print(f'a={a}, b={b}, c={c}')

a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9


*rest is used to capture an arbitrarily long list of positional arguments.

In [3]:
values = 1, 2, 3, 4, 5

a, b, *rest = values

display(a, b)
display(rest)

1

2

[3, 4, 5]

### Lists

Using insert you can insert an elements at a specific location in a list.

In [4]:
b_list = ['foo', 'peekaboo', 'baz', 'dwarf']

b_list.insert(1, 'red')

print(b_list)

['foo', 'red', 'peekaboo', 'baz', 'dwarf']


pop removes and returns an element at a particular index.

In [5]:
b_list.pop(2)

print(b_list)

['foo', 'red', 'baz', 'dwarf']


Adding two lists together with + concatenates them.

In [6]:
[4, None, 'foo'] + [7, 8, (2, 3)]

[4, None, 'foo', 7, 8, (2, 3)]

If you have a list already defined, you can append multiple elements to it using the extend method.<br>
Using extend to append elements to an existing list, especially if you are building up a large list, is usually preferable.

In [7]:
x = [4, None, 'foo']

x.extend([7, 8, (2, 3)])

print(x)

[4, None, 'foo', 7, 8, (2, 3)]


Sort has a few options that come in handy. We can sort a collection of strings by their lengths.

In [10]:
b = ['saw', 'small', 'He', 'foxes', 'six']

b.sort(key=len)

print(b)

['He', 'saw', 'six', 'small', 'foxes']


The built-in bisect module implements binary search and insertion into a sorted list.<br>
biset.biset finds the location where an element should be inserted to keep it sorted, while bisect.insort actually inserts the element into that location.

In [16]:
import bisect

c = [1, 2, 2, 2, 3, 4, 7]

display(bisect.bisect(c, 5))

bisect.insort(c, 6)

print(c)

6

[1, 2, 2, 2, 3, 4, 6, 7]


When slicing, a step can also be used after a second colon to, say, take every other element.

In [18]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]

seq[::2]

[7, 3, 5, 0]

A clever use of this is to pass -1, which has the effect of reversing a list or tuple.

In [19]:
seq[::-1]

[1, 0, 6, 5, 7, 3, 2, 7]

### Built-in Sequence Functions

When indexing data, a helpful pattern that uses enumerate is computing a dict mapping of the values of a sequence to their locations in the sequence.

In [20]:
some_list = ['foo', 'bar', 'baz']
mapping = {}

for i, v in enumerate(some_list):
    mapping[v] = i
    
mapping

{'foo': 0, 'bar': 1, 'baz': 2}

zip "pairs" up the elements of a number of lists, tuples, or other sequences to create a list of tuples:

In [26]:
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']

zipped = zip(seq1, seq2)

list(zipped)

[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

zip can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence.

In [28]:
seq3 = [True, False]

list(zip(seq1, seq2, seq3))

[('foo', 'one', True), ('bar', 'two', False)]

A very common use of zip is simultaneously iterating over multiple sequences, possibly also combined with enumerate:

In [29]:
for i, (a, b) in enumerate(zip(seq1, seq2)):
    print(f'{i}: {a}, {b}')

0: foo, one
1: bar, two
2: baz, three


Given a "zipped" sequence, zip can be applied in a clever way to "unzip" the sequence.

In [30]:
pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ('Curt', 'Schilling')]

first_names, last_names = zip(*pitchers)

display(first_names)
display(last_names)

('Nolan', 'Roger', 'Curt')

('Ryan', 'Clemens', 'Schilling')

reversed iterates over the elements of a sequence in reverse order:

In [31]:
list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

### Dictionaries

In [33]:
d1 = {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

The keys and values method give you iterators of the dict's keys and values, respectively.

In [37]:
display(list(d1.keys()))
display(list(d1.values()))

['a', 'b', 7]

['some value', [1, 2, 3, 4], 'an integer']

You can merge one dict into another using the update method.

In [38]:
d1.update({'b': 'foo', 'c': 12})
d1

{'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

It's common to occasionally end up with two sequences that you want to pair up element-wise in a dict.

In [39]:
mapping = dict(zip(range(5), reversed(range(5))))
mapping

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

The dict methods get and pop can take a default value to be returned.<br>
get by default will return None if the key is not present, while pop will raise an exception.

In [57]:
value = d1.get('b', 'default_value')

In order to categorize a list of words by their first letters as a dict of lists:

In [60]:
words = ['apple', 'bar', 'bar', 'atom', 'book']
by_letter = {}

for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
    
by_letter

{'a': ['apple', 'atom'], 'b': ['bar', 'bar', 'book']}

The built-in collections module has a useful class, defaultdict, which makes this even easier.

In [62]:
from collections import defaultdict

by_letter = defaultdict(list)

for word in words:
    by_letter[word[0]].append(word)
    
by_letter

defaultdict(list, {'a': ['apple', 'atom'], 'b': ['bar', 'bar', 'book']})

### Sets

A set is an unordered collection of unique elements and can be created in two ways: via the set function or via a set literal with curly braces.

In [68]:
set([2, 2, 2, 1, 3, 3])

{1, 2, 3}

In [73]:
{2, 2, 2, 1, 3, 3}

{1, 2, 3}

Sets support mathematical set operations like union, intersection, difference, and symmetric difference.

In [76]:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}

display(a.union(b))
display(a | b)

{1, 2, 3, 4, 5, 6, 7, 8}

{1, 2, 3, 4, 5, 6, 7, 8}

In [77]:
display(a.intersection(b))
display(a & b)

{3, 4, 5}

{3, 4, 5}

All of the logical set operations have in-place counterparts, which enable you to replace the contents of the set on the left side of the operation with the result. For large datasets, this may be more efficient:

In [79]:
a |= b

a

{1, 2, 3, 4, 5, 6, 7, 8}

You can also check if a set is a subset of (is contained in) or a superset of (contains all elements of) another set:

In [83]:
a_set = {1, 2, 3, 4, 5}

display({1, 2, 3}.issubset(a_set))

display(a_set.issuperset({1, 2, 3}))

True

True

In [84]:
{1, 2, 3} == {3, 2, 1}

True

### List, Set, and Dict Comprehensions

list_comp = [<i>expr</i> for val in collection if <i>condition</i>]<br>
dict_comp = {<i>key-expr</i> : <i>value-expr</i> for value in collection if <i>condition</i>}<br>
set_comp = {<i>expr</i> for val in collection if <i>condition</i>}<br>

In [85]:
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']

[x.upper() for x in strings if len(x) > 2]

['BAT', 'CAR', 'DOVE', 'PYTHON']

Suppose we wanted a set containing just the lengths of the strings contained in the collections:

In [86]:
unique_lengths = {len(x) for x in strings}

unique_lengths

{1, 2, 3, 4, 6}

We could also express this more functionally using the map function.

In [87]:
set(map(len, strings))

{1, 2, 3, 4, 6}

As a simple dict comprehension example, we could create a lookup map of these strings to their locations in the list:

In [89]:
loc_mapping = {val: index for index, val in enumerate(strings)}

loc_mapping

{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

Suppose we have a list of lists containing names and we wanted to get a single list containing all of the names with two or more e's in them:

In [92]:
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'], ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]

result = [name for names in all_data for name in names if name.count('e') >= 2]

result

['Steven']

Here is another example where we "flatten" a list of tuples of integers into a simple list of integers:

In [93]:
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

flattened = [x for tup in some_tuples for x in tup]

flattened

[1, 2, 3, 4, 5, 6, 7, 8, 9]

## Functions

In [95]:
import re  # regex

states = ['  Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda', 'south   carolina##', 'West virginia?']

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result

clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

An alternative approach that you may find useful is to make a list of the operations you want to apply to a particular set of strings.

In [97]:
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

clean_strings(states, clean_ops)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

You can use functions as arguments to other functions like the built-in map function, which applies a function to a sequence of some kind:

In [98]:
for x in map(remove_punctuation, states):
    print(x)

  Alabama 
Georgia
Georgia
georgia
FlOrIda
south   carolina
West virginia


### Lambda Functions

In [99]:
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]

apply_to_list(ints, lambda x: x * 2)

[8, 0, 2, 10, 12]

In [100]:
[x * 2 for x in ints]

[8, 0, 2, 10, 12]

Suppose you wanted to sort a collection of strings by the number of distinct letters in each string:

In [102]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

strings.sort(key=lambda x: len(set(list(x))))
strings

['aaaa', 'foo', 'abab', 'bar', 'card']

Currying is computer science jargon that means deriving new functions from existing ones by partial argument application.

In [104]:
def add_numbers(x, y):
    return x + y

add_five = lambda y: add_numbers(5, y)

The second argument to add_numbers is said to be curried.<br>
The built-in functools module can simplify this process using the partial function.

In [105]:
from functools import partial

add_five = partial(add_numbers, 5)

### Generators

A generator is a concise way to construct a new iterable object. Whereas normal functions execute and return a single result at a time, generators return a sequence of multiple results lazily, pausing after each one until the next one is requested.<br>
To create a generator, use the yield keyword instead of return in a function.

In [113]:
def squares(n=10):
    print(f'Generating squares from 1 to {n ** 2}')
    for i in range(1, n + 1):
        yield i ** 2
        
gen = squares()

for x in gen:
    print(x, end=' ')

Generating squares from 1 to 100
1 4 9 16 25 36 49 64 81 100 

Another even more concise way to make a generator is by using a generator expression. This is a generator analogue to list, dict, and set comprehensions.

In [114]:
gen = (x ** 2 for x in range(100))

# is equivalent to 

def _make_gen():
    for x in range(100):
        yield x ** 2
        
gen = _make_gen()

Generator expressions can be used instead of list comprehensions as function arguments in many cases:

In [119]:
display(sum(x ** 2 for x in range(100)))
display(dict((i, i ** 2) for i in range(5)))

328350

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

The standard library itertools module has a collection of generators for many common data algorithms.<br>
For example, groupby takes any sequence and a function, grouping consecutive elements in the sequence by return value of the function.

In [121]:
import itertools

names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

first_letter = lambda x: x[0]

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) # names is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


### Exception Handling

You can catch multiple exception types by writing a tuple of exception types:

In [122]:
def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x

In some cases, you may not want to suppress an exception, but you want some code to be executed regardless of whether the code in the try block succeeds or not. To do this, use finally:<br>
<br>
f = open(path, 'w')<br>
try:<br>
&emsp;write_to_file(f)<br>
except:<br>
&emsp;print('Failed')<br>
else:<br>
&emsp;print('Succeeded')<br>
finally:<br>
&emsp;f.close()

## Files and the Operating System

In [127]:
path = 'examples/segismundo.txt'

In [128]:
with open(path) as f:
    lines = [x.rstrip() for x in f]

lines

['SueÃ±a el rico en su riqueza,',
 'que mÃ¡s cuidados le ofrece;',
 '',
 'sueÃ±a el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueÃ±a el que a medrar empieza,',
 'sueÃ±a el que afana y pretende,',
 'sueÃ±a el que agravia y ofende,',
 '',
 'y en el mundo, en conclusiÃ³n,',
 'todos sueÃ±an lo que son,',
 'aunque ninguno lo entiende.',
 '']

In [134]:
f = open(path, 'rb') # Binary mode
f.read(10)

b'Sue\xc3\xb1a el '

The read method advances the file handle's position by the number of bytes read; tell gives you the current position.

In [135]:
f.tell()

10

seek changes the file position to the indicated byte in the file.

In [136]:
f.seek(3)

3

To write to a file, you can use the file's write or writelines methods.

In [141]:
with open('tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)  # Excludes blank lines
    
with open('tmp.txt') as f:
    lines = f.readlines()
    
lines

['SueÃ±a el rico en su riqueza,\n',
 'que mÃ¡s cuidados le ofrece;\n',
 'sueÃ±a el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueÃ±a el que a medrar empieza,\n',
 'sueÃ±a el que afana y pretende,\n',
 'sueÃ±a el que agravia y ofende,\n',
 'y en el mundo, en conclusiÃ³n,\n',
 'todos sueÃ±an lo que son,\n',
 'aunque ninguno lo entiende.\n']

If we open the file in 'rb' mode, read requests exact number of bytes.

In [147]:
with open(path, 'rb') as f:
    data = f.read(10)

display(data)
data.decode('utf8')

b'Sue\xc3\xb1a el '

'Sueña el '