# Chapter 3. Bult-in Data structures, functions and files

## Data structures and sequences

### Tuple

A tuple is a fixed-length, immutable sequence of Python objects.

In [None]:
tup = 4, 5, 6
tup

Though tuples are immutable, if they contain a mutable object inside, these can be modified in-place

In [None]:
tup = tuple(['foo', [1, 2], True])
tup[1].append(3)
tup

### List

In contrast with tuples, lists are variable-length and their ontents can be modified in-place. 

In [None]:
a_list = [2, 3, 7, None]
a_list.append(10)
a_list.insert(1, 12)
a_list

In [None]:
a_list.pop(2)

### Sorting
A very useful function is in-place sorting

In [None]:
a = [7, 2, 5, 1, 3]
a.sort()
a

In [None]:
b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
b

### Binary search and maintaining a sorted list

The bulti-in bisect module implements a binary search and insertion into a sorte list. bisect.bisect finds the locaton where an element should be inserted to keep it sorted, while bisect.insort actually inserts the element into that location

In [None]:
import bisect
c = [1, 2, 2, 2, 2, 3, 4, 7]
bisect.bisect(c, 2)

In [None]:
bisect.bisect(c, 5)

In [None]:
bisect.bisect(c, 6)

In [None]:
bisect.insort(c, 6)
c

### Slicing

You can select sections of the most sequence types by using slice notation, which in its basic form consists of start:stop passed to the index operator []

In [None]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5]

While the element at the start index is included, the stop index is not included so that the number of elements in the result is stop - start

In [None]:
seq[:5]

In [None]:
seq[3:]

In [None]:
seq[-4:]

In [None]:
seq[-6:-2]

### Enumerate

In [None]:
some_list = ['foo', 'bar', 'baz']
mapping = {}
for i, v in enumerate(some_list):
    mapping[v] = i
mapping

### Sorted

Sorted function returns a new sorted list from the elements of any sequence

In [None]:
sorted('horse race')

### zip

zip "pairs" up the elements of a number of lists, tuples or other sequences to create a list of tuples

In [None]:
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1, seq2)
list(zipped)

A very common use of zip is simultaneously iterating over multiple sequences, possible also combined with enumerate:

In [None]:
for i, (a, b) in enumerate(zip(seq1, seq2)):
    print('{0}: {1}, {2}' .format(i, a, b))

### Dict

Dict is likely the most important built-in Python data structure. A more common name for it is hash map or associative array. It is a flexibly sized collection of key-value pairs, where key and value are Python object. One aproach for creating one is to use curly braces {} and colons to seperate keys and values

In [None]:
empty_dict = {}
d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
d1

In [None]:
d1[7] = 'an integer'
d1

In [None]:
d1['b']

### Deleting value

In [None]:
del d1['b']

### Creating dicts from sequences

It's common to occasionally end up with two sequences that you want to pair up element-wise in a dict. As a first cut you might write code like this:

In [None]:
mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value

Since a dict is essentially a collection of 2-tuples, the dict function accepts a list of 2-tuples

In [None]:
mapping = dict(zip(range(5), reversed(range(5))))
mapping

In [None]:
a = zip(range(5), reversed(range(5)))
list(a)

### Default values

It's very common to have logic like:

In [None]:
if key in some_dict:
    value = some_dict[key]
else:
    value = default_value

Thus, the dict methods get and pop can take a default value to be returned, so that the above if-else block can be written as

In [None]:
value = some_dict.get(key, default_value)

Get will return None if key is not present. Pop will raise an exception. 

### Valid dict key types

Whoøe the values of a dict can be any Python object, the keys generally have to be immutable objects like scalar types or tuples. The technical term here is hashability. You can check wether an object is hashable with the hash function

In [None]:
hash('string')

In [None]:
hash((1, 2, (2, 3)))

### Set

A set is an unordered collection of unique elements. You can think of them like dicts but without values, keys only. A set can be created in two ways, via the set function or via a set literal. Example below

In [None]:
a = set([2,2,2,1,3,3])
b = {3,4,5,6,7,8}

The union of these two sets is the set of distinct elements occuring in either set. This can be computed like this:

In [None]:
a.union(b)

In [None]:
a | b

We also have the intersection of the two sets. The intersection contains elements occuring in both sets. This can be calculated like so:

In [None]:
a.intersection(b)

In [None]:
a & b

We also have the following set operations:

* add 
* clear
* remove
* pop
* union
* update
* intersection
* intersection_update
* difference
* difference_update
* symmetric_difference
* symmetric_difference_update
* issubset
* issuperset
* isdisjoint

See python documentation or recall from Mathematics 2 

### List, set and Dict Comprehensions

In [None]:
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
[x.upper() for x in strings if len(x) > 2]

Set and dict comprehensions are a natural extension, producing sets and dicts in an idiomatically similar way instead of lists. A dict comprehension looks like this:

In [None]:
dict_comp = {key-expr: value-expr for value in collection if condition}

A set condition looks like the equivalent list comprehension except with curly braces instead of square brackets.

In [None]:
set_comp = {expr for value in collection if condition}

Like list comprehensions, these are mostly conveniences, but they similarly can make code both easier to write and read. Consider the list of strings from before. Suppose we wanted a set containing just the lengths of the strings contained in the collection; we could easily compute this using a set comprehension:

In [None]:
uniquie_lengths = {len(x) for x in strings}
uniquie_lengths

We could also express this more functionally using the map function

In [None]:
set(map(len, strings))

A simple dict comprehension example, we could create a lookup map of these strings to their locations in the list:

In [None]:
loc_mapping = {val : index for index, val in enumerate(strings)}
loc_mapping

### Nested list comprehensions

Suppose we have a list containing some English and Spanish names:

In [None]:
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],
            ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]

You might have gotten these names from a couple of files and decided to organize them by language. Now, suppose we wanted to get a single list containing all names with two or more e's in them. We could this with a simple for-loop:

In [None]:
names_of_interest = []
for names in all_data:
    enough_es = [name for name in names if name.count('e') >= 2]
    names_of_interest.extend(enough_es)

You can actually wrap this whole operation up in a single nested list comprehension, which will look like:

In [None]:
result = [name for names in all_data for name in names if name.count('e') >= 2]
result

At first, nested list comprehensions are a bit hard to wrap your head around. 

In [None]:
some_tuples = [(1,2,3), (4,5,6), (7,8,9)]
flattened = [x for tup in some_tuples for x in tup]
flattened

This is the same as

In [None]:
flattened = []

for tup in some_tuples:
    for x in tup:
        flattened.append(x)

## Functions

Functions are the primary and most important method of code organization and reuse in Python. As a rule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable function. Functions can also help make your code more readable by giving a name to a group of python statements.

Functions are declared with the **def** keyword and returned from with the **return** keyword.

In [None]:
def my_function(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)

There is no issue with having multiple **return** statements. If Python reaches the end of a function without encountering a **return** statement, **None** is returned automatically.

Each function can have **positional** arguments and **keyword** arguments. Keyword arguments are most commonly used to specify default values or optional arguments. In the preceding function, x and y are positional arguments while z is a keyword argument. This means that the function can be called in any of these ways:

In [None]:
my_function(5, 6, z=0.7)
my_function(3.14, 7, 3.5)
my_function(10, 20)

The main restriction on function arguments is that the keyword arguments **must** follow the positional arguments (if any). You can specify keyword arguments in any order; this frees you from having to remember which order the function arguments were specified in and only what their name are.

### Namespaces, Scope and Local Functions

Functions can access variables in two different scopes: **global** and **local**. An alternative and more descriptive name describing a variable scope in python is a **namespace**. Any variables that are assigned within a function by default are assigned to the local namespace. The local namespace is created when the function is called and immediately populated by the Function's arguments. After the function is finished, the local namespace is destroyed (with some exceptions that are outside the purview of this chapter). Consider the following function:

In [None]:
def func():
    a = []
    for i in range(5):
        a.append(i)

When *func()* is called, the empty list a is created, five elements are appended, and then a is destroyed when the function exits. Suppose instead we had declared a as follow:

In [None]:
a = []
def func():
    for i in range(5):
        a.append(i)

Each call to func will modify the list a:

In [None]:
func()
a

In [None]:
func()
a

Assigning variables outside the function's scope is possible but those variables must be declared as global via the global keyword:

In [None]:
a = None

def bind_a_variable():
    global a
    a = []

bind_a_variable()
print(a)

### Returning Multiple Values

In [None]:
def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

a, b, c = f()

In data analysis and other scientific applications, you may find yourself doing this often. What's happening here is that the function is actually just returning one object, namely a tuple which is then being unpacked into the result variables. In the preceding example, we could have done this instead:

In [None]:
return_value = f()

Alternatively it may be useful to return a dictionary instead:

In [None]:
def f():
    a = 5
    b = 6
    c = 7
    return {'a' : a, 'b' : b, 'c' : c}

### Functions Are Objects

Since python functions are objects, many constructs can be easily expressed that are difficult to do in other languages. Suppose we were doing some data cleaning and needed to apply a bunch of transformations to the following listof strings:

In [None]:
states = ['  Alabama  ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda', 'south carolina##', 'West virginia?']

Anyone who has ever worked with user-submitted survey data has seen messy results like these. Lots of things need to happen to make this list of strings uniform and ready for analysis. Stripping whitespace, removing punctuation symbols and standarizing on proper capitalization. This may be done using the builtin string methods along with the re standard library:

In [None]:
import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result

In [None]:
clean_strings(states)

An alternative approach that you may find useful is to make a list of the operations you want to apply to a particular set of strings:

In [None]:
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

clean_strings(states, clean_ops)

A more *functional* pattern like this enables you to easily modify how the strings are transformed at a very high level. The clean_strings function is also now more reusable and generic.

You can use functions as arguments to other functions like the built-in map function, which applies a function to a sequence of some kind:

In [None]:
for x in map(remove_punctuation, states):
    print(x)

### Anonymous (Lambda) Functions

Python ha support for so-called **anonymous** or **lambda** functions, which are  a way of writing functions consisting of a signle statement, the result of which is the return value. They are defined with the lambda keyword, which has no meaning other then "we are declaring an anonymous function":

In [None]:
def short_function(x):
    return x*2

equiv_anon = lambda x: x * 2

These are especially convenient in data analysis because, as you'll see, there are many cases where data transformation functions will take functions as arguments. It's often less typing (and clearer) to pass a lambda function as opposed to writing a full-out function declaration or even assigning the lambda function to a local variable.

In [None]:
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * 2)

As another example, suppose you wanted to sort a collection of strings by the number of distinc letters in each string:

In [None]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

Here we could pass a lambda function to the list's sort method:

In [None]:
strings.sort(key=lambda x: len(set(list(x))))
strings

### Currying: Partial Argument Application

Currying is computer science jargon (named after the mathematician Haskell Curry) that means deriving new functions from existing ones by *partial argument application*. For example, suppose we had a trivial function that adds two numbers together:

In [None]:
def add_numbers(x, y):
    return x + y

Using this function, we could derive a new function of the variable, add_five, that adds 5 to it's argument:

In [None]:
add_five = lambda y : add_numbers(5, y)

The second argument to add_numbers is said to be *curried*. There's nothing fancy here, as all we've really dine is define a new function that calls an existing function. The built-in functools module can simplify this process using the partial function:

In [None]:
from functools import partial
add_five = partial(add_numbers, 5)

### Generators

Having a consistent way to iterate over sequences, like objects in a list or lines in a file, is an important Python feature. This is accomplished by the means of the **iterator protocol**, a generic way to make objects iterable. For example, iterating over a dict yields the dict keys:

In [None]:
some_dict = {'a' : 1, 'b' : 2, 'c' : 3}
for key in some_dict:
    print(key)

When you write key in some_dict, the Python interpreter first attempts to create an iterator out of some_dict

In [None]:
dict_iterator = iter(some_dict)
dict_iterator

An iterator is any object that will yield objects to the Python interpreter when used in a context like for loop. Most methods expecting a list or list-like objects will also accept any iterable object. This includes built-in methods such as min, max, sum, and type constructors like list and tuple:

In [None]:
list(dict_iterator)

A generator is a concise way to construct a new iterable object. Whereas normal functions execute and return a single result at a time, generators return a sequence of multiple results lazily, pausing after each one until the next one is requested. To create a generator, use the yield keyword instead of return in a function.

In [None]:
def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2

gen = squares()

In [None]:
for x in gen:
    print(x, end=' ')

### Generator expressions

Another even more concise way to make a generator is by using a generator expression. This is a generator analogue to list, dict, and set comprehensions; to create one, enclose what would otherwise be a list comprehension within parentheses instead of brackets:

In [None]:
gen = (x ** 2 for x in range(100))
gen

Generator expressions can be used instead of list comprehensions as function arguments in many cases:

In [None]:
sum(x ** 2 for x in range(100))

In [None]:
dict((i, i ** 2) for i in range(5))

### Itertools module

The standard library itertools module has a collection of generators for many common data algorithms. For example, groupby takes an sequence and a function, grouping consecutive elements in the sequence by return value of the function:

In [None]:
import itertools

first_letter = lambda x : x[0]

names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names))

### Errors and Exception Handling

Handling Python errors or exceptions gracefully is an important part of bulding robust programs. In data analysis applications, many functions only work on certain kinds of input. As an example, Python's float function is capable of casting a string to a floating-point number, but fails with ValueError in improper inputs:

In [None]:
float('1.2345')

In [None]:
float('something')

Suppose we wanted a version of float that fails gracefully, returning the input argument. We can do this by writing a function that encloses the call to float in a try/except block:

In [None]:
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

The code in the except part of the block will only be executed if float(x) raises an exception:

In [None]:
attempt_float('1.2345')

In [None]:
attempt_float('something')

You might notice that float can raise exceptions other than ValueError:

In [None]:
float((1,2))

You might only want to suppress ValueError, since a TypeError (the input was not a string or numeric value) might indicate a legitimate bug in your program. To do that, write the exception type after except:

In [None]:
def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return x

In [None]:
attempt_float((1, 2))

You can catch multiple exception types by writing a tuple of exception types instead.

In [None]:
def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x

In some cases, you may not want to suppress an exception, but you want some code to be executed regardless of whether the code in the try block succeds or not. To do this, use finally:

In [None]:
f = open(path, 'w')

try:
    write_to_file(f)
finally:
    f.close()

Here, the file handle f will always get closed. Similarly, you can have code that executes only if the try: block succeds using else:

In [None]:
f = open(path, 'w')

try:
    write_to_file(f)
except:
    print('Failed')
else:
    print('Succeded')
finally:
    f.close()

### Exceptions in IPython

If an exception is raised while you are running a script or executing any statement, IPython will by default print a full call stack trace (traceback) with a few linesof context around the position at each point in the stack.

## Files and the Operating System

Most of this book uses high-level tools like pandas.read_csv to read files from disk into Python data structures. However, it's important to understand the basics of how to work with files in Python. Fortunately, it's very simple, which is one reason why Python is so popular for text and file munging.

Toopen a file for reading or writing, use the builtin open function with either a relative or absolute file path:

In [None]:
path = 'examples/segismundo.txt'

f = open(path)

By default, the file is opened in read_only mode 'r'. We can then treat the file handle flike a list and iterate over the lines like so:

In [None]:
for line in f:
    pass

The lines come out of the file with the end-of-line (EOL) markers intact, so you'll often see code to get an EOL-free list of lines in a file like:

In [None]:
lines = [x.rstrip() for x in open(path)]
lines

When you use open to create file objects, it is important to explicitly close the file when you are finished with it. Closing the file releases it's resources back to the operating system:

In [None]:
f.close()

One of the ways to make it easier to clean up open files is to use the with statement:

In [None]:
with open(path) as f:
    lines = [x.rstrip() for x in f]

This will automatically close the file f when exiting the with block.

If we had typed f = open(path, 'w') a new file at examples/segismundo.txt would have been created, overwriting any one in its place. 

For readable files, some of the most commonly used method are read, seek and tell.read returns a certain number of characters from the file.

In [None]:
f = open(path)

f.read(10)

In [None]:
f2 = open(path, 'rb')

f2.read(10)

The read method advances the file handle's position by the number of bytes read. Tell gives you the current position.

In [None]:
f.tell()

In [None]:
f2.tell()

In [None]:
import sys

sys.getdefaultencoding()

Seek changes the file position to the indicated byte in the file:

In [None]:
f.seek(3)

In [None]:
f.close()
f2.close()