# CHAPTER 3
# Built-in Data Structures, Functions, and Files

## Data Structures and Sequences

### Tuple
- **Tuple** = fixed-length, immutable sequence of Python objects.
- Simple tuples can be created as a comma-separated sequence of values.
- More complex tuples will require to enclose the values in parentheses.
- You can convert any sequence or iterator to a tuple by invoking the **tuple** function.
- Elements can be accessed with square brackets [] as with most other sequence types.
- While the objects stored in a tuple may be mutable themselves, once the tuple is created it’s not possible to modify which object is stored in each slot.
- You can concatenate tuples using the + operator to produce longer tuples.
- Multiplying a tuple by an integer, as with lists, has the effect of concatenating together that many copies of the tuple.

In [2]:
# Define a simple tuple as a comma-separated sequence of values

tup = 4, 5, 6
tup

(4, 5, 6)

In [3]:
# Check the type of the above defined python object

type(tup)

tuple

In [1]:
# Define a more complex nested tuple containing 2 tuples of different sizes

nested_tup = (4, 5, 6), (7, 8)
nested_tup

((4, 5, 6), (7, 8))

In [3]:
# Define a list

my_list = [4, 0, 2]
type(my_list)

list

In [4]:
# Convert my_list to a tuple

tuple(my_list)

(4, 0, 2)

**REMEMBER**: Strings are a sequence of Unicode characters and therefore can be treated like other sequences.

In [1]:
# Convert a string to a tuple

tup = tuple('python')
tup

('p', 'y', 't', 'h', 'o', 'n')

**REMEMBER**: Sequences are 0-indexed in Python.

In [2]:
# Check the first elemnet in tup

tup[0]

'p'

In [4]:
# Define another tuple by converting a list to a tuple

tup = tuple(['foo', [1, 2], True])
tup

('foo', [1, 2], True)

In [5]:
# Try to change the value of the tuple at tup[2] - you will get an error

tup[2] = False

TypeError: 'tuple' object does not support item assignment

In [6]:
# If an object inside a tuple is mutable, such as a list, you can modify it in-place

tup[1].append(3)
tup

('foo', [1, 2, 3], True)

In [15]:
# If you define a tuple that contains a single object you need to add , after it

test_tuple = ('bar')
type(test_tuple)

#test_tuple = ('bar',)
#type(test_tuple)

str

In [8]:
# Concatenate tuples using the + operator
(4, None, 'foo') + (6, 0) + ('bar',)

(4, None, 'foo', 6, 0, 'bar')

In [16]:
# Multiplying a tuple by an integer

('foo', 'bar') * 4

('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')

#### Upacking tuples
- If you try to assign to a tuple-like expression of variables, Python will attempt to unpack the value on the righthand side of the equals sign.
- Even sequences with nested tuples can be unpacked.
- Using this functionality you can easily swap variable names.
- A common use of variable unpacking is iterating over sequences of tuples or lists.
- Another common use is returning multiple values from a function. (covered later in the book)
- You can use the * **rest** syntax to pluck a few elements from the beginning of a tuple.
- This **rest** bit is sometimes something you want to discard & is nothing special about the **rest** name.
- Many Python programmers will use the underscore (_) for unwanted variables.
        a, b, *_ = values

In [17]:
# Assign to a tuple-like expresion of variables

tup = (4, 5, 6)
a, b, c = (4, 5, 6)
print(a, b, c)

4 5 6


In [18]:
# Unpack nested tuples

tup = 4, 5, (6, 7)
a, b, (c, d) = tup
d

7

In [19]:
# Swap variables name

a, b = 1, 2
print(a, b)

b, a = a, b
print(a, b)

1 2
2 1


In [20]:
# Iterating over a list of tuples

seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

for a, b, c in seq:
    print('a={0}, b={1}, c={2}'.format(a, b, c))

a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9


In [30]:
# Using the special *rest syntax to select a few elements from a tuple

values = 1, 2, 3, 4, 5, 6

a,b, *rest = values
print('a =',a, 'b =',b, 'rest = ', rest)

a = 1 b = 2 rest =  [3, 4, 5, 6]


#### Tuple methods
- Since the size and contents of a tuple cannot be modified there are not so many tuple methods.
- **count** = counts the number of occurrences of a value.

In [31]:
# Use the count method on a tuple

tup = (1, 2, 2, 2, 3, 4, 2)
tup.count(2)

4

**REMEMBER**: You can use tab completion in jupyter notebooks to check for available methods for a tuple.

In [35]:
# Tab completion tuple methods

tup.

### List
- Lists are variable-length and their contents can be modified in-place.
- You can define them using square brackets [] or using the **list** type function.
- You can convert a tuple to a list using the **list** function.
- The **list** function is used in data processing as a way to materialize an iterator or generator expression.

In [2]:
# Define a list using [] brackets

a_list = [2, 3, 7, None]
a_list

[2, 3, 7, None]

In [3]:
# Convert a tuple to a list

tup = ('foo', 'bar', 'baz')
b_list = list(tup)
b_list

['foo', 'bar', 'baz']

In [4]:
# Modify an elemnt of b_list

b_list[1] = 'peekaboo'
b_list

['foo', 'peekaboo', 'baz']

In [6]:
# Generate a sequence of integers using the range function

gen = range(10)
gen

range(0, 10)

In [9]:
# Check the type of gen

type(gen)

range

In [10]:
# Convert the type range to a list

list(gen)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### Adding and removing elements
- Elements can be appended to the end of the list with the **append** method.
- Using **insert** you can insert an element at a specific location in the list. The insertion index must be between 0 and the length of the list, inclusive.
- **insert** is computationally expensive compared with **append**, because references to subsequent elements have to be shifted internally to make room for the new element.
- If you need to insert elements at both the beginning and end of a sequence, you may wish to explore **collections.deque**. (https://docs.python.org/2/library/collections.html)
- **pop** removes and returns an element at a particular index.
- Elements can be removed by value with **remove**, which locates the first such value and removes it from the list.
- Check if a list contains a value using the **in** keyword. The keyword **not** can be used to negate **in**.
- Checking whether a list contains a value is a lot slower than doing so with dicts and sets as Python makes a linear scan across the values of the list, whereas it can check the others (based on hash tables) in constant time.

In [12]:
# Append an element to a list

b_list.append('dwarf')
b_list

['foo', 'peekaboo', 'baz', 'dwarf']

In [13]:
# Insert an element at a specific location in the list

b_list.insert(1, 'red')
b_list

['foo', 'red', 'peekaboo', 'baz', 'dwarf']

In [14]:
# Use pop to remove an element at a certain location in the list

b_list.pop(2)
b_list

['foo', 'red', 'baz', 'dwarf']

In [15]:
# Append another element to the list

b_list.append('foo')
b_list

['foo', 'red', 'baz', 'dwarf', 'foo']

In [16]:
# Use remove to remove elements by value - only first element with that value is removed

b_list.remove('foo')
b_list

['red', 'baz', 'dwarf', 'foo']

In [17]:
# Check if a list contains a value

'dwarf' in b_list

True

In [18]:
# Check if a list does not contain a value

'dwarf' not in b_list

False

#### Concatenating and combining lists
- You can use **+** operator to **concatenate** lists.
- You can append multiple elements to it using the **extend** method.
- Note that list **concatenation by addition** is a comparatively expensive operation since a new list must be created and the objects copied over. 
- Using **extend** to append elements to an existing list, especially if you are building up a large list, is usually preferable.
        everything = []
        for chunk in list_of_lists:
            everything.extend(chunk)           --faster
            #everything = everything + chunk   --slower

In [19]:
# Concatenating lists using + operator

[4, None, 'foo'] + [7, 8, (2, 3)]

[4, None, 'foo', 7, 8, (2, 3)]

In [20]:
# Use extend method to append multiple elements to a list

x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])
x

[4, None, 'foo', 7, 8, (2, 3)]

#### Sorting
- You can sort a list in-place (without creating a new object) by calling its **sort** function.
- **sort** has the ability to pass a secondary sort key, that produces a value to use to sort the objects.
- For example, we could **sort** a collection of strings by their lengths.

In [2]:
# Sort a list of integers

a = [7, 2, 5, 1, 3]
a.sort()
a

[1, 2, 3, 5, 7]


In [3]:
# Sort a collection of strings by their length

b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
b

['He', 'saw', 'six', 'small', 'foxes']

#### Binary search and maintaining a sorted list
- The built-in **bisect** module implements binary search and insertion into a sorted list.
- **bisect.bisect** finds the location where an element should be inserted to keep it sorted.
- **bisect.insort** actually inserts the element into the location found by **bisect.bisect**.
- The **bisect** module functions do not check whether the list is sorted.
- Using the **bisect** module with an unsorted list will succeed without error but may lead to incorrect results.

In [4]:
# Import the bisect module

import bisect

In [14]:
# Use bisect.bisect to find the location where the element 5 should be inserted in list c

c = [1, 2, 2, 2, 3, 4, 7]
bisect.bisect(c, 5)

6

In [16]:
# Use bisect.insort to insert the elemnt 5 at the location found by bisect.bisect

bisect.insort(c, 5)
c

[1, 2, 2, 2, 3, 4, 5, 7]

#### Slicing
- You can select sections of most sequence types by using slice notation, which in its basic form consists of [**start:stop**].
- The **start** index is included while the **stop** index is not: [1:4] means indexes from 1, 2, 3
- Slices of a list can be assigned with a sequence.
- Either the **start** or **stop** can be omitted, in which case they default to the start of the sequence and the end of the sequence, respectively.
- **Negative** indices slice the sequence relative to the end.
- A **step** can also be used after a second colon to, say, take every other element: [::2].
- Using [::-1] has the effect of reversing a list or tuple.

In [30]:
# Slicing a list using [start:stop]

seq = ['foo', 'peekaboo', 'python', 'red', 'dwarf']
print(seq)
print(seq[1:4])

['foo', 'peekaboo', 'python', 'red', 'dwarf']
['peekaboo', 'python', 'red']


In [31]:
# Assign a slice in a list with a sequence

seq[3:4] = ['green', 'black']
seq

['foo', 'peekaboo', 'python', 'green', 'black', 'dwarf']

In [32]:
# Omit start to get everything from the beginning of the list till stop

print(seq)
print(seq[:3])

['foo', 'peekaboo', 'python', 'green', 'black', 'dwarf']
['foo', 'peekaboo', 'python']


In [33]:
# Omit stop to get everything from start till the end of the list

print(seq)
print(seq[4:])

['foo', 'peekaboo', 'python', 'green', 'black', 'dwarf']
['black', 'dwarf']


In [34]:
# Use negative indices to slice relative to the end

print(seq)
print(seq[-4:-2])

['foo', 'peekaboo', 'python', 'green', 'black', 'dwarf']
['python', 'green']


In [29]:
# Use a step to take every other element
print(seq)
print(seq[::2])

['foo', 'peekaboo', 'python', 'green', 'black', 'dwarf']
['foo', 'python', 'black']


In [36]:
# Use negative indices to reverse a list
print(seq)
print(seq[::-1])

['foo', 'peekaboo', 'python', 'green', 'black', 'dwarf']
['dwarf', 'black', 'green', 'python', 'peekaboo', 'foo']


### Built-in Sequence Functions

#### enumerate
- Returns a sequence of (i, value) tuples
- Used to keep track of the index of the current item when iterating over a sequence.
- Do it yourself approach:
        i = 0
        for value in collection:
            # do something with value
            i += 1
- Python built_in function **enumerate**:
        for i, value in enumerate(collection):
            # do something with value
- When you are indexing data, a helpful pattern that uses **enumerate** is computing a **dict mapping** the values of a sequence (which are assumed to be unique) to their locations in the sequence.

In [41]:
# Define a list called some_list & an empy dict called mapping

some_list = ['foo', 'bar', 'baz']
mapping = {}

In [43]:
# Use enumerate to map the values from some_list to their locations in the sequence

for i, v in enumerate(some_list):
    mapping[v] = i
    
mapping

{'foo': 0, 'bar': 1, 'baz': 2}

#### sorted
- The **sorted** function returns a new sorted list from the elements of any sequence.
- The **sorted** function accepts the same arguments as the **sort** method on lists.

In [47]:
# Sort a list of integers using sorted function

sorted([7, 1, 2, 6, 0, 3, 2])

[0, 1, 2, 2, 3, 6, 7]

#### zip
- **zip** = “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples.
- **zip** can take an arbitrary number of sequences, and the number of elements it produces is determined by the *shortest* sequence.
- A very common use of **zip** is simultaneously iterating over multiple sequences, possibly also combined with **enumerate**.
- Given a “zipped” sequence, **zip** can be applied to “unzip” the sequence.
        zip(*my_list_of_tuples)

In [49]:
# Use zip to pair up the elemnts of 2 lists into a list of tuples

seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1, seq2)
list(zipped)

[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

In [50]:
# Add a 3rd list with only 2 elements and use zip with seq1, seq2, seq3

seq3 = [False, True]
list(zip(seq1, seq2, seq3))

[('foo', 'one', False), ('bar', 'two', True)]

In [51]:
# Iterate over multiple sequneces using zip & enumerate

for i, (a, b) in enumerate(zip(seq1, seq2)):
    print('{0}: {1}, {2}'.format(i, a, b))

0: foo, one
1: bar, two
2: baz, three


In [52]:
# "Unzip" a sequence

pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)
print(first_names, last_names)

('Nolan', 'Roger', 'Schilling') ('Ryan', 'Clemens', 'Curt')


#### reversed
- **reversed** iterates over the elements of a sequence in reverse order.
- **reversed** is a **generator** (to be discussed in some more detail later), so it does not create the reversed sequence until materialized (e.g., with *list* or a *for loop*).

In [55]:
# Reversed function

list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

### Dict
- **dict** is likely the most important built-in Python data structure. 
- A more common name for it is **hash map** or **associative array**. 
- **dict** = a flexibly sized collection of *key-value* pairs, where key and value are Python objects. 
- You can create a **dict** using  curly braces {} and colons to separate keys and values.
- You can access, insert, or set elements using the same syntax as for accessing elements of a **list** or **tuple** with square brackets [].
- You can check if a **dict** contains a **key** using **in**.
- You can delete values either using the **del** keyword or the **pop** method (which *simultaneously* returns the value and deletes the key).
- The **keys** method and **values** method give you iterators of the dict’s keys and values, respectively. 
- While the key-value pairs are not in any particular order, these functions output the keys and values in the same order.
- You can merge one dict into another using the **update** method, which changes dicts in-place, so any existing keys in the data passed to **update** will have their old values discarded.

In [16]:
# Create an empty dict

empty_dict = {}

In [17]:
# Define a dict with 2 key_value pairs

dict1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
dict1

{'a': 'some value', 'b': [1, 2, 3, 4]}

In [18]:
# Insert a new key-value pair in dict1

dict1[7] = 'an integer'
dict1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

In [19]:
# Access the value for the key 'b' from dict1

dict1['b']

[1, 2, 3, 4]

In [20]:
# Check if dict1 contains key 'b'

'b' in dict1

True

In [21]:
# Insert  2 other key-value pairs in dict1

dict1[5] = 'some value'
dict1['dummy'] = 'another value'
dict1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 5: 'some value',
 'dummy': 'another value'}

In [22]:
# Delete the key-value pair for key = 5

del dict1[5]
dict1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value'}

In [26]:
# Assign deleted value from dict1 using pop method to ret

ret = dict1.pop('dummy')
print('ret =', ret, '& dict1 =', dict1)

ret = another value & dict1 = {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}


In [27]:
# Use the key method to access the keys in dict1

list(dict1.keys())

['a', 'b', 7]

In [28]:
# Use the values method to access the values in dict1

list(dict1.values())

['some value', [1, 2, 3, 4], 'an integer']

In [29]:
# Update dict1

dict1.update({'b' : 'foo', 'c' : 12})
dict1

{'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

#### Creating dicts from sequences
- It's commonn to end up with two **sequences** that you want to pair up element-wise in a **dict**.
- You could use a for loop:
        mapping = {}
        for key, value in zip(key_list, value_list):
            mapping[key] = value
- Or better use the **dict** function which accepts a list of 2-tuples

**REMEMBER**: **zip** = pairs up the elements of a number of lists, tuples, or other sequences to create a *list of tuples*.

In [32]:
# Create a dict using the zip function to create a list of 2-tuples

list_2_tuples = zip(range(5), reversed(range(5)))
mapping = dict(list_2_tuples)
mapping

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

#### Default values
- It’s very common to have logic like:
        if key in some_dict:
            value = some_dict[key]
        else:
            value = default_value
- The above if-else block can be simplified using the **get** and **pop** dict methods because they can take **default values** to be returned:
        value = some_dict.get(key, default_value)
- **get** by default will return **None** if the key is not present, while **pop** will raise an exception.
- With **setting** values, a common case is for the values in a dict to be other collections, like lists. In this case you can use the **setdefault** dict method.
- The built-in **collections** module has a useful class, **defaultdict**, which makes setting dict values even easier. 
- To create one, you pass a type or function for generating the default value for each slot in the dict:
        from collections import defaultdict
        by_letter = defaultdict(list)
        for word in words:
            by_letter[word[0]].append(word)

In [3]:
# Categorize a list of words by their first letters as a dict of lists using setdefault dict method

# Define the list of words
words = ['apple', 'bat', 'bar', 'atom', 'book', 'compass'] 

# Define an empty dict
by_letter = {}

# Use the setdefault dict method to categorize the list of words by their first letter
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
    
# Ouput the dict by_letter
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book'], 'c': ['compass']}

In [15]:
# Categorize a list of words by their first letters as a dict of lists using defaultdict from collections module

# Define a list of colors
colors = ['red', 'blue', 'green', 'gray', 'pink', 'black']

# Import the collections module
from collections import defaultdict

# Call the defaultdic function & use a for loop to append the key-value pairs
colors_by_letter = defaultdict(list)
for color in colors:
    colors_by_letter[color[0]].append(color)
    
# Output the dict color_by_letter
colors_by_letter

defaultdict(list,
            {'r': ['red'],
             'b': ['blue', 'black'],
             'g': ['green', 'gray'],
             'p': ['pink']})

#### Valid dict key types
- **values** of a **dict** can be any Python object.
- **keys** have to be *immutable* objects like scalar types (int, float, string) or tuples (all the objects in the tuple need to be immutable, too).
- The technical term here is **hashability**. 
- You can check whether an object is hashable (can be used as a key in a dict) with the **hash** function.
- To use a list as a key, one option is to convert it to a tuple.

In [17]:
# Use hash function on a string

hash('string')

9037810056053108966

In [18]:
# Use hash function on a tuple

hash((1, 2, (2, 3)))

1097636502276347782

In [19]:
# Use hash function on a list - it will fail because lists are mutable

hash((1, 2, [2, 3]))

TypeError: unhashable type: 'list'

In [20]:
# Convert a list to atuple and use it as a key in a dict

d = {}
d[tuple([1, 2, 3])] = 5
d

{(1, 2, 3): 5}

### Set
- **set** = an unordered collection of unique elements.
- You can think of them like dicts, but keys only, no values. 
- A **set** can be created in two ways: via the **set function** or via a set literal with curly braces.
- **Sets** support mathematical set operations like union, intersection, difference, and symmetric difference.
- **Union** of two sets is the set of distinct elements occurring in either set. This can be computed with either the **union** method or the | binary operator.
- **Intersection** contains the elements occurring in both sets. The & operator or the **intersection** method can be used.
- All of the logical set operations have in-place counterparts, which enable you to replace the contents of the set on the left side of the operation with the result.
- Like dicts, **set** elements generally must be *immutable*. To have list-like elements, you must convert it to a tuple.
- You can also check if a set is a **subset** of (is contained in) or a **superset** of (contains all elements of) another set.
- Sets are equal if and only if their contents are equal.

In [22]:
# Define sets using the set function & curly braces

set1 = set([10, 2, 2, 2, 1, 3, 4, 5, 2])
set2 = {4, 2, 3, 4, 2, 2, 7, 8}

print(set1, set2)

{1, 2, 3, 4, 5, 10} {2, 3, 4, 7, 8}


In [23]:
# Union of sets set1 & set2 using union method or | binary operator

u1 = set1.union(set2)
u2 = set1 | set2
print(u1, u2)

{1, 2, 3, 4, 5, 7, 8, 10} {1, 2, 3, 4, 5, 7, 8, 10}


In [24]:
# Intersection of sets set1 & set2 using intersection method or & operator

intersect1 = set1.intersection(set2)
intersect2 = set1 & set2
print(intersect1, intersect2)

{2, 3, 4} {2, 3, 4}


In [25]:
# Create set3 as a copy of set1
set3 = set1.copy()

# Replace set3 with the union between set3 & set2
set3 |= set2
set3

{1, 2, 3, 4, 5, 7, 8, 10}

In [26]:
# Create set4 as a copy of set1
set4 = set1.copy()

# Replace set4 as the intersection of set4 & set2
set4 &= set2
set4

{2, 3, 4}

In [27]:
# In order to have list-like elements you need to convert the list to a tuple

my_set = {tuple([1, 2, 3 ,4])}
my_set

{(1, 2, 3, 4)}

In [28]:
# Check if a set is a subset of another set

a_set = {1, 2, 3, 4, 5}
{1, 2, 3}.issubset(a_set)

True

In [29]:
# Check if a set is a superset of another set

a_set.issuperset({1, 2, 3})

True

#### Python set operations
    Function                      Alternative   Description
                                     syntax       
    a.add(x)                         N/A        Add element x to the set a
    a.clear()                        N/A        Reset the set a to an empty state, discarding all of its elements
    a.remove(x)                      N/A        Remove element x from the set a
    a.pop()                          N/A        Remove an arbitrary element from the set a, raising KeyError if the set is empty
    
    a.union(b)                       a | b      All of the unique elements in a and b
    a.update(b)                      a |= b     Set the contents of a to be the union of the elements in a and b
    a.intersection(b)                a & b      All of the elements in both a and b
    a.intersection_update(b)         a &= b     Set the contents of a to be the intersection of the elements in a and b
    
    a.difference(b)                   a - b     The elements in a that are not in b
    a.difference_update(b)            a -= b    Set a to the elements in a that are not in b
    a.symmetric_difference(b)         a ^ b     All of the elements in either a or b but not both
    a.symmetric_difference_update(b)  a ^= b    Set a to contain the elements in either a or b but not both
    a.issubset(b)                     N/A       True if the elements of a are all contained in b
    a.issuperset(b)                   N/A       True if the elements of b are all contained in a
    a.isdisjoint(b)                   N/A       True if a and b have no elements in common

### List, Set, and Dict Comprehensions
- **List comprehensions** allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter in one concise expression:
        [expr for val in collection if condition]
- This is equivalent to the following for loop:
        result = []
        for val in collection:
            if condition:
                result.append(expr)
- The filter condition can be omitted, leaving only the expression. 
- A **dict comprehension** looks like this:
        dict_comp = {key-expr : value-expr for value in collection
                    if condition}
- A **set comprehension** looks like the equivalent list comprehension except with curly braces instead of square brackets:
        set_comp = {expr for value in collection if condition}

In [33]:
# Create a list of strings
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']

# Filter out strings with length 2 or less and also convert what remais to uppercase like
list_comp1 = [x.upper() for x in strings if len(x) > 2]

# Omit the filter condition which means we make all values uppercase
list_comp2 = [x.upper() for x in strings]

# Print output
print(list_comp1, list_comp2)

['BAT', 'CAR', 'DOVE', 'PYTHON'] ['A', 'AS', 'BAT', 'CAR', 'DOVE', 'PYTHON']


In [35]:
# Create a set containing just the lengths of the strings contained in the 'strings' collection

unique_lengths = {len(x) for x in strings}
unique_lengths

{1, 2, 3, 4, 6}

In [36]:
# We could achieve the same thing as the cell above using the map function

set(map(len, strings))

{1, 2, 3, 4, 6}

In [37]:
# Create a lookup map of the strings and their location in the list

loc_mapping = {val : index for index, val in enumerate(strings)}
loc_mapping

{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

#### Nested list comprehensions
- The for parts of the **list comprehension** are arranged according to the order of nesting, and any filter condition is put at the end as before.
- You can have arbitrarily many levels of nesting, though if you have more than two or three levels of nesting you should probably start to question whether this makes sense from a code readability standpoint.

In [38]:
# List of lists containing some English and Spanish names
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],
            ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]

In [39]:
# Get a single list containing all names with two or more e’s in them using a for loop
names_of_interest = []
for names in all_data:
    enough_es = [name for name in names if name.count('e') >= 2]
    names_of_interest.extend(enough_es)
names_of_interest

['Steven']

In [40]:
# Get a single list containing all names with two or more e’s in them using nested list comprehension
result = [name for names in all_data for name in names if name.count('e') >= 2]
result

['Steven']

In [41]:
# “Flatten” a list of tuples of integers into a simple list of integers

some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
flattened

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [42]:
# A list comprehension inside a list comprehension resulting in a list of lists

[[x for x in tup] for tup in some_tuples]

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

## Functions
- **Functions** are the primary and most important method of code organization and reuse in Python.
- As a rule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable **function**.
- **Functions** can also help make your code more readable by giving a name to a group of Python statements.
- **Functions** are declared with the **def** keyword and returned from with the **return** keyword.
- There is no issue with having multiple return statements. 
- If Python reaches the end of a function without encountering a return statement, None is returned automatically.
- Each **function** can have **positional** arguments and **keyword** arguments.
- **Keyword** arguments are most commonly used to specify **default values** or optional arguments.
- The main restriction on **function arguments** is that the **keyword** arguments must follow the **positional** arguments (if any).
- You can specify keyword arguments in any order.
- This way you only need to remember the name of the keyword arguments and not their order.
- It is possible to use keywords for passing positional arguments as well which can help with readability.

In [43]:
# Define a function called my_function
# x and y are positional arguments while z is a keyword argument

def my_function(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)

In [45]:
# Ways to call the function my_function

my_function(5, 6, z=0.7)
my_function(3.14, 7, 3.5)
my_function(10, 20)
my_function(x=5, y=6, z=7)
my_function(y=6, x=5, z=7)

77

### Namespaces, Scope, and Local Functions
- Functions can access variables in two different scopes: **global** and **local**.
- An alternative and more descriptive name describing a **variable scope** in Python is a **namespace**. 
- Any variables that are assigned within a function by default are assigned to the **local namespace**. 
- The **local namespace** is created when the function is called and immediately populated by the function’s arguments. 
- After the function is finished, the **local namespace** is destroyed.
- Assigning variables outside of the function’s scope is possible, but those variables must be declared as global via the **global** keyword.
- The use of the **global** keyword is generally discourage. 
- Typically **global variables** are used to store some kind of state in a system. If you find yourself using a lot of them, it may indicate a need for object-oriented programming (using classes).

In [46]:
# Consider the following function

def func():
    a = []
    for i in range(5):
        a.append(i)

# When func() is called, the empty list a is created, five elements are appended, 
# and then a is destroyed when the function exits.

In [47]:
# If instead we declare a as follows:

a = []
def func():
    for i in range(5):
        a.append(i)
        
# a is no longer destroyed when the function exits

In [50]:
# Assigning variables outside the function’s scope with the global keyword

# Assign None to a
a = None

# Define the function bind_a_variable()
def bind_a_variable():
    global a
    a = []

# Call teh function bind_a_variable()
bind_a_variable()

# Print a
print(a)

[]


### Returning Multiple Values
- The **function** is actually returning one object, namely a *tuple*, which is then being unpacked into the result variables.

In [51]:
# Example of returning multiple values from a function

# Define the function
def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

# Return multiplpe values
a, b, c = f()

# Print the output
print(a, b, c)

5 6 7


In [52]:
# Return the function result as  a tuple

return_value = f()
return_value

(5, 6, 7)

In [53]:
# Return multiple values as a dict

# Define the function so that is return a dict
def f():
    a = 5
    b = 6
    c = 7
    return {'a' : a, 'b' : b, 'c' : c}

# Call the function and show the output
f()

{'a': 5, 'b': 6, 'c': 7}

### Functions Are Objects
- **Functions** should be reusable and generic.
- You can use functions as **arguments** to other functions.

In [54]:
# DATA CLEANING EXERCISE: Suppose we were doing some data cleaning and
# needed to apply a bunch of transformations to a list of strings

# List of strings that we need to clean
states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
          'south carolina##', 'West virginia?']

# In order to make this list of strings uniform and ready for analysis: 
#     - stripping whitespace, 
#     - removing punctuation symbols
#     - standardizing on proper capitalization

In [61]:
# Use built-in string methods along with the re standard library module for regular expressions

# Import the re module
import re

# Define the function
def clean_strings(strings):
    result = []
    for value in strings:
        # Use strip string() method to remove both the leading & trailing spaces
        value = value.strip()
        # Use regular expressions to remove characters '!', '#', '?'
        value = re.sub('[!#?]', '', value)
        # Use title method so that first character in every word is upper case
        value = value.title()
        # Append the modified string to the result list
        result.append(value)
    return result

In [63]:
# Call function clean_strings()
clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

In [64]:
# DATA CLEANING EXERCISE: Another approach is to make a list of the operations
# you want to apply to a particular set of strings:

# Define the function remove_punctuation()
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

# Create a list of transformations to apply to your strings
clean_ops = [str.strip, remove_punctuation, str.title]

# Define the function clean_strings() which takes 2 arguments the string and a list of functions
def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

In [65]:
# Call function clean_strings()
clean_strings(states, clean_ops)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

### Anonymous (Lambda) Functions
- **Lambda functions** = are a way of writing functions consisting of a single statement, the result of which is the return value.
- They are defined with the **lambda** keyword, which has no meaning other than “we are declaring an anonymous function”.
        def short_function(x):
            return x * 2
            
        equiv_anon = lambda x: x * 2
- **Lambda functions** are called **anonymous functions** because unlike functions declared with the **def** keyword, the function object itself is never given an explicit __name__ attribute.

In [66]:
# Lambda function example

# Define a function called apply_to_list() which retunrs a list comprehension
# The function takes 2 arguments a list & a function
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

# Create a list of integers called ints
ints = [4, 0, 1, 5, 6]

# Call teh function apply_to_list() & apply a simple lambda function
apply_to_list(ints, lambda x: x * 2)

[8, 0, 2, 10, 12]

In [68]:
# Lambda function example

# Suppose you wanted to sort a collection of strings by the number of distinct letters in each string:

# Create a list of strings
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

# Pass the lambda function to the list's sort method
strings.sort(key = lambda x: len(set(list(x))))
strings

['aaaa', 'foo', 'abab', 'bar', 'card']

### Currying: Partial Argument Application
- **Currying** is computer science jargon (named after the mathematician Haskell Curry) that means deriving new functions from existing ones by partial argument application.
- The built-in **functools** module can simplify this process using the **partial** function.

In [70]:
# Function that adds 2 numers together
def add_numbers(x, y):
    return x + y

# Using this function, we could derive a new function of one variable, add_five, that adds 5 to its argument
add_five = lambda y: add_numbers(5, y)

# The second argument to add_numbers is said to be curried

In [71]:
# Use the built-in functools module and partial function

from functools import partial
add_five = partial(add_numbers, 5)

### Generators
- Having a consistent way to iterate over sequences, like objects in a list or lines in a file, is an important Python feature.
- **Iterator protocol** = a generic way to make objects iterable.
- **Iterator** = is any object that will yield objects to the Python interpreter when used in a context like a for loop.
- Most methods expecting a list or list-like object will also accept any iterable object, like built-in methods such as min, max, and sum, and type constructors like list and tuple.
- **Generator** = is a concise way to construct a new iterable object. 
- Whereas normal functions execute and return a single result at a time, generators return a sequence of multiple results lazily, pausing after each one until the next one is requested. 
- To create a **generator**, use the **yield** keyword instead of return in a function.
- When you actually call the **generator**, no code is immediately executed, not until you request elements from the generator that it begins executing its code.

In [72]:
# Iterating over a dict yields the dict keys

some_dict = {'a': 1, 'b': 2, 'c': 3}

for key in some_dict:
    print(key)

a
b
c


In [74]:
# When you write for key in some_dict, the Python interpreter first attempts to create 
# an iterator out of some_dict

dict_iterator = iter(some_dict)
dict_iterator

<dict_keyiterator at 0x1e8e44daa98>

In [75]:
# The type constructor list accepts any iterable object

list(dict_iterator)

['a', 'b', 'c']

In [76]:
# Create a generator using the yield keyword

def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2

In [77]:
# Calling the generator no code is immediately executed

gen = squares()
gen

<generator object squares at 0x000001E8E4B3E6C8>

In [78]:
# Request elements from the generator

for x in gen:
    print(x, end=' ')

Generating squares from 1 to 100
1 4 9 16 25 36 49 64 81 100 

#### Generator expresssions
- A more concise way to make a generator is by using a **generator expression**.
- This is a generator analogue to list, dict, and set comprehensions.
- To create one, enclose what would otherwise be a list comprehension within parentheses instead of brackets.

In [79]:
# Generator expression

gen = (x ** 2 for x in range(100))
gen

<generator object <genexpr> at 0x000001E8E4B3E3C8>

In [81]:
# Above generator expression is equivalent to the following function

def _make_gen():
    for x in range(100):
        yield x ** 2
gen = _make_gen()

gen

<generator object _make_gen at 0x000001E8E4B3EAC8>

In [82]:
# Generator expressions can be used instead of list comprehensions as function arguments

dict((i, i **2) for i in range(5))

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

#### itertools module
- The standard library **itertools** module has a collection of generators for many common data algorithms.

In [83]:
# Import the itertools module
import itertools

# Define a lambda function that takes the first element of a sequence
first_letter = lambda x: x[0]

# Create a list of names
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

# Loop over the names in the list & group them by first letter
for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) # names is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


#### Some useful itertools functions
    Function                     Description
    combinations(iterable, k)    Generates a sequence of all possible k-tuples of elements in the iterable,   ignoring order and without replacement (see also the companion function combinations_with_replacement)

    permutations(iterable, k)    Generates a sequence of all possible k-tuples of elements in the iterable, respecting order

    groupby(iterable[, keyfunc]) Generates (key, sub-iterator) for each unique 
    
    key product(*iterables, repeat=1) Generates the Cartesian product of the input iterables as tuples, similar to a nested for loop

### Errors and Exception Handling
- Handling Python **errors** or **exceptions** is an important part of building robust programs.
- In data analysis applications, many functions only work on certain kinds of input.
- To handle **errors** or **exceptions** we can use the **try/except** block.
- To suppress a certain exception write the **exception type** after except.
- You can catch multiple exception types by writing a tuple of exception types instead (the parentheses are required.
- In some cases, you may not want to suppress an exception, but you want some code to be executed regardless of whether the code in the try block succeeds or not. To do this, use **finally:**.
- Similarly, you can have code that executes only if the try: block succeeds using **else:**

In [84]:
# Python’s float function is capable of casting a string to a floating-point number, 
# but fails with ValueError on improper inputs

print(float('1.2345'))
print(float('something'))

1.2345


ValueError: could not convert string to float: 'something'

In [86]:
# Define a function to handle the exceptions from float function
def attempt_float(x):
    try:
        return float(x)
    except:
        return x
    
# The code in the except: part of the block will only be executed if float(x) raises an exception

In [87]:
# Call the function attempt_float()
attempt_float('something')

'something'

In [88]:
# Float can raise exceptions other than ValueError
float((1, 2))

TypeError: float() argument must be a string or a number, not 'tuple'

In [91]:
# To only suppress ValueError, since a TypeError (the input was not a
# string or numeric value) might indicate a legitimate bug in your program
# write the exception type after except

def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return x
    
attempt_float((1, 2))

TypeError: float() argument must be a string or a number, not 'tuple'

In [92]:
# You can catch multiple exception types by writing a tuple of exception types instead
# (the parentheses are required)

def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x

In [93]:
# Use finally: to execute some code regardless if the try: block succeeds or not

# Open a file handle
f = open(path, 'w')

try:
    write_to_file(f)
finally:
    f.close()  # the file handle f will always get closed

NameError: name 'path' is not defined

In [94]:
# you can have code that executes only if the try: block succeeds using else:

f = open(path, 'w')

try:
    write_to_file(f)
except:
    print('Failed')
else:
    print('Succeeded')
finally:
    f.close()

NameError: name 'path' is not defined

## Files and the Operating System
- To open a file for reading or writing, use the built-in **open** function with either a relative or absolute file path.
- By default, the file is opened in read-only mode 'r'.
- We can then treat the **file handle** f like a list and iterate over the lines.
- We can then treat the **file handle** f like a list and iterate over the lines like:
        for line in f:
            pass
- The lines come out of the file with the end-of-line (EOL) markers intact, so you’ll often see code to get an EOL-free list of lines from a file.
- When you use **open** to create file objects, it is important to explicitly **close** the file when you are finished with it. 
- Closing the file releases its resources back to the operating system.
- One of the ways to make it easier to clean up open files is to use the **with** statement.
- Using **f = open(path, 'w')** would overwrite our file at *examples/segismundo.txt*. 
- There is also the 'x' file mode, which creates a writable file but fails if the file path already exists.

In [1]:
# Define the file path
path = 'examples/segismundo.txt'

# Read the file using the open function
f = open(path)

In [4]:
# Get rid of EOL markers from file f & create a list of lines
lines = [x.rstrip() for x in open(path)]
lines

['SueÃ±a el rico en su riqueza,',
 'que mÃ¡s cuidados le ofrece;',
 '',
 'sueÃ±a el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueÃ±a el que a medrar empieza,',
 'sueÃ±a el que afana y pretende,',
 'sueÃ±a el que agravia y ofende,',
 '',
 'y en el mundo, en conclusiÃ³n,',
 'todos sueÃ±an lo que son,',
 'aunque ninguno lo entiende.',
 '']

In [5]:
# Close the file
f.close()

In [6]:
# Use with statement to open a file
with open(path) as f:
    lines = [x.rstrip() for x in f]
    
# This will automatically close the file f when exiting the with block

#### Readable files
- For readable files, some of the most commonly used methods are **read**, **seek**, and **tell**.
- **read** = returns a certain number of characters from the file.
- The **read** method advances the file handle’s position by the number of bytes read.
- What constitutes a “character” is determined by the file’s encoding (e.g., UTF-8) or simply raw bytes if the file is opened in binary mode.
- **tell** = gives you the current position.
- **seek** = changes the file position to the indicated byte in the file.

In [24]:
# Open the file handle
f = open(path)

# Read 10 characters from the file
f.read(10)

'SueÃ±a el '

In [25]:
# Open the file handle in binary mode
f2 = open(path, 'rb')

# Read 10 bytes
f2.read(10)

b'Sue\xc3\xb1a el '

In [26]:
# Use tell method to check teh position
f.tell()

10

In [27]:
# Use tell method to check the position
f2.tell()

10

In [28]:
# You can check the default encoding in the sys module
import sys
sys.getdefaultencoding()

'utf-8'

In [29]:
# Use seek to change to the inidcated byte in teh file
f.seek(3)
f.read(1)

'Ã'

In [30]:
# Close the file
f.close()
f2.close()

#### Python file modes

| Mode                  | Description |
| :---                  |    :----    |
| r                     | Read-only mode| 
| w                     | Write-only mode; creates a new file (erasing the data for any file with the same name)|
| x            | Write-only mode; creates a new file, but fails if the file path already exists|
| a   | Append to existing file (create the file if it does not already exist)|
| r+               | Read and write        |
| b               | Add to mode for binary files (i.e., 'rb' or 'wb')        |
| t             | Text mode for files (automatically decoding bytes to Unicode). This is the default if not specified. Add t to other modes to use this (i.e., 'rt' or 'xt')        |

#### Writing to a file
- To write text to a file, you can use the file’s **write** or **writelines** methods.

In [31]:
# Write to a file called 'tmp.txt'
with open('tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)

In [37]:
# Check the lines of the file you created above
with open('tmp.txt') as f:
    lines = f.readlines()
    
lines

['SueÃ±a el rico en su riqueza,\n',
 'que mÃ¡s cuidados le ofrece;\n',
 'sueÃ±a el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueÃ±a el que a medrar empieza,\n',
 'sueÃ±a el que afana y pretende,\n',
 'sueÃ±a el que agravia y ofende,\n',
 'y en el mundo, en conclusiÃ³n,\n',
 'todos sueÃ±an lo que son,\n',
 'aunque ninguno lo entiende.\n']

#### Important Python file methods or attributes

| Syntax                | Description |
| :---                  |    :----    |
| read([size])          | Return data from file as a string, with optional size argument indicating the     number of bytes to read       | 
| readlines([size])     | Return list of lines in the file, with optional size argument        |
| write(str)            | Write passed string to file        |
| writelines(strings)   | Write passed sequence of strings to the file        |
| close()               | Close the handle        |
| flush()               | Flush the internal I/O buffer to disk        |
| seek(pos)             | Move to indicated file position (integer)        |
| tell()                | Return current file position as integer        |
| closed                | True if the file is closed        |

#### Bytes and Unicode with Files
- The default behavior for Python files is text mode, which means that you intend to work with Python strings (i.e., **Unicode**).
- This contrasts with binary mode, which you can obtain by appending b onto the file mode.
- UTF-8 is a variable-length Unicode encoding.
- Depending on the text encoding, you may be able to **decode** the bytes to a str object.
- Text mode, combined with the **encoding** option of open, provides a convenient way to convert from one Unicode encoding to another.
- Beware using **seek** when opening files in any mode other than binary. 
- If the file position falls in the middle of the bytes defining a Unicode character, then subsequent reads will result in an error.

In [41]:
# Open the file as bytes
with open(path, 'rb') as f:
    data = f.read(10)
    
data

b'Sue\xc3\xb1a el '

In [42]:
# Use decode to create a str object
data.decode('utf8')

'Sueña el '