Beata Sirowy
# __Data structures: examples__
Based on McKinley, W. (2022) _Python for Data Analysis_

## Tuples

In [None]:
tup = 3, 5, 7
print(tup)
print(tup[2])

(3, 5, 7)
7


Converting a sequence to a tuple:

In [16]:
n = [5, 78, 56]
y = tuple(n)
print(y)


(5, 78, 56)


In [None]:
n = "Hello"
tuple(n)

('H', 'e', 'l', 'l', 'o')

In [20]:
tuple("Hello")[1]

'e'

A nested tuple:

In [21]:
x = (4,5,6), (7,8,9)
x

((4, 5, 6), (7, 8, 9))

In [22]:
x[1]

(7, 8, 9)

If an object inside a tuple is mutable, such as a list, you can modify it 

In [31]:
tup1 = 2, [4, 6], 8
tup1[1].append(7)
tup1


(2, [4, 6, 7], 8)

You can concatenate tuples using the + operator, and multiply using *

In [32]:
("bim", "bom")*5

('bim', 'bom', 'bim', 'bom', 'bim', 'bom', 'bim', 'bom', 'bim', 'bom')

In [35]:
("bim", "bom") + (1, 3, 5) + (2, [4, 6, 3])


('bim', 'bom', 1, 3, 5, 2, [4, 6, 3])

Unpacking an expression / swaping variable names:

In [5]:
tup = 1,2,3
a,b,c = tup
b

2

In [7]:
a,b = 1,3
print(b)
b,a = a,b
print(b)

3
1


Iterating over sequences of tuples or lists:

In [None]:
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for x, y, z in seq:
    print(f'x={x}, y={y}, z={z}')

x=1, y=2, z=3
x=4, y=5, z=6
x=7, y=8, z=9


### *
"*" (e.g. *_ or *rest) can be used to "pluck" a few elements from the beginning of a tuple, it is also used in function signatures to capture an arbitrarily long list of positional arguments:

In [10]:
n = 1,2,3,4,5,6
a, b, *_ = n
_

[3, 4, 5, 6]

Very few methods fpr tuples exist. A useful one (also available on lists) is count, which counts the number of occurrences of a value:

In [11]:
 a = (1, 2, 2, 2, 3, 4, 2)
 a.count(2)

4

## Lists

In [12]:
y =list(range(10))
y

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [15]:
y.append ("horse")
y

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'horse']

Insert is computationally expensive compared with append, because references to subsequent elements have to be shifted internally to make room for the new element:

In [35]:
y.insert(2, "pig")
y

[0, 1, 'pig', 2, 3, 4, 5, 6, 7, 8, 9, 'horse']

In [None]:
y.pop(2)

'pig'

In [51]:
y

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'horse']

In [52]:
"horse" in y

True

In [53]:
"pig" in y

False

In [None]:
print(y.reverse())

['horse', 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [None]:
y.pop(0)
y.sort()
y

[1, 2, 3, 4, 5, 6, 7, 8]

We can pass a secondary sort key—that is, a function that produces a value to use to sort the objects. For example, we could sort a collection of strings by their lengths:

In [66]:
 b = ["saw", "small", "He", "foxes", "six"]
 b.sort(key = len)
 b

['He', 'saw', 'six', 'small', 'foxes']

### Slicing

In [67]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:4]

[2, 3, 7]

In [68]:
seq[:5]

[7, 2, 3, 7, 5]

In [70]:
seq[5:]

[6, 0, 1]

In [73]:
seq[0:3] = [1,2,11]
seq

[1, 2, 11, 7, 5, 6, 0, 1]

Negative indices slice the sequence relative to the end:

In [82]:
seq[-2:]

[0, 1]

In [83]:
seq[-5:-1]

[7, 5, 6, 0]

![image.png](attachment:image.png)

A step can also be used after a second colon to, take e.g. every other element:

In [88]:
seq[::2]

[1, 11, 5, 0]

 -1, which has the useful effect of reversing a list or tuple:

In [89]:
seq[::-1]

[1, 0, 6, 5, 7, 11, 2, 1]

## Dictionary

In other programming languages, dictionaries are sometimes called hash maps or associative arrays. A dictionary stores a collection of key-value pairs, where key and value are Python objects.

Creating dictionaries from sequences

In [95]:
mapping = {}
for key, value in zip([1,2,3], ["ann", "bob", "charlie"]):
    mapping[key] = value
mapping

{1: 'ann', 2: 'bob', 3: 'charlie'}

In [96]:
list(mapping.items())

[(1, 'ann'), (2, 'bob'), (3, 'charlie')]

In [100]:
mapping.update({3:"david"})
mapping

{1: 'ann', 2: 'bob', 3: 'david'}

In [101]:
mapping.update({5:"joe"})
mapping

{1: 'ann', 2: 'bob', 3: 'david', 5: 'joe'}

In [103]:
2 in mapping

True

In [105]:
mapping["bob"] = "barbara"
mapping

{1: 'ann', 2: 'bob', 3: 'david', 5: 'joe', 'bob': 'barbara'}

In [106]:
mapping["barbara"] = 6
mapping

{1: 'ann', 2: 'bob', 3: 'david', 5: 'joe', 'bob': 'barbara', 'barbara': 6}

In [108]:
mapping["john"] = 12
mapping

{1: 'ann',
 2: 'bob',
 3: 'david',
 5: 'joe',
 'bob': 'barbara',
 'barbara': 6,
 'john': 12}

In [110]:
mapping[5]= "ellie"
mapping

{1: 'ann',
 2: 'bob',
 3: 'david',
 5: 'ellie',
 'bob': 'barbara',
 'barbara': 6,
 'john': 12}

In [None]:
del mapping["barbara"]
mapping

{1: 'ann', 2: 'bob', 3: 'david', 5: 'ellie', 'bob': 'barbara', 'john': 12}

In [112]:
del mapping[2]
mapping

{1: 'ann', 3: 'david', 5: 'ellie', 'bob': 'barbara', 'john': 12}

In [113]:
list(mapping.items())

[(1, 'ann'), (3, 'david'), (5, 'ellie'), ('bob', 'barbara'), ('john', 12)]

In [114]:
mapping.update({2:"brenda", 12:"jim"})
mapping

{1: 'ann',
 3: 'david',
 5: 'ellie',
 'bob': 'barbara',
 'john': 12,
 2: 'brenda',
 12: 'jim'}

In [117]:
mapping.pop(2)
mapping

{1: 'ann', 3: 'david', 5: 'ellie', 'bob': 'barbara', 'john': 12, 12: 'jim'}

In [119]:
mapping.pop("bob")
mapping

{1: 'ann', 3: 'david', 5: 'ellie', 'john': 12, 12: 'jim'}

In [120]:
mapping.get(3)

'david'

Categorizing a list of words by their first letters as a dictionary of lists:

In [150]:
words = ["apple", "bat", "bar", "atom", "book"]
by_letter = {}
 
for n in words:
     letter = n[0]
     if letter not in by_letter:
         by_letter[letter] = [n]
     else:
         by_letter[letter].append(n)
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

The setdefault dictionary method can be used to simplify this workflow:

In [151]:
words = ["apple", "bat", "bar", "atom", "book"]
by_letter = {}
 
for n in words:
     letter = n[0]
     by_letter.setdefault(letter, []).append(n)
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

### Valid dictionary key types

While dictionary values in Python can be any object, the keys must generally be immutable objects, such as scalar types (int, float, string) or tuples (where all elements in the tuple are also immutable). This property is known as hashability. To determine if an object is hashable (and thus can be used as a dictionary key), you can use the hash function:

In [152]:
hash("hello john")

-6031759716228369289

In [154]:
hash((1,3,4))

-1070212610609621906

In [155]:
hash([1,5,5])

TypeError: unhashable type: 'list'

In [181]:
f = list(range(50, 55))

for n in f:
    print(f"Hashed {n} is: ")
    print(hash(str(n)))
    
   
    
    




Hashed 50 is: 
-5782350072571480311
Hashed 51 is: 
313275033364399876
Hashed 52 is: 
-5767693949663812336
Hashed 53 is: 
-5400422504188886393
Hashed 54 is: 
8312797451441545071


In [177]:
hash(str(4123))

1226802963314174934

## Set

A set is an unordered collection of unique elements. A set can be created either via the set function or via a set literal with curly braces:

In [1]:
set([3, 5, 6 ,7, 7, 2, 2, 1, 5, 3])

{1, 2, 3, 5, 6, 7}

In [3]:
{3, 5, 6 ,7, 7, 2, 2, 1, 5, 3}

{1, 2, 3, 5, 6, 7}

Sets support mathematical set operations like union, intersection, difference, and symmetric difference. 

Union:

In [25]:
a = {1,2,3}
b = {3,6,7,8}

a.union(b)

{1, 2, 3, 6, 7, 8}

In [26]:
a | b

{1, 2, 3, 6, 7, 8}

Intersection:

In [10]:
a.intersection(b)

{3}

In [11]:
a & b

{3}

Difference:

In [12]:
a.difference(b)

{1, 2}

In [13]:
b.difference(a)

{6, 7, 8}

In [14]:
a.symmetric_difference(b)

{1, 2, 6, 7, 8}

![image.png](attachment:image.png)

![image.png](attachment:image.png)

If you pass an input that is not a set to methods like union and intersection, Python will convert the input to a set before executing the operation. When using the binary operators, both objects must already be sets.

Like dictionary keys, set elements generally must be immutable, and they must be hashable (which means that calling hash on a value does not raise an exception). In order to store list-like elements (or other mutable sequences) we can convert them to tuples

In [19]:
data = [1, 2, 3, 4]

my_set = set(data)

my_set

{1, 2, 3, 4}

In [21]:
data = [1, 2, 3, 4]
my_set = {tuple(data)}
my_set

{(1, 2, 3, 4)}

We can check if a set is a subset of (is contained in) or a superset of (contains all elements of) another set:

In [24]:
a_set = {1, 2, 5, 7}
{1, 2}.issubset(a_set)

True

## Built in sequence functions 

The __sorted()__ function returns a new sorted list from the elements of any sequence:

In [27]:
sorted("Hello Python")

[' ', 'H', 'P', 'e', 'h', 'l', 'l', 'n', 'o', 'o', 't', 'y']

In [29]:
sorted([7, 1, 2, 6, 0, 3, 2])

[0, 1, 2, 2, 3, 6, 7]

__zip()__ “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples:

In [33]:
a = ["Ann", "John", "Bob"]
b = ["NYC", "SF", "LA"]
y = zip(a, b)
list(y)

[('Ann', 'NYC'), ('John', 'SF'), ('Bob', 'LA')]

In [35]:
a = ["Ann", "John", "Bob"]
b = ["NYC", "SF", "LA"]
c = ["car", "bike"]
y = zip(a, b, c)
list(y)

[('Ann', 'NYC', 'car'), ('John', 'SF', 'bike')]

__enumerate()__ returns a sequence of (i, value) tuples:

In [38]:
for n, (x, y) in enumerate(zip(a, b)):
   .....:     print(f"{n}: {x}, {y}")

0: Ann, NYC
1: John, SF
2: Bob, LA


A basic approach to iterating over a sequence while keeping track of the index is:


In [None]:
index = 0
for value in collection:
   # do something with value
   index += 1

The above is equivalent to:

In [None]:
for index, value in enumerate(collection):
   # do something with value

In [40]:
for n, a in enumerate (["Ann", "John", "Bob"]):
    print(n, a)

0 Ann
1 John
2 Bob


In [41]:
for n, a in enumerate (["Ann", "John", "Bob"]):
    print(n+1, a)

1 Ann
2 John
3 Bob


__reversed()__ iterates over the elements of a sequence in reverse order:

In [49]:
a = list(range(10))
print(a)
b = list(reversed(range(1, 11)))
print(b)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]


## List, Set and Dictionary Comprehensions

__List comprehensions__ allow you to concisely form a new list by filtering the elements of a collection, equivalent to the following loop:

In [None]:
result = []
for value in collection:
    if condition:
        result.append(expr)

Comprehensions take the basic form:

In [None]:
[expr for value in collection if condition]


Given a list of strings, we could for example filter out strings with length 2 or less and convert them to uppercase like this:

In [53]:
strings = ["a", "as", "ally", "bat", "car", "dove", "python"]
[x.upper() for x in strings if len(x) > 2]



['ALLY', 'BAT', 'CAR', 'DOVE', 'PYTHON']

__A dictionary and a set comprehension:__

In [None]:
dict_comp = {key-expr: value-expr for value in collection
             if condition}

In [None]:
set_comp = {expr for value in collection if condition}

In [55]:
strings = ["a", "as", "ally", "bat", "car", "dove", "python"]
unique_lengths = {len(x) for x in strings}
unique_lengths


{1, 2, 3, 4, 6}

Alternatively - we can express it using the __map()__ function

In [56]:
set(map(len, strings))

{1, 2, 3, 4, 6}

As a  dictionary comprehension example, we could create a lookup map of these strings for their locations in the list:

In [65]:
strings = ["a", "as", "ally", "bat", "car", "dove", "python"]
loc_mapping = {key: value for key, value in enumerate(strings)}
loc_mapping

{0: 'a', 1: 'as', 2: 'ally', 3: 'bat', 4: 'car', 5: 'dove', 6: 'python'}

Starting key numbers from 1 instead of 0: 

In [66]:
strings = ["a", "as", "ally", "bat", "car", "dove", "python"]
loc_mapping = {(key +1): value for key, value in enumerate(strings)}
loc_mapping

{1: 'a', 2: 'as', 3: 'ally', 4: 'bat', 5: 'car', 6: 'dove', 7: 'python'}

__Nested comprehensions:__

 We want to get a single list containing all names with two or more a’s in them. We could do this with a simple for loop:

In [68]:
all_data = [["John", "Emily", "Michael", "Mary", "Steven"], ["Maria", "Juan", "Javier", "Natalia", "Pilar"]]

names_of_interest = []

for names in all_data:
    enough_a = [x for x in names if x.count("a") >= 2]
    names_of_interest.extend(enough_a)

names_of_interest


['Maria', 'Natalia']

__A nested list comprehension__ (the for parts of the list comprehension are arranged according to the order of nesting, and any filter condition is put at the end as before):

In [73]:
result = [x for names in all_data 
          for x in names if x.count("a") >= 2]
result

['Maria', 'Natalia']

Here is another example where we “flatten” a list of tuples of integers into a simple list of integers:

In [75]:
tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

integers_list = [x for tup in tuples for x in tup]
integers_list

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [76]:
tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

integers_list = [x for tup in tuples for x in tup if x <4]
integers_list

[1, 2, 3]

The order of the for expressions would be the same if you wrote a nested for loop instead of a list comprehension:

In [80]:
integers = []

for tup in tuples:
    for x in tup:
        integers.append(x)
integers
    

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [81]:
integers = []

for tup in tuples:
    for x in tup:
        if x < 4: 
            integers.append(x)
integers
    

[1, 2, 3]

It’s important to distinguish the syntax just shown from a list comprehension inside a list comprehension:

In [83]:
[[x for x in tup] for tup in tuples]

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

## Functions

If Python reaches the end of a function without encountering a return statement, None is returned automatically. 

In [84]:
def my_function(x):
    print(x)

my_function("Hello")



Hello


In [87]:
result = my_function("Hello")
print(result)

Hello
None


In [109]:
h = []

def function1():
    for i in range (1,11):
        h.append(i)
    return(h)
function1()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Each call to function1() will modify list h:

In [113]:
function1()
function1()
print(h)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


Assigning variables outside of the function's scope is possible, but those variables must be declared explicitly using either the global or nonlocal keywords:

In [127]:
m = None

def function3():
    global m
    m = []
    for i in range (1,5):
        m.append(i)
    return(m)
function3()

[1, 2, 3, 4]

 Typically, global variables are used to store some kind of state in a system. If you find yourself using a lot of them, it may indicate a need for object-oriented programming (using classes).

### Returning Multiple Values

The function is actually just returning one object, a tuple, which is then being unpacked into the result variables.

In [135]:
def f():
    a = 1
    b = 3
    c = 2
    return(a, b, c)
f()

(1, 3, 2)

In [142]:
def f():
    a = 1
    b = 3
    c = 2
    return {"a": a, "b": b, "c": c }

f()


{'a': 1, 'b': 3, 'c': 2}

### Functions are objects; regular expressions
re.sub() replaces matching substrings with a new string (can be empty) for all occurences or a specified number - e.g. re.sub("[!#?]", "", value).  Including "?" indicates a non-greedy matching 

In [147]:
import re

states = ["   Alabama ", "Georgia!", "Georgia", "georgia", "FlOrIda", "south   carolina##", "West virginia?"]


def clean_strings(states):
    result = []
    for value in states:
        value = value.strip()
        value = re.sub("[!#?]", "", value)
        value = value.title()
        result.append(value)
    return result

clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

### Lambda functions

So-called anonymous or lambda functions are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the lambda keyword, which has no meaning other than “we are declaring an anonymous function”:

In [152]:
def short_function(x):
    return x * 2

short_function(5)

10

An alternative expression:

In [None]:

y = lambda x: x*2

Lambda functions are convenient in data analysis because there are many cases where data transformation functions will take functions as arguments. 

In [157]:
def funct(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]

funct(ints, lambda x: x * 2)

[8, 0, 2, 10, 12]

We want to sort a collection of strings by the number of distinct letters in each string:

In [159]:
strings = ["Fanny", "Carol", "Bob", "Anna", "Philip", "Jo"]
strings.sort(key=lambda x: len(set(x)))
strings

['Jo', 'Bob', 'Anna', 'Fanny', 'Carol', 'Philip']

### Generators

An iterator is any object that will yield objects to the Python interpreter when used in a context like a for loop. Most methods expecting a list or list-like object will also accept any iterable object. This includes built-in methods such as min, max, and sum, and type constructors like list and tuple:

In [172]:
some_dict = {"a": 1, "b": 2, "c": 3}

for key in some_dict:
    print(key)

a
b
c


In [162]:
dict_iterator = iter(some_dict)
list(dict_iterator)

['a', 'b', 'c']

A generator is a convenient way, similar to writing a normal function, to construct a new iterable object. Whereas normal functions execute and return a single result at a time, generators can return a sequence of multiple values by pausing and resuming execution each time the generator is used. To create a generator, use the yield keyword instead of return in a function:

In [173]:
def squares(n=10):
    print(f"Generating squares from 1 to {n ** 2}")
    for i in range(1, n + 1):
        yield i ** 2 
gen = squares()
gen

<generator object squares at 0x0000023D0AC89A80>

In [174]:
for x in gen:
    print(x)

Generating squares from 1 to 100
1
4
9
16
25
36
49
64
81
100


Another way to make a generator is by using a generator expression. This is a generator analogue to list, dictionary, and set comprehensions. To create one, enclose what would otherwise be a list comprehension within parentheses instead of brackets:

In [178]:
gen2 = (x ** 2 for x in range(5))
gen2

<generator object <genexpr> at 0x0000023D0ADA0D40>

In [179]:
for n in gen2:
    print(n)

0
1
4
9
16


Generator expressions can be used instead of list comprehensions as function arguments in some cases:

In [185]:
sum(x for x in range(10001))

50005000

In [187]:
dict((i, i*2) for i in range (2,6))

{2: 4, 3: 6, 4: 8, 5: 10}

## Itertools module

The standard library itertools module has a collection of generators for many common data algorithms. For example, groupby

In [188]:
import itertools

def first_letter(x):
    return x[0]

names = ["Alan", "Adam", "Wes", "Will", "Albert", "Steven"]

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) # names is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


Other itertools functions:

![image.png](attachment:image.png)

## Handling errors

In some cases, we may not want to suppress an exception, but  want some code to be executed regardless of whether or not the code in the try block succeeds. To do this, we use finally:

In [None]:
f = open(path, mode="w")

try:
    write_to_file(f)
except:
    print("Failed")
else:
    print("Succeeded")
finally:
    f.close()

To get additional context around errors:

In [189]:
%xmode verbose

Exception reporting mode: Verbose


## Files and the operating system

In [None]:
path = "examples/segismundo.txt"
f = open(path, encoding="utf-8")

By default, the file is opened in read-only mode "r". We can then treat the file object f like a list and iterate over the lines like so:

In [None]:
for line in f:
    print(line)

In [None]:
lines = [x.rstrip() for x in open(path, encoding="utf-8")]


When you use open to create file objects, it is recommended to close the file when you are finished with it. Closing the file releases its resources back to the operating system:

f.close()

One of the ways to make it easier to clean up open files is to use the with statement:

In [None]:
 with open(path, encoding="utf-8") as f:
     lines = [x.rstrip() for x in f]

If we had typed f = open(path, "w"),  new file would have been created (be careful!), overwriting any file in its place. 

Python file modes

![image.png](attachment:image.png)

Python file methods or attributes

![image.png](attachment:image.png)