# Types and containers, string manipulation

### Reminder from last lecture
We talked about variable assignment, such as:

In [29]:
# first assignment
a = 1
b = "hello world"
print(a, type(a))
print(b, type(b))

1 <class 'int'>
hello world <class 'str'>


We also talked briefly about python's built-in types.

Summary of native types:
- string (`str`): contains characters, supports unicodes, there is no separate type for individual characters;
- numeric types: integers (`int`) have variable-length, that means they do not have minimum or maximum values. Floating point numbers (`float`, numbers with a decimal point) are double precision (64 bit). An `int` or `float` can be negative. Important: **floating point** is a synonym of **variable precision**. This means that the resolution of your variable (i.e. the minimum difference between two values) depends on the order of magnitude of the number. Most of the times you will have enough precision for all practical purposes, but be aware that some numbers (especially decimals) may not have an exact representation!
- booleans (`True` and `False`)
- collections: `tuple` (immutable sequence), `list` (mutable sequence), `set` (set of unique items), `dict` (key-value mapping)
- none `None` is a special object of `NoneType`, its usage may vary.

In [1]:
"python"                            # str
b"\xf0\x9f\x90\x8d"                 # bytes
42                                  # int
42., 42.0, 4.2e1                    # float
(1, 42., "python")                  # tuple
[1, 42., "python"]                  # list
{1, 42., "python"}                  # set
{1: "foo", 42.: "bar"}              # dict
None                                # NoneType
True, False                         # bool

(True, False)

#### Floating point precision

In [2]:
a = 0.3
b = 0.1 + 0.1 + 0.1
print(a == b)
print(a , b)

False
0.3 0.30000000000000004


Can you guess what is happening? *Floating-point representation errors* happen because `float`s are stored in base-2 (binary) representation, where e.g. 0.1 does not have a finite decimal representation, and therefore must be approximated.

## 1. Containers

The following types are *data structures*. As the name indicates, they are useful for storing data!

### Tuples
Tuples are immutable sets of values. Once constructed, they cannot be modified.

In [45]:
a = ()
type(a)

tuple

In [48]:
a = (1,2,3)
print(a)
print(a[0], a[1], a[2])
a[0] = 9 # try this!

(1, 2, 3)
1 2 3


TypeError: 'tuple' object does not support item assignment

In [54]:
a, b, c = 1, 2, 3
t = (a, b, c)
print(t)
#t[0] = 4 # this cannot work
a = 4 # maybe this will work?
print(a)
print(t)

(1, 2, 3)
4
(1, 2, 3)


In [55]:
print(t)

(1, 2, 3)


So be careful, the tuple has stored the values of `a`, `b`, `c` and assigning a new value to `a` will not change what's in the tuple!

### Lists
Lists are the simplest form of collection that can be modified. We can use the `.append()` method to add elements. Note that in contrast to functions like `print()`, we use dot notation with methods. We will explain this when we cover functions and methods.

In [57]:
a = list() # create an empty list using the list() built-in function
b = [] # create an empty list using the `[]` literal
b.append(1) # add an element to the list
b.append("hello")
print(b)

[1, 'hello']


Collections can be non-homogeneous, but this is rarely a good practice to adopt!

You can create lists from tuples:

In [58]:
c = list((1,2,3))
print(c, type(c))
c[0] = 4 # now we can modify the list!
print(c)

[1, 2, 3] <class 'list'>
[4, 2, 3]


We will show a few examples of list *slicing*. Slicing is a very powerful syntactic tool that allows to manipulate collections by means of a very compact notation. Spend a bit of time learning it, you will use it all the time! Important to note: python starts indexing from zero!

In [37]:
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

#print(l[2:]) # start at index 2
#print(l[2:9]) # select between indices 2 and 9-1 (upper limits are exclusive)
#print(l[:9]) # stop at index 9-1
#print(l[-1], l[-2], "...") # access individual elements in reverse order
#print(l[::-1]) # reverse the entire list

10 9 ...
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


We can check the number of elements in a collection with the `len()` function:

In [64]:
print(len(l))

10


Note how lists are not the same as arrays or vectors! For example:

In [65]:
a = [1,2,3]
b = [4,5,6]
c = a + b # this concatenates the lists, does not add their values!
print(c)

[1, 2, 3, 4, 5, 6]


You can have nested lists:

In [66]:
a = [1,2,3]
b = [4,5,6]
c = [7,8,9]
m = [a, b, c]
print(m)

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]


Let's try some methods for working with lists, `.insert()`, `.append()` `.extend()`, and `.pop()`.

In [43]:
# An element to list at a specified index
a = [1,2,3]
a.insert(1, 4)  # First argument is index, second is value to insert
print(a)
a.insert(1, "banana")
print(a)

[1, 4, 2, 3]
[1, 'banana', 4, 2, 3]


In [38]:
# Add an element to end of list
a = [1,2,3]
print(a)
a.append(4)
print(a)

[1, 2, 3]
[1, 2, 3, 4]


In [44]:
# Add multiple elements to list
a = [1,2,3]
a.extend([4, "banana"])
print(a)

[1, 2, 3, 4, 'banana']


In [46]:
# Remove elements from list
a = [1,2,3]
a.pop(2)
print(a)

[1, 2]


We can also sort lists, using `.sort()`. Strings will be sorted alphabetically, numbers will be sorted by value.

In [47]:
# Sort a list of strings
colors = ["green", "yellow", "blue"]
print(colors)
colors.sort()
print(colors)

['green', 'yellow', 'blue']
['blue', 'green', 'yellow']


In [48]:
# Sort a list of numbers
numbers = [593, 2, 39010, 1595]
numbers.sort()
print(numbers)

[2, 593, 1595, 39010]


Also useful are the functions `sum()`, `min()`, `max()`. These can also be used on tuples.

In [56]:
number_list = [300, 12, 50]
number_tuple = (10, 20, 5)
print(sum(number_list), min(number_list), max(number_list))
print(sum(number_tuple), min(number_tuple), max(number_tuple))

362 12 300
35 5 20


### Sets
Sets are similar to lists but do not hold multiple repetition of the same value.

In [70]:
a = {"a", "b"} # note that {} is not an empty set but an empty dictionary!
print(a, type(a))
a.add("a")
print(a)
a.add("c")
print(a)

{'a', 'b'} <class 'set'>
{'a', 'b'}
{'c', 'a', 'b'}


In [71]:
# We can also build it from a list using the function set()
l = [1,2,2,3,4,5]
s = set(l) # repeated elements will not be preserved!
print(s)

{1, 2, 3, 4, 5}


### Dictionaries
Dictionaries map a key to a value (key-value pair). Keys and values can be of any valid Python type (e.g. integers, floats, strings, Boolean, tuples), and do not have to be homogeneous in general (but again, there is difference between what you *can* and what you *should* do. Also, the key has to be immutable - you can't use a list as the key.

In [58]:
a = {"", ""}
d = {}
print(a, type(a))
print(d, type(d))

{''} <class 'set'>
{} <class 'dict'>


Here is a silly example of a dictionary

In [73]:
d = dict() # initialisation statement
d[1] = 'a' # 1 is the key, a is the value
d['b'] = 2 # 'b' is the key, 2 is the value


"""
As mixed as it gets (almost).
A bit confusing.
Also not very useful?    
"""
print(d)

{1: 'a', 'b': 2}


Here is a more sensible example.

In [63]:
family_locations = { 'Elisa' : 'Germany', 'Norah' : 'New York', 'Kristen' : 'Pennsylvania'}
print(family_locations)

{'Elisa': 'Germany', 'Norah': 'New York', 'Kristen': 'Pennsylvania'}


In [62]:
# Access a value
print(family_locations["Elisa"])

Germany


Note that we could store the same information with a list - why is a dictionary more convenient?

Dictionaries are mutable, we can add, overwrite, and delete elements.

In [65]:
# Add element to dictionary
family_locations['Eric'] = 'Wisconsin'
print(family_locations)

{'Elisa': 'Germany', 'Norah': 'New York', 'Kristen': 'Pennsylvania', 'Eric': 'Wisconsin'}


In [66]:
# Overwrite/update element in a dictionary
family_locations['Norah'] = 'Germany'
print(family_locations)

{'Elisa': 'Germany', 'Norah': 'Germany', 'Kristen': 'Pennsylvania', 'Eric': 'Wisconsin'}


In [67]:
# Delete element from a dictionary
del family_locations['Elisa']
print(family_locations)

{'Norah': 'Germany', 'Kristen': 'Pennsylvania', 'Eric': 'Wisconsin'}


What happens when we try to access a key that isn't in the dictionary?

In [69]:
print(family_locations["Paul"])

KeyError: 'Paul'

- Dictionaries are one-way maps: you can get a value given its key, you can have repeated values but not repeated keys!
- You can use integers as keys, but this does not turn a dictionary into a vector.
- Dictionaries in old versions of Python (before 3.6) had the key-value pairs stored in random order. In newer Python versions, the dictionary will preserve the order the elements have been inserted.

### A simple structured dataset

A common situation in science is having a table with labeled data. python does not provide a native table or matrix format, but you can achieve something similar with a dictionary of lists, for example:

In [75]:
names = ['proton, neutron, electron']
symbols = ['p', 'n', 'e']
masses = [938, 939, 0.511]

particles = { "name" : names, "symbol" : symbols, "mass" : masses}

print(particles)

{'name': ['proton, neutron, electron'], 'symbol': ['p', 'n', 'e'], 'mass': [938, 939, 0.511]}


Now, if you access a given index on each list, you will get all the properties of a particle. This is still a crude way to build a structured dataset, but one that can be easily converted in the formats used by popular libraries.

### `in` operator
The `in` operator has two main use cases:
- check if a string is part of another string;
- check if a value is present in a collection.

In [77]:
a = "hello" in "hello world"
print(a)

b = 3 in [2,4,5]
print(b)

True
False


## 2. Strings

### String manipulation
Strings are an immutable sequence of characters, similar to a tuple, which we learned about last week. We can use the built-in `len()` function on them, as well as slicing, which we used above to get sub-lists.

In [3]:
my_string = "Hello World!"
print(my_string, len(my_string))

Hello World! 12


Slice to print part of the string.

In [4]:
print(my_string[0:6])

Hello 


Use a negative index to get the end of the string.

In [5]:
print(my_string[-1:])

!


Or even reverse the string!

In [6]:
print(my_string[::-1])

!dlroW olleH


We can "add" or concatenate multiple strings with the operator `+`.

In [7]:
my_string_1 = "Hello "
my_string_2 = "World!"
print(my_string_1 + my_string_2)

Hello World!


Let's try out some interesting string methods: `.lower()`, `.upper()`, `.rstrip()`, `.lstrip()`, `.strip()`, `.startswith()`, `.endswith()`, `.find()`, and `.replace()`.

In [8]:
my_string.upper()

'HELLO WORLD!'

Notice that this function prints a new string, the original string is not modified.

In [16]:
print(my_string)

Hello World!


In [9]:
my_string = "    Hello World!    "
print(my_string)

    Hello World!    


We can remove white space with `.lstrip()` `.rstrip()` and `.strip()`

In [12]:
my_string.rstrip()

'    Hello World!'

We can get information about the string contents with `.endswith()` and `.find()`.

In [13]:
my_string = "Hello World!"
my_string.endswith("!")

True

In [None]:
print(my_string.find("o "))
print(my_string.find("vjke35i345"))

Or replace part of the string with `.replace()`. Again, the original string is not modified.

In [19]:
print(my_string.replace("World", "Back At You"))
print(my_string)

Hello Back At You!
Hello World!


### String formatting
In python, there are several ways of building strings incorporating different types of variables. If you are curious about the disadvantages/advantages of different string formatting methods, you can read this article: https://realpython.com/python-string-formatting/#3-string-interpolation-f-strings-python-36.

### String interpolation (legacy)
The oldest style is *[string interpolation](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)*:

In [20]:
a = 1.2345
b = 42
#print("a = %d, b = %d" % (a, b)) # d -> integer
#print("a = %02d" % a)
#print("a = %f" % a) # f -> float
print("a = %.2f" % a)

a = 1.23


The %d and similar strings (%s, %x) are called format strings. This style has many pitfalls, is basically deprecated and we recommend against using it in your code, it is just included here in case you come across it in existing code.

### f-strings and format() method
*f-strings* are *formatted string literals* allowing to easily incorporate python variables and expressions in strings. An alternative and less compact notation uses the `format()` method.

In [21]:
a, b = 1.2345, 42
#print(f"a = {a}, b = {b}") # this is simple
#print(f"{a=}, {b=}") # this is even more compact, although less flexible
print("a = {}, b = {}".format(a, b)) # this is an alternative standard, can be more or less readable depending on the circumstances

a = 1.2345, b = 42


You can control the spacing, number of zeros, number of decimals etc. with specific format strings. Let's try with a pair of `int` values.

In [22]:
a, b = 42, 1042
print(f"b = {b}")
print(f"b = {b:4d}")
print(f"b = {b:5d}")
print(f"a = {a:4d}") # this fill up to 4 spaces regardless of the number of digits
print(f"a = {a:04d}") # this will fill with zeros instead

b = 1042
b = 1042
b =  1042
a =   42
a = 0042


And now with `float` values.

In [24]:
a = 123.456
print(f"a = {a}") # default
print(f"a = {a:.2f}") # only print two places past the decimal place
print(f"a = {a:.6f}") # print 6 places past the decimal place
print(f"a = {a:.2e}") # exponential notation!

a = 123.456
a = 123.46
a = 123.456000
a = 1.23e+02


You can even do inline arithmetic with f-strings.

### Multiline strings
You can build a multiline string using the newline (`\n`) escape sequence. What's an escape sequence? It's a sequence of characters that starts with a special character (`\`) and is subject to a special treatment. Here is an example where it is important, using the `print()` function.

In [25]:
print('Let's escape!')

SyntaxError: unterminated string literal (detected at line 1) (2091332172.py, line 1)

In [26]:
""" 
If you want to print an apostrophe in a string delimited by apostrophes, you will have to escape the character.
"""
print('Let\'s escape!')

""" 
But this will work flawlessly if you choose quotation marks as delimiters.
"""
print("Let's not escape!")

Let's escape!
Let's not escape!


Back to multiline strings...

In [27]:
print("Line 1\nLine 2\nLine 3")

Line 1
Line 2
Line 3


You could get the same with three `print()` statements, however in some cases you may want to use a single one. For better readability, you could compose the string as follows:

In [28]:
s = "Line 1\n"
s += "Line 2\n"
s += "Line 3"
print(s)

Line 1
Line 2
Line 3


Code using this style can easily get very cluttered, so use this parsimoniously!