<a href="https://colab.research.google.com/github/albertomanfreda/intensive_school_ml/blob/master/Lesson2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Structures

## Lists

A **list** is a powerful data stucture that is able to contain an arbitrary number of elements of any type (and not necessarily of the same type). The syntax to create a list is the follwing:



```
list_name = [item0, item1, item2, ...]
```



In [19]:
my_list = [1, 2, 3., 'a string']

Elements in a list can be individually accessed using the square parenthesis, as in the following eample (this is also called **random access**, though there is nothing actually random involved):

In [11]:
my_list = [1, 2, 3., 'a string']
# Note: element numbering starts from zero, not one!
print(my_list[0], my_list[1], my_list[2], my_list[3])
# The index is circular: -1 is the last element, -2 the second-to-last and so on
print(my_list[-1], my_list[-2], my_list[-3], my_list[-4])

1 2 3.0 a string
a string 3.0 2 1


You can get the number of elemnts currently in a list with the **len** function.

In [12]:
my_list = [1, 2, 3., 'a string']
print(len(my_list))

4


Lists in Python are **mutable**, that is they can be changed by adding or removing elements, or by changing the content at one or more of their indices.

In [16]:
my_list = [1, 2., 'a string']

# Add an element at the end
my_list.append('another_string')
print(my_list)

# Change the element at index 2
my_list[2] = 'a different string' 
print(my_list)

# Insert a new element in position 1
my_list.insert(1, 1.5) 
print(my_list)

# Remove the element in second-to-last position
my_list.pop(-2)
print(my_list)

[1, 2.0, 'a string', 'another_string']
[1, 2.0, 'a different string', 'another_string']
[1, 1.5, 2.0, 'a different string', 'another_string']
[1, 1.5, 2.0, 'another_string']


Lists can also be easily sorted (in-place). This, of course, requires that the type of the elements in the list is homogenous enough so that pairs of elements can be compared.

Curiosity: the algorithm for list sorting in Python is called *Timsort* (after Tim Peters, which developed it) and it's a very clever and efficient one https://en.wikipedia.org/wiki/Timsort. 

In [20]:
a = [2, 4., 3, 1] # it's ok to mix floats and integers
a.sort()
print(a)

my_list = [1, 2., 'a string'] # it's not ok to compare numbers and strings!
my_list.sort()
print(my_list)

[1, 2, 3, 4.0]


TypeError: ignored

### Slices

In Python, there is a powerful tool to select a subsamples of a list (and of other data containers): **slices**.

**Note**: slices are also a fundamental tool for arrays and tensors manipulation in the numpy library, which is the base of the Python scientific ecosystem.

The syntax for slices is:

```
a_list_slice = a_list[start:end:step]
```

Which returns a list containing all the elements from the *start* (**included**) to the *end* (**excluded**) in jumps of *step*, in the very same way as the **range** function we have seen before.

In [13]:
song_words = ['We', 'all', 'live', 'in', 'a', 'Yellow', 'Submarine']
print(song_words[1:4])

# Here we set also the step
print(song_words[0:5:2])

['all', 'live', 'in']
['We', 'live', 'a']


In [4]:
song_words = ['We', 'all', 'live', 'in', 'a', 'Yellow', 'Submarine']
# Omitting the start argument, the slice will start from the beginning
print(song_words[:4])

# Omitting the end argument, the slice will end at the end of the list
print(song_words[4:])

# Omitting both we get the full list
print(song_words[:])

# And here we get the list reverted
print(song_words[::-1])

['We', 'all', 'live', 'in']
['a', 'Yellow', 'Submarine']
['We', 'all', 'live', 'in', 'a', 'Yellow', 'Submarine']
['Submarine', 'Yellow', 'a', 'in', 'live', 'all', 'We']


You can also create a slice variable first, using the **slice** function, and then apply it to a list afterwards.

In [8]:
song_words = ['We', 'all', 'live', 'in', 'a', 'Yellow', 'Submarine']
my_slice = slice(1, 5, 2)
print(song_words[my_slice])

['all', 'in']


### For-loop on lists

Let's say you want to iterate on a list and do something with each of its elements. You can do it like this:

In [23]:
alphabet = ['a', 'b', 'c', 'd']
for i in range(0, len(alphabet)):
    print(alphabet[i])

a
b
c
d


However, Python provides a much much expressive way to do the same thing - which you should *always* prefer:

```
for element in list:
    do something
```



In [24]:
alphabet = ['a', 'b', 'c', 'd']
for letter in alphabet:
    print(letter)

a
b
c
d


Any data container that allows you to iterate on its elements like that is called an **iterable**. We will encounter other iterables soon.

What of you also need to know the index of the current element?
You can do that with the **enumerate** function. The syntax is:


```
for i, element in enumerate(list):
    do_something
```



In [27]:
cities = ['Rome', 'Paris', 'London', 'Berlin', 'New York', 'Tokyo', 'Beijing']
cities.sort() # alphabetical order
print('Cities alphabetically sorted:')
for i, city in enumerate(cities):
    print('{:d}) {}'.format(i+1, city))


Cities alphabetically sorted:
1) Beijing
2) Berlin
3) London
4) New York
5) Paris
6) Rome
7) Tokyo


And what if you want to iterate over two lists at the same time? That's a work for the **zip()** function:

In [44]:
alphabet = ['a', 'b', 'c']
numbers = [1, 2, 3, 4]
# If the lists have different size zip will stop as soon as the shorter ends
for letter, number in zip(alphabet, numbers):
    print('{} -> {:d}'.format(letter, number))

a -> 1
b -> 2
c -> 3


This works for any pairs of iterables, even of different types, not just two lists.

## Tuples

At first look **tuples** are very similar to lists. However they have a major difference: they are **immutable**. That means that a tuple cannot be altered after it was created: you cannot add or remove elements and you cannot change the elements inside it.

A tuple is created similarly to a lists, only with round brackets.

In [30]:
a_tuple = (1, 2, 3., 'a_string')
print(a_tuple[0])

#Attempting to change an element of a tuple will produce an error
a_tuple[0] = 5

1


TypeError: ignored

When creating a tuple of a single element, however, the syntax is slightly different, to avoid possible ambiguities:

In [31]:
# We need to put a comma even if there are no other elements after that one
single_element_tuple = ('a', )

Tuples are iterables and support slicing just like lists, so all we said before still applies here.

In [35]:
numbers_tuple = (1, 2, 3, 4, 5)
for number in numbers_tuple:
    print(number)

print(numbers_tuple[1:3])

1
2
3
4
5
(2, 3)


# Strings

Strings in Python are sequences of characters. They can be easily created by just putting some text inside double or single quotes.

In [1]:
a_string = 'This is a string'
another_string = "This is a string, too"

Strings are iterables too and they support slincing. Like tuples, strings are **immutable** (so, in some way, strings are like tuples of characters).

In [43]:
a_string = 'This is a string'

print(len(a_string))
print(a_string[2])
print(a_string[1:-3:2])

a_string[2] = 'd'

16
i
hsi  t


TypeError: ignored

## String formatting

There are several different ways to format strings in Python:

* The ugly way, just adding pieces with '+'
* "Old-school" with the *%* operator
* With the new *format* syntax introduced in Python3
* With the even newer *f-strings* (Python 3.6+), which we won't cover here.


In [31]:
# The ugly way: avoid this if you can
name = 'Bob'
age = 15
print('My name is ' + name + ', I am ' + str(age) + ' years old.')

My name is Bob, I am 15 years old.


### Old style string formatting

In [6]:
name = 'Bob'
greetings = 'Hi, %s!' % name # %s is a placeholder for a string
print(greetings)

age = 15
print('I am %d years old' % age) # %d is the placeholder for an integer

pi = 3.1415926535
print('%f' % pi) # %f is for floating point numbers
print('%.9f' % pi) # You can control the number of decimal digits like this

# You can use the % operator to substitute more than one values in a string
print('%s is %d years old' % (name, age)) # Note the parenthesis!

Hi, Bob!
I am 15 years old
3.141593
3.141592654
Bob is 15 years old


### New string formatting

*  The placeholders are indicated by a *{}*
*  Format specifiers are declared inside the placeholder, after a colon
*  The *%* operator at the end of the string is replaced by *format* 

In [12]:
print('{} is {:d} years old'.format(name, age))
print('{:.6f}'.format(pi))

Bob is 15 years old
3.141593


The new syntax may seems slightly less intuitive, but it is also more powerful:


In [15]:
# We can give names to the placeholders, which helps readibility
print('{son} is the son of {mother}'.format(son='Bob', mother='Alice'))

Bob is the son of Alice


In [14]:
# We can also substitute the placeholders later
unformatted_string = '{} is the son of {}' # Note how \ is used to escape the single quote
mother = 'Alice'
son = 'Bob'
print(unformatted_string.format(son, mother))

Bob is the son of Alice


Besides the '%' operator Python identifies another special character: the backslash *\*.

This is used essentially for two reasons:

1.  Escape other special characters, so that they lose their special meaning an can be actually represented in the string:

  * \\'  (escaped single quote)

  * \\"  (escaped double quote)

  * \\\  (escaped backslash)

2.  Combine with other characters to create special symbols:
  
  * \n 	(New Line)
  * \r 	(Carriage Return)
  * \t 	(Tab)
  * others....

In [28]:
print('Bob\'s mother is Alice')

Bob's mother is Alice


In [29]:
print('Let\'s start a newline\n\tand give it a little shift!')

Let's start a newline
	and give it a little shift!


### Further readings

Learning all about the string formatting syntax is way beyond the scope of this course. You can find a useful list of tricks here:

https://pyformat.info/#string_pad_align

Under the hoods, one of the main changes in the passage bewteen Python 2 and Python 3 is the addition of Unicode support. As a data scientist, it is unlikely that you will need to care about that, but if you are really interested you may take a look at:

https://docs.python.org/3/howto/unicode.htm

or read chapter 4 of the great book 'Fluent Python' by Luciano Ramalho.