<a href="https://colab.research.google.com/github/alanwuha/ce7455-nlp/blob/master/data-structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5. Data Structures

This chapter describes some things you've learned about already in more detail, and adds some new things as well.

## 5.1. More on Lists

The list data type has some more methods. Here are all of the methods of list objects:

list.__append__(_x_): Add an item to the end of the list. Equivalent to `a[len(a):] = [x]`.

list.__extend__(_iterable_): Extend the list by appending all the items from the iterable. Equivalent to `a[len(a):] = iterable`.

list.__insert__(_i, x_): Insert an item at a given position. The first argument is the index of the element before which to insert, so `a.insert(0, x)` inserts at the front of the list, and `a.insert(len(a), x)` is equivalent to `a.append(x)`.

list.__remove__(_x_): Remove the first item from the list whose value is _x_. It is an error if there is no such item.

list.__pop__([_i_]): Remove the item at the given position in the list, and return it. If no index is specified, `a.pop()` removes and returns the last item in the list. (The square brackets around the _i_ in the method signature denote that the parameter is optional, not that you should type square brackets at that position. You will see this notation frequently in the Python Library Reference.)

list.__clear__(): Remove all items from the list. Equivalent to `del a[:]`.

list.__index__(x[, start [, end]]): Return zero-based in the list of the first item whose value is _x_. Raises a [ValueError](#) if there is no such item. The optional arguments _start_ and _end_ are interpreted as in the slice notation and are used to limit the search to a particular subsequence of the list. The returned index is computed relative to the beginning of the full sequence rather than the _start_ argument.

list.__count__(x): Return the number of times _x_ appears in the list.

list.__sort__(_key=None, reverse=False_): Sort the items of the list in place (the arguments can be used for sort customization, see [sorted()](#) for their explanation).

list.__reverse__(): Reverse the elements of the list in place.

list.__copy__(): Return a shallow copy of the list. Equivalent to `a[:]`.

An example that uses most of the list methods:

In [2]:
fruits = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
print(fruits.count('apple'))
print(fruits.count('tangerine'))
print(fruits.index('banana'))
print(fruits.index('banana', 4))  # Find next banana starting a position 4
fruits.reverse()
print(fruits)
fruits.append('grape')
print(fruits)
fruits.sort()
print(fruits)
print(fruits.pop())

2
0
3
6
['banana', 'apple', 'kiwi', 'banana', 'pear', 'apple', 'orange']
['banana', 'apple', 'kiwi', 'banana', 'pear', 'apple', 'orange', 'grape']
['apple', 'apple', 'banana', 'banana', 'grape', 'kiwi', 'orange', 'pear']
pear


You might have noticed that methods like `insert`, `remove`, or `sort` that only modify the list have no return value printed - they return the default `None`. [[1]](#) This is a design principle for all mutable data structures in Python.

[[1]](#) Other languages may return the mutated object, which allows method chaining, such as `d->insert("a")->remove("b")->sort();`.

## 5.1.3. List Comprehensions

List comprehensions provide a concise way to create lists. Common applications are to make new lists where each element is the result of some operations applied to each member of another sequence or iterable, or to create a subsequence of those elements that satisfy a certain condition.

For example, assume we want to create a list of squares, like:

In [7]:
squares = list(map(lambda x: x**2, range(10)))
squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [9]:
squares = [x**2 for x in range(10)]
squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

### 5.3. Tuples and Sequences

We saw that lists and strings have many common properties, such as indexing and slicing operations. They are **two examples of _sequence_ data types** (see [Sequence Types -- list, tuple, range](#)). Since Python is an evolving language, other sequence data types may be added. There is also **another standard sequence data type: the _tuple_**.

A tuple consists of a __number of values separated by commas__, for instance:

In [19]:
t = 12345, 54321, 'hello!'
print(t[0])
print(t)

# Tuples may be nested:
u = t, (1, 2, 3, 4, 5)
print(u)

# Tuples are immutable:
# t[0] = 88888

# but they can contain mutable objects:
v = ([1, 2, 3], [3, 2, 1])
print(v)
print(v[0])

v[0][0] = 4
print(v)

12345
(12345, 54321, 'hello!')
((12345, 54321, 'hello!'), (1, 2, 3, 4, 5))
([1, 2, 3], [3, 2, 1])
[1, 2, 3]
([4, 2, 3], [3, 2, 1])


As you see, on output tuples are always enclosed in parentheses, so that nested tuples are interpreted correctly; they may be input with or without surrounding parentheses, although often parentheses are necessary anyway (if the tuple is part of a larger expression). It is not possible to assign to the individual items of a tuple, however it is possible to create tuples which contain mutable objects, such as lists.

Though tuples may seem similar to lists, they are often used in different situations and for different purpose. Tuples are [immutable](#), and usually contain a heterogeneous sequence of elements that are accessed via unpacking (see later in this section) or indexing (or even by attribute in the case of [namedtuples](#)). Lists are [mutable](#), and their elements are usually homogeneous and are accessed by iterating over the list.

A special problem is the construction of tuples containing 0 or 1 items: the syntax has some extra quirks to accommodate these. Empty tuples are constructed by an empty pair of parenthese; a tuple with one item is constructed by following a value with a comma (it is not sufficient to enclose a single value in parenthese). Ugly, but effective. For example:

In [20]:
empty = ()
singleton = 'hello',  # <-- note trailing comma
print(len(empty))
print(len(singleton))
print(singleton)

0
1
('hello',)


In [25]:
lst = [i for i in range(10)]
tple = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
print(type(lst), type(tple))

<class 'list'> <class 'tuple'>


The statement `t = 12345, 54321, 'hello'!` is an example of __tuple packing__: the values `12345`, `54321`, and `'hello!'` are packed together in a tuple. The reverse operation is also possible:

In [27]:
x, y, z = t
print(x, y, z)

12345 54321 hello!


This is called, appropriately enough, _sequence unpacking_ and works for any sequence on the right-hand side. Sequence unpacking requires that there are as many variables on the left side of the equals sign as there are elements in the sequence. Note that multiple assignment is really just a combination of tuple packing and sequence unpacking.

## 5.4. Sets

Python also includes a data type for _sets_. A set is an __unordered collection__ with __no duplicate elements__. Basic uses include __membership testing__ and __eliminating duplicate entries__. Set objects also support __mathematical operations__ like union, intersection, difference, and symmetric difference.

Curly braces or the [set()](#) function can be used to create sets. Note: to create an empty set you have to use `set()`, not `{}`; the latter creates an empty dictionary, a data structure that we discuss in the next section.

Here is a brief demonstration:

In [34]:
# A set is an unordered collection with no duplicate elements
# Basic uses include membership testing and eliminating duplicate entries
basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
print(basket)
print('orange' in basket) # fast membership testing
print('crabgrass' in basket)

# Demonstrate set operations on unique letters from two words
a = set('abracadabra')
b = set('alacazam')
print(a)  # unique leeters in a
print(a - b)  # letters in a but not in b
print(a | b)  # letters in a or b or both
print(a & b)  # letters in both a and b
print(a ^ b)  # letters in a or b but not both

{'apple', 'pear', 'orange', 'banana'}
True
False
{'c', 'r', 'b', 'd', 'a'}
{'r', 'b', 'd'}
{'m', 'c', 'r', 'd', 'b', 'z', 'l', 'a'}
{'c', 'a'}
{'z', 'l', 'd', 'm', 'r', 'b'}


Similarly to [list comprehensions](#), set comprehensions are also supported:

In [35]:
a = { x for x in 'abracadabra' if x not in 'abc' }
a

{'d', 'r'}