# NB18: List Algorithms

## Programming Fundamentals

## L.EIC/2022-23

#### João Correia Lopes$^{1}$, Nuno Macedo$^{1}$, Pedro Vasconcelos$^{2}$
$^{1}$FEUP/DEI & INESC TEC\
$^{2}$FCUP/DCC & LIACC

> “Intelligence is the ability to avoid doing work, yet getting the work done.”

Linus Torvalds

## Goals

By the end of this class, the student should be able to:

- Explain and implement linear search and binary search

- Describe other algorithms that work with lists

- Compare the performance of those algorithms

## Bibliography

- Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers, *How to Think Like a Computer Scientist — Learning with Python 3 (RLE)*, 2012 (Chapter 14)
[[HTML]](http://openbookproject.net/thinkcs/python/english3e/list_algorithms.html)

- Brad Miller and David Ranum, *Problem Solving with Algorithms and Data Structures using Python* (Section 6.3, Section 6.4)
[[HTML]](https://runestone.academy/runestone/books/published/pythonds/SortSearch/toctree.html)

# 18 List Algorithms




## 18.1 Introduction

> **And now for something completely different!** (*Monty Python's Flying Circus*)

- Rather than introduce more programming constructs, or new Python syntax and features

- ... we focus on some algorithms that work with lists

### Alice in Wonderland

- Examples work with the book "Alice in Wonderland" and a "vocabulary file"

![alice](https://raw.githubusercontent.com/fp-leic/public/main/notebooks/18/alice_cover.jpg)

$\Rightarrow$
[Alice in Wonderland](http://openbookproject.net/thinkcs/python/english3e/_downloads/alice_in_wonderland.txt)  
$\Rightarrow$
[Vocabulary](http://openbookproject.net/thinkcs/python/english3e/_downloads/vocab.txt)  
$\Rightarrow$
[Alice's Adventures in Wonderland by Lewis Carroll](http://www.gutenberg.org/ebooks/19033)

## 18.2 The linear search algorithm

### The linear search algorithm

- We'd like to know the index where a specific item occurs within in a list of items

- Specifically, we'll return the index of the item if it is found, or we'll return -1 if the item doesn't occur in the list

![image](https://raw.githubusercontent.com/fp-leic/public/main/notebooks/18/linear.png)

$\Rightarrow$
[Problem Solving with Algorithms...](https://runestone.academy/runestone/books/published/pythonds/SortSearch/TheSequentialSearch.html)  
$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/linear.py>

### Linear search


Searching all items of a sequence from first to last is called a **linear search**.

Given a list of numbers:

In [None]:
ls = [2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37, 43, 47, 53]

Find and return the index of `target` in the given sequence `xs`, or `None` when not found:

In [None]:
def linear_search(xs, target):
    """ Find and return the index of target in sequence xs, or None """
    for (i, v) in enumerate(xs):
        if v == target:
            return i
    return None

In [None]:
print(linear_search(ls, 1))
print(linear_search(ls, 17))

### Probing

- Each time we check whether `v == target` we'll call it a **probe**

    - We like to count probes as a measure of how efficient our algorithm is

    - this will be a good enough indication of how long our algorithm will take to execute


- Linear searching is characterized by the fact that the number of probes needed to find some target depends directly on the length of the list $N$

- On average, when the target is present, we're going to need to go about halfway through the list, or $N/2$ probes


### Linear Performance

- We say that linear search has *linear runtime* (as in a linear function of $N$)

- Analysis like this are pretty meaningless for small lists

    - The computer is quick enough not to bother if the list only has a handful of items

- So generally, we're interested in the **scalability** of our algorithms

    - How do they perform if we throw bigger problems at them

    - What happens for **really large** datasets, e.g. how does Google search so brilliantly well?

## 18.3 A more realistic problem

### A more realistic problem

- As children learn to read, there are expectations that their vocabulary will grow

- So a child of age 14 is expected to know more words than a child of age 8

- When prescribing reading books for a grade, an important question might be:
    
> **"which words in this book are not in the expected
> vocabulary at this level?"**


- Let us assume we can read a vocabulary of words into our program

- Then read the text of a book, and split it into words

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/alice.py>

First, download to your local file system the two files you'll need:

In [None]:
!wget http://openbookproject.net/thinkcs/python/english3e/_downloads/alice_in_wonderland.txt

In [None]:
!wget http://openbookproject.net/thinkcs/python/english3e/_downloads/vocab.txt

An utility function we'll need later:

In [None]:
def load_from_file(filename):
    """ Read words from filename, return a string. """
    # outside Colab:
    # f = open(filename, "r")
    # insde Colab:
    f = open("/content/" + filename, 'r')
    file_content = f.read()
    return file_content

Now, to open the vocabulary:

In [None]:
vocab = load_from_file("vocab.txt")
print("There are", len(vocab), "chars in the vocabulary.\nStarting with:")
vocab[:80]

The basic strategy is to run through each of the words in the book, look it up in the vocabulary, and if it is not in the vocabulary, save it into a new resulting list which we return from the function.

In [None]:
def find_unknown_words(vocab, wds):
    """ Return a list of words in wds that do not occur in vocab. """
    result = []
    for w in wds:
        if (linear_search(vocab, w) is None):  # linear search
            result.append(w)
    return result

Split vocabulary in words:

In [None]:
def load_words_from_file(filename):
    """ Read words from filename, return list of words. """
    file_content = load_from_file(filename)
    wds = file_content.split()
    return wds

Now read a sensible size vocabulary:

In [None]:
bigger_vocab = load_words_from_file("vocab.txt")
print("There are {0} words in the vocabulary.\nStarting with:\n{1} "
      .format(len(bigger_vocab), bigger_vocab[:12]))

Books are full of punctuation, and have mixtures of lowercase and uppercase letters. We need to clean up the contents of the book.

This will involve removing punctuation, and converting everything to the same case (lowercase, because our vocabulary is all in lowercase).

In [None]:
def text_to_words(the_text):
    """ return a list of words with all punctuation removed, and all in lowercase. """
    my_substitutions = the_text.maketrans(
      # If you find any of these
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\\",
      # Replace them by these
      "abcdefghijklmnopqrstuvwxyz                                          ")
    # Translate the text now.
    cleaned_text = the_text.translate(my_substitutions)
    wds = cleaned_text.split()
    return wds

Try it, using an (in)famous quotation:

In [None]:
str = "I'm a lumberjack and I'm OK. I sleep all night and I work all day!"
print(str)
clean_str = text_to_words(str)
print(clean_str)

Get a clean book from a file:

In [None]:
def get_words_in_book(filename):
    """ Read a book from filename, and return a list of its words. """
    file_content = load_from_file(filename)
    wds = text_to_words(file_content)
    return wds

Now we're ready to read in our book:

In [None]:
book_words = get_words_in_book("alice_in_wonderland.txt")
print("There are {0} words in the book.\nStarting with:\n{1}"
      .format(len(book_words), book_words[:12]))

Find the unknown words and make some timing measurements:

In [None]:
import time
print("Finding unknown words...", end='')
t0 = time.perf_counter()

missing_words = find_unknown_words(bigger_vocab, book_words)

t1 = time.perf_counter()
print(" took {0:.4f} seconds.".format(t1-t0))
print()
print("There are {0} unknown words.\nStarting with:\n{1}"
      .format(len(missing_words), missing_words[:12]))

## 18.4 Binary Search

### Binary Search

- If you were given a vocabulary and asked to tell if some word was present, you'd probably start in the middle

- You can do this because the vocabulary is **ordered** --- so you can probe some word in the middle, and immediately realize that your target was before (or perhaps after) the one you had probed

- Applying this principle repeatedly leads us to a very much better algorithm for searching in a list of items that are already ordered

- This algorithm is called **binary search**

- It is a good example of *divide and conquer*

### Region of Interest (ROI)

- Our algorithm will start with the ROI set to all the items in the list

- On the first probe in the middle of the ROI, there are three possible outcomes:

    - either we find the target

    - or we learn that we can discard the top half of the ROI

    - or we learn that we can discard the bottom half of the ROI

- Trying with 54...

![image](https://raw.githubusercontent.com/fp-leic/public/main/notebooks/18/binary.png)

$\Rightarrow$
[Problem Solving with Algorithms...](https://runestone.academy/runestone/books/published/pythonds/SortSearch/TheBinarySearch.html)  
$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/binary.py>

In [None]:
def binary_search(xs, target):
    """ Find and return the index of key in sequence xs, or None."""
    lb = 0        # lower bound (lower index)
    ub = len(xs)  # upper bound (1 past the higher index)
    while True:
        if lb == ub:     # If region of interest (ROI) becomes empty
            return None  # NOT found!

        # Next probe should be in the middle of the ROI
        mid_index = (lb + ub) // 2

        # Fetch the item at that position
        item_at_mid = xs[mid_index]

        # comment if not debugging
        print("ROI[{0}:{1}](size={2}), probed='{3}', target='{4}'"
              .format(lb, ub, ub-lb, item_at_mid, target))

        # How does the probed item compare to the target?
        if item_at_mid == target:
            return mid_index      # Found it!
        if item_at_mid < target:
            lb = mid_index + 1    # Use upper half of ROI next time
        else:
            ub = mid_index        # Use lower half of ROI next time

- The debug print statements, give us a trace of the probes done during a search

In [None]:
ls = [2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37, 43, 47]  # len(ls) = 13

Try it and have a look at the output:

In [None]:
print(binary_search(ls, 4))
#print(binary_search(ls, 37))

In [None]:
ls = [2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37, 43, 47, 53]  # len(ls) = 14

In [None]:
print(binary_search(ls, 17))

### Back to "A more realistic problem"


Now, we'll use `binary_search` to find the unknown words.

First let us redefine the function without debugging code:

In [None]:
def binary_search(xs, target):
    """ Find and return the index of key in sequence xs, or None."""
    lb = 0        # lower bound (lower index)
    ub = len(xs)  # upper bound (1 past the higher index)
    while True:
        if lb == ub:     # If region of interest (ROI) becomes empty
            return None  # NOT found!
        # Next probe should be in the middle of the ROI
        mid_index = (lb + ub) // 2
        # Fetch the item at that position
        item_at_mid = xs[mid_index]
        # How does the probed item compare to the target?
        if item_at_mid == target:
            return mid_index      # Found it!
        if item_at_mid < target:
            lb = mid_index + 1    # Use upper half of ROI next time
        else:
            ub = mid_index        # Use lower half of ROI next time

Now we are can redefine `find_unkown_words` to use `binary_search` instead of `linear_search`:

In [None]:
def find_unknown_words(vocab, wds):
    """ Return a list of words in wds that do not occur in vocab """
    result = []
    for w in wds:
        if (binary_search(vocab, w) is None):  # binary search
            result.append(w)
    return result

In [None]:
import time
print("Finding unknown words...", end='')
t0 = time.perf_counter()

missing_words = find_unknown_words(bigger_vocab, book_words)

t1 = time.perf_counter()
print(" took {0:.4f} seconds.".format(t1-t0))
print()
print("There are {0} unknown words.\nStarting with:\n{1}"
      .format(len(missing_words), missing_words[:12]))

- What a spectacular difference: more than 200 times faster!


$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/alice.py>

## 18.5 Removing adjacent duplicates

### Removing adjacent duplicates from a list

- We often want to get the unique elements in a list, i.e. produce a new list in which each different element occurs just once

- Consider our case of looking for words in *Alice in Wonderland* that are not in our vocabulary

- We had a report that there are 3398 such words, but there are duplicates in that list

- In fact, the word "alice" occurs 398 times in the book, and it is not in our vocabulary!

- How should we remove these duplicates?



- A good approach is to sort the list, then **remove all adjacent
    duplicates**

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/adjacent.py>


In a sorted list, to remove duplicates, we simply have to remember the most recent item that was inserted into the result, and avoid inserting it again.

In [None]:
def remove_adjacent_dups(xs):
    """ Return a new list in which all adjacent duplicates from xs
        have been removed. """
    result = []
    most_recent_elem = None
    for e in xs:
        if e != most_recent_elem:
            result.append(e)
            most_recent_elem = e
    return result

In [None]:
ls = friends = ["Joe", "Zoe", "Joe", "Angelina", "Zuki", "Thandi", "Zuki"]

In [None]:
print(ls)
ls.sort()
print(ls)
print(remove_adjacent_dups(ls))

The amount of work done in this algorithm is linear — each item in the list `xs` causes the loop to execute exactly once, and there are no nested loops.

### Back to "A more realistic problem"

- Let us go back now to our analysis of *Alice in Wonderland*

- Before checking the words in the book against the vocabulary, we'll sort those words into order, and eliminate duplicates

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/alice2.py>

In [None]:
all_words = get_words_in_book("alice_in_wonderland.txt")
all_words.sort()
book_words = remove_adjacent_dups(all_words)
print()
print("There are {0} words in the book. Only {1} are unique."
      .format(len(all_words), len(book_words)))
print("The first 12 words are:\n{0}".format(book_words[:12]))


Lewis Carroll was able to write a classic piece of literature using only 2570 different words!

## 18.6 Merging sorted lists

### Merging sorted lists

- Suppose we have two sorted lists (`xs` and `yx`)

- Devise an algorithm to merge them together into a single sorted list



- A simple algorithm could be to simply append the two lists together, and sort the result

```
   newlist = xs + ys
   newlist.sort()
```

- But this doesn't use the fact that the two lists are already sorted

- We can take advantage of this to devise a more efficient algorithm

- Sorting a list requires a more complicated algorithm (which is what  
   the `sort` method does)  



$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/merge.py>

### Merging one element at a time

Idea:

- compare the first element of both lists

- add the smaller of the two into the result

- continue with the remaining elements until at least one of the lists is empty

- if necessary, add any remaining elements to the end of the result


In [None]:
def merge(xs, ys):
    """ Merge sorted lists xs and ys. Return a sorted result. """
    result = []
    i = 0   # index of the next element of xs
    j = 0   # index of the next element of ys

    while i<len(xs) and j<len(ys):
        # Both lists still have items; add the smaller item to result.
        if xs[i] <= ys[j]:
            result.append(xs[i])
            i = i + 1
        else:
            result.append(ys[j])
            j = j + 1
    # loop ended
    if i >= len(xs):             # If xs list is finished,
          result.extend(ys[j:])  # Add remaining items from ys
    elif j >= len(ys):           # Same, but swapping roles
          result.extend(xs[i:])
    return result


Try it here:

In [None]:
ls1 = ["Joe", "Zoe", "Brad"]
ls2 = ["Angelina", "Zuki", "Thandi", "Paris"]

In [None]:
ls1.sort()
print(ls1)
ls2.sort()
print(ls2)
ls3 = merge(ls1, ls2)
print(ls3)

## 18.7 Alice in Wonderland, again!

### Back to "A more realistic problem"

- Let us go back now to our analysis of *Alice in Wonderland*

- Previously we sorted the words from the book, and eliminated duplicates

- Our vocabulary is also sorted

- So, to find all items in the second list that are not in the first list, would be another way to implement `find_unknown_words`


Instead of searching for every word in the dictionary (either by linear or binary search), ww can use a variant of the merge algorithm above to return the words that occur in the book, but not in the vocabulary.

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/18/alice3.py>

In [None]:
def find_unknowns_merge_pattern(vocab, wds):
    """ Both the vocab and wds must be sorted.  Return a new
        list of words from wds that do not occur in vocab.
    """
    result = []
    i = 0   # index in vocab
    j = 0   # index in wds

    while i<len(vocab) and j<len(wds):
        if vocab[i] == wds[j]:    # good, word exists in vocab
            i = i+1

        elif vocab[i] < wds[j]:   # move past this vocab word
            i = i+1

        else:                       # word is not in vocab
            result.append(wds[j])
            j = j+1
    # end of loop
    if i >= len(vocab):        # there's no more words in vocab
          result.extend(wds[j:])

    return result

Find the unknown words, again:

In [None]:
import time
all_words = get_words_in_book("alice_in_wonderland.txt")
t0 = time.perf_counter()

all_words.sort()
book_words = remove_adjacent_dups(all_words)
missing_words = find_unknowns_merge_pattern(bigger_vocab, book_words)

t1 = time.perf_counter()
print("There are {0} unknown words.".format(len(missing_words)))
print("That took {0:.4f} seconds.".format(t1-t0))

Another increase in performance, although not as impressive as the first one.

### What we've learned

- We started with a word-by-word **linear lookup** in the vocabulary that ran in about 50 seconds

- We implemented a clever **binary search**, and got that down to 0.22 seconds, more than 200 times faster

- But then we did something even better: we sorted the words from the book, eliminated duplicates, and used a **merging pattern** to find words from the book that were not in the  this was about five times faster than even the binary lookup algorithm

- At the end, our algorithm is more than a 1000 times faster than our first attempt!

# Further reading

### How Binary Search Makes Computers Much, Much Faster

Tom Scott

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('KXJSjte_OAI')

-- João Correia Lopes, Nuno Macedo & Pedro Vasconcelos