<img src="dsci512_header.png" width="600">

# Lecture 2

In [1]:
import numpy as np
import pandas as pd
from collections import defaultdict

import matplotlib.pyplot as plt
import altair as alt

Outline:

- Linear search and binary search intro (15 min)
- Code timings (15 min)
- Sorting (10 min)
- Break (5 min)
- Hash tables, hash functions (15 min)
- Lookup tables, Python `dict` (5 min)
- Python's `defaultdict` (15 min)

## Learning objectives

- Describe some basic sorting algorithms and their time complexities.
- Describe the binary search algorithm.
- Explain why searching in a sorted list is faster than in an unsorted list.
- Explain the pros and cons of hash tables, and how they work at a high level.
- Apply python's `dict` and `defaultdict` data structures.

## Linear search and binary search intro (15 min)

We return to the problem of checking whether an element is present in a collection.

#### Linear search

In [2]:
def search_unsorted(data, key):
    """
    Searches the key in data using linear search 
    and returns True if found and False otherwise. 

    Parameters
    ----------
    data : list
           the elements to search within
    key  : int
           the key to search for

    Returns
    -------
    bool
        boolean if key is contained in the data 

    Examples
    --------
    >>> search_unsorted([1, 7, 67, 35, 45], 3)
    False
    >>> search_unsorted([1, 7, 67, 35, 45], 7)
    True
    """

    for element in data:
        if element == key:
            return True
    return False

In [3]:
# Some tests

# key is the first element in the list
assert search_unsorted([4, 7, 9, -12, 1000], 4)

# key is the last element in the list
assert search_unsorted([4, 7, 9, -12, 1000], 1000)

# key occurs multiple times in the list
assert search_unsorted([4, 7, 9, -12, 4, 1000], 4)

# key is larger than the largest element in the list
assert not search_unsorted([4, 7, 9, -12, 1000], 2000)

# key is smaller than the smallest element in the list
assert not search_unsorted([4, 7, 9, -12, 1000], -18)

# nothing is in an empty list
assert not search_unsorted([], 1)

**Question:** What is the time complexity of the `search_unsorted`, as a function of the length of the list, $n$? 

<br><br><br><br><br><br><br><br> 
**Answer:** The time complexity of the `search_unsorted` function is $O(n)$ because in the worst case the function loops over $n$ elements. 

#### Binary search

- If the list is already sorted, we can search much faster with _binary search_.
- See the "binary search video" that was recommended as pre-class viewing. 
- We start in the middle and can just restrict ourselves to searching half the list after a comparison.
- Note: the input list must be sorted for the code to work.

In [4]:
def search_sorted(data, key):
    """
    Searches the key in data using binary search 
    and returns True if found and False otherwise. 

    Parameters
    ----------
    data : list
           a list of sorted elements to search within
    key  : int
           the key to search for

    Returns
    -------
    bool :
        boolean if key is contained in the data 

    Examples
    --------
    >>> search_sorted([1, 7, 35, 45, 67], 3)
    False
    >>> search_sorted([1, 7, 35, 45, 67], 7)
    True
    """
    
    while len(data) > 0:
        mid = len(data)//2
        if data[mid] == key:
            return True
        if key < data[mid]:
            data = data[:mid] 
        else:
            data = data[mid+1:]
    return False

In [5]:
data = [-12, 4, 7, 9, 45, 45, 987, 1000, 2000]

In [6]:
# Test cases for binary search

# key is the first element in the list
assert search_sorted(data, -12) == True

# key is the last element in the list
assert search_sorted(data, 2000) == True

# key occurs multiple times in the list
assert search_sorted(data, 45) == True

# key is larger than the largest element in the list
assert search_sorted(data, 3000) == False

# key is smaller than the smallest element in the list
assert search_sorted(data, -18) == False

# nothing is in an empty list
assert search_sorted([], 1) == False

**Question:** What is the time complexity of the `search_sorted`, as a function of the length of the list, $n$? 

<br><br><br><br><br><br><br><br> **Answer:** The time complexity of the `search_sorted` function is $O(\log n)$ because in the worst case, the function loops over $\log n$ elements, as the search space reduces by half in each iteration of the loop. 

**Question:** What happens if you call `search_unsorted` on sorted data? What happens if you call `search_sorted` on unsorted data?

<br><br><br><br><br><br> **Answer:** The `search_unsorted` function does not care about whether the data is sorted or not. In both cases, it sequentially searches for the key and returns `True` when it is found. The `search_sorted` function, on the other hand, is based on the assumption that the data is sorted in ascending order and you might miss the element you are looking for if called on unsorted data. In binary search, whenever we are in a position, all elements on the left are less than (or equal to in cases where values occur multiple times in the data) the element at the position and all elements on the right are greater than the element at the position, which is helpful in deciding which part of the list the next search should happen.   

For example:

In [7]:
search_sorted([3, 2, 1], 1)

False

**Question:** Why doesn't the `search_sorted` function start by verifying that the list is indeed sorted?

<br><br><br><br><br><br>
**Answer:** because this would take $O(n)$ time, defeating the purpose of the $O(\log\, n)$ lookup.

## Code timing (15 min)

Below we empirically measure the running times of 4 approaches:

1. Using `in` with a Python `set` (same as last class)
2. `search_sorted`, that is, our implementation of binary search above
3. Using `in` with a Python `list` (same as last class)
4. `search_unsorted`, that is, our implementation of linear search above

**Question:** Why do I search for $-1$ in the code below? Why not $1$?

In [8]:
list_sizes = [100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]

results = defaultdict(list)
results["size"] = list_sizes

key = -1

for list_size in list_sizes:
    print('List size: ', list_size)
    x = np.random.randint(1e8, size=list_size)

    time = %timeit -q -o -r 1 search_unsorted(x, key)
    results["Unsorted list linear"].append(time.average)
    # Note: -q prevents it from printing to the terminal
    #       -o sends the result to a variable (average time in seconds)
    #       -r 3 makes it average only 3 trials instead of the default of 7, which saves time

    time = %timeit -q -o -r 1 (key in x)
    results["Unsorted list in"].append(time.average)

    x.sort()
    time = %timeit -q -o -r 1 search_sorted(x, key)
    results["Sorted list binary"].append(time.average)

    x_set = set(x)
    time = %timeit -q -o -r 1 (key in x_set)
    results["Python set in"].append(time.average)

List size:  100
List size:  1000
List size:  10000
List size:  100000
List size:  1000000
List size:  10000000


**Answer**: we search for -1 because we know it will not be in the array. This gives us a worst case timing, because searching is the slowest  if it has to keep looking; it can be faster if it finds something right away. For example, if it's the 1st element, then linear search would seem extremely fast.

In [9]:
df = pd.DataFrame(results, columns=list(results.keys()))
df

Unnamed: 0,size,Unsorted list linear,Unsorted list in,Sorted list binary,Python set in
0,100,1.6e-05,2e-06,5e-06,3.961091e-08
1,1000,0.000171,3e-06,8e-06,4.132132e-08
2,10000,0.001515,6e-06,1e-05,3.969443e-08
3,100000,0.015571,3.8e-05,1.2e-05,4.082971e-08
4,1000000,0.151856,0.000517,1.5e-05,4.082498e-08
5,10000000,1.524873,0.008633,1.8e-05,4.522875e-08


Are these consistent with the time complexities we expected?

Reading runtimes from a table: what happens of $N$ becomes $10N$?

- Linear: time $T$ goes up to $10T$
- Logarithmic: time $T$ goes up to $T+\Delta T$
- Constant: time $T$ stays about the same

In [10]:
df_long = pd.melt(df, id_vars="size", var_name="method", value_name="time (s)")

alt.Chart(df_long).mark_line().encode(
    alt.X('size', scale=alt.Scale(type='log')),
    alt.Y('time (s)', scale=alt.Scale(type='log')),
    color='method'
).configure_axis(grid=False)

Note that the `binary_search` we wrote is actually slower than the linear search using `in` when the list is smaller than $10,000$ elements. Remember, big-O is just an "asymptotic" trend. There could be:

- Large/small constants, like $1000\log(n)$ vs. $n$.
- "Lower order terms", like $\log(n)+100 \log \log(n) + 100$ vs. $n$.

We are probably seeing the former; my code performs fewer steps (better complexity), but each step is much slower because the implementation is not optimized. (Often, though, we don't care too much about how code performs for very small inputs.)

(**Note:** the last ~10 minutes of material are not easy and very important. If you didn't completely follow, please review them later and ask questions as needed!)

The two fastest methods (the orange and blue curves) look quite similar, possible because of the log scale on the y-axis. Let's plot time vs. $\log n$ to try and tell the difference.

In [11]:
df_long = pd.melt(df[["size", "Sorted list binary", "Python set in"]],
                  id_vars="size", var_name="method", value_name="time (s)")

alt.Chart(df_long).mark_line().encode(
    alt.X('size', scale=alt.Scale(type='log')),
    alt.Y('time (s)'),
    color='method'
).configure_axis(grid=False)


We can see that the set really is constant, but the binary search is logarithmic - in the plot, time is linear in $\log(n)$.

## Sorting (10 min)

- Sorting is a very popular topic in Algorithms and Data Structures courses.
- We'll start with "[selection sort](https://en.wikipedia.org/wiki/Selection_sort)".

In [12]:
def selection_sort(x):
    """Sorts x inplace... slowly.

    Parameters
    ----------
    x : list
           the list needed to be sorted

    Returns
    -------
    list :
        the sorted list 

    Examples
    --------
    >>> selection_sort([7, 1, 67, 35, 45])
    [1, 7, 35, 45, 67]
    >>> selection_sort([357, 6, 55, 12, 112])
    [6, 12, 55, 112, 357]
    """

    n = len(x)

    for i in range(n):
        # Get the index of the smallest value from location i onward
        min_ind = np.argmin(x[i:]) + i

        # Swap this with element i
        x[i], x[min_ind] = x[min_ind], x[i] # swap
    return x

- How does the above code work?
  - For the remaining part of the array `x[i:]`, find the smallest element.
  - Swap this with the current element.
  - Repeat until reaching the end of the array.
- We will revisit this code next class

**Question:** What is the time complexity of this method?

<br><br><br><br><br><br>

**Answer:** $O(n^2)$. `argmin` itself takes $O(n)$, and this is called $n$ times, for a total of $O(n^2)$. The actual number of steps is more like $n^2/2$.

<br>

Detailed analysis:

- The swapping takes constant time.
- The call to `np.argmin` needs to look through the array.
  - This takes time proportional to the length of the array.
  - The first time, the length is $n$. Then $n-1$, then $n-2$, etc.
- The number of steps in the above code is

$n+(n-1)+(n-2)+\ldots+3+2+1$

This is an arithmetic series; the sum is $\frac{n(n+1)}{2}=\frac12 n^2 + \frac{n}{2}$

- We ignore the $\frac{n}{2}$ because it is very small compared to $\frac12 n^2$ when $n$ is large.
- E.g. for $n=1000$ we have $\frac{1}{2}n^2= 500000$ vs. $\frac{n}{2}=500$.
- We ignore the $\frac12$ because we're only interested in the growth, not the scaling factor.
- Result: we say the above code is $O(n^2)$.

Put another way, for our purposes this is the same as code that performs $n$ steps inside the loop. The fact that it actually decreases each time is not important enough to show up in the Big O.

**Question:** could we find a sorting algorithm that takes $\log(n)$ time?

<br><br><br><br><br><br>

**Answer:** no way, because it takes $n$ steps to even inspect every element of the input!

- The real answer is that the best sorting algorithms are $n \log(n)$ time. This is close enough to $O(n)$ that we should be very happy with the result. 
- If you are interested, you can read more about [mergesort](https://www.geeksforgeeks.org/merge-sort/) and [quicksort](https://www.geeksforgeeks.org/quick-sort/). We may go into this a bit next week.

## Break (5 min)

## Hash tables, hash functions (15 min)


- Python's `set` type supports the following operations in $O(1)$ time:
  - inserting a new element
  - deleting an element
  - checking if an element is present

How could we implement this using the tools we already have?

- Well, what about using linear search to find elements, e.g. a `list`?
  - This is too slow
- What about using binary search?
  - Now searching is fast, but insertion/deletion is slow, because we need to maintain an ordered list
- Enter the [hash table](https://en.wikipedia.org/wiki/Hash_table) - to save the day!
  - Trees could also work (see Lecture 4), but hash tables are the most popular.
  
#### Hash functions

Python objects have a _hash_:

In [13]:
hash("mds")

-7949029789803057674

In [14]:
hash("")

0

It looks like the hash function returns an integer.

In [15]:
hash(5.5)

1152921504606846981

In [16]:
hash(5)

5

In [17]:
hash(-9999)

-9999

It looks like the hash function of a Python integer is itself. Or at least small enough integers.

In [18]:
hash(999999999999999999999999)

2003764205207330319

Sometimes it fails?

In [19]:
hash([1, 2, 3])

TypeError: unhashable type: 'list'

In [20]:
hash((1, 2, 3))

529344067295497451

In [21]:
hash(None)

281576614

If a Python `set` is a hash table, that means items in it must be hashable (`dict` has the same requirement, for keys):

In [22]:
s = set()

In [23]:
s.add(5.5)

In [24]:
s.add("mds")

In [25]:
s

{5.5, 'mds'}

In [26]:
s.add([1, 2, 3])

TypeError: unhashable type: 'list'

In [27]:
s.add((1, 2, 3))

In [28]:
s

{(1, 2, 3), 5.5, 'mds'}

- Typically, mutable objects are not hashable.

#### Hash tables

- So, it looks like the hash function maps from an object to an integer.
- And that Python `set`s use these hash functions.
- How do they work?
- The hash table is basically a list of lists, and the hash function (mod the array size) maps an object to its location in the outer list.
  - But it's a bit more complicated than that.
  - The list typically expands and contracts automatically as needed.
  - These operations may be slow, but averaged or "amortized" over many operations, the runtime is $O(1)$
  - The hash function depends on this array size.
  - There's also an issue of collisions: when two different objects hash to the same place.
- Roughly speaking, we can insert, retrieve, and delete things in $O(1)$ time so long as we have a "good" hash function.
  - The hash function will be "good" for default Python objects, and if you end up needing to implement your own one day you should read a bit more about it.

#### A simple hash table implementation

Below is a (very low-quality) hash table implementation, with only 4 buckets by default:

In [29]:
# myset = HashTable()
# myset.add(5)
# myset.contains(5)

In [30]:
# myset.contains(6)

In [37]:
class HashTable:
    
    def __init__(self, num_buckets=4):
        self.stuff = list() # A list of lists
        self.n = num_buckets
        
        for i in range(num_buckets):
            self.stuff.append([]) # Create the inner lists, one per bucket
        
    def add(self, item):
        if not self.contains(item):
            self.stuff[hash(item) % self.n].append(item)
        
    def contains(self, item):
        return item in self.stuff[hash(item) % self.n]
    
    def __str__(self):
        return str(self.stuff)

(Note: The `hash` function has a random seed that is set at the start of every Python session, so your actual results my vary from mine.)

In [38]:
ht = HashTable()
print(ht)

[[], [], [], []]


- So far, all 4 buckets are empty. 
- Now let's add something:

In [39]:
ht.add("hello")
print(ht)

[[], [], [], ['hello']]


"hello" went into this bucket because 

In [40]:
hash("hello")

9217359063657435943

In [41]:
hash("hello") % 4

3

Now let's add more things:

In [42]:
ht.add("goodbye")
print(ht)

[[], ['goodbye'], [], ['hello']]


In [43]:
ht.add("test")
print(ht)

[['test'], ['goodbye'], [], ['hello']]


In [44]:
ht.add("item")
print(ht)

[['test'], ['goodbye'], ['item'], ['hello']]


In [45]:
ht.add("what")
print(ht)

[['test'], ['goodbye'], ['item'], ['hello', 'what']]


If we want to look for something:

In [46]:
ht.contains("blah")

False

False because 

In [47]:
hash("blah") % 4

2

And "blah" is not found in bucket. 

In [48]:
ht.contains("item")

True

- Same thing here.
- The key idea is that you only need to look in one bucket - either it's in that bucket, or it's not in the _entire_ hash table.

In [49]:
print(ht)

[['test'], ['goodbye'], ['item'], ['hello', 'what']]


- Above we have a _collision_: that is, 2 items in the same bucket.
- If the main list is able to dynamically grow as the number of items grows, we can keep the number of collisions low.
- This preserves the $O(1)$ operations.

## Lookup tables, Python `dict` (5 min)

- Python's `dict` type is a dictionary (aka symbol table)
- A dictionary should support the following operations:
  - inserting a new element
  - deleting an element
  - finding an element
- It is much like a `set` except the entries, called "keys", now have some data payload associated with them, which we call "values".
- It is also implemented as a hash table, meaning you can expect $O(1)$ operations.
- Only the keys are hashed, so only the keys have to be hashable.
  - A list can be a value, but not a key.

In [50]:
d = dict()
d[5] = "a"
d["b"] = 9
d

{5: 'a', 'b': 9}

In [51]:
5 in d

True

In [52]:
9 in d  # it only searches the keys

False

In [53]:
d[5]

'a'

In [54]:
d[6]

KeyError: 6

Hashable types:

In [55]:
d[[1,2,3]] = 10

TypeError: unhashable type: 'list'

In [56]:
d[10] = [1,2,3] # OK

A reminder of some dictionary syntax:

In [57]:
f = {i: i*2 for i in range(10)}
f

{0: 0, 1: 2, 2: 4, 3: 6, 4: 8, 5: 10, 6: 12, 7: 14, 8: 16, 9: 18}

In [58]:
for key, val in f.items():
    print("key =", key, " val =", val)

key = 0  val = 0
key = 1  val = 2
key = 2  val = 4
key = 3  val = 6
key = 4  val = 8
key = 5  val = 10
key = 6  val = 12
key = 7  val = 14
key = 8  val = 16
key = 9  val = 18


Nested dictionaries:

In [59]:
g = dict()
g[5] = f
g

{5: {0: 0, 1: 2, 2: 4, 3: 6, 4: 8, 5: 10, 6: 12, 7: 14, 8: 16, 9: 18}}

## Python's `defaultdict` (15 min)

_Meta-comment: As with parts of our labs, this could be considered DSCI 511 content. But 1 course isn't enough for all the Python we need!_

- Python dictionaries are super useful
- It's often the case that we want to add something to the value of a dictionary. 
- Example: listing multiples


In [60]:
multiples_of_5 = list()
for i in range(100):
    if i % 5 == 0:
        multiples_of_5.append(i)

print(multiples_of_5)

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]


In [61]:
multiples_of_2 = list()
for i in range(100):
    if i % 2 == 0:
        multiples_of_2.append(i)

print(multiples_of_2)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]


In [62]:
multiples_of_3 = list()
for i in range(100):
    if i % 3 == 0:
        multiples_of_3.append(i)

print(multiples_of_3)

[0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99]


In [63]:
multiples_of_4 = list()
for i in range(100):
    if i % 4 == 0:
        multiples_of_3.append(i)

print(multiples_of_3)

[0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99, 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96]


- But now let's say we want multiples of 2, 3, 4, 5, 6, 7, 8, 9 all in one place.
- Well, we don't want to violate DRY and copy-paste the above code.
- A dictionary would be ideal for this!

In [64]:
multiples_of = dict()

for base_number in range(2, 10):
    print("Finding the multiples of", base_number)
    
    for i in range(100):
        if i % base_number == 0:
            multiples_of[base_number].append(i)

print(multiples_of)

Finding the multiples of 2


KeyError: 2

- What happened here?
- I tried `multiples_of[base_number]` but that key was not present in the dictionary.
- I need to initialize all those lists!
- Another attempt:

In [65]:
multiples_of = dict()

for base_number in range(2, 10):
    print("Finding the multiples of", base_number)
    
    for i in range(100):
        if i % base_number == 0:
            if base_number not in multiples_of:    # added
                multiples_of[base_number] = list() # added
                
            multiples_of[base_number].append(i)

print(multiples_of)

Finding the multiples of 2
Finding the multiples of 3
Finding the multiples of 4
Finding the multiples of 5
Finding the multiples of 6
Finding the multiples of 7
Finding the multiples of 8
Finding the multiples of 9
{2: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98], 3: [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99], 4: [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96], 5: [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95], 6: [0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96], 7: [0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98], 8: [0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96], 9: [0, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99]}


- This works but we Python users are a bit lazy.
- Enter the `defaultdict`.
- A dictionary with a default value for cases when the key does not exist.
- Use case here: the default is an empty list!

In [66]:
d = dict()
d["hello"]

KeyError: 'hello'

In [67]:
d.get("hello", 5) # equivalent to d["hello"] but returns 5 if "hello" not in d

5

In [68]:
d[4] = 100000
d.get(4, 5)

100000

In [69]:
from collections import defaultdict

In [70]:
dd = defaultdict(list)
dd["hello"]

[]

In [71]:
dd["new key"].append(5)

In [72]:
dd

defaultdict(list, {'hello': [], 'new key': [5]})

In [73]:
dd

defaultdict(list, {'hello': [], 'new key': [5]})

- The beauty here is that we can call `append` on a key that doesn't exist.
- It defaults to a new empty list and then immediately appends. 
- So... our original (broken) code works again, if we change `multiples_of` to a `defaultdict`:

In [74]:
multiples_of = defaultdict(list)

for base_number in range(2, 10):
    print("Finding the multiples of", base_number)
    
    for i in range(100):
        if i % base_number == 0:
            multiples_of[base_number].append(i)

print(multiples_of)

Finding the multiples of 2
Finding the multiples of 3
Finding the multiples of 4
Finding the multiples of 5
Finding the multiples of 6
Finding the multiples of 7
Finding the multiples of 8
Finding the multiples of 9
defaultdict(<class 'list'>, {2: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98], 3: [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99], 4: [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96], 5: [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95], 6: [0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96], 7: [0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98], 8: [0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96], 9: [0, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90

#### (optional) Type of defaultdict

- In DSCI 511 you saw classes and inheritance.
- Just for fun, we can look at the implementation of `defaultdict` in PyPy, which is an implementation of Python in Python (the OG Python is written in C): https://github.com/reingart/pypy/blob/master/lib_pypy/_collections.py#L387
- We can see here that `defaultdict` inherits from `dict`. Indeed:

In [75]:
type(d)

dict

In [76]:
type(dd)

collections.defaultdict

In [77]:
type(d) == type(dd)

False

In [78]:
isinstance(d, dict)

True

In [79]:
isinstance(dd, dict)

True

In [80]:
type(dd) == dict

False

So in general people prefer the use of `isinstance(obj, class)` over `type(obj) == class`.

#### list vs. list()

- Question: why was it `defaultdict(list)` instead of `defaultdict(list())`?
- What happens when I run this code:

In [81]:
my_list = list()
my_list

[]

In [82]:
bad = defaultdict([])

TypeError: first argument must be callable or None

In [83]:
bad = defaultdict(list())

TypeError: first argument must be callable or None

- But why?
- The code that executes first is `list()`.
- Now we have _a particular list_ being passed in.
- But we don't want one particular list, we want lots of new lists all the time.
- So you need to pass in _the ability to create lists_ 
  - in other words, a function that creates lists.
- That is `list`

In [84]:
list

list

In [85]:
list()

[]

In [2]:
x = defaultdict(lambda : "hello my name is Arman")

In [3]:
x[5]

'hello my name is Arman'

- In lab you need to count occurrences.
- So, in that case, what is my default value?

In [88]:
text = "Blah blah is he still talking about dictionaries??"

(This can be done with a Python function)

In [89]:
text.count('a')

5

(But in the lab we're doing something more sophisticated so let's ignore this for now)

In [90]:
number_of_a = 0
for t in text:
    if t == "a":
        number_of_a += 1
        
number_of_a

5

- Ok, but now we want to count "a" and "b".
- Same as before, let's use a dict.
- We already know this won't work - it's the same problem as before:

In [91]:
number_of_times = dict()

for char in ('a', 'b'):
    for t in text:
        if t == char:
            number_of_times[char] = number_of_times[char] + 1
        
number_of_times

KeyError: 'a'

- So, we use a defaultdict.
- What do we want the default value to be?

<br><br><br>

Answer: 0

In [92]:
defaultdict(0)

TypeError: first argument must be callable or None

Same problem as before! So:

In [93]:
defaultdict(lambda: 0)

defaultdict(<function __main__.<lambda>()>, {})

In [94]:
defaultdict(int)

defaultdict(int, {})

In [95]:
int()

0

Back to the code:

In [96]:
number_of_times = defaultdict(int)

for char in ('a', 'b', 'c'):
    for t in text:
        if t == char:
            number_of_times[char] += 1
        
number_of_times

defaultdict(int, {'a': 5, 'b': 2, 'c': 1})

#### Counter

And finally, for the supremely ~~lazy~~ awesome, we can use `Counter`:

In [97]:
from collections import Counter

This is basically a `defaultdict(int)` but with some fancy methods added:

In [98]:
number_of_times = Counter()

for char in ('a', 'b'):
    for t in text:
        if t == char:
            number_of_times[char] += 1
        
number_of_times

Counter({'a': 5, 'b': 2})

In [99]:
number_of_times.most_common(1)

[('a', 5)]