# Class 16: Reinforcing data structures with unique elements

The previous class provided an introduction into how to work with sets and dictionaries by constructing, accessing, and mutating these objects. We also looked at some real world examples in terms of how APIs like Twitter's returns structured meta-data in the form of dictionaries. 

Dictionaries have some nice performance $O(1)$ properties for accessing and mutating data compared to lists. In this class, we'll explore how dictionaries and sets take advantage of [hash functions](https://en.wikipedia.org/wiki/Hash_function) to achieve these high levels of performance. Hashing also has [applications for cryptography](https://en.wikipedia.org/wiki/Cryptographic_hash_function) to secure data by making it difficult for un-authenticated users to access data or alternatively make it obvious if someone altered the data. 

This lesson is adapted from the following resources:

* [Hashing Strings with Python](http://pythoncentral.io/hashing-strings-with-python/), Python Central.
* [Python dictionary implementation](http://www.laurentluce.com/posts/python-dictionary-implementation/), Laurent Luce.
* [What happens when you mess with hashing in Python](http://www.asmeurer.com/blog/posts/what-happens-when-you-mess-with-hashing-in-python/), Aaron Meurer's Blog.
* [Hashing](http://interactivepython.org/runestone/static/pythonds/SortSearch/Hashing.html), Problem Solving with Algorithms and Data Structures.
* [What does `hash` do in python?](http://stackoverflow.com/questions/17585730/what-does-hash-do-in-python), Stack Overflow.

## Hashing functions
A hash function takes an input of some variable length and converts it to a fixed length sequence. Hashing has the property of calculating the hashed value of an input is extremely fast but trying to reverse from a hash back to the input is enormously expensive. This makes them great for cryptography where you want to easily encode information while making it difficult to decode. 

The value it puts out can be called a hash, message digest, hash value, or checksum. 

Most hash functions should produce unique outputs for a given input, but in some cases *collisions* can occur where two different inputs result in the same hash value. Understanding why this happens requires diving deeper into the mathematical theory of hashing functions, but we'll simply assert that hashing functions should always return a unique hash value.

Vanilla Python has a [`hash`](https://docs.python.org/3/library/functions.html#hash) function we can call without importing anything as well as a [`hashlib`](https://docs.python.org/3/library/hashlib.html#) that implements more complicated approaches. In Python 3.3 and up, the hash of the same object like `hash('a')` will differ based on a randomized seed from the start of each session. This is primarily for security.

In [1]:
hash(0)

0

In [7]:
hash(1)

1

In [3]:
hash(1.0)

1

In [4]:
hash(True)

1

In [5]:
hash(False)

0

In [1]:
hash('1')

-2503828751330989411

In [2]:
hash('a')

1138401930875163393

We can hash longer strings as well. This is often how passwords are securely stored in databases.

In [5]:
_short = 'abcd'
_long = 'abcd'*100000000

In [6]:
hash(_short)

-7672584178109169190

In [7]:
hash(_long)

-8229056510577254563

We haven't talked about tuples much, but these are a linear data structure like a list but are "immutable" because we can't change them (*i.e.*, use mutator methods like append) once they're created. The reason why we'd want such a functionality is because we can hash tuples!

In [10]:
['a','b','c'].append('d')

In [14]:
('a','b','c')

'b'

In [15]:
hash(('a','b','c','d'))

1742628348583651419

In contrast, mutable data structures like lists, sets, and dictionaries can't be hashed and will return `TypeError` if you try to hash them.

In [16]:
hash(['a','b','c','d'])

TypeError: unhashable type: 'list'

In [17]:
string_set = set(['a','b','c','d'])
hash(string_set)

TypeError: unhashable type: 'set'

In [18]:
hash({'a':1})

TypeError: unhashable type: 'dict'

You can hash an object.

In [19]:
class Foo(object):
    def __init__(self,identifier):
        self.identifer = identifier
        self.items = []
        
    def add_item(self,item):
        self.items.append(item)
        
f = Foo(1)
hash(f)

-9223372036579304234

Interestingly, after we mutate our object it still has the same hash.

In [20]:
f.add_item(1)
hash(f)

-9223372036579304234

## Dictionaries rely on hashing

Remember dictionaries allow us to map unique keys to their values. The keys must be unique, but they don't have to be similarly typed. `test_dict_1` below has keys 'a' and 2 mapping to values 1 and 'b', respectively.

In [21]:
test_dict_1 = {'a':1,2:'b'}
test_dict_1

{'a': 1, 2: 'b'}

In [22]:
test_dict_1['a']

1

In [23]:
test_dict_1[2]

'b'

One important assumption we made was that these keys are hashable objects like integers and strings. We saw in Class 15 that dictionaries like what Twitter or Wikipedia's APIs returns can be strings, integers, lists, or other dictionaries. 

But what happens if we try to make a dictionary where the key is a list or a dictionary?

In [24]:
test_dict_2 = {['a','b']:1,{'c':3}:4}

TypeError: unhashable type: 'list'

We get the same error as we got above when we tried to call the hash function on a list or dictionary! So clearly when we construct a dictionary, the keys are being passed to the hash function, which means that only hashable objects like strings and integers can be keys in a dictionary! 

Note that we *can* hash an immutable object like a tuple, so we can also use tuples as keys in a dictionary.

In [32]:
f = 'a'
h = 'a'

In [36]:
hash(hash('a'))

1138401930875163393

In [30]:
hash(h)

1138401930875163393

In [25]:
test_dict_3 = {('a','b'):1,('c',2,3):4}
test_dict_3

{('a', 'b'): 1, ('c', 2, 3): 4}

## Memoization

Adapted from [Lee & Hubbard (2015) Ch. 5](https://learn.colorado.edu/d2l/le/content/190526/viewContent/2892655/View).

Recall the recursive Fibonacci function we wrote in Week 3.

In [37]:
def fib(n):
    if n < 2:
        return n
    else:
        return fib(n-1) + fib(n-2)

This function branches out and makes multiple recursive calls. Its call stack looks something like and operates in $O(2^N)$ time.

* fib(5) = fib(4) + fib(3)
* fib(4) = fib(3) + fib(2)
* fib(3) = fib(2) + fib(1)
* fib(2) = fib(1) + fib(0)
* fib(1) = 1
* fib(0) = 0

But it's not quite that simple: 

![](http://i.imgur.com/P2Gfvbq.png)

We can see that some calls like `fib(3)` or `fib(2)` are repeatedly called. However, once it's computed we should store its value so we don't need to compute it again. Rather than computing it recursively *each* time, lookup if the value has been computed already and only compute it if it has not been computed.

In [38]:
def fib_memoized(num):
    # Start off the memo with some base cases
    memo = {0:0,1:1}
    
    # If the num has already been computed, return it
    if num in memo:
        return memo[num]
    
    # Define the base case
    if num < 2:
        return memo[num]
    
    # Compute and store the recursive case
    val = fib(num-1) + fib(num-2)
    memo[num] = val
    return val

In [39]:
fib_memoized(5)

5

## Using hashing to securely store passwords in a database

How would we store and retrieve a password in a database? (Adapted from [Hashing strings with Python](http://pythoncentral.io/hashing-strings-with-python/))

In [2]:
import hashlib
import uuid # Used to generate a random number

uuid.uuid4().hex

'eff1cee20b994124938c0fd612b5fa68'

Make two functions to hash and check a password: `hash_password` and `check_password`

In [3]:
def hash_password(password):
    # uuid is used to generate a random number
    salt = uuid.uuid4().hex
    return hashlib.sha256(salt.encode() + password.encode()).hexdigest() + ':' + salt
    
def check_password(hashed_password, user_password):
    password, salt = hashed_password.split(':')
    return password == hashlib.sha256(salt.encode() + user_password.encode()).hexdigest()
 
new_pass = input('Please enter a password: ')
hashed_password = hash_password(new_pass)
print('The string to store in the db is: ' + hashed_password)
old_pass = input('Now please enter the password again to check: ')
if check_password(hashed_password, old_pass):
    print('You entered the right password')
else:
    print('I am sorry but the password does not match')

Please enter a password: password
The string to store in the db is: ebdd047de12687072c02fe2851d1abc887899d27dcdaa74dc89a94962f4ea2f2:ce57509a89634604966d54f61cf6c190
Now please enter the password again to check: password
You entered the right password


In [6]:
hash_password('password')

'1debf73c4db9056f821a21011e5d72d255ffdf235f5aa03263c6f2504953d166:4b01ec8271414f4dafca05ba7908042f'

In [7]:
hash_password('password')

'368becf2cdfd37b00ede26cdfd85178b2fd66232a43d14fe7dce84c69720758a:033e4cc4f19b4ae2b7c08d0b2eef0ef6'

## Using hashing to simplify Wikipedia revision comparisons

We saw above that we can use the `hash` function to convert a string (among other kinds of objects) to a hashed value. We could theoretically pass a string of a very long length and the hashing function should ideally return a unique hash value.

In [49]:
import requests        # Import the requests library to let us talk to APIs
import numpy as np     # Import numpy (if we haven't already)
import pandas as pd    # Use pandas, which we'll discuss more after spring break, to look at the data

# Make a request dictionary with all the parameters the Wikipedia API wants for requests to use
request = {}                         # Start with an empty dictionary
request['action'] = 'query'          # Talk to the query end-point of the API
request['format'] = 'json'           # Return the data in JSON format
request['prop'] = 'revisions'        # Get the revisions of an article
request['titles'] = "Data mining"    # Get the data for the "data mining" article
request['rvlimit'] = 100             # Get the 100 most recent revisions to the article
request['formatversion'] = 2         # Output the data in a friendlier version

# Ask for specific properties of every revision, joining them together by pipes "|"
# User is the username, ids is the unique revision id, size is the number of characters in the content
# sha1 is the hash of the article version
# content is the wiki-text markup version of the article
request['rvprop'] = '|'.join(['user','timestamp','ids','size','sha1','content'])

# Make the request to Wikipedia's API and use the .json() method to return JSON response to a Python dictionary
result = requests.get('http://en.wikipedia.org/w/api.php', params=request).json()

# Look at the results
result['query']['pages'][0]['revisions'][0]

{'anon': True,
 'content': '{{Distinguish|analytics|information extraction|data analysis}}\n{{machine learning bar}}\n\n\'\'\'Data mining\'\'\'  is the computational process of discovering patterns in large [[data set]]s involving methods at the intersection of [[artificial intelligence]], [[machine learning]], [[statistics]], and [[database system]]s.<ref name="acm" /> It is an [[interdisciplinary]] subfield of [[computer science]].<ref name="acm">{{cite web |url=http://www.kdd.org/curriculum/index.html |title=Data Mining Curriculum |publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2014-01-27 }}</ref><ref name="brittanica">{{cite web |last=Clifton |first=Christopher |title=Encyclopædia Britannica: Definition of Data Mining |year=2010 |url=http://www.britannica.com/EBchecked/topic/1056150/data-mining |accessdate=2010-12-09 }}</ref><ref name="elements">{{cite web |last1=Hastie|first1=Trevor |authorlink1=Trevor Hastie|last2=Tibshirani|first2=R

In [50]:
revisions = result['query']['pages'][0]['revisions']
for num,rev in enumerate(revisions):
    if rev['sha1'] == '55f6e2a5ce1e2b104724415794375f55d98cb25c':
        print(num)

20
23
26
28
30


In [51]:
print('https://en.wikipedia.org/w/index.php?oldid='+str(revisions[20]['revid']))

https://en.wikipedia.org/w/index.php?oldid=758414296


In [52]:
print('https://en.wikipedia.org/w/index.php?oldid='+str(revisions[23]['revid']))

https://en.wikipedia.org/w/index.php?oldid=757902516


We see there are a few Wikipedia revisions have the same SHA-1 hash values.

In [53]:
pd.Series([i['sha1'] for i in result['query']['pages'][0]['revisions']]).value_counts().head(10)

55f6e2a5ce1e2b104724415794375f55d98cb25c    5
3037378feb106b20cf2c1a3520e093278e64b270    4
dd9902dc7dd668ac8bb40d79023460fb5b187660    3
3329b77085ce6feb53bd6f474871aa5f0c97920d    2
a1e961cc542e6a168d6c40c33f75469585c24852    2
d9a8dfa6448bfa8dc255ab39990fdc2bef7223dd    2
894dab75d14250f15d933067d637205c33ac0ff7    1
ca84e352e0fa44356fdc266280a77ce51e5e19c2    1
5277b0833e0ca1474d1750c55ab88d70bc7a76cb    1
20dd300219000cf81f4019c796a861c73e9bffab    1
dtype: int64

So the revisions at these indexes in the revision history have exactly the same content. Thus when we pass them through the same hash function, they return the same hashed values.

In [56]:
len(similar_content)

45217

In [54]:
content = revisions[20]['content']
similar_content = revisions[23]['content']

# What are the hashes for each of these content revisions?
print(hash(content))
print(hash(similar_content))

# Are these two hashes equal to each other?
hashes_equal_boolean = hash(content) == hash(similar_content)
print("Are these two hashes equal to each other? {0}".format(hashes_equal_boolean))

# Is the content itself equal to each other
content_equal_boolean = content == similar_content
print("Are these two content equal to each other? {0}".format(content_equal_boolean))

4157966322820739759
4157966322820739759
Are these two hashes equal to each other? True
Are these two content equal to each other? True


But there are enormous differences in the performance of comparing strings themselves versus comparing hashes of strings to check if they're the same.

In [57]:
# Import the timeit module to estimate how long it takes to do each kind of comparison
import timeit

# Compare the strings of the identical Wikipedia article revision content
string_comparison_secs = timeit.timeit('content == similar_content',"from __main__ import content, similar_content")

# Compare the hashes of the identitical Wikipedia article revision content
hash_comparison_secs = timeit.timeit('hash(content) == hash(similar_content)',"from __main__ import content, similar_content")

print("The string comparison took {0:.3f} seconds.".format(string_comparison_secs))
print("The hash comparison took {0:.3f} seconds.".format(hash_comparison_secs))
print("This is a {0:.1f}x speed-up!".format(string_comparison_secs/hash_comparison_secs))


The string comparison took 3.723 seconds.
The hash comparison took 0.152 seconds.
This is a 24.5x speed-up!


Thinking about complexity classes for a second, let's pretend the Wikipedia article was 10x the size. How would this affect the performance of checking if two versions of an article are identical?

In [58]:
# Make the content 10x the length
content_10x = content * 10
similar_content_10x = similar_content * 10

# Compare the strings of the identical Wikipedia article revision content
string_comparison_10x_secs = timeit.timeit('content_10x == similar_content_10x',"from __main__ import content_10x, similar_content_10x")

# Compare the hashes of the identitical Wikipedia article revision content
hash_comparison_10x_secs = timeit.timeit('hash(content_10x) == hash(similar_content_10x)',"from __main__ import content_10x, similar_content_10x")

print("The string comparison took {0:.3f} seconds.".format(string_comparison_10x_secs))
print("The hash comparison took {0:.3f} seconds.".format(hash_comparison_10x_secs))
print("This is a {0:.1f}x speed-up!".format(string_comparison_10x_secs/hash_comparison_10x_secs))

The string comparison took 42.257 seconds.
The hash comparison took 0.150 seconds.
This is a 281.0x speed-up!


So we can see now that if we don't need to do anything with the content itself but want to compare content to content, hashing works in close to $O(1)$ time while string comparisons work (worst-case) in $O(N)$ time. A huge speedup!