Reading code. Writing code. Determining complexity of algorithms.
- Hashing
- Primary clustering, secondary clustering, non-clustering
- Characteristics of a good hash function
- Linear probing, quadratic probing, double hashing, separate chaining Rehashing
- Priority Queues. Algorithms for implementing
- Heap as an array
- Leftist heap
- Skew heap
- Binomial queue
- Sorting
- Stability, adaptive, complexity
- Bubble, insertion, selection, shell
- Quicksort, mergesort, heap sort
- Effect of data: few unique, almost in order
- Bucket sort
- TImsort: merge low/high, natural mergesort, insertion sort to increase minlength, galloping, merging similar sized adjacent runs
- Union Find
- Union by size, union by weight
- Path compression
- Graphs (introduction only)
- Directed, undirected, bipartite, weighted, connected/unconnected
- Depth first traversal, breadth first traversal, visited flag
- Adjacency matrix, adjacency list
Topics | Sub Topics |
---|---|
Hashing | Hash Function |
Collisions | |
Rehash | |
Efficiency | |
Deletions | |
Cuckoo | |
Priority Queue | |
Sorting | |
Union Find | |
Graphing |
Hashing Instructor Notes
Hashing | Concepts |
---|---|
Hash Function | Hash Function |
Collisions | Linear Probing |
Clustering | |
Quadratic Probing |
This function is called a hash function because it “makes hash” of its inputs
A hash function is a function that:
- When applied to an Object, returns a number
- When applied to equal Objects, returns the same number for each
- When applied to unequal Objects, is very unlikely to return the same number for each
- This function has no other purpose
Hash function with sliding bits and exclusive or
unsigned int hash(string key)
{unsigned int hashVal=0;
for (i=0; i < key.length(); i++)
hashVal = (hashVal << 5) ^ key[i] ^ hashVal;
return hashVal %TABLESIZE;
}
Properties of a good hash function:
- Minimize collisions
- Fast to compute
- Scatter data evenly through hash table. Data may have patterns to them.
- Uses all bits of the key – generally helps in scattering non-random data
- If mod is used, base should be prime
When two values hash to the same array location, this is called a collision
- Solution #1: Search from desired spot for an empty location (with wrap)
- Solution #2: Try again with a second hash function ...and a third, and a fourth, and a fifth, ... to find a better location.
- Solution #3: Create a linked list of values that hash to this location
Use a vacant spot in table following HASH(key) – “open” addressing means that we find an unused address and use it. Evaluation
- Longer search times
- Clustering gets worse and worse (snowballs).
Primary clustering is when two keys that hash onto different values compete for same locations in successive hashes. - Better if successive probe locations are picked in “scrambled” order to lessen clustering.
- One problem with the “linear probing” is the tendency to form “clusters”
A cluster is a group of items not containing any open slots
-
The bigger a cluster gets, the more likely it is that new values will hash into the cluster, and make it ever bigger
-
Clusters cause efficiency to degrade
-
Primary Clustering
is the tendency for certain open-addressing hash tables collision resolution schemes to create long sequences of filled slots.
- Secondary Clustering
when different keys that hash to same locations compete for successive hash locations.
- Quadratic probing eliminates primary clustering, but not secondary clustering.
Note: For both linear and quadratic probing, the sequences checked are key independent. Two different keys which hash to same location, keep competing.
Have two functions, Step gives the increment for Hash.
- Define: Step(key) = 1 + key % (TABLESIZE - 2) – gives personalized increment.
- Notice, the location of Step is never used directly as the hash value. It is only an auxiliary function.
- If TABLESIZE and TABLESIZE–2 are primes, it works better.
- Hash(key) = key % TABLESIZE
- If Hash(key) is full, successively add increment as specified by Step(key).
- example: for key = 38 and TABLESIZE = 13
-
- Each of the probe sequences visits all the table locations if the size of the table and the size of the increment are relatively prime with respect to each other.
Another way to handle collisions is to have each entry in the hash table hold more than one (B) item.
- Each entry is called a bucket.
- This doesn’t really solve the problem. As soon as the bucket fills up, you still have to deal with collisions.
- I think it would be better to just have the original table B times as large.
Each location in the hash table is a pointer to a linked list of things that hashed to that location.
Advantages:
- Saves space if records are long
- Collisions are no problem
- Overflow is solved as space is dynamic
- Deletion is easy
Disadvantages: - Links require space
- Following linked list is more time consuming
Works well, but we have the overhead of pointers.
Rehash – create a larger (or smaller) array. Rehash all old elements into new array.
Hash tables are actually surprisingly efficient
-
Until the table is about 70% full, the number of probes (places looked at in the table) is typically only 2 or 3
-
Sophisticated mathematical analysis is required to prove that the expected cost of inserting into a hash table, or looking something up in the hash table, is O(1)
-
Even if the table is nearly full (leading to occasional long searches), efficiency is usually still quite high
-
load factor = number-of-elements / TABLESIZE
-
Should not exceed 2/3.
-
With chaining, the concern is the average size of a chain.
solution: Use lazy deletions
Cuckoo hashing worst-case constant lookup time.
The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.
- There are two hash functions, one for each table.
- If a key can’t go in either location, it tries to move the existing key to its alternative spot in the other table.
- When a new key is inserted, the new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location.
- This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place using new hash functions:
--
Priority Queue Instructor Notes
Anything Greedy
Main Idea | Sub idea | Description |
---|---|---|
Non-Mergable Heaps | Binary Heap | |
D-Heap | ||
Mergable Heaps | Leftist Heap | |
Skew Heap | ||
Binomial Queue |
Examples:
- ordering CPU jobs
- searching for the exit in a maze (or looking for moves in the rotation puzzle game)
- emergency room admission processing
You don't need to be as sorted as other data structures (BST trees, Splay trees, AVL trees)
// Delete Code
Comparable deleteMin(){
x = A[0];
A[0]=A[size--];
percolateDown(0);
return x;
}
percolateDown(int hole) {
shiftVal=heap[hole];
while (2*hole+1 <= size) {
left = 2*hole+1;
right = left + 1;
if (right <= size &&
heap[right] < heap[left])
target = right;
else
target = left;
if (heap[target] < shiftVal) {
heap[hole] = heap[target];
hole = target;
}
else
break;
}
Heap [hole] = shiftVal;
}
// Insert Code
void insert(Comparable newVal) {
// Efficiency hack: we won’t actually put newVal
// into the heap until we’ve located the position
// it goes in. This avoids having to copy it
// repeatedly during the percolate up.
int hole = ++size;
// Percolate up
for( ; hole>0 && newVal < heap[(hole-1)/2] ;
hole = (hole-1)/2)
heap[hole] = heap[(hole-1)/2];
heap[hole] = newVal;
}
Floyd's method
buildHeap(){
for (i=size/2; i>0; i--)
percolateDown(i);
}
// O(n log n) worst case, but O(n) average case.
// Can we get a worst case O(n) bound?
Binary Heap Worst Case | Binary Heap Average Case | AVL Worst | BST Average | |
---|---|---|---|---|
Insert | (log n) | O(1) 2.6 compares | O(log n) | O(log n) |
Delete Min | O(log n) | O(log n) | O(log n) | O(log n) |
- finding a child/parent index is a multiply/divide by two
- each percolate down operation looks at only two kids
- inserts are at least as common as deleteMins
- division and multiplication by powers of two are fast
- with huge data sets (that can’t be stored in main memory), memory - accesses dominate
- Each node has d children
- Still representable by array
- optimize performance based on # of inserts/removes
- power of two for efficiency
- fit one set of children in a cache line (the block of memory that - is transferred to memory cache)
- fit one set of children on a memory page/disk block
Complexity of O(n) with Floyd's method
Binary heap-ordered trees with left subtrees always “longer” than right subtrees A heap structure that enables fast merges
- Main idea: Recursively work on right path for Merge/Insert/DeleteMin
- Right path is always short has O(log N) nodes
- Merge, Insert, DeleteMin all have O(log N) running time (see text)
the null path length (npl) of a node is the smallest number of nodes between it and a null in the tree
npl(null) = -1
npl(leaf) = 0
npl(single-child node) = 0
- parent’s priority value is to childrens’ priority values
- result: minimum element is at the root
- null path length of left subtree is npl of right subtree
- result: tree is at least as “heavy” on the left as the right
- every subtree of a leftist tree is leftist!
- Basically, the shortest path must be to the right. So, if you always take the shortest path, it can’t be longer than log n.
- If both the left and right sub-trees are leftist heaps but the root does not form a leftist heap, We only need to swap the two sub-trees
- We can use this to merge two leftist heaps
- Merging strategy: Given two leftist heaps, recursively merge the larger value with the right sub-heap of the root Traversing back to the root, swap trees to maintain the leftist heap property
Node * merge (Node * t1, Node * t2) // t1 and t2 are merged, new tree is created
{ Node * small;
if (t1==NULL) return t2;
if (t2==NULL) return t1;
if (t1 ->element < t2->element) {
t1->right = merge(t1->right, t2);
small=t1;}
else {
t2->right = merge(t2->right, t1);
small=t2;}
if (notLeftist(small)) swapkids(small);
setNullPathLength(small);
return small;
}
- The heaps are merged, but the result is not a leftist heap as 3 is unhappy. On the way back our of the recursion swap sub-heaps where necessary. Find the unhappy nodes – after updating the null path lengths.
- extra storage for npl
- extra complexity/logic to maintain and check npl
Self-adjusting version of leftist heaps (a la splay trees)
Motivation? Remember that we “often” get heavy right sides when we merge. So, why not just assume the right side is heavy and move it over automatically?
We can make a simple modification to the leftist heap and get similar results without storing (or computing) the null path length.
We always merge with the right child, but after merging, we swap the left and right children for every node in the resulting right path of the temporary tree.
- Do not actually keep track of path lengths
- Adjust tree by swapping children during each merge
- O(log N) amortized time per operation for a sequence of M operations
Skew Notes
- blind adjusting version of leftist heaps
- amortized time for merge, insert, and deleteMin is O(log n)
- worst case time for all three is O(n)
- merge always switches children when fixing right path
- iterative method has only one pass
Node * SkewHeapMerge (Node * t1, Node * t2) // t1 and t2 are merged, a new tree
{ Node * small;
if (t1==NULL) return t2;
if (t2==NULL) return t1;
if (t1 ->element < t2->element) {
t1->right = merge(t1->right, t2);
small=t1;}
else {
t2->right = merge(t2->right, t1);
small=t2;}
swapkids(small);
return small;
}
--
Sorting Instructor Notes
Main Idea | Sub Idea | Complexity | Adaptive | Stable | in-Place |
---|---|---|---|---|---|
Sort Theory | Adaptive(Non-oblivious) | Definition | |||
Stable | Definition | ||||
Non-Recursive Sorts | Bubble Sort | O(n^2) | Adaptive | Stable | |
Insertion Sort | O(n^2) | Adaptive | Stable | ||
Selection Sort | O(n^2) | Not Adaptive | Not Stable | ||
Recursive Sorts | Quicksort | ||||
Shell | |||||
Heap | |||||
Merge |
- faster with mostly sorted data
Sorts | Complexities | Oblivious | Adaptive | Stable |
---|---|---|---|---|
Bubble | O(n) - O(n^2) | Adaptive | Stable | |
Selection | O(n^2) | Not Adaptive | not stable | |
Insertion | O(n) - O(n^2) | Adaptive | Stable |
- A sorting algorithm is said to be stable if two objects with equal keys appear in the same order in sorted output as they appear in the input array to be sorted. Some sorting algorithms are stable by nature like Insertion sort, Merge Sort, Bubble Sort, etc.Oct 5, 2009
Although it is one of the elementary sorting algorithms with O(n2) worst-case time, insertion sort is the algorithm of choice either when the data is nearly sorted (because it is adaptive) or when the problem size is small (because it has low overhead).
For these reasons, and because it is also stable, insertion sort is often used as the recursive base case (when the problem size is small) for higher overhead divide-and-conquer sorting algorithms, such as merge sort or quick sort.
From the comparions presented here, one might conclude that selection sort should never be used. It does not adapt to the data in any way (notice that the four animations above run in lock step), so its runtime is always quadratic.
However, selection sort has the property of minimizing the number of swaps. In applications where the cost of swapping items is high, selection sort very well may be the algorithm of choice.
Sorts | Complexities | Adaptive | Stable | in-place |
---|---|---|---|---|
Quicksort | ||||
Shell | O(n^1/2) - O(nlogn) | Adaptive | not stable | |
Heap | Ologn) | Not adaptive | Not stable | in-place |
Merge | O(nlogn) | Not adaptive | Stable |
- partitions
The worse-case time complexity of shell sort depends on the increment sequence. For the increments 1 4 13 40 121..., which is what is used here, the time complexity is O(n3/2). For other increments, time complexity is known to be O(n4/3) and even O(n·lg2(n)). Neither tight upper bounds on time complexity nor the best increment sequence are known.
Because shell sort is based on insertion sort, shell sort inherits insertion sort's adaptive properties. The adapation is not as dramatic because shell sort requires one pass through the data for each increment, but it is significant. For the increment sequence shown above, there are log3(n) increments, so the time complexity for nearly sorted data is O(n·log3(n)).
Because of its low overhead, relatively simple implementation, adaptive properties, and sub-quadratic time complexity, shell sort may be a viable alternative to the O(n·lg(n)) sorting algorithms for some applications when the data to be sorted is not very large.
Heap sort is simple to implement, performs an O(n·lg(n)) in-place sort, but is not stable.
The first loop, the Θ(n) "heapify" phase, puts the array into heap order. The second loop, the O(n·lg(n)) "sortdown" phase, repeatedly extracts the maximum and restores heap order.
The sink function is written recursively for clarity. Thus, as shown, the code requires Θ(lg(n)) space for the recursive call stack. However, the tail recursion in sink() is easily converted to iteration, which yields the O(1) space bound.
Both phases are slightly adaptive, though not in any particularly useful manner. In the nearly sorted case, the heapify phase destroys the original order. In the reversed case, the heapify phase is as fast as possible since the array starts in heap order, but then the sortdown phase is typical. In the few unique keys case, there is some speedup but not as much as in shell sort or 3-way quicksort.
Merge sort is very predictable. It makes between 0.5lg(n) and lg(n) comparisons per element, and between lg(n) and 1.5lg(n) swaps per element. The minima are achieved for already sorted data; the maxima are achieved, on average, for random data. If using Θ(n) extra space is of no concern, then merge sort is an excellent choice: It is simple to implement, and it is the only stable O(n·lg(n)) sorting algorithm. Note that when sorting linked lists, merge sort requires only Θ(lg(n)) extra space (for recursion).
Merge sort is the algorithm of choice for a variety of situations: when stability is required, when sorting linked lists, and when random access is much more expensive than sequential access (for example, external sorting on tape).
There do exist linear time in-place merge algorithms for the last step of the algorithm, but they are both expensive and complex. The complexity is justified for applications such as external sorting when Θ(n) extra space is not available.
Union Find is an algorithm which uses a disjoint-set data structure to solve the following problem: Say we have some number of items.
-
nodes
-
relationships (Arcs/edges)
-
child is called successor
-
things that are touching are called adjacent
-
parent is called a predecessor
Examples:
Example | Node | Arc(Relationship) |
---|---|---|
Ages | People | |
Digital Logic | ||
Starting Salary | Major | Salary |
Networking | ||
Candy Bars | ||
Test to see | ||
Social Network | People | Relationship to person |
This class is called a "Bipartite" graph
Annotative graph Example: Starting Salary Node: Student Arc: Salary
Array or linked lists or adjacency matrix
Adjacency Matrix
0 1 2 3 4 5
_____________
0|_|x|_|_|_|_|
1|_|_|_|_|_|_|
2|_|_|_|x|_|_|
3|_|_|_|_|_|_|
4|_|x|_|_|_|_|
5|_|_|_|_|_|_|
- successor -> look at the row
- Predecessor -> look at the column
- Intense on the memory
Time Space Tradeoff
- Greedy: A greedy algorithm is an algorithm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.
- Null Path Length: the null path length (npl) of a node is the smallest number of nodes between it and a null in the tree