Skip to content

codeAlgorithm/Python---Search-Algorithm---Linear-Search---Hash-Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Python---Search-Algorithm---Linear-Search---Hash-Search

This is a comprehensive, practical guide to linear search and hash-based search (hash tables) in Python. I’ll explain what each method is, how they work, complexity and space trade-offs, Python implementations (including idiomatic use of dict and set), collision handling, tuning and performance tips, testing and benchmarking strategies, common pitfalls, and real-world guidance for choosing one over the other. All examples are pure Python and ready to run.

Table of contents

Quick overview

Linear search: intuition, algorithms, and Python code

Complexity and practical behavior of linear search

When linear search is appropriate

Hash search (hash tables): intuition and core ideas

Core operations: hashing, equality, collisions, and resolution strategies

Python’s built-in hash containers: dict and set (how to use them)

Implementing a simple hash table in Python (educational)

Complexity, performance trade-offs, and memory considerations for hash tables

Hash table internals (high-level view of CPython)

Collisions and attacks, robustness, and hash randomization

When to use linear search vs hash search: decision guide

Testing and benchmarking search implementations in Python

Real-world patterns, pitfalls, and optimization checklist

Summary and quick reference

  1. Quick overview

Linear search (sequential search) scans a sequence element-by-element until it finds an element that satisfies the search condition. It works on any iterable and requires no extra data structure. Time complexity: O(n) in the worst case.

Hash search (hash table lookup) stores items indexed by a hash code. Typical operations (lookup, insert, delete) have average-case O(1) time, assuming a good hash function and low load factor. In Python you normally use dict or set.

Both are fundamental. Linear search is trivial and always applicable; hash search needs preprocessing (hash table) but becomes vastly faster for repeated queries.

  1. Linear search — intuition, algorithms, and Python code 2.1 Intuition

Think of reading a phonebook line-by-line looking for a name. You check every line until you match — that’s linear search. It’s safe and general: no assumptions about ordering or indexing are required.

2.2 Basic linear search implementations

Return the index of the first occurrence or -1 if not found:

def linear_search_index(arr, target): """Return the index of target in arr, or -1 if not found.""" for i in range(len(arr)): if arr[i] == target: return i return -1

Return the element (useful for predicate searches):

def linear_search_element(iterable, predicate): """ Return the first element in iterable that satisfies predicate, or None. predicate is callable: element -> bool """ for item in iterable: if predicate(item): return item return None

Pythonic idiom using next and a generator expression:

return first element with attribute age > 30 or None

person = next((p for p in people if p.age > 30), None)

Membership test in lists is linear under the hood:

if x in arr: # this performs a linear scan for lists and tuples

2.3 Variations

Find all matches: collect items or indices into a list — requires full scan.

Sentinel method: rarely beneficial in Python — avoids bound checks in low-level languages.

Recursive linear search: works but uses O(n) recursion depth — not recommended.

  1. Complexity and practical behavior of linear search

Worst-case time: O(n) comparisons (target absent or last element).

Best-case time: O(1) (target at first position).

Average-case: O(n) — if target uniformly random, expect ~n/2 checks.

Space: O(1) extra for iterative versions.

Practical notes:

For very small arrays (tens of elements), linear search can be faster than algorithms with overhead (like binary search on a sorted array), because of lower constant factors and better CPU caching for contiguous memory.

If you need to perform many membership queries on an immutable dataset, converting to a set or dict is usually worth the upfront cost because subsequent queries are much cheaper.

  1. When linear search is appropriate

Data is unsorted and you only need a single search or a few searches.

The collection is small.

The data is not indexable by keys or you must evaluate a complex predicate (e.g., “first user with age > 30 and salary < X”) where hashing by a single key is insufficient.

You’re streaming data or working with generators (one-pass) where you cannot or do not want to store an auxiliary index.

  1. Hash search (hash tables) — intuition and core ideas

A hash table uses a function (hash) to map arbitrary keys to integers (hash codes). The hash code is translated into an index into an array (the bucket array). Ideally, different keys map to different indices — but collisions (different keys mapping to same bucket) occur. Good hash table implementations resolve collisions and resize the internal array as the number of elements grows to ensure that average operations remain O(1).

The common operations:

Insert(key, value) — compute hash, place key and value in the appropriate bucket, resolving collisions as needed.

Lookup(key) — compute hash, inspect bucket (and possibly handle collisions) to find the key.

Delete(key) — find and remove the key; update internal state.

Hash tables are the backbone of Python (dict) and are used everywhere for fast membership and mapping.

  1. Core operations: hashing, equality, collisions, and resolution strategies 6.1 Hash function

Python exposes hash(x) for hashable objects. A hashable object must have hash and a stable eq.

The hash function must be deterministic for the lifetime of the object in the table. For mutable objects, using them as keys is incorrect because their hash may change.

6.2 Equality

Equal objects (a == b) must return the same hash value (if a == b, their hashes should be equal). Python requires that objects that compare equal have the same hash.

Hash table lookup first filters by hash and then confirms actual equality to handle collisions.

6.3 Collisions and resolution

Two main families of collision resolution strategies:

Chaining: Each bucket holds a secondary structure (often a linked list or dynamic array) of items that hashed to that bucket. On collision, append to the bucket; on lookup, scan the bucket.

Pros: Simple, easy to implement; buckets can grow arbitrarily.

Cons: Extra pointer overhead; degraded performance if many items collide.

Open addressing (probing): All items store directly in the bucket array. On collision you probe alternative slots according to a scheme (linear probing, quadratic probing, double hashing).

Pros: No secondary structures; often better cache performance.

Cons: Requires careful deletion handling, clustering can degrade performance.

CPython’s dict uses an optimized open addressing scheme with perturbation and dynamic resizing (not simple chaining). Java’s HashMap uses chaining (since Java 8 it switches to balanced trees for very long chains to avoid bad worst-case behavior).

6.4 Load factor and resizing

Load factor = number of stored entries / number of buckets. As it increases, collisions increase and average lookup time rises.

Typical strategy: when load factor exceeds a threshold (e.g., 0.66), resize to a bigger table (often about double size) and rehash all entries.

Rehashing is expensive but amortized: occasional expensive resize versus many cheap O(1) operations.

  1. Python’s built-in hash containers: dict and set

Use these — they are written in C, highly optimized, and handle all the tricky details.

7.1 dict — mapping keys to values d = {} d['apple'] = 5 value = d.get('apple') # returns 5 exists = 'apple' in d # True, average O(1) d.pop('apple', None) # remove key

dict handles hashing, equality, collision resolution, resizing, and preserves insertion order (CPython 3.7+ guarantees insertion order as a language feature).

7.2 set — collection of unique keys s = set([1,2,3]) if x in s: # membership test average O(1) ... s.add(4) s.remove(2)

set is implemented using the same hashed-storage techniques as dict but without values.

7.3 Important idioms

Use in d or in s for membership tests — these are fast, idiomatic, and readable.

Use collections.defaultdict, collections.Counter, or dict.setdefault for common aggregation patterns that rely on hashing.

For read-only large mappings where memory matters, consider specialized libraries (but in pure Python, dict is usually the right choice).

  1. Implementing a simple hash table in Python (educational)

Below is an educational open-addressing hash table with linear probing. This is not production-ready (CPython’s dict is far more complex and optimized), but it shows the mechanics.

class SimpleHashTable: _EMPTY = object() _DELETED = object()

def __init__(self, capacity=8, max_load=0.66):
    self._size = 0
    self._capacity = capacity
    self._keys = [self._EMPTY] * capacity
    self._values = [None] * capacity
    self._max_load = max_load

def _hash(self, key):
    return hash(key) & 0x7FFFFFFF  # positive

def _probe(self, h, i):
    # linear probing
    return (h + i) % self._capacity

def _resize(self):
    old_keys = self._keys
    old_values = self._values
    self._capacity *= 2
    self._keys = [self._EMPTY] * self._capacity
    self._values = [None] * self._capacity
    self._size = 0
    for k, v in zip(old_keys, old_values):
        if k is not self._EMPTY and k is not self._DELETED:
            self[k] = v

def __setitem__(self, key, value):
    if (self._size + 1) / self._capacity > self._max_load:
        self._resize()
    h = self._hash(key)
    i = 0
    first_deleted = None
    while True:
        idx = self._probe(h, i)
        k = self._keys[idx]
        if k is self._EMPTY:
            # insert here (or earlier deleted slot)
            if first_deleted is not None:
                idx = first_deleted
            self._keys[idx] = key
            self._values[idx] = value
            self._size += 1
            return
        elif k is self._DELETED:
            if first_deleted is None:
                first_deleted = idx
        elif k == key:
            self._values[idx] = value
            return
        i += 1

def __getitem__(self, key):
    h = self._hash(key)
    i = 0
    while True:
        idx = self._probe(h, i)
        k = self._keys[idx]
        if k is self._EMPTY:
            raise KeyError(key)
        if k is self._DELETED:
            i += 1
            continue
        if k == key:
            return self._values[idx]
        i += 1

def __contains__(self, key):
    try:
        _ = self[key]
        return True
    except KeyError:
        return False

def __delitem__(self, key):
    h = self._hash(key)
    i = 0
    while True:
        idx = self._probe(h, i)
        k = self._keys[idx]
        if k is self._EMPTY:
            raise KeyError(key)
        if k == key:
            self._keys[idx] = self._DELETED
            self._values[idx] = None
            self._size -= 1
            return
        i += 1

Notes:

This uses linear probing and tombstones (_DELETED) to allow deletions.

Real implementations add more optimizations (better probing, load factors, resizing strategies, and cache-aware layouts).

This example demonstrates why dict is complicated and best left to the interpreter.

  1. Complexity, performance trade-offs, and memory considerations for hash tables 9.1 Time complexity (average)

Lookup/Insert/Delete: expected O(1) average time.

Worst-case: O(n) if many keys collide or attacker chooses keys to cause collisions (but mitigations exist).

9.2 Space complexity

Hash tables use extra memory: a bucket array (often larger than the number of elements) plus per-entry overhead. dict stores keys and values and some metadata per slot.

Memory cost is the main trade-off compared to arrays or lists.

9.3 Performance caveats

Python dict and set are highly optimized. For many lookups, they provide large speedups.

Hashing cost per key can be nontrivial for complex objects — keep hash computation cheap.

Resizing costs are amortized but can cause occasional spikes. For predictable performance-sensitive workloads, consider pre-sizing when possible (via idioms like creating dicts with a known approximate number of items, though Python’s API does not provide a direct constructor for initial capacity).

9.4 Cache behavior

Open addressing can be cache-friendly because keys/values are stored in contiguous arrays.

Chaining involves pointer chasing and can be less cache-friendly.

  1. Hash table internals (high-level view of CPython)

CPython’s dict is implemented in C with many optimizations:

Uses open addressing with a perturbation algorithm to distribute probes and reduce clustering.

Stores entries in an array of “entries” (key, value, hash) and a smaller index-table for quick lookups (implementation details evolved across versions).

Guarantees insertion order preservation (as of Python 3.7+ language spec).

Resizes the table when necessary and uses calculated growth patterns to balance memory and speed.

You don’t need to know the full internals to use dict effectively, but understanding open addressing, load factors, and the importance of stable hash/equality semantics helps you avoid mistakes.

  1. Collisions and attacks, robustness, and hash randomization 11.1 Collision attacks

If an attacker can control keys and craft many keys with the same hash bucket, they can degrade hash table performance to O(n) per operation (hash flood attacks).

To mitigate this, modern Python uses hash randomization (a secret random seed applied to the hash function for some types), making it hard for attackers to predict collision patterns.

11.2 Best practices to avoid problems

Avoid accepting arbitrary user-input keys and using them directly as dict keys on high-throughput servers without safeguards.

For user-provided strings, Python’s randomized hashing helps but you should still validate or limit inputs in security-sensitive contexts.

For custom objects, implement hash and eq carefully — making hash expensive or non-uniform can harm performance.

  1. When to use linear search vs hash search: decision guide

Use linear search when:

The dataset is small (e.g., tens of elements).

You need to evaluate a complex predicate that cannot be reduced to a single key.

Data is not stored or cannot be stored in memory (streaming) and you need to check elements on the fly.

Simplicity and minimal overhead are priorities.

Use hash search when:

You need many membership/lookup operations on the same dataset.

You can identify a key (or small composite key) for each item.

The dataset fits in memory and you can accept the additional memory overhead.

You need fast average-case lookup and insertion.

Comparisons:

Single lookup in unsorted list: linear search (O(n)). If repeated lookups: build a hash table and then O(1) per lookup.

If ordering queries are necessary (e.g., sorted order), a balanced tree or sorted list + binary search may be better than hash tables.

  1. Testing and benchmarking search implementations in Python 13.1 Unit tests

Linear search tests:

def test_linear_search(): arr = [3, 1, 4, 1, 5] assert linear_search_index(arr, 3) == 0 assert linear_search_index(arr, 5) == 4 assert linear_search_index(arr, 9) == -1

Hash search tests:

def test_dict_lookup(): d = {'a': 1, 'b': 2} assert d['a'] == 1 assert 'c' not in d d['c'] = 3 assert 'c' in d del d['b'] assert 'b' not in d

13.2 Microbenchmarks

Use timeit for microbenchmarks. Example: compare membership tests for lists vs sets:

import timeit import random

n = 100000 data = list(range(n)) target = n - 1

membership on list

t_list = timeit.timeit(lambda: (target in data), number=1000)

convert to set

s = set(data) t_set = timeit.timeit(lambda: (target in s), number=1000)

print("list membership:", t_list) print("set membership:", t_set)

Tips:

Run benchmarks multiple times and use realistic data sizes.

For fairness, include conversion time if building a set from a list is part of the use case.

Measure cold cache vs warm cache conditions if that matters.

  1. Real-world patterns, pitfalls, and optimization checklist 14.1 Patterns

Use set for membership deduplication and fast membership.

Use dict for mapping keys to values; use collections.Counter for frequency counts.

Use list comprehensions with in sparingly — if doing many in checks, convert the right-hand side to a set to avoid repeated linear scans.

Example pitfall:

BAD: repeated linear scans

result = [x for x in items if x in big_list] # for every x, big_list scanned

GOOD: convert once

big_set = set(big_list) result = [x for x in items if x in big_set]

14.2 Pitfalls & gotchas

Using mutable objects as dict keys — not allowed unless you promise not to mutate attributes that affect equality/hash.

Relying on insertion order prior to Python 3.7 — older implementations did not guarantee it (now guaranteed).

Assuming hash function is cryptographically secure — it’s not. Use cryptographic hashing libraries for security-sensitive uses.

14.3 Optimization checklist

If you do many lookups, use set or dict.

Pre-size or estimate data volume to reduce repeated resizing (where applicable).

Keep hash functions cheap and stable.

Avoid using objects with expensive eq or hash as keys in performance-critical paths.

For memory-critical environments, measure dict overhead — sometimes other data structures (sorted arrays, tries, specialized libraries) may be better.

  1. Summary and quick reference

Linear search:

Works on any sequence or iterable.

Implementation: simple for loop or next + generator.

Complexity: O(n) time, O(1) extra space.

Use when dataset is small, unsorted, or you need a single pass or complex predicate.

Hash search (hash tables):

Uses hashing to achieve expected O(1) time for lookup/insert/delete.

Implemented in Python as dict and set. Use these instead of implementing your own in production.

Complexity: O(1) average, O(n) worst-case (mitigated by randomization and good designs).

Use when many lookups or inserts are needed and you can afford memory overhead.

Choosing:

Single lookup on small unsorted data → linear.

Many lookups or key-based retrievals → hash (dict/set).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published