<a href="https://colab.research.google.com/github/aadityasomani/Aadi/blob/master/Lesson_75_String_Algorithms_Aditya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 75: String Algorithms


---

### Teacher-Student Tasks

In this class, we will understand what is string matching or pattern matching, hash tables and then we will implement string matching some string matching algorithms in python.


---

#### String Matching Problem

Strings are a sequence of characters and are used to store data. Similar to another data type, we need to perform certain operations on them. There are many string algorithms available that can be used to solve various string processing problems, particularly finding a given substring within a string, also known as **pattern matching**.

Let us discuss some of the pattern matching algorithms.

**Pattern Matching:**


- Pattern matching or string matching is used to search a string within another string.
- The algorithm returns the index position where the pattern is matched in a given string. 

We will explore following string matching algorithms:

1. Brute-force algorithm.
2. Rabin-Karp algorithm.
3. Knuth-Morris-Pratt algorithm.


Before that, let us learn about hash tables.




---

#### Task 1: Hash Tables

A hash table is a data structure where elements are accessed by a keyword rather than an index number. 

In this data structure, the data items are stored in key/value pairs. A hash table uses a **hashing function** to find an index position where an element should be stored and retrieved. 

Each position in the hash table is often called a **slot** or a **bucket** and will store an element. Each data item in the form of (key, value) pairs would be stored in the hash table at a position that is decided by the hash value of the data. 

We will be creating our own hashing function and hash table.

**Hashing:**

This deals with generating slot or index to any "key" value. Perfect hashing or perfect hash function is the one which assigns a unique slot for every key value. Sometimes, there can be cases where the hash function generates the same index for multiple key values. The size of the hash table can be increased to improve the perfection of the hash function.

First of all, let's create a hash table of size `10` with empty data.

In [None]:
# S1.1: Create an empty hash table.
hash_table = [None] * 10
print(hash_table) 

[None, None, None, None, None, None, None, None, None, None]


Below is a simple hash function that returns the modulus of the length of the hash table. In our case, the length of the hash table is 10.

Modulo operator (%) is used in the hashing function. The % (modulo) operator yields the remainder from the division of the first argument by the second.

In [None]:
# S1.2: Define the hashing function
def hash_func(key):
	return key % len(hash_table)

print(hash_func(10)) 
print(hash_func(20)) 
print(hash_func(25)) 

0
0
5


**Inserting Data into Hash Table:**

Here's a simple implementation of inserting data/values into the hash table. We first use the hashing function to generate a slot/index and store the given value into that slot.




In [None]:
# S1.3: Insert Data into hash table
def insert(hash_table, key, value):
	hash_key = hash_func(key)
	hash_table[hash_key] = value 

# Insert 'India' for key '10'
insert(hash_table, 10, 'India')
print (hash_table)

# Insert 'USA' for key '25'
insert(hash_table, 25, 'USA')
print (hash_table)

['India', None, None, None, None, None, None, None, None, None]
['India', None, None, None, None, 'USA', None, None, None, None]


**Collision:**
- A collision occurs when two items/values get the same slot/index, i.e. the hashing function generates same slot number for multiple items.
- If proper collision resolution steps are not taken then the previous item in the slot will be replaced by the new item whenever the collision occurs.

For Example:

In the code above, we have inserted items `India` and `USA` with key `10` and `25` respectively. If we try to insert a new item with key `20` then the collision occurs because our hashing function will generate slot `0` for key `20`. But, slot `0` in the hash table has already been assigned to item `India`.

In [None]:
# S1.4: Insert 'Nepal' for key '20'
insert(hash_table, 20, 'Nepal')
print (hash_table)

['Nepal', None, None, None, None, 'USA', None, None, None, None]


As you can see, `India` is replaced by `Nepal` as the first item of the hash table because the result of `hash_func()` for keys `10` and `20` is the same (i.e. `0`).

**Collision Resolution:**

There are generally two ways to resolve a collision:

1. Linear Probing
2. Chaining

**1. Linear Probing**

One way to resolve collision is to find another open slot whenever there is a collision and store the item in that open slot. 

The search for open slot starts from the slot where the collision happened. It moves sequentially through the slots until an empty slot is encountered. The movement is in a circular fashion.

It can move to the first slot while searching for an empty slot. Hence, covering the entire hash table. This kind of sequential search is called Linear Probing.

**2. Chaining**

The other way to resolve collision is Chaining. This allows multiple items exist in the same slot/index. This can create a chain/collection of items in a single slot. When the collision happens, the item is stored in the same slot using chaining mechanism.

While implementing Chaining in Python, we first create the hash table as a nested list (lists inside a list).

In [None]:
# S1.5: Create hash table as nested list.
hash_table = [[] for _ in range(10)]
print(hash_table)

[[], [], [], [], [], [], [], [], [], []]


The hashing function will be the same as we have done in above example.

We change the insert function. We use `append()` function to insert key-value pairs in the hash table.

In [None]:
# S1.6: Insert key-value pairs in hash table.
def insert(hash_table, key, value):
	hash_key = hash_func(key)
	hash_table[hash_key].append(value)

# Insert 'Nepal' for key '10'
insert(hash_table, 10, 'Nepal')
print(hash_table)

# Insert 'USA' for key '25'
insert(hash_table, 25, 'USA')
print(hash_table)

# Insert 'India' for key '20'
insert(hash_table, 20, 'India')
print (hash_table)

[['Nepal'], [], [], [], [], [], [], [], [], []]
[['Nepal'], [], [], [], [], ['USA'], [], [], [], []]
[['Nepal', 'India'], [], [], [], [], ['USA'], [], [], [], []]


In this way, we can implement chaining using nested lists in Python. 

Let us discuss the string matching algorithms one by one.

---

#### Task 2: Brute Force Algorithm

This is a very basic algorithm used for pattern matching. In this algorithm, we test all possible combinations of input string to determine the presence of the pattern in the string.

Let us try to understand brute force algorithm with an example.

 Suppose we have a string `S = ABCDADBABA` and pattern `P = ABA`. The algorithm needs to determine whether the pattern exists in the string and the index position of the pattern (P) in the string (S) as shown in the image below:


<img src="https://obj.whitehatjr.com/6a4071ec-6382-4c9b-b81c-678ba8f078c5.gif" height=200/>


In the above example, 
- The algorithm starts comparing the first character of the string (S) with the characters of the pattern (P). Thus, initial 3 characters of the string are checked.
- The last character of the pattern does not matches with the third character of the string.
- Since there is a mismatch, the pattern is shifted by one place.
- Again, the second character of string (S) is compared with the first character of the pattern (P).
- In this way, the characters of the string (S) is continually compared with the characters of the pattern (P) unless the pattern is found.
- In this example, the pattern is found at index position `7` in the string. 

Let us create a Python function to implement Brute Force algorithm.






In [None]:
# S2.1: Create a function to implement brute force algorithm
def brute_force(text, sub_str):  
 for i in range(len(text)-len(sub_str)+1): 
  index = i  
  for j in range(len(sub_str)): 
   if text[index] == sub_str[j]: 
    index += 1 
   else: 
    break 
   if index-i == len(sub_str): 
    return i 
 return -1 


In [None]:
# S2.2: Look for pattern 'ABA' in the string 'ABCDADBABA' using brute force algorithm
brute_force("ABCDADBABA","ABA")

7

---

#### Task 3: Rabin-Karp Algorithm

Rabin-Karp algorithm is an improved version of the brute-force approach. This algorithm reduces the number of comparisons by obtaining the hash value of the substrings and the pattern. The algorithm works as follows:

1. First, the hash value of the pattern of length `p` and the hash values of all the possible substrings of length `p` is determined by using a hash function. 

   Thus, the total number of substrings would be `(s-p+1)`,
   where `s` is the length of the input string.

2. The hash value of each substring is compared with the hash value of the pattern one by one. 

3. If the hash values do not match, then the pattern is moved by one position.

4. If both the hash value matches, then the substring and the pattern are compared character by character to ensure that the pattern actually exists in the input string.

<img src="https://obj.whitehatjr.com/cadaff1d-326d-4a18-827d-903c2005d47d.png" />








In [None]:
# S3.1: Create a function to implement Rabin-Karp algorithm

num = 256  # num is the number of characters in the input alphabet
def rabin_karp(pattern, input_string, p_num):  # p_num -> A prime number
  pattern_length = len(pattern)
  string_length = len(input_string)
  i = 0
  j = 0
  hash_pattern = 0  # hash value for pattern
  hash_string = 0   # hash value for input_string
  h = 1
  
  # The value of h would be "pow(num, pattern_length-1) % p_num"
  for i in range(pattern_length-1):
    	h = (h * num)% p_num
  
  # Calculate the hash value of pattern and first window of input_string
  for i in range(pattern_length):
    hash_pattern = (num * hash_pattern + ord(pattern[i]))% p_num
    hash_string = (num * hash_string + ord(input_string[i]))% p_num
  
  # Check both the hash values # If the hash values match then only check for characters one by one
  for i in range(string_length-pattern_length + 1):
    if hash_pattern == hash_string:
      for j in range(pattern_length):
        if input_string[i + j] != pattern[j]:
          break

      j+= 1
      if j == pattern_length:
        print("Pattern found at index " + str(i))
    
    # Calculate hash value for next window of input_string
    if i < string_length-pattern_length:
      hash_string = (num*(hash_string-ord(input_string[i])*h) + ord(input_string[i + pattern_length]))% p_num


In [None]:
 # S3.2: Look for pattern 'NEW' in the string 'NEW YORK NEW DELHI' using Rabin-Karp algorithm

 rabin_karp("NEW", "NEW YORK NEW DELHI", 101)

Pattern found at index 0
Pattern found at index 9


As you can see, the pattern `NEW` occurs at indexes `0` and `9`.

Hence, we have successfully identified the required pattern using the Rabin-Karp algorithm. 



---

#### Task 4: Knuth-Morris-Pratt Algorithm

The **Knuth-Morris-Pratt (KMP)** algorithm is a more speedy version of the Brute-Force algorithm as it provides an efficient way to shift the pattern by `n` steps rather than shifting them by one step on every mismatch. Thus, it minimizes the comparisons of the given patterns with the input string.

The algorithm uses a preprocessed table called "Prefix Table" to determine how much the pattern should be shifted to search the pattern in the input string whenever there is a mismatch.


**Prefix Table:**

The prefix table is also known as Longest proper prefix or suffix. A proper prefix is a prefix that is not equal to the string itself.

**For e.g:** Proper prefix of `"abc"` are `""`, `"a"`, `"ab"` but not `"abc"`. 

The KMP algorithm uses the longest proper prefix (lps) to search for sub-patterns in the input string. 

For each sub-pattern `pattern[0..i]` where `i = 0` to `m-1`, `lps[i]` stores the length of the maximum matching proper prefix.

Let's create the function to find the pattern using Knuth-Morris-Pratt Algorithm with help of the below steps:

1. Compare `pattern[j]` with `j = 0` i.e. characters of a current substring with the input string.

2. Compare the characters of `string[i]` and `pattern[j]`.

3. Increment the values of `i` and `j` until `string[i]` and `pattern[j]` are matched.

4. Compare the characters in `pattern[0..j-1]` with the `string[i-j…i-1]` when there is a mismatch.

**Note**: It is not necessary to match the `lps[j-1]` characters with `string[i-j…i-1]` as these characters will match in any case.





In [None]:
# S4.1: Create a function to implement KMP algorithm
def KMP(pattern, string):
	pattern_length = len(pattern)
	string_length = len(string)

	# Create lps[] that will hold the longest prefix suffix values
	lps = [0]* pattern_length
	j = 0 # index for pattern[]

	# Preprocess the pattern
	find_lps(pattern, pattern_length, lps)

	i = 0 # index for string[]
	while i < string_length:
		if pattern[j] == string[i]:
			i += 1
			j += 1

		if j == pattern_length:
			print ("Found pattern at index " + str(i-j))
			j = lps[j-1]

		# mismatch after j matches
		elif i < string_length and pattern[j] != string[i]:
		
			if j != 0:
				j = lps[j-1]
			else:
				i += 1

def find_lps(pattern, pattern_length, lps):
	lps_len = 0 # length of the previous lps

	lps[0] # lps[0] is always 0
	i = 1

	# the loop calculates lps[i] for i = 1 to pattern_length-1
	while i < pattern_length:
		if pattern[i]== pattern[lps_len]:
			lps_len += 1
			lps[i] = lps_len
			i += 1
		else:		
			if lps_len != 0:
				lps_len = lps[lps_len-1]	
        		
			else:
				lps[i] = 0
				i += 1

In [None]:
 # S4.2: Look for pattern 'YO' in the string 'NEW YORK NEW DELHI' using the KMP algorithm
string = "NEW YORK NEW DELHI"
pattern = "YO"
KMP(pattern, string)

Found pattern at index 4


As you can see, the pattern `YO` occurs at index `4`.

We will stop here. In the next class, we will learn to perform geometric computational tasks using geometric algorithms.

---