##  Tutorial Topic: Boyer Moore String Algorithm

### Name: MOHIT DASWANI

### NUID: 002102079

Strings matching is an important task in computer science with a wide variety of applications ranging from searching databases to genetics. There are many algorithms to accomplish this task like the naive approach i.e Brute force approach, KMP algorithm, String matching with finite automata, Boyer Moore algorithm. In this article, we will mainly talk about the Boyer Moore algorithm.

Boyer-Moore is an algorithm for finding substrings into strings. This algorithm compares each character of the substring to find a word or the same characters in the string. When characters do not match, the search jumps to the next matching position in the pattern by the value indicated in the Bad Match Table.

Boyer Moore uses a combination of two approaches - Bad character and good character heuristic. Each of them is used independently to search the pattern.

In this algorithm, different arrays are formed for both heuristics by pattern processing, and the best heuristic is used at each step. Boyer Moore starts to match the pattern from the last, which is a different approach from KMP and Naive.

In other words, This algorithm scans the characters of the pattern from right to left beginning with the rightmost character. 
During the testing of the posing placement of pattern P in T, a mismatch is handled as follows: 
1. Let T be the text and P is the pattern that we need to find. 
2. Let us assume that the current character being matched is T[i] = c and the corresponding pattern character is p[j].
3. If c is not contained anywhere in P, then shift the pattern P completely past T[i]. Otherwise, shift P until an occurrence of character c in P gets aligned with T[i]. This technique avoids needless comparisons by shifting the pattern relative to the text.
4. The time complexity for the best case is O(n/m), with m being the length of the pattern. However, the worst-case complexity of this algorithm is O(n*m).

### Example
    pattern = "STING"
    string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"

    try to match first m characters
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    This fails. Slide pattern right to look for other matches.
    Note that R does not occur in pattern. So can slide it past R.
    You may be starting to guess that this method is for large alphabets (e.g. human text)
    while KMP is good for small alphabets (where one could rarely ever do this kind of sliding).
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    Fails again.
    Rightmost character S is in pattern precisely once, so slide until two S's line up.
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    No C in pattern. Slide past it.
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    No space in pattern. Slide past it.
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    No P in pattern. Slide past it.
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    No O in pattern. Slide past it.
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    Rightmost char T. Exactly one T in pattern. Slide to align them.
    STING
    A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
    match

The two strategies are called heuristics of B - M as they are used to reduce the search. They are:

* Bad Character Heuristics
* Good Suffix Heuristics

### 1. Bad Character Heuristics
    This Heuristics has two implications:
    Methods vary on the exact form the table for the bad character rule should take, but a simple constant-time lookup solution is as follows: create a 2D table which is indexed first by the index of the character c in the alphabet and second by the index i in the pattern. This lookup will return the occurrence of c in P with the next-highest index {\displaystyle j<i}j<i or -1 if there is no such occurrence. The proposed shift will then be {\displaystyle i-j}i-j, with {\displaystyle O(1)}O(1) lookup time and {\displaystyle O(km)}{\displaystyle O(km)} space, assuming a finite alphabet of length k.
    Suppose there is a character in a text in which does not occur in a pattern at all. When a mismatch happens at this character (called as bad character), the whole pattern can be changed, begin matching form substring next to this 'bad character.'
    On the other hand, it might be that a bad character is present in the pattern, in this case, align the nature of the pattern with a bad character in the text.
    
   ![title](https://static.javatpoint.com/tutorial/daa/images/boyer-moore-algorithm.png)

    Example2: If a bad character doesn't exist the pattern then.

   ![title](https://static.javatpoint.com/tutorial/daa/images/boyer-moore-algorithm2.png)


### Problem in Bad-Character Heuristics:

In some cases, Bad-Character Heuristics produces some negative shifts.

For Example:


   ![title](https://static.javatpoint.com/tutorial/daa/images/boyer-moore-algorithm3.png)
   
This means that we need some extra information to produce a shift on encountering a bad character. This information is about the last position of every aspect in the pattern and also the set of characters used in a pattern (often called the alphabet ∑of a pattern).

### 2. Good Suffix Heuristics:


    The good suffix rule is markedly more complex in both concept and implementation than the bad character rule. Like the bad character rule, it also exploits the algorithm's feature of comparisons beginning at the end of the pattern and proceeding towards the pattern's start. It can be described as follows:
    
    Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a mismatch occurs at the next comparison to the left.
   1. Then find, if it exists, the right-most copy t' of t in P such that t' is not a suffix of P and the character to the left of t' in P differs from the character to the left of t in P. Shift P to the right so that substring t' in P aligns with substring t in T.
   2. If t' does not exist, then shift the left end of P past the left end of t in T by the least amount so that a prefix of the shifted pattern matches a suffix of t in T.
   3. If no such shift is possible, then shift P by n (length of P) places to the right.
   4. If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T.
   5. If no such shift is possible, then shift P by n places, that is, shift P past t.


For Example:


   ![title](https://static.javatpoint.com/tutorial/daa/images/boyer-moore-algorithm4.png)
  
***
 COMPUTE-GOOD-SUFFIX-FUNCTION (P, m)
 1. Π ← COMPUTE-PREFIX-FUNCTION (P)
 2. P'← reverse (P)
 3. Π'← COMPUTE-PREFIX-FUNCTION (P')
 4. for j ← 0 to m
 5. do ɣ [j] ← m - Π [m]
 6. for l ← 1 to m
 7. do j ← m - Π' [L]
 8. If ɣ [j] > l - Π' [L]
 9. then ɣ [j] ← 1 - Π'[L]
 10. Return ɣ
***
  
  
***
BOYER-MOORE-MATCHER (T, P, ∑)
 1. n ←length [T]
 2. m ←length [P]
 3. λ← COMPUTE-LAST-OCCURRENCE-FUNCTION (P, m, ∑ )
 4. ɣ← COMPUTE-GOOD-SUFFIX-FUNCTION (P, m)
 5. s ←0
 6. While s ≤ n - m
 7. do j ← m
 8. While j > 0 and P [j] = T [s + j]
 9. do j ←j-1
 10. If j = 0
 11. then print "Pattern occurs at shift" s
 12. s ← s + ɣ[0]
 13. else s ← s + max (ɣ [j], j -  λ[T[s+j]])
 ***
 
 
 ### The code for the Boyer Moore Algorithms is as follow:
 
   ![code](https://miro.medium.com/max/1400/1*iTpVEuELCDl1udyKAhrMjA.jpeg)
   
 ### Link for Playground with the code and example:
 
 <a href="https://colab.research.google.com/drive/1GEWxxQ-hhmDA3oWOaMbeFKKwh4Wjd5qi?usp=sharing" target="_blank">https://colab.research.google.com/drive/1GEWxxQ-hhmDA3oWOaMbeFKKwh4Wjd5qi?usp=sharing</a>
 
   
 ### Analysis
 
    If there are only a constant number of matches of the pattern in the text, the Boyer-Moore searching algorithm perfoms O(n) comparisons in the worst case. The proof of this is rather difficult.

    In general Θ(n·m) comparisons are necessary, e.g. if the pattern is am and the text an. By a slight modification of the algorithm the number of comparisons can be bounded to O(n) even in the general case.

    If the alphabet is large compared to the length of the pattern, the algorithm performs O(n/m) comparisons on the average. This is because often a shift by m is possible due to the bad character heuristics.

 ### Conclusion

    * Complexity is O(n). The execution time can actually be sub-linear: it doesn't need to actually check every character of the string to be searched but rather skips over some of them (check right-most character of the block of m first, if not found in pattern can skip entire rest of block).
    * Best-case performance is O(n/m). In the best case, only one in m characters needs to be checked.
    * Actually works better (on average) with longer m!
    * Boyer-Moore-Horspool algorithm is the simplification described above.
    
 ### Variations
 
 
    The Boyer–Moore–Horspool algorithm is a simplification of the Boyer–Moore algorithm using only the bad character rule.

    The Apostolico–Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given alignment by skipping explicit character comparisons. This uses information gleaned during the pre-processing of the pattern in conjunction with suffix match lengths recorded at each match attempt. Storing suffix match lengths requires an additional table equal in size to the text being searched.

    The Raita algorithm improves the performance of Boyer-Moore-Horspool algorithm. The searching pattern of particular sub-string in a given string is different from Boyer-Moore-Horspool algorithm.
   
 ### References 
 
    AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255–300, Elsevier, Amsterdam.

    GONNET, G.H., BAEZA-YATES, R.A., 1991. Handbook of Algorithms and Data Structures in Pascal and C, 2nd Edition, Chapter 7, pp. 251–288, Addison-Wesley Publishing Company.

    KNUTH, D.E., MORRIS (Jr) J.H., PRATT, V.R., 1977, Fast pattern matching in strings, SIAM Journal on Computing 6(1):323–350.
    
    
<a href="https://www-igm.univ-mlv.fr/~lecroq/string/node14.html" target="_blank">https://www-igm.univ-mlv.fr/~lecroq/string/node14.html</a>

<a href="https://www.javatpoint.com/daa-boyer-moore-algorithm" target="_blank">https://www.javatpoint.com/daa-boyer-moore-algorithm</a>

<a href="https://www.inf.hs-flensburg.de/lang/algorithmen/pattern/bmen.htm" target="_blank">https://www.inf.hs-flensburg.de/lang/algorithmen/pattern/bmen.htm</a>

