<h1>Regular expressions/State machines</h1>

<h2>Last Time: checking substrings</h2>

In the previous lecture, we investigated some algorithms to implement the python code "B in A" where B and A are strings. In this lecture, we will generalize these algorithms to determine if B has certain patterns, known as regular expressions.

<h2>Pattern Matching</h2>
Suppose that we know about a whole family of genetic diseases that have a particular form. For example, let's define 'lucky' genes to be those that satisfy all of the following.

1. The first letter is either 'a' or 'c'
2. Then there are several (or none) letters that are all 'g' or 't
3. Then last letters of the gene are 'ccc'

Now, let's say that we have a gene $g$ and want to know if it is a lucky gene. We can do this very easily by looping through the letters of $g$ once, in order. We also don't need to keep track of any variables when we do this. The naive code is below.

In [9]:
def is_lucky(g):
    reading_first_char = True #currently reading the first character
    read_last_3char = False #read the 3rd to last character
    read_last_3char = False #read the 2nd to last character
    read_last_3char = False #read the last character

    for x in g:
        if reading_first_char:
            if x=='a' or x=='c':
                reading_first_char=False
            else:
                return False
        elif not (read_last_3char or read_last_2char or read_last_1char): #If we have not started reading the ccc at the end..
            if x =='g' or x=='t':
                pass
            elif x =='c':
                read_last_3char = True #x must have been the 3rd to last character.
        elif reading_last_3char and not read_last_2char:
            if x== 'c':
                read_last_2char==True
            else:
                return False
        elif read_last_2char:
            if x =='c':
                read_last_1char== True
            else:
                return False
        elif read_last_1char:
            return False #If we had really read the last character, we wouldn't be here.
    if read_last_1char:
        return True
    else:
        return False

    def test_is_lucky():
        assert is_lucky("agtgtttgtgggtccc")
        assert not is_lucky("agtgtttgtgggtcccc")
        assert is_lucky("cccc")
        assert is_lucky("agtgggtccc")
    test_is_lucky()


In this lecture, we will investigate the properties of strings that can be determined in this way. This theory developed from linguistics, so we call a set of strings a <i>language</i>. Chomsky developed this theory to support his thesis that natural language is inherently recursive, and so cannot be captured by nonrecursive machines. [Reference](https://twiki.di.uniroma1.it/pub/LC/WebHome/chomsky1956.pdf)

 The set of lucky strings is a language, with a special property: it can identified by a simply looping through the gene without keeping track of variables. Such languages are called [<i>regular languages</i>](https://en.wikipedia.org/wiki/Regular_language).


<h2>Regular Expressions</h2>
Regular expressions are an alternate way to formulate regular languages.

First, we define some operations on regular languages:

If L, A and B are languages, then 
1. (Kleene Star) L* is the language whose words are concatenations of the words of L.
2. (Union) A|B is the language whose words are in A or in B.
3. (Concatenation) AB is the language whose words are the concatenation of a word in A followed by a word in B.

 A <i>regular expression</i> is an expression that can be formed in the following way:
 1. The emptystring is a regular expression.
 2. Letters ('a','t','g','c') are regular expressions.
 3. if R is a regular expression R* is a regular expression.
 4. If A and B are regular expressions, $A|B$ and $AB$ are also expressions

We can also add parethesis make the order of application more clear.

The regular expressions are just formal sequences of characters. They represent languages according to the interpretations of operations. If a word is a member of the language that the regular expression represents, we say that the word <i>matches</i> the regular expression.

For example, the lucky genes are represented by the regular expression (a|c)(g|t)*ccc, and any particular lucky gene matches the regular expression. In the python syntax, this is written as [ac][gt]*ccc

 Kleene's Theorem:
 Each regular expressions represents a regular language, and each regular language has a regular expression that represents it.
 



We can use regular expressions to simplify our code that checks whether a gene is lucky.

In [10]:
#https://stackoverflow.com/questions/12595051/check-if-string-matches-pattern see Ali Rizavi's answer.
import re

def regex_is_lucky(txt):
    x = re.fullmatch("[ac][gt]*ccc",txt)
    if x is None:
        return False
    else:
        return True

def test_regex_is_lucky():
    assert regex_is_lucky("agtgtttgtgggtccc")
    assert not regex_is_lucky("agtgtttgtgggtcccc")
    assert regex_is_lucky("cccc")
    assert regex_is_lucky("agtgggtccc")
test_regex_is_lucky()

<h2> State Machines </h2>

We demonstrated that lucky genes can be identified by a single pass through the gene, without keeping track of any variables. We now formalize this sort of program.

A (finite) state machine (for a given alphabet, $\Sigma$. Here, we stick to $\Sigma = \{a,t,g,c\}$.) consists of the following information
1. A finite set $S$, whose elements are called <i>states</i>
2. A special state $s_0\in S$, called the <i>starting state</i>
3. A special state $S_a\subset S$, called the <i>accept states</i>
4. A function $T:\Sigma \times S \to S$ called the <i>transition function</i>
5. A current state, $c$ (which can change). We say that the machine is "in" the current state.

We can visualize state machines as directed graphs, where the vertices are the states $S$ and the directed edges are labeled by the letters of $\Sigma$. We draw an edge from state $s_1$ to state $s_2$ labeled by $x\in \Sigma$ if $T(s_1,x)=s_2$.

Each state machine specifies a language. The machine reads sequence of letters using the following procedure, which returns either "accept" or "reject": 

Initially, the machine begins in the starting state. When the machine reads the next letter of the string $x$, it updates the current state to $C \coloneqq T(x,c)$. The machine reads each letter of the string until it reaches the end. The machine returns "accept" if the last state is an accept state (in other words, if $c\in S_a$ after all of the letters have been read). Otherwise, the machine returns "reject".

<img src="state_machine.png" style="background-color:white;">

Here, we show an illustration of a state machine that recognizes lucky genes. There is only one accept state, and it is denoted by the double circles.

<h2>Properties of regular languages</h2>

From Kleene's Theorem, it is easy to prove the following. Suppose that A and B are regular languages. Then the following languages are also regular:
1. $\Sigma^\star - A$ (the complement of $A$): 
    - Proof idea: Take the finite state machine for $A$ and switch all of the accept and non-accept states.
2. $A\cap B$: 
    - Proof idea: Create a finite state machine whose states are pairs of states for the machines $A$ and $B$. The accept states are pairs that are both accept states.
3. $A^R$ the language whose words are those of $A$ but reversed: 
    - Proof idea: Reverse all of the arrows in the machine for $A$. The resulting finite state machine may be nondeterministic (two arrows labeled 'a' can leave the same node). Use the well-known theorem that nondeterministic state machines are equivalent to deterministic ones.

<h2>GREP</h2>

GREP (Global regular expression printer) is a command line tool that allows you to search for matches to regular expressions from the command line. The algorithm that implements this is the [Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm). It uses an idea similar to the Knuth Morris Pratt string matching algorthm, but modified to apply to state machines.

<h2>Machines and Languages</h2>

The Theory of regular languages and the finite state machines that recognize them is the first example of a more general phenomenon.

Classes of languages can be described by:

    -Grammar rules (Rules by which valid words can be formed, like regular expressions)
    -Machines (An abstract machine that recognizes valid words, like finite state automata)

See the textbook Aurora and Barack for more information on this perspective.


<img src="Chomsky-hierarchy.svg" style="background-color:white;">

Next week, we will see the Recusively Ennumerable languages and the associated machines, Turing machines.

<h2>The Pumping Lemma: Languages that are not regular</h2>

Not all languages are regular. For example, the set of string $\mathcal{S}\coloneqq \{``", ``at", ``aatt", ``aaattt", \dots\} = \{(``a"*n +``t"*n) \text{ for n in }range(\infty)\}$ is not a regular language.

Intuitively, the reason that $\mathcal{S}$ is not regular is that we need memory to recognize members of $\mathcal{S}$. The naive way to do this is to step through the string one chacater at a time and maintain a counter for the number of $a's$ encountered. Then, once we're done with the $a$'s, we count the number of $t$'s and check that the two are equal. However, finite state machines only have finite memory (in the form of finite states) and so cannot maintain a variable to count the (unbounded) number of $a$'s.

Of course, this is not a proof, since one might argue that there is a better way to recognize these strings.

To demonstrate that this is the case, we use an argument called the <i>Pumping Lemma</i>. Suppose that a state machine $\mathcal{M}$ could recognize $\mathcal{S}$. Since $\mathcal{M}$ has only finitely many states, we can apply the pigeon hole principle to conclude that $\mathcal{M}$ finds itself in the same state twice when reading words that are long enough. To be more specific, let $M$ be the number of states of $\mathcal{M}$. The word $W=``a"*(M+1) +``t"*(M+1) \in \mathcal{S}$. The first half of the word is longer than $M$, so $\mathcal{M}$ will encounter the same state twice when reading $W$. Suppose this state is encountered at steps $m_1$ and $m_2$. Then $\mathcal{M}$ completes a cycle in the directed graph its states when reading the word $W$ between steps $m_1$ and $m_2$, for $m_2>m_1$. We can remove this loop without changing the behavior of the machine on subsequent states. This shows that $W^\prime \coloneqq ``a"*(M+1-(m_2-m_1)) +``t"*(M+1)$ is accepted by $\mathcal{M}$. But clearly $W^\prime \not \in \mathcal{S}$, so we have a contradiction.

Another example of a non-regular language is the language of properly matched parenthesis.