# Lexical Analysis

## The Job of Lexical Analysis
- Read in characters from a file
- Determine the groupings of characters (the lexemes)
    - The lowest level(s) of the grammar
- Assign the lexemes to a token




## Where Lexical Analysis Fits in
<img alt="Where in the process we are" src="lexprocess.svg" style="width: 75%; height: auto;" />

## Another View of Lexical Analysis
![Syntax highlighted in the atom.io text editor](https://i.github-camo.com/a4cd12f0aa9610f5487f031d3163918abdcb42ec/68747470733a2f2f662e636c6f75642e6769746875622e636f6d2f6173736574732f3637313337382f323236353637312f64303265626565382d396538352d313165332d396238632d3132623263623730313565332e706e67)
From https://atom.io/themes/monokai

## Methods of Lexical Anaylsis
Lexical Anaylsis boils down to matching patterns.  
There are several ways to do this
- Use formal descriptions and regular expressions to describe the patterns
- Use a state transition diagram and accompying implementation
- Use a state transition diagram and manually construct a table-driven implentation

## Regular Expressions

- A notation that is used to describe patterns
- Is simpler and less expresive than BNF
- Consists of three basic operations:
 - Concatenation **( ab )** 
 - Union  **a | b**
 - Kleene Star or Kleene Closure __a*__ = __$\epsilon$|a|aa|aaa|....__


## Regular Expression Notataion
- Has some syntactic sugar that could be constructed from the above
 - Parentheses __( a ( b | c ) )__ == **ab|ac**
 - Kleene Plus __a+__ == __aa\*__
 - Zero or One Operator __a?__ = __a | $\epsilon$__ 
 - Character classes  __[a-z]__ == __a | b | c | ... | y | z__

- Can use identifiers to break up long patterns
 - ONES = (one|two|three| ... | ten)
 - TEENS = (eleven|twelve|thirteen|...|nineteen)
 - TENS = (twenty|thirty|forty|...|ninety)
 - MONEY = ((ONES|TEENS)| TENS ONES)(CENTS | (DOLLARS ((ONES|TEENS)| TENS ONES) CENTS)
- Can be simplified even further
 - X = (ONES | TEENS)
 - Y = TENS ONES
 - NUMBER = (X | Y)
 - MONEY2 = NUMBER (CENTS | DOLLARS NUMBER CENTS)

## RegEx Examples

Over the alphabet {a,b} give a regular expression for

- Strings with an even number of a's
- Strings with a length that is a multiple of 3

## Regex in the Real Word
* Most programming languages support far more complex regular expressions
* Python uses a style of regex known as perl compatable regular expressiosn (PCRE)
    * \d stands for all digits
    * \w for all alphanumeric characters
* Can also be used in `grep` by using the -P flag

## Real World Uses of Regex

In [21]:
ls -lh ~/wackypediaFlat.slim

-rw-rw-r-- 1 bryan bryan 2.0G Jan 27  2017 /home/bryan/wackypediaFlat.slim


In [22]:
grep -oP "\d\d\d-\d\d\d-\d\d\d\d" ~/wackypediaFlat.slim | head

800-801-9322
888-241-4556
800-818-8589
465-577-4922
465-577-4923
465-823-7231
800-567-5111
510-234-9054
907-248-3780
866-445-6580
grep: write error: Broken pipe


In [23]:
grep -oP "[tT]he capital of [A-Z]\w+ is [A-Z]\w+" ~/wackypediaFlat.slim | head

The capital of Slovenia is Ljubljana
The capital of Berry is Bourges
The capital of Chile is Santiago
The capital of Alabama is Montgomery
The capital of Anambra is Awka
The capital of Cybertron is Iacon
The capital of Ghor is Chaghcharan
The capital of Colombia is Bogota
The capital of Boharo is Dhahar
The capital of Cundinamarca is Bogota


## Finite Automata
- A class of mathematical machines
- Represented by a state transition diagram
- Recognizes strings that can be described by regular expressions

![Finite Automata to recognize money](moneyDFA.jpg)

## Deterministic Finite Automata

Finite automata that obey certain rules

- For each state, any given input only provides on possible trasition
- You cannot transition between two states with out looking at the input



## DFA Practice

Over the alphabet {a,b} give a DFA that accepts:

- Strings with no more than 3 a's
- Strings with a length that is a multiple of 3

## Non-Deterministic Finite State Automaton (NFA)
* NFAs are computationally equivalent to DFAs
* NFAs allow:
    * A transition over the empty string $\epsilon$
    * Multiple transitions given the same input and same state
 
![Example of two transitions](NFA1.jpg)
                                                                                                                                                  
![Example of empty transition](NFA2.jpg)

## Implementing a DFA with a table
- If we number (or otherwise label) each state, we can create a table where
    - the rows are the states
    - the columns are the inputs
    - the cell value is which state to go to next
- Process input one character at a time, updating a variable that holds the current state
- After processing input, if we are in an accept state, accept

<img src="aa.png" style="height:30vh;margin:0px auto;" alt="Three state DFA, transition from state 1 to state 2 on an 'a', from state 2 to state 1 on a 'b' and from state 2 to state 3 on an 'a'">
<table style="font-size:1.5em;">
<thead>
<tr>
<th></th><th>a</th><th>b</th>
</tr>
</thead>
<tbody>
<tr>
<td style="border-right:1px solid black;">1</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td style="border-right:1px solid black;">2</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td style="border-right:1px solid black;">3</td>
<td>3</td>
<td>3</td>
</tr>
</tbody>
</table>


## Lex

- Lexical analyzer generator
 - It writes a lexical analyzer
- Assumption
 - each token matches a regular expression
- Needs
 - set of regular expressions
 - for each expression an action
- Produces
 - A C program
- Automatically handles many tricky problems
- flex is the GNU version of the venerable unix tool lex.
 - Produces highly optimized code 

## Lex File Layout
- A lex file consist of three sections, separated by a line containing only `%%`
    - Definitions
    - Rules
    - Subroutines
```lex
DEFINITIONS
%%
RULES
%%
SUBROUTINES
```

## Lex Pipeline
* Lex files are traditional named with an `.l` suffix
* First the lex files is processed by `lex` or `flex` which generates C code
```bash
flex -ooutput.c input.l
```
* Next the c file is compiled. You **must** use the `-lfl` flag
```bash
gcc -oexample_lex example_lex.c -lfl
```

## Lex Example
```c
/* scanner for a toy Pascal-like language */
%{
#include <math.h> /* needed for call to atof() */
%}
DIG [0-9]
ID [a-z][a-z0-9]*
%%
{DIG}+ printf("Integer: %s (%d)\n", yytext, atoi(yytext));
{DIG}+"."{DIG}* printf("Float: %s (%g)\n", yytext, atof(yytext));
if|then|begin|end printf("Keyword: %s\n",yytext);
{ID} printf("Identifier: %s\n",yytext);
"+"|"-"|"*"|"/"|">" printf("Operator: %s\n",yytext);
"{"[^}\n]*"}" /* skip one-line comments */
[ \t\n]+ /* skip whitespace */
. printf("Unrecognized: %s\n",yytext);
%%
int main(){yylex();} 
```

In [24]:
flex -oexample_lex.c example.l

In [26]:
tail example_lex.c

	free( (char *) ptr );	/* see yyrealloc() for (char *) cast */
}

#define YYTABLES_NAME "yytables"

#line 15 "example.l"


int main(){yylex();}



In [27]:
wc example.l example_lex.c

   17    63   537 example.l
 1804  6479 46378 example_lex.c
 1821  6542 46915 total


In [28]:
gcc -oexample_lex example_lex.c -lfl

In [30]:
./example_lex < pascal_example 

Keyword: begin
Keyword: if
Identifier: size
Operator: >
Integer: 10 (10)
Keyword: then
Identifier: size
Operator: *
Operator: -
Float: 3.1415 (3.1415)
Keyword: end


In [29]:
cat pascal_example

begin
 if size > 10
	then size * -3.1415
end
