Java Lexical Analyzer and Parser Generator

Overview

This project was a part of CS321 Programming Languages and Translators course project. This repository contains the first two phases:

1. Lexical Analyzer

The lexical analyzer generator is required to automatically construct a lexical analyzer from a regular expression description of a set of tokens. The tool is required to construct a nondeterministic finite automata (NFA) for the given regular expressions, combine these NFAs together with a new starting state, convert the resulting NFA to a DFA, minimize it and emit the transition table for the reduced DFA together with a lexical analyzer program that simulates the resulting DFA machine.

2. Parser Generator

The parser generator expects a grammar as input. It should compute First and Follow sets and uses them to construct a predictive parsing table for the grammar. The table is to be used to drive a predictive top-down parser. A bonus task we included is converting any grammar to be LL (1)

Architecture

The project contains the following classes:

nfa

This class represents an NFA. It contains the following attributes:

alphabet

a vector of characters representing all possible inputs.
transition table

represented by a 3 dimensional vector of integers, or a 2 dimensional vector where each entry is a vector. each row corresponds to a state, each column represents an input. each entry in row i and column j is a vector of characters representing the next states to state i under input alphabet[j]
accepted states

A vector of integer representing indices of acceptance states

dfa_state

This class represents one state of the DFA. It contains the following attributes

corresponding_states

A vector of int.contains the indices of the NFA that correspond to this Dfa state
next_states

A vector of integers, its length is the same size as the alphabet vector. element i represents the next state when the input is alphabet[j]
accepted A boolean that tells whether this is an acceptance state or not
state_name

A string representing the name of the state if it is accepted (i.e: num, id, keyword .. ), “unaccepted” otherwise

dfa

this class is responsible for converting an nfa to a dfa. Its attributes are

NFA

type is nfa. represents the nfa we are trying to convert
transition_table a vector of dfa_state representing the states of the DFA with their corresponding NFA states and their next states in the DFA.
alphabet vector of character representing the alphabet
epsilon_closures_table a two dimensional vector of integer. each row i represents a vector of all states to which state i goes to under input epsilon
accepted_states a vector of integer representing indices of acceptance states

minimization

this class is used for minimization a DFA. Its important attributes are

next_states

a two dimensional vector of integers. rows represent the states and columns represent under which input symbol.
minimized_table

a two dimensional vector of integers. rows represent a group of states and columns represent the other states in the same group, it is used through the partitions, so it has some cells which must be ignored.
final_minimized_table

is the same as the minimized_table but without the cells of empty spaces.
DFA

Its type is dfa and is the DFA which we want to minimize.

dfa_instance

this class is responsible for reading the input string and splitting it into tokens it uses a DFA (from class dfa), gets a string and splits the string into tokens.

grammar :

This class represents a grammar with attributes:

non_terminals

vector of symbols which are non terminals.
terminals

vector of symbols which are terminals.
productions_list

vector of productions which represents the grammar rules.

symbol

This class represents a symbol with attributes :

value

string of the symbol value (name).
is_terminal

a boolean attribute that determines if this symbol is one of the terminal symbols or not.
first

vector of strings and each string is one of the terminal symbols value.
first_production

vector of integers and each integer refers to the index of production (in productions_list) that generates that first .
follow

vector of strings and each string is one of the follow of that symbol.
finish_first

boolean attribute which shows if that symbol’s first attribute has been calculated.

production :

This class represents a production with attributes :

LHS : symbol(non terminal).
RHS : vector of symbols(terminals with non terminals).

output :

This class represents the output with attribute:

rhs_productions_stack: stack of symbols

Algorithms Used

Eliminating left recursion

for each production P_j in productions vector
if left hand side LHS₁ is equal to the first symbol in right hand side (left recursion detected):

create a new production (P_new) , with a new symbol (S_new) in the left hand side.
P_new.LHS = S_new, P_new.RHS = P_j.RHS after removing first element.
add P_new to the productions list.
find every other production (p_i) with the same left hand side.
add the new symbol to the right hand side vector of each production with same LHS.
create a new production with left-hand-side = LHS₁and right-hand-side = p_i.right_hand_side + S_new

Left factoring

for each production p_i:
find indices of all production with the same left-hand-side and same first symbol in right-hand-side and push them all to vector v
if v.size is greater than one (i.e: there are other productions that make the grammar not left-factored):
for each element in v (v_j):

create new symbol S_new.
create new production P_new.
P_new.left-hand-side = P_i.left-hand-side
P_new.right-hand-side = {P_i.RHS[0], new_symbol}
push P_new to productions_list
productions_list[v_j].RHS = {sub_vector(productions_list[v_j].RHS, 1, productions_list[v_j].end};
productions_list[v_j].LHS = S_new; \

calculating first sets

we use iterative technique to calculate the first set for every symbol in the grammar (terminal or non terminal) until there is no change with some rules :

If the symbol is terminal, its first is the symbol itself.
if the symbol is non terminal :
if the first symbol in the right hand side does not have ‘/L’ in its first set, its first is the first of that symbol.
if the first symbol in the right hand side has ‘/L’ in its first set, its first is the first of that symbol with the first of the next symbol and that is repeated until the symbol does have ‘/L’ in its first set.

calculating follow sets

we iterate to get the final follow set for each non terminal symbol until there is no change in every follow set by using some rules :

If the symbol is the starting symbol, one of its follow is “$”.
If the symbol exists in the right hand side of one of the productions :
If it is the last symbol in the right hand side of the production or the following symbol has ‘/L’ in its first set, its follow set have all follows of the left hand side symbol.
If it is not the last symbol in the right hand side of the production and the following symbol does not have ‘/L’ in its first set, its follow set have all firsts of the following symbol.

constructing predictive parsing table

for each production rule A of a grammar G

– for each terminal a in FIRST()

add A to M[A,a]

– If in FIRST()

for each terminal a in FOLLOW(A) add A to M[A,a]

– If in FIRST() and $ in FOLLOW(A)

add A to M[A,$]

generating output file

Initialize a stack containing S$.

Repeat until the stack is empty:

Let the next character of ω be t.
If the top of the stack is a terminal r:

– If r and t don't match, report an error.

– Otherwise consume the character t and pop r from the stack.
Otherwise, the top of the stack is a nonterminal A:

– If T[A, t] is undefined, report an error.

– Replace the top of the stack with T[A, t]

handling errors using Panic-Mode error recovery

– For each nonterminal A, mark the entries M[A,a] as synch if a is in the follow set of A. So,for an empty entry, the input symbol is discarded. This should continue until either:

an entry with a production is encountered. In the case, parsing is continued as usual.
an entry marked as synch is encountered. In this case, the parser will pop that non-terminal A from the stack. The parsing continues from that state.

– To handle unmatched terminal symbols, the parser pops that unmatched terminal symbol

from the stack and it issues an error message saying that that unmatched terminal is inserted.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin/Debug		bin/Debug
include		include
obj/Debug		obj/Debug
src		src
Last version of compiler.cbp		Last version of compiler.cbp
Last version of compiler.depend		Last version of compiler.depend
Last version of compiler.layout		Last version of compiler.layout
README.md		README.md
code.txt		code.txt
grammar.txt		grammar.txt
main.cpp		main.cpp
output1.txt		output1.txt
rules1.txt		rules1.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Lexical Analyzer and Parser Generator

Overview

1. Lexical Analyzer

2. Parser Generator

Architecture

nfa

dfa_state

dfa

minimization

dfa_instance

grammar :

symbol

production :

output :

Algorithms Used

Eliminating left recursion

Left factoring

calculating first sets

calculating follow sets

constructing predictive parsing table

generating output file

handling errors using Panic-Mode error recovery

About

Releases

Packages

Languages

YomnaEldawy/java-compiler

Folders and files

Latest commit

History

Repository files navigation

Java Lexical Analyzer and Parser Generator

Overview

1. Lexical Analyzer

2. Parser Generator

Architecture

nfa

dfa_state

dfa

minimization

dfa_instance

grammar :

symbol

production :

output :

Algorithms Used

Eliminating left recursion

Left factoring

calculating first sets

calculating follow sets

constructing predictive parsing table

generating output file

handling errors using Panic-Mode error recovery

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages