An Easy Compiler Compiler implemented in Java
EasyCC is a Compiler Compiler developed mainly to be an educational tool for students learning about compilers. The software has a command line interface and a graphical user interface. EasyCC is composed of four components: Lexical analysis, Syntax analysis, Semantic analysis and Code generation.
Lexical analysis is the first step in building a compiler. In this step the user has to build a finite automaton representing the possible tokens in the language. The finite automaton should be provided as a JSON file following this structure:
{
"states": [
{
"types": ..., # initial, normal, final
"name": ..., # unique name for a state
"token": ..., # token name describing the target input
"backtrack": ... # true | false. Usually a true is selected if the input has an unfixed length (e.g. variable, number).
},
{
... # Another state
}
],
"edges": [
{
"from": ..., # name of `from` state
"to": ..., # name of `to` state
"value": ... # value of the edge
},
{
... # Another edge
}
]
}
Edge special values:
SPACE
Space or tabEOL
End of lineEOF
End of fileLETTER
A-Za-zNONZERO
1-9OTHER
Any character not covered by an out-edge from a state
Writting a JSON file manually can be difficult for a large finite automaton. Therefore, a graphical user interface is developed to auto-generate the lexical JSON file by entering the states and the edges connecting them.
To run the lexical generator GUI:
./gradlew :lexical-analysis:lexical-generator:build :lexical-analysis:lexical-generator:run
Note: The states on the GUI are displayed using the BFS algorithm assuming that the graph generated is connected. Therefore a state is only visible once it is connected to another visible one
After defining the set of lexical tokens for a language, the next step is writting the language grammar. Each production in the grammar is composed of terminals and non-terminals tokens. The terminals tokens should match one of the lexical tokens defined in the previous step. The syntax for a grammar file is defined as follow:
% 'My language' Grammar file
% Author: EasyCC
A -> B 'C' D
B -> 'B' | EPSILON
D -> 'D'
| EPSILON
Explanation:
- Starting with
%
marks the line as a comment - First production is always the starting point for a grammar. (e.g.
A -> B 'C' D
) - A non-terminal is a word without single quote (e.g.
A
,B
andD
) - A terminal is a word with single quote (e.g.
'B'
,'C'
and'D'
) - Multiple productions sharing the same left hand side can be written on one or more lines seperated by
|
symbol. - An epsilon is represented by the word
EPSILON
. - Spaces between words and symbols are important, otherwise an exception is thrown.