GitHub - arbihaza/Project-LEO: Project LEO

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
lib		lib
src/projectleo		src/projectleo
.gitignore		.gitignore
Compilers - Principles Techniques and Tools by Alfred Aho - Monica Lam- Ravi Sethi- Jeffrey Ullman - Second Edition.pdf		Compilers - Principles Techniques and Tools by Alfred Aho - Monica Lam- Ravi Sethi- Jeffrey Ullman - Second Edition.pdf
README		README

Repository files navigation

Welcome to Project LEO!
There are two parts to this document. The first part is a simiplified way of doing it written by me. The second part is from the reference site for our dragon book (http://dragonbook.stanford.edu/lecture-notes.html).
While it is more complicated, you can do a lot more with the procedure in the second section.
Instructions for writing grammars:
First, in a word doc or similar write your start symbol:
For example:
@ <- start symbol

After every decleration, put an arrow and then what it is
For example:

number <- 0|1|2|3|4|5|6|7|8|9

In the above example, I included the pipe( | ) or "or" sign. Use this to represent something that could be one thing or another.

Here is a complete example:

equation <- start symbol

equation <- number + number | number - number | number * number | number / number

number <- 0|1|2|3|4|5|6|7|8|9

Remeber all instructions MUST end in a terminal.
Have fun!
Alan

Part 2:
Formal Grammars
Handout written by Maggie Johnson and Julie Zelenski.
What is a grammar?
A grammar is a powerful tool for describing and analyzing languages. It is a set of rules
by which valid sentences in a language are constructed. Here’s a trivial example of
English grammar:
sentence –> <subject> <verb-phrase> <object>
subject –> This | Computers | I
verb-phrase –> <adverb> <verb> | <verb>
adverb –> never
verb –> is | run | am | tell
object –> the <noun> | a <noun> | <noun>
noun –> university | world | cheese | lies

Using the above rules or productions, we can derive simple sentences such as these:
This is a university.
Computers run the world.
I am the cheese.
I never tell lies.

In addition to several reasonable sentences, we can also derive nonsense like "Computers
run cheese" and "This am a lies". These sentences don't make semantic sense, but they
are syntactically correct because they are of the sequence of subject, verb-phrase, and
object. Formal grammars are a tool for syntax, not semantics. We worry about
semantics at a later point in the compiling process. In the syntax analysis phase, we
verify structure, not meaning.

Vocabulary

We need to review some definitions before we can proceed:
grammar: a set of rules by which valid sentences in a language are constructed.

nonterminal: a grammar symbol that can be replaced/expanded to a sequence of
symbols.
terminal: an actual word in a language; these are the symbols in a grammar that
cannot be replaced by anything else. "terminal" is supposed to conjure
up the idea that it is a dead-end—no further expansion is possible.
production: a grammar rule that describes how to replace/exchange symbols. The
general form of a production for a nonterminal is:
X –>Y1Y2Y3...Yn
The nonterminal X is declared equivalent to the concatenation of the
symbols Y1Y2Y3...Yn. The production means that anywhere where we
encounter X, we may replace it by the string Y1Y2Y3...Yn. Eventually we
will have a string containing nothing that can be expanded further, i.e., it
will consist of only terminals. Such a string is called a sentence. In the
context of programming languages, a sentence is a syntactically correct
and complete program.
derivation: a sequence of applications of the rules of a grammar that produces a
finished string of terminals. A leftmost derivation is where we always
substitute for the leftmost nonterminal as we apply the rules (we can
similarly define a rightmost derivation). A derivation is also called a
parse.
start symbol: a grammar has a single nonterminal (the start symbol) from which all
sentences derive:
S –> X1X2X3...Xn
All sentences are derived from S by successive replacement using the
productions of the grammar.
null symbol: ε, it is sometimes useful to specify that a symbol can be replaced by
nothing at all. To indicate this, we use the null symbol ε, e.g., A –> B | ε.

Productions are often defined in terms of themselves. For example a list of variables in a
programming language grammar could be specified by this production:
variable_list –> variable | variable_list , variable
Such productions are said to be recursive.

Mark: tts
art
OOP
AI & drivers

Sub packages:
drivers
-- art - Morgan,David
-- sound - Mark
-- network - Alan
Ai - Alan
Crypto - Nathan
core - all
Go here for the interfacing guide:
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-manual-325462-rmver.html
Go here for the google url guide:
http://cdn.yoast.com/wp-content/uploads/2007/07/google-url-parameters.pdf

Lexer info starts here:
To implement a lexical analyzer by hand, it helps to start with a diagram or other
description for the lexemes of each token. We can then write code to
identify each occurrence of each lexeme on the input and to return information
about the token identified.

more info on all these words later

The Role of the Lexical Analyzer

As the first phase of a compiler, the main task of the lexical analyzer is to
read the input characters of the source program, group them into lexemes, and
produce as output a sequence of tokens for each lexeme in the source program.
The stream of tokens is sent to the parser for syntax analysis. It is common
for the lexical analyzer to interact with the symbol table as well. When the
lexical analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table. Since the lexical analyzer is the
part of the compiler that reads the source text, it may perform certain other
tasks besides identification of lexemes. One such task is stripping out comments
and whitespace (blank, newline, tab, and perhaps other characters that are used
to separate tokens in the input). Another task is correlating error messages
generated by the compiler with the source program. For instance, the lexical
analyzer may keep track of the number of newline characters seen, so it can
associate a line number with each error message. In some compilers, the lexical
analyzer makes a copy of the source program with the error messages inserted at
the appropriate positions.

Sometimes, lexical analyzers are divided into a cascade of two processes:
a) Scanning consists of the simple processes that do not require tokenization
of the input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner
produces the sequence of tokens as output.

Lexical Analysis Versus Parsing
There are a number of reasons why the analysis portion of a compiler is normally
separated into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration. The separation
of lexical and syntactic analysis often allows us to simplify at least one
of these tasks. For example, a parser that had to deal with comments
and whitespace as syntactic units would be considerably more complex
than one that can assume comments and whitespace have already been
removed by the lexical analyzer. If we are designing a new language,
separating lexical and syntactic concerns can lead to a cleaner overall
language design.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to
apply specialized techniques that serve only the lexical task, not the job
of parsing. In addition, specialized buffering techniques for reading input
characters can speed up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can
be restricted to the lexical analyzer.

Tokens, Patterns, and Lexemes
When discussing lexical analysis, we use three related but distinct terms:

A token is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of
lexical unit, e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the
parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.

A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens,
the pattern is a more complex structure that is matched by many strings.

A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identified by the lexical analyzer as an
ihstance of that token

Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the par-
ticular lexeme that matched. For example, the pattern for token number
matches both 0 and 1, but it is extremely important for the code generator to
know which lexeme was found in the source program. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an attribute
value that describes the lexeme represented by the token; the token name in-
fluences parsing decisions, while the attribute value influences translation of
tokens after the parse.
We shall assume that tokens have at most one associated attribute, although
this attribute may have a structure that combines several pieces of information.
The most important example is the token id, where we need to associate with
the token a great deal of information. Normally, information about an identi-
fier - e.g., its lexeme, its type, and the location at which it is first found
( in case an error message about that identifier must be issued) - is kept in the
symbol table. Thus, the appropriate attribute value for an identifier is a
pointer to the symbol-table entry for that identifier.