In [None]:
#

# Syntax, Semantics, Parsing and Formal Grammars

## Syntax vs Semantics

Syntax and semantics are two fundamental aspects of any programming language that help in understanding how programs are constructed and what they mean.

### Syntax
Syntax refers to the set of rules that specifies the correct combined sequence of symbols that can be used to form a correctly structured program using a specific programming language. These rules dictate how statements and expressions are formed. For example, in Python:


- A statement to assign a value to a variable has the syntax: `variable_name = value`
- A for-loop has the syntax: `for variable in iterable:`

These rules ensure that the program is grammatically correct.

Common syntax errors include:


- Missing or extra braces, brackets, or parentheses
- Incorrect indentation (especially in languages like Python)
- Missing semicolons in languages where they are required (like C, C++, and Java)

### Semantics
Semantics refers to the meaning associated with syntactically valid strings of symbols in a programming language. Even if your code is syntactically correct, it might not do what you intend due to semantic errors. The semantics of a language provide the rules for interpretation of the syntax, which makes it possible for a machine to execute the code written by a developer.

Examples of semantic elements in a language might include:


- Variable scoping rules (e.g., global vs. local scope)
- Type systems (e.g., how different data types interact)
- Evaluation of expressions (e.g., order of operations)

Common semantic errors include:


- Type mismatch (e.g., trying to add a string and an integer)
- Undefined variables
- Division by zero
- Index out of range

### Relation between Syntax and Semantics
Syntax and semantics are closely related but distinct:


- A program could be syntactically correct but semantically wrong. For example, dividing by zero is usually syntactically correct but semantically incorrect.
- Conversely, semantics can't be correct if the syntax is incorrect; a program won't run if it's not syntactically correct.

Programming languages often come with a formal specification that details their syntax and semantics, which is essential for compiler and interpreter writers. Programmers generally don't have to study these formal specifications; they learn the rules more implicitly through documentation, tutorials, and examples.

## Going from Source Code to Machine Code

Process of going from programming language code to machine code involves multiple stages, each with its own set of tasks and objectives. 

### 1. Preprocessing
This is the first stage for some languages like C and C++. The preprocessor handles tasks like macro expansion, file inclusion, conditional compilation, etc. Source code is manipulated based on preprocessor directives like `#include`, `#define`, and others.

### 2. Lexical Analysis (Lexing)
#### Objective:
To convert the input source code into a stream of tokens. A token is a sequence of characters that represents a fundamental building block of the language, such as an identifier, a keyword, or an operator.

#### How it Works:

- The lexer scans the source code character by character.
- It groups characters into tokens according to the lexical rules of the language.
- Comments and white spaces are often discarded.

### 3. Syntax Analysis (Parsing)
#### Objective:
To convert the token stream into a parse tree, which represents the syntactic structure of the code based on the language's grammar rules.

#### How it Works:

- The parser applies the grammar rules specified in a formal notation like BNF (Backus-Naur Form) or EBNF (Extended Backus-Naur Form).
- If it encounters a sequence of tokens that doesn't conform to the grammar, a syntax error is produced.

### 4. Semantic Analysis
#### Objective:
To perform checks that are not related to syntax, like type checking, variable binding, etc.

#### How it Works:

- The compiler verifies that the parse tree adheres to the language's semantic rules.
- For example, it may check that variables are declared before use, that functions are called with the correct number of arguments, etc.

### 5. Intermediate Code Generation
#### Objective:
To convert the semantically correct parse tree into an intermediate code that serves as an abstraction over the target machine code.

#### How it Works:

- This intermediate code is usually platform-independent.
- It allows for further optimization without having to deal with the specifics of the target architecture.

### 6. Optimization
#### Objective:
To optimize the intermediate code for performance, memory usage, or other criteria.

#### How it Works:

- The compiler applies various optimization techniques to eliminate redundant code, improve data flow, etc.

### 7. Code Generation
#### Objective:
To convert the optimized intermediate code into the target machine code or assembly language.

#### How it Works:

- The code generator produces the final output based on the specifics of the target architecture.

### 8. Linking
#### Objective:
To combine multiple machine code files (possibly from different sources) into a single executable.

#### How it Works:

- The linker resolves external references, assigns final memory addresses to functions and variables, and produces a single executable or library.

This gives you a broad overview of the entire process. Each of these stages can be quite complex and may involve many sub-steps. But this should provide a reasonable high-level understanding of what it takes to get from source code to machine code.

## Compilation vs Interpretation vs Hybrid Approach (review)

Let's review the concepts of compilation vs interpretation vs hybrid approach from previous lessons.

We already know that a compiler is a program that translates source code into machine code. But there are other ways to translate source code into machine code, such as interpretation and hybrid approaches.

### Compilation

In compilation, the entire source code is converted into machine code at once. The resulting machine code is stored as a separate file, which is executed later. This approach is used by languages like C, C++, Java, etc.

### Interpretation

In interpretation, the source code is converted into machine code one line at a time. The machine code is executed immediately after it is generated. This approach is used by languages like Python, JavaScript, etc.

### Hybrid Approach

In the hybrid approach, the source code is converted into intermediate code, which is then executed by an interpreter. This approach is used by languages like C#, PHP, etc.

### JIT Compilation

Just-in-time (JIT) compilation is a hybrid approach that combines the speed of compilation with the flexibility of interpretation. In JIT compilation, the source code is compiled into machine code at runtime, just before executing it. This approach is used by languages like C#, Java, etc.

## Abstract vs Concrete Syntax Trees

Understanding the difference between abstract and concrete syntax is crucial for those who are interested in the design and implementation of programming languages, compilers, or interpreters.

### Concrete Syntax
Concrete syntax, often called "surface syntax," refers to the literal textual or visual representation of a program. It specifies the exact sequence of characters that are valid in a program. This is what programmers actually write in their code editors.

For example, consider the following Python code for calculating the factorial of a number:

```python
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)

```
The concrete syntax involves everything: the layout of the code, the braces, the indentation, the keywords, etc. Concrete syntax is what is read and produced by the lexical and syntactic analysis phases of a compiler or interpreter.

### Abstract Syntax
Abstract syntax, on the other hand, represents the hierarchical and structural view of a program, abstracting away many of the textual details. In this form, the focus is on the relationships between the elements, rather than on how they are specifically notated.

For the above Python example, an abstract syntax tree (AST) might represent the program like so:


- A `function definition` node with the name "factorial" and argument "n"
   - An `if-else` node
      - A `condition` node specifying that `n == 0`
         - A `return` node with value `1`
      
<li>An `else` node
- A `return` node
   - An `expression` node that multiplies `n` by `factorial(n-1)`

In abstract syntax, details like the placement of parentheses, specific keyword usage, and other syntactic sugar may be irrelevant, as they are not necessary for understanding the program's structure or meaning.

### Why Both Are Important
Both concrete and abstract syntax have their roles:


- **Concrete Syntax**: Important for writing, reading, and maintaining programs. It's what developers interact with directly.
- **Abstract Syntax**: Critical for program analysis, optimization, and transformation, often serving as an intermediary representation of the program within compilers and other tools.

In programming language theory and in the construction of compilers and interpreters, it's common to convert a program's concrete syntax into its abstract syntax as an early step. This abstract representation is often easier to work with when it comes to tasks like semantic analysis, optimization, and code generation.

Understanding the abstract syntax can also be crucial for tasks like programmatic code manipulation, refactoring, and analysis, which is why many tools and libraries offer ways to work directly with the abstract syntax tree of a program.

## Typical methods for concrete syntax specifications

The specification of concrete syntax for programming languages involves formal methods that provide a precise and unambiguous way to define the rules that determine how programs can be written in that language. Here are some of the most commonly used methods for specifying the concrete syntax:

### Backus-Naur Form (BNF) and Extended Backus-Naur Form (EBNF)
Backus-Naur Form (BNF) and its extension, Extended Backus-Naur Form (EBNF), are among the most widely used notations for specifying the syntax of programming languages. BNF was initially developed to describe the syntax of the Algol 60 programming language.

A BNF specification describes a language in terms of production rules, which define how a sequence of tokens can be generated from a given symbol (called a non-terminal).

For example, a simplified BNF-like specification for a basic arithmetic expression might look something like this:

```bash
<expression> ::= <term> "+" <term>
               | <term> "-" <term>
               | <term>

<term>       ::= <factor> "*" <factor>
               | <factor> "/" <factor>
               | <factor>

<factor>     ::= "(" <expression> ")"
               | <number>

<number>     ::= "0" | "1" | "2" | ... | "9"

```
In EBNF, you can introduce more advanced constructs, like optional elements, repetitions, and groupings, making it more expressive than the original BNF.

### Regular Expressions
Regular expressions are often used in the lexical analysis phase of a compiler to specify the format of tokens such as identifiers, numbers, and special symbols. For example, a regular expression for a basic identifier in many programming languages might be `[a-zA-Z_][a-zA-Z_0-9]*`.

### Syntax Diagrams (Railroad Diagrams)
Syntax diagrams, also known as railroad diagrams, provide a graphical representation of the syntax rules. In these diagrams, each rule is represented as a path, often with forks and loops to indicate optional or repeated elements.

Syntax diagrams are quite intuitive to understand but are not as concise as BNF or EBNF for complex languages.

### Augmented Parsing Methods
More advanced parsing techniques, like LR (Left-to-right, Rightmost derivation), LALR (Look-Ahead LR), and LL (Left-to-right, Leftmost derivation) grammars, are also used to describe syntax rules. These methods are more machine-oriented and are used by parser generators like YACC (Yet Another Compiler-Compiler) and ANTLR (Another Tool for Language Recognition) to produce parsers from a given grammar.

### Attribute Grammars
Attribute grammars extend the concept of grammars like BNF by adding semantic rules to the syntax rules. Each syntax rule is associated with a set of attributes and equations that compute the attributes. Attribute grammars are powerful tools for specifying both syntax and some aspects of semantics.

These are just a few of the most common methods. Each method has its own set of advantages and disadvantages. Some are more intuitive and human-readable, while others are more machine-oriented and easier to parse. Some are more expressive, while others are more concise. Some are more suitable for specifying syntax, while others can also be used to specify semantics.

## BNF: Backus-Naur Form

The Backus-Naur Form (BNF) is most accurately described as a context-free grammar (CFG). In formal language theory, context-free grammars are used to generate context-free languages. These grammars are powerful enough to describe the syntactic structure of most programming languages.

### Context-Free Grammars (CFG)
In a context-free grammar, the production rules specify that a single non-terminal can be replaced by a sequence of terminals and/or non-terminals. The "context-free" part means that the replacement of a non-terminal does not depend on its surrounding symbols (its "context").

A context-free grammar is typically defined as a 4-tuple <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><mi>N</mi><mo separator="true">,</mo><mi mathvariant="normal">Σ</mi><mo separator="true">,</mo><mi>P</mi><mo separator="true">,</mo><mi>S</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(N, \Sigma, P, S)</annotation></semantics></math><span class="katex-html" aria-hidden="true"><span class="strut" style="height: 1em; vertical-align: -0.25em;">(<span class="mord mathnormal" style="margin-right: 0.10903em;">N,<span class="mspace" style="margin-right: 0.1667em;">Σ,<span class="mspace" style="margin-right: 0.1667em;"><span class="mord mathnormal" style="margin-right: 0.13889em;">P,<span class="mspace" style="margin-right: 0.1667em;"><span class="mord mathnormal" style="margin-right: 0.05764em;">S), where:


- N
N
N is a set of non-terminal symbols.
- Σ
\Sigma
Σ is a set of terminal symbols (disjoint from
N
N
N).
- P
P
P is a set of production rules, each transforming a single non-terminal into a sequence of terminals and/or non-terminals.
- S
S
S is the start symbol, an element of
N
N
N.

### Example
A simplified BNF-like representation of an arithmetic expression could be:

```bash
<expression> ::= <term> "+" <term>
               | <term> "-" <term>
               | <term>

<term>       ::= <factor> "*" <factor>
               | <factor> "/" <factor>
               | <factor>

<factor>     ::= "(" <expression> ")"
               | <number>

<number>     ::= "0" | "1" | "2" | ... | "9"

```
In this example:


- N
N
N would be {`
`, `
`, `
`, `
`}
- Σ
\Sigma
Σ would be {"+", "-", "*", "/", "(", ")", "0", "1", "2", ..., "9"}
- P
P
P would contain the production rules as shown in the BNF.
- S
S
S would be `
` (assuming we are interested in parsing expressions).

BNF allows us to easily represent the hierarchical structure of programming constructs, making it well-suited for describing the syntax of programming languages. However, it's worth mentioning that while BNF is useful for specifying the structure of syntactically valid sentences, it does not capture their semantics—that is, their meaning or behavior.