# Lexical Analysis

Lexical analysis, also known as scanning or tokenization, is the first phase of the compiler or interpreter process. It involves breaking the source code into meaningful tokens or lexemes. The goal of lexical analysis is to recognize and categorize the different elements in the source code, such as keywords, identifiers, constants, operators, and symbols.

Here's an overview of the lexical analysis process:

1. Input: The input to the lexical analysis phase is the source code written in a programming language.

2. Scanning: The scanner reads the source code character by character and groups them into lexemes based on predefined patterns or rules. It skips any irrelevant characters like whitespaces, comments, and formatting.

3. Tokenization: Each lexeme is classified into a specific category called a token. Tokens represent the basic building blocks of the programming language and have predefined meanings. Common token categories include keywords, identifiers, literals, operators, and punctuation symbols.

4. Building Tokens: The scanner constructs tokens by assigning a token type and, in some cases, additional attributes or values. For example, the token "if" might be assigned the token type "KEYWORD," while the identifier "count" might be assigned the token type "IDENTIFIER" with the attribute "count."

5. Symbol Table: During lexical analysis, a symbol table may be maintained to keep track of identifiers and their associated information (e.g., variable names, types, memory locations). The symbol table is used for subsequent phases of compilation or interpretation.

6. Error Handling: If the scanner encounters an invalid or unrecognized lexeme, it generates an error message or token indicating a lexical error. These errors are typically reported to the programmer to help identify and correct mistakes in the source code.

7. Output: The output of the lexical analysis phase is a stream of tokens, each representing a categorized element of the source code. These tokens are passed on to the subsequent phases of the compiler or interpreter, such as parsing and semantic analysis.

The lexical analysis phase serves as a foundation for the subsequent phases of compilation or interpretation. It ensures that the source code is divided into meaningful units that can be further processed and analyzed by the compiler or interpreter. The tokens generated during lexical analysis provide a structured representation of the source code that can be used to understand its syntactic structure and semantics.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Parser

also known as a syntactic analyzer, is a component of a compiler or interpreter that takes the stream of tokens produced by the lexical analysis phase and checks whether they conform to the grammar of the programming language. The parser performs syntax analysis, determining the structure and organization of the code based on the language's rules.

Parser's role and the parsing process:

1. Input: The input to the parser is the stream of tokens generated by the lexical analysis phase.

2. Grammar: The parser uses a formal grammar, often expressed in a notation like Backus-Naur Form (BNF) or Extended Backus-Naur Form (EBNF), to define the syntax of the programming language. The grammar consists of production rules that describe how valid expressions and statements can be formed.

3. Parsing Techniques: The parser applies various parsing techniques to analyze the token stream and construct a parse tree or abstract syntax tree (AST) representing the syntactic structure of the code. Common parsing techniques include:

   - Recursive Descent Parsing: A top-down parsing technique where the parser corresponds directly to the production rules in the grammar. It recursively applies these rules to match the input tokens and build the parse tree.

   - LL Parsing: A top-down parsing technique that uses a lookahead mechanism to decide which production rule to apply based on the next few tokens in the input stream. LL parsers are often generated automatically from LL grammars.

   - LR Parsing: A bottom-up parsing technique that constructs a parse tree by identifying and reducing productions in a right-to-left manner. LR parsers can handle a broader class of grammars compared to LL parsers.

4. Parse Tree or Abstract Syntax Tree (AST): The parser constructs a parse tree or AST, representing the hierarchical structure of the code according to the grammar. The parse tree is a concrete representation of the syntax, while the AST abstracts away unnecessary details and focuses on the essential elements of the code.

5. Error Handling: If the parser encounters a syntax error or an input that does not conform to the grammar, it generates an error message indicating a syntax error. These errors are typically reported to the programmer, helping identify and correct syntax mistakes in the code.

6. Output: The output of the parsing phase is either a parse tree or an AST. These structures serve as the foundation for subsequent phases, such as semantic analysis and code generation, in the compilation or interpretation process.

The parser's primary goal is to ensure that the input code is syntactically correct according to the language's grammar. It provides a structured representation of the code's syntax, enabling further analysis and transformation in subsequent compiler or interpreter phases.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### BNF

Backus-Naur Form (BNF) is a notation used to formally describe the syntax of a programming language or other formal languages. It was developed by John Backus and Peter Naur in the late 1950s and early 1960s as a way to define the structure of programming languages in a clear and concise manner.

BNF is a metalanguage, which means it is a language used to describe other languages. It provides a set of rules and symbols to specify the structure of the language being defined. BNF consists of production rules, which define how valid expressions can be formed in the language.

Here are the key components of BNF:

1. Nonterminal symbols: Nonterminal symbols represent syntactic categories or components within the language being defined. They are typically represented by uppercase letters. For example, `<expression>` could be a nonterminal symbol representing a valid expression in the language.

2. Terminal symbols: Terminal symbols represent the basic building blocks or tokens of the language. They are the smallest units of the language that cannot be further broken down. Terminal symbols are typically represented by lowercase letters, strings, or special symbols. For example, `if`, `while`, and `+` could be terminal symbols in a programming language.

3. Production rules: Production rules define the possible ways in which nonterminal symbols can be expanded or derived. Each production rule consists of a nonterminal symbol on the left-hand side (lhs) and a sequence of nonterminal and/or terminal symbols on the right-hand side (rhs). The `::=` symbol is used to denote the expansion. For example:

   ```
   <expression> ::= <term> '+' <expression>
   ```

   This rule states that an `<expression>` can be derived by combining a `<term>`, followed by a `+` symbol, followed by another `<expression>`.

4. Alternatives: Alternatives allow for multiple options within a production rule. They are represented using the `|` symbol. For example:

   ```
   <term> ::= <factor> '*' <term> | <factor>
   ```

   This rule states that a `<term>` can be derived either by combining a `<factor>`, followed by a `*` symbol, followed by another `<term>`, or simply by a `<factor>` alone.

5. Optional elements: Optional elements are denoted using square brackets `[ ]` and indicate that the enclosed part is optional. For example:

   ```
   <statement> ::= 'if' <condition> 'then' <statement> ['else' <statement>]
   ```

   This rule states that an `<else>` part is optional in an `if-then` statement.

BNF provides a concise and formal way to specify the syntax of a language. It serves as a foundation for language designers, compiler developers, and programmers to understand and implement the structure of programming languages accurately. By following the rules defined in BNF, it becomes possible to generate valid sentences or expressions in the defined language and identify syntax errors when the rules are violated.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Binary Tree Structure

One common approach to parsing is using a binary tree data structure, specifically a parse tree or syntax tree. A parse tree represents the syntactic structure of the input according to the grammar rules. Each node in the tree represents a grammatical construct, and the edges represent the relationships between these constructs.

Here's a high-level overview of how parsing with a binary tree works:

1. Tokenization: The input sequence is first divided into individual tokens or lexical units. These tokens can be keywords, identifiers, operators, literals, or other meaningful units.

2. Parsing Algorithm: A parsing algorithm, such as a top-down or bottom-up approach, is used to build the parse tree. In the case of a binary tree, each node in the tree has at most two children.

   - Top-down parsing (e.g., recursive descent): It starts from the root of the tree and recursively applies grammar rules to the input. The parsing algorithm typically uses predictive parsing, where the next rule to apply is determined based on the current token and the expected grammar production.

   - Bottom-up parsing (e.g., LR parsing): It starts from the individual tokens and builds the tree from the bottom up. The parsing algorithm uses a stack to keep track of the partial parse tree and applies grammar rules in reverse until it reaches the root.

3. Building the Parse Tree: As the parsing algorithm progresses, it constructs the parse tree by creating nodes for each grammar construct encountered and establishing the parent-child relationships.

4. Ambiguity and Error Handling: During parsing, ambiguity in the input can lead to multiple valid parse trees. In such cases, the parsing algorithm may apply disambiguation rules or generate multiple parse trees. If the input is not valid according to the grammar, parsing errors may occur, and appropriate error handling strategies need to be employed.

5. Further Analysis or Processing: Once the parse tree is constructed, it can be further processed for various purposes. For example, it can be used to generate intermediate representations, perform semantic analysis, or execute the code represented by the input.

Binary trees are commonly used for parsing because they provide a hierarchical structure that can represent the syntax of the input language. The left and right children of each node in the parse tree correspond to the substructures of the input that are governed by grammar rules. By following the parent-child relationships in the tree, the entire structure of the input can be understood and analyzed.

Note that while binary trees are a common representation for parse trees, other tree structures or even graph structures can be used depending on the requirements of the specific parsing algorithm or application.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)