In [1]:
import { display } from "tslab";
import { readFileSync } from "fs";

const css = readFileSync("../style.css", "utf8");
display.html(`<style>${css}</style>`);

# Evaluating an Exam Using Chevrotain

This notebook shows how we can use the module [`chevrotain`](https://chevrotain.io/docs/) to implement a scanner.

Our goal is to implement a program that can be used to evaluate the results of an exam.

Assume the result of an exam is stored in the string `data` that is defined below:

In [22]:
const data = `Class: Algorithms and Complexity
          Group: TINF22AI1
          MaxPoints = 60
   
          Exercise:      1. 2. 3. 4. 5. 6.
          Jim Smith:     9 12 10  6  6  0
          John Slow:     4  4  2  0  -  -
          Susi Sorglos:  9 12 12  9  9  6
          1609922:       7  4 12  5  5  3
       `;

This data show that there has been a exam with the subject <em style="color:blue">Algorithms and Complexity</em>
in the group <em style="color:blue">TIT22AI1</em>.  Furthermore, the equation
```
   MaxPoints = 60
```
shows that in order to achieve the best mark, <em style="color:blue">60</em> points would have been necessary.

There have been 6 different exercises in this exam and, in this small example,  only four students took part, namely *Jim Smith*, *John Slow*, *Susi Sorglos*, and some student that is only represented by their matriculation number.  Each of the rows decribing the results of the students begins with the name (or matriculation number) of the student followed by the number of points that they have achieved in the different exercises. Our goal is to write a program that is able to compute the marks for all students.

## Importing the Chevrotain Library

We will use the package [Chevrotain](https://chevrotain.io/).

In particular, we will use:
- The **lexer generator** provided by `createToken` and the `Lexer` class
- **TypeScript interfaces** for type-safe token processing: `ILexingResult`, `IToken`, and `ILexingError`
- TypeScript's built-in **regular expressions** to match and extract patterns

In [3]:
import {
  createToken,
  Lexer,
  ILexingResult,
  IToken,
  ILexingError
} from "chevrotain";

**Interface Overview:**

- `ILexingResult`: Return type of `lexer.tokenize()` containing:
  - `tokens: IToken[]` – Array of recognized tokens
  - `errors: ILexingError[]` – Array of lexing errors
  - `groups: Record<string, IToken[]>` – Grouped tokens (optional)

- `IToken`: Represents a single token with properties:
  - `image: string` – The matched text
  - `tokenType: TokenType` – The token type
  - `startLine: number`, `startColumn: number` – Position in input

- `ILexingError`: Describes lexing errors with:
  - `line: number`, `column: number` – Error position
  - `offset: number` – Character offset in input string
  - `length: number` – Length of the erroneous character sequence
  - `message: string` – Error description

These interfaces enable **type-safe processing** of lexer results in TypeScript.

## Auxiliary Functions

The function `mark(maxPoints: number, points: number): number` takes two arguments and returns a numeric grade:

**Parameters:**
- `maxPoints: number` - The number of points needed to achieve the best mark of 1.0
- `points: number` - The number of points achieved by the student

**Return value:**
- `number` - The calculated grade (between 1.0 and 5.0)

It is assumed that the relation between the mark and the points is mostly linear. A student who achieves 50% of `maxPoints` will get the mark 4.0, while 100% results in mark 1.0.

The formula to calculate the grade is:
$$ \textrm{grade} = 7 - 6 \cdot \frac{\texttt{points}}{\texttt{maxPoints}} $$

However, the worst mark is 5.0. The `Math.min()` function ensures the grade does not exceed 5.0. The result is rounded to one decimal place using `Math.round()`.

In [4]:
function mark(maxPoints: number, points: number): number {
    const grade = 7 - 6 * points / maxPoints;
    return Math.round(Math.min(5.0, grade) * 10) / 10;
}

## Visualizing the Grading Function

To better understand how our `mark()` function converts points to grades, let's visualize it:

In [5]:
import { plotGradeFunction } from "./utils/plotGrade";

plotGradeFunction(mark, 60);

The resulting plot shows how the grade decreases linearly from 5.0 (worst) at 0 points to 1.0 (best) at 60 points, with a grade of 4.0 achieved at exactly 50% of the maximum points (30 points).

## Token Definitions

In this section, we will define the tokens needed to process our exam data.

Each token is created using Chevrotain's `createToken` function, which takes two main parameters:
- `name` - A string identifying the token type
- `pattern` - A regular expression that defines what strings this token matches

### The `HEADER` Token

The `HEADER` token is designed to match informational lines at the beginning of our exam data.

Looking at our example data:

```
Class: Algorithms and Complexity
Group: TINF22AI1
Exercise: 1. 2. 3. 4. 5. 6.
```

Each HEADER line follows this pattern:
1. It starts with one or more letters (for example, "Class", "Group", or "Exercise")
2. This is followed by a colon `:`
3. After the colon comes any descriptive text (such as the course name, group, or exercise numbers)
4. The line ends with a newline character

The regular expression `/[A-Za-z]+:.*\n/` captures this pattern:
- `[A-Za-z]+` matches one or more letters (upper or lowercase)
- `:` matches the literal colon character
- `.*` matches any characters after the colon (the descriptive text)
- `\n` matches the newline at the end

**Note:** By including the newline in the pattern, we ensure that the entire line is recognized as a single token.

In [6]:
const Header = createToken({ 
  name: "HEADER", 
  pattern: /[A-Za-z]+:.*\n/ 
});

### The `MAXDEF` Token

The `MAXDEF` token matches the line that defines the maximum number of points for the exam.

In our example data, this line looks like:

```
MaxPoints = 60
```

The regular expression `/MaxPoints\s*=\s*[1-9][0-9]*/` captures this pattern:
- `MaxPoints` matches the literal string
- `\s*` matches any amount of whitespace before and after the equals sign
- `=` matches the literal equals sign
- `[1-9][0-9]*` matches a number without leading zeros (e.g., "60", "100")

This token is important because it tells us how many points are needed for the best possible grade.

In [7]:
const MaxDef = createToken({ 
  name: "MAXDEF", 
  pattern: /MaxPoints\s*=\s*[1-9][0-9]*/ 
});

### The `NAME` Token

The `NAME` token matches the name of a student, which is always followed by a colon.

Student names can contain letters, spaces, and hyphens. For example:

```
Jim Smith:
Susi Sorglos:
```

The regular expression `/[A-Za-z]+(?: [A-Za-z]+)+:/` ensures:
- The name starts with one or more letters
- It contains at least one space (to distinguish names from headers)
- It ends with a colon `:`

This token helps us identify which student the following points belong to.

In [8]:
const Name = createToken({ 
  name: "NAME", 
  pattern: /[A-Za-z]+(?: [A-Za-z]+)+:/ 
});

### The `MATRICULATION` Token

The `MATRICULATION` token matches a student identification number.

Some students are identified by a 7-digit matriculation number followed by a colon, for example:

```
1609922:
```

The regular expression `/[0-9]{7}:/` ensures:
- Exactly seven digits (`[0-9]{7}`)
- Followed by a colon (`:`)

This token helps us process students who are listed by their ID instead of their name.

In [9]:
const Matriculation = createToken({ 
  name: "MATRICULATION", 
  pattern: /[0-9]{7}:/ 
});

### The `NUMBER` Token

The `NUMBER` token matches the points a student achieved in an exercise.

A number is either exactly `0` or starts with a digit from 1-9 followed by any number of digits. This prevents leading zeros, so "007" would be tokenized as three separate numbers: `0`, `0`, `7`.

The regular expression `/0|[1-9][0-9]*/` ensures:
- Either a single zero (`0`)
- Or a non-zero digit followed by more digits (`[1-9][0-9]*`)

These tokens are used to sum up the points for each student.

In [10]:
const Number = createToken({ 
  name: "NUMBER", 
  pattern: /0|[1-9][0-9]*/ 
});

### The `DASH` Token

The `DASH` token matches a hyphen/minus character `-`.

In the exam data, dashes indicate that a student did not attempt a specific exercise. For example:

```
John Slow: 4 4 2 0 - -
```


Here, John Slow didn't attempt exercises 5 and 6 (indicated by the dashes).

The regular expression `/-/` simply matches a single dash character.

Since dashes don't contribute to the point total, we add this token to the `SKIPPED` group. This means:
- The lexer recognizes dashes (so they don't cause errors)
- They are not included in the token stream
- They effectively represent 0 points

This is similar to how we handle whitespace - recognized but not processed.

In [11]:
const Dash = createToken({ 
  name: "DASH", 
  pattern: /-/, 
  group: Lexer.SKIPPED 
});

### The `IGNORE` Token

Lines that contain only whitespace (spaces or tabs) should be ignored.

In Chevrotain, we use a token in the `SKIPPED` group to recognize and discard these lines. The regular expression `/[ \t\r]+/` matches any sequence of spaces, tabs, or carriage returns.

This ensures that empty lines in the input do not affect the processing.

In [12]:
const Whitespace = createToken({ 
  name: "WS", 
  pattern: /[ \t\r]+/, 
  group: Lexer.SKIPPED 
});


### The `LINEBREAK` Token

The `LINEBREAK` token matches the newline character `\n`.

This token is important for detecting the end of a student's record. When we reach a LINEBREAK, we know it's time to calculate and output the student's grade.

The regular expression `/\n/` matches a single newline character.


In [13]:
const Linebreak = createToken({ 
  name: "LINEBREAK", 
  pattern: /\n/ 
});


## Creating the Lexer

Now that we have defined all our tokens, we need to collect them in an array and create the lexer.

**Important:** The order of tokens matters! More specific patterns must come before more general ones to avoid ambiguity:
- `MAXDEF` comes before `HEADER` (both contain letters and colons, but MAXDEF is more specific)
- `MATRICULATION` comes before `NUMBER` (matriculation numbers are specific 7-digit sequences)

In [14]:
const allTokens = [
  Whitespace,
  Dash,
  MaxDef,
  Header,
  Matriculation,
  Name,
  Number,
  Linebreak
];

const lexer = new Lexer(allTokens, { positionTracking: "full" });

## Processing the Exam Data

In Chevrotain, token recognition (lexing) and data processing are separate concerns. After tokenization, we process the token stream step-by-step.

We'll build our processor from small, focused functions that each handle one responsibility.

### Step 1: Extracting Maximum Points

When we encounter a `MAXDEF` token (e.g., `"max_points: 60"`), we need to extract the number:

In [15]:
function extractMaxPoints(tokenImage: string): number {
  const match = tokenImage.match(/[1-9][0-9]*/);
  return match ? parseInt(match[0]) : 0;
}

This function uses a regex to find the numeric value and returns it as an integer.

### Step 2: Starting a New Student Record

When we see a `NAME` or `MATRICULATION` token, we begin tracking a new student:

In [16]:
function startNewStudent(tokenImage: string): string {
  return tokenImage.slice(0, -1);
}

We simply remove the trailing colon (`:`) from the token to get the clean name or ID.

### Step 3: Outputting a Student's Grade

When we reach a `LINEBREAK`, we calculate and display the student's grade:

In [17]:
function outputGrade(name: string, totalPoints: number, maxPoints: number): void {
  const grade = mark(maxPoints, totalPoints);
  console.log(`${name} has ${totalPoints} points and achieved the mark ${grade}.`);
}

This function uses our previously defined `mark()` function to calculate the grade and formats the output message.

### Step 4: Processing State

To track our progress through the input, we maintain a state object:

In [18]:
interface ProcessingState {
  maxPoints: number;
  currentName: string;
  sumPoints: number;
}

function createInitialState(): ProcessingState {
  return {
    maxPoints: 0,
    currentName: '',
    sumPoints: 0
  };
}

The state keeps track of:
- **`maxPoints`**: The maximum achievable points (from `MAXDEF`)
- **`currentName`**: The student currently being processed
- **`sumPoints`**: Running total of points for the current student

### Step 5: Enhanced Error Checking

The error reporting should display the **exact faulty character(s)** and their **position** in the input:

In [19]:
function printLexerErrors(input: string, errors: ILexingError[]): void {
  for (const err of errors) {
    const faultyText = input.substr(err.offset, err.length);
    console.error(
      `Lexing Error: "${faultyText}" at line ${err.line}, column ${err.column} (length: ${err.length}). Details: ${err.message}`
    );
  }
}

This provides users with precise feedback about **which characters caused the error** and **where they are located**, significantly improving debugging.

### Step 6: The Main Processing Loop

Now we can assemble our processing function from these building blocks:

1. **Tokenize** the input
2. **Check for errors** and exit if any are found
3. **Initialize state** to track processing progress
4. **Iterate through tokens**, calling the appropriate helper function for each type

In [20]:
function processExamData(input: string): void {
  const lexingResult: ILexingResult = lexer.tokenize(input);

  if (lexingResult.errors.length > 0) {
    printLexerErrors(input, lexingResult.errors);
    return;
  }

  const state: ProcessingState = createInitialState();

  for (const token of lexingResult.tokens) {
    const tokenType: string = token.tokenType.name;
    const tokenImage: string = token.image;
    
    switch (tokenType) {
      case 'MAXDEF': {
        const maxPoints: number = extractMaxPoints(tokenImage);
        state.maxPoints = maxPoints;
        break;
      }

      case 'NAME':
      case 'MATRICULATION': {
        const studentName: string = startNewStudent(tokenImage);
        state.currentName = studentName;
        state.sumPoints = 0;
        break;
      }

      case 'NUMBER': {
        const points: number = parseInt(tokenImage, 10);
        state.sumPoints += points;
        break;
      }

      case 'LINEBREAK': {
        if (state.currentName !== '') {
          outputGrade(state.currentName, state.sumPoints, state.maxPoints);
          state.currentName = '';
        }
        break;
      }

      case 'HEADER':
        // Headers are recognized but don't affect processing
        break;

      default:
        // Unknown token types are safely ignored
        console.warn(`Unexpected token type: ${tokenType}`);
        break;
    }
  }
}

Now let's run our scanner on the exam data and see the results:

In [23]:
processExamData(data);

Jim Smith has 43 points and achieved the mark 2.7.
John Slow has 10 points and achieved the mark 5.
Susi Sorglos has 57 points and achieved the mark 1.3.
1609922 has 36 points and achieved the mark 3.4.


### How It Works: Example Trace

Let's trace through what happens for one student:

| Token | Helper Function Called | State Update |
|-------|------------------------|--------------|
| `NAME: "Jim Smith:"` | `startNewStudent()` | `currentName = "Jim Smith"`, `sumPoints = 0` |
| `NUMBER: "9"` | *(none)* | `sumPoints = 9` |
| `NUMBER: "12"` | *(none)* | `sumPoints = 21` |
| `NUMBER: "10"` | *(none)* | `sumPoints = 31` |
| `NUMBER: "6"` | *(none)* | `sumPoints = 37` |
| `NUMBER: "6"` | *(none)* | `sumPoints = 43` |
| `NUMBER: "0"` | *(none)* | `sumPoints = 43` |
| `LINEBREAK` | `outputGrade()` | **Output**: `"Jim Smith has 43 points and achieved the mark 2.7."` |