In [None]:
from IPython.core.display import HTML
with open ("../style.css", "r") as file:
    css = file.read()
HTML(css)

# Evaluating an Exam Using Ply

This notebook shows how we can use the package [`ply`](https://ply.readthedocs.io/en/latest/ply.html)
to implement a scanner.  Our goal is to implement a program that can be used to evaluate the results of an exam.  

The function `load_data(fn)` takes one argument:
* `fn` is the name of a file

It returns the data that is stored in the specified files as a string.

In [None]:
def load_data(file):
    with open(file, 'r') as handle:
        return handle.read()

The result of a specific exam is stored in the file `exam.txt`:

In [None]:
data = load_data('exam.txt')
print(data)

This data show that there has been a exam on the subject *Advanced Witchcraft*
in the group *TINF09AI*.  Furthermore, the equation
```
MaxPoints = 75
```
shows that in order to achieve the best mark, **75** points would have been necessary.
    
There have been 5 different exercises in this exam. 
Our goal is to write a program that is able to compute the marks for all students.

## Imports

We will be use the package [ply](https://ply.readthedocs.io/en/latest/ply.html).
In this example, we will only use the scanner generator that is provided by the module `ply.lex`. Later on, we will also use Python regular expression.  Therefore, we also have to import the module `re`. 

In [None]:
import ply.lex as lex
import re

## Auxiliary Functions

The function `mark(max_points, points)` takes two arguments:
- `points`     is the number of points achieved by the student whose mark is to be computed.
- `max_points` is the number of points that need to be achieved in order to get the best mark of $1.0$.
  
It is assumed that the relation between the mark of an exam and the number of points achieved in this exam is mostly linear and that a student who has achieved $50\%$ of `lexer.max_points` points will get the mark $4.0$, while a student who has achieved  $100\%$ of `lexer.max_points` points will be marked as $1.0$.
However, the worst mark is $5.0$.  Therefore, if the mark would fall below that line, the `min` function below assures that it is lifted up to $5.0$.

In [None]:
def mark(max_points, points):
    return round(min(5.0, 7 - 6 * points / max_points), 1)

Lets test this function.

In [None]:
for points in range(0, 80+1, 5):
    print(f'mark(75, {points}) = {mark(75, points)}')

The function `percentage(max_points, points)` takes two arguments:
- `points`     is the number of points achieved by the student whose mark is to be computed.
- `max_points` is the number of points that need to be achieved in order to get the best mark of $1.0$.
  
It returns the percentage of points achieved.  Note that this percentage is never higher than `100%`, even if `points` is greater than `max_points`.

In [None]:
def percentage(max_points, points):
    return min(100, round(100*points/max_points))

In [None]:
for points in range(0, 80+1, 5):
    print(f'percentage(75, {points}) = {percentage(75, points)}%')

## Token Declarations

We begin by <em style="color:blue">declaring</em> the list of tokens.  Note that the variable `tokens` is a keyword of `ply` to define the names of the token classes.  In this case, we have declared six different tokens.
The <em style="color:blue">definitions</em> of these tokens are given later.
- `HEADER` will match the first two lines of the string `data` as well as the fifth line that begins with 
  the string `Exercise:`.  
- `MAXDEF` is a token that will match the line `MaxPoints = 60`.
- `NAME` is a token that will match the name of a student.
- `NUMBER` is a token that will match a natural number.
- `IGNORE` is a token that will match an empty line.  For example, the fourth line in `data` is empty.
- `LINEBREAK` is a token that will match the newline character `\n` at the end of a line.

In [None]:
tokens = [ 'HEADER',
           'MAXDEF',
           'NAME',
           'MATRICULATION'          
           'NUMBER',
           'IGNORE',
           'LINEBREAK'
         ]

In [None]:
print(data)

## Token Definitions

Next, we need to provide the definition of the tokens.  One way to define tokens is via python functions.  
In this notebook we are only going to use these <em style="color:blue">functional token definitions</em>.
The <em style="color:blue">document string</em> of these functions is a <em style="color:blue">raw string</em> that contains the regular expression defining the semantics of the token.  The regular expression can be followed by code that is needed to further process the token.  The name of the function defining a token has to have the form `t_`**name**, where **name** is the name of the token as declared in the list `tokens`.

### The `HEADER` Token

The token `HEADER` matches any string that is made up of upper and lower case characters followed by a colon.  This colon may be followed by arbitrary characters.
The token extends to the end of the line and includes the terminating newline.

When the function `t_HEADER` is called it is provided with a token `t`.  This is an object that has five
attributes:
- `t.lexer` is an object of class `Lexer` that contains the scanner that was used to extract the token `t`.
  We are free to attach additional attributes to this `Lexer` object.
- `t.type` is  a string containing the type of the token.  For tokens processed in the function
  `t_HEADER` this type is always the string `HEADER`.
- `t.value` is the actual string matched by the token.
- `t.lexpos` is the position of the token in the input string that is scanned.

Furthermore, the lexer object has one important attribute:
- `t.lexer.lineno` is the line number.  However, it is our responsibility to update this variable
  by incrementing `t.lexer.lineno` every time we read a newline.

In the case of the token `HEADER` we need to increment the attribute `t.lineno`, as the regular expression contain a newline.

In [None]:
def t_HEADER(t):
    r'[A-Za-z]+:.*\n'
    t.lexer.lineno += 1

### The Token `MAXDEF`

The token `MAXDEF` matches a substring of the form `MaxPoints = 60`.  Note that the regular expression defining the semantics of this token uses the expression `\s*` to match the white space before and after the character `=`.  This is necessary because `ply.lex` uses <em style="color:blue">verbose regular expressions</em> that can contain whitespace for formatting.  Hence a blank character "` `" inside a regular expression is silently discarded.

 
After defining the regular expression, the function `t_MAXDEF` has some <em style="color:blue">action code</em> that is used to extract the maximal number of points from the token value and store this number in the variable `t.lexer.max_points`.
`t.value` is the string that is matched by the regular expression.
We extract the maximum number of points using conventional Python regular expressions.  Furthermore, we initialize the student name, 
which is stored in `t.lexer.name`,  to the empty string.

In [None]:
def t_MAXDEF(t):
    r'MaxPoints\s*=\s*[1-9][0-9]*'
    t.lexer.max_points = int(re.findall(r'[0-9]+', t.value)[0])
    t.lexer.name       = ''

### The Token Name

The token `NAME` matches the name of a student followed by a colon.  In general, a student name can be any sequence of letters that contain optional hyphens and blanks.  Note that it is not necessary to use `\s` inside of a character range, as we can use a blank character instead.
Furthermore, note that the hypen `-` is the last character in the square brackets so it cannot be mistaken for the hyphen of a range.

The action code has to reset the variable `sum_points` that is stored in `lexer.sum_points`to `0`.

In [None]:
def t_NAME(t):
    r'[a-zA-Z -]+:'
    t.lexer.name = t.value[:-1] # cut of colon
    t.lexer.sum_points = 0      # start counting

### The Token `MATRICULATION`

The token `MATRICULATION` matches a string consisting of seven digits.  These digits are followed by a colon.

In [None]:
def t_MATRICULATION(t):
    r'[0-9]{7}:'
    t.lexer.name = t.value[:-1] # cut of colon
    t.lexer.sum_points = 0      # start counting

### The Token `NUMBER`

The token `NUMBER` matches a natural number.  We have to convert the value, which is initially a *string* of digits, into an integer.  Furthermore, this value is then added to the number of points the current student has achieved in previous exercises.

In [None]:
def t_NUMBER(t):
    r'0|[1-9][0-9]*'
    t.lexer.sum_points += int(t.value)

### The Token `IGNORE`

The token `IGNORE` matches a line that contains only whitespace.  In order to keep track of line numbers we have to increment `lexer.lineno`.  However, we do not return a token at the end of the function.  Hence, if the input contains an empty line, this line is silently discarded.

In [None]:
def t_IGNORE(t):
    r'^[ \t]*\n'
    t.lexer.lineno += 1

### The Token `LINEBREAK`

The token `LINEBREAK` matches a single newline character `\n`.  If a student name is
currently defined, then we output the result for this student.  Note that we set `lexer.name` back to the empty string once we have processed the student.
This allows for empty lines between different students.

Furthermore, we collect the data in three lists:
* `gNames` is the list of names or matriculation numbers,
* `gPoints` is the list of the total number of points achieved by each student,
* `gGrades` is the list of grades,
* `gPercentages` is the list of percentages.

In [None]:
gNames       = []
gPoints      = []
gGrades      = []
gPercentages = []

In [None]:
def t_LINEBREAK(t):
    r'\n'
    global gNames, gPoints, gGrades, gPercentages
    t.lexer.lineno += 1
    if t.lexer.name != '':
        name    = t.lexer.name
        maxpts  = t.lexer.max_points
        points  = t.lexer.sum_points
        grade   = mark(maxpts, points)
        percent = percentage(maxpts, points)
        print(f'{name} has {points} points and achieved the mark {grade}: {percent} % \n')
        gNames       += [name]
        gPoints      += [points]
        gGrades      += [grade]
        gPercentages += [percent]
        t.lexer.name  = ''

---

We have now defined all of the tokens.  
Note that the scanner tries the regular expressions in the same order that we 
have used to define these tokens. 

---

### Ignoring Characters

The string `t_ignore` specifies those characters that should be ignored.  Note that this string is **not** interpreted as a regular expression.  It is just a string of *single characters*.  These characters are allowed to occur as part of other tokens, but when they occur on their own and would otherwise generate a scanning error, they are silently discarded instead of triggering an error. 

In this example we ignore hyphens, blanks, and tabs.

In [None]:
t_ignore  = '- \t'

### Error Handling

The function `t_error` is called when a string at the beginning of the input that
has not yet been processed can not be matched by any of the regular expressions defined in the various tokens defined above.  In our implementation we print the first character that could not be matched, discard this character and continue.

In [None]:
def t_error(t):
    print(f"Illegal character '{t.value[0]} at line {t.lexer.lineno}.'")
    t.lexer.skip(1)

### Tricking Ply

The line below is necessary to trick `ply.lex` into assuming this program is part of an ordinary python file instead of being a *Jupyter notebook*.

In [None]:
__file__ = 'main'

### Generating the Scanner and Running It

The next line generates the scanner.

In [None]:
lexer = lex.lex()

Next, we feed an input string into the generated scanner.

In [None]:
lexer.input(data)

In order to scan the data that we provided in the last line, the function `scan` iterates
over all tokens generated by our scanner.

In [None]:
def scan(lexer):
    for t in lexer:
        pass

Finally, we can run the scanner.

In [None]:
scan(lexer)

In [None]:
import pandas as pd

In [None]:
ExamData = { 'Name': gNames, 'Points': gPoints, 'Grade': gGrades, 'Percentage': gPercentages }
ExamDataFrame = pd.DataFrame(ExamData, index=list(range(1, len(gNames)+1)))
ExamDataFrame