# The Reader
This section will go over building a reader for the **E**xtensible **D**ata 
**N**otation (EDN). Most of the descriptions on this page are directly from 
[edn-format][1] spec. While the spec itself is pretty informal it looks to
provide enough detail to implement a satisfactory reader.

> **edn** is an extensible data notation. A superset of edn is used by Clojure to 
> represent programs, and it is used by Datomic and other applications as a data 
> transfer format. This spec describes edn in isolation from those and other specific
> use cases, to help facilitate implementation of readers and writers in other 
> languages, and for other uses.
> 
> edn supports a rich set of built-in elements, and the definition of extension 
> elements in terms of the others. Users of data formats without such facilities
> must rely on either convention or context to convey elements not included in the 
> base set. This greatly complicates application logic, betraying the apparent 
> simplicity of the format. edn is simple, yet powerful enough to meet the demands of
> applications without convention or complex context-sensitive logic.
> 
> edn is a system for the conveyance of values. It is not a type system, and has no 
> schemas. Nor is it a system for representing objects - there are no reference types, 
> nor should a consumer have an expectation that two equivalent elements in some body of
> edn will yield distinct object identities when read, unless a reader implementation
> goes out of its way to make such a promise. Thus the resulting values should be 
> considered immutable, and a reader implementation should yield values that ensure this,
> to the extent possible.
> 
> edn is a set of definitions for acceptable elements. A use of edn might be a stream or
> file containing elements, but it could be as small as the conveyance of a single element
> in e.g. an HTTP query param.
> 
> There is no enclosing element at the top level. Thus edn is suitable for streaming and
> interactive applications.
> 
> The base set of elements in edn is meant to cover the basic set of data structures 
> common to most programming languages. While edn specifies how those elements are formatted
> in text, it does not dictate the representation that results on the consumer side. A well 
> behaved reader library should endeavor to map the elements to programming language types 
> with similar semantics

[1]: https://github.com/edn-format/edn

<div class="alert alert-block alert-warning">
While we could use an existing reader implementation we intend to add features that will have support for not only
valid forms of EDN but will also support a lot of common errors which will be used to provide the user helpful error messages.
I was unable to find an existing EDN 
</div>

In [None]:
#export
import nbloader
nbloader.install_loader()

from nbdev import patch
from functools import reduce

import re

utils = __import__('Appendix - B - Utilities')

## EDN Spec
This section will build up the reader. The design is to simply iteratre through all
characters, ignore whitespace and dispatch to a reader function.

In [None]:
# export 

# reader states to run based on character
macros = {}
dispatch_macros = {}

READ_EOF = 'READ_EOF'
READ_FINISHED = 'READ_FINISHED'

def read(stream_or_str, sentinel=None):
    
    if isinstance(stream_or_str, str):
        stream = utils.PushBackCharStream(stream_or_str)
    else:
        stream = stream_or_str
    
    for ch in stream:
        if is_whitespace(ch): continue
        if ch is None: return READ_EOF
        if ch == sentinel: return READ_FINISHED
        
        # Possibly need 1 lookahead
        lookahead = next(stream)
        stream.push_back(lookahead)

        if is_number_literal(ch, lookahead): 
            return read_number(stream, ch)
        elif ch in macros: 
            form = macros[ch](stream, ch)
            
            # if the actual stream is the form then the form was a comment and we should
            # continue on
            if form != stream:
                return form
        else: 
            return read_symbol(stream, ch)

### General considerations
__edn__ elements, streams and files should be encoded using UTF-8.

Elements are generally separated by whitespace. Whitespace, other than within strings, is not otherwise significant, 
nor need redundant whitespace be preserved during transmissions. Commas , are also considered whitespace, other than within strings.

The delimiters `{ } ( ) [ ]` need not be separated from adjacent elements by whitespace.



In [None]:
# export

def is_whitespace(ch):  return ch in ' \t\n,'
def is_ending(ch):      return ch in '";@^`()[]{}\\'

### symbols
Symbols are used to represent identifiers, and should map to something other than strings, if possible.

In [None]:
# export

class Symbol(str):
    def __new__(cls, val, *args, **kwargs):
        return str.__new__(cls, val)

    def __init__(self, val, namespace=None):
        self.namespace = namespace
        self.meta = {}
        
    def __eq__(self, val):
        if isinstance(val, type(self)):
            return super().__eq__(val) and self.namespace == val.namespace
        elif isinstance(val, str):
            if self.namespace is not None:
                return f"{self.namespace}/{self}" == val
            else:
                return super().__eq__(val)
        return False
    
    @property
    def name(self):
        return self
    
    def __repr__(self):
        if self.namespace is None:
            return self
        else:
            return f"{self.namespace}/{self}"
        
    def __hash__(self):
        return hash(str(self))

Symbols begin with a non-numeric character and can contain alphanumeric characters and `. * + ! - _ ? $ % & = < >`. If `-`, `+` or `.` are the first character, the second character (if any) must be non-numeric. Additionally, `: #` are allowed as constituent characters in symbols other than as the first character.

`/` has special meaning in symbols. It can be used once only in the middle of a symbol to separate the prefix (often a namespace) from the name, e.g. `my-namespace/foo`. `/` by itself is a legal symbol, but otherwise neither the prefix nor the name part can be empty when the symbol contains `/`.

If a symbol has a prefix and `/`, the following name component should follow the first-character restrictions for symbols as a whole. This is to avoid ambiguity in reading contexts where prefixes might be presumed as implicitly included namespaces and elided thereafter.

### Builtin Elements
#### nil  
nil represents nil, null or nothing. It should be read as an object with similar meaning on the target platform.

#### booleans
`true` and `false` should be mapped to booleans.

If a platform has canonic values for `true` and `false`, it is a further semantic of booleans that all instances of `true` yield that (identical) value, and similarly for `false`.

In [None]:
# export

def is_special(ch):     return ch in '-+.'
def is_non_numeric(ch): return ch.isalpha() or ch in '.*+!-_?$%&=<>@:#'
def is_start(ch):       return (is_non_numeric(ch) and not ch in ':#')

def is_numeric(ch, base=10):
    try:
        int(ch, base)
        return True
    except:
        return False    

def is_number_literal(ch, lookahead):
    if is_numeric(ch):
        return True
    elif ch == '-' or ch == '+':
        return is_numeric(lookahead)  
    
def read_symbol(stream, initch):
    starting_info = stream.starting_line_col_info()    
    token = read_token(stream, initch)
    
    # Special Symbols
    if token == 'nil': return None
    elif token == 'true': return True
    elif token == 'false': return False
    elif token == '/': return Symbol('/')
    
    try:
        ns, name = parse_symbol(token)
        symbol = Symbol(name, ns)
        
        # attach line/col info
        ending_info = stream.ending_line_col_info()
        symbol.meta['start_row'] = starting_info[0]
        symbol.meta['start_col'] = starting_info[1]
        symbol.meta['ending_row'] = ending_info[0]
        symbol.meta['ending_col'] = ending_info[1]
        
        return symbol
    except:
        raise
        
invalid_token = re.compile(r"(^::|.*:$)")
invalid_namespace = re.compile(r".*:$")
def parse_symbol(token):
    "Parses a string into a tuple of the namespace and symbol"
    if not token or invalid_token.match(token):
        raise Exception("Invalid symbol: '{}'".format(token))
    
    # If no namespace just return None as ns and token as symbol
    if token == '/' or '/' not in token:
        return None, token
    
    ns, sym = token.split('/', 1)
    
    if (sym and 
        not is_numeric(sym[0]) and
        not invalid_namespace.match(ns) and
        (sym == '/' or 
         '/' not in sym)):
        return ns, sym
    raise Exception("Invalid symbol: '{}'".format(token))
    

def read_token(stream, initch):
    token = ''
    
    ch = initch
    while True:
        if ch is None or is_whitespace(ch):
            break
        elif is_ending(ch):
            stream.push_back(ch)
            break
        else:
            token += ch
            ch = next(stream)

    return token

In [None]:
# Symbol tests
# tests

# helper
def read_exception(s, expected_msg):
    try:
        read('invalid:')
    except Exception as e:
        return str(e) == "Invalid symbol: 'invalid:'"
    return False
    
# basic symbols
assert read("abc")        == Symbol('abc'),                     'simple symbol'
assert read('ns/my-name') == Symbol('my-name', namespace='ns'), 'symbol with namespace'

# special cases
assert read('true')       == True,                              'true'
assert read('false')      == False,                             'false'
assert read('nil')        == None,                              'nil'

# odd ball cases
assert read('/')          == Symbol('/'),                       'single slash symbol'
assert read('ns//')       == Symbol('/', namespace='ns'),       'slash symbol with namespace'

# exceptional cases
assert read_exception('invalid:',         "Invalid symbol: 'invalid:'"),        'symbol cannot have trailing colon'
assert read_exception('::invalid',        "Invalid symbol: '::invalid:'"),      'symbol cannot start with ::'
assert read_exception('ns/double/slash:', "Invalid symbol: 'ns/double/slash'"), 'symbol can only have a single slash'

#### strings
Strings are enclosed in `"double quotes"`. May span multiple lines. Standard C/Java escape characters `\t`, `\r`, `\n`, `\\` and `\"` are supported.

In [None]:
#export

def read_string(stream, initch):
    s = ''
    for ch in stream:
        if ch == '"':
            break
        elif ch == '\\':
            s += escape_char(stream)
        elif ch is None:
            raise Exception("EOF in middle of string")
        else:
            s += ch
    return s

macros['"'] = read_string

escape_chars = {'t':'\t', 'r':'\r', 'n':'\n', '\\':'\\', '\"':'\"', 'b':'\b', 'f':'\f'}
def escape_char(stream):
    ch = next(stream)
    
    # Normal character
    if ch in escape_chars:
        ch = escape_chars[ch]
    elif ch == 'u':                                    # Hex unicode escape
        ch = read_unicode_char(stream, base=16, length=4)
    elif is_numeric(ch):                      # Octal Unicode escape
        stream.push_back(ch)
        ch = read_unicode_char(stream, base=8, length=3)
    else:
        raise Exception("Invalid escape '\\{}'".format(ch))
    return ch

def read_unicode_char(stream, base, length):
    unicode_bytes = b'\\'
    
    if base == 16:
        unicode_bytes += b'u'

    for _ in range(length):
        ch = next(stream)
        if not is_numeric(ch, base=base):
            raise Exception("Invalid unicode escape '{}'".format(ch))
        unicode_bytes += bytes(ch, 'utf-8')

    try:
        return unicode_bytes.decode('unicode-escape')
    except:
        raise Exception("Invalid unicode escape '{}'".format(unicode_bytes))

In [None]:
# normal strings
assert read('""')    == '',    "Can have empty string"
assert read('"abc"') == 'abc', 'Simple String'

# escapes
assert read(r'"\t\r\n\\\"\b\f"') == '\t\r\n\\"\b\f', 'Simple escapes'
assert read(r'"\u0021"')         == '!',             'Hex unicode'
assert read(r'"\041"')           == '!',             'Octal unicode'

# illegal
assert read_exception('"abc',  "EOF in middle of string"),        'EOF in middle of string'
assert read_exception(r'"\ "', "Invalid escape '\ '"),            'Whitespace cannot be escaped'
assert read_exception(r'"\u123G"', "Invalid unicode escape 'G'"), 'Hex base unicode cannot have invalid hex digits'
assert read_exception(r'"\128"',   "Invalid unicode escape '8'"), 'Octal base unicdoe cannot have invalid octal digits'


#### characters
Characters are preceded by a backslash: `\c`, `\newline`, `\return`, `\space` and `\tab` yield the corresponding characters. 
Unicode characters are represented with `\uNNNN` as in Java. Backslash cannot be followed by whitespace.

In [None]:
# export 

def read_char(stream, initch):
    ch = next(stream)   
    if ch is None:
        raise Exception("EOF in character")
        
    if is_whitespace(ch):
        raise Exception("Backslash cannot be followed by whitespace")
    
    if is_ending(ch):
        token = ch
    else:
        token = read_token(stream, ch)
    
    if len(token) == 1:        ch = token
    elif token == "newline":   ch = '\n'
    elif token == 'space':     ch = ' '
    elif token == 'tab':       ch = '\t'
    elif token == 'backspace': ch = '\b'
    elif token == 'formfeed':  ch = '\f'
    elif token == 'return':    ch = '\r'
    elif token.startswith('u'):
        stream.push_back(token[1:])
        ch = read_unicode_char(stream, base=16, length=4)
    elif token.startswith('o'):
        stream.push_back(token[1:])
        ch = read_unicode_char(stream, base=8, length=len(token)-1)
    else:
        raise Exception("Invalid character escape '{}'".format(token))
    
    return ch

macros['\\'] = read_char

In [None]:
# simple
assert read(r'\c')         == 'c', 'Simple character'

# keyword characters
assert read(r'\newline')   == '\n', 'Newline character'
assert read(r'\space')     == ' ',  'Space character'
assert read(r'\tab')       == '\t', 'Tab character'
assert read(r'\backspace') == '\b', 'Backspace character'
assert read(r'\formfeed')  == '\f', 'Formfeed character'
assert read(r'\return')    == '\r', 'Carriage Return character'

# unicode escapes
assert read(r'\u0021')     == '!',  'Hex unicode'
assert read(r'\o41')       == '!',  'Octal unicode'

# ending
assert read(r'\]')          == ']', 'Ending chars can be escaped'
assert read(r'\(')          == '(', 'Ending chars can be escaped'

# exceptions
assert read_exception(r'\unknown', "Invalid character escape 'unknown'"),         'Unrecognized character escape'
assert read_exception(r'\ ',       "Backslash cannot be followed by whitespace"), 'Missing character escape'
assert read_exception(r'\\',       "EOF in character"),                           'EOF in char'

#### keywords
Keywords are identifiers that typically designate themselves. They are semantically akin to enumeration values. Keywords follow the rules of symbols, except they can (and must) begin with `:`, e.g. `:fred` or `:my/fred`. If the target platform does not have a keyword type distinct from a symbol type, the same type can be used without conflict, since the mandatory leading `:` of keywords is disallowed for symbols. Per the symbol rules above, `:/` and `:/anything` are not legal keywords. A keyword cannot begin with `::`

If the target platform supports some notion of interning, it is a further semantic of keywords that all instances of the same keyword yield the identical object.

In [None]:
# export

class Keyword(Symbol):
    pass

    def __repr__(self):
        return ':' + str(self)

def read_keyword(stream, initch):
    starting_info = stream.starting_line_col_info()
    
    ch = next(stream)
    if is_whitespace(ch):
        raise Exception('Single colon not allowed')
    
    token = read_token(stream, ch)
    ns, kw = parse_symbol(token)
    
    if ns is not None and ns.startswith(':'):
        raise Exception('Namespace alias not supported')
    
    keyword = Keyword(kw, ns)
    
    # attach line/col info
    ending_info = stream.ending_line_col_info()
    keyword.meta['start_row'] = starting_info[0]
    keyword.meta['start_col'] = starting_info[1]
    keyword.meta['ending_row'] = ending_info[0]
    keyword.meta['ending_col'] = ending_info[1]
    
    return keyword
        
macros[':'] = read_keyword

In [None]:
assert read(':abc')    == Keyword('abc'),                 'Simple'
assert read(':ns/abc') == Keyword('abc', namespace='ns'), 'Keywords can have namespace'

assert read_exception(': ', 'Single colon not allowed'),                    'Keyword with no name'
assert read_exception('::abc/my-symbol', 'Namespace alias not supported'),  'Keyword alias are not currently supported' 

#### integers
Integers consist of the digits 0 - 9, optionally prefixed by - to indicate a negative number, or (redundantly) by +. No integer other than 0 may begin with 0. 64-bit (signed integer) precision is expected. An integer can have the suffix N to indicate that arbitrary precision is desired. -0 is a valid integer not distinct from 0

In [None]:
# export

int_pattern = re.compile(r"^([-+]?)(?:(0)|([1-9][0-9]*)|0[xX]([0-9A-Fa-f]+)|0([0-7]+)|([1-9][0-9]?)[rR]([0-9A-Za-z]+)|0[0-9]+)(N)?$")

def read_number(stream, initch):
    s = initch
    for ch in stream:
        if ch is None or is_whitespace(ch) or is_ending(ch):
            stream.push_back(ch)
            return match_number(s)
        else:
            s += ch

def match_number(s):
    if int_pattern.match(s):
        return match_int(s)
    elif float_pattern.match(s):
        return match_float(s)
    elif ratio_pattern.match(s):
        return match_ratio(s)
    
def match_int(s):
    m = int_pattern.match(s).groups()
    if m[1] is not None:
        return 0
    
    negate = m[0] == '-'
    
    if m[2] is not None:
        base = 10
        n = m[2]
    elif m[3] is not None:
        base = 16
        n = m[3]
    elif m[4] is not None:
        base = 8
        n = m[4]
    elif m[6] is not None:
        base = int(m[5])
        n = m[6]
    else:
        base = None
        n = None
        
    number = int(n, base)
    number = -1 * number if negate else number
    return number

In [None]:
assert read('0')        == 0
assert read('+0')       == 0
assert read('-0')       == 0
assert read('42')       == 42
assert read('052')      == 42
assert read('8r52')     == 42
assert read('0x2a')     == 42
assert read('36r16')    == 42
assert read('2r101010') == 42

#### floating point numbers
64-bit (double) precision is expected.

In [None]:
# export

float_pattern = re.compile(r"^([-+]?[0-9]+(\.[0-9]*)?([eE][-+]?[0-9]+)?)(M)?$")
ratio_pattern = re.compile(r"^([-+]?[0-9]+)/([0-9]+)$")

def match_float(s):
    m = float_pattern.match(s).groups()
    if m[3] is not None:
        return float(m[0])   # TODO: Should we support exact precision? This would be decimal.Decimal
    else:
        return float(s)
    
def match_ratio(s):
    m = ratio_pattern.match(s).groups()
    numerator = m[0][1:] if m[0].startswith('+') else m[0]
    denominator = m[1]    
    
    # no ratio in python
    return int(numerator) / int(denominator)


In [None]:
assert read('34.1') == 34.1
assert read('3e5')  == 3e5
assert read('1/2')  == 0.5

### lists
A list is a sequence of values. Lists are represented by zero or more elements enclosed in parentheses `()`. Note that lists can be heterogeneous.

In [None]:
# export
class List(list):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.meta = {}

def read_list(stream, initch):
    starting_info = stream.starting_line_col_info()
    
    forms = read_delimited(stream, initch, sentinel=')')
    thelist = List(forms)
    
    # attach line/col info
    ending_info = stream.ending_line_col_info()
    thelist.meta['start_row'] = starting_info[0]
    thelist.meta['start_col'] = starting_info[1]
    thelist.meta['ending_row'] = ending_info[0]
    thelist.meta['ending_col'] = ending_info[1]
    
    return thelist

def read_delimited(stream, initch, sentinel):
    starting_info = stream.starting_line_col_info()
    
    forms = []
    while True:
        form = read(stream, sentinel)
        if form == READ_EOF:
            raise Exception("EOF in middle of list")
        elif form == READ_FINISHED:
            return forms
        elif form == stream:
            pass
        else:
            forms.append(form)

macros['('] = read_list
        

In [None]:

assert read('(1 2 3)') == [1, 2, 3]

read('(1 "abc" 3, (1, 2 3) :key ns/sym)')

In [None]:
# export
class Vector(List):
    pass

def read_vector(stream, initch):
    starting_info = stream.starting_line_col_info()
    forms = read_delimited(stream, initch, sentinel=']')
    thevector = Vector(forms)
    
    # attach line/col info
    ending_info = stream.ending_line_col_info()
    thevector.meta['start_row'] = starting_info[0]
    thevector.meta['start_col'] = starting_info[1]
    thevector.meta['ending_row'] = ending_info[0]
    thevector.meta['ending_col'] = ending_info[1]
    
    return thevector

macros['['] = read_vector


In [None]:
assert read('[1 2 3 4 ]') == [1, 2, 3, 4], 'vector'

In [None]:
# export 
class Map(dict):
    def __init__(self, vals, linerange=None):
        dict.__init__(self, vals)
        self.meta = {}
    
def read_map(stream, initch):
    starting_info = stream.starting_line_col_info()
    forms = read_delimited(stream, initch, sentinel='}')
    
    assert len(forms) % 2 == 0, "Map must have value for every key"
    
    pairs = [forms[i:i+2] for i in range(0, len(forms), 2)]   
    themap = Map(pairs)
    
    # attach line/col info
    ending_info = stream.ending_line_col_info()
    themap.meta['start_row'] = starting_info[0]
    themap.meta['start_col'] = starting_info[1]
    themap.meta['ending_row'] = ending_info[0]
    themap.meta['ending_col'] = ending_info[1]
    
    return themap

macros['{'] = read_map

In [None]:
a, b = Symbol('a'), Symbol('b')
assert read('{a 1 b 3 :abc 123}') == {a: 1, b: 3, Keyword('abc'): 123}, "maps"

#### # dispatch character
Tokens beginning with `#` are reserved. The character following `#` determines the behavior. The dispatches `#{` (sets), `#_` (discard), `#alphabetic-char` (tag) are defined below. `#` is not a delimiter.

In [None]:
# export 

def read_dispatch(stream, initch):
    ch = next(stream)
    if ch in dispatch_macros:
        return dispatch_macros[ch](stream, ch)
    raise Exception("Invalid Dispatch")
    
macros['#'] = read_dispatch 

class Set(set):
    def __init__(self, *vals):
        set.__init__(self, vals)
        self.meta = {}
        
def read_set(stream, initch):
    starting_info = stream.starting_line_col_info()
    
    forms = read_delimited(stream, initch, sentinel='}')
    theset = Set(*forms)
    
    # attach line/col info
    ending_info = stream.ending_line_col_info()
    theset.meta['start_row'] = starting_info[0]
    theset.meta['start_col'] = starting_info[1] - 1
    theset.meta['ending_row'] = ending_info[0]
    theset.meta['ending_col'] = ending_info[1]
    
    return theset

dispatch_macros['{'] = read_set

In [None]:
assert read('#{1 2 5 5 1 3 4 4}') == set([1, 2, 3, 4, 5])

#### Comments
If a `;` character is encountered outside of a string, that character and all subsequent characters to the next newline should be ignored.

In [None]:
# export 

def read_comment(stream, initch):
    for ch in stream:
        if ch == '\n':
            break
        # ignore everything else
    # returning the stream will allow higher level
    # macros ignore the form.
    return stream
macros[';'] = read_comment

In [None]:
assert read('''[
  1 ; first entry 
  2 ; second entry
]''')                == [1, 2], 'comments in vector'


#### ' (quote)

In [None]:
# export 

def read_quote(stream, initch):
    return List([Symbol('quote'), read(stream)])
macros["'"] = read_quote

In [None]:
# test

assert read("(1 2 3)") == ['quote', [1, 2, 3]]

In [None]:
from pathlib import Path
import os

script_dir = Path(os.path.abspath("")).resolve()

print(str(script_dir))

nbloader.export_module('02 - The Reader', 'reader', str(script_dir))