# Translating [Markdown] to [Python]

A primary translation is literate programming is the tangle step that converts the literate program into 
the programming language. The 1979 implementation converts `".WEB"` files to valid pascal - `".PAS"` - files.
The `pidgy` approach begins with [Markdown] files and proper [Python] files as the outcome. The rest of this 
document configures how [IPython] acknowledges the transformation and the heuristics the translate [Markdown] to [Python].

[Markdown]: #
[Python]: #

In [1]:
    
    import IPython, typing, mistune as markdown, IPython, textwrap, ast, doctest, re, dataclasses, pidgy
    try: 
        from . import base, util
        from .util import FENCE, CONTINUATION, SEMI, COLON, MAGIC, DOCTEST, QUOTES, SPACE, WHITESPACE
    except: 
        import base, util
        from util import FENCE, CONTINUATION, SEMI, COLON, MAGIC, DOCTEST, QUOTES, SPACE, WHITESPACE

The `pidgy` tangle workflow has three steps:

1. Block-level lexical analysis to tokenize [Markdown].
2. Normalize the tokens to compacted `"code" and not "code"` tokens.
3. Translate the normalized tokens to a string of valid [Python] code.

[Markdown]: #
[Python]: #

## Block level lexical analysis.

`pidgy` uses a modified `mistune.BlockLexer` to create block level tokens
for a [Markdown] source. A specific `pidgy` addition is the addition off 
a `doctest` block object, `doctest` are testable strings that are ignored by the tangle
step. The tokens are to be normalized and translated to [Python] strings.

<details><summary><code>BlockLexer</code></summary>

In [2]:
    class BlockLexer(markdown.BlockLexer, util.ContextDepth):
        class grammar_class(markdown.BlockGrammar):
            doctest = doctest.DocTestParser._EXAMPLE_RE
            block_code = re.compile(r'^((?!\s+>>>\s) {4}[^\n]+\n*)+')
            default_rules = "newline hrule block_code fences heading nptable lheading block_quote list_block def_links def_footnotes table paragraph text".split()

        def parse_doctest(self, m): self.tokens.append({'type': 'paragraph', 'text': m.group(0)})

        def parse_fences(self, m):
            if m.group(2): self.tokens.append({'type': 'paragraph', 'text': m.group(0)})
            else: super().parse_fences(m)

        def parse_hrule(self, m): self.tokens.append(dict(type='hrule', text=m.group(0)))
            
        def parse_def_links(self, m):
            super().parse_def_links(m)
            self.tokens.append(dict(type='def_link', text=m.group(0)))
            
        def parse(self, text: str, default_rules=None, normalize=True) -> typing.List[dict]:
            if not self.depth: self.tokens = []
            with self: tokens = super().parse(util.whiten(text), default_rules)
            if normalize and not self.depth: tokens = normalizer(text, tokens)
            return tokens

</details>

## Normalizing the tokens

Tokenizing [Markdown] typically extracts conventions at both the block and inline level.
Fortunately, `pidgy`'s translation is restricted to block level [Markdown] tokens, and mitigating some potential complexities from having opinions about inline code while tangling.

<details><summary><code>normalizer</code></summary>

In [3]:
    def normalizer(text, tokens):
        compacted = []
        while tokens:
            token = tokens.pop(0)
            if 'text' not in token: continue
            else: 
                if not token['text'].strip(): continue
                block, body = token['text'].splitlines(), ""
            while block:
                line = block.pop(0)
                if line:
                    before, line, text = text.partition(line)
                    body += before + line
            if token['type']=='code':
                compacted.append({'type': 'code', 'lang': None, 'text': body})
            else:
                if compacted and compacted[-1]['type'] == 'paragraph':
                    compacted[-1]['text'] += body
                else: compacted.append({'type': 'paragraph', 'text': body})
        if compacted and compacted[-1]['type'] == 'paragraph':
            compacted[-1]['text'] += text
        elif text.strip():
            compacted.append({'type': 'paragraph', 'text': text})
        # Deal with front matter
        if compacted and compacted[0]['text'].startswith('---\n') and '\n---' in compacted[0]['text'][4:]:
            token = compacted.pop(0)
            front_matter, sep, paragraph = token['text'][4:].partition('---')
            compacted = [{'type': 'front_matter', 'text': F"\n{front_matter}"},
                        {'type': 'paragraph', 'text': paragraph}] + compacted
        return compacted

</details>

## Flattening the tokens to a [Python] string.

The tokenizer controls the translation of markdown strings to python strings.  Our major constraint is that the Markdown input should retain line numbers.

<details><summary><code>Flatten</code></summary>

In [4]:
    class Tokenizer(BlockLexer):
        def stringify(self, tokens: typing.List[dict], source: str = """""", last: int =0) -> str:
            INDENT = indent = util.base_indent(tokens) or 4
            for i, token in enumerate(tokens):
                object = token['text']
                if token and token['type'] == 'code':
                    if object.lstrip().startswith(FENCE):

                        object = ''.join(''.join(object.partition(FENCE)[::2]).rpartition(FENCE)[::2])
                        indent = INDENT + util.num_first_indent(object)
                        object = textwrap.indent(object, INDENT*SPACE)

                    if object.lstrip().startswith(MAGIC):  ...
                    else: indent = util.num_last_indent(object)
                elif token and token['type'] == 'front_matter': 
                    object = textwrap.indent(
                        F"locals().update(__import__('ruamel.yaml').yaml.safe_load({util.quote(object)}))\n", indent*SPACE)

                elif not object: ...
                else:
                    object = textwrap.indent(object, SPACE*max(indent-util.num_first_indent(object), 0))
                    for next in tokens[i+1:]:
                        if next['type'] == 'code':
                            next = util.num_first_indent(next['text'])
                            break
                    else: next = indent       
                    Δ = max(next-indent, 0)

                    if not Δ and source.rstrip().rstrip(CONTINUATION).endswith(COLON): 
                        Δ += 4

                    spaces = util.indents(object)
                    "what if the spaces are ling enough"
                    object = object[:spaces] + Δ*SPACE+ object[spaces:]
                    if not source.rstrip().rstrip(CONTINUATION).endswith(QUOTES): 
                        object = util.quote(object)
                source += object

            # add a semicolon to the source if the last block is code.
            for token in reversed(tokens):
                if token['text'].strip():
                    if token['type'] != 'code': 
                        source = source.rstrip() + SEMI
                    break

            return source

## Tangling functions

<details><summary><code>Utilities</code></summary>

In [6]:
    for x in "default_rules footnote_rules list_rules".split():
        setattr(BlockLexer, x, list(getattr(BlockLexer, x)))
        getattr(BlockLexer, x).insert(getattr(BlockLexer, x).index('block_code'), 'doctest')
        if 'block_html' in getattr(BlockLexer, x):
            getattr(BlockLexer, x).pop(getattr(BlockLexer, x).index('block_html'))


</summary></details>

In [None]:
    @base.implementation
    def tangle(text:str):
        tokenizer = Tokenizer()
        return tokenizer.stringify(tokenizer.parse(''.join(text)))

## `pidgy` transform manager.

The `pidgy` input transform manager puts together the tokenizing, normalizing, and flattening [Markdown]
to [Python].  It is a configurable object than can be configured by `IPython`.