-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document what the text post-processor is for #41
Comments
Thanks for giving the library a spin! The text renderer is meant to output a human-readable textual representation. If it's spitting out tags and attrs, then that's a bug, and I'd be happy to take patches against it. If you want to customize the output, you can use raw.py instead, giving you a raw AST to play with. Incidentally, at some unspecified point in the future, I'm going to finish Parsimonious (https://github.com/erikrose/parsimonious/) and port the MW grammar to that, at which time I'll start ignoring this. |
Yes, the text post-processor is designed to produce nice and human readable output. As for tags, in the HTML post-processor, you have two kinds of tags: allowed and disallowed. By default, all tags are disallowed. In this case, they are treated as "normal" text, that's why "" is rendered as "": by default, it is not a tag. "Allowed" tags are interpreted when they are implemented (like In the text post-processor, you can't currently define which tags are allowed or disallowed. They are all treated as text, except I think we can make a better output with the text renderer. I spent some time looking at how we can adapt the HTML renderer for this purpose. It's quite long to do and I don't have the time right now, but please feel free to propose improvements if you want to. |
Finally, I felt inspired. I proposed a first version of a new text post-processor based on the HTML one. Please feel free to test it and propose improvements. |
Looking at the post-processor under text.py I don't fully understand what its purpose is. Is it designed to produce nice and human readable output (but then why are tags with attributes preserved?) or just to strip off wiki markup?
I am looking for a transformation to get text only, mostly for doing processing with NLTK later on. Something like ** for bold texts might be agreeable.
The text was updated successfully, but these errors were encountered: