Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document what the text post-processor is for #41

Open
cburgmer opened this issue Apr 17, 2012 · 3 comments
Open

Document what the text post-processor is for #41

cburgmer opened this issue Apr 17, 2012 · 3 comments

Comments

@cburgmer
Copy link
Contributor

Looking at the post-processor under text.py I don't fully understand what its purpose is. Is it designed to produce nice and human readable output (but then why are tags with attributes preserved?) or just to strip off wiki markup?

I am looking for a transformation to get text only, mostly for doing processing with NLTK later on. Something like ** for bold texts might be agreeable.

@erikrose
Copy link
Owner

Thanks for giving the library a spin! The text renderer is meant to output a human-readable textual representation. If it's spitting out tags and attrs, then that's a bug, and I'd be happy to take patches against it.

If you want to customize the output, you can use raw.py instead, giving you a raw AST to play with.

Incidentally, at some unspecified point in the future, I'm going to finish Parsimonious (https://github.com/erikrose/parsimonious/) and port the MW grammar to that, at which time I'll start ignoring this.

@peter17
Copy link
Collaborator

peter17 commented Apr 26, 2012

Yes, the text post-processor is designed to produce nice and human readable output.

As for tags, in the HTML post-processor, you have two kinds of tags: allowed and disallowed. By default, all tags are disallowed. In this case, they are treated as "normal" text, that's why "" is rendered as "": by default, it is not a tag. "Allowed" tags are interpreted when they are implemented (like <p>, <br/>...). In this case, they don't appear anymore in the output.

In the text post-processor, you can't currently define which tags are allowed or disallowed. They are all treated as text, except <p> and <br /> which will be interpreted as new paragraph and line break.

I think we can make a better output with the text renderer. I spent some time looking at how we can adapt the HTML renderer for this purpose. It's quite long to do and I don't have the time right now, but please feel free to propose improvements if you want to.

@peter17
Copy link
Collaborator

peter17 commented Apr 26, 2012

Finally, I felt inspired. I proposed a first version of a new text post-processor based on the HTML one. Please feel free to test it and propose improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants