Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the tokenizer able to be run separately #41

Open
aredridel opened this issue Nov 14, 2011 · 8 comments
Open

Make the tokenizer able to be run separately #41

aredridel opened this issue Nov 14, 2011 · 8 comments

Comments

@aredridel
Copy link
Owner

Do this while minimally involving parser for state transitions between tokenizing modes.

@gwicke
Copy link
Contributor

gwicke commented Nov 28, 2011

I am very much interested in this as well. I am using the tree builder with a custom (MediaWiki) tokenizer in a prototype I am working on currently. For now I modified the parser to take a tokenizer as an argument in parse(), but this is just a quick hack that ignores tokenizer modes completely.

Right now I am mostly working on the tokenizer, but might be able to put some effort into the interface later.

@aredridel
Copy link
Owner Author

Oh, sweet. I'm integrating the latest draft's tokenizer changes, which flattens the modes out into separate states. That should help move this along.

@papandreou
Copy link
Contributor

@aredridel: Has that work been completed? I'm thinking about updating a project to the newest version of html5 and getting rid of the state transitions in my code.

@aredridel
Copy link
Owner Author

It has not, sadly! It's not trivial to get 100% accurate, since the HTML5 algorithms assume a document tree, and if you're not parsing fully, you don't get one. I need to rework for some of the latest spec changes, and that should simplify, since they've flattened some of the parser down into tokenizer states.

@dgreensp
Copy link

I'm using just the tokenizer at the moment. I was able to pull it out, but it would be nice if it was explicitly separable.

@aredridel
Copy link
Owner Author

It's hard with the revision of the spec I was originally targeting, since the parser state feeds back into the tokenizer. With the latest revision, much if not all of this is flattened out. As I migrate toward the current parsing spec, it'll smooth that out.

@dgreensp
Copy link

Sounds great. Thanks for writing this package, btw, it seems to be one of a kind in the JS world.

@aredridel
Copy link
Owner Author

I'm kinda surprised at how one of a kind it is, but it's needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants