forked from ndmitchell/tagsoup
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO.txt
67 lines (45 loc) · 2.04 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
This document sets out a plan for the next iteration of TagSoup.
Change 1 will be the elimination of TagPos, and instead moving
to Tag (Position String). The big advantage will be that every string
(including attrib values) has a position, and it's easy to find.
Change 2 will be a massive increase in speed, aiming to be the
fastest HTML parser in any language.
---------------------------------------------------------------------
-- PARSER
The parser will be specified as:
data Position a = Position !Int !Int a {-# UNPACK #-}
-- ^ check that is equivalent to Position Int# Int# a
data Flag = TagOpen | AttVal | AttName | EntHex ...
| Warn String
lexer :: String -> [Either (Position Flag) Char]
-- ^ Spec.hs plus a very minimal definition
strings :: [Either (Position Flag) Char] -> [(Flag, Position String)]
-- ^ Around 7 lines or so, only difficult bit is move Warn as it doesn't capture
tags :: Options a -> [(Flag,a)] -> [Tag a]
-- ^ Longish and handwritten. Make sure to buffer Warn when inside a tag
---------------------------------------------------------------------
-- GRAMMAR
The grammar for flags will be specified:
type Grammar = [(Flag,String)] -> Maybe [(Flag,String)]
(<+>), (<.>) :: Grammar -> Grammar -> Grammar
star :: Grammar -> Grammar
grammar = star $
TagOpen <.> star (AttVal <.> ents <+> AttName) <.> (TagShut <+> TagShutEnd) <+>
TagClose <+>
Comment <+>
Text <.> ents
ents = star $ entStart <.> entEnd
entStart = EntName <+> EntHex <+> EntNum
entEnd = EntEndNone <+> EntEndSemi
Can check the flags against the grammar (but not usually done at runtime)
Some flags must always be empty (TagShut/TagShutEnd/EntEndNone/EntNoneSemi) -
can also check this with the grammar.
---------------------------------------------------------------------
-- OPTIMISATIONS
* UNPACK on Position
* Change Flag to Int# when no Warn elements
* Eliminate all Position/Warn when not used
* Deforest all stages
* Pull the opt flags upwards
* lexer/strings should not copy on BS/LBS (the big one!)
For BS/LBS generate [(Flag,a)] in one step