-
Notifications
You must be signed in to change notification settings - Fork 2
/
pxml.htm
387 lines (377 loc) · 18.3 KB
/
pxml.htm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
<html>
<head>
<title>A Lisp Based XML Parser</title>
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
</head>
<body>
<p><strong><big><big>A Lisp Based XML Parser</big></big></strong></p>
<p><a href="#intro">Introduction/Simple Example</a><br>
<a href="#lxml">LXML parse output format</a><br>
<a href="#props">parse-xml non-validating parser properties</a><br>
<a href="#modern">case and international character support issues</a><br>
<a href="#keyword">parse-xml and packages</a><br>
<a href="#namespace">parse-xml, the XML Namespace specification, and packages</a><br>
<a href="#unicode-scalar">ACL does not support Unicode 4 byte scalar values</a><br>
<a href="#big-endian">only little-endian Unicode tested in ACL 6.0 beta</a><br>
<a href="#debug">debugging aids</a><br>
<a href="#conformance">XML Conformance test results</a><br>
<a href="#build">Compiling and Loading the parser</a><br>
<a href="#reference">parse-xml reference</a></p>
<p><a name="intro"></a>The <strong>parse-xml </strong>generic function processes XML
input, returning a list of XML tags,<br>
attributes, and text. Here is a simple example:<br>
<br>
(parse-xml "<item1><item2 att1='one'/>this is some
text</item1>")<br>
<br>
--><br>
<br>
((item1 ((item2 att1 "one")) "this is some text"))<br>
<br>
The output format is known as LXML format.<br>
<br>
<a name="lxml"></a><strong>LXML Format</strong><br>
<br>
LXML is a list representation of XML tags and content.<br>
<br>
Each list member may be:<br>
<br>
a. a string containing text content, such as "Here is some text with a "<br>
<br>
b. a list representing a XML tag with associated attributes and/or content,
such as ('item1 "text") or (('item1 :att1 "help.html")
"link"). If the XML tag
does not have associated attributes, then the first list member will be a
symbol representing the XML tag, and the other elements will
represent the content, which can be a string (text content), a symbol (XML
tag with no attributes or content), or list (nested XML tag with
associated attributes and/or content). If there are associated attributes,
then the first list member will be a list containing a symbol
followed by two list members for each associated attribute; the first member is a
symbol representing the attribute, and the next member is a string corresponding
to the attribute value.<br>
<br>
c. XML comments and or processing instructions - see the more detailed example below for
further information.</p>
<p><a name="props"></a><strong>Non Validating Parser Properties</strong></p>
<p>Parse-xml is a non-validating XML parser. It will detect non-well-formed XML input.
When<br>
processing valid XML input, parse-xml will optionally produce the same output as a
validating <br>
parser would, including the processing of an external DTD subset and external entity
declarations.<br>
<br>
By default, parse-xml outputs a DTD parse along with the parsed XML contents. The DTD
parse may<br>
be optionally suppressed. The following example shows DTD parsed output components:</p>
<p>(defvar *xml-example-external-url*<br>
"<!ENTITY ext1 'this is some external entity %param1;'>")<br>
<br>
(defun example-callback (var-name token &optional public)<br>
(declare (ignorable token public))<br>
(setf var-name (uri-path var-name))<br>
(if* (equal var-name "null") then nil<br>
else<br>
(let ((string (eval (intern var-name (find-package
:user)))))<br>
(make-string-input-stream string))))<br>
<br>
(defvar *xml-example-string*<br>
"<?xml version='1.0' encoding='utf-8'?><br>
<!-- the following XML input is well-formed but its validity has not been checked ...
--><br>
<?piexample this is an example processing instruction tag ?><br>
<!DOCTYPE example SYSTEM '*xml-example-external-url*' [<br>
<!ELEMENT item1 (item2* | (item3+ , item4))><br>
<!ELEMENT item2 ANY><br>
<!ELEMENT item3 (#PCDATA)><br>
<!ELEMENT item4 (#PCDATA)><br>
<!ATTLIST item1<br>
att1 CDATA #FIXED 'att1-default'<br>
att2 ID #REQUIRED<br>
att3 ( one | two | three ) 'one'<br>
att4 NOTATION ( four | five ) 'four' ><br>
<!ENTITY % param1 'text'><br>
<!ENTITY nentity SYSTEM 'null' NDATA somedata><br>
<!NOTATION notation SYSTEM 'notation-processor'><br>
]><br>
<item1 att2='1'><item3>&ext1;</item3></item1>")<br>
<br>
(pprint (parse-xml *xml-example-string* :external-callback 'example-callback))<br>
<br>
--><br>
<br>
((:xml :version "1.0" :encoding "utf-8")<br>
(:comment " the following XML input is well-formed but may or may not be valid
")<br>
(:pi :piexample "this is an example processing instruction tag ")<br>
(:DOCTYPE :example<br>
(:[ (:ELEMENT :item1 (:choice (:* :item2) (:seq (:+ :item3) :item4))) <br>
(:ELEMENT :item2 :ANY)<br>
(:ELEMENT :item3 :PCDATA) (:ELEMENT :item4
:PCDATA)<br>
(:ATTLIST item1 (att1 :CDATA :FIXED
"att1-default") (att2 :ID :REQUIRED)<br>
(att3
(:enumeration :one :two :three) "one") <br>
(att4 (:NOTATION
:four :five) "four"))<br>
(:ENTITY :param1 :param "text") <br>
(:ENTITY :nentity :SYSTEM "null"
:NDATA :somedata)<br>
(:NOTATION :notation :SYSTEM
"notation-processor"))<br>
(:external (:ENTITY :ext1 "this is some external entity
text")))<br>
((item1 att1 "att1-default" att2 "1" att3 "one"
att4 "four") <br>
(item3 "this is some external entity
text")))<br>
<br>
<br>
<strong><big>Usage Notes</big></strong><br>
<br>
<ol>
<li><a name="modern"></a>The parse-xml function has been primarily compiled and tested in a
modern ACL. However, in an ANSI Lisp with wide character support, it DOES pass the valid
component of the conformance suite in the same manner as it does in a Modern Lisp. The
parser's successful operation in all potential situations depends on wide character support.
<br><br>
</li>
<li><a name="keyword"></a>The parser uses the keyword package for DTD tokens and other
special XML tokens. Since element and attribute token symbols are usually interned
in the current package, it is not recommended to execute parse-xml
when the current package is the keyword package.
<br><br>
</li>
<li><a name="namespace"></a>The XML parser supports the XML Namespaces specification. The
parser recognizes a "xmlns" attribute and attribute names starting with
"xmlns:".
As per the specification, the parser expects that the associated value
is an URI string. The parser then associates XML Namespace prefixes with a
Lisp package provided via the parse-xml :uri-to-package option or, if
necessary, a package created on the fly. The following example demonstrates
this behavior:<br>
<p>(setf *xml-example-string4*<br>
"<bibliography<br>
xmlns:bib='http://www.bibliography.org/XML/bib.ns'<br>
xmlns='urn:com:books-r-us'><br>
<bib:book owner='Smith'><br>
<bib:title>A Tale of Two Cities</bib:title><br>
<bib:bibliography<br>
xmlns:bib='http://www.franz.com/XML/bib.ns'<br>
xmlns='urn:com:books-r-us'><br>
<bib:library branch='Main'>UK
Library</bib:library><br>
<bib:date calendar='Julian'>1999</bib:date><br>
</bib:bibliography><br>
<bib:date calendar='Julian'>1999</bib:date><br>
</bib:book><br>
</bibliography>")<br>
<br>
(setf *uri-to-package* nil)<br>
(setf *uri-to-package*<br>
(acons (parse-uri <a href="http://www.bibliography.org/XML/bib.ns">"http://www.bibliography.org/XML/bib.ns"</a>)<br>
(make-package "bib") *uri-to-package*))<br>
(setf *uri-to-package*<br>
(acons (parse-uri <a href="http://www.bibliography.org/XML/bib.ns">"</a>urn:com:books-r-us<a
href="http://www.bibliography.org/XML/bib.ns">"</a>)<br>
(make-package "royal") *uri-to-package*))<br>
(setf *uri-to-package*<br>
(acons (parse-uri <a href="http://www.bibliography.org/XML/bib.ns">"</a>http://www.franz.com/XML/bib.ns<a
href="http://www.bibliography.org/XML/bib.ns">"</a>)<br>
(make-package "franz-ns") *uri-to-package*))<br>
(pprint (multiple-value-list<br>
(parse-xml
*xml-example-string4*<br>
:uri-to-package
*uri-to-package*)))<br>
<br>
--><br>
((((bibliography |xmlns:bib| <a href="http://www.bibliography.org/XML/bib.ns">"http://www.bibliography.org/XML/bib.ns"</a><br>
xmlns "urn:com:books-r-us")<br>
"<br>
"<br>
((bib::book royal::owner "Smith") "<br>
" (bib::title "A Tale of Two
Cities") "<br>
"<br>
((bib::bibliography royal::|xmlns:bib|<br>
"http://www.franz.com/XML/bib.ns" royal::xmlns<br>
"urn:com:books-r-us")<br>
"<br>
" ((franz-ns::library royal::branch
"Main") "UK Library") "<br>
" ((franz-ns::date royal::calendar
"Julian") "1999") "<br>
")<br>
"<br>
" ((bib::date royal::calendar
"Julian") "1999") "<br>
")<br>
"<br>
"))<br>
((#<uri http://www.franz.com/XML/bib.ns> . #<The franz-ns package>)<br>
(#<uri urn:com:books-r-us> . #<The royal package>)<br>
(#<uri http://www.bibliography.org/XML/bib.ns> . #<The bib package>)))<br>
<br>
</li>
<li>In the absence of XML Namespace attributes, element and attribute symbols are interned
in the current package. Note that this implies that attributes and elements referenced
in DTD content will be interned in the current package.
</li>
<li>The parse-xml function has been tested using the OASIS conformance test suite (see
details below). The test suite has wide coverage across possible XML and DTD syntax,
but there may be some syntax paths that have not yet been tested or completely
supported. Here is a list of currently known syntax parsing issues:
<ul>
<li><a name="unicode-scalar"></a>ACL does not support 4 byte Unicode scalar values, so
input containing such data
will not be processed correctly. (Note, however, that parse-xml does correctly detect
and process wide Unicode input.)
</li>
<li><a name="big-endian"></a>The OASIS tests that contain wide Unicode all use a
little-endian encoded Unicode.
Changes to the unicode-check function are required to also support big-endian encoded
Unicode. (Note also that this issue may be resolved by an ACL 6.0 final release change.)
</li>
<li>An initial <?xml declaration in external entity files is skipped without a check
being made to see if the <?xml declaration is itself incorrect.
</li>
</ul>
</li>
<li><a name="debug"></a>When investigating possible parser errors or examining more closely
where the parser
determined that the input was non-well-formed, the net.xml.parser internal symbols
*debug-xml* and *debug-dtd* are useful. When not bound to nil, these variables cause
lexical analysis and intermediate parsing results to be output to *standard-output*.
</li>
<li><a name="loading"></a>It is necessary to load the <b>pxml</b> module before using it.
Typically this can be done by evaluating <b>(require :pxml)</b>.
</li>
</ol>
<a name="conformance"></a><strong>XML Conformance Test Suite</strong><br>
<br>
Using the OASIS test suite <a href="http://www.oasis-open.org">(http://www.oasis-open.org)</a>,
here are the current parse-xml results:<br>
<br>
xmltest/invalid: Not tested, since parse-xml is a non-validating parser<br>
<br>
not-wf/<br>
<br>
ext.sa: 3 tests; all pass<br>
not-sa: 8 tests; all pass<br>
sa: 186 tests; the following fail:<br>
<br>
170.xml: fails because ACL does not support 4
byte Unicode scalar values<br>
<br>
valid/<br>
<br>
ext-sa: 14 tests; all pass<br>
not-sa: 31 tests; all pass<br>
sa: 119 tests: the following fail:<br>
<br>
052.xml, 064.xml, 089.xml: fails because ACL
does not support 4 byte <br>
Unicode scalar values<br>
<br>
<a name="build"></a><big><strong>Compiling and Loading</strong></big><br>
<br>
Load build.cl into a modern ACL session will result in a pxml.fasl file that can
subsequently be<br>
loaded in a modern ACL to provide XML parsing functionality.<br>
<br>
-------------------------------------------------------------------------------------------<br>
<br>
<a name="reference"></a><big><strong>parse-xml reference</strong></big><br>
<br>
parse-xml [Generic
function]<br>
<br>
Arguments: input-source &key external-callback content-only <br>
general-entities
parameter-entities<br>
uri-to-package<br>
<br>
Returns multiple values:<br>
<ol>
<li>LXML and parsed DTD output, as described above.</li>
<li>An association list containing the uri-to-package argument conses (if any)
and conses associated with any XML Namespace packages created during the
parse (see uri-to-package argument description, below).</li>
</ol>
The external-callback argument, if specified, is a function object or symbol
that parse-xml will execute when encountering an external DTD subset
or external entity DTD declaration. Here is an example which shows that
arguments the function should expect, and the value it should return:
<br><pre>
(defun file-callback (uri-object token &optional public)
;; The uri-object is an ACL URI object created from
;; the XML input. In this example, this function
;; assumes that all uri's will be file specifications.
;;
;; The token argument identifies what token is associated
;; with the external parse (for example :DOCTYPE for external
;; DTD subset
;;
;; The public argument contains the associated PUBLIC string,
;; when present
;;
(declare (ignorable token public))
;; An open stream is returned on success,
;; a nil return value indicates that the external
;; parse should not occur.
;; Note that parse-xml will close the open stream before exiting.
(ignore-errors (open (uri-path uri-object))))
</pre>
<p>
The general-entities argument is an association list containing general entity symbol
and replacement text pairs. The entity symbols should be in the keyword package.
Note that this option may be useful in generating desirable parse results in
situations where you do not wish to parse external entities or the external DTD subset.
<p>
The parameter-entities argument is an association list containing parameter entity symbol
and replacement text pairs. The entity symbols should be in the keyword package.
Note that this option may be useful in generating desirable parse results in
situations where you do not wish to parse external entities or the external DTD subset.
<p>
The uri-to-package argument is an association list containing uri objects and package
objects. Typically, the uri objects correspond to XML Namespace attribute values, and
the package objects correspond to the desired package for interning symbols associated
with the uri namespace. If the parser encounters an uri object not contained in this list,
it will generate a new package. The first generated package will be named
net.xml.namespace.0,
the second will be named net.xml.namespace.1, and so on.
<h3>parse-xml methods</h3>
<pre>
(parse-xml (p stream) &key
external-callback content-only
general-entities
parameter-entities
uri-to-package)
(parse-xml (str string) &key
external-callback content-only
general-entities
parameter-entities
uri-to-package)
</pre>
An easy way to parse a file containing XML input:
<pre>
(with-open-file (p "example.xml")
(parse-xml p :content-only p))
</pre>
<h3>net.xml.parser unexported special variables:</h3>
<p>
*debug-xml*<br>
<br>
When true, parse-xml generates XML lexical state and intermediary
parse result debugging output.
<p>
*debug-dtd*<br>
<br>
When true, parse-xml generates DTD lexical state and intermediary
parse result debugging output.
</body>
</html>