Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAXParseException: <unknown>:2:0: syntax error #4

Open
sidgitind opened this issue Aug 18, 2019 · 2 comments
Open

SAXParseException: <unknown>:2:0: syntax error #4

sidgitind opened this issue Aug 18, 2019 · 2 comments

Comments

@sidgitind
Copy link

Hi @WillKoehrsen , I am trying to execute on the code in the notebook and ran into an error in the xml parser code. I am getting an SAX error at this code snippet. handler = WikiXmlHandler()

Parsing object

parser = xml.sax.make_parser()
parser.setContentHandler(handler)

Iteratively process file

#handler._pages

for l in lines[-165:-109]:
parser.feed(l)

The stack trace is as follows


ExpatError Traceback (most recent call last)
myprojectpath\lib\xml\sax\expatreader.py in feed(self, data, isFinal)
216 # except when invoked from close.
--> 217 self._parser.Parse(data, isFinal)
218 except expat.error as e:

ExpatError: syntax error: line 2, column 0

During handling of the above exception, another exception occurred:

SAXParseException Traceback (most recent call last)
in ()
40
41 for l in lines[-165:-109]:
---> 42 parser.feed(l)
43
44 print(handler._pages)

myprojectpath\lib\xml\sax\expatreader.py in feed(self, data, isFinal)
219 exc = SAXParseException(expat.ErrorString(e.code), e, self)
220 # FIXME: when to invoke error()?
--> 221 self._err_handler.fatalError(exc)
222
223 def _close_source(self):

myprojectpath\lib\xml\sax\handler.py in fatalError(self, exception)
36 def fatalError(self, exception):
37 "Handle a non-recoverable error."
---> 38 raise exception
39
40 def warning(self, exception):

SAXParseException: :2:0: syntax error

Appreciate your time in helping me proceed further and handling this issue. Thanks

@sidgitind
Copy link
Author

The dump I am using is enwiki-20190101-pages-articles-multistream.xml.bz2 and my machine is a Windows 10 laptop

@sandertan
Copy link

@sidgitind As far as I understand, the part that you're feeding into the parser is not valid XML, because of the subset of lines: [-165:-109]. This subset of lines was valid for the 20180901 release used in the example, but not for the 20190101 release. Could you try a different subset, or experiment with removing the subset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants