Add support for parsing large SPARQL XML Results #357

njh · 2020-08-27T00:53:50Z

Replaces #294

Does it need a test for the two places loadXML() is called?

I would rather not add a large file to the Git repo.

k00ni · 2020-08-27T08:48:27Z

You could create a big file during a test, parse it and remove it afterwards. But as far as I understand the documentation about LIBXML_PARSEHUGE ...

which relaxes any hardcoded limit from the parser. This affects limits like maximum depth of a document or the entity recursion, as well as limits of the size of text nodes. (Source: https://www.php.net/manual/en/libxml.constants.php)

... it changes parameters not behavior. So a test might not be required here.

Why another PR besides #352?

njh · 2020-08-27T22:53:58Z

#352 is a fix for parsing large RDF/XML files.

This PR is for parsing SPARQL Results and XML Literals.

njh · 2020-08-27T23:29:28Z

I wrote a quick test to check if adding LIBXML_PARSEHUGE to \EasyRdf\Sparql\Result

It just created a result set with a large number of <result> elements.

    public function testSelectHugeXml()
    {
        $huge = "<sparql xmlns=\"http://www.w3.org/2005/sparql-results#\">\n";
        $huge .= "<head><variable name=\"s\"/><variable name=\"p\"/><variable name=\"o\"/></head>\n";
        $huge .= "<results>\n";
        for ($i = 1; $i < 50000; $i++) {
            $huge .= "<result>\n";
            $huge .= "<binding name=\"s\"><uri>http://www.example.com/person/$i</uri></binding>\n";
            $huge .= "<binding name=\"p\"><uri>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</uri></binding>\n";
            $huge .= "<binding name=\"o\"><uri>http://xmlns.com/foaf/0.1/Person</uri></binding>\n";
            $huge .= "</result>\n";
        }
        $huge .= "</results>\n</sparql>\n";

        // Check it is more than 10Mb
        $this->assertGreaterThan(10485760, strlen($huge));

        $result = new Result($huge, 'application/sparql-results+xml');

        $this->assertCount(50000, $result);
        $this->assertSame(50000, $result->numRows());
    }

However it is so slow to run, it takes longer than 60 seconds and the test times out.
Feels like this is a performance bug of its own.

I don't think it is the parsing of the XML that is slow - it is the stepping through the results that is slow 😞
Which is weird, because it is much less complex than parsing RDF/XML.

njh · 2020-09-02T00:02:57Z

I have rewritten the SPARQL XML parser to use XMLReader instead of DOMDocument.

It now parses results more than 10Mb in size and much much faster.
It is a bit hacky - could do with some cleaning up before merging.
I'm also not confident that I haven't introduced bugs - although the tests pass.

Would be good to profile the old and new parser and see how much faster it is.

k00ni · 2020-09-25T12:06:00Z

What is the state of this PR?

njh · 2020-09-25T17:23:43Z

The code currently in the PR is quite hacky - it is just based on XML element names - rather than checking that the path is correct. This could cause it to do weird things.

I started refactoring it so that there is a new \EasyRdf\XMLParser class and a new \EasyRdf\Sparql\ResultXMLParser class that subclasses it - but I decided that this was over-engineered.

I am in the process for creating a \EasyRdf\XMLParser class that allows you to register callbacks - so that the SPARQL result parsing code will remain in \EasyRdf\Sparql\Result.

The new \EasyRdf\XMLParser class will also be used by the RDF/XML parser.

Hoping to get some time this weekend to finish this PR.

njh · 2020-10-01T23:33:30Z

I have finished creating a \EasyRdf\XMLParser class and refactoring the SPARQL result parsing code in \EasyRdf\Sparql\Result. Not totally happy with the code style, but I think it will do.

The performance improvement is significant. I did some quick parse time check on my laptop, running PHP 7.2.31. The units are seconds(!).

SPARQL Result Rows	DOMDocument	XMLReader
1000	0.09102296829	0.01886892319
2000	0.4148211479	0.03816699982
3000	1.046848059	0.06045007706
4000	1.861665964	0.07614183426
5000	2.973433018	0.09430813789
6000	4.380954981	0.1115550995
7000	5.838248968	0.1330370903
8000	8.454589128	0.1832821369
9000	12.74119997	0.2013540268
10000	18.93464804	0.1963009834
11000	15.5420959	0.2084708214
12000	19.11489892	0.2328841686
13000	23.05493402	0.252961874
14000	27.24416304	0.2764151096
15000	36.6921699	0.3372910023
16000	36.95573616	0.3119380474
17000	39.13428283	0.3308308125
18000	46.77152419	0.3676671982
19000	51.85528517	0.387378931
20000	53.71366787	0.3986120224

I had to use a logarithmic scale, to compare the two on a chart:

The tests are all passing - included the new test that checks it can parse a XML document larger than 10 Megabytes.

@k00ni please can you review and merge if you are happy?

k00ni

Fantastic job!

The code looks good. I would change a few comments for styling reasons, but its fine. Will keep this open for a few days, if that's alright. This way our community has a chance to comment.

@zozlak if you have the time, it would be great if you could also have a look here.

Added LIBXML_PARSEHUGE to calls to loadXML()

bba8634

njh added bugfix WIP Work In Progress - don't merge labels Aug 27, 2020

njh requested a review from k00ni August 27, 2020 00:54

njh mentioned this pull request Aug 27, 2020

Implement LIBXML_PARSEHUGE #294

Closed

Merge branch 'master' into libxml-parsehuge

6ad3e6e

njh added 2 commits September 2, 2020 00:58

Added test for parsing SPARQL XML more than 10Mb

8a329ec

Re-wrote SPARQL XML parser to use XMLReader instead of DOMDocument

b9489da

njh changed the title ~~Added LIBXML_PARSEHUGE to calls to loadXML()~~ Add support for parsing large SPARQL XML Results Sep 2, 2020

Merge branch 'master' into libxml-parsehuge

3087b55

njh added 4 commits October 2, 2020 00:02

Added XMLParser utility class

ea16d2b

Added test for SPARQL Result namespace being wrong

a29dd19

Changed \EasyRdf\Sparql\Result class to use new XMLParser

8052188

Corrected parameter ordering for implode()

7bc98cf

njh removed the WIP Work In Progress - don't merge label Oct 1, 2020

njh added this to the 1.1.0 milestone Oct 1, 2020

k00ni previously approved these changes Oct 2, 2020

View reviewed changes

k00ni added the enhancement label Oct 2, 2020

Code style fixes

6bb7b08

njh dismissed k00ni’s stale review via 6bb7b08 October 2, 2020 13:38

k00ni approved these changes Oct 5, 2020

View reviewed changes

njh merged commit 032b284 into master Oct 5, 2020

njh deleted the libxml-parsehuge branch October 5, 2020 08:47

njh mentioned this pull request Oct 5, 2020

Switch to using XMLParser for parsing RDF/XML #352

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parsing large SPARQL XML Results #357

Add support for parsing large SPARQL XML Results #357

njh commented Aug 27, 2020

k00ni commented Aug 27, 2020

njh commented Aug 27, 2020

njh commented Aug 27, 2020

njh commented Sep 2, 2020 •

edited

k00ni commented Sep 25, 2020

njh commented Sep 25, 2020

njh commented Oct 1, 2020 •

edited

k00ni left a comment •

edited

Add support for parsing large SPARQL XML Results #357

Add support for parsing large SPARQL XML Results #357

Conversation

njh commented Aug 27, 2020

k00ni commented Aug 27, 2020

njh commented Aug 27, 2020

njh commented Aug 27, 2020

njh commented Sep 2, 2020 • edited

k00ni commented Sep 25, 2020

njh commented Sep 25, 2020

njh commented Oct 1, 2020 • edited

k00ni left a comment • edited

Choose a reason for hiding this comment

njh commented Sep 2, 2020 •

edited

njh commented Oct 1, 2020 •

edited

k00ni left a comment •

edited