Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing large SPARQL XML Results #357

Merged
merged 10 commits into from Oct 5, 2020
Merged

Conversation

njh
Copy link
Collaborator

@njh njh commented Aug 27, 2020

Replaces #294

Does it need a test for the two places loadXML() is called?

I would rather not add a large file to the Git repo.

@njh njh added bugfix WIP Work In Progress - don't merge labels Aug 27, 2020
@njh njh requested a review from k00ni August 27, 2020 00:54
@njh njh mentioned this pull request Aug 27, 2020
@k00ni
Copy link
Contributor

k00ni commented Aug 27, 2020

You could create a big file during a test, parse it and remove it afterwards. But as far as I understand the documentation about LIBXML_PARSEHUGE ...

which relaxes any hardcoded limit from the parser. This affects limits like maximum depth of a document or the entity recursion, as well as limits of the size of text nodes. (Source: https://www.php.net/manual/en/libxml.constants.php)

... it changes parameters not behavior. So a test might not be required here.

Why another PR besides #352?

@njh
Copy link
Collaborator Author

njh commented Aug 27, 2020

#352 is a fix for parsing large RDF/XML files.

This PR is for parsing SPARQL Results and XML Literals.

@njh
Copy link
Collaborator Author

njh commented Aug 27, 2020

I wrote a quick test to check if adding LIBXML_PARSEHUGE to \EasyRdf\Sparql\Result

It just created a result set with a large number of <result> elements.

    public function testSelectHugeXml()
    {
        $huge = "<sparql xmlns=\"http://www.w3.org/2005/sparql-results#\">\n";
        $huge .= "<head><variable name=\"s\"/><variable name=\"p\"/><variable name=\"o\"/></head>\n";
        $huge .= "<results>\n";
        for ($i = 1; $i < 50000; $i++) {
            $huge .= "<result>\n";
            $huge .= "<binding name=\"s\"><uri>http://www.example.com/person/$i</uri></binding>\n";
            $huge .= "<binding name=\"p\"><uri>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</uri></binding>\n";
            $huge .= "<binding name=\"o\"><uri>http://xmlns.com/foaf/0.1/Person</uri></binding>\n";
            $huge .= "</result>\n";
        }
        $huge .= "</results>\n</sparql>\n";

        // Check it is more than 10Mb
        $this->assertGreaterThan(10485760, strlen($huge));

        $result = new Result($huge, 'application/sparql-results+xml');

        $this->assertCount(50000, $result);
        $this->assertSame(50000, $result->numRows());
    }

However it is so slow to run, it takes longer than 60 seconds and the test times out.
Feels like this is a performance bug of its own.

I don't think it is the parsing of the XML that is slow - it is the stepping through the results that is slow 😞
Which is weird, because it is much less complex than parsing RDF/XML.

@njh njh changed the title Added LIBXML_PARSEHUGE to calls to loadXML() Add support for parsing large SPARQL XML Results Sep 2, 2020
@njh
Copy link
Collaborator Author

njh commented Sep 2, 2020

I have rewritten the SPARQL XML parser to use XMLReader instead of DOMDocument.

It now parses results more than 10Mb in size and much much faster.
It is a bit hacky - could do with some cleaning up before merging.
I'm also not confident that I haven't introduced bugs - although the tests pass.

Would be good to profile the old and new parser and see how much faster it is.

@k00ni
Copy link
Contributor

k00ni commented Sep 25, 2020

What is the state of this PR?

@njh
Copy link
Collaborator Author

njh commented Sep 25, 2020

The code currently in the PR is quite hacky - it is just based on XML element names - rather than checking that the path is correct. This could cause it to do weird things.

I started refactoring it so that there is a new \EasyRdf\XMLParser class and a new \EasyRdf\Sparql\ResultXMLParser class that subclasses it - but I decided that this was over-engineered.

I am in the process for creating a \EasyRdf\XMLParser class that allows you to register callbacks - so that the SPARQL result parsing code will remain in \EasyRdf\Sparql\Result.

The new \EasyRdf\XMLParser class will also be used by the RDF/XML parser.

Hoping to get some time this weekend to finish this PR.

@njh njh removed the WIP Work In Progress - don't merge label Oct 1, 2020
@njh
Copy link
Collaborator Author

njh commented Oct 1, 2020

I have finished creating a \EasyRdf\XMLParser class and refactoring the SPARQL result parsing code in \EasyRdf\Sparql\Result. Not totally happy with the code style, but I think it will do.

The performance improvement is significant. I did some quick parse time check on my laptop, running PHP 7.2.31. The units are seconds(!).

SPARQL Result Rows DOMDocument XMLReader
1000 0.09102296829 0.01886892319
2000 0.4148211479 0.03816699982
3000 1.046848059 0.06045007706
4000 1.861665964 0.07614183426
5000 2.973433018 0.09430813789
6000 4.380954981 0.1115550995
7000 5.838248968 0.1330370903
8000 8.454589128 0.1832821369
9000 12.74119997 0.2013540268
10000 18.93464804 0.1963009834
11000 15.5420959 0.2084708214
12000 19.11489892 0.2328841686
13000 23.05493402 0.252961874
14000 27.24416304 0.2764151096
15000 36.6921699 0.3372910023
16000 36.95573616 0.3119380474
17000 39.13428283 0.3308308125
18000 46.77152419 0.3676671982
19000 51.85528517 0.387378931
20000 53.71366787 0.3986120224

I had to use a logarithmic scale, to compare the two on a chart:
SPARQL Result Parser Performance

The tests are all passing - included the new test that checks it can parse a XML document larger than 10 Megabytes.

@k00ni please can you review and merge if you are happy?

@njh njh added this to the 1.1.0 milestone Oct 1, 2020
k00ni
k00ni previously approved these changes Oct 2, 2020
Copy link
Contributor

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic job!

The code looks good. I would change a few comments for styling reasons, but its fine. Will keep this open for a few days, if that's alright. This way our community has a chance to comment.

@zozlak if you have the time, it would be great if you could also have a look here.

@njh njh merged commit 032b284 into master Oct 5, 2020
@njh njh deleted the libxml-parsehuge branch October 5, 2020 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants