Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using XMLParser for parsing RDF/XML #352

Closed
wants to merge 13 commits into from
Closed

Conversation

njh
Copy link
Collaborator

@njh njh commented Aug 10, 2020

This Pull Request is work in progress for fixing #350

  • Write test to demonstrate the problem
  • Add a test for parsing non-standard namespaces - don't currently seem to test this
  • Switch to using XMLReader API
  • Optimise code a bit - splitURI() isn't needed anymore, the XML parse can do this
  • Check performance of new parser - check it isn't slower than the old implementation

Also had to increase memory limit for PHP while running tests
@njh njh added the WIP Work In Progress - don't merge label Aug 10, 2020
@k00ni
Copy link
Contributor

k00ni commented Aug 10, 2020

Its good that you take care of this.

@njh
Copy link
Collaborator Author

njh commented Aug 26, 2020

I have just committed my work in progress on switching to the XMLReader class.

Annoying it it nearly works but not quite. Some tests are currently failing and I keep getting a very difficult to interpret error message:

EasyRdf\Parser\Exception: startElementHandler() called at state 3 in http://xmlns.com/foaf/0.1/primaryTopic

I am not quite sure what still needs changing/fixing from moving over to the new parser.

@njh
Copy link
Collaborator Author

njh commented Oct 5, 2020

I am going to re-factor this code to use the XMLParser added in #357

@njh
Copy link
Collaborator Author

njh commented Oct 11, 2020

I think I may have worked out why this isn't working.

I created this script:
https://gist.github.com/njh/b5ae7a4399b7c89765c81bc4dbb9df2e

And this is the output:

>>> \EasyRdf\XMLParser
  start: root
    start: one
      text: one
    end: one
    start: two
      text: two
    end: two
    start: three
  end: root
<<< \EasyRdf\XMLParser


>>> xml_parser_create_ns()
  start: root
    start: one
      text: one
    end: one
    start: two
      text: two
    end: two
    start: three
    end: three
  end: root
<<< xml_parser_create_ns()

So I think it might relate to the handling of empty XML elements.

@njh njh removed the help wanted label Oct 12, 2020
@njh
Copy link
Collaborator Author

njh commented Oct 12, 2020

Tests passing and build fixed! 🎉

Sorry that took so long to work out - the inner workings of the RDF/XML parser code is quite hard to understand.

Little bit of tidying up and it should be ready to merge.
Also want to check that the performance hasn't been made worse.

@k00ni
Copy link
Contributor

k00ni commented Oct 12, 2020

Good job so far 👍

@@ -201,8 +186,28 @@ protected function reify($t, $s, $p, $o, $sType, $oType, $oDatatype = null, $oLa
}

/** @ignore */
protected function startElementHandler($p, $t, $a)
public function startElementHandler($parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we type-hint $parser here? That might make it more clear what it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good idea.
It is a bit annoying that the callback has to be public.
But at least @ignore should remove it from the API documentation.

@njh njh mentioned this pull request Oct 12, 2020
1 task
}

foreach ($a as $key => $uri) {
$xmlns = 'http://www.w3.org/2000/xmlns/';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should not be in a loop, maybe make it a class property?

@njh njh changed the title Fix for parsing large RDF/XML files Switch to using XMLParser for parsing RDF/XML Oct 20, 2020
@njh njh modified the milestones: 1.1.0, 1.2.0 Oct 20, 2020
@njh
Copy link
Collaborator Author

njh commented Oct 20, 2020

Really close to giving up on this 😢

  1. Now that it sort-of work, I tested the performance and it is a little bit slower than the old implementation
  2. When parsing a largish RDF/XML file from dbpedia, it returned a larger number of triples than the old implementation and rapper. So there is still more work to be done fixing bugs. And writing tests.

It is possible that if I fix [2], then the performance will improve.

But I have moved this to 1.2.0 milestone and will back-port the original PR fix.

@k00ni
Copy link
Contributor

k00ni commented Oct 21, 2020

Don't give up!

Also, I rather have more readable code than better performance.

@njh
Copy link
Collaborator Author

njh commented Nov 30, 2020

Might pick this branch up in the future but closing this PR - it has been replaced by #370

@njh njh closed this Nov 30, 2020
@njh njh removed this from the 1.2.0 milestone Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work In Progress - don't merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants