ANY23-226 Extract JSON-LD embedded in HTML#16
Conversation
|
Important to state, this is largely based off of our existing META extractor. We are merely looking for /HTML/HEAD/SCRIPT/ presence. |
|
The main bug was that the entire script node was being sent to JSONLD-Java, and not just its content. However, I also made a few other changes while doing that testing. It turned out that the jsonld was invalid, but somehow the exception when parses fail was changed to be silently swallowed, so the only indication was that the count was 0. I turned on the exception propagation again (no reason it should be swallowed outside of temporary testing). However, in addition to the 4 tests currently failing on the core tests, there are now other tests failing due to an inability to parse "<div itemscope>" |
|
Ok Peter thank you for looking. This is great. I have not seen the test On Saturday, March 21, 2015, Peter Ansell notifications@github.com wrote:
Lewis |
|
The test failures are in the Microdata parsing code, not JSONLD-Java, so I thought it was fine to push this even though it was going to break the Jenkins build (it was already silently broken before due to the swallowed exception). The JSONLD parsing now works, the key fix on what you had done was to send the first child of the script element, which is the actual JSON code. |
Initial patch for this support.
It is not working correctly @ansell can you have a look into the parsing of JSONLD textual content?
I've provided a '//' comment to where I can see the correct parser being selected. It seems to not parse and extract the JSONLD so I know I am doing something wrong.
Thank you very much @ansell if you can have a wee look.