ANY23-247 FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character. by lewismc · Pull Request #17 · apache/any23

lewismc · 2015-03-30T16:43:28Z

Hi Folks,
PR which fixes this issue locally. I am getting clean builds now again after introducing this new MissingItemscopeAttributeValueRule class.

ansell · 2015-03-30T22:13:26Z

core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java

How is this class recognised or instantiated? META-INF/services/ or another method?

I looked for it being registered during a single document extraction. It was my understanding that validation and fixes are registered and active as part of the extraction parameters agenda? If a vanilla SingleDocumentExtration is invoked... as per the Any23Test then by default the Fixes and Validations are activated.

It may be done using a classpath scan. I will look into it further.

Ack

On Monday, March 30, 2015, Peter Ansell notifications@github.com wrote:

In
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
#17 (comment):

+/**

* This fixes missing attribute values for the 'itemscope' attribute,

* which was be associated with
nodes.

* Typically when such a snippet of XHTML is fed through the

* {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and

* subsequently to Sesame's {@link org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},

* it will result in the following behavior.

*

* {@code

* [Fatal Error] :23:15: Attribute name "itemscope" associated with an element type "div" must be followed by the ' = ' character.

* }

*

* This Fix is an effort to mitigate against that happening.

*/
+public class MissingItemscopeAttributeValueRule implements Fix {

It may be done using a classpath scan. I will look into it further.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/any23/pull/17/files#r27442717.

Lewis

Everything I've uploaded to the patch is what I have coded. There is no
other black magic on my end to get this invoked.

On Monday, March 30, 2015, Lewis John Mcgibbney lewis.mcgibbney@gmail.com
wrote:

Ack

On Monday, March 30, 2015, Peter Ansell <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

In
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
#17 (comment):

+/**

* This fixes missing attribute values for the 'itemscope' attribute,

* which was be associated with
nodes.

* Typically when such a snippet of XHTML is fed through the

* {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and

* subsequently to Sesame's {@link org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},

* it will result in the following behavior.

*

* {@code

* [Fatal Error] :23:15: Attribute name "itemscope" associated with an element type "div" must be followed by the ' = ' character.

* }

*

* This Fix is an effort to mitigate against that happening.

*/
+public class MissingItemscopeAttributeValueRule implements Fix {

It may be done using a classpath scan. I will look into it further.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/any23/pull/17/files#r27442717.

Lewis

Lewis

There is a hardcoded set in DefaultValidator.loadDefaultRules, but I can't find any place that is doing classpath scanning there.

I also do not understand the relationship between Rule and Fix. In the DefaultValidator, there are either Rule, or Rule+Fix, not just a Fix like you have here.

I will look into it further when I get a chance.

ansell · 2015-03-31T01:32:49Z

Could you rebase your branch onto upstream master and try again?

The place where the error started to become visible as a test failure (when I started rethrowing an exception that was being swallowed incorrectly) is on the current master, but your master branch is 4 commits behind that so the test will still silently succeed on your branch.

lewismc · 2015-03-31T11:21:18Z

@ansell done, the branch is now 2 ahead of master. Yes you are correct, my results are as follows

Results :

Failed tests:
  RoverTest.testRunMultiURLs:104->runWithMultiSourcesAndVerify:119 Unexpected exit code. expected:<0> but was:<1>
  RoverTest.testRunMultiFiles:65->runWithMultiSourcesAndVerify:119 Unexpected exit code. expected:<0> but was:<1>

Tests in error:
  Any23Test.testMicrodataSupport:483->assertExtractorActivation:589->detectAndExtract:558 » Extraction

Tests run: 391, Failures: 2, Errors: 1, Skipped: 9

lewismc · 2015-03-31T11:45:43Z

By the way @ansell, an observation is that whenever we make an attempt to infer the document language, we never succeed. It is always returns null. On every single occasion I get back null.

lewismc · 2015-03-31T11:48:06Z

When I debug this, a good place to set a breakpoint is at line
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L253
The parse fails on the RDFA1.1 parser with the following error... still

  [Fatal Error] :23:15: Attribute name "itemscope" associated with an element type "div" must be followed by the ' = ' character.
[2015-03-31 04:46:46,618]DEBUG544766[main] - org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:488) - html-rdfa11: Error while parsing RDF document.

lewismc · 2016-02-20T07:36:29Z

@ansell the line I am getting the error on is away down in semargl here
https://github.com/levkhomich/semargl/blob/ee8b35fc330deae6cb623fa3c57f583f3684bb76/rdfa/src/main/java/org/semarglproject/rdf/rdfa/RdfaParser.java#L1130
I am going to investigate this issue again this weekend as it is high time we got Any23 master back to successful healthy builds.
It seems to be 'too' strict for what I think Any23 should be providing. We should acknowledge an allow markup scenarios which present an itemscope attribute without a value. It seems like we cannot avoid the exception though unless we skip an entire document extraction... which is not favorable by any means.
What do you think?

…e html must be followed by the ' = ' character

lewismc · 2016-03-25T22:00:01Z

hi @ansell OK I've added in the correct rule and fix as well as a test to verify that empty itemscope values are identified and fixed.
Whilst debugging this however the core issue persists. Reasoning for this is that RDFa11Extractor extends BaseRDFExtractor which inherits the parser function inputstream parameter. This input stream is not the 'fixed' steam but the raw document.
The only way I can think around this is for us to

refactor the RDFa1.1Extractor such that it extends TagSoupDomExtractor as oppose to (eventually) the ContentExtractor, or
undertake a mass refactoring which essentially removes the ContentExtractor altogether... this would provide us with a much more flexible and adaptable extraction framework IMHO.

What do you think?

ansell · 2016-03-25T22:15:44Z

The system does seem a little too complex for our purposes and isn't usable because of that.

Removing generics would be the first step IMO as there are too many rawtypes definitions which indicate generics are being used badly.

ContentExtractor may be able to be completely removed instead of being refitted into the process after that and the parser should always be set to parse as far as practical for our purposes.

It is a little strange that there isn't a buffered, markable, InputStream provided for all of the steps to reuse as necessary rather than pushing a raw InputStream or other source into different extractors.

lewismc · 2016-03-25T22:24:22Z

I agree. Jumping through this in the debugged made me think the same.
I think it is different if Any23 is to be a PURE implementation... But that
is clearly not the case. Any23 fits in best when it can be used to extract
semantics from any old crap input that it is fed. Parsers and extractors
should not fail when there is a piece of crap input HTML. Currently,
that's exactly what happens and it is extremely limiting.

I would like to propose that this PR is committed to master as is, we then
open a brand new issue which acts exactly your comments refactoring out
content extractor and reusing the input stream which has been fixed, etc.

Any thoughts Peter? Thanks fr quick response.

On Friday, March 25, 2016, Peter Ansell notifications@github.com wrote:

The system does seem a little too complex for our purposes and isn't
usable because of that.

Removing generics would be the first step IMO as there are too many
rawtypes definitions which indicate generics are being used badly.

ContentExtractor may be able to be completely removed instead of being
refitted into the process after that and the parser should always be set to
parse as far as practical for our purposes.

It is a little strange that there isn't a buffered, markable, InputStream
provided for all of the steps to reuse as necessary rather than pushing a
raw InputStream or other source into different extractors.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#17 (comment)

Lewis

ansell · 2016-03-25T22:51:54Z

I tested this pull request and it has a few failing tests for me. I know that the Any23 master hasn't been perfect for its test record (mostly due to unreliable remote queries), but I haven't been watching recently to know which tests are expected to fail.

lewismc · 2016-03-25T23:10:24Z

ACK @ansell , master branch is unstable with the following test failures

https://builds.apache.org/view/A-D/view/Any23/job/Any23-trunk/1466/#showFailuresLink

If you can reproduce this locally (or up until your test build fails within core with 3 failing tests) then that is the 'expected' behaviour right now. The Microdata test is directly related to the issue we are now discussing here.

This issue is the most pressing for Any23 right now, IMHO it is a complete blocker to us releasing Any23 1.2.

Thanks for the review.

lewismc · 2016-03-29T04:22:09Z

@ansell any further comments here? I will try to get to work on the larger issue this week.

…pe html must be followed by the ' = ' character. this closes #17

ansell reviewed Mar 30, 2015
View reviewed changes

lewismc force-pushed the ANY23-247 branch from 3bafb0d to fc45932 Compare March 25, 2016 21:49

ANY23-247 FIX Attribute name itemscope associated with an element typ…

fc45932

…e html must be followed by the ' = ' character

asfgit merged commit fc45932 into apache:master Apr 2, 2016

asfgit pushed a commit that referenced this pull request Apr 2, 2016

ANY23-247 FIX Attribute name itemscope associated with an element ty…

57afcd1

…pe html must be followed by the ' = ' character. this closes #17

lewismc deleted the ANY23-247 branch April 2, 2016 19:52

Conversation

lewismc commented Mar 30, 2015

Uh oh!

ansell Mar 30, 2015

Choose a reason for hiding this comment

Uh oh!

lewismc Mar 30, 2015

Choose a reason for hiding this comment

Uh oh!

ansell Mar 30, 2015

Choose a reason for hiding this comment

Uh oh!

lewismc Mar 31, 2015

Choose a reason for hiding this comment

Uh oh!

lewismc Mar 31, 2015

Choose a reason for hiding this comment

Uh oh!

ansell Mar 31, 2015

Choose a reason for hiding this comment

Uh oh!

ansell commented Mar 31, 2015

Uh oh!

lewismc commented Mar 31, 2015

Uh oh!

lewismc commented Mar 31, 2015

Uh oh!

lewismc commented Mar 31, 2015

Uh oh!

lewismc commented Feb 20, 2016

Uh oh!

lewismc commented Mar 25, 2016

Uh oh!

ansell commented Mar 25, 2016

Uh oh!

lewismc commented Mar 25, 2016

Uh oh!

ansell commented Mar 25, 2016

Uh oh!

lewismc commented Mar 25, 2016

Uh oh!

lewismc commented Mar 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments