Skip to content
This repository was archived by the owner on Jul 3, 2023. It is now read-only.

ANY23-247 FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.#17

Merged
asfgit merged 1 commit intoapache:masterfrom
lewismc:ANY23-247
Apr 2, 2016
Merged

ANY23-247 FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.#17
asfgit merged 1 commit intoapache:masterfrom
lewismc:ANY23-247

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Mar 30, 2015

Hi Folks,
PR which fixes this issue locally. I am getting clean builds now again after introducing this new MissingItemscopeAttributeValueRule class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this class recognised or instantiated? META-INF/services/ or another method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for it being registered during a single document extraction. It was my understanding that validation and fixes are registered and active as part of the extraction parameters agenda? If a vanilla SingleDocumentExtration is invoked... as per the Any23Test then by default the Fixes and Validations are activated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be done using a classpath scan. I will look into it further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

On Monday, March 30, 2015, Peter Ansell notifications@github.com wrote:

In
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
#17 (comment):

+/**

  • * This fixes missing attribute values for the 'itemscope' attribute,
  • * which was be associated with
    nodes.
  • * Typically when such a snippet of XHTML is fed through the
  • * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
  • * subsequently to Sesame's {@link org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
  • * it will result in the following behavior.
  • *
  • * {@code
  • * [Fatal Error] :23:15: Attribute name "itemscope" associated with an element type "div" must be followed by the ' = ' character.
  • * }
  • *
  • * This Fix is an effort to mitigate against that happening.
  • */
    +public class MissingItemscopeAttributeValueRule implements Fix {

It may be done using a classpath scan. I will look into it further.


Reply to this email directly or view it on GitHub
https://github.com/apache/any23/pull/17/files#r27442717.

Lewis

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I've uploaded to the patch is what I have coded. There is no
other black magic on my end to get this invoked.

On Monday, March 30, 2015, Lewis John Mcgibbney lewis.mcgibbney@gmail.com
wrote:

Ack

On Monday, March 30, 2015, Peter Ansell <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

In
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
#17 (comment):

+/**

  • * This fixes missing attribute values for the 'itemscope' attribute,
  • * which was be associated with
    nodes.
  • * Typically when such a snippet of XHTML is fed through the
  • * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
  • * subsequently to Sesame's {@link org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
  • * it will result in the following behavior.
  • *
  • * {@code
  • * [Fatal Error] :23:15: Attribute name "itemscope" associated with an element type "div" must be followed by the ' = ' character.
  • * }
  • *
  • * This Fix is an effort to mitigate against that happening.
  • */
    +public class MissingItemscopeAttributeValueRule implements Fix {

It may be done using a classpath scan. I will look into it further.


Reply to this email directly or view it on GitHub
https://github.com/apache/any23/pull/17/files#r27442717.

Lewis

Lewis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a hardcoded set in DefaultValidator.loadDefaultRules, but I can't find any place that is doing classpath scanning there.

I also do not understand the relationship between Rule and Fix. In the DefaultValidator, there are either Rule, or Rule+Fix, not just a Fix like you have here.

I will look into it further when I get a chance.

@ansell
Copy link
Member

ansell commented Mar 31, 2015

Could you rebase your branch onto upstream master and try again?

The place where the error started to become visible as a test failure (when I started rethrowing an exception that was being swallowed incorrectly) is on the current master, but your master branch is 4 commits behind that so the test will still silently succeed on your branch.

@lewismc
Copy link
Member Author

lewismc commented Mar 31, 2015

@ansell done, the branch is now 2 ahead of master. Yes you are correct, my results are as follows

Results :

Failed tests:
  RoverTest.testRunMultiURLs:104->runWithMultiSourcesAndVerify:119 Unexpected exit code. expected:<0> but was:<1>
  RoverTest.testRunMultiFiles:65->runWithMultiSourcesAndVerify:119 Unexpected exit code. expected:<0> but was:<1>

Tests in error:
  Any23Test.testMicrodataSupport:483->assertExtractorActivation:589->detectAndExtract:558 » Extraction

Tests run: 391, Failures: 2, Errors: 1, Skipped: 9

@lewismc
Copy link
Member Author

lewismc commented Mar 31, 2015

By the way @ansell, an observation is that whenever we make an attempt to infer the document language, we never succeed. It is always returns null. On every single occasion I get back null.

@lewismc
Copy link
Member Author

lewismc commented Mar 31, 2015

When I debug this, a good place to set a breakpoint is at line
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L253
The parse fails on the RDFA1.1 parser with the following error... still

  [Fatal Error] :23:15: Attribute name "itemscope" associated with an element type "div" must be followed by the ' = ' character.
[2015-03-31 04:46:46,618]DEBUG544766[main] - org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:488) - html-rdfa11: Error while parsing RDF document.

@lewismc
Copy link
Member Author

lewismc commented Feb 20, 2016

@ansell the line I am getting the error on is away down in semargl here
https://github.com/levkhomich/semargl/blob/ee8b35fc330deae6cb623fa3c57f583f3684bb76/rdfa/src/main/java/org/semarglproject/rdf/rdfa/RdfaParser.java#L1130
I am going to investigate this issue again this weekend as it is high time we got Any23 master back to successful healthy builds.
It seems to be 'too' strict for what I think Any23 should be providing. We should acknowledge an allow markup scenarios which present an itemscope attribute without a value. It seems like we cannot avoid the exception though unless we skip an entire document extraction... which is not favorable by any means.
What do you think?

…e html must be followed by the ' = ' character
@lewismc
Copy link
Member Author

lewismc commented Mar 25, 2016

hi @ansell OK I've added in the correct rule and fix as well as a test to verify that empty itemscope values are identified and fixed.
Whilst debugging this however the core issue persists. Reasoning for this is that RDFa11Extractor extends BaseRDFExtractor which inherits the parser function inputstream parameter. This input stream is not the 'fixed' steam but the raw document.
The only way I can think around this is for us to

What do you think?

@ansell
Copy link
Member

ansell commented Mar 25, 2016

The system does seem a little too complex for our purposes and isn't usable because of that.

Removing generics would be the first step IMO as there are too many rawtypes definitions which indicate generics are being used badly.

ContentExtractor may be able to be completely removed instead of being refitted into the process after that and the parser should always be set to parse as far as practical for our purposes.

It is a little strange that there isn't a buffered, markable, InputStream provided for all of the steps to reuse as necessary rather than pushing a raw InputStream or other source into different extractors.

@lewismc
Copy link
Member Author

lewismc commented Mar 25, 2016

I agree. Jumping through this in the debugged made me think the same.
I think it is different if Any23 is to be a PURE implementation... But that
is clearly not the case. Any23 fits in best when it can be used to extract
semantics from any old crap input that it is fed. Parsers and extractors
should not fail when there is a piece of crap input HTML. Currently,
that's exactly what happens and it is extremely limiting.

I would like to propose that this PR is committed to master as is, we then
open a brand new issue which acts exactly your comments refactoring out
content extractor and reusing the input stream which has been fixed, etc.

Any thoughts Peter? Thanks fr quick response.

On Friday, March 25, 2016, Peter Ansell notifications@github.com wrote:

The system does seem a little too complex for our purposes and isn't
usable because of that.

Removing generics would be the first step IMO as there are too many
rawtypes definitions which indicate generics are being used badly.

ContentExtractor may be able to be completely removed instead of being
refitted into the process after that and the parser should always be set to
parse as far as practical for our purposes.

It is a little strange that there isn't a buffered, markable, InputStream
provided for all of the steps to reuse as necessary rather than pushing a
raw InputStream or other source into different extractors.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#17 (comment)

Lewis

@ansell
Copy link
Member

ansell commented Mar 25, 2016

I tested this pull request and it has a few failing tests for me. I know that the Any23 master hasn't been perfect for its test record (mostly due to unreliable remote queries), but I haven't been watching recently to know which tests are expected to fail.

@lewismc
Copy link
Member Author

lewismc commented Mar 25, 2016

ACK @ansell , master branch is unstable with the following test failures

https://builds.apache.org/view/A-D/view/Any23/job/Any23-trunk/1466/#showFailuresLink

If you can reproduce this locally (or up until your test build fails within core with 3 failing tests) then that is the 'expected' behaviour right now. The Microdata test is directly related to the issue we are now discussing here.

This issue is the most pressing for Any23 right now, IMHO it is a complete blocker to us releasing Any23 1.2.

Thanks for the review.

@lewismc
Copy link
Member Author

lewismc commented Mar 29, 2016

@ansell any further comments here? I will try to get to work on the larger issue this week.

@asfgit asfgit merged commit fc45932 into apache:master Apr 2, 2016
asfgit pushed a commit that referenced this pull request Apr 2, 2016
…pe html must be followed by the ' = ' character. this closes #17
@lewismc lewismc deleted the ANY23-247 branch April 2, 2016 19:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments