Skip to content
Permalink
Browse files
ANY23-321 Add openie toggle functionality to service
  • Loading branch information
lewismc committed Feb 3, 2018
2 parents 4f40fe0 + 482e780 commit 63ffc9e3e8a8da0b4af7ca5b227f1e199e545227
Show file tree
Hide file tree
Showing 42 changed files with 1,777 additions and 182 deletions.

This file was deleted.

@@ -1,5 +1,5 @@
Apache Any23
Copyright 2011-2017 The Apache Software Foundation
Copyright 2011-2018 The Apache Software Foundation
Copyright 2008-2011 Digital Enterprise Research Institute (DERI)

This product includes software developed by
@@ -1,3 +1,69 @@
Apache Any23 2.2
Release Notes
25/01/2018 (dd/mm/yyy)

Sub-task

[ANY23-155] - Test failure: testRunOnHTTPResource(org.apache.any23.cli.MicrodataParserTest)
[ANY23-267] - Entire extractions fail due to "The element type 'meta' must be terminated by the matching end-tag </meta>"
[ANY23-268] - Entire extraction task fails due to "Element type "t.length" must be followed by either attribute specifications, ">" or "/>"

Bug

[ANY23-12] - character are wrongly encoded in rdfxml output
[ANY23-131] - Nested Microdata are not extracted
[ANY23-140] - Revise Any23 tests to remove fetching of web content
[ANY23-166] - Parsing crashes with attributes that don't use quotes
[ANY23-201] - Service Regularly Times Out on DBPedia Queries
[ANY23-227] - not extracting opengraph rdfa
[ANY23-228] - Invalid URI
[ANY23-230] - any23.org redirects to single slash URI
[ANY23-256] - MicrodataParserTest failing locally but not on Jenkins
[ANY23-260] - Get Any23 listed as an Application capable of using DBPedia
[ANY23-266] - Fix Issues with Failing WebService Examples
[ANY23-271] - Address "...The entity "raquo" was referenced, but not declared" SAXParseException
[ANY23-273] - The content of elements must consist of well-formed character data or markup - no bogus comments
[ANY23-303] - JsonLdError: loading remote context failed: http://schema.org/
[ANY23-306] - Absent binaries for version 2.0
[ANY23-312] - Triple sub-pred-null should not be added into outcome. Change traversing method.
[ANY23-314] - Service fails to return extraction in case of extraction error
[ANY23-316] - Yaml parser does not halndle intentional null value
[ANY23-317] - Any23 fails when dealing with JavaScript
[ANY23-318] - ExtractionException handling in BaseRDFExtractor.java kills entire extraction
[ANY23-326] - parsing unclosed meta and input tags fails

New Feature

[ANY23-8] - Write a separate tool for RDFa/microformat detection tool usable in crawlers
[ANY23-233] - Add local extraction cache to Any23 service

Improvement

[ANY23-106] - Gracefully shut down Any23 service
[ANY23-213] - Implement JSOn reporting for the Any23 service
[ANY23-214] - ë (e-umlaut or diaeresis) not decoded in RDF output
[ANY23-249] - Update all W3C and other Standards Compliance within Any23
[ANY23-280] - Refactor ContentExtractor to improve extraction flexibility
[ANY23-291] - JSON-LD should be looked up in entire HTML document, not just in <head>
[ANY23-298] - Revisit the OGP.java vocabulary and update it
[ANY23-309] - "Scraper" misspelled as "Scarper" on Downloads webpage
[ANY23-319] - Upgrade jsonld-java dependency to 0.11.1
[ANY23-324] - Replace net.sourceforge.nekohtml with jsoup
[ANY23-325] - Any23 incompatible with http://rdfa.info/test-suite/#

Test

[ANY23-320] - Address @Ignore tests in Any23

Wish

[ANY23-210] - Address 1.0 Release Review Discrepancies

Task

[ANY23-40] - Complete Documentation for Plugin Management system


Apache Any23 2.1
Release Notes
14/09/2017 (dd/mm/yyy)
@@ -21,7 +21,7 @@
<parent>
<artifactId>apache-any23</artifactId>
<groupId>org.apache.any23</groupId>
<version>2.2-SNAPSHOT</version>
<version>2.3-SNAPSHOT</version>
<relativePath>../</relativePath>
</parent>

@@ -1,11 +1,12 @@
/*
* Copyright 2017 The Apache Software Foundation.
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
@@ -76,3 +76,7 @@ any23.extraction.csv.comment=#
# A confidence threshold for the OpenIE extractions
# Any extractions below this value will not be processed.
any23.extraction.openie.confidence.threshold=0.5

# Use legacy setting to parse html
# with NekoHTML instead of Jsoup
any23.tagsoup.legacy=off
@@ -21,7 +21,7 @@
<parent>
<groupId>org.apache.any23</groupId>
<artifactId>apache-any23</artifactId>
<version>2.2-SNAPSHOT</version>
<version>2.3-SNAPSHOT</version>
<relativePath>../</relativePath>
</parent>

@@ -1,11 +1,12 @@
/*
* Copyright 2017 The Apache Software Foundation.
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
@@ -1,11 +1,12 @@
/*
* Copyright 2017 The Apache Software Foundation.
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
@@ -1,11 +1,12 @@
# Copyright 2017 The Apache Software Foundation.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -21,7 +21,7 @@
<parent>
<groupId>org.apache.any23</groupId>
<artifactId>apache-any23</artifactId>
<version>2.2-SNAPSHOT</version>
<version>2.3-SNAPSHOT</version>
<relativePath>../</relativePath>
</parent>

@@ -74,6 +74,10 @@
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
</dependency>
<dependency>
<groupId>com.beust</groupId>
<artifactId>jcommander</artifactId>
@@ -123,8 +123,10 @@ private void extractLinkDefinedPrefixes(Document in) {
List<Node> linkNodes = DomUtils.findAll(in, "/HTML/HEAD/LINK");
for (Node linkNode : linkNodes) {
NamedNodeMap attributes = linkNode.getAttributes();
String rel = attributes.getNamedItem("rel").getTextContent();
String href = attributes.getNamedItem("href").getTextContent();
Node relNode = attributes.getNamedItem("rel");
String rel = relNode == null ? null : relNode.getTextContent();
Node hrefNode = attributes.getNamedItem("href");
String href = hrefNode == null ? null : hrefNode.getTextContent();
if (rel != null && href != null && RDFUtils.isAbsoluteIRI(href)) {
prefixes.put(rel, SimpleValueFactory.getInstance().createIRI(href));
}
@@ -135,7 +137,7 @@ private Set<JSONLDScript> extractJSONLDScript(Document in,
String baseProfile, ExtractionParameters extractionParameters,
ExtractionContext extractionContext, ExtractionResult out)
throws IOException, ExtractionException {
List<Node> scriptNodes = DomUtils.findAll(in, "/HTML/HEAD/SCRIPT");
List<Node> scriptNodes = DomUtils.findAll(in, "//SCRIPT");
Set<JSONLDScript> result = new HashSet<>();
extractor = new JSONLDExtractorFactory().createExtractor();
for (Node jsonldNode : scriptNodes) {
@@ -101,7 +101,8 @@ private void fixIncludes(HTMLDocument document, Node node, IssueReport report) {
report.notifyIssue(
IssueReport.IssueLevel.WARNING,
"Current node tries to include an ancestor node.",
nodeLocation[0], nodeLocation[1]
nodeLocation == null ? -1 : nodeLocation[0],
nodeLocation == null ? -1 : nodeLocation[1]
);
continue;
}
@@ -139,8 +139,10 @@ private void extractLinkDefinedPrefixes(Document in) {
List<Node> linkNodes = DomUtils.findAll(in, "/HTML/HEAD/LINK");
for(Node linkNode : linkNodes) {
NamedNodeMap attributes = linkNode.getAttributes();
String rel = attributes.getNamedItem("rel").getTextContent();
String href = attributes.getNamedItem("href").getTextContent();
Node relNode = attributes.getNamedItem("rel");
String rel = relNode == null ? null : relNode.getTextContent();
Node hrefNode = attributes.getNamedItem("href");
String href = hrefNode == null ? null : hrefNode.getTextContent();
if(rel != null && href !=null && RDFUtils.isAbsoluteIRI(href)) {
prefixes.put(rel, SimpleValueFactory.getInstance().createIRI(href));
}
@@ -0,0 +1,103 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/


package org.apache.any23.extractor.html;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.SequenceInputStream;
import java.util.Arrays;

/**
* @author Hans Brende
*/
public class JsoupUtils {

public static Document parse(InputStream input, String documentIRI, String encoding) throws IOException {
//Jsoup doesn't allow null document URIs
if (documentIRI == null) {
documentIRI = "";
}

//workaround for Jsoup issue #1009
if (encoding == null) {

int c;
do {
c = input.read();
} while (c != -1 && Character.isWhitespace(c));

if (c != -1) {
int capacity = 256;
byte[] bytes = new byte[capacity];
int length = 0;
bytes[length++] = (byte)c;

if (c == '<') {
c = input.read();
if (c != -1) {
bytes[length++] = (byte)c;
if (c == '?') {
c = input.read();

while (c != -1) {
if (length == capacity) {
capacity *= 2;
bytes = Arrays.copyOf(bytes, capacity);
}
bytes[length++] = (byte)c;

if (c == '>') {
if (length >= 20 && bytes[length - 2] == '?') {
String decl = "<" + new String(bytes, 2, length - 4) + ">";
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.parse(decl, documentIRI, Parser.xmlParser());
for (org.jsoup.nodes.Element el : doc.children()) {
if ("xml".equalsIgnoreCase(el.tagName())) {
String enc = el.attr("encoding");
if (enc != null && !enc.isEmpty()) {
encoding = enc;
break;
}
}
}
}
break;
}

c = input.read();
}
}
}

}

input = new SequenceInputStream(new ByteArrayInputStream(bytes, 0, length), input);
}

}

//Use Parser.htmlParser() to parse javascript correctly
return Jsoup.parse(input, encoding, documentIRI, Parser.htmlParser());
}

}

0 comments on commit 63ffc9e

Please sign in to comment.