New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed XML/HTML and invalid links #6

Open
cifkao opened this Issue Apr 13, 2015 · 18 comments

Comments

Projects
None yet
5 participants
@cifkao

cifkao commented Apr 13, 2015

Extracted text is not being escaped, which in some cases results in malformed XML.

For example, the third sentence of Inequality (mathematics) is rendered as:

For the use of the "<" and ">" signs as punctuation, see <a href="Bracket">Bracket</a>.

The correct output would be:

For the use of the "&lt;" and "&gt;" signs as punctuation, see <a href="Bracket">Bracket</a>.

Similarly, the extracted text of Brian Kernighan contains:

The first documented <a href=""Hello, world!" program">"Hello, world!" program</a>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)

which should instead be:

The first documented <a href="&quot;Hello, world!&quot; program">"Hello, world!" program</a>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)

The same applies to page titles in the <doc> elements.

Another issue with links is that most of them are not really hypertext links, but wikilinks. My opinion is that wikilinks should be represented using a different element, e.g. <wikilink>, so the sentence above would become:

The first documented <wikilink page="&quot;Hello, world!&quot; program">"Hello, world!" program</wikilink>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)
@attardi

This comment has been minimized.

Owner

attardi commented Apr 15, 2015

The output should be text, not HTML, hence it is correct that HTML entities are converted to characters, exactly as they appear when reading the page.
In the case of entities within URL, they should be converted to urlencoding, I suppose.

@Blemicek

This comment has been minimized.

Blemicek commented Apr 15, 2015

Actually, the output is XML, not a plain text. It should contain XML entities instead ", &, ', < and > because of future parsing.

@attardi

This comment has been minimized.

Owner

attardi commented Apr 15, 2015

The input is XML, the output is plain text.
That is the intended use.
I use it for extracting a text corpus for performing linguistic
analysis: parsing, QA, creating word embeddings, etc.
If the content were not converted, you would get a lot of crap in the
output, including comments, etc.

I guess I could add an option to avoid conversion, if that helps.

On 15/4/2015 12:36, Petr Fanta wrote:

Actually, the output is XML, not a plain text. It should contain XML
entities
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent
instead ", &, ', < and > because of future parsing.


Reply to this email directly or view it on GitHub
#6 (comment).

@attardi

This comment has been minimized.

Owner

attardi commented Apr 15, 2015

Links are now urlencoded.

@cifkao

This comment has been minimized.

cifkao commented Apr 15, 2015

If the output is supposed to be plain text, then it does not make sense to represent links, lists and headings using HTML tags (<a>, <h1>, <li>) and it's impossible to parse such output (what if the actual text of the article contains some of these tags, or worse, the <doc> tag, which is unlikely, but possible?).

@attardi

This comment has been minimized.

Owner

attardi commented Apr 15, 2015

No tags will be present in the output: they all get stripped out, even if there was a .
The anchors are only present if you ask for them using the option to preserve links.
Use at your own discretion.

@cifkao

This comment has been minimized.

cifkao commented Apr 15, 2015

The <a> tags are not the only issue. If the --sections option is used, <li> and <h1>, <h2> etc. are inserted. If the option is not used, section headings and list items are completely removed (which breaks disambiguation pages, for example, where all the interesting information is present as list items).

@attardi

This comment has been minimized.

Owner

attardi commented Apr 15, 2015

Same reason: they are inserted if you ask for them.
All tables and lists are removed, because they do not form linguistic sentences.
If you want to preserve the structure, you need a different tool.

@Blemicek

This comment has been minimized.

Blemicek commented Apr 15, 2015

It seems that HTML/XML tags are not removed from a template output some tags are not removed. E.g. in HTML element:

<doc id="274393" url="http://en.wikipedia.org/wiki?curid=274393" title="HTML element">
HTML element

An <abbr title="Hyper Text Markup Language">HTML</abbr> element is an individual component of an <a href="HTML">HTML</a> document or <a href="web page">web page</a>, once this has been parsed into the <a href="HTML Document Object Model">Document Object Model</a>. HTML is composed of a <a href="Tree structure">tree</a> of HTML elements and other <a href="Node (computer science)">nodes</a>, such as text nodes. Each element can have <a href="HTML attribute">HTML attributes</a> specified. Elements can also have content, including other elements and text. Many HTML elements represent <a href="semantics">semantics</a>, or meaning. For example, the codice_1 element represents the title of the document.

...

</doc>

(Anyway, it is a bit confusing to use XML/HTML tags in a plain text.)

@attardi

This comment has been minimized.

Owner

attardi commented Apr 19, 2015

I added to the list of ignoredTags.
The case of article HTML element is a little peculiar, since it is about HTML, hence the text extracted from the page should contain tags.
That page however is written using the extension SyntaxHighlight.
So now the content of is not converted.

@psibre

This comment has been minimized.

psibre commented Jul 6, 2015

FWIW, the <doc id="..." url="..." title="...">...</doc> output format implies a certain XML affinity. However, the lack of a common single root element makes many XML parsers barf. IMHO, it would make sense to wrap the entire output text file in some top level element.

@attardi

This comment has been minimized.

Owner

attardi commented Jul 6, 2015

I agree that it might be confusing. But the format is not meant to be an XML format.
If it were to be XML, than all sort of escaping would have to be done, for instance to handle character entities, etc.
But this would defeat the purpose of a text extractor.
The output is just text, with tags used to separate the documents.
It is meant for easy processing: you can just drop the tags with a one-liner sed script.
You are not supposed to use an XML parser, since there is no need for it.
Actually the use of an XML is definitely discouraged, for the reasons mentioned above.

@psibre

This comment has been minimized.

psibre commented Jul 7, 2015

I agree that everything between the <doc...> and </doc> is, and should be, plain text.
But the fact that the (sparse) metadata is still encoded in an XML-like way with attributes that do use character entities undermines the effort to avoid XML...

For example, the page for "Weird Al" Yankovic produces something like <doc ... title="&quot;Weird Al&quot; Yankovic">. It seems a bit odd to output XML-like elements with attributes, but to discourage XML parsing to extract the attribute and convert the entities. Why not produce something like JSON instead?

@psibre

This comment has been minimized.

psibre commented Jul 7, 2015

I realize that my comments are going a bit off-topic and have opened #30 .

@nathj07

This comment has been minimized.

Contributor

nathj07 commented Nov 12, 2015

Hi,
First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence of < as plain text within the <doc>...</doc> tags causes decoding to break. So when decoding the XML using tokenization it breas on the presence of < inside the tags typically with something like XML syntax error on line xx: expected element name after <

It was mentioned above that there could be a flag introduced to handle this so that those characters in that position get escaped. Has any progress been made on this? If need be I'd be happy to help out with that - given a pointer in the right direction.

Thanks

@attardi

This comment has been minimized.

Owner

attardi commented Nov 13, 2015

Would it help just enclosing the text within

-- Beppe

On 12/11/2015 18:25, Nathan Davies wrote:

Hi,
First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence
of |<| as plain text within the |...| tags causes decoding
to break. So when decoding the XML using tokenization it breas on the
presence of |<| inside the tags typically with something like |XML
syntax error on line xx: expected element name after <|

It was mentioned above that there could be a flag introduced to handle
this so that those characters in that position get escaped. Has any
progress been made on this? If need be I'd be happy to help out with
that - given a pointer in the right direction.

Thanks


Reply to this email directly or view it on GitHub
#6 (comment).

@nathj07

This comment has been minimized.

Contributor

nathj07 commented Nov 13, 2015

An interesting idea, I did think that would work in my use case. However, when I ran some simple tests I end up with an unexpected EOF error.

I think perhaps a flag on the command line to enable escaping of characters within the <doc>...</doc>. How does that sound?

@nathj07

This comment has been minimized.

Contributor

nathj07 commented Nov 13, 2015

How does that PR look for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment