Implement MediaWiki table parsing #81

davidswinegar · 2014-07-21T17:27:54Z

(#10) These tables are just substitutes for HTML tags, so I used the existing Tag node with a few changes. This is a big pull request but I'm available to make any changes you would like to see. This also doesn't fix the issues from #55 regarding comments/templates as whitespace or handle implicitly closing tags from #40. Because the scope of those issues are greater than just tables I didn't implement them here.

One note: I also changed context in the CTokenizer to use uint64_t to ensure an exact width, which requires an include of <stdint.h>. I'm not sure how well this include works with VS2008 and VS2010, and I currently don't have access to a Windows computer, but it might be necessary in tokenizer.h to do something like:

#ifdef _WIN32
    typedef unsigned long long int uint64_t;
#else
    #include <stdint.h>
#endif

Started parsing table support and added the start of table support. This is a big commit (ugh) and it should probably be split up into multiple smaller ones if possible, but that seems unworkable as of right now because of all the dependencies. Also breaks tests of CTokenizer (double ugh) because I haven't started table support there. May want to pick line by line on this commit later but I need to save my work for now.

Added another stack layer for tokenizing table cells because of styling/correctness of implementation. Added many tests cases.

Started support for parsing table style attributes. I suspect some of this is incorrect, need to add more tests to see.

Support for header cells was mostly in already, just needed minor changes. Added two tests as well.

Started styling attributes for table row and table start. Still not entirely sure about this, definitely need to make changes regarding padding.

Added support for allowing different wiki syntax for replacing the opening and closing tags. Added for table support.

Added comments, tried to keep to 80 character lines.

Tables and rows use newlines as padding, partly because these characters are pretty important to the integrity of the table. They might need to be in the preceding whitespace of inner tags instead as padding after, not sure.

Changed row recursion handling to make sure the tag is emitted even when hitting recursion limits. Need to test table recursion to make sure that works. Also fixed a bug in which tables were eating the trailing token. Added several tests for rows and trailing tokens with tables.

Removed the `StopIteration()` exception for handling table style and instead call `_handle_table_cell_end()` with a new parameter. Also added some random tests for table openings.

Make sure py tokenizer methods only call methods that have been declared earlier. Not necessary but makes it much easier to maintain/write the C tokenizer if methods are in the same order.

Padding now included on all wiki table cells. With wiki table cells that include attributes, `wiki_markup` is also included (unchanged).

Fix problem in which fake table closes were causing a problem inside cells. Changed inline table handling to fix this.

Fix problem in which invalid table attributes were being parsed incorrectly. Added tests.

Various changes to avoid returning tuples - working on the C tokenizer made me realize this was a bad idea for compatability/similarity between the two.

For C compatability, switch table cell end to return the stack. Now context is kept by using `keep_context` when calling `self._pop()`.

For the C tokenizer, include `<stdint.h>` and use `uint64_t` instead of `int` for context. Changes to tables mean that context can be larger than 32 bits, and it is possible for `int` to only have 16 bits anyways (though this is very unlikely).

CTokenizer is completely implemented in this commit - it didn't make much sense to me to split it up. All tests passing, memory test shows no leaks on Linux.

earwig · 2014-07-21T19:43:05Z

Okay! Good work, for starters. I'm actually really impressed that anyone managed to do this. Let's go through some various points:

I'm okay with the <stdint.h> thing; I don't have a Windows computer either right now, so we'll wait until release time to see if we need to make any changes.
It looks like there's a problem with building on python 3.4 specifically, according to Travis. I'll look into it.
I'm going to merge this PR into feature/tables and review it there for a bit before merging into develop.

Will let you know if anything comes up.

earwig · 2014-07-21T19:54:36Z

In 94a9e32, I fixed a test (table_cell_unclosed_style) that wasn't being parsed correctly due to a missing comma. Now that the test is being run, it fails. Is this related to #40 as you mentioned above, and if so, should we mark it as an expected failure?

Another thing: why are <th>, <td>, and <tr> included as SINGLE tags?

davidswinegar · 2014-07-21T20:22:26Z

I've been testing on Python 2.7 so I'm not sure what the problem with 3.4
is. I just added that test this morning, didn't read the output after
seeing everything passed so my bad. That is related to #40 but the test
should actually fail as written if implicit close is implemented correctly.
I meant to test that style shouldn't cross cells which I apparently forgot
to implement (should just involve failing other contexts on cell end, not
to difficult to do). I won't be able to work on anything until tomorrow but
I can work on these problems and any others you find then.

On Monday, July 21, 2014, Ben Kurtovic notifications@github.com wrote:

In 94a9e32
94a9e32,
I fixed a test (table_cell_unclosed_style) that wasn't being parsed
correctly due to a missing comma. Now that the test is being run, it fails.
Is this related to #40
#40 as you mentioned
above, and if so, should we mark it as an expected failure?

—
Reply to this email directly or view it on GitHub
#81 (comment)
.

earwig · 2014-07-21T22:25:04Z

So I noticed that inline_cell_attributes doesn't roundtrip correctly:

>>> import mwparserfromhell
>>> text = u'{| \n ! name="foo bar" | test ||color="red"| markup!!foo | time \n|}'
>>> code = mwparserfromhell.parse(text)
>>> code == text
False
>>> print code
{|
 ! name="foo bar"|  test ||color="red"| markup!!foo | time
|}
>>> print text
{|
 ! name="foo bar" | test ||color="red"| markup!!foo | time
|}

Perhaps this needs to be re-ordered? I'm not sure if that would break anything else.

One final thing. I'm trying to understand the tree structure you're generating here, and it doesn't seem right to me. From the same example markup above, we get this:

>>> import re
>>> print re.sub("\s+", " ", code.get_tree())
< table > < th name = foo bar /> test < th color = red /> markup < th foo /> time \n </ table >

I asked above why <th> etc are allowed to be self-closing, because this doesn't seem correct. Surely the test, markup, and time strings should be inside their respective <th> tags, not between them?

davidswinegar · 2014-07-22T17:28:44Z

Back now and working on this, here are my thoughts:

tree structure
The tree structure is something I probably should have asked about. I realize now that the self closing tags don't enclose anything. I was confused because this is considered valid HTML:

<table>
  <tr>
    <td> foo
  <tr>
    <td> bar
</table>

This is the closest transliteration of wiki syntax to HTML table syntax, which is why I was trying to emulate it. But it doesn't make sense with the parse tree to treat them as self closing tags because then the table rows and table cells are separate from their contents. I think removing the self-closing bits and just setting the closing_wiki_markup to "" should do the trick, though it'll require some time to go back through the code to fix that.
This is also why the row and cell tags were included as single tags, because they are implicitly closed by the next row/cell in non-XHTML tables. As far as I can tell this isn't supported currently by mwparserfromhell.

Inline cell attributes
Yep, just a stupid mistake. I was thinking earlier that it might be useful to add an integration test that goes through all the mwtest examples and compares str(input) to str(wikicode) to catch these kinds of representation mistakes.

Travis/Python 3.4
I also see the CI tool building it as C90 while I was coding to C99 standards - it gave a few warnings because of that. It looks like VS doesn't support all of the C99 standard anyways so I'll look into changing any syntax that could be a problem.

Today I'll plan on going over the issues in this order:

fix inline cell attributes
~~kill style/templates/etc on table cell close + add tests~~ (wontfix)
switch table rows, cells, header to no longer be self-closing + change tests to match
Python 3.4 issues, if I have time

Self-closing wiki syntax tags have incorrectly ordered wiki syntax and padding, fixed the ordering.

earwig · 2014-08-20T10:37:33Z

Sorry, I've been dead lately due to a variety of real life things. Wanted to get on this before I leave for college (which is tomorrow...) but it doesn't look like that's going to happen. I'll take a look first thing after stuff gets settled down, which I expect to be within the next two weeks.

earwig · 2014-10-19T08:06:27Z

Alright, I'm back alive again and trying to work on this. My memory is a bit fuzzy from two months ago, so I'm going to try to figure out what still needs to be done for this:

Add integration test for roundtripping existing unit test cases (done in bd85805)
Investigate drop in coverage; see what code is untested

earwig · 2014-10-23T04:35:25Z

Okay! I'm reasonably happy with the code (for now), all tests are passing, and there are no obvious examples of stuff that's not covered (with the exception of |+ table captions, but ehh...), so I'm running some roundtripping tests on an enwiki dump and if all goes well, I'm going to merge this tonight.

Edit: >500,000 pages tested, about 10% with tables, all roundtripped correctly so far. Going to sleep now, will merge tomorrow assuming nothing changes.

Edit 2: Noticed a bug on Natasha Kaplinsky. Will try to fix.

Edit 3: There is definitely a bug in how the code is handling closing pipes when present on the same line as the open. For example:

{| |}
| foo
|}

MediaWiki treats this as a table with a single cell "foo" (the first |} is ignored), while this code treats it as an empty table with the | foo \n|} treated as regular text. I suppose it's enough of an edge case to ignore for now, but I still need to fix the roundtripping bug present in the article above.

Edit 4: The syntax that looks like {| |} (close on same line as open) is actually not a valid table closure. I'm going to change the code to ignore this, which should fix the above problem.

Edit 5: Pushed updates; will re-run roundtripping tests.

Edit 6: There's a problem with Rafael Nadal but after several hours on this one I can't figure it out. I refactored a lot of the table error recovery code, but that doesn't seem to have fixed it. I want to blame #42, but I'm not sure if that's the real cause here.

coveralls · 2014-10-24T16:02:47Z

Coverage decreased (-0.2%) when pulling a15f617 on davidswinegar:tables into 810c24e on earwig:develop.

davidswinegar · 2014-10-24T17:06:25Z

Just saw your edits, that's very weird - I was testing against syntax on test pages and the "close on same line as open" behavior was making valid tables, I guess it's not supposed to be. The Nadal page is definitely hitting the max cycles on my machine, maybe #42 is to blame but I'm not sure, I can take a closer look as well this weekend.

earwig · 2014-10-24T21:23:14Z

Okay, gonna go ahead and merge this. The remaining issue re: Nadal can be dealt with separately.

legoktm · 2014-10-24T21:38:09Z

Exactly 666 commits!

Also removed the max cycles stop-gap, allowing much more complex pages to be parsed quickly without losing nodes at the end Also fixes #65, fixes #102, fixes #165, fixes #183 Also fixes #81 (Rafael Nadal parsing bug) Also fixes #53, fixes #58, fixes #88, fixes #152 (duplicate issues)

davidswinegar added 18 commits July 14, 2014 10:37

Table cells now recurse

b7e40d7

Added another stack layer for tokenizing table cells because of styling/correctness of implementation. Added many tests cases.

Started table cell attribute support

a13bc94

Started support for parsing table style attributes. I suspect some of this is incorrect, need to add more tests to see.

Added tests/support for header cells

0bba69d

Support for header cells was mostly in already, just needed minor changes. Added two tests as well.

Add table start/row start style attribute support

9f159ec

Started styling attributes for table row and table start. Still not entirely sure about this, definitely need to make changes regarding padding.

Added closing_wiki_markup support to Tag node

d356a57

Added support for allowing different wiki syntax for replacing the opening and closing tags. Added for table support.

Clean up and style changes

9e4bb0c

Added comments, tried to keep to 80 character lines.

Tables and rows now use newline as padding

ec08001

Tables and rows use newlines as padding, partly because these characters are pretty important to the integrity of the table. They might need to be in the preceding whitespace of inner tags instead as padding after, not sure.

fixed hacky table cell style exception, added tests

842af20

Removed the `StopIteration()` exception for handling table style and instead call `_handle_table_cell_end()` with a new parameter. Also added some random tests for table openings.

Reorder table tokenizer methods for forward declaration

ddaa3ec

Make sure py tokenizer methods only call methods that have been declared earlier. Not necessary but makes it much easier to maintain/write the C tokenizer if methods are in the same order.

Add padding to table cell tags

457b224

Padding now included on all wiki table cells. With wiki table cells that include attributes, `wiki_markup` is also included (unchanged).

Changes to table close handling

8b5d6f9

Fix problem in which fake table closes were causing a problem inside cells. Changed inline table handling to fix this.

Fix issue with incorrect table attributes

151a73e

Fix problem in which invalid table attributes were being parsed incorrectly. Added tests.

Refactor methods to avoid returning tuples

e6ec5dc

Various changes to avoid returning tuples - working on the C tokenizer made me realize this was a bad idea for compatability/similarity between the two.

All tokenizer end methods return a stack

406dd3a

For C compatability, switch table cell end to return the stack. Now context is kept by using `keep_context` when calling `self._pop()`.

Use uint64_t for context

2d945b3

For the C tokenizer, include `<stdint.h>` and use `uint64_t` instead of `int` for context. Changes to tables mean that context can be larger than 32 bits, and it is possible for `int` to only have 16 bits anyways (though this is very unlikely).

Implement CTokenizer for tables

0128b1f

CTokenizer is completely implemented in this commit - it didn't make much sense to me to split it up. All tests passing, memory test shows no leaks on Linux.

earwig added aspect: parser labels Jul 21, 2014

earwig added this to the version 0.4 milestone Jul 21, 2014

earwig self-assigned this Jul 21, 2014

Add missing comma to test output.

94a9e32

Fix ordering of tag representation

7bbeb68

Self-closing wiki syntax tags have incorrectly ordered wiki syntax and padding, fixed the ordering.

Adjust table test labels for consistency.

e446c51

earwig added 11 commits October 19, 2014 20:44

Add tables to changelog.

b7c46a6

Add integration tests for token roundtripping.

bd85805

Merge branch 'develop' into feature/tables

67c2365

Break at 80 cols for most lines.

7489253

Add a couple more tests involving templates.

92cf8f2

Add a test for tokenizer line 1384.

c638746

Remove try/except that is impossible to fail inside of.

457355d

Remove an incorrect usage of Py_XDECREF().

5d29bff

Add test code for a missing branch of Tag.wiki_markup.setter; cleanup.

504b8ba

Cleanup; add a missing test.

913ff59

Ensure token list is copied before being fed to the builder.

e1ebb59

earwig added 6 commits October 23, 2014 21:27

Tokenizer cleanup; make inline table syntax invalid as it should be.

640005d

Update table tests to reflect new grammar.

4d40459

Port tokenizer updates to C.

fb26145

Credit for table parsing code. [skip ci]

8480381

Refactor a lot of table error recovery code.

9fc4b90

Minor bugfix.

a15f617

earwig merged commit a15f617 into earwig:develop Oct 24, 2014

earwig added a commit that referenced this pull request Oct 24, 2014

Merge branch 'feature/tables' into develop (closes #81)

45ed8c4

earwig added result: fixed and removed priority: mid labels Jan 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement MediaWiki table parsing #81

Implement MediaWiki table parsing #81

davidswinegar commented Jul 21, 2014

earwig commented Jul 21, 2014

earwig commented Jul 21, 2014

davidswinegar commented Jul 21, 2014

earwig commented Jul 21, 2014

davidswinegar commented Jul 22, 2014

earwig commented Aug 20, 2014

earwig commented Oct 19, 2014

earwig commented Oct 23, 2014

coveralls commented Oct 24, 2014

davidswinegar commented Oct 24, 2014

earwig commented Oct 24, 2014

legoktm commented Oct 24, 2014

Implement MediaWiki table parsing #81

Implement MediaWiki table parsing #81

Conversation

davidswinegar commented Jul 21, 2014

earwig commented Jul 21, 2014

earwig commented Jul 21, 2014

davidswinegar commented Jul 21, 2014

earwig commented Jul 21, 2014

davidswinegar commented Jul 22, 2014

earwig commented Aug 20, 2014

earwig commented Oct 19, 2014

earwig commented Oct 23, 2014

coveralls commented Oct 24, 2014

davidswinegar commented Oct 24, 2014

earwig commented Oct 24, 2014

legoktm commented Oct 24, 2014