Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MediaWiki table parsing #81

Merged
merged 45 commits into from
Oct 24, 2014
Merged

Conversation

davidswinegar
Copy link
Contributor

(#10) These tables are just substitutes for HTML tags, so I used the existing Tag node with a few changes. This is a big pull request but I'm available to make any changes you would like to see. This also doesn't fix the issues from #55 regarding comments/templates as whitespace or handle implicitly closing tags from #40. Because the scope of those issues are greater than just tables I didn't implement them here.

One note: I also changed context in the CTokenizer to use uint64_t to ensure an exact width, which requires an include of <stdint.h>. I'm not sure how well this include works with VS2008 and VS2010, and I currently don't have access to a Windows computer, but it might be necessary in tokenizer.h to do something like:

#ifdef _WIN32
    typedef unsigned long long int uint64_t;
#else
    #include <stdint.h>
#endif

Started parsing table support and added the start of table support.
This is a big commit (ugh) and it should probably be split up into
multiple smaller ones if possible, but that seems unworkable as of
right now because of all the dependencies. Also breaks tests of
CTokenizer (double ugh) because I haven't started table support there.

May want to pick line by line on this commit later but I need to save
my work for now.
Added another stack layer for tokenizing table cells because of
styling/correctness of implementation. Added many tests cases.
Started support for parsing table style attributes. I suspect some
of this is incorrect, need to add more tests to see.
Support for header cells was mostly in already, just needed minor
changes. Added two tests as well.
Started styling attributes for table row and table start. Still not entirely
sure about this, definitely need to make changes regarding padding.
Added support for allowing different wiki syntax for replacing the opening
and closing tags. Added for table support.
Added comments, tried to keep to 80 character lines.
Tables and rows use newlines as padding, partly because these characters
are pretty important to the integrity of the table. They might need
to be in the preceding whitespace of inner tags instead as padding after,
not sure.
Changed row recursion handling to make sure the tag is emitted even
when hitting recursion limits. Need to test table recursion to make
sure that works. Also fixed a bug in which tables were eating the
trailing token. Added several tests for rows and trailing tokens with
tables.
Removed the `StopIteration()` exception for handling table style
and instead call `_handle_table_cell_end()` with a new parameter.
Also added some random tests for table openings.
Make sure py tokenizer methods only call methods that have been declared
earlier. Not necessary but makes it much easier to maintain/write
the C tokenizer if methods are in the same order.
Padding now included on all wiki table cells. With wiki table cells
that include attributes, `wiki_markup` is also included (unchanged).
Fix problem in which fake table closes were causing a problem inside
cells. Changed inline table handling to fix this.
Fix problem in which invalid table attributes were being parsed
incorrectly. Added tests.
Various changes to avoid returning tuples - working on the C tokenizer
made me realize this was a bad idea for compatability/similarity between
the two.
For C compatability, switch table cell end to return the stack.
Now context is kept by using `keep_context` when calling `self._pop()`.
For the C tokenizer, include `<stdint.h>` and use `uint64_t` instead
of `int` for context. Changes to tables mean that context can be
larger than 32 bits, and it is possible for `int` to only have 16
bits anyways (though this is very unlikely).
CTokenizer is completely implemented in this commit - it didn't
make much sense to me to split it up. All tests passing, memory test
shows no leaks on Linux.
@earwig
Copy link
Owner

earwig commented Jul 21, 2014

Okay! Good work, for starters. I'm actually really impressed that anyone managed to do this. Let's go through some various points:

  • I'm okay with the <stdint.h> thing; I don't have a Windows computer either right now, so we'll wait until release time to see if we need to make any changes.
  • It looks like there's a problem with building on python 3.4 specifically, according to Travis. I'll look into it.
  • I'm going to merge this PR into feature/tables and review it there for a bit before merging into develop.

Will let you know if anything comes up.

@earwig earwig added this to the version 0.4 milestone Jul 21, 2014
@earwig earwig self-assigned this Jul 21, 2014
@earwig
Copy link
Owner

earwig commented Jul 21, 2014

In 94a9e32, I fixed a test (table_cell_unclosed_style) that wasn't being parsed correctly due to a missing comma. Now that the test is being run, it fails. Is this related to #40 as you mentioned above, and if so, should we mark it as an expected failure?

Another thing: why are <th>, <td>, and <tr> included as SINGLE tags?

@davidswinegar
Copy link
Contributor Author

I've been testing on Python 2.7 so I'm not sure what the problem with 3.4
is. I just added that test this morning, didn't read the output after
seeing everything passed so my bad. That is related to #40 but the test
should actually fail as written if implicit close is implemented correctly.
I meant to test that style shouldn't cross cells which I apparently forgot
to implement (should just involve failing other contexts on cell end, not
to difficult to do). I won't be able to work on anything until tomorrow but
I can work on these problems and any others you find then.

On Monday, July 21, 2014, Ben Kurtovic notifications@github.com wrote:

In 94a9e32
94a9e32,
I fixed a test (table_cell_unclosed_style) that wasn't being parsed
correctly due to a missing comma. Now that the test is being run, it fails.
Is this related to #40
#40 as you mentioned
above, and if so, should we mark it as an expected failure?


Reply to this email directly or view it on GitHub
#81 (comment)
.

@earwig
Copy link
Owner

earwig commented Jul 21, 2014

So I noticed that inline_cell_attributes doesn't roundtrip correctly:

>>> import mwparserfromhell
>>> text = u'{| \n ! name="foo bar" | test ||color="red"| markup!!foo | time \n|}'
>>> code = mwparserfromhell.parse(text)
>>> code == text
False
>>> print code
{|
 ! name="foo bar"|  test ||color="red"| markup!!foo | time
|}
>>> print text
{|
 ! name="foo bar" | test ||color="red"| markup!!foo | time
|}

Perhaps this needs to be re-ordered? I'm not sure if that would break anything else.

One final thing. I'm trying to understand the tree structure you're generating here, and it doesn't seem right to me. From the same example markup above, we get this:

>>> import re
>>> print re.sub("\s+", " ", code.get_tree())
< table > < th name = foo bar /> test < th color = red /> markup < th foo /> time \n </ table >

I asked above why <th> etc are allowed to be self-closing, because this doesn't seem correct. Surely the test, markup, and time strings should be inside their respective <th> tags, not between them?

@davidswinegar
Copy link
Contributor Author

Back now and working on this, here are my thoughts:

tree structure
The tree structure is something I probably should have asked about. I realize now that the self closing tags don't enclose anything. I was confused because this is considered valid HTML:

<table>
  <tr>
    <td> foo
  <tr>
    <td> bar
</table>

This is the closest transliteration of wiki syntax to HTML table syntax, which is why I was trying to emulate it. But it doesn't make sense with the parse tree to treat them as self closing tags because then the table rows and table cells are separate from their contents. I think removing the self-closing bits and just setting the closing_wiki_markup to "" should do the trick, though it'll require some time to go back through the code to fix that.
This is also why the row and cell tags were included as single tags, because they are implicitly closed by the next row/cell in non-XHTML tables. As far as I can tell this isn't supported currently by mwparserfromhell.

Inline cell attributes
Yep, just a stupid mistake. I was thinking earlier that it might be useful to add an integration test that goes through all the mwtest examples and compares str(input) to str(wikicode) to catch these kinds of representation mistakes.

Travis/Python 3.4
I also see the CI tool building it as C90 while I was coding to C99 standards - it gave a few warnings because of that. It looks like VS doesn't support all of the C99 standard anyways so I'll look into changing any syntax that could be a problem.

Today I'll plan on going over the issues in this order:

  • fix inline cell attributes
  • kill style/templates/etc on table cell close + add tests (wontfix)
  • switch table rows, cells, header to no longer be self-closing + change tests to match
  • Python 3.4 issues, if I have time

Self-closing wiki syntax tags have incorrectly ordered wiki syntax
and padding, fixed the ordering.
@earwig
Copy link
Owner

earwig commented Aug 20, 2014

Sorry, I've been dead lately due to a variety of real life things. Wanted to get on this before I leave for college (which is tomorrow...) but it doesn't look like that's going to happen. I'll take a look first thing after stuff gets settled down, which I expect to be within the next two weeks.

@earwig
Copy link
Owner

earwig commented Oct 19, 2014

Alright, I'm back alive again and trying to work on this. My memory is a bit fuzzy from two months ago, so I'm going to try to figure out what still needs to be done for this:

  • Add integration test for roundtripping existing unit test cases (done in bd85805)
  • Investigate drop in coverage; see what code is untested

@earwig
Copy link
Owner

earwig commented Oct 23, 2014

Okay! I'm reasonably happy with the code (for now), all tests are passing, and there are no obvious examples of stuff that's not covered (with the exception of |+ table captions, but ehh...), so I'm running some roundtripping tests on an enwiki dump and if all goes well, I'm going to merge this tonight.

Edit: >500,000 pages tested, about 10% with tables, all roundtripped correctly so far. Going to sleep now, will merge tomorrow assuming nothing changes.

Edit 2: Noticed a bug on Natasha Kaplinsky. Will try to fix.

Edit 3: There is definitely a bug in how the code is handling closing pipes when present on the same line as the open. For example:

{| |}
| foo
|}

MediaWiki treats this as a table with a single cell "foo" (the first |} is ignored), while this code treats it as an empty table with the | foo \n|} treated as regular text. I suppose it's enough of an edge case to ignore for now, but I still need to fix the roundtripping bug present in the article above.

Edit 4: The syntax that looks like {| |} (close on same line as open) is actually not a valid table closure. I'm going to change the code to ignore this, which should fix the above problem.

Edit 5: Pushed updates; will re-run roundtripping tests.

Edit 6: There's a problem with Rafael Nadal but after several hours on this one I can't figure it out. I refactored a lot of the table error recovery code, but that doesn't seem to have fixed it. I want to blame #42, but I'm not sure if that's the real cause here.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) when pulling a15f617 on davidswinegar:tables into 810c24e on earwig:develop.

@davidswinegar
Copy link
Contributor Author

Just saw your edits, that's very weird - I was testing against syntax on test pages and the "close on same line as open" behavior was making valid tables, I guess it's not supposed to be. The Nadal page is definitely hitting the max cycles on my machine, maybe #42 is to blame but I'm not sure, I can take a closer look as well this weekend.

@earwig
Copy link
Owner

earwig commented Oct 24, 2014

Okay, gonna go ahead and merge this. The remaining issue re: Nadal can be dealt with separately.

@earwig earwig merged commit a15f617 into earwig:develop Oct 24, 2014
@legoktm
Copy link
Contributor

legoktm commented Oct 24, 2014

Exactly 666 commits!

earwig added a commit that referenced this pull request Jun 23, 2017
Also removed the max cycles stop-gap, allowing much more complex pages
to be parsed quickly without losing nodes at the end

Also fixes #65, fixes #102, fixes #165, fixes #183
Also fixes #81 (Rafael Nadal parsing bug)
Also fixes #53, fixes #58, fixes #88, fixes #152 (duplicate issues)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants