Problem with sections from RfA pages #218

ananth1996 · 2019-06-07T09:04:59Z

I'm trying to parse the sections from RfA pages such as https://en.wikipedia.org/wiki/Wikipedia:Requests_for_adminship/7. Using the get_sections() seems to always return 1 even if I use skip_style_tags=True . Is there any fix for this? The filter_headings() functions returns all the headings?
I want to parse the Support, Oppose and Negate votes. Is there any better way to do this in python?

The text was updated successfully, but these errors were encountered:

earwig · 2019-06-09T19:17:57Z

Hi @ananth1996,

The issue is basically that the entire RfA content is inside a <div> tag, and get_sections() expects headings to be nodes at the top level of the wikicode. Since all headings are inside that <div>, it considers the entire page to be one section.

Here's a cheap workaround:

>>> code = mwparserfromhell.parse(text, skip_style_tags=True)
>>> if code:
...     first = code.get(0)
...     if isinstance(first, mwparserfromhell.nodes.Tag) and first.tag == 'div':
...         code = first.contents
...
>>> len(code.get_sections())
9

I'll think more about a way to fix this inside the parser.

ananth1996 · 2019-06-10T08:06:55Z

Thank you for the workaround, it is working properly.
I also wanted to ask if there is any way particular way to iterate through list items such as some methods in wikitextparser?. I am also looking to extract the user signature at the end of every vote and was wondering if there is a template or general regex pattern already available in some parser.
Thanks in advance.

earwig · 2019-06-10T11:42:27Z

I don’t think there’s a good built-in way to do that, unfortunately. You would need to do some manual node iteration. For example: for each unnested li tag, find the last wikilink to a user page or user talk page before the next li tag. Something like that might work.

…

On Jun 10, 2019, at 4:06 AM, Ananth Mahadevan ***@***.***> wrote: Thank you for the workaround, it is working properly. I also wanted to ask if there is any way particular way to iterate through list items such as some methods in wikitextparser?. I am also looking to extract the user signature at the end of every vote and was wondering if there is a template or general regex pattern already available in some parser. Thanks in advance. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with sections from RfA pages #218

Problem with sections from RfA pages #218

ananth1996 commented Jun 7, 2019

earwig commented Jun 9, 2019 •

edited

ananth1996 commented Jun 10, 2019

earwig commented Jun 10, 2019 via email

Problem with sections from RfA pages #218

Problem with sections from RfA pages #218

Comments

ananth1996 commented Jun 7, 2019

earwig commented Jun 9, 2019 • edited

ananth1996 commented Jun 10, 2019

earwig commented Jun 10, 2019 via email

earwig commented Jun 9, 2019 •

edited