Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_templates does not find matches when template has a linebreak #293

Open
Gonzalo933 opened this issue Sep 16, 2022 · 1 comment
Open

Comments

@Gonzalo933
Copy link

Gonzalo933 commented Sep 16, 2022

When filtering templates, if the template name has a linebreak there is no way of filtering it with the matches param.

For example, given this text from the source: https://en.wikipedia.org/wiki/Cholera

text = """
{{short description|Bacterial infection of the small intestine}}
{{About|the bacterial disease|the dish|Cholera (food)}}
{{pp-vandalism|small=yes}}
{{Infobox medical condition (new)
| name            = Cholera
| synonyms        = Asiatic cholera, epidemic cholera<ref name=textbook/>
| image           = PHIL 1939 lores.jpg
| caption         = A person with severe [[dehydration]] due to cholera, causing sunken eyes and wrinkled hands and skin.
| field           = [[Infectious disease (medical specialty)|Infectious disease]]
| symptoms        = Large amounts of watery [[diarrhea]], [[vomiting]], [[muscle cramps]]<ref name=WHO2010 /><ref name=CDC2015Pro />
| complications   = [[Dehydration]], [[electrolyte imbalance]]<ref name=WHO2010 />
| onset           = 2 hours to 5 days after exposure<ref name=CDC2015Pro />
| duration        = A few days<ref name=WHO2010 />
| causes          = ''[[Vibrio cholerae]]'' spread by [[fecal-oral route]]<ref name=WHO2010 /><ref name=Fink2016 />
| risks           = Poor [[sanitation]], not enough clean [[drinking water]], [[poverty]]<ref name=WHO2010 />
| diagnosis       = [[Stool test]]<ref name=WHO2010 />
| differential    =
| prevention      = Improved sanitation, [[drinking water|clean water]], [[hand washing]], [[cholera vaccine]]s<ref name=WHO2010 /><ref name=Lancet2012 />
| treatment       = [[Oral rehydration therapy]], [[zinc supplementation]], [[intravenous fluids]], [[antibiotics]]<ref name=WHO2010 /><ref name=CDC2014Zinc />
| medication      =
| prognosis       = Less than 1% mortality rate with proper treatment, untreated mortality rate 50-60%
| frequency       = 3–5&nbsp;million people a year<ref name=WHO2010 />
| deaths          = 28,800 (2015)<ref name=GBD2015De/>
}}

'''Cholera''' is an [[infection]] of the [[small intestine]] by some [[strain (biology)|strains]] of the [[Bacteria|bacterium]] ''[[Vibrio cholerae]]''.<ref name=Fink2016>{{cite book |last1=Finkelstein |first1=Richard A. |chapter=Cholera, ''Vibrio cholerae'' O1 and O139, and Other Pathogenic Vibrios |pmid=21413330 |id={{NCBIBook2|NBK8407}} |editor1-last=Baron |editor1-first=Samuel |title=Medical Microbiology |date=1996 |publisher=University of Texas Medical Branch at Galveston |isbn=978-0-9631172-1-2 |edition=4th }}</ref><ref name=CDC2015Pro /> Symptoms may range from none, to mild, to severe.<ref name=CDC2015Pro>{{cite web|title=Cholera – Vibrio cholerae infection Information for Public Health & Medical Professionals|url=https://www.cdc.gov/cholera/healthprofessionals.html|publisher=[[Centers for Disease Control and Prevention]]|access-date=17 March 2015|date=January 6, 2015|url-status=live|archive-url=https://web.archive.org/web/20150320052724/http://www.cdc.gov/cholera/healthprofessionals.html|archive-date=20 March 2015}}</ref> The classic symptom is large amounts of watery [[diarrhea]] that lasts a few days.<ref name=WHO2010 /> [[Vomiting]] and [[muscle cramps]] may also occur.<ref name=CDC2015Pro /> Diarrhea can be so severe that it leads within hours to severe [[dehydration]] and [[electrolyte imbalance]].<ref name=WHO2010 /> This may result in [[Enophthalmia|sunken eyes]], cold skin, decreased skin elasticity, and wrinkling of the hands and feet.<ref name=Lancet2012>{{cite journal | vauthors = Harris JB, LaRocque RC, Qadri F, Ryan ET, Calderwood SB | title = Cholera | journal = Lancet | volume = 379 | issue = 9835 | pages = 2466–2476 | date = June 2012 | pmid = 22748592 | pmc = 3761070 | doi = 10.1016/s0140-6736(12)60436-x }}</ref> Dehydration can cause the skin to turn [[cyanosis|bluish]].<ref>{{cite book|last1=Bailey|first1=Diane  | name-list-style = vanc |title=Cholera|date=2011|publisher=Rosen Pub.|location=New York|isbn=978-1-4358-9437-2|page=7|edition=1st|url=https://books.google.com/books?id=7rvLPx33GPgC&pg=PA7|url-status=live|archive-url=https://web.archive.org/web/20161203190215/https://books.google.com/books?id=7rvLPx33GPgC&pg=PA7|archive-date=2016-12-03}}</ref> Symptoms start two hours to five days after exposure.<ref name=CDC2015Pro />
"""

Some templates are correctly found:

import mwparserfromhell

print([t.name for t in mwparserfromhell.parse(text).filter_templates()])

['short description', 'About', 'pp-vandalism', 'Infobox medical condition (new)\n', 'cite book ', 'NCBIBook2', 'cite web', 'cite journal ', 'cite book']

but when trying to match the "Infobox medical condition (new)" one the filter does not work.

mwparserfromhell.parse(text).filter_templates(matches="Infobox medical condition (new)")

[]

@lahwaacz
Copy link
Contributor

From the documentation:

matches can be used to further restrict the nodes, either as a function (taking a single Node and returning a boolean) or a regular expression (matched against the node’s string representation with re.search()). If matches is a regex, the flags passed to re.search() are re.IGNORECASE, re.DOTALL, and re.UNICODE, but custom flags can be specified by passing flags.

So your matches="Infobox medical condition (new)", taken as a regex, does not match the final \n in the template name. Note that this is different from the Wikicode.matches method. To filter with the latter, use:

mwparserfromhell.parse(text).filter_templates(matches=lambda template: template.name.matches("Infobox medical condition (new)"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants