Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop invalid XML element attributes #462

Merged
merged 4 commits into from Jan 2, 2024

Conversation

vbarbaresi
Copy link
Contributor

@vbarbaresi vbarbaresi commented Dec 31, 2023

Fixes issue #375

The bug happened when we had a : in an element attribute that didn't match any XML namespace (invalid XML).
In the bug example it was caused bypadding:1px=""; margin:15px=""
We can workaround this issue by manually stripping bad elements. I hope it doesn't impact performance too much

Testing

To reproduce:
trafilatura -u https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml

Minimal reproduction example:

echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>' | trafilatura --xml

Notes

I had to add a constraint on lxml < 5.0: a new lxml major version broke a test

Fixes issue adbar#375

The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""`
We can workaround it by manually dropping those bad elements.
I hope it doesn't impact performance too much

To reproduce:
`trafilatura -u  https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml`

Minimal reproduction example:
```
echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>
' | trafilatura --xml
```
Copy link

codecov bot commented Dec 31, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (396b991) 96.76% compared to head (24cd725) 96.82%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #462      +/-   ##
==========================================
+ Coverage   96.76%   96.82%   +0.06%     
==========================================
  Files          22       22              
  Lines        3367     3371       +4     
==========================================
+ Hits         3258     3264       +6     
+ Misses        109      107       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar adbar linked an issue Jan 2, 2024 that may be closed by this pull request
@adbar adbar merged commit de57ac1 into adbar:master Jan 2, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XMLSyntaxError during conversion to XML output
2 participants