Remove nasty characters from xml before report is parsed#3
Remove nasty characters from xml before report is parsed#3
Conversation
beda42
left a comment
There was a problem hiding this comment.
Looks good. Just one comment about safety - could you please check it? Maybe a test would be called for.
| # try to remove nasty characters from xml | ||
| raw_converted = "".join( | ||
| map( | ||
| lambda ch: ch if ch.isprintable() else " ", |
There was a problem hiding this comment.
.isprintable() seems like the correct method to use here, but I would be afraid of messing up something legitimate. Have you tried a few existing XMLs to see what gets replaced? I know it is extra work, but I think that it could prove useful in the long run.
| # Missing some mandatory field to extract data -> | ||
| # exit right away | ||
| if not c_report or not hasattr(c_report, "Customer") or not hasattr(c_report.Customer, "ReportItems"): | ||
| if ( |
There was a problem hiding this comment.
This looks like 'blacking', are you sure we want to reformat external library code? Or is there some change I missed?
There was a problem hiding this comment.
Actually this change fixes it. I manage to push a commit without a proper 'blacking'.
|
Well, I made some tests and .isprintable() may not be what we are looking for - for example a TAB or a EOL return False. Even though it might not be such a problem for our type of data, it is still something I would rather not do. |
|
It seems there is a good solution here https://stackoverflow.com/questions/1707890/fast-way-to-filter-illegal-xml-unicode-chars-in-python It is based on characters really disallowed in XML by the standard. |
No description provided.