Skip to content

Remove nasty characters from xml before report is parsed#3

Open
shenek wants to merge 1 commit into2.xfrom
nasty-characters-in-xml
Open

Remove nasty characters from xml before report is parsed#3
shenek wants to merge 1 commit into2.xfrom
nasty-characters-in-xml

Conversation

@shenek
Copy link
Copy Markdown
Collaborator

@shenek shenek commented Aug 18, 2020

No description provided.

@shenek shenek requested a review from beda42 August 27, 2020 08:50
@shenek shenek self-assigned this Aug 27, 2020
@shenek shenek added the bug Something isn't working label Aug 27, 2020
Copy link
Copy Markdown
Owner

@beda42 beda42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one comment about safety - could you please check it? Maybe a test would be called for.

Comment thread pycounter/sushi.py
# try to remove nasty characters from xml
raw_converted = "".join(
map(
lambda ch: ch if ch.isprintable() else " ",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.isprintable() seems like the correct method to use here, but I would be afraid of messing up something legitimate. Have you tried a few existing XMLs to see what gets replaced? I know it is extra work, but I think that it could prove useful in the long run.

Comment thread pycounter/sushi.py
# Missing some mandatory field to extract data ->
# exit right away
if not c_report or not hasattr(c_report, "Customer") or not hasattr(c_report.Customer, "ReportItems"):
if (
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like 'blacking', are you sure we want to reformat external library code? Or is there some change I missed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this change fixes it. I manage to push a commit without a proper 'blacking'.

@beda42
Copy link
Copy Markdown
Owner

beda42 commented Aug 27, 2020

Well, I made some tests and .isprintable() may not be what we are looking for - for example a TAB or a EOL return False. Even though it might not be such a problem for our type of data, it is still something I would rather not do.

@beda42
Copy link
Copy Markdown
Owner

beda42 commented Aug 27, 2020

It seems there is a good solution here https://stackoverflow.com/questions/1707890/fast-way-to-filter-illegal-xml-unicode-chars-in-python It is based on characters really disallowed in XML by the standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants