Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

[PRED-2644] Fix decoding error in case of dialect detection #161

Merged
merged 3 commits into from
Jul 19, 2019

Conversation

falkerson
Copy link
Contributor

@falkerson falkerson commented Jul 5, 2019

BATCH SCORING PULL REQUEST

This is a pull request into a public repository for Batch Scoring script maintained by DataRobot.

RATIONALE

We open some file in binary mode and read some N bytes to detect encoding. Later we use this bytes to detect dialect but before it we decode() them into string(unicode). Because we have const number of N, it's possible that during bytes reading last character may be torn apart and then during decode() we can't identify that character.

CHANGES

As solution we will read first N characters instead of bytes to detect dialect. Because it's called only once no huge performance degradation expected.

TESTING

@devexp-slackbot
Copy link

@@ -432,6 +432,19 @@ def sniff_dialect(sample, encoding, sep, skip_dialect, ui):
return dialect


def get_opener_and_mode(is_gz, text=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's going to work for gzipped japanese dataset. I think we need some sort of combination for is_gz and text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in PY3 gzip.open has encoding parameter. For PY2 please take a look here: https://github.com/datarobot/batch-scoring/pull/161/files#diff-6c240e73b54162e5fce6a481761017a2R527

@falkerson falkerson force-pushed the andriy-popovych/PRED-2644 branch 3 times, most recently from 78b621c to bd1eb21 Compare July 15, 2019 11:51
We open some file in binary mode and read some N bytes to detect encoding.
Later we use this bytes to detect dialect but before it we decode() them
into string(unicode). Because we have const number of N, it's possible that
during bytes reading last character may be torn apart and then during decode()
we can't identify that character.
@coveralls
Copy link

coveralls commented Jul 15, 2019

Coverage Status

Coverage increased (+0.2%) to 84.24% when pulling 1b56f0e on andriy-popovych/PRED-2644 into 13da21e on master.

tsh
tsh previously approved these changes Jul 16, 2019
else:
mode = 'rt' if text else 'rb'
return (gzip.open, mode)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work, however, I think the whole else can be written using io:

mode = 'rt' if text else 'rb'
return (io.open, mode)

because:

  • io.open uses universal newlines by default (no need to pass U in openning mode)
  • io.open is open in Python 3

ikalnytskyi
ikalnytskyi previously approved these changes Jul 17, 2019
Copy link
Member

@ikalnytskyi ikalnytskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'd just add a bunch of tests to cover gzipped version of datasets, and ensure encoding and encoding sniffing works fine with them, and maybe add some test japanese dataset too.

@falkerson falkerson dismissed stale reviews from ikalnytskyi and tsh via f0be1b8 July 19, 2019 10:15
actual = out.read_text('utf-8')
with open('tests/fixtures/jpReview_books_reg_out.csv', 'rU') as f:
expected = f.read()
assert str(actual) == str(expected), expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does str works on Python 2? I mean .read_text should return unicode, and casting a unicode to str should probably fail o_O I'd expect to see here six.text_type instead of str.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter in case of output. Here we have:
row_id,0.0,1.0
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0

so it can be converted as ascii

@falkerson falkerson merged commit 95ff508 into master Jul 19, 2019
@falkerson falkerson deleted the andriy-popovych/PRED-2644 branch July 19, 2019 15:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants