Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new encoding parameter to ascii.read #5448

Merged
merged 5 commits into from
Jun 15, 2017

Conversation

saimn
Copy link
Contributor

@saimn saimn commented Nov 2, 2016

Allow to specify encoding when using ascii.read(...). See #3826 for the motivations.
I added test files only for the simple reader, not sure if I should do the same for other readers ?
cc @taldcroft @pllim

@taldcroft
Copy link
Member

@saimn - thanks!

I think that you might be able to do this with a much smaller footprint and more in the idiom of io.ascii. The many existing parameters like delimiter or comment are carried around in the reader data and header objects. In _get_reader() in io/ascii/core.py you would put some code like (but change comment to encoding):

    if 'comment' in kwargs:
        reader.header.comment = kwargs['comment']
        reader.data.comment = kwargs['comment']

You can also do a global grep for delimiter (including the io.ascii docs) in order to find the other places you need to add encoding to make it as a documented and accepted reader parameter.

@taldcroft
Copy link
Member

For testing, in this case you definitely want to test the values as well as the column header names. So I would suggest removing the standardized tests you added and putting in a new test function that tests the column names and values and dtypes.

To be honest I don't entirely understand the expected output for this in Python 2.7 and even why it is not failing tests now (I have an idea but didn't dig through the code to check). However, if I read the latin1 test table in Py2 and try to print it then this raises an exception. I'm not sure if this is a separate issue that can / should be fixed, but in any case we need to make sure Py2 users don't get a partially working situation. (Either fail up front or fully work).

@saimn
Copy link
Contributor Author

saimn commented Nov 2, 2016

@taldcroft - Good points, I forgot to test on Python 2 but actually it fails on Travis (https://travis-ci.org/astropy/astropy/builds/172562778). The green status on the PR is because I pushed the CHANGES entry later with [ci skip] ... I will take a look !

@taldcroft
Copy link
Member

One possibility is not allowing encoding for Py2, at least within this PR, if it turns out to be a real problem. I have the feeling that to fully support different encodings in Py2 will be a lot of work and end up requiring some of the hacks shown in the csv docs.

@saimn
Copy link
Contributor Author

saimn commented Nov 22, 2016

I pushed some commits to give an update, even though I didn't make progress the last 2 weeks. The current status is:

  • I changed the way to pass encoding like other parameters for io.ascii
  • I was stuck on the tests and what can be supported: it seems more difficult than expected for the cparser (within the FileString class), and I didn't find the time to explore this more in detail.
    One possibility to get this in the next release would be to restrict the use of encoding to Python 3 and the slow readers, this should be quite straightforward.

@taldcroft
Copy link
Member

@saimn - sorry for the slow response. This looks much better now and is definitely close.

I'm good with your approach in the second bullet regarding limiting the functionality to slow readers and Py3. I'm assuming that the existing tests you wrote are failing and that is why you [skip ci]'d the last commit here. So just check for the presence of an encoding kwarg and raise informative exceptions if it is set in a case where encoding is not actually supported. You might raise a NotImplementedError to give some hope/indication that it might be supported in the future.

@@ -2642,7 +2642,7 @@ def footprint_to_file(self, filename=None, color='green', width=2):
color : str, optional
Color to use when plotting the line.

width : int, optional
width : int, optionalastropy/io/ascii/tests/test_read.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental edit.

@@ -72,6 +72,8 @@ def get_package_data():
't/simple3.txt',
't/simple4.txt',
't/simple5.txt',
't/simple_latin1.txt',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like your current test strategy is to generate the appropriate file on the fly, so these are not needed.

print('\n\n******** SKIPPING %s' % testfile['name'])
continue

tmpfile = str(tmpdir.join(os.path.basename(testfile['name'])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here giving guidance on the overall plan of generating a new table with an encoded column where possible.


format = formats.get(testfile['opts'].get('Reader'))

with open(tmpfile, mode='w', encoding='latin1') as fout:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does utf-8 need to be tested as well?

name = u'à' if not six.PY2 else 'alpha'
col = Column(name=name, data=[six.u(x) for x in table.columns[0]])
table.add_column(col, 0)
table[0][0] = u"àéö"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on this table value should be tested. After this block you could define a variable table00 = table[0][0] and then test.

@saimn
Copy link
Contributor Author

saimn commented Dec 1, 2016

@taldcroft - no problem for the slow pace, I'm not better ;-). Thanks for your comments, the last commit on tests is really messy and unfinished, sorry about that. I will try to make progress on this soon.

@eteq
Copy link
Member

eteq commented Dec 18, 2016

Looks like we're going to have to push this to 2.0 . (will need to move the changelog entry to that section, too).

@eteq eteq modified the milestones: v2.0.0, v1.3.0 Dec 18, 2016
@taldcroft
Copy link
Member

@saimn - what's your thought on this? Defer to next release or try to push?

@saimn
Copy link
Contributor Author

saimn commented Jun 12, 2017

@taldcroft - I will check if I can get something, at least for Python 3 and the slow reader.

@saimn
Copy link
Contributor Author

saimn commented Jun 14, 2017

@taldcroft - I pushed a new simplified version, a said above it works for slow readers and Python 3.

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great! In fact so nice that I also request putting some mention or example in the Getting Started docs. There is a Note at the end of the Reading section that would be a perfect place to mention the option of specifying the encoding parameter.

'--- --- -----',
' 1 2 héllo']

table = ascii.read(testfile, format=fmt, fast_reader=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a loop here for guess in (True, False): to explicitly include the no-guessing case.

CHANGES.rst Outdated
@@ -30,6 +30,9 @@ New Features

- ``astropy.io.ascii``

- Allow to specify encoding in ``ascii.read``, only for Python 3 and with the
slow readers. [#5448]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say "pure-Python readers" instead of "slow readers". 😄

@saimn
Copy link
Contributor Author

saimn commented Jun 15, 2017

@taldcroft - I addressed your comments, and the builds passed.

@taldcroft taldcroft merged commit ecf7910 into astropy:master Jun 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants