Add a new encoding parameter to ascii.read #5448

saimn · 2016-11-02T10:12:14Z

Allow to specify encoding when using ascii.read(...). See #3826 for the motivations.
I added test files only for the simple reader, not sure if I should do the same for other readers ?
cc @taldcroft @pllim

taldcroft · 2016-11-02T13:07:19Z

@saimn - thanks!

I think that you might be able to do this with a much smaller footprint and more in the idiom of io.ascii. The many existing parameters like delimiter or comment are carried around in the reader data and header objects. In _get_reader() in io/ascii/core.py you would put some code like (but change comment to encoding):

    if 'comment' in kwargs:
        reader.header.comment = kwargs['comment']
        reader.data.comment = kwargs['comment']

You can also do a global grep for delimiter (including the io.ascii docs) in order to find the other places you need to add encoding to make it as a documented and accepted reader parameter.

taldcroft · 2016-11-02T13:31:00Z

For testing, in this case you definitely want to test the values as well as the column header names. So I would suggest removing the standardized tests you added and putting in a new test function that tests the column names and values and dtypes.

To be honest I don't entirely understand the expected output for this in Python 2.7 and even why it is not failing tests now (I have an idea but didn't dig through the code to check). However, if I read the latin1 test table in Py2 and try to print it then this raises an exception. I'm not sure if this is a separate issue that can / should be fixed, but in any case we need to make sure Py2 users don't get a partially working situation. (Either fail up front or fully work).

saimn · 2016-11-02T14:09:13Z

@taldcroft - Good points, I forgot to test on Python 2 but actually it fails on Travis (https://travis-ci.org/astropy/astropy/builds/172562778). The green status on the PR is because I pushed the CHANGES entry later with [ci skip] ... I will take a look !

taldcroft · 2016-11-02T22:33:18Z

One possibility is not allowing encoding for Py2, at least within this PR, if it turns out to be a real problem. I have the feeling that to fully support different encodings in Py2 will be a lot of work and end up requiring some of the hacks shown in the csv docs.

saimn · 2016-11-22T18:40:10Z

I pushed some commits to give an update, even though I didn't make progress the last 2 weeks. The current status is:

I changed the way to pass encoding like other parameters for io.ascii
I was stuck on the tests and what can be supported: it seems more difficult than expected for the cparser (within the FileString class), and I didn't find the time to explore this more in detail.
One possibility to get this in the next release would be to restrict the use of encoding to Python 3 and the slow readers, this should be quite straightforward.

taldcroft · 2016-11-30T11:54:12Z

@saimn - sorry for the slow response. This looks much better now and is definitely close.

I'm good with your approach in the second bullet regarding limiting the functionality to slow readers and Py3. I'm assuming that the existing tests you wrote are failing and that is why you [skip ci]'d the last commit here. So just check for the presence of an encoding kwarg and raise informative exceptions if it is set in a case where encoding is not actually supported. You might raise a NotImplementedError to give some hope/indication that it might be supported in the future.

taldcroft · 2016-11-30T11:31:20Z

astropy/wcs/wcs.py

@@ -2642,7 +2642,7 @@ def footprint_to_file(self, filename=None, color='green', width=2):
        color : str, optional
            Color to use when plotting the line.

-        width : int, optional
+        width : int, optionalastropy/io/ascii/tests/test_read.py


Accidental edit.

taldcroft · 2016-11-30T11:55:52Z

astropy/io/ascii/setup_package.py

@@ -72,6 +72,8 @@ def get_package_data():
                                   't/simple3.txt',
                                   't/simple4.txt',
                                   't/simple5.txt',
+                                   't/simple_latin1.txt',


It looks like your current test strategy is to generate the appropriate file on the fly, so these are not needed.

taldcroft · 2016-11-30T11:57:34Z

astropy/io/ascii/tests/test_read.py

+            print('\n\n******** SKIPPING %s' % testfile['name'])
+            continue
+
+        tmpfile = str(tmpdir.join(os.path.basename(testfile['name'])))


Add a comment here giving guidance on the overall plan of generating a new table with an encoded column where possible.

taldcroft · 2016-11-30T11:58:03Z

astropy/io/ascii/tests/test_read.py

+
+        format = formats.get(testfile['opts'].get('Reader'))
+
+        with open(tmpfile, mode='w', encoding='latin1') as fout:


Does utf-8 need to be tested as well?

taldcroft · 2016-11-30T12:00:10Z

astropy/io/ascii/tests/test_read.py

+            name = u'à' if not six.PY2 else 'alpha'
+            col = Column(name=name, data=[six.u(x) for x in table.columns[0]])
+            table.add_column(col, 0)
+            table[0][0] = u"àéö"


Later on this table value should be tested. After this block you could define a variable table00 = table[0][0] and then test.

saimn · 2016-12-01T11:51:58Z

@taldcroft - no problem for the slow pace, I'm not better ;-). Thanks for your comments, the last commit on tests is really messy and unfinished, sorry about that. I will try to make progress on this soon.

eteq · 2016-12-18T06:32:34Z

Looks like we're going to have to push this to 2.0 . (will need to move the changelog entry to that section, too).

taldcroft · 2017-06-12T12:32:37Z

@saimn - what's your thought on this? Defer to next release or try to push?

saimn · 2017-06-12T15:59:28Z

@taldcroft - I will check if I can get something, at least for Python 3 and the slow reader.

Currently passed to get_readable_fileobj only when guess is True

saimn · 2017-06-14T09:49:36Z

@taldcroft - I pushed a new simplified version, a said above it works for slow readers and Python 3.

taldcroft

Thanks, looks great! In fact so nice that I also request putting some mention or example in the Getting Started docs. There is a Note at the end of the Reading section that would be a perfect place to mention the option of specifying the encoding parameter.

taldcroft · 2017-06-14T10:36:35Z

astropy/io/ascii/tests/test_read.py

+                                   '--- --- -----',
+                                   '  1   2 héllo']
+
+        table = ascii.read(testfile, format=fmt, fast_reader=False,


Add a loop here for guess in (True, False): to explicitly include the no-guessing case.

taldcroft · 2017-06-14T10:39:07Z

CHANGES.rst

@@ -30,6 +30,9 @@ New Features

 - ``astropy.io.ascii``

+  - Allow to specify encoding in ``ascii.read``, only for Python 3 and with the
+    slow readers. [#5448]


Let's say "pure-Python readers" instead of "slow readers". 😄

saimn · 2017-06-15T07:42:06Z

@taldcroft - I addressed your comments, and the builds passed.

taldcroft self-assigned this Nov 2, 2016

taldcroft added io.ascii Affects-release Enhancement labels Nov 2, 2016

taldcroft added this to the v1.3.0 milestone Nov 2, 2016

saimn force-pushed the ascii-encoding branch from fcf08e8 to 7cffbd2 Compare November 22, 2016 18:30

taldcroft requested changes Nov 30, 2016

View reviewed changes

eteq modified the milestones: v2.0.0, v1.3.0 Dec 18, 2016

saimn added 4 commits June 13, 2017 23:09

Add a new encoding parameter to ascii.read

14703c3

Currently passed to get_readable_fileobj only when guess is True

Forbid the use of encoding with the fast reader or py2

91cfa6a

Rework test

a4d7192

Detail scope in changelog

701d297

saimn force-pushed the ascii-encoding branch from 7cffbd2 to 701d297 Compare June 13, 2017 22:57

taldcroft requested changes Jun 14, 2017

View reviewed changes

Address review comments

273e981

taldcroft approved these changes Jun 15, 2017

View reviewed changes

taldcroft merged commit ecf7910 into astropy:master Jun 15, 2017

saimn deleted the ascii-encoding branch June 15, 2017 10:06

This was referenced Jun 15, 2017

Reading an "ascii" table that is unicode #3826

Closed

Better handling of non-ASCII encoded data in io.ascii #2923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new encoding parameter to ascii.read #5448

Add a new encoding parameter to ascii.read #5448

saimn commented Nov 2, 2016

taldcroft commented Nov 2, 2016

taldcroft commented Nov 2, 2016

saimn commented Nov 2, 2016

taldcroft commented Nov 2, 2016

saimn commented Nov 22, 2016

taldcroft commented Nov 30, 2016

taldcroft Nov 30, 2016

taldcroft Nov 30, 2016

taldcroft Nov 30, 2016

taldcroft Nov 30, 2016

taldcroft Nov 30, 2016

saimn commented Dec 1, 2016

eteq commented Dec 18, 2016

taldcroft commented Jun 12, 2017

saimn commented Jun 12, 2017

saimn commented Jun 14, 2017

taldcroft left a comment

taldcroft Jun 14, 2017

taldcroft Jun 14, 2017

saimn commented Jun 15, 2017


		format = formats.get(testfile['opts'].get('Reader'))

		with open(tmpfile, mode='w', encoding='latin1') as fout:

Add a new encoding parameter to ascii.read #5448

Add a new encoding parameter to ascii.read #5448

Conversation

saimn commented Nov 2, 2016

taldcroft commented Nov 2, 2016

taldcroft commented Nov 2, 2016

saimn commented Nov 2, 2016

taldcroft commented Nov 2, 2016

saimn commented Nov 22, 2016

taldcroft commented Nov 30, 2016

taldcroft Nov 30, 2016

Choose a reason for hiding this comment

taldcroft Nov 30, 2016

Choose a reason for hiding this comment

taldcroft Nov 30, 2016

Choose a reason for hiding this comment

taldcroft Nov 30, 2016

Choose a reason for hiding this comment

taldcroft Nov 30, 2016

Choose a reason for hiding this comment

saimn commented Dec 1, 2016

eteq commented Dec 18, 2016

taldcroft commented Jun 12, 2017

saimn commented Jun 12, 2017

saimn commented Jun 14, 2017

taldcroft left a comment

Choose a reason for hiding this comment

taldcroft Jun 14, 2017

Choose a reason for hiding this comment

taldcroft Jun 14, 2017

Choose a reason for hiding this comment

saimn commented Jun 15, 2017