Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix relative path parsing on windows when using fastparquet #4445

Merged

Conversation

Projects
None yet
3 participants
@Dimplexion
Copy link
Contributor

commented Jan 31, 2019

  • Tests added / passed
  • Passes flake8 dask

Some tests in dataframe/io/tests/test_parquet.py were failing on Windows on my local set up due to a file path issue.

Here's an example of the fail:

C:\development\dask>py.test dask\dataframe\io\tests\test_parquet.py --pdb
=========================================================================================================== test session starts ============================================================================================================
platform win32 -- Python 3.6.6, pytest-4.1.1, py-1.7.0, pluggy-0.8.1
rootdir: C:\development\dask, inifile: setup.cfg
collected 156 items

dask\dataframe\io\tests\test_parquet.py ..x...x.....xx..s.F
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

tmpdir = local('C:\\Users\\Janne\\AppData\\Local\\Temp\\pytest-of-janne.vuorela\\pytest-18\\test_read_glob_pyarrow_fastpar0'), write_engine = 'pyarrow', read_engine = 'fastparquet'

    @write_read_engines()
    def test_read_glob(tmpdir, write_engine, read_engine):
        if write_engine == read_engine == 'fastparquet' and os.name == 'nt':
            # fastparquet or dask is not normalizing filepaths correctly on
            # windows.
            pytest.skip("filepath bug.")
        fn = str(tmpdir)
        ddf.to_parquet(fn, engine=write_engine)
        if os.path.exists(os.path.join(fn, '_metadata')):
            os.unlink(os.path.join(fn, '_metadata'))

        files = os.listdir(fn)
        assert '_metadata' not in files

        # Infer divisions for engines/versions that support it

        ddf2 = dd.read_parquet(os.path.join(fn, '*.parquet'), engine=read_engine,
                               infer_divisions=should_check_divs(write_engine) and should_check_divs(read_engine))
        assert_eq(ddf, ddf2, check_divisions=should_check_divs(write_engine) and should_check_divs(read_engine))

        # No divisions
        ddf2_no_divs = dd.read_parquet(os.path.join(fn, '*.parquet'),
                                       engine=read_engine, infer_divisions=False)
>       assert_eq(ddf.clear_divisions(), ddf2_no_divs, check_divisions=True)

dask\dataframe\io\tests\test_parquet.py:200:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask\dataframe\utils.py:686: in assert_eq
    b = _check_dask(b, check_names=check_names, check_dtypes=check_dtypes)
dask\dataframe\utils.py:615: in _check_dask
    result = dsk.compute(scheduler='sync')
dask\base.py:156: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask\base.py:398: in compute
    results = schedule(dsk, keys, **kwargs)
dask\local.py:500: in get_sync
    return get_async(apply_sync, 1, dsk, keys, **kwargs)
dask\local.py:446: in get_async
    fire_task()
dask\local.py:442: in fire_task
    callback=queue.put)
dask\local.py:489: in apply_sync
    res = func(*args, **kwds)
dask\local.py:235: in execute_task
    result = pack_exception(e, dumps)
dask\local.py:230: in execute_task
    result = _execute_task(task, data)
dask\core.py:119: in _execute_task
    return func(*args2)
dask\dataframe\io\parquet.py:378: in _read_pf_simple
    df = pf.to_pandas(all_columns, categories, index=index_names)
..\fastparquet\fastparquet\api.py:439: in to_pandas
    assign=parts)
..\fastparquet\fastparquet\api.py:242: in read_row_group_file
    assign=assign, scheme=self.file_scheme)
..\fastparquet\fastparquet\core.py:302: in read_row_group_file
    with open(fn, mode='rb') as f:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <dask.bytes.local.LocalFileSystem object at 0x00000233F9D3D908>
path = 'C:/Users/Janne/AppData/Local/Temp/pytest-of-janne.vuorela/pytest-18/test_read_glob_pyarrow_fastpar0/C:/Users/Janne/AppData/Local/Temp/pytest-of-janne.vuorela/pytest-18/test_read_glob_pyarrow_fastpar0/part.0.parquet', mode = 'rb'
kwargs = {}

    def open(self, path, mode='rb', **kwargs):
        """Make a file-like object

        Parameters
        ----------
        mode: string
            normally "rb", "wb" or "ab" or other.
        kwargs: key-value
            Any other parameters, such as buffer size. May be better to set
            these on the filesystem instance, to apply to all files created by
            it. Not used for local.
        """
>       return open(self._normalize_path(path), mode=mode)
E       OSError: [Errno 22] Invalid argument: 'C:\\Users\\Janne\\AppData\\Local\\Temp\\pytest-of-janne.vuorela\\pytest-18\\test_read_glob_pyarrow_fastpar0\\C:\\Users\\Janne\\AppData\\Local\\Temp\\pytest-of-janne.vuorela\\pytest-18\\test_read_glob_pyarrow_fastpar0\\part.0.parquet'

dask\bytes\local.py:58: OSError
@mrocklin

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

cc @martindurant do you have time to take a look at this?

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

Glad to see a fix here, but I must admit I don't really understand the error. Is the same thing done within fastparquet ParquetFile, or does it too need a fix?

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

In short, I am +1 here, but would appreciate if you ( @Dimplexion ) would check the situation for fastparquet stand-alone, which is not tested on Windows under CI.

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Jan 31, 2019

Sure I'll check it there and will also give more details about it. I was a little unsure how these projects operate together so I just made the request here as it fixed the tests for me.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

Sometimes we are unsure too :)

@martindurant martindurant merged commit b1dd6ee into dask:master Jan 31, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

It seems to me that fastparquet is internally handling the paths correctly. When running the tests on Windows it seems that one of them failed, though, so I'll be taking a look at that later when I have time and will raise an issue or create a pull request on the repo.

The case here was that fastparquet internally handles the path as it should (by converting it to the universal format) The problem rose from the fact that the variable base is received by calling analyse_paths which returns the base of the path in the fastparquet universal format, but the list of paths was used directly without going through any of the fastparquet functions. So if paths == ['C:\\tmp\parquet\\'] then base == ['C:/tmp/parquet/']. Later when attempting to get the relative path to the file with this line relpath = path.replace(base, '').lstrip('/'), the end result will just be relpath == 'C:\\tmp\parquet\\' which is not what it was supposed to be. Later this is appended into the base path ending with the long malformed path being used on the open function: OSError: [Errno 22] Invalid argument: 'C:\\Users\\Janne\\AppData\\Local\\Temp\\pytest-of-janne.vuorela\\pytest-18\\test_read_glob_pyarrow_fastpar0\\C:\\Users\\Janne\\AppData\\Local\\Temp\\pytest-of-janne.vuorela\\pytest-18\\test_read_glob_pyarrow_fastpar0\\part.0.parquet'.

@martindurant

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

Thanks very much for looking into it. Can I please ask you to repost this on the fastparquet repo?

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

As far as I can see this issue is only related to how dask is using fastparquet instead of there being an issue in fastparquet itself. I guess fastparquet could expose a function that would convert a list of system paths into the universal fastparquet format that would make errors like this less likely. Or am I misunderstanding something?

@martindurant

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

it seems that one of them failed

I mean this one - if you figure out why it happens, I'd be keen to fix it.
I agree that your fix here appears to work for fastparquet-in-dask.

@Dimplexion

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

Ah alright, I just opened the issue on it. :)

dask/fastparquet#395

@Dimplexion Dimplexion deleted the Dimplexion:fix-relative-path-parsing-on-windows branch Feb 1, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Fix relative path parsing on windows when using fastparquet (dask#4445)
* Fix fastparque relative path parsing on Windows

* Fix path parsing when using fastparquet reader on Windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.