Issue reading CSV files from non-working directory in Windows #4861

simaster123 · 2019-05-30T13:44:09Z

I posted this issue on Stackoverflow a couple days ago thinking there would be an easy workaround, but I still cannot read CSVs from a different hard drive using Dask's "read_csv" function in Windows.

This seems related to the issue discussed here, but the resolutions discussed there did not work for me: #1885

I tried various combinations of converting '/' to '\', using the os path functions and using 'r'file://' with no luck. Since the reading works flawlessly in Pandas, my current workaround is to read the files with Pandas and convert the Pandas df to Dask.

Copying from my Stackoverflow question (https://stackoverflow.com/questions/56315885/dask-throws-filenotfounderror-reading-csv-worked-fine-in-pandas):

I'm trying to port some Pandas code to Dask, and I'm encountering an issue when reading the csv's - it appears that Dask adds the local working directory to the file path in the read operation. It works fine when I read using Pandas.

I'm using Windows 10. Working directory is on my C drive; data is in my D drive.

Pandas code:

import pandas as pd

file_path = 'D:/test_data/'
item = filename.csv
temp_df = pd.read_csv(file_path + item, usecols=['time', 'ticker_price'])

Output of print(temp_df.head()):

                         time  ticker_price
0  2019-05-15 09:34:09.233373       0.02843
1  2019-05-15 09:34:11.334135       0.02843
2  2019-05-15 09:34:12.147282       0.02843
3  2019-05-15 09:34:13.705145       0.02843
4  2019-05-15 09:34:14.521257       0.02843
type = <class 'pandas.core.frame.DataFrame'>

Dask code:

import dask.dataframe as dd

file_path = 'D:/test_data/'
item = filename.csv
temp_dd = dd.read_csv(file_path + item, usecols=['time', 'ticker_price'])

Output of print(temp_dd.head()):

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Dan\\PycharmProjects\\project1_folder/D:/test_data/filename.csv'

It looks like Dask is appending the file_path to my data on the D drive to the path of my local working directory (the PycharmProjects folder), while Pandas does not. Are there any solutions for this?

A few things I tried that did not work:

(1)

temp_file_path_str = pathlib.Path(file_path + item)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This returns the same error:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Dan\PycharmProjects\project1_folder/D:\test_data\filename.csv'

(2)

temp_file_path_str = r'file://' + file_path + item
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This returns an error that suggests Dask removed the drive ID from the path:

FileNotFoundError: [WinError 3] The system cannot find the path specified: '\test_data\filename.csv'

(3)

temp_file_path_str = 'file://' + file_path + item
temp_file_path_str = pathlib.Path(temp_file_path_str)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This seems to add an extra \ before the drive ID in the path:

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '\D:\test_data\filename.csv'

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-05-30T22:13:12Z

What is dask.bytes.core.get_fs_token_paths(file_path + item) on your system?

simaster123 · 2019-05-30T23:20:37Z

(<dask.bytes.local.LocalFileSystem object at 0x000001FD4DCDEF98>, '5dc96918a5449170f49bbbcaef27ea92', ['F:/test_data/filename.csv'])

(I moved the data to my F drive from my D drive)

TomAugspurger · 2019-05-31T03:14:47Z

Can you post the full traceback?

If you're interested in debugging, I would drop a pdb.set_trace() somewhere around

dask/dask/dataframe/io/csv.py

Line 328 in a6abe3c

b_lineterminator = lineterminator.encode()

jakirkham · 2019-06-03T14:49:32Z

@simaster123, have you had a chance to try Tom's suggestion above?

simaster123 · 2019-06-04T07:02:12Z

Apologies for the delay....

It looks like the issue was caused by me calling Dask's read_csv function in a distributed environment. My Dask code works fine on my local machine using just a local worker, but the error was thrown when I started adding my distributed workers.

The weird file path used in the traceback seems to have been a red herring. The answer here led me in the right direction: https://stackoverflow.com/questions/50987030/file-not-found-error-in-dask-program-run-on-cluster

jakirkham · 2019-06-04T13:35:56Z

Thanks for the update @simaster123.

myidealab · 2019-12-10T19:36:46Z

@simaster123 I am having the same issue using jupyter notebook and a UNC path. Based on the link you provided, it's unclear to me what your solution was.

I had to map the UNC path to a drive letter for dask.dataframe to read the csv file.

I'm still not sure why UNC path works in pandas, but not dask. The function os.path.exists() was used to confirm that the file is there.

TomAugspurger · 2019-12-10T20:10:48Z

@myidealab if you're able to provide a failing example / debug if things are wrong somewhere we'd appreciate it.

The function os.path.exists() was used to confirm that the file is there.

Just a reminder, you'd need to verify that the workers think the file exists with os.path.exists, in addition to from your client process.

simaster123 · 2019-12-11T15:37:50Z

@myidealab To solve my issue, I created identical shares across my cluster and moved everything to the same OS (Ubuntu) using virtual machines. Now, to share data between nodes, I'm using CIFS shares. To address the issue I raised here, I used the same CIFS mounted folder name on each node (e.g., /my_shared_data/) to access the shared folder with the data I wanted to process.

In other words, I think the errors I had in my original post arose from me trying to distribute computing across nodes with different operating systems + I didn't have consistent shared folder naming.

jakirkham added the dataframe label May 31, 2019

simaster123 closed this as completed Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue reading CSV files from non-working directory in Windows #4861

Issue reading CSV files from non-working directory in Windows #4861

simaster123 commented May 30, 2019 •

edited

TomAugspurger commented May 30, 2019

simaster123 commented May 30, 2019 •

edited

TomAugspurger commented May 31, 2019

jakirkham commented Jun 3, 2019

simaster123 commented Jun 4, 2019

jakirkham commented Jun 4, 2019

myidealab commented Dec 10, 2019 •

edited

TomAugspurger commented Dec 10, 2019

simaster123 commented Dec 11, 2019

Issue reading CSV files from non-working directory in Windows #4861

Issue reading CSV files from non-working directory in Windows #4861

Comments

simaster123 commented May 30, 2019 • edited

TomAugspurger commented May 30, 2019

simaster123 commented May 30, 2019 • edited

TomAugspurger commented May 31, 2019

jakirkham commented Jun 3, 2019

simaster123 commented Jun 4, 2019

jakirkham commented Jun 4, 2019

myidealab commented Dec 10, 2019 • edited

TomAugspurger commented Dec 10, 2019

simaster123 commented Dec 11, 2019

simaster123 commented May 30, 2019 •

edited

simaster123 commented May 30, 2019 •

edited

myidealab commented Dec 10, 2019 •

edited