-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue reading CSV files from non-working directory in Windows #4861
Comments
What is |
(<dask.bytes.local.LocalFileSystem object at 0x000001FD4DCDEF98>, '5dc96918a5449170f49bbbcaef27ea92', ['F:/test_data/filename.csv']) (I moved the data to my F drive from my D drive) |
Can you post the full traceback? If you're interested in debugging, I would drop a Line 328 in a6abe3c
|
@simaster123, have you had a chance to try Tom's suggestion above? |
Apologies for the delay.... It looks like the issue was caused by me calling Dask's read_csv function in a distributed environment. My Dask code works fine on my local machine using just a local worker, but the error was thrown when I started adding my distributed workers. The weird file path used in the traceback seems to have been a red herring. The answer here led me in the right direction: https://stackoverflow.com/questions/50987030/file-not-found-error-in-dask-program-run-on-cluster |
Thanks for the update @simaster123. |
@simaster123 I am having the same issue using jupyter notebook and a UNC path. Based on the link you provided, it's unclear to me what your solution was. I had to map the UNC path to a drive letter for dask.dataframe to read the csv file. I'm still not sure why UNC path works in pandas, but not dask. The function |
@myidealab if you're able to provide a failing example / debug if things are wrong somewhere we'd appreciate it.
Just a reminder, you'd need to verify that the workers think the file exists with |
@myidealab To solve my issue, I created identical shares across my cluster and moved everything to the same OS (Ubuntu) using virtual machines. Now, to share data between nodes, I'm using CIFS shares. To address the issue I raised here, I used the same CIFS mounted folder name on each node (e.g., /my_shared_data/) to access the shared folder with the data I wanted to process. In other words, I think the errors I had in my original post arose from me trying to distribute computing across nodes with different operating systems + I didn't have consistent shared folder naming. |
I posted this issue on Stackoverflow a couple days ago thinking there would be an easy workaround, but I still cannot read CSVs from a different hard drive using Dask's "read_csv" function in Windows.
This seems related to the issue discussed here, but the resolutions discussed there did not work for me: #1885
I tried various combinations of converting '/' to '\', using the os path functions and using 'r'file://' with no luck. Since the reading works flawlessly in Pandas, my current workaround is to read the files with Pandas and convert the Pandas df to Dask.
Copying from my Stackoverflow question (https://stackoverflow.com/questions/56315885/dask-throws-filenotfounderror-reading-csv-worked-fine-in-pandas):
I'm trying to port some Pandas code to Dask, and I'm encountering an issue when reading the csv's - it appears that Dask adds the local working directory to the file path in the read operation. It works fine when I read using Pandas.
I'm using Windows 10. Working directory is on my C drive; data is in my D drive.
Pandas code:
Output of print(temp_df.head()):
Dask code:
Output of print(temp_dd.head()):
It looks like Dask is appending the file_path to my data on the D drive to the path of my local working directory (the PycharmProjects folder), while Pandas does not. Are there any solutions for this?
A few things I tried that did not work:
(1)
temp_file_path_str = pathlib.Path(file_path + item)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])
This returns the same error:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Dan\PycharmProjects\project1_folder/D:\test_data\filename.csv'
(2)
temp_file_path_str = r'file://' + file_path + item
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])
This returns an error that suggests Dask removed the drive ID from the path:
FileNotFoundError: [WinError 3] The system cannot find the path specified: '\test_data\filename.csv'
(3)
temp_file_path_str = 'file://' + file_path + item
temp_file_path_str = pathlib.Path(temp_file_path_str)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])
This seems to add an extra \ before the drive ID in the path:
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '\D:\test_data\filename.csv'
The text was updated successfully, but these errors were encountered: