Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reading CSV files from non-working directory in Windows #4861

Closed
simaster123 opened this issue May 30, 2019 · 9 comments
Closed

Issue reading CSV files from non-working directory in Windows #4861

simaster123 opened this issue May 30, 2019 · 9 comments

Comments

@simaster123
Copy link

simaster123 commented May 30, 2019

I posted this issue on Stackoverflow a couple days ago thinking there would be an easy workaround, but I still cannot read CSVs from a different hard drive using Dask's "read_csv" function in Windows.

This seems related to the issue discussed here, but the resolutions discussed there did not work for me: #1885

I tried various combinations of converting '/' to '\', using the os path functions and using 'r'file://' with no luck. Since the reading works flawlessly in Pandas, my current workaround is to read the files with Pandas and convert the Pandas df to Dask.

Copying from my Stackoverflow question (https://stackoverflow.com/questions/56315885/dask-throws-filenotfounderror-reading-csv-worked-fine-in-pandas):

I'm trying to port some Pandas code to Dask, and I'm encountering an issue when reading the csv's - it appears that Dask adds the local working directory to the file path in the read operation. It works fine when I read using Pandas.

I'm using Windows 10. Working directory is on my C drive; data is in my D drive.

Pandas code:

import pandas as pd

file_path = 'D:/test_data/'
item = filename.csv
temp_df = pd.read_csv(file_path + item, usecols=['time', 'ticker_price'])

Output of print(temp_df.head()):

                         time  ticker_price
0  2019-05-15 09:34:09.233373       0.02843
1  2019-05-15 09:34:11.334135       0.02843
2  2019-05-15 09:34:12.147282       0.02843
3  2019-05-15 09:34:13.705145       0.02843
4  2019-05-15 09:34:14.521257       0.02843
type = <class 'pandas.core.frame.DataFrame'>

Dask code:

import dask.dataframe as dd

file_path = 'D:/test_data/'
item = filename.csv
temp_dd = dd.read_csv(file_path + item, usecols=['time', 'ticker_price'])

Output of print(temp_dd.head()):

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Dan\\PycharmProjects\\project1_folder/D:/test_data/filename.csv'

It looks like Dask is appending the file_path to my data on the D drive to the path of my local working directory (the PycharmProjects folder), while Pandas does not. Are there any solutions for this?

A few things I tried that did not work:

(1)

temp_file_path_str = pathlib.Path(file_path + item)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This returns the same error:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Dan\PycharmProjects\project1_folder/D:\test_data\filename.csv'

(2)

temp_file_path_str = r'file://' + file_path + item
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This returns an error that suggests Dask removed the drive ID from the path:

FileNotFoundError: [WinError 3] The system cannot find the path specified: '\test_data\filename.csv'

(3)

temp_file_path_str = 'file://' + file_path + item
temp_file_path_str = pathlib.Path(temp_file_path_str)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This seems to add an extra \ before the drive ID in the path:

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '\D:\test_data\filename.csv'

@TomAugspurger
Copy link
Member

What is dask.bytes.core.get_fs_token_paths(file_path + item) on your system?

@simaster123
Copy link
Author

simaster123 commented May 30, 2019

(<dask.bytes.local.LocalFileSystem object at 0x000001FD4DCDEF98>, '5dc96918a5449170f49bbbcaef27ea92', ['F:/test_data/filename.csv'])

(I moved the data to my F drive from my D drive)

@TomAugspurger
Copy link
Member

Can you post the full traceback?

If you're interested in debugging, I would drop a pdb.set_trace() somewhere around

b_lineterminator = lineterminator.encode()

@jakirkham
Copy link
Member

@simaster123, have you had a chance to try Tom's suggestion above?

@simaster123
Copy link
Author

Apologies for the delay....

It looks like the issue was caused by me calling Dask's read_csv function in a distributed environment. My Dask code works fine on my local machine using just a local worker, but the error was thrown when I started adding my distributed workers.

The weird file path used in the traceback seems to have been a red herring. The answer here led me in the right direction: https://stackoverflow.com/questions/50987030/file-not-found-error-in-dask-program-run-on-cluster

@jakirkham
Copy link
Member

Thanks for the update @simaster123.

@myidealab
Copy link

myidealab commented Dec 10, 2019

@simaster123 I am having the same issue using jupyter notebook and a UNC path. Based on the link you provided, it's unclear to me what your solution was.

I had to map the UNC path to a drive letter for dask.dataframe to read the csv file.

I'm still not sure why UNC path works in pandas, but not dask. The function os.path.exists() was used to confirm that the file is there.

@TomAugspurger
Copy link
Member

@myidealab if you're able to provide a failing example / debug if things are wrong somewhere we'd appreciate it.

The function os.path.exists() was used to confirm that the file is there.

Just a reminder, you'd need to verify that the workers think the file exists with os.path.exists, in addition to from your client process.

@simaster123
Copy link
Author

@myidealab To solve my issue, I created identical shares across my cluster and moved everything to the same OS (Ubuntu) using virtual machines. Now, to share data between nodes, I'm using CIFS shares. To address the issue I raised here, I used the same CIFS mounted folder name on each node (e.g., /my_shared_data/) to access the shared folder with the data I wanted to process.

In other words, I think the errors I had in my original post arose from me trying to distribute computing across nodes with different operating systems + I didn't have consistent shared folder naming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants