Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError: No module named 'pyarrow._parquet' #15417

Closed
asfimport opened this issue May 23, 2017 · 9 comments
Closed

ModuleNotFoundError: No module named 'pyarrow._parquet' #15417

asfimport opened this issue May 23, 2017 · 9 comments

Comments

@asfimport
Copy link

$ python
Python 3.6.1 |Continuum Analytics, Inc.| (default, Mar 22 2017, 20:11:04) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\bmachie\AppData\Local\Continuum\Miniconda3\envs\ml_recommender\lib\site-packages\pyarrow\__init__.py", line 28, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ImportError: DLL load failed: The specified procedure could not be found.
>>>

Environment: Windows 7 64-bit, conda environment, Python 3.6.1
pyarrow: 0.3.0.post-np112py36_vc14_1 conda-forge [vc14]
Reporter: Brecht Machiels / @brechtm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-1064. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Brecht Machiels / @brechtm:
It seems I had an older version (0.3.pre-np112py36_vc14_0 conda-forge [vc14]) of arrow-cpp installed. After upgrading to the current version (0.3.0.post-np112py36_vc14_1 conda-forge [vc14]), "import pyarrow" works, but "import pyarrow.parquet" fails:

$ python -c "import pyarrow.parquet"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\bmachie\AppData\Local\Continuum\Miniconda3\envs\ml_recommender\lib\site-packages\pyarrow\parquet.py", line 23, in <module>
    from pyarrow._parquet import (ParquetReader, FileMetaData,  # noqa
ModuleNotFoundError: No module named 'pyarrow._parquet'

parquet-cpp is installed: 1.1.0 vc14_1 [vc14] conda-forge

@asfimport
Copy link
Author

Wes McKinney / @wesm:
This should be resolved in the next 24 hours; we are in the process of making a release

@asfimport
Copy link
Author

Brecht Machiels / @brechtm:
Great! And thank you for providing Windows conda packages!

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Can you check out the updated conda packages and let me know if all is working?

@asfimport
Copy link
Author

Brecht Machiels / @brechtm:
Yes, 0.4.0 seems to be working. I can perform the import and parse parquet files now. Importing parquet datasets consisting of multiple files but with missing _metadata doesn't seem to be possible, but I don't suppose that is a bug.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
That sounds buggy to me. Could you open a new JIRA?

@asfimport
Copy link
Author

Brecht Machiels / @brechtm:
I did eventually get it to open the set of parquet files with missing _metadata file by removing an empty directory ("_impala_insert_staging") that was in the same directory.

I still am not able to to open a particular single-file parquet dataset though. It fails with ArrowIOError: IOError: Invalid parquet file. Corrupt footer.. It cannot be opened by fastparquet either. Trying to load it using PySpark fails with a similar error, so there must be something wrong with it:

py4j.protocol.Py4JJavaError: An error occurred while calling o76.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, itsusraedlp08.jnj.com): java.io.IOException: Could not read footer: java.lang.RuntimeException: hdfs://<snipped>/<parquet_dir>/e4a415679f64f34-7ac06c0506c56aab_1260025109_data.0. is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [50, 51, 56, 10]
        at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)
        ...
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: hdfs://<snipped>/<parquet_dir>/e4a415679f64f34-7ac06c0506c56aab_1260025109_data.0. is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [50, 51, 56, 10]
        at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:423)
        ...
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more

It is possible to run queries against it using Impala though.

@asfimport
Copy link
Author

Brecht Machiels / @brechtm:
Never mind the comment about the parquet file with the corrupt footer. Turns out it is a CSV file :-)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I created ARROW-1079 about the empty directory issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants