-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue while importing pydoop inside pySpark map function #273
Comments
Hi, and thanks for reporting this. Pydoop uses dynamic extension modules, so it's not importable from a zip archive. It should be importable from an egg (also supported by Spark), but this leads to the same error as above. I have just opened issue #276 for this and hope to get to it soon. In the meantime, since you most likely don't need properties anyway, you should be able to work around the problem as follows:
--- a/pydoop/__init__.py
+++ b/pydoop/__init__.py
@@ -179,9 +179,7 @@ def read_properties(fname):
with open(fname) as f:
parser.readfp(AddSectionWrapper(f))
except IOError as e:
- if e.errno != errno.ENOENT:
- raise
- return None # compile time, prop file is not there
+ return {}
return dict(parser.items(AddSectionWrapper.SEC_NAME))
You should end up with a |
Hi Simone, Thanks for reply and workaround, I have followed the steps provided and the import issue is resolved. but when I am trying to do operations with hdfs I am facing issues.
I have tried looking at other issues opened, but no one seems to have build the pydoop in this way.
It is printing hadoop version and hadoop class path
|
Hi, It looks like it's not at all straightforward to make a Python package that includes native extensions importable from an egg or other archive. However, I've just tried your pyspark sample code and it works for me if I pass the unzipped installation dir to
And in the pyspark code: from pyspark import SparkContext, SparkConf
SparkContext.setSystemProperty('spark.executor.memory', '4g')
conf = SparkConf().setAppName("pydoop test")
sc = SparkContext(conf=conf)
sc.addFile("/tmp/pydoop", recursive=True)
rdd = sc.parallelize([
(12, 34, 56, 67),
(34, 56, 87, 354),
(345, 74, 33, 77),
(453, 56, 73, 56)
], 2)
def func(rec):
import sys
from pyspark import SparkFiles
sys.path.insert(0, SparkFiles.get("pydoop"))
from pydoop import hdfs
hdfs.dump("hello", "/user/root/temp_{}.txt".format(rec[0]))
rdd.map(func).take(10) Note that you need to manually alter |
Thanks Simone, The option sc.addFile with recursive is working, I have tested this code in my local VM, but unfortunately the cluster where I need this has spark 1.6.3 which does not have recursive parameter for addFile method. Regards, |
Hi, I believe you can still make it work with the older Spark version. Build def func(rec):
import sys
import zipfile
import tempdir
from pyspark import SparkFiles
zip_fn = SparkFiles.get("pydoop.zip")
d = tempfile.mkdtemp()
with zipfile.ZipFile(zip_fn, 'r') as zipf:
zipf.extractall(d)
sys.path.insert(0, d)
from pydoop import hdfs
hdfs.dump("hello", "/user/root/temp_{}.txt".format(rec[0])) |
Hi, Regards, |
I'm going to make an alpha release pretty soon. You should be able to avoid the zip-unzip round trip by simply zipping the contents of |
I have a requirement to write to hdfs inside map, hence am shipping pydoop.zip dependency module to all worker nodes using sc.addPyFile options, but when I try importing pydoop.hdfs I get below error.
Steps followed to create pydoop.zip
Sample pyspark code that I am trying to use pydoop inside map.
Please help me to resolve this.
The text was updated successfully, but these errors were encountered: