Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"TypeError: 'int' object is not iterable" cause the application to abort #2

Closed
marcogoldin opened this issue May 14, 2018 · 4 comments

Comments

@marcogoldin
Copy link

Hi, "map_test.py split" cause the application to abort apparently due to a TypeError.
Is this tool supposed to work with python 3.5 or 3.6?
Anyway, here's the output:

`2018-05-14 11:51:45,168 INFO Splitting started
/home/aml/mlolUR/ur-analysis-tools/report.py:15: DeprecationWarning: Call to deprecated function remove_sheet (Use wb.remove(worksheet) or del wb[sheetname]).
wb.remove_sheet(wb.active)
2018-05-14 11:51:45,170 INFO Spark initialization
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/05/14 11:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/05/14 11:51:46 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2018-05-14 11:51:47,467 INFO Source file reading
2018-05-14 11:51:53,761 INFO Filter users with small number of events
2018-05-14 11:51:54,122 INFO Split data into train and test
[Stage 3:> (0 + 2) / 200]18/05/14 11:52:08 WARN TaskSetManager: Lost task 1.0 in stage 3.0 (TID 19, 172.31.70.53, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 451, in loads
return pickle.loads(obj, encoding=encoding)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 783, in _make_skel_func
closure = _reconstruct_closure(closures) if closures else None
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 775, in _reconstruct_closure
return tuple([_make_cell(v) for v in values])
TypeError: 'int' object is not iterable

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

18/05/14 11:52:08 ERROR TaskSetManager: Task 3 in stage 3.0 failed 4 times; aborting job
18/05/14 11:52:08 WARN TaskSetManager: Lost task 1.2 in stage 3.0 (TID 30, 172.31.70.53, executor 0): TaskKilled (killed intentionally)
18/05/14 11:52:08 WARN TaskSetManager: Lost task 2.2 in stage 3.0 (TID 29, 172.31.70.53, executor 0): TaskKilled (killed intentionally)
Traceback (most recent call last):
File "map_test.py", line 647, in
root()
File "/home/aml/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/aml/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/aml/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/aml/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/aml/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "map_test.py", line 122, in split
train_df, test_df = split_data(df)
File "map_test.py", line 63, in split_data
split_date = get_split_date(df, cfg.splitting.split_event, cfg.splitting.train_ratio)
File "map_test.py", line 51, in get_split_date
total_primary_events = date_rdd.count()
File "/home/aml/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py", line 1056, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/aml/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py", line 1047, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/home/aml/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py", line 921, in fold
vals = self.mapPartitions(func).collect()
File "/home/aml/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py", line 824, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/home/aml/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 4 times, most recent failure: Lost task 3.3 in stage 3.0 (TID 28, 172.31.70.53, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 451, in loads
return pickle.loads(obj, encoding=encoding)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 783, in _make_skel_func
closure = _reconstruct_closure(closures) if closures else None
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 775, in _reconstruct_closure
return tuple([_make_cell(v) for v in values])
TypeError: 'int' object is not iterable

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 451, in loads
return pickle.loads(obj, encoding=encoding)
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 783, in _make_skel_func
closure = _reconstruct_closure(closures) if closures else None
File "/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 775, in _reconstruct_closure
return tuple([_make_cell(v) for v in values])
TypeError: 'int' object is not iterable

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more

`

@marcogoldin
Copy link
Author

nevermind, problem solved with a correct hdfs configuration and settings in spark-env.sh. The split works

@aaronbenz
Copy link

running into essentially the same issue - do you recall what configuration needed adjustment?

@marcogoldin
Copy link
Author

marcogoldin commented Jul 17, 2018

hi, it's been a while and honestly i don't recall precisely what caused the issue.
I use conda, a virtual env with python 3.5 (3.6+ is risky, for compatibility issues) and all the packages needed by the ur-analysis tool:

numpy scipy pandas ml_metrics predictionio tqdm click openpyxl pyspark

Anyway, it's still working so here's my .bashrc part for python (3.5) and the one for spark-env.sh (2.1.1):

.bashrc

# spark
export SPARK_HOME=/home/aml/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
# aggiunta qui sotto riga di prova per map_test.py split
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

# Default python: path to virtual env 
export PATH="/home/aml/anaconda3/envs/uranalysis4/bin/:$PATH"

spark-env.sh:

# pyspark driver
PYSPARK_PYTHON=/home/aml/anaconda3/envs/uranalysis4/bin/python
PYSPARK_DRIVER_PYTHON=/home/aml/anaconda3/envs/uranalysis4/bin/python

This is how it worked for me

@Penumbra69
Copy link

Thanks for this! It was exactly the issue I was facing - for me, the only requirement from your bash work I had to do to get mine working was:

export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

Your mileage may vary - but I can now use pyspark locally in unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants