New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD #670

Closed
crazw opened this Issue Jan 25, 2016 · 1 comment

Comments

Projects
None yet
2 participants
@crazw

crazw commented Jan 25, 2016

I run pyspark with:

pyspark  --jars es-hadoop-2.1.0/elasticsearch-hadoop-2.1.0.jar

when I connect the es,it had some error:

[root@fxdata225 crazw]# pyspark  --jars es-hadoop-2.1.0/elasticsearch-hadoop-2.1.0.jar
Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/01/25 15:16:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/25 15:16:01 WARN Utils: Your hostname, fxdata225 resolves to a loopback address: 127.0.0.1; using 192.168.1.225 instead (on interface bond0)
16/01/25 15:16:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.6.6 (r266:84292, Feb 22 2013 00:00:18)
SparkContext available as sc, HiveContext available as sqlContext.
>>> conf = {"es.resource":"fum-2016.01.18/logs","es.nodes":"192.168.1.225","es.port":"9200"}
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat","org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/crazw/spark-1.6.0/python/pyspark/context.py", line 644, in newAPIHadoopRDD
    jconf, batchSize)
  File "/home/crazw/spark-1.6.0/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/home/crazw/spark-1.6.0/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/home/crazw/spark-1.6.0/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1967)
    at org.elasticsearch.hadoop.rest.RestClient.discoverNodes(RestClient.java:110)
    at org.elasticsearch.hadoop.rest.InitializationUtils.discoverNodesIfNeeded(InitializationUtils.java:58)
    at org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:227)
    at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:457)
    at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:438)
    at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:113)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1293)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
    at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:201)
    at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:530)
    at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

>>>

Anyone can give me some suggest,Thanks very much.

@costin

This comment has been minimized.

Show comment
Hide comment
@costin

costin Jan 25, 2016

Member

For ES 2.x, use ES-Hadoop 2.2 (currently in rc1). For ES 1.x, one can still use ES-Hadoop 2.1.x

Member

costin commented Jan 25, 2016

For ES 2.x, use ES-Hadoop 2.2 (currently in rc1). For ES 1.x, one can still use ES-Hadoop 2.1.x

@costin costin closed this Jan 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment