Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error Dbscan.train on 9_1M.csv (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes of type scala.collection.immutable.List in instance of org.alitouka.spark.dbscan.spatial.Box) #22

Open
ttpro1995 opened this issue Nov 10, 2017 · 12 comments

Comments

@ttpro1995
Copy link

ttpro1995 commented Nov 10, 2017

Zeppelin notebook export json https://gist.github.com/0f067d6ff2239500ca8eed7d38b5872b
Built on commit d3b085286ccb16b146e7bb5234765cbc23e11c66

val data_path2 = "hdfs://127.0.0.1:9000/data/9_1M.csv"
val dataset2 = IOHelper.readDataset(sc, data_path2)
val settings = new DbscanSettings ().withEpsilon (0.8).withNumberOfPoints (4).withTreatBorderPointsAsNoise(true)
val clusteringResult = Dbscan.train (dataset2, settings)

error log https://gist.github.com/ttpro1995/7437b1f3b1f944fd26daf2ef4ba73efe

build.sbt

name := "spark_dbscan"

organization := "org.alitouka"

version := "0.0.4"

scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0" % "provided"

libraryDependencies += "org.scalatest" % "scalatest_2.11" % "2.1.3" % "test"

libraryDependencies += "org.apache.commons" % "commons-math3" % "3.2"

// https://mvnrepository.com/artifact/com.github.scopt/scopt_2.10
libraryDependencies += "com.github.scopt" % "scopt_2.11" % "3.7.0"

@ttpro1995 ttpro1995 changed the title error when Dbscan.train on 9_1M.csv error Dbscan.train on 9_1M.csv (cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes of type scala.collection.immutable.List in instance of org.alitouka.spark.dbscan.spatial.Box) Nov 10, 2017
ttpro1995 referenced this issue in ttpro1995/spark_dbscan Nov 10, 2017
@valera7979
Copy link
Contributor

valera7979 commented Jan 22, 2018

I have the same problem. This problem occur when use spark 1.6 and higher. It works when reduce number of points below org.alitouka.spark.dbscan.spatial.rdd.PartitioningSettings.DefaultNumberOfPointsInBox . But in this case calculation performed in one container.

@valera7979
Copy link
Contributor

valera7979 commented Feb 16, 2018

I found solution. I reduced scala version to 2.10 because at higher version some methods became deprecated, and it worked.

@lccmpn
Copy link

lccmpn commented Mar 15, 2018

I tried compiling the library with scala version 2.10.6 and spark 2.1.0 but still no luck, it throws the same exception. @valera7979 could you explain how did you manage to make it work?

@valera7979
Copy link
Contributor

I made a pull request. See #24
Use this code https://github.com/valera7979/spark_dbscan/tree/rise_Spark

@shuangyumo
Copy link

shuangyumo commented Jun 7, 2018

i get the same error, but my friends runs it and no error. very confused the code.
can you help me to solve the problem?

@sfdan473414
Copy link

i get the same error , but i solved this problem by the following code .
before:
val partitioningSettings = new PartitioningSettings (numberOfPointsInBox = argsParser.args.numberOfPoints)
after:
val partitioningSettings = new PartitioningSettings ()

it does work well.

@Benji81
Copy link

Benji81 commented Feb 14, 2019

i get the same error , but i solved this problem by the following code .
before:
val partitioningSettings = new PartitioningSettings (numberOfPointsInBox = argsParser.args.numberOfPoints)
after:
val partitioningSettings = new PartitioningSettings ()

it does work well.

@sfdan473414 could you give the filename(s) and line number please?

@Benji81
Copy link

Benji81 commented Feb 15, 2019

I found solution. I reduced scala version to 2.10 because at higher version some methods became deprecated, and it worked.

@valera7979
Do you know which parts are deprecated?

@sfdan473414
Copy link

i get the same error , but i solved this problem by the following code .
before:
val partitioningSettings = new PartitioningSettings (numberOfPointsInBox = argsParser.args.numberOfPoints)
after:
val partitioningSettings = new PartitioningSettings ()
it does work well.

@sfdan473414 could you give the filename(s) and line number please?

The filename is DbcanDriver,but this problem occur again when the dataset is large (100M_4d). when i run it using small dataset (150 records) ,it does work well .

@sfdan473414
Copy link

My scala version is 2.11.12 and spark version is 2.0.0.cloudera2(spark 2.x).
we can get the key info from the exception "cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes ".
that is to say, the field "org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes" failed serialization or can't be serialization just like SparkContext class. so you should add annotation '@transient' for it.

Class name : org.alitouka.spark.dbscan.spatial.Box

private [dbscan] class Box (val bounds: Array[BoundsInOneDimension], val boxId: BoxId = 0, val partitionId: Int = -1, @transient var adjacentBoxes: List[Box] = Nil)

Now the program can run well in big datasets or small datasets.

@DanyYan
Copy link

DanyYan commented Apr 24, 2019

My scala version is 2.11.12 and spark version is 2.0.0.cloudera2(spark 2.x).
we can get the key info from the exception "cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes ".
that is to say, the field "org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes" failed serialization or can't be serialization just like SparkContext class. so you should add annotation '@transient' for it.

Class name : org.alitouka.spark.dbscan.spatial.Box

private [dbscan] class Box (val bounds: Array[BoundsInOneDimension], val boxId: BoxId = 0, val partitionId: Int = -1, @transient var adjacentBoxes: List[Box] = Nil)

Now the program can run well in big datasets or small datasets.

Dose the code support high dimension data?

@laksheenmendis
Copy link

My scala version is 2.11.12 and spark version is 2.0.0.cloudera2(spark 2.x).
we can get the key info from the exception "cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes ".
that is to say, the field "org.alitouka.spark.dbscan.spatial.Box.adjacentBoxes" failed serialization or can't be serialization just like SparkContext class. so you should add annotation '@transient' for it.

Class name : org.alitouka.spark.dbscan.spatial.Box

private [dbscan] class Box (val bounds: Array[BoundsInOneDimension], val boxId: BoxId = 0, val partitionId: Int = -1, @transient var adjacentBoxes: List[Box] = Nil)

Now the program can run well in big datasets or small datasets.

Thank you very much @sfdan473414
I had many challenges, however with your suggestion, I was able to run this with a Spark 2.2.1 cluster, with Hadoop 2.7.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants