Spark pipeline cannot load with spark CatBoostClassifierModel #2402

spencer-wallace · 2023-06-02T09:03:30Z

Problem: When trying to load a pipeline containing a spark catboost classifier, I am receiving the following error:

AttributeError: module 'ai.catboost.spark' has no attribute 'CatBoostClassificationModel'

AttributeError Traceback (most recent call last)
in <cell line: 5>()
3
4 # Load model
----> 5 loaded_model = mlflow.spark.load_model(logged_model)

/databricks/python/lib/python3.9/site-packages/mlflow/spark.py in load_model(model_uri, dfs_tmpdir, dst_path)
793 get_databricks_profile_uri_from_artifact_uri(root_uri)
794 ):
--> 795 return PipelineModel.load(mlflowdbfs_path)
796
797 return _load_model(

/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
444 def load(cls, path: str) -> RL:
445 """Reads an ML instance from the input path, a shortcut of read().load(path)."""
--> 446 return cls.read().load(path)
447
448

/databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
282 metadata = DefaultParamsReader.loadMetadata(path, self.sc)
283 if "language" not in metadata["paramMap"] or metadata["paramMap"]["language"] != "Python":
--> 284 return JavaMLReader(cast(Type["JavaMLReadable[PipelineModel]"], self.cls)).load(path)
285 else:
286 uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)

/databricks/spark/python/pyspark/ml/util.py in load(self, path)
398 "This Java ML type cannot be loaded into Python currently: %r" % self._clazz
399 )
--> 400 return self._clazz._from_java(java_obj) # type: ignore[attr-defined]
401
402 def session(self: JR, sparkSession: SparkSession) -> JR:

/databricks/spark/python/pyspark/ml/pipeline.py in _from_java(cls, java_stage)
342 """
343 # Load information from java_stage to the instance.
--> 344 py_stages: List[Transformer] = [JavaParams._from_java(s) for s in java_stage.stages()]
345 # Create a new instance of this stage.
346 py_stage = cls(py_stages)

/databricks/spark/python/pyspark/ml/pipeline.py in (.0)
342 """
343 # Load information from java_stage to the instance.
--> 344 py_stages: List[Transformer] = [JavaParams._from_java(s) for s in java_stage.stages()]
345 # Create a new instance of this stage.
346 py_stage = cls(py_stages)

/databricks/spark/python/pyspark/ml/wrapper.py in _from_java(java_stage)
290 stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
291 # Generate a default new instance from the stage_name class.
--> 292 py_type = __get_class(stage_name)
293 if issubclass(py_type, JavaParams):
294 # Load information from java_stage to the instance.

/databricks/spark/python/pyspark/ml/wrapper.py in __get_class(clazz)
285 m = import(module)
286 for comp in parts[1:]:
--> 287 m = getattr(m, comp)
288 return m
289

AttributeError: module 'ai.catboost.spark' has no attribute 'CatBoostClassificationModel'
catboost version: spark 1.2
scala: 2.12
spark: 3.3

The text was updated successfully, but these errors were encountered:

bakuteyev · 2023-06-21T10:22:03Z

Any workarounds?

ek-ak · 2023-06-30T11:13:04Z

Hello!
Looks like you don't have all necessary JARs in the CLASSPATH environment.

bakuteyev · 2023-06-30T11:20:10Z

Hello! Looks like you don't have all necessary JARs in the CLASSPATH environment.

In my case environment is databricks. And it works perfectly with "ai.catboost:catboost-spark_3.2_2.12:1.1.1" but not with "ai.catboost:catboost-spark_3.4_2.12:1.2" (with corresponding environment)
I'm pretty sure that error is not on my side, because only loading part in pipeline doesn't work.

bakuteyev · 2023-08-08T09:56:56Z

Actually it doesn't work with any version of catboost inside Pipeline.

andrey-khropov · 2023-08-28T04:25:47Z

Actually it doesn't work with any version of catboost inside Pipeline.

That's very weird. Actually there have been an error with the Pipeline some time ago: #1936 and it has been fixed since CatBoost 1.0.4 and there're test cases to check that. Can you check whether the code from the test cases committed in 835c1a2 works in your environment?

andrey-khropov added bug Spark labels Jun 2, 2023

lsli8888 mentioned this issue Apr 30, 2024

PySpark ML CrossValidator cannot load serialized CrossValidator because it cannot find CatBoostRegressor class #2652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark pipeline cannot load with spark CatBoostClassifierModel #2402

Spark pipeline cannot load with spark CatBoostClassifierModel #2402

spencer-wallace commented Jun 2, 2023

bakuteyev commented Jun 21, 2023

ek-ak commented Jun 30, 2023

bakuteyev commented Jun 30, 2023

bakuteyev commented Aug 8, 2023

andrey-khropov commented Aug 28, 2023

Spark pipeline cannot load with spark CatBoostClassifierModel #2402

Spark pipeline cannot load with spark CatBoostClassifierModel #2402

Comments

spencer-wallace commented Jun 2, 2023

AttributeError: module 'ai.catboost.spark' has no attribute 'CatBoostClassificationModel'

bakuteyev commented Jun 21, 2023

ek-ak commented Jun 30, 2023

bakuteyev commented Jun 30, 2023

bakuteyev commented Aug 8, 2023

andrey-khropov commented Aug 28, 2023