Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] [WIP] Pyspark api wrapper #5658

Closed
wants to merge 5 commits into from

Conversation

WeichenXu123
Copy link
Contributor

This take over #4656 , this is still WIP.

@trivialfis
Copy link
Member

Out of curiosity, what's the difficulty of basing pyspark support on python package instead of jvm packages?

@CodingCat
Copy link
Member

There is no difficulty, except that you need to translate all existing xgboost4j-spark code to python

@Ben-Epstein
Copy link

Ben-Epstein commented May 19, 2020

Unfortunately I'm unable to import from sparkxgb after following those steps on 1.0.0

from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.jars','xgboost4j-spark_2.12-1.0.0.jar').config('spark.jars', 'xgboost4j_2.12-1.0.0.jar').getOrCreate()

spark.sparkContext.addPyFile("xgboost4j-spark_2.12-1.0.0.jar")
spark.sparkContext.addPyFile("./xgboost4j_2.12-1.0.0.jar")
from sparkxgb import XGBoostClassifier, XGBoostClassificationModel
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-9-8a1823de6207> in <module>
----> 1 from sparkxgb import XGBoostClassifier, XGBoostClassificationModel
ModuleNotFoundError: No module named 'sparkxgb'

I'm running this on Jupyter, not sure if it matters or not.

Until this is packages officially, is there a workaround? I looked at some other threads but none seemed to work. Is there a zip file for 1.0.0 I can download (i saw one for 0.0.9/0.0.8)

from pyspark.ml.util import JavaMLWritable
from pyspark.ml.wrapper import JavaModel, JavaEstimator

from sparkxgb.util import XGBoostReadable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that this import was left in unintentionally? I'm assuming that's causing Ben's import issue (as you define the class later on as well).

Is the new import simply from xgboost.spark import XGBoostRegressor rather than from sparkxgb import XGBoostRegressor with this change?

edit: I just saw ben was importing from sparkxgb and not from xgboost.spark so if I understand this, both issues were contributing

@FelixYBW
Copy link
Contributor

we use a modified version of sparkxgb, it works on spark3.0 and xgboost master now. let me know if it can help.

Steps:

  1. os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j_2.12-1.3.0-SNAPSHOT.jar, xgboost4j-spark_2.12-1.3.0-SNAPSHOT.jar pyspark-shell'
  2. sc.addPyFile('file:///sparkxgb.zip')

@luoguohao
Copy link

any updates on this issues ?

@mallman
Copy link
Contributor

mallman commented Jan 19, 2021

@WeichenXu123 Sorry if this is the wrong place to inquire, but do you know what the plan is for the Pyspark API wrapper? Is this PR still the latest effort?

@mallman
Copy link
Contributor

mallman commented May 26, 2021

@WeichenXu123 @CodingCat Still interested in this PR (or pyspark xgboost4j support broadly speaking). Any progress on this front?

@FelixYBW
Copy link
Contributor

This is the one we used
https://github.com/Intel-bigdata/xgboost/blob/arrow-to-dmatrix/jvm-packages/xgboost4j-spark/contrib/sparkxgb_1.24.zip

@WeichenXu123 @CodingCat Still interested in this PR (or pyspark xgboost4j support broadly speaking). Any progress on this front?

@candalfigomoro
Copy link

@WeichenXu123
Is this attempt still active?

@WeichenXu123
Copy link
Contributor Author

No. We have

@metra
Copy link

metra commented Oct 25, 2021

Hey @WeichenXu123 I don't think you finished your sentence? "We have"?

@DTW1004
Copy link

DTW1004 commented Aug 4, 2022

we use a modified version of sparkxgb, it works on spark3.0 and xgboost master now. let me know if it can help.

Steps:

  1. os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars xgboost4j_2.12-1.3.0-SNAPSHOT.jar, xgboost4j-spark_2.12-1.3.0-SNAPSHOT.jar pyspark-shell'
  2. sc.addPyFile('file:///sparkxgb.zip')

hello , I have seen you can run xgboost on spark3, and I use pyspark3.1.2 that I failed many times to develop xgb, I download the file sparkxgb-1.24.zip, but where can I find the two jars named xgboost4j_2.12-1.3.0-SNAPSHOT.jar and xgboost4-saprk-j_2.12-1.3.0-SNAPSHOT.jar ?
Have you changed the param in sparkxgb.init.py version? should it be changed to 1.3.0?
Hoping your reply , I don't know how to solve the problem, thanks very much!

@FelixYBW
Copy link
Contributor

FelixYBW commented Aug 4, 2022

This is the latest one we used. We only tested on Spark3.0. Not sure if it works on spark3.1

https://github.com/Intel-bigdata/xgboost/blob/arrow-to-dmatrix/jvm-packages/xgboost4j-spark/contrib/sparkxgb_1.24.zip

@DTW1004
Copy link

DTW1004 commented Aug 4, 2022

thanks for your reply, I mean can you give me the link to download two jar packages? I can't find them. Thank you ~~

@FelixYBW
Copy link
Contributor

FelixYBW commented Aug 4, 2022

One jar used Gazelle's Arrow native parquet reader to read the data, another Jar transform Arrow data format into xgboost Dmatrix directly. They are all outdated. We didn't update them anymore since no one use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet