Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost doesn't generate as many tree as specified in the num_round parameter #2610

Closed
fatihtekin opened this issue Aug 16, 2017 · 5 comments

Comments

@fatihtekin
Copy link

fatihtekin commented Aug 16, 2017

This is not a bug but a question to understand. When I call getModelDump from the Booster object, I don't get as many trees as I have in "num_round" parameter. I was thinking if "num_round" is 100 then, XGBoost will generate 100 trees sequentially and I will see all these trees when I call getModelDump. I am sure there is a logical reason behind or my knowledge is wrong. Could you please explain why only 2 trees are printed instead of 100?

val paramMap = List(
      "eta" -> 0.1, "max_depth" -> 7, "objective" -> "binary:logistic", "num_round" ->100,
      "eval_metric" -> "auc", "nworkers" -> 8).toMap
    val xgboostEstimator = new XGBoostEstimator(paramMap)
//TrainModel is another set of standard Spark features like StringIndexer, OnehotEncoding and VectorAssembler
    val pipelineXGBoost = new Pipeline().setStages(Array(trainModel, xgboostEstimator))
    val cvModel = pipelineXGBoost.fit(train)
//Below call generates only 2 tree instead of 100 as num_round is 100!!!
    println(cvModel.stages(1).asInstanceOf[XGBoostClassificationModel].booster.getModelDump()(0))

Versions are as below using scala 2.11

  "ml.dmlc" % "xgboost4j" % "0.7",
  "ml.dmlc" % "xgboost4j-spark" % "0.7",
  "org.apache.spark" %% "spark-core" % "2.2.0",
  "org.apache.spark" %% "spark-sql" % "2.2.0",
  "org.apache.spark" %% "spark-graphx" % "2.2.0",
  "org.apache.spark" %% "spark-mllib" % "2.2.0",

SOF link https://stackoverflow.com/questions/45707479/xgboost-doesnt-generate-as-many-tree-as-specified-in-the-num-round-parameter

@superbobry
Copy link
Contributor

Are you sure about getting (0) from the result of getModelDump? I think it is supposed to return a line per tree.

@fatihtekin
Copy link
Author

That was really good. I know it is not relevant but can I also ask how can I generate a nice UI output of a response like below because it is really hard to understand the response. I would normally like to see the original label names instead of f1..fn.

0:[f24<15.66] yes=1,no=2,missing=1,gain=6.21942,cover=129.287
	1:[f21<261] yes=3,no=4,missing=3,gain=3.42443,cover=71.6611
		3:[f140<2] yes=7,no=8,missing=8,gain=2.75606,cover=61.9827
			7:[f27<20.48] yes=15,no=16,missing=15,gain=3.00671,cover=6.08482
				15:[f80<14] yes=31,no=32,missing=32,gain=1.05651,cover=4.43298
					31:leaf=-0.00412571,cover=1.91075
					32:leaf=-0.09875,cover=2.52223
				16:leaf=0.0649825,cover=1.65183
			8:[f118<23] yes=17,no=18,missing=18,gain=2.48619,cover=55.8978
				17:[f92<2e+06] yes=33,no=34,missing=34,gain=3.16515,cover=4.08267
					33:leaf=0.0710142,cover=1.43293
					34:leaf=-0.0771197,cover=2.64974
				18:[f32<15] yes=35,no=36,missing=36,gain=1.11248,cover=51.8152
					35:[f130<2] yes=51,no=52,missing=52,gain=3.462,cover=21.3867
						51:leaf=0.0397835,cover=1.60196
						52:[f44<2] yes=63,no=64,missing=63,gain=2.42754,cover=19.7847
							63:leaf=-0.10413,cover=17.9684
							64:leaf=0.0112034,cover=1.81628
					36:[f42<17] yes=53,no=54,missing=53,gain=0.561538,cover=30.4285
						53:leaf=-0.12668,cover=29.3089
						54:leaf=-0.0249998,cover=1.1196
		4:[f9<437.88] yes=9,no=10,missing=10,gain=2.93188,cover=9.67846
			9:[f29<35.38] yes=19,no=20,missing=20,gain=3.12078,cover=5.46093
				19:leaf=-0.0656442,cover=1.84422
				20:[f39<7] yes=37,no=38,missing=38,gain=1.13613,cover=3.61671
					37:leaf=-0.00702025,cover=1.27122
					38:leaf=0.0989425,cover=2.34549
			10:[f9<800.01] yes=21,no=22,missing=22,gain=1.2229,cover=4.21753
				21:leaf=-0.106609,cover=3.16405
				22:leaf=0.00745218,cover=1.05348
	2:[f89<263.393] yes=5,no=6,missing=6,gain=3.26881,cover=57.6261
		5:[f25<36] yes=11,no=12,missing=12,gain=3.1383,cover=29.0563
			11:[f89<105.357] yes=23,no=24,missing=24,gain=5.17044,cover=14.8166
				23:[f78<13] yes=39,no=40,missing=40,gain=5.36755,cover=8.94914
					39:leaf=-0.091808,cover=2.22487
					40:[f19<3036.57] yes=55,no=56,missing=55,gain=3.76778,cover=6.72427
						55:[f1100<2] yes=65,no=66,missing=66,gain=2.18262,cover=5.71874
							65:leaf=-0.00891904,cover=1.48012
							66:leaf=0.122998,cover=4.23861
						56:leaf=-0.0705334,cover=1.00553
				24:[f14<16821] yes=41,no=42,missing=42,gain=1.03521,cover=5.86751
					41:leaf=-0.113841,cover=4.58508
					42:leaf=-0.00770119,cover=1.28243
			12:[f30<145] yes=25,no=26,missing=25,gain=3.40768,cover=14.2397
				25:[f2<3235] yes=43,no=44,missing=44,gain=1.33554,cover=13.1357
					43:[f126<2] yes=57,no=58,missing=58,gain=0.447535,cover=11.7918
						57:leaf=-0.024697,cover=1.01128
						58:leaf=-0.127271,cover=10.7805
					44:leaf=-0.00902633,cover=1.34391
				26:leaf=0.0420599,cover=1.10395
		6:[f40<39.94] yes=13,no=14,missing=14,gain=3.09517,cover=28.5697
			13:[f1<2544] yes=27,no=28,missing=28,gain=2.65386,cover=6.64987
				27:[f74<1148.94] yes=45,no=46,missing=46,gain=0.285036,cover=4.87463
					45:leaf=-0.116752,cover=3.82716
					46:leaf=-0.0246764,cover=1.04747
				28:leaf=0.0238662,cover=1.77524
			14:[f29<3.63] yes=29,no=30,missing=29,gain=3.60683,cover=21.9199
				29:[f118<33] yes=47,no=48,missing=48,gain=2.58241,cover=5.06454
					47:leaf=0.0429156,cover=1.12571
					48:leaf=-0.0977424,cover=3.93884
				30:[f1<573] yes=49,no=50,missing=50,gain=2.66939,cover=16.8553
					49:[f52<5] yes=59,no=60,missing=59,gain=4.01581,cover=6.28172
						59:[f24<23.91] yes=67,no=68,missing=68,gain=1.82016,cover=4.95269
							67:leaf=-0.00493858,cover=1.07114
							68:leaf=0.134587,cover=3.88155
						60:leaf=-0.0546345,cover=1.32903
					50:[f1100<2] yes=61,no=62,missing=62,gain=3.16652,cover=10.5736
						61:[f31<2.31] yes=69,no=70,missing=70,gain=1.08024,cover=4.25716
							69:leaf=0.012435,cover=1.10261
							70:leaf=-0.0891398,cover=3.15455
						62:[f81<17] yes=71,no=72,missing=72,gain=2.08267,cover=6.31646
							71:leaf=-0.00785882,cover=4.19799
							72:leaf=0.0983542,cover=2.11846

@superbobry
Copy link
Contributor

getModelDump accepts an optional argument mapping fN to human-readable values.

@fatihtekin
Copy link
Author

I guess for what you say I need to generate featureMap in txt format. Do you know how can i generate it?

def getModelDump(featureMap: String = null, withStats: Boolean = false, format: String = "text")

@52DL
Copy link
Contributor

52DL commented Jun 12, 2018

@CodingCat This seems a jvm-package problem, and i encounted it too.
In binary:logistic mission with xgboost4j-spark, i always get one tree from a trained booster, i think we should get "round" amount of trees...

@tqchen tqchen closed this as completed Jul 4, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants