Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecisionTree notebook #5

Open
tlienart opened this issue Apr 9, 2018 · 3 comments
Open

DecisionTree notebook #5

tlienart opened this issue Apr 9, 2018 · 3 comments
Assignees

Comments

@tlienart
Copy link

tlienart commented Apr 9, 2018

# I would strongly suggest not running past n=4
n = 5

is a bit funny.

Re benchmark with Python do you

  1. have some idea why DecisionTree.jl is a factor 3-4 slower?
  2. know whether the results identical? (tree obtained, classification accuracy), this is not necessarily the case as DT.jl may be using a different algorithm

Also I think it would be interesting to test on pure decision trees (not on forest)

@harveydevereux
Copy link
Collaborator

I'm returning to the DT notebook today so I'll have some answers shortly.

Thanks for spotting the n=5 !

@harveydevereux
Copy link
Collaborator

  1. My best guess is that python uses sparse data representations for the training data, and perhaps because in python trees are represented as arrays (cython pointers) rather than spawning large sets of nested node and leaf objects. Perhaps it would be useful to look deeper, say into memory allocations in Julia.

  2. Both DT.jl and python scikitlearn are implementing the same model (CART) and seem to produce the
    same decision tree and prediction accuracy on the titanic data. So I think the benchmark is very comparable.

I've now added a pure decision tree benchmark. It seems that Julia is slower by an order of magnitude, this also seems to get worse with more data.

@tlienart
Copy link
Author

Wow, an order of magnitude! well that'd be a nice side project to work on: build a decent DecisionTreeFast.jl, there's really no reason for it to be much slower than the Python one...

I think their code is not based on SkLearn but rather on something (that I don't know) like ml toolkit or some similar name. It'd be interesting to see if it compares favourably to that one or not (probably not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants