-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to replicate results on diabetes
data from paper
#129
Comments
Thanks for your interest @abhishek-ghose! Sorry you've had a hard time replicating the result...not sure I see any blatant error with your code. @aagarwal1996 - could you point Abhishek here to the exact code we used to run this experiment in the paper? |
Thank you for your response @csinva, and thank you for reviewing the snippet!
The train:test splits were 70%:30%, (stratified by label) and I limited the total datasize to ~3000 points where there was more data available, e.g., This is what the plots look like now - note the titles, they have the dataset name , number of trials, number of folds use to select the best I tried to quantify predictability of HS being better by reporting
|
I have shared some code here (works for the |
H @csinva @aagarwal1996 - any pointers you can provide? Thanks! |
Hi @abhishek-ghose thanks for checking in again. I'll provide some code that can reproduce our experiments shortly. One quick thing I noticed is that you seem to be tuning over max_depth for HS as well, whereas we used the default parameters for RF and only tune over the reg parameter |
Thanks @aagarwal1996 - I tried out some experiments with not setting the
This is what those results look like. For each dataset, I also calculated the
Now we see a gap; this is between RFn and HSRFn. But if I learn the |
Hi,
I was trying to replicate some of the Random Forest results from the paper, specifically Figure 3(D) for the
diabetes
dataset - but I am unable to see the gap in AUC, as presented in the paper. Its probably me doing something silly :) - appreciate some help!To simplify identifying a good
max_depth
for a Random Forest object, I'm using this class- this allows me to use scikit'sGridSearchCV
:And here's my code - the
X
andy
values passed in are from thediabetes
dataset:When I plot the columns
score
againstnum_trees
indf
, I see something like this:The text was updated successfully, but these errors were encountered: