Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ed-dash comments #26

Closed
alanocallaghan opened this issue Mar 27, 2024 · 5 comments
Closed

ed-dash comments #26

alanocallaghan opened this issue Mar 27, 2024 · 5 comments

Comments

@alanocallaghan
Copy link

base_estimator -> estimator in recent sklearn

In the random forest page, we specify max_features=1 but the decision boundaries are all bivariate. This makes for a very confusing introduction to random forests
https://carpentries-incubator.github.io/machine-learning-trees-python/06-random-forest/index.html

@tompollard
Copy link
Collaborator

I've found this lesson to work well with just two features, but I do play around with some of the parameters to demonstrate what is happening. These should be captured in the materials, so I'll try to make some updates to explain things more clearly.

@alanocallaghan
Copy link
Author

What I mean is that if we're fitting a random forest to two variables, then I'd expect the feature subsampling to produce trees with one feature, otherwise it's just a regular tree ensemble

@tompollard
Copy link
Collaborator

tompollard commented Mar 27, 2024

What I mean is that if we're fitting a random forest to two variables, then I'd expect the feature subsampling to produce trees with one feature, otherwise it's just a regular tree ensemble

One of the nice things about dealing with only two variables is that we can demonstrate that this expectation is not true for random forests (at least for this particular implementation).

If it was true that setting max_features=1 as an argument led to trees with a single variable, we would not see the following trees (which all make decisions based on both variables).

image

The explanation is that features are being limited at each split, not at the model level:

Screenshot 2024-03-27 at 11 11 34 AM

@alanocallaghan
Copy link
Author

Ah. In that case it'd be good to explain that in the lesson

tompollard added a commit that referenced this issue Mar 27, 2024
@alanocallaghan points out that the max_features is confusing for Random Forests. Why does a Random Forest with max_features=1 still result in sub-trees that make decisions based on >1 feature? The explanation is that the max_feature argument is applied at the split level, not the tree level.
@tompollard
Copy link
Collaborator

@alanocallaghan Please could you take a look at #27 and let me know if this resolves the issue?

tompollard added a commit that referenced this issue Mar 27, 2024
Explain the purpose of max_features for Random Forests. Closes #26.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants