Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Replicate Private SKLearn Classes #2328

Closed

Conversation

thealphadollar
Copy link
Contributor

@thealphadollar thealphadollar commented Mar 7, 2020

As per #2240 , sklearn has deprecated the importing of LinearClassifierMixin and SparseCoefMixin and they are now part of private linear_model/_base.py and will be discontinued for importing from v0.24.

This PR addresses the issue by replicating the objects and storing them within VowpalWabbit which gives us full control over the classes and along we use public API of sklearn to implement them which have very dormant rate of breaking changes.

The exhaustive list of changes in the PR as follows:

  • make sklearn_vw independent of sklearn private classes

    • sklearn will remove importing of private classes from v0.24
    • we use the methods and need to implement them locally
    • serves the purpose of making us the proprietor of the code and only dependent on public API of sklearn which doesn't change very frequently
  • set minimum version of sklearn as 0.18

    • The minimum version required is 0.18 since there is no module sklearn.exceptions in v0.17
      Screenshot from 2020-03-08 01-40-07
  • make python files adhere to VW yapf formatting

fixes #2240

thealphadollar and others added 5 commits March 8, 2020 02:08
- The minimum version required is 0.17 since there is no module `sklearn.exceptions`
- sklearn will remove importing of private classes from v0.24
- we use the methods and need to implement them locally
- serves the purpose of making us the proprietor of the code and only dependent on public API of sklearn which doesn't change very frequently
@peterychang
Copy link
Collaborator

Generally this is bad practice, but I'm not sure if there's a better way to solve the problem. It might be worth holding off on this until 0.24 is closer to release to see if any official guidance (or something that doesn't require us to manage the code on our side) comes out

@thealphadollar
Copy link
Contributor Author

@peterychang Sounds good to me. We can wait a few weeks then.

@kumpera
Copy link
Collaborator

kumpera commented Mar 10, 2020

sklearn is 3 clause bsd. I think we'll need to update VW's notice if we merge this.

@thealphadollar
Copy link
Contributor Author

sklearn is 3 clause bsd. I think we'll need to update VW's notice if we merge this.

Yes, this will have to keep in mind.

Copy link
Collaborator

@lalo lalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change for now, but wanted to have this as a review item.

python/vowpalwabbit/objects.py Outdated Show resolved Hide resolved
@jackgerrits
Copy link
Member

@thealphadollar We cannot duplicate the internal SKlearn classes to fix the deprecation. Is there direction from SKlearn about how to migrate? Can you come up with another way to solve this issue without duplicating their code.

@thealphadollar
Copy link
Contributor Author

@peterychang @jackgerrits I was sick for the past two weeks and hence no update here. I'll use the method suggested by @peterychang and attempt the resolve the issue and update the same PR.

thealphadollar added 2 commits March 22, 2020 20:22
- This is required since we cannot copy code directly from SKLearn
- This makes it easy for us to use the public API instead of copying everything over here.
- `fit` method is removed from TLCM in order to use the `fit` method defined in VW
@thealphadollar
Copy link
Contributor Author

thealphadollar commented Mar 22, 2020

@jackgerrits @peterychang I've updated the Pull Request with new changes which do not require copying the source from SKLearn.

I've used LinearSVC as it already inherits from BE, SCM, and LCM. SCM has two new methods (densify and sparsify) that we require and hence are not silenced in TLCM. LSVC has __init__ and fit method. While the former has been ignored since TLCM inherited from LCM which in turn inherited from CM and both do not have the __init__ method and hence it was unnecessary to call init on super.

UPDATE FOR BELOW QUOTED TEXT: I've reversed the order since I went through all of the methods in both TLCM (including inherited once, in old one) and figured that there was no overlap of method and hence a swap can be made without any issue.

The ignorance of fit method is important since we do not want to override the fit method from VW and hence it has been made to raise AttributeError which makes the fit from VW used instead of the one from TLCM.

NOTE: The above problem could also be solved by doing class VWClassifier(VW, ThresholdingLinearClassifierMixin) as methods are looked up from left to right but I didn't want to create large changes and avoid any major change.~

By avoiding to call super init we also make sure that we do not invoke for the BaseEstimator __init__ of the parent of LSVC and hence keep the TLCM as close as possible to the current scenario in order to break anything.

- We are inheriting first from VW and then TLCM, `fit` method from VM is used and hence it is not required to remove fit from TLCM.
@thealphadollar
Copy link
Contributor Author

@peterychang @jackgerrits @lalo Can you please review this?

@peterychang
Copy link
Collaborator

It looks ok to me, but I'm not sure if there are any unintended unintended consequences based on how python does inheritance.

IIRC inherited function implementations are chosen left-to-right on the superclass list, so changing the order of inheritance on VWClassifier could change some behavior. On the other hand, the both the VW class and the LinearSVC class adds/overloads the fit function; obviously we want to use our own implementation of it here. I believe LinearSVC also swaps the order of the two mixins compared to what we currently have.

@gramhagen are you aware of any problems with modifying the order of inheritance here? Or with using LinearSVC in place of the two deprecated mixin classes? I couldn't see anything, but i'm not familiar with the nuances of what python is doing here.

@gramhagen
Copy link
Contributor

this seems like a good approach to handling the api changes, although LinearSVC seems like an odd choice. Why not use LogisticRegression?

@thealphadollar
Copy link
Contributor Author

@gramhagen One outright reason is that @peterychang suggested it and that created a bias towards it.

But I also went forward with my own research about the same. As I saw the classes, everything but their number of methods was different. LinearSVC has two methods (which we would need to check for conflict). On the other hand, LogisticRegression has 4 and hence I outrightly chose the former.

@gramhagen
Copy link
Contributor

fair enough =). I am overhauling the sklearn_vw module a bit in #2368 though which will collide with these changes?

would you mind if we pull in some of these modifications there?

@thealphadollar
Copy link
Contributor Author

Sure. Please include it in the mentioned PR.

@peterychang
Copy link
Collaborator

Oh sorry, I only suggested LinearSVC because it looked similar to what we needed, I didn't see LogisticRegression when I was searching through the sklearn repo.

If LogisticRegression makes more sense, then please use it instead. I'm still really new on the ML side of things so I'm mostly making suggestions and comments based on an engineering standpoint

@gramhagen
Copy link
Contributor

@thealphadollar should we close this and coordinate effort on #2368?

@thealphadollar
Copy link
Contributor Author

Yes, sure.

@jackgerrits
Copy link
Member

Closing this based on the comments

@jackgerrits jackgerrits closed this Apr 2, 2020
@thealphadollar thealphadollar deleted the fix_deprecated_imports branch August 19, 2020 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sklearn has deprecated LinearModel Mixins
6 participants