Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modin integration #22

Closed
HamiltonRepoMigrationBot opened this issue Feb 26, 2023 · 10 comments
Closed

Modin integration #22

HamiltonRepoMigrationBot opened this issue Feb 26, 2023 · 10 comments
Labels
enhancement New feature or request migrated-from-old-repo Migrated from old repository

Comments

@HamiltonRepoMigrationBot
Copy link
Collaborator

Issue by skrawcz
Thursday Mar 10, 2022 at 19:09 GMT
Originally opened as stitchfix/hamilton#85


Is your feature request related to a problem? Please describe.
Modin - https://github.com/modin-project/modin - also enables scaling pandas computation. Since we have ray, dask, and koalas, why not add Modin?

Describe the solution you'd like
Modin requires a replacement of the pandas import in user code to work.
We would need to think how to do this:

  1. Do we get people to import "pandas" from hamilton, and we can then control which pandas is actually imported?
  2. Do we require users then to assume modin, by changing the pandas import themselves when defining their hamilton python functions?
  3. Or is there some other way to integrate? E.g. a graph adapter

Additional context
N/A

@HamiltonRepoMigrationBot HamiltonRepoMigrationBot added the enhancement New feature or request label Feb 26, 2023
@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by devin-petersohn
Thursday Mar 10, 2022 at 20:53 GMT


Hi @skrawcz, I'm happy to help support this idea.

  • Would (1) require Modin to be a hard dependency? Would your users want that type of behavior from hamilton?
  • 2 seems much more natural to me. Modin users can be hamilton users in that case, right?

What is required from the Modin side for an integration?

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by skrawcz
Friday Mar 11, 2022 at 00:13 GMT


@devin-petersohn

In my ideal world, we could we enable people to maintain https://github.com/stitchfix/hamilton/blob/main/examples/hello_world/my_functions.py without touching it.

Reasoning:

  1. that code is reusable in different contexts very easily.
  2. if someone wanted to switch to dask, ray, koalas, they "technically" wouldn't have to change that code. Maybe a platform team wants to control this, and the user shouldn't need to?

idea 1

Add this to the top of the file, if people want

from hamilton.augment import pandas as pd

And if modin was detected as being installed, it would be returned, otherwise vanilla pandas would be.

idea 2

We would require users to "hard code" modin -- i.e. replace the pandas import in https://github.com/stitchfix/hamilton/blob/main/examples/hello_world/my_functions.py with modin.
Or do some if else based on an environment variable, or something like that.
Essentially like idea (1) -- but the user controls it. The downside is that they have to know about modin.

idea 3

Is there some other python duck typing way (this might be considered hacky), that "driver" or "graph adapter" code would own?

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by elijahbenizzy
Friday Mar 11, 2022 at 03:43 GMT


Either one could be OK I think.

That said, I want to try something like the following, in the driver:

sys.modules['pandas'] = modin.pandas

So long as its the first thing executed (big IF, but it should be), then this should work...

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by skrawcz
Monday Mar 28, 2022 at 05:08 GMT


Either one could be OK I think.

That said, I want to try something like the following, in the driver:

sys.modules['pandas'] = modin.pandas

So long as its the first thing executed (big IF, but it should be), then this should work...

Hmm, that could work. Will have to prototype and see how the ergonomics feel.

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by skrawcz
Monday Mar 28, 2022 at 05:10 GMT


And if modin was detected as being installed, it would be returned, otherwise vanilla pandas would be.

Just putting it in here, that we could use https://docs.python.org/3/library/importlib.html#checking-if-a-module-can-be-imported or something like that to check if modin is installed.

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by devin-petersohn
Thursday Apr 07, 2022 at 01:12 GMT


Thanks @skrawcz and @elijahbenizzy! Is there some way we can help to support this on the Modin side?

Typically the approach I have taken with Modin is that the choice should be the users and that users should be aware that Modin is being used. We make it easy to not only move to Modin, but also back to pandas from Modin if you choose.

Replacing sys.modules could work, but there may be some considerations here:

  • When would you change sys.modules?
    • If it happens on import, and the user doesn't end up using Hamilton after import, they may still end up using Modin thinking it is pandas (weird edge case)
  • Would users be able to opt-out of this (or opt-in)?
  • If Modin had a configuration that did this for you, could that help?

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by skrawcz
Tuesday Apr 12, 2022 at 05:37 GMT


Thanks @devin-petersohn comments inline:

When would you change sys.modules?

  • If it happens on import, and the user doesn't end up using Hamilton after import, they may still end up using Modin thinking it is pandas (weird edge case)

It wouldn't happen on import of hamilton no. It would be as part of a script/flow of execution:

from hamilton import driver, switch_modin_for_pandas, switch_pandas_for_modin

# do switch here
switch_modin_for_pandas()
# have to import after doing the switch
import func_module
dr = driver.Driver({}, func_module)
df = dr.execute(['a', 'b', ...])
# switch it back
switch_pandas_for_modin()
save_df(df)

Would users be able to opt-out of this (or opt-in)?

The idea is that they opt-in to this, but ideally they don't have to change any of their logic to do so.

If Modin had a configuration that did this for you, could that help?

Potentially. It would enable us to depend on it, rather than having to write and maintain it ourselves.

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by devin-petersohn
Tuesday Apr 19, 2022 at 02:46 GMT


Ok this should actually be pretty straightforward as an opt-in utility. I think we can probably start with a hacky approach in Hamilton and then merge it into Modin as it matures as a configuration (i.e. modin.config.OverrideAllPandasCalls.enable()). I like the idea of the config, but let's make sure it solves the use case first is the thinking here. Would that process make sense to you @skrawcz and @elijahbenizzy?

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by skrawcz
Thursday Apr 21, 2022 at 15:41 GMT


@devin-petersohn sounds good.

Do you happen by chance to have that hacky incantation?

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by skrawcz
Monday May 23, 2022 at 23:38 GMT


@devin-petersohn okay so we can't play with sys modules, because modin still requires access to pandas. So you'd have to provide a means to do it. Created modin-project/modin#4488 to track.

Otherwise I am going to prototype idea 1 and see how that feels.

@elijahbenizzy elijahbenizzy added the migrated-from-old-repo Migrated from old repository label Feb 26, 2023
@zilto zilto closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request migrated-from-old-repo Migrated from old repository
Projects
None yet
Development

No branches or pull requests

3 participants