-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modularization of data.all code #295
Labels
priority: high
status: in-progress
This issue has been picked and is being implemented
type: modularization
Code refactoring project
Comments
dlpzx
added
type: newfeature
New feature request
priority: high
status: in-progress
This issue has been picked and is being implemented
labels
Feb 13, 2023
dlpzx
added
type: modularization
Code refactoring project
and removed
type: newfeature
New feature request
labels
Mar 14, 2023
nikpodsh
added a commit
that referenced
this issue
Mar 31, 2023
This is a draft PR for showing purposes. There are still some minor issues that needs to be addressed. ### Feature or Bugfix - Refactoring ### Detail There are following changes under this PR: 1. Modularization + Refactoring of notebooks There are new modules that will play a major role in the future refactoring: * Core = contains the code need for application to operate correctly * Common = common code for all modules * Modules = the plugin/feature that can be inserted into the system (at the moment only notebooks) The other part that is related to modularization is the creation of environment parameters. Environment parameter will replace all hardcoded parameters of the environment configuration. There is a new file - config.json that allows you to configure an application configuration. All existing parameters will be migrated via db migration in AWS 2. Extracting permissions and request context (Optional for the modularization) Engine, user, and user groups had been passed as a parameter of context in the request. This had forced to pass a lot of parameters to other methods that weren't even needed. This information should be as a scope of the request session. There is a new way to retrieve the information using `RequestContext.` There is also a new way to use permission checks that require less parameters and make code cleaner. The old way was marked as deprecated 3. Restructure of the code (Optional for the modularization) Since the modularization will touch all the places in the API code it can be a good change to set a new structure of the code. There are small re-organization in notebook module to address * Allocating the resources before the validating parameters * Not clear responsibility of the classes * Mixed layers There are new structure : - resolvers = validate and pass code to service layer - service layer = bisnesss logic - repositories = database logic (all queries should be placed here) - aws = contains a wrapper client upon boto3 - cdk = all logic related to create stacks or code for ecs - tasks = code that will be executed in AWS lambda (short-living tasks) All names can be changed. ### Relates [#295](#295) By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
This was referenced Apr 13, 2023
nikpodsh
added a commit
that referenced
this issue
Apr 24, 2023
### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues #295 and #412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
nikpodsh
added a commit
that referenced
this issue
May 2, 2023
This was referenced May 2, 2023
nikpodsh
added a commit
that referenced
this issue
May 4, 2023
### Feature or Bugfix - Refactoring ### Detail Refactoring of the third part of dataset: `DatasetStorageLocation` Introduced the Indexers: this code was migrated from the `indexers.py` and put into modules. Removed unused alarms (which didn't call actual alarm code) Introduced `DatasetShareService` but it seems it will be migrated to share module. All `DatasetXXXServices` will be split onto Services (business logic) and Repositories( DAO layer) in future parts ### Relates - #412 and #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
nikpodsh
added a commit
that referenced
this issue
May 4, 2023
### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetTable: Get rid of ElasticSearch connection for every request. Created a lazy way to establish connection. ### Relates #412 and #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
nikpodsh
added a commit
that referenced
this issue
May 10, 2023
### Feature or Bugfix - Refactoring ### Detail Refactoring of the Dataset entity and related to it code. Refactoring for Votes Introduced DataPolicy (the same way as ServicePolicy was used used) Extracted dataset related permissions. Used new `has_tenant_permission` instead of `has_tenant_perm` that allows not to pass unused parameters ### Relates #412 and #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
dlpzx
added a commit
to dlpzx/aws-dataall
that referenced
this issue
May 25, 2023
This is a draft PR for showing purposes. There are still some minor issues that needs to be addressed. ### Feature or Bugfix - Refactoring ### Detail There are following changes under this PR: 1. Modularization + Refactoring of notebooks There are new modules that will play a major role in the future refactoring: * Core = contains the code need for application to operate correctly * Common = common code for all modules * Modules = the plugin/feature that can be inserted into the system (at the moment only notebooks) The other part that is related to modularization is the creation of environment parameters. Environment parameter will replace all hardcoded parameters of the environment configuration. There is a new file - config.json that allows you to configure an application configuration. All existing parameters will be migrated via db migration in AWS 2. Extracting permissions and request context (Optional for the modularization) Engine, user, and user groups had been passed as a parameter of context in the request. This had forced to pass a lot of parameters to other methods that weren't even needed. This information should be as a scope of the request session. There is a new way to retrieve the information using `RequestContext.` There is also a new way to use permission checks that require less parameters and make code cleaner. The old way was marked as deprecated 3. Restructure of the code (Optional for the modularization) Since the modularization will touch all the places in the API code it can be a good change to set a new structure of the code. There are small re-organization in notebook module to address * Allocating the resources before the validating parameters * Not clear responsibility of the classes * Mixed layers There are new structure : - resolvers = validate and pass code to service layer - service layer = bisnesss logic - repositories = database logic (all queries should be placed here) - aws = contains a wrapper client upon boto3 - cdk = all logic related to create stacks or code for ecs - tasks = code that will be executed in AWS lambda (short-living tasks) All names can be changed. ### Relates [data-dot-all#295](data-dot-all#295) By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
dlpzx
pushed a commit
to dlpzx/aws-dataall
that referenced
this issue
May 25, 2023
### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
dlpzx
pushed a commit
to dlpzx/aws-dataall
that referenced
this issue
May 25, 2023
### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetProfilingRun ### Relates - data-dot-all#295 and data-dot-all#412 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
dlpzx
pushed a commit
to dlpzx/aws-dataall
that referenced
this issue
May 25, 2023
### Feature or Bugfix - Refactoring ### Detail Refactoring of the third part of dataset: `DatasetStorageLocation` Introduced the Indexers: this code was migrated from the `indexers.py` and put into modules. Removed unused alarms (which didn't call actual alarm code) Introduced `DatasetShareService` but it seems it will be migrated to share module. All `DatasetXXXServices` will be split onto Services (business logic) and Repositories( DAO layer) in future parts ### Relates - data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
dlpzx
pushed a commit
to dlpzx/aws-dataall
that referenced
this issue
May 25, 2023
### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetTable: Get rid of ElasticSearch connection for every request. Created a lazy way to establish connection. ### Relates data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
dlpzx
pushed a commit
to dlpzx/aws-dataall
that referenced
this issue
May 25, 2023
### Feature or Bugfix - Refactoring ### Detail Refactoring of the Dataset entity and related to it code. Refactoring for Votes Introduced DataPolicy (the same way as ServicePolicy was used used) Extracted dataset related permissions. Used new `has_tenant_permission` instead of `has_tenant_perm` that allows not to pass unused parameters ### Relates data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
This was referenced Jun 14, 2023
dlpzx
added a commit
that referenced
this issue
Jun 16, 2023
### Feature or Bugfix - Feature ### Detail - For ML Studio and Environments, the original procedure of synthesizing the template and testing the resulting json has been replaced by the `cdk assertions` library recommended in the [documentation](https://docs.aws.amazon.com/cdk/v2/guide/testing.html#testing_getting_started) of CDK. - In the Environment cdk testing, the `extent` method of the `EnvironmentStackExtension` subclasses has been mocked. Now it only tests the `EnvironmentSetup` stack as if no extensions were registered. - In the MLStudio cdk testing, this PR adds a test for the `SageMakerDomainExtension` mocking the environment stack. It tests the `SageMakerDomainExtension` as a standalone. Open question: 1) The rest of cdk stacks (notebooks, pipelines, datasets...) are tested using the old method of printing the json template. Should I go ahead and migrate to using `cdk assertions` library? If so, I thought of doing it for notebooks only and sync on the testing for datasets and pipelines with @dbalintx and @nikpodsh. 2) With the `cdk assertions` library we can test the resources properties more in depth. I added some tests on the environment stack but we could add more asserts in all stacks. I will add more tests on the MLStudio stack and I made a note in the GitHub project to review this once modularization is complete. ### Relates - #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
nikpodsh
added a commit
that referenced
this issue
Jun 26, 2023
Modularization of the sharing. ### Detail Migrated the sharing part (including tasks) Removed unused methods for datasets There are a few issues that needs to be addressed before merging this PR. But it can be view, since we don't expect major changes here. ### Related #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com> Co-authored-by: kukushking <kukushkin.anton@gmail.com> Co-authored-by: Dariusz Osiennik <osiend@amazon.com> Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com> Co-authored-by: Dennis Goldner <107395339+degoldner@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Abdulrahman Kaitoua <abdulrahman.kaitoua@polimi.it> Co-authored-by: akaitoua-sa <126820454+akaitoua-sa@users.noreply.github.com> Co-authored-by: Gezim Musliaj <102723839+gmuslia@users.noreply.github.com> Co-authored-by: Rick Bernotas <97474536+rbernotas@users.noreply.github.com> Co-authored-by: David Mutune Kimengu <57294718+kimengu-david@users.noreply.github.com> Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
Completed with V2.0 release 🚀 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
priority: high
status: in-progress
This issue has been picked and is being implemented
type: modularization
Code refactoring project
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
Describe alternatives you've considered
Modularization can be done at multiple levels.
Additional context
Add any other context or screenshots about the feature request here.
P.S. Please Don't attach files. Add code snippets directly in the message body instead.
The text was updated successfully, but these errors were encountered: