Modularization of data.all code #295

dlpzx · 2023-02-07T06:35:22Z

Is your feature request related to a problem? Please describe.

As a client I am always frustrated when I customize data.all and then I ran into conflicts whenever I try get a more upgraded version.
Since developing a new feature involves changing many details across the repo, the learning curve can be very steep at the beginning.
I also find it difficult to contribute back, since my developed features collide with some features or modify many different pieces of code in various files across the repo.
Some customers want to use a subset of data.all modules but instead they need to deploy the whole platform and maintain the whole code even when using only the core functionalities.

Describe the solution you'd like

Code should be refractored to group features/modules together. It should be easy to find all code related to a feature and it should be easy to replace it with another piece of code.
Platform should be configurable and customers need to easily define the modules that build their platform and what is the source code for it.

Describe alternatives you've considered
Modularization can be done at multiple levels.

Architecture modularization: similar to what is already implemented. Each component is isolated (db, api, aws...)
Code modularization: refractor code but keep the above infrastructure in single components for all features.
Features as microservices: Each feature module having its own resources

Additional context
Add any other context or screenshots about the feature request here.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

This is a draft PR for showing purposes. There are still some minor issues that needs to be addressed. ### Feature or Bugfix - Refactoring ### Detail There are following changes under this PR: 1. Modularization + Refactoring of notebooks There are new modules that will play a major role in the future refactoring: * Core = contains the code need for application to operate correctly * Common = common code for all modules * Modules = the plugin/feature that can be inserted into the system (at the moment only notebooks) The other part that is related to modularization is the creation of environment parameters. Environment parameter will replace all hardcoded parameters of the environment configuration. There is a new file - config.json that allows you to configure an application configuration. All existing parameters will be migrated via db migration in AWS 2. Extracting permissions and request context (Optional for the modularization) Engine, user, and user groups had been passed as a parameter of context in the request. This had forced to pass a lot of parameters to other methods that weren't even needed. This information should be as a scope of the request session. There is a new way to retrieve the information using `RequestContext.` There is also a new way to use permission checks that require less parameters and make code cleaner. The old way was marked as deprecated 3. Restructure of the code (Optional for the modularization) Since the modularization will touch all the places in the API code it can be a good change to set a new structure of the code. There are small re-organization in notebook module to address * Allocating the resources before the validating parameters * Not clear responsibility of the classes * Mixed layers There are new structure : - resolvers = validate and pass code to service layer - service layer = bisnesss logic - repositories = database logic (all queries should be placed here) - aws = contains a wrapper client upon boto3 - cdk = all logic related to create stacks or code for ecs - tasks = code that will be executed in AWS lambda (short-living tasks) All names can be changed. ### Relates [#295](#295) By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>

### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues #295 and #412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetProfilingRun ### Relates - #295 and #412 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix - Refactoring ### Detail Refactoring of the third part of dataset: `DatasetStorageLocation` Introduced the Indexers: this code was migrated from the `indexers.py` and put into modules. Removed unused alarms (which didn't call actual alarm code) Introduced `DatasetShareService` but it seems it will be migrated to share module. All `DatasetXXXServices` will be split onto Services (business logic) and Repositories( DAO layer) in future parts ### Relates - #412 and #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetTable: Get rid of ElasticSearch connection for every request. Created a lazy way to establish connection. ### Relates #412 and #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>

### Feature or Bugfix - Refactoring ### Detail Refactoring of the Dataset entity and related to it code. Refactoring for Votes Introduced DataPolicy (the same way as ServicePolicy was used used) Extracted dataset related permissions. Used new `has_tenant_permission` instead of `has_tenant_perm` that allows not to pass unused parameters ### Relates #412 and #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This is a draft PR for showing purposes. There are still some minor issues that needs to be addressed. ### Feature or Bugfix - Refactoring ### Detail There are following changes under this PR: 1. Modularization + Refactoring of notebooks There are new modules that will play a major role in the future refactoring: * Core = contains the code need for application to operate correctly * Common = common code for all modules * Modules = the plugin/feature that can be inserted into the system (at the moment only notebooks) The other part that is related to modularization is the creation of environment parameters. Environment parameter will replace all hardcoded parameters of the environment configuration. There is a new file - config.json that allows you to configure an application configuration. All existing parameters will be migrated via db migration in AWS 2. Extracting permissions and request context (Optional for the modularization) Engine, user, and user groups had been passed as a parameter of context in the request. This had forced to pass a lot of parameters to other methods that weren't even needed. This information should be as a scope of the request session. There is a new way to retrieve the information using `RequestContext.` There is also a new way to use permission checks that require less parameters and make code cleaner. The old way was marked as deprecated 3. Restructure of the code (Optional for the modularization) Since the modularization will touch all the places in the API code it can be a good change to set a new structure of the code. There are small re-organization in notebook module to address * Allocating the resources before the validating parameters * Not clear responsibility of the classes * Mixed layers There are new structure : - resolvers = validate and pass code to service layer - service layer = bisnesss logic - repositories = database logic (all queries should be placed here) - aws = contains a wrapper client upon boto3 - cdk = all logic related to create stacks or code for ecs - tasks = code that will be executed in AWS lambda (short-living tasks) All names can be changed. ### Relates [data-dot-all#295](data-dot-all#295) By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>

### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetProfilingRun ### Relates - data-dot-all#295 and data-dot-all#412 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix - Refactoring ### Detail Refactoring of the third part of dataset: `DatasetStorageLocation` Introduced the Indexers: this code was migrated from the `indexers.py` and put into modules. Removed unused alarms (which didn't call actual alarm code) Introduced `DatasetShareService` but it seems it will be migrated to share module. All `DatasetXXXServices` will be split onto Services (business logic) and Repositories( DAO layer) in future parts ### Relates - data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetTable: Get rid of ElasticSearch connection for every request. Created a lazy way to establish connection. ### Relates data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>

### Feature or Bugfix - Refactoring ### Detail Refactoring of the Dataset entity and related to it code. Refactoring for Votes Introduced DataPolicy (the same way as ServicePolicy was used used) Extracted dataset related permissions. Used new `has_tenant_permission` instead of `has_tenant_perm` that allows not to pass unused parameters ### Relates data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dbalintx

### Feature or Bugfix - Feature ### Detail - For ML Studio and Environments, the original procedure of synthesizing the template and testing the resulting json has been replaced by the `cdk assertions` library recommended in the [documentation](https://docs.aws.amazon.com/cdk/v2/guide/testing.html#testing_getting_started) of CDK. - In the Environment cdk testing, the `extent` method of the `EnvironmentStackExtension` subclasses has been mocked. Now it only tests the `EnvironmentSetup` stack as if no extensions were registered. - In the MLStudio cdk testing, this PR adds a test for the `SageMakerDomainExtension` mocking the environment stack. It tests the `SageMakerDomainExtension` as a standalone. Open question: 1) The rest of cdk stacks (notebooks, pipelines, datasets...) are tested using the old method of printing the json template. Should I go ahead and migrate to using `cdk assertions` library? If so, I thought of doing it for notebooks only and sync on the testing for datasets and pipelines with @dbalintx and @nikpodsh. 2) With the `cdk assertions` library we can test the resources properties more in depth. I added some tests on the environment stack but we could add more asserts in all stacks. I will add more tests on the MLStudio stack and I made a note in the GitHub project to review this once modularization is complete. ### Relates - #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Modularization of the sharing. ### Detail Migrated the sharing part (including tasks) Removed unused methods for datasets There are a few issues that needs to be addressed before merging this PR. But it can be view, since we don't expect major changes here. ### Related #295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com> Co-authored-by: kukushking <kukushkin.anton@gmail.com> Co-authored-by: Dariusz Osiennik <osiend@amazon.com> Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com> Co-authored-by: Dennis Goldner <107395339+degoldner@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Abdulrahman Kaitoua <abdulrahman.kaitoua@polimi.it> Co-authored-by: akaitoua-sa <126820454+akaitoua-sa@users.noreply.github.com> Co-authored-by: Gezim Musliaj <102723839+gmuslia@users.noreply.github.com> Co-authored-by: Rick Bernotas <97474536+rbernotas@users.noreply.github.com> Co-authored-by: David Mutune Kimengu <57294718+kimengu-david@users.noreply.github.com> Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>

dlpzx · 2023-09-13T09:52:03Z

Completed with V2.0 release 🚀

dlpzx mentioned this issue Feb 7, 2023

Modularization draft (POC) proposal #296

Closed

dlpzx added type: newfeature New feature request priority: high status: in-progress This issue has been picked and is being implemented labels Feb 13, 2023

dlpzx added this to To do in v2.0.0 Code Modularization via automation Feb 13, 2023

dlpzx moved this from To do to In progress in v2.0.0 Code Modularization Feb 13, 2023

nikpodsh mentioned this issue Mar 9, 2023

modularization: backend Pluginarchitecture #359

Merged

dlpzx added type: modularization Code refactoring project and removed type: newfeature New feature request labels Mar 14, 2023

This was referenced Apr 13, 2023

Modularization of Datasets (backend) #412

Closed

modularization: Dataset Modularization pt.1 #413

Merged

Dataset Modularization pt.2 nikpodsh/aws-dataall#1

Closed

nikpodsh mentioned this issue Apr 24, 2023

modularization: Datasets modularization pt.2 #432

Merged

This was referenced May 2, 2023

modularization: Datasets modularization pt.3 #440

Merged

modularization: Datasets modularization pt.4 #441

Merged

modularization: Datasets modularization pt.5 #442

Merged

dlpzx removed this from In progress in v2.0.0 Code Modularization May 17, 2023

nikpodsh mentioned this issue May 31, 2023

modularization: Dataset Sharing modularization #488

Merged

This was referenced Jun 14, 2023

modularization: testing extension cdk stacks and environment stack separatedly #517

Closed

modularization: testing extensions #518

Merged

dlpzx closed this as completed Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularization of data.all code #295

Modularization of data.all code #295

dlpzx commented Feb 7, 2023

dlpzx commented Sep 13, 2023

Modularization of data.all code #295

Modularization of data.all code #295

Comments

dlpzx commented Feb 7, 2023

dlpzx commented Sep 13, 2023