Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularization of data.all code #295

Closed
dlpzx opened this issue Feb 7, 2023 · 1 comment
Closed

Modularization of data.all code #295

dlpzx opened this issue Feb 7, 2023 · 1 comment
Labels
priority: high status: in-progress This issue has been picked and is being implemented type: modularization Code refactoring project

Comments

@dlpzx
Copy link
Contributor

dlpzx commented Feb 7, 2023

Is your feature request related to a problem? Please describe.

  • As a client I am always frustrated when I customize data.all and then I ran into conflicts whenever I try get a more upgraded version.
  • Since developing a new feature involves changing many details across the repo, the learning curve can be very steep at the beginning.
  • I also find it difficult to contribute back, since my developed features collide with some features or modify many different pieces of code in various files across the repo.
  • Some customers want to use a subset of data.all modules but instead they need to deploy the whole platform and maintain the whole code even when using only the core functionalities.

Describe the solution you'd like

  • Code should be refractored to group features/modules together. It should be easy to find all code related to a feature and it should be easy to replace it with another piece of code.
  • Platform should be configurable and customers need to easily define the modules that build their platform and what is the source code for it.

Describe alternatives you've considered
Modularization can be done at multiple levels.

  • Architecture modularization: similar to what is already implemented. Each component is isolated (db, api, aws...)
  • Code modularization: refractor code but keep the above infrastructure in single components for all features.
  • Features as microservices: Each feature module having its own resources

Additional context
Add any other context or screenshots about the feature request here.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

@dlpzx dlpzx added type: newfeature New feature request priority: high status: in-progress This issue has been picked and is being implemented labels Feb 13, 2023
@dlpzx dlpzx added this to To do in v2.0.0 Code Modularization via automation Feb 13, 2023
@dlpzx dlpzx moved this from To do to In progress in v2.0.0 Code Modularization Feb 13, 2023
@dlpzx dlpzx added type: modularization Code refactoring project and removed type: newfeature New feature request labels Mar 14, 2023
nikpodsh added a commit that referenced this issue Mar 31, 2023
This is a draft PR for showing purposes. There are still some minor
issues that needs to be addressed.

### Feature or Bugfix
- Refactoring

### Detail
There are following changes under this PR:
1. Modularization + Refactoring of notebooks
There are new modules that will play a major role in the future
refactoring:

   * Core = contains the code need for application to operate correctly
   * Common = common code for all modules
* Modules = the plugin/feature that can be inserted into the system (at
the moment only notebooks)

The other part that is related to modularization is the creation of
environment parameters.
Environment parameter will replace all hardcoded parameters of the
environment configuration.
There is a new file - config.json that allows you to configure an
application configuration.
All existing parameters will be migrated via db migration in AWS

2. Extracting permissions and request context (Optional for the
modularization)

Engine, user, and user groups had been passed as a parameter of context
in the request. This had forced to pass a lot of parameters to other
methods that weren't even needed. This information should be as a scope
of the request session.
There is a new way to retrieve the information using `RequestContext.`
There is also a new way to use permission checks that require less
parameters and make code cleaner. The old way was marked as deprecated

3. Restructure of the code (Optional for the modularization)

Since the modularization will touch all the places in the API code it
can be a good change to set a new structure of the code. There are small
re-organization in notebook module to address

   * Allocating the resources before the validating parameters
   * Not clear responsibility of the classes
   * Mixed layers 
  
   There are new structure :
   - resolvers = validate and pass code to service layer
   - service layer = bisnesss logic
   - repositories = database logic (all queries should be placed here)
   - aws = contains a wrapper client upon boto3
   - cdk = all logic related to create stacks or code for ecs
- tasks = code that will be executed in AWS lambda (short-living tasks)

   All names can be changed.

### Relates
[#295](#295)



By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
nikpodsh added a commit that referenced this issue Apr 24, 2023
### Feature or Bugfix
- Refactoring (Modularization)

### Relates
- Related issues #295 and #412 

### Short Summary
First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) 

### Long Summary 
Datasets are huge. It's one of the central modules that's spread
everywhere across the application. Migrating the entire Dataset piece
would be very difficult task and, more importantly, even more difficult
to review. Therefore, I decided to break down this work into "small"
steps to make it more convenient to review.
Dataset's API consist of the following items:
* `Dataset`
* `DatasetTable`
* `DatasetTableColumn`
* `DatasetLocation`
* `DatasetProfiling`

In this PR, there is only creation of `Dataset` module and migration of
`DatasetTableColumn` (and some related to it pieces). Why? Because the
plan was to migrate it, to see what issues would come up along with it
and to address them here. The refactoring of `DatasetTableColumn` will
be in other PR.
The issues: 
1) Glossaries
2) Feed
3) Long tasks for datasets
4) Redshift

Glossaries rely on GraphQL UNION of different type (including datasets).
Created an abstraction for glossary registration. There was an idea to
change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing
frontend API is more feasible, but I wanted it to be consistent with
glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't
require a dedicated loading for its code (at least for now). But there
are two concerns:
1) The deployment uses a direct module folder references to run them
(e.g. `dataall.modules.datasets....`, so basically when a module is
deactivated, then we shouldn't deploy this tasks as well). I left a TODO
for it to address in future (when we migrate all modules), but we should
bear in mind that it might lead to inconsistencies.
2) There is a reference to `redshift` from long-running tasks = should
be address in `redshift` module

Redshift: it has some references to `datasets`. So there will be either
dependencies among modules or small code duplication (if `redshift`
doesn't rely hard on `datasets`) = will be addressed in `redshift`
module

Other changes:
Fixed and improved some tests 
Extracted glue handler code that related to `DatasetTableColumn`
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :) 

Next steps:
[Part2 ](nikpodsh#1) in preview :) 
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.



By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
nikpodsh added a commit that referenced this issue May 2, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetProfilingRun

### Relates
- #295 and #412 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
nikpodsh added a commit that referenced this issue May 4, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the third part of dataset: `DatasetStorageLocation`
Introduced the Indexers: this code was migrated from the `indexers.py`
and put into modules.
Removed unused alarms (which didn't call actual alarm code)
Introduced `DatasetShareService` but it seems it will be migrated to
share module.
All `DatasetXXXServices` will be split onto Services (business logic)
and Repositories( DAO layer) in future parts

### Relates
- #412 and #295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
nikpodsh added a commit that referenced this issue May 4, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetTable:
Get rid of ElasticSearch connection for every request. Created a lazy
way to establish connection.

### Relates
#412 and #295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
nikpodsh added a commit that referenced this issue May 10, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions. 
Used new `has_tenant_permission` instead of `has_tenant_perm` that
allows not to pass unused parameters

### Relates
#412 and #295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@dlpzx dlpzx removed this from In progress in v2.0.0 Code Modularization May 17, 2023
dlpzx added a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
This is a draft PR for showing purposes. There are still some minor
issues that needs to be addressed.

### Feature or Bugfix
- Refactoring

### Detail
There are following changes under this PR:
1. Modularization + Refactoring of notebooks
There are new modules that will play a major role in the future
refactoring:

   * Core = contains the code need for application to operate correctly
   * Common = common code for all modules
* Modules = the plugin/feature that can be inserted into the system (at
the moment only notebooks)

The other part that is related to modularization is the creation of
environment parameters.
Environment parameter will replace all hardcoded parameters of the
environment configuration.
There is a new file - config.json that allows you to configure an
application configuration.
All existing parameters will be migrated via db migration in AWS

2. Extracting permissions and request context (Optional for the
modularization)

Engine, user, and user groups had been passed as a parameter of context
in the request. This had forced to pass a lot of parameters to other
methods that weren't even needed. This information should be as a scope
of the request session.
There is a new way to retrieve the information using `RequestContext.`
There is also a new way to use permission checks that require less
parameters and make code cleaner. The old way was marked as deprecated

3. Restructure of the code (Optional for the modularization)

Since the modularization will touch all the places in the API code it
can be a good change to set a new structure of the code. There are small
re-organization in notebook module to address

   * Allocating the resources before the validating parameters
   * Not clear responsibility of the classes
   * Mixed layers

   There are new structure :
   - resolvers = validate and pass code to service layer
   - service layer = bisnesss logic
   - repositories = database logic (all queries should be placed here)
   - aws = contains a wrapper client upon boto3
   - cdk = all logic related to create stacks or code for ecs
- tasks = code that will be executed in AWS lambda (short-living tasks)

   All names can be changed.

### Relates
[data-dot-all#295](data-dot-all#295)

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring (Modularization)

### Relates
- Related issues data-dot-all#295 and data-dot-all#412

### Short Summary
First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :)

### Long Summary
Datasets are huge. It's one of the central modules that's spread
everywhere across the application. Migrating the entire Dataset piece
would be very difficult task and, more importantly, even more difficult
to review. Therefore, I decided to break down this work into "small"
steps to make it more convenient to review.
Dataset's API consist of the following items:
* `Dataset`
* `DatasetTable`
* `DatasetTableColumn`
* `DatasetLocation`
* `DatasetProfiling`

In this PR, there is only creation of `Dataset` module and migration of
`DatasetTableColumn` (and some related to it pieces). Why? Because the
plan was to migrate it, to see what issues would come up along with it
and to address them here. The refactoring of `DatasetTableColumn` will
be in other PR.
The issues:
1) Glossaries
2) Feed
3) Long tasks for datasets
4) Redshift

Glossaries rely on GraphQL UNION of different type (including datasets).
Created an abstraction for glossary registration. There was an idea to
change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing
frontend API is more feasible, but I wanted it to be consistent with
glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't
require a dedicated loading for its code (at least for now). But there
are two concerns:
1) The deployment uses a direct module folder references to run them
(e.g. `dataall.modules.datasets....`, so basically when a module is
deactivated, then we shouldn't deploy this tasks as well). I left a TODO
for it to address in future (when we migrate all modules), but we should
bear in mind that it might lead to inconsistencies.
2) There is a reference to `redshift` from long-running tasks = should
be address in `redshift` module

Redshift: it has some references to `datasets`. So there will be either
dependencies among modules or small code duplication (if `redshift`
doesn't rely hard on `datasets`) = will be addressed in `redshift`
module

Other changes:
Fixed and improved some tests
Extracted glue handler code that related to `DatasetTableColumn`
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :)

Next steps:
[Part2 ](nikpodsh#1) in preview :)
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetProfilingRun

### Relates
- data-dot-all#295 and data-dot-all#412 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the third part of dataset: `DatasetStorageLocation`
Introduced the Indexers: this code was migrated from the `indexers.py`
and put into modules.
Removed unused alarms (which didn't call actual alarm code)
Introduced `DatasetShareService` but it seems it will be migrated to
share module.
All `DatasetXXXServices` will be split onto Services (business logic)
and Repositories( DAO layer) in future parts

### Relates
- data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetTable:
Get rid of ElasticSearch connection for every request. Created a lazy
way to establish connection.

### Relates
data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions.
Used new `has_tenant_permission` instead of `has_tenant_perm` that
allows not to pass unused parameters

### Relates
data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx added a commit that referenced this issue Jun 16, 2023
### Feature or Bugfix
- Feature

### Detail
- For ML Studio and Environments, the original procedure of synthesizing
the template and testing the resulting json has been replaced by the
`cdk assertions` library recommended in the
[documentation](https://docs.aws.amazon.com/cdk/v2/guide/testing.html#testing_getting_started)
of CDK.
- In the Environment cdk testing, the `extent` method of the
`EnvironmentStackExtension` subclasses has been mocked. Now it only
tests the `EnvironmentSetup` stack as if no extensions were registered.
- In the MLStudio cdk testing, this PR adds a test for the
`SageMakerDomainExtension` mocking the environment stack. It tests the
`SageMakerDomainExtension` as a standalone.

Open question: 
1) The rest of cdk stacks (notebooks, pipelines, datasets...) are tested
using the old method of printing the json template. Should I go ahead
and migrate to using `cdk assertions` library? If so, I thought of doing
it for notebooks only and sync on the testing for datasets and pipelines
with @dbalintx and @nikpodsh.

2) With the `cdk assertions` library we can test the resources
properties more in depth. I added some tests on the environment stack
but we could add more asserts in all stacks. I will add more tests on
the MLStudio stack and I made a note in the GitHub project to review
this once modularization is complete.

### Relates
- #295 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
nikpodsh added a commit that referenced this issue Jun 26, 2023
Modularization of the sharing.

### Detail
Migrated the sharing part (including tasks)
Removed unused methods for datasets

There are a few issues that needs to be addressed before merging this
PR. But it can be view, since we don't expect major changes here.

### Related
 #295 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com>
Co-authored-by: kukushking <kukushkin.anton@gmail.com>
Co-authored-by: Dariusz Osiennik <osiend@amazon.com>
Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com>
Co-authored-by: Dennis Goldner <107395339+degoldner@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Abdulrahman Kaitoua <abdulrahman.kaitoua@polimi.it>
Co-authored-by: akaitoua-sa <126820454+akaitoua-sa@users.noreply.github.com>
Co-authored-by: Gezim Musliaj <102723839+gmuslia@users.noreply.github.com>
Co-authored-by: Rick Bernotas <97474536+rbernotas@users.noreply.github.com>
Co-authored-by: David Mutune Kimengu <57294718+kimengu-david@users.noreply.github.com>
Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
@dlpzx
Copy link
Contributor Author

dlpzx commented Sep 13, 2023

Completed with V2.0 release 🚀

@dlpzx dlpzx closed this as completed Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high status: in-progress This issue has been picked and is being implemented type: modularization Code refactoring project
Projects
None yet
Development

No branches or pull requests

1 participant