Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularization of Datasets (backend) #412

Closed
nikpodsh opened this issue Apr 13, 2023 · 0 comments
Closed

Modularization of Datasets (backend) #412

nikpodsh opened this issue Apr 13, 2023 · 0 comments
Assignees
Labels
type: modularization Code refactoring project

Comments

@nikpodsh
Copy link
Contributor

A work-item for ongoing modularization work for backend #295

@nikpodsh nikpodsh changed the title Modularization of Datasets Modularization of Datasets (backend) Apr 13, 2023
@nikpodsh nikpodsh added the type: modularization Code refactoring project label Apr 13, 2023
@nikpodsh nikpodsh self-assigned this Apr 13, 2023
nikpodsh added a commit that referenced this issue Apr 24, 2023
### Feature or Bugfix
- Refactoring (Modularization)

### Relates
- Related issues #295 and #412 

### Short Summary
First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) 

### Long Summary 
Datasets are huge. It's one of the central modules that's spread
everywhere across the application. Migrating the entire Dataset piece
would be very difficult task and, more importantly, even more difficult
to review. Therefore, I decided to break down this work into "small"
steps to make it more convenient to review.
Dataset's API consist of the following items:
* `Dataset`
* `DatasetTable`
* `DatasetTableColumn`
* `DatasetLocation`
* `DatasetProfiling`

In this PR, there is only creation of `Dataset` module and migration of
`DatasetTableColumn` (and some related to it pieces). Why? Because the
plan was to migrate it, to see what issues would come up along with it
and to address them here. The refactoring of `DatasetTableColumn` will
be in other PR.
The issues: 
1) Glossaries
2) Feed
3) Long tasks for datasets
4) Redshift

Glossaries rely on GraphQL UNION of different type (including datasets).
Created an abstraction for glossary registration. There was an idea to
change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing
frontend API is more feasible, but I wanted it to be consistent with
glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't
require a dedicated loading for its code (at least for now). But there
are two concerns:
1) The deployment uses a direct module folder references to run them
(e.g. `dataall.modules.datasets....`, so basically when a module is
deactivated, then we shouldn't deploy this tasks as well). I left a TODO
for it to address in future (when we migrate all modules), but we should
bear in mind that it might lead to inconsistencies.
2) There is a reference to `redshift` from long-running tasks = should
be address in `redshift` module

Redshift: it has some references to `datasets`. So there will be either
dependencies among modules or small code duplication (if `redshift`
doesn't rely hard on `datasets`) = will be addressed in `redshift`
module

Other changes:
Fixed and improved some tests 
Extracted glue handler code that related to `DatasetTableColumn`
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :) 

Next steps:
[Part2 ](nikpodsh#1) in preview :) 
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.



By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
nikpodsh added a commit that referenced this issue May 2, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetProfilingRun

### Relates
- #295 and #412 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
nikpodsh added a commit that referenced this issue May 4, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the third part of dataset: `DatasetStorageLocation`
Introduced the Indexers: this code was migrated from the `indexers.py`
and put into modules.
Removed unused alarms (which didn't call actual alarm code)
Introduced `DatasetShareService` but it seems it will be migrated to
share module.
All `DatasetXXXServices` will be split onto Services (business logic)
and Repositories( DAO layer) in future parts

### Relates
- #412 and #295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
nikpodsh added a commit that referenced this issue May 4, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetTable:
Get rid of ElasticSearch connection for every request. Created a lazy
way to establish connection.

### Relates
#412 and #295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
nikpodsh added a commit that referenced this issue May 10, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions. 
Used new `has_tenant_permission` instead of `has_tenant_perm` that
allows not to pass unused parameters

### Relates
#412 and #295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring (Modularization)

### Relates
- Related issues data-dot-all#295 and data-dot-all#412

### Short Summary
First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :)

### Long Summary
Datasets are huge. It's one of the central modules that's spread
everywhere across the application. Migrating the entire Dataset piece
would be very difficult task and, more importantly, even more difficult
to review. Therefore, I decided to break down this work into "small"
steps to make it more convenient to review.
Dataset's API consist of the following items:
* `Dataset`
* `DatasetTable`
* `DatasetTableColumn`
* `DatasetLocation`
* `DatasetProfiling`

In this PR, there is only creation of `Dataset` module and migration of
`DatasetTableColumn` (and some related to it pieces). Why? Because the
plan was to migrate it, to see what issues would come up along with it
and to address them here. The refactoring of `DatasetTableColumn` will
be in other PR.
The issues:
1) Glossaries
2) Feed
3) Long tasks for datasets
4) Redshift

Glossaries rely on GraphQL UNION of different type (including datasets).
Created an abstraction for glossary registration. There was an idea to
change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing
frontend API is more feasible, but I wanted it to be consistent with
glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't
require a dedicated loading for its code (at least for now). But there
are two concerns:
1) The deployment uses a direct module folder references to run them
(e.g. `dataall.modules.datasets....`, so basically when a module is
deactivated, then we shouldn't deploy this tasks as well). I left a TODO
for it to address in future (when we migrate all modules), but we should
bear in mind that it might lead to inconsistencies.
2) There is a reference to `redshift` from long-running tasks = should
be address in `redshift` module

Redshift: it has some references to `datasets`. So there will be either
dependencies among modules or small code duplication (if `redshift`
doesn't rely hard on `datasets`) = will be addressed in `redshift`
module

Other changes:
Fixed and improved some tests
Extracted glue handler code that related to `DatasetTableColumn`
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :)

Next steps:
[Part2 ](nikpodsh#1) in preview :)
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetProfilingRun

### Relates
- data-dot-all#295 and data-dot-all#412 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the third part of dataset: `DatasetStorageLocation`
Introduced the Indexers: this code was migrated from the `indexers.py`
and put into modules.
Removed unused alarms (which didn't call actual alarm code)
Introduced `DatasetShareService` but it seems it will be migrated to
share module.
All `DatasetXXXServices` will be split onto Services (business logic)
and Repositories( DAO layer) in future parts

### Relates
- data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetTable:
Get rid of ElasticSearch connection for every request. Created a lazy
way to establish connection.

### Relates
data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dbalintx <132444646+dbalintx@users.noreply.github.com>
dlpzx pushed a commit to dlpzx/aws-dataall that referenced this issue May 25, 2023
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions.
Used new `has_tenant_permission` instead of `has_tenant_perm` that
allows not to pass unused parameters

### Relates
data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@nikpodsh nikpodsh closed this as completed Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: modularization Code refactoring project
Projects
None yet
Development

No branches or pull requests

1 participant