Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split command #63

Merged
merged 46 commits into from
Jul 7, 2023
Merged

Split command #63

merged 46 commits into from
Jul 7, 2023

Conversation

dave-connors-3
Copy link
Collaborator

@dave-connors-3 dave-connors-3 commented Jun 13, 2023

Opening up a PR to collaborate on the split command!

Right now, this command:

  • accepts the standard args for select/exclude/selector to identify which resources should be moved
  • registers those resources as a subproject of the original project
  • calls the subproject.initialize() method.

The initialize() method:

  • iterates over the resources in the subproject
    • for resources represented by a .sql , .py, or .csv file, move that file into the subdirectory, then move the resource's yml entry, if necessary
    • for .yml defined resources, copy over only the resource yml entry.
  • writes the subproject's dbt_project.yml

These changes requires some new DbtYmlEditor methods, especially to handle the nuance of source files and make our methods more generic to other resource types that follow similar patterns. Additionally, there's a lot of variable renaming for accuracy's sake, so apologies in advance for lots of edits.

Things this command does not do, but will need to before shipping:

  • declare model access and groups for boundary nodes / leaf nodes of selected resources (question: do we need groups at all?)
  • edit children of boundary nodes / leaf nodes to use the multi-argument ref function
  • write a dependencies.yml file to the parent project
  • copy necessary macros from parent project to subproject
  • tests!
  • anything else we can think of!

I would really love some input here on syntax, logic, and design. This is not at all complete, but

@dave-connors-3
Copy link
Collaborator Author

dave-connors-3 commented Jun 15, 2023

@nicholasyager -- this now also moves relevant custom macros and blindly copies the packages.yml to the new subdirectory. There may be a small bug in one of the yml entry moving methods that I am trying to track down, but if use the split_proj in the test-projects directory and run a:

> dbt-meshify split my_project --select +orders
> cd my_project
> dbt deps && dbt compile

you should get into a runnable state 🤞 let me know if that is the case! then we can work to actually make this code good which it currently is not

# this one appears in the project yml, but i don't think it should be written
contents.pop("query-comment")
contents = filter_empty_dict_items(contents)
self.file_manager.write_file(Path("dbt_project.yml"), contents)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not want the file manager in this class -- to date, it's only in the DbtMeshConstructor class that's operating on resource files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this feels like an odd level of abstraction. My mind goes to some other class that operates on DbtProjects to write files. Open to other approaches, though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i'm with you hre -- would it make sense to just have a DbtSubprojectWriter class or something to that effect?

from dbt_meshify.storage.file_manager import DbtFileManager

class DbtSubprojectWriter:
    def __init__(self, subproject: DbtSubproject, ):
        self.subproject = subproject
        self.file_manager = DbtFileManager(read_project_path=subproject.parent.path, write_project_path=subproject.path)

    def inititalize(self):
        ...

    def write_project(self):
        pass

    def write_package_yml(self):
        pass

    def write_package_directory(self):
        pass

pseudocode written with the knowledge that the filemanager stuff is wonky as hell

@nicholasyager
Copy link
Collaborator

@dave-connors-3 I plan on looking at the code tomorrow evening, but in the meantime I did have a chance to noodle with the command as it's currently implemented.

My specific invocation, for reference:

poetry run dbt-meshify split --select +orders revenue 

Here are some thoughts in no particular order:

  • Since we're copy packages.yml, we could also copy over the dbt_packages directory too. This will let the user needing to run dbt deps before their first build of the subproject.
  • We may want to have some logic in place that redistributes/partitions groups into the subproject. Currently, if a model is part of a group and that model is moved to a new project, it is retaining its old group and compilation fails.
  • As a user, it would be useful to have information logged during execution communicating what has been done. After running the split command for the first time, I was left thinking, "huh. I wonder if that worked." 😝

dbt_meshify/dbt_projects.py Outdated Show resolved Hide resolved
dbt_meshify/dbt_projects.py Outdated Show resolved Hide resolved
dbt_meshify/dbt_projects.py Outdated Show resolved Hide resolved
# this one appears in the project yml, but i don't think it should be written
contents.pop("query-comment")
contents = filter_empty_dict_items(contents)
self.file_manager.write_file(Path("dbt_project.yml"), contents)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this feels like an odd level of abstraction. My mind goes to some other class that operates on DbtProjects to write files. Open to other approaches, though!

Comment on lines 368 to 376
def move_resource(self):
"""
move a resource file from one project to another

"""
current_path = self.get_resource_path()
new_path = self.subdirectory / current_path
new_path.parent.mkdir(parents=True, exist_ok=True)
current_path.rename(new_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut reaction is that this is a little odd. I suspect that it would be more ergonomic/useful to pass the destination path in as an argument. Having that been said, I'm only ~45% confident in this opinion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't disagree! i think it's mostly dependent on how much flexibility we want to afford users on where resources land in the new project

@dave-connors-3
Copy link
Collaborator Author

@nicholasyager agree on all fronts! made the small adjustments you recommended, and will spend some time today refactoring the initialization into a separate class that operates on dbt projects rather than have dbt projects have knowledge of the filesystem.

re: logging -- couldn't agree more, I'll open up a new issue so we can tackle it separately for all the commands we have! would love a dbt-esque logging experience in here too

Copy link
Collaborator

@nicholasyager nicholasyager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some light comments. I'm going to review the file editors more in depth soon:tm: once I have some more bandwidth.

dbt_meshify/dbt_projects.py Show resolved Hide resolved
dbt_meshify/main.py Show resolved Hide resolved
dbt_meshify/storage/dbt_project_creator.py Outdated Show resolved Hide resolved
dbt_meshify/storage/dbt_project_creator.py Outdated Show resolved Hide resolved

def process_model_yml(model_yml: Dict[str, Any]):
"""Processes the yml contents to be written back to a file"""
model_ordered_dict = OrderedDict.fromkeys(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be possible to use the schema.json for manifests from dbt-core to track these keys automatically.

dbt_meshify/storage/file_content_editors.py Show resolved Hide resolved
@@ -152,7 +159,19 @@ def get_catalog_entry(self, unique_id: str) -> Optional[CatalogTable]:

def get_manifest_node(self, unique_id: str) -> Optional[ManifestNode]:
"""Returns the catalog entry for a model in the dbt project's catalog"""
return self.manifest.nodes.get(unique_id)
if unique_id.split(".")[0] in [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about exposure and metric/measure? Are those not listed on purpose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic was to get nodes vs the resources that are also other top-level keys in the manifest. Looking with fresh eyes, I would guess there's a dbt-core class that represents this type that we could leverage instead of a list of strings!

@@ -2,7 +2,8 @@ version: 2

models:
- name: customers
description: Customer overview data mart, offering key details for each unique customer. One row per customer.
description: Customer overview data mart, offering key details for each unique
customer. One row per customer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the YAML files have some odd newlines showing up now.

x_proj_ref = "{{ ref('my_new_project', 'stg_orders') }}"
child_sql = (Path(dest_project_path) / "models" / "marts" / "orders.sql").read_text()
assert x_proj_ref in child_sql
teardown_test_project()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a teardown after an assert work as expected? I'd think that it would stop at the assert but it might just be my lack of pytest knowledge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to work locally!

return yaml.safe_load(yml_str)


class TestRemoveResourceYml:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some tests when using model versions as well

Comment on lines +38 to +42
with open("test/profiles.yml", "w") as f:
f.write(yaml.dump(test_project_profile))
if write_packages_yml:
with open("test/packages.yml", "w") as f:
f.write(yaml.dump(test_package_yml))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use pathlib like in the other parts of the code rather than open.



def teardown_new_project():
os.system("rm -rf test-projects/test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😉 let's use shutil.rmtree instead of os.system.

More generally, it is now recommended to use subprocess.run rather than os.system when running commands from Python

Copy link
Collaborator

@nicholasyager nicholasyager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more feedback here!

My tl;dr of the file manager is that it is sufficient, but there are a few opportunities for simplification. I also found a couple areas where defects can trickle in (raising exceptions instead of returning None, using specific exceptions, etc). One major benefit of having this functionality written is that it makes clear to me where we need cleaner APIs -- specifically around manipulating resources. I wonder if a pre-v1 step will be to create first-class classes for our resources and manage SerDe. Definitely a task for later!

Omissions:

  • I've not dug into the yml files generated yet.
  • I'm going to trust the updated tests.

Open questions:

  • How is mypy holding up? I suspect there are a few typing issues around None values and iterables.

# find yml path for resoruces that are not defined
yml_path = Path(self.node.patch_path.split("://")[1]) if self.node.patch_path else None
else:
yml_path = Path(self.node.original_file_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should leverage get_resource_path()


if resource_path is None:
# If this happens, then the model doesn't have a model file, either, which is cause for alarm.
raise Exception(f"Unable to locate the file defining {self.node.name}. Aborting")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love for this to raise a specific exception. Maybe a ModelMissingError, since we cannot recover from this type of issue.

Copy link
Collaborator Author

@dave-connors-3 dave-connors-3 Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would that require a new ModelMissingError class that extends Exception?

Copy link
Collaborator

@nicholasyager nicholasyager Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Likely a good candidate for a future refactor.

Comment on lines +369 to +370
if isinstance(models_yml, str):
raise Exception(f"Unexpected string values in dumped model data in {yml_path}.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little odd. Is there a more systematic way to ensure that read_file returns the appropriate type?

Comment on lines +410 to +411
if model_path is None:
raise Exception(f"Unable to find path to model {self.node.name}. Aborting.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there ever be a time when get_resource_path returns None, or do we want this to error out? My gut says erroring out is the preferred approach, that way do don't introduce surface area for defects.

dbt_meshify/storage/file_content_editors.py Outdated Show resolved Hide resolved
dbt_meshify/storage/file_content_editors.py Outdated Show resolved Hide resolved
dbt_meshify/storage/file_content_editors.py Outdated Show resolved Hide resolved
Comment on lines +90 to +100
def add_group_to_model_yml(model_name: str, group: Group, models_yml: Dict[str, Any]):
"""Add group and access configuration to a model's YAMl properties."""
# parse the yml file into a dictionary with model names as keys
models = resources_yml_to_dict(models_yml)
model_yml = models.get(model_name) or {"name": model_name, "columns": [], "config": {}}

model_yml.update({"group": group.name})
models[model_name] = process_model_yml(model_yml)

models_yml["models"] = list(models.values())
return models_yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method, add_access_to_model_yml, and add_group_to_yml are very similar in function -- specifically they're adding properties to a specific model property entry. This is fine for now, since this is a first draft. Having that been said, this might be something to add to the IO refactoring work I've been pondering.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were copy pasted in order to access each method separately, rather than only as part of add_access_and_group_to_yml -- I think the steps in all of these methods are overlapping and too verbose -- i want to shrink this file dramatically!

dave-connors-3 and others added 2 commits July 5, 2023 09:14
Co-authored-by: Nicholas A. Yager <yager@nicholasyager.com>
Co-authored-by: Benoit Perigaud <8754100+b-per@users.noreply.github.com>
@dave-connors-3 dave-connors-3 mentioned this pull request Jul 6, 2023
Copy link
Collaborator

@nicholasyager nicholasyager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a chonky feature! I did another 🔍 of this PR and found a few small areas of refactoring and a CLI UX oddity. Having this been said, there were no flaming red flags along the way, so it's a ✅ from me! Let's get this out there so we can get feedback.

dbt_meshify/dbt_projects.py Outdated Show resolved Hide resolved
Comment on lines +90 to +93
target_directory = Path(create_path) if create_path else None
subproject_creator = DbtSubprojectCreator(
subproject=subproject, target_directory=target_directory
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am of the opinion that DbtSubprojectCreator should not allow None target_directory values. Instead, the calling method should be responsible for passing a valid target. This approach will allow us to reduce the complexity of the underlying API/class.

This is not something that needs to be tackled here, but rather in a refactor ticket.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's good feedback, definitely agree -- IIRC the init handles this, but it's definitely not necessary to hide that behavior!

dbt_meshify/storage/file_content_editors.py Outdated Show resolved Hide resolved
@@ -24,9 +24,9 @@ class ResourceGrouper:
recommendations based on the reference characteristics for each resource.
"""

def __init__(self, project: DbtProject):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great use of types!

@exclude
@project_path
@select
@selector
def split():
def split(project_name, select, exclude, project_path, selector, create_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since select now allows multiple arguments, we cannot have --select before the project_name argument.

(dbt-meshify-py3.11) > $ poetry run dbt-meshify split --select "+orders" revenue                                                                                                                                         
Usage: dbt-meshify split [OPTIONS] PROJECT_NAME
Try 'dbt-meshify split --help' for help.

Error: Missing argument 'PROJECT_NAME'.

Instead, we need to order arguments/options specifically

(dbt-meshify-py3.11) > $ poetry run dbt-meshify split revenue --select "+orders"                                                                                                                                   

I don't this is a blocker per se. At the very least, documentation should be refined in a follow-up.

dave-connors-3 and others added 2 commits July 7, 2023 10:12
Co-authored-by: Nicholas A. Yager <yager@nicholasyager.com>
Co-authored-by: Nicholas A. Yager <yager@nicholasyager.com>
@dave-connors-3 dave-connors-3 merged commit 764ad1f into main Jul 7, 2023
1 check passed
@nicholasyager nicholasyager deleted the split-command branch July 25, 2023 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants