Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DS Dict #34

Closed
yevgenybulochnik opened this issue Nov 16, 2021 · 8 comments
Closed

DS Dict #34

yevgenybulochnik opened this issue Nov 16, 2021 · 8 comments
Assignees

Comments

@yevgenybulochnik
Copy link
Member

Problem Statement

The DS (data source) dictionary is starting to get large. We had discussed moving this into the data source folders themselves. Not sure we want to do this at the moment but wanted to start a discussion.

Questions:

  • What data structure should we move to? (json or yaml)
  • Do we want to keep all the sources that use the dynamic dag together, 1 file vs multiple files each in their own ds folder?

Criteria for Success

Move the large dictionary out of the dynamic dag. We need to pay close attention to what the dependencies are for each DS. Kent i think you already thought about this to some degree with name spacing the user defined macros

Additional Information

  • Yaml
  • caveat with yaml is we ill have to install an additional dependency
@Bridg109
Copy link
Collaborator

Right now all dependencies are imported through sagerx.py and user_defined_macros...which should allow us to move things outside of dynamic dag pretty easily.

Another consideration here using JSON files to store this data. JSON would allow us to define functions in it to pass to dataset_dicts if needed as one offs. There are pros and cons (I think mostly cons) but just an alternative

https://stackoverflow.com/questions/51936785/store-python-function-in-json/51938459

@Bridg109
Copy link
Collaborator

Could also consider moving dynamic DAG into its own folder with settings files for each DS also stored in same folder. This way moving them out of dynamic DAG will be easy

@yevgenybulochnik
Copy link
Member Author

Sorry totally missed these comments earlier. Im still somewhat torn about the best way to handle this. The JSON alt makes sense but I do think the cons outway the pros if we are using eval and functions as strings. One of the comments kinda points out that this is dangerous. Im not sure how risky it actually is for our use case from a security perspective, but everywhere else I have read you want to be extremely cautious about this.

@Bridg109
Copy link
Collaborator

Bridg109 commented Jan 23, 2022

Alternative consider making independent DAGs for each data source. Then using either functional or object paradigm make a generate tasks code block that can be placed as a single line of code to generate all tasks for each DAG. This will module and abstract the DAG functions, organize everything.

Looking at the datasets we currently have there is benefit on a few (CMS, RxNorm) to adjust our process and grab historical files. So I feel like in the short future we will be pulling alot of these out of the dynamic DAG anyways, so my thought is how do we do that and still keep the code maintainable and able to share a common set of processes.

This is pretty complex and might be above my software engineering skills at the moment.

@Bridg109
Copy link
Collaborator

As example the Download_dataset should be a class in its own module. With subclasses and methods that can be adjusted from a dataset info class.

Then a SQL class should be formed in its own module

Then a DS class should call both of these as part of composition

@jrlegrand jrlegrand modified the milestone: Solidify Core Functionality Mar 17, 2022
@jrlegrand
Copy link
Member

@Bridg109 / @yevgenybulochnik - does this issue represent the huge topic of re-structuring the dynamic DAG into like functional programming or a class system? What would be a better name for this issue than "DS Dict" to capture that? Feel free to rename it.

@jrlegrand
Copy link
Member

See notes about RxNorm and NADAC on this PR when we tackle this issue: #158

@jrlegrand
Copy link
Member

Fixed by #215

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants