Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a ResourceField that allows exporting and importing related subsets of data #1375

Open
pokken-magic opened this issue Jan 1, 2022 · 13 comments

Comments

@pokken-magic
Copy link

(This is fairly involved -- trying to get all the random ideas I had on this down before I forget them)

A Field created with a reference to a Resource could allow exporting an entire related model, which would enable (potentially) exporting subsets of data with arbitrary depth (where a resourcefield could have other resourcefields) --

So if you had a Book with Publishers and Authors, you could export that Book and all of the Publishers and Authors necessary for it. ResourceFields would need to be imported first, which likely would require significant UI changes -- such as an additional workflow step for each level of depth (if Publishers have Representatives, you'd have to load Representatives before Publishers, then Publishers and Authors before Books).

The power here is that you could create a more consistent dataset with complex realistic models, without having to take multiple steps to synchronize all the dependencies -- you just export a Book, and everything needed for that Book comes with it.

@pokken-magic
Copy link
Author

Simplest approach might be to require YAML or Json for nested datasets, then you could fairly easily sequence the Resources in the order they need to be.

In Excel you could put each resource on a tab and import in tab order, and CSV you could hypothetically use multiple segments, but might not be worth the bother.

It could be feasible zip and index subsets (e.g. 1_Author.csv, 2_Publisher.csv, 3_Book.csv), but fairly ugly.

@pokken-magic
Copy link
Author

I am not sure when I’ll get to it but I think I have a pretty good idea of how to approach this design.

Tablibhas a concept of a data book which seems like a good fit here.

So the thought is a ResourceField (or maybe widget) requires a foreign key widget and specifies the class of the resource at instantiation. When exporting a resource with a resource field export instead creates a data book and adds each resourcefields resource as a sheet in the databook ordered ahead of the primary resource sheet. It should be pretty trivial to filter the resource fields resource objects by the ones referred to by the main dataset thereby only bringing the needed data.

I need to play around with data books a little and see how the load and export functions work. My hope is we can update our importer to always use data books instead of datasets and thereby be able to import complete sequenced datasets (where resources can load in an order by dependency).

The ability to do complete internally consistent exports would make import-export an invaluable QA tool for django which is fairly difficult to work with subsets of data (dumpdata and loaddata are not very sophisticated for this purpose).

@pokken-magic
Copy link
Author

I had to build this for work and it works beautifully, with the constraint that I could not figure out how to get it to spit out anything other than a json blob. I will see if I can get permission to post it, but the gist is:

I used a RelatedField to passing the parent object down to the widget, then a ResourceWidget that you tell the resource_class of the field, on init, then --

  1. a render method that builds a queryset of whatever is in the field's RelatedManager, then passes that to the resource_class's export method, then renders that as json
  2. a clean method that creates a Tablib dataset from Json, then passes that to the resource_Class's import_data method. Then you build a queryset from the RowResult.object_ids and return that

Possible it could be simplified to not need the custom field, but there's some magic in my RelatedField that makes it behave better.

Anyway the bottom line is that this approach of allowing exporting nested resources seems to work pretty well with JSON, and could potentially be added as a capability.

@daniel-butler
Copy link

I think this is really interesting! If you are able to include code snippits and a general idea of the export/import result that would be really helpful! I have a similar problem and honestly most data has complex relationships

@pokken-magic
Copy link
Author

pokken-magic commented Apr 16, 2022

Sure!

So before I dive in, I will say that I use a custom Field that saves for enabling creating things through the relatedmanager/related field capability in Django. I have not tried using the ResourceWidget outside of this Field so it may be required to issue a Save.

Because of how m2m models save you at the minimum need to inherit your ResourceWidget from the m2m widget (since otherwise it will never save the newly created things).

What I did was made a ResourceWidget that inherited from M2m and accepted a resource as one of the arguments. Then its Clean and Render methods used the resource.

Since Resource export_data can accept a queryset, you just pass the queryset from the Field you are exporting, and serialize it as a chosen format (my ResourceWidget has another field to allow you to specify a serialization format). the Render function of your widget uses resource.export_data. you can get at the queryset with 'attr.all()'

On the Clean side, you use the format to create a tablib dataset, then pass that to import_data and return the queryset of the new things you made (which can be pulled from the results of import)

Waiting on permission from work to do a PR for this, but I think it's pretty intuitive once you start working on it. Long term it would probably be better to update the whole system to use tablib databooks, but that is a lot more work.

@pokken-magic
Copy link
Author

The downside to this is I could not figure out a way to get the parent serialization format and inherit it, so you have to be explicit. You could change the parameters of various functions to pass the parent format down to the Field Render/Clean functions, but that is a lot of trouble.

The other downside is that I seriously doubt this would work with complex datasets using anything but Yaml/Json.

Annnnd, if you use Yaml you get out of order dictionaries because Tablib yaml dumper doesn't seem to allow you to change the setting that requires it.

So bottom line you're using Json if you want to do this and not have it be weird :)

@pokken-magic
Copy link
Author

Does anyone understand the dynamic resource creation code in the admin site well enough to explain to me how to use it?

I was thinking it would be useful to allow dynamically generating a resource for resourcewidget if one is not supplied when I found that but the meta class code stuff went over my head at first.

If you know what model something is how would you make a resource from it in the fly concisely?

@pokken-magic
Copy link
Author

🔥 I got the approval to release this code from work so I will be working on a draft PR out there since I need some comments.

@pokken-magic
Copy link
Author

pokken-magic commented Apr 25, 2022

Figured I'd get it out there so people can yoink it for their own private repos if they want, it's very very useful even in its somewhat incomplete form (we're using the heck out of it for making testing datasets).

Love people's feedback on things I missed or other ideas for smoothing off the rough edges. Would very much like some test suite ideas.

Major things I catalogued as I was working

  • behavior of reverse relationship foreignkeys/inlines
  • formats - is it too opinionated to say nested relationships should just be json? (I have had all kinds of problems with Yaml, and I don't even want to start thinking about CSV/TVS)
  • behavior of the Field widget. There were was a really nice fix for post_save someone had written that might be a good way to approach figuring out how to assign values to fields (.set or = etc.) . The code I have there is pretty hacky.
  • What's the best way to validate this? I can write a bunch of widget tests and make sure things serialize OK, that might be the easiest approach.
  • How can I automate creating resources for things that don't have them? And would it be useful to have a ResourceOptions option that allows autocreating resources for all fk/m2m fields?

@pokken-magic
Copy link
Author

If anyone wants to pull that branch down and try it, the BookAdmin has a nested resource you can try that should export a book's categories and authors.

@pokken-magic
Copy link
Author

I realize this thread is getting kinda spammy but I didn't want to lose sight of a design thought I've been having. ResourceWidget works pretty well for the specific use case I am passionate about which is test data and data lifecycle (moving internally consistent subsets of data from place to place for one reason for another).

But what I am thinking on a bit lately is one of the distinguishing features of this library is really its broad format support, especially non-programmer friendly formats like Excel and CSV. Tablib's support for so many formats is a real differentiator between things like dumpdata/loaddata and DRF serializers.

So what I'm wondering is if, long term, we might be better off thinking about Tablib Databooks as a way to manage this without forcing down the Json pathway. I could see some ways to sequence nested data imports in ways that visualize better than deeply nested Json blobs (series' of CSV tables with line breaks between them or similar).

It might be OK to the initial JSON-only ResourceWidget and then dig more into Databooks?

@pokken-magic
Copy link
Author

I’m going to be away for a bit on family business but figured I’d update to say that all my company’s code including resource widget is apparently working on 3.0.0b3 :) not sure how we got upgraded but it’s working.

Looking forward to trying multiple resource classes and detection of natural foreign keys

@matthewhegarty
Copy link
Contributor

Closing - discussion continues in #445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants