Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type conversions refactorization #16

Closed
wants to merge 64 commits into from

Conversation

wrwrwr
Copy link
Member

@wrwrwr wrwrwr commented Feb 7, 2012

This is something that started as an effort to understand how things work, add some comments to make the process faster for others to come, and maybe clean some things up. However, some additional goals have appeared on the way.

Flaws in type conversions

The first was to fill missing parts of type conversions:

  • converting / deconverting embedded instance fields using the right db_subtype (rather than as RawFields);
  • converting filter arguments for collection field lookups (as a collection item);
  • deconverting new keys in the same way as any other values from the database.

This should fix a few issues, the ones I know about are the Mongo part of #10 and django-nonrel/mongodb-engine#47.

Change to converting / deconverting of embedded instance fields can be disabled by using LegacyEmbeddedModelField (now in toolbox) in case of incompatibility with existing code (hopefully this will be rare, see the comment on NonrelDatabaseOperations.value_for_db_model for some insight). [Update: Not any longer, we've decided to drop this completely.]

Data types abstraction

The second goal was to build data type framework more suitable for a "general" NoSQL database rather than just the GAE datastore:

  • db_type names suitable for a database driver that can process Python types directly;
  • all primary keys and relations use a "key" db_type, unless the back-end decides otherwise;
  • introduced a nonrel_db_type method that makes it possible to override Django's db_type logic even for custom fields;
  • toolbox can handle the "Django side" of type conversions -- converts Django lazy objects and wrappers to standard types before passing them to the database;
  • a bit lesser reliance on sql.Query -- values passed to back-ends compilers are now always accompanied by a proper field (my feeling is that the "column" logic is not very natural for NoSQL databases);

Nonrel fields

The third goal was to make things easier for back-end developers, by taking the burden of handling toolbox fields away (there are many subtleties which can easily lead to mistakes). Now all the code is within the toolbox and all a back-end develop has to do is choose a suitable collection or embedded field db_type for his/her database and call a parent conversion (deconversion) method at the beginning (end) of the back-end methods. There is likely enough to choose from to satisfy most NoSQL databases (see comments in creation.py to learn the available options).

GAE: relations as db.Keys and binary serialization

The fourth was to prepare grounds for storing ForeignKeys and other relations using the special db.Key type of the GAE datastore. A database option is implemented to switch between the two storage methods, but only a "dump data; switch; reload data" procedure is possible. The default is to use the legacy storage for backwards compatibility, but it could be changed with some future major revision. See django-nonrel/djangoappengine#18 for more info.

GAE engine now also serializes DictFields and EmbeddedModelFields using the binary pickle protocol (instead of the ASCII protocol 0); this should work transparently as deconversion is the same for both protocols and no lookups / filtering is specified for these fields (as far as I can tell).

Additional tests and fixes

  • Fix for the EmbeddedModelField.update issue (EmbeddedModelField pre_save, get_db_prep_value rework #14) is already applied [update: already on the develop branch];
  • Added some tests for the natural usage of ListField with ForeignKeys that we've discussed on the mailing list (marked as an expected failure for now) [update: on a separate branch now];
  • Trivial fix for the test for issue Resubmission of pull request for serialization #12 (failing on develop) [update: committed to develop]. Nevertheless, I believe there is more to collection fields serialization and as it is written now it may not work with any kind of nesting [pending verification].
  • Also added some tests checking interaction between Django safe/to-be-escaped strings and lazy objects and the database layer.

Django changes to be considered

  • We could drop all the related_db_type infrastructure -- it's not sufficient for GAE relations-as-Keys and rather hard to follow
  • Replacing various value_to_db_* and convert_values methods on DatabaseOperations with something like the new value_for/from_db(value, field [, lookup]) would allow to avoid double recursion for collection fields, simplify back-end code (get_db_prep_* would call these methods rather than compilers), and make it possible to avoid various hacks in toolbox code (e.g. by moving what is now in Field.get_db_prep_lookup -- and is more or less reverted in NonrelQuery._normalize_lookup_value -- to an abstract SQLDatabaseOperations). See comments on the NonrelDatabaseOperations class if you'd like to know what I find unnecessarily hard to do with the current framework.

Reviewing, testing & merging

I believe the most convenient way to review these changes is by reading creation.py and base.py of toolbox and the engine that you use. If you'd prefer to go through some commits the best bet is to go through the Mongo ones -- only these have been squashed; otherwise you'll find a story of taking bad paths, learning new aspects of the system and redesigning things a couple of times. Any kind of feedback is welcome (including linguistic corrections, comments improvements, code style suggestions, pointing out errors, better design ideas, etc.).

These changes could use some testing. Particularly with preexisting data -- the intention was to keep everything compatible, and a considerable effort was put into understanding what each piece in current conversions does and not lose it, but considering the amount and nature of changes I've probably broken some things. So if you have a project with a testing environment and some existing data, trying these branches out would be really nice -- just backup, update and see if everything works. If it does and it's a GAE project, dump the data, add STORE_RELATIONS_AS_DB_KEYS: True to your database options, reload and check if it still does.

I've run toolbox / appengine / dbindexer / django.contrib (except flatpages and GeoDjango) test cases for GAE under Python 2.5 / 2.7 and the Mongo test suite under 2.6 / 2.7 some horrible amount of times. However, if you have the power to run the full Django test suite (django/tests/runtests.py) in reasonable time -- run it -- I've given up for now (or otherwise would have to postpone pushing this for another week or so ;-) Some tests are expected to fail, but no new failures should be introduced compared to develop branches.

There are branches also called feature/type-conversions-refactor in the GAE and Mongo engine repositories. Each of them depends on this one and this one depends on them -- they need to be merged together.

WARNING: Although this fixes some things it also changes how things are stored, in some cases in a subtle, dynamic way (e.g. DictFields on GAE). It really should not be used in production until going through some more extensive, real-data testing. Nevertheless, in case somebody feels that lucky (especially considering that this is not likely to be merged any time soon :-) dumping your data before and reloading it after the update gives a better chance to avoid some yet unforeseen problems and I'd also highly recommend looking into your database (e.g. at _ah/admin) and checking that things are stored as they should be shortly after that.

…ther than get_db_prep_value for DecimalField.
…rt_values_from_db) to allow symmetric handling of encoding and decoding.
…o not doing any type casting for AutoField if the latter is true.
…llow back-ends to override the main type (for example to easily use list storage for SetFields).
…gic). Use <lookup info (column, type, operator, negated)>, <value>, <value info (starting with field)> argument order everywhere.
…) 3-tuple. Don't pass db_type to filtering-related methods (it was only used for encoding).
…e-String/Unicode) to plain strings for type-inspecting back-ends.
…irement is only that str(value) is reversible given the field kind.
…ary_key is usable and a feature to mark back-ends as supporing using fields other than AutoField as keys or not.
…er for subclasses and these fields should behave much like standard fields.
…column used by the constraint (was: parent primary key, is: child foreign key).
…lso allow the field to be used to prevent convert_value_to/from_db to be run on embedded instance fields' values like in the 0.4 version of the engine.
Conflicts:
	djangotoolbox/fields.py
	djangotoolbox/tests.py
… conversions / deconversions of embedded instance fields' values.
…edModelField implementation and no-deconversion handling to the LegacyEmbeddedModelField.
…umn keys (fields are already available for filtering and updates).
…umn keys (fields are already available for filtering and updates) and pass the key returned by back-ends through database deconversions.
…the-box (to be chosen by a back-end developer).
… lists nor strings nor bytes lets not call it a database ;-)
@emperorcezar
Copy link
Member

I will run your changes agains. Django-lms and see if it breaks anything (mongo).

Thanks. This is really great. Toolbox really needed a refactor.

@jonashaag
Copy link
Contributor

Holy shit O_O

This is insane. Thanks so much!

I don't think we can review this easily, though. It would really help if you could squash your changes and then split them into logical bits. Also, please keep all sorts of style fixes (line breaks, whitespace, typos, etc) in one separate commit as to reduce noise.

From a quick glance:

  • we can drop LegacyEmbeddedModelField entirely. The old data format has been deprecated for two releases, so we're free to drop support :-)
  • We shouldn't pickle anything in djangotoolbox. Databases may have their own representation of types (like Binary for bytes on MongoDB), so pickling should only happen (if at all) in the backend.

Also, you should probably try to get as many of these changes as possible into Django vanilla. Most of the code is database agnostic and it's better for Django-nonrel to have very little code that differs from Django. (You could do that as a GSoC for example.)

Edit: We also need to know of any backwards incompatibilities :-)

@wrwrwr
Copy link
Member Author

wrwrwr commented Feb 7, 2012

On Tue, 2012-02-07 at 07:07 -0800, Adam Cezar Jenkins wrote:

I will run your changes agains. Django-lms and see if it breaks anything (mongo).

Thanks. This is really great. Toolbox really needed a refactor.

Thanks, let me know if something does break. Toolbox compilers
themselves could be also cleaned up a bit (to rely even less on the
Django's SQL machinery or build some more general abstraction), but that
can probably be done in a more incremental manner (unlike this part
where one thing was asking for another :-)


Reply to this email directly or view it on GitHub:
#16 (comment)

@wrwrwr
Copy link
Member Author

wrwrwr commented Feb 7, 2012

On Tue, 2012-02-07 at 07:59 -0800, Jonas Haag wrote:

Holy shit O_O

This is insane. Thanks so much!

Thank you for the good word.

I don't think we can review this easily, though. It would really help if you could squash your changes and then split them into logical bits. Also, please keep all sorts of style fixes (line breaks, whitespace, typos, etc) in one separate commit as to reduce noise.

Yes, all true. I like to keep such formatting stuff separate too (even
though it's against what Django coding style tells me :-) I'll try to
squash some time to at least do a partial squashing, but for the time
being maybe try a file based approach -- in short: DatabaseCreation had
most of the type logic changed, DatabaseOperations: all of the
conversions code collected and mostly rewritten, DatabaseFeatures: two
features removed one added, compilers: changes mostly related to
providing fields for values and adaptation.

From a quick glance:

  • we can drop LegacyEmbeddedModelField entirely. The old data format has been deprecated for two releases, so we're free to drop support :-)

It now also allows to keep using the old logic for embedded instance
fields conversions / deconversions -- I guess this can be removed too
(as far as I can tell, the change can cause incompatibility only in some
rather unusual circumstances), but maybe lets wait and see if anybody
needed to use it or not.

  • We shouldn't pickle anything in djangotoolbox. Databases may have their own representation of types (like Binary for bytes on MongoDB), so pickling should only happen (if at all) in the backend.

The pickling only happens if a back-end developer chooses "string" or
"bytes" storage for DictFields or EmbeddedModelFields (that's not the
default; the first one is more or less what GAE used to use, the second
just uses another pickle protocol). Generally the serialization will be
needed to support nonrel fields with databases that can't store a dict
or a nested list directly (say a relational DB with a BLOB type?); I
don't think it does any harm for toolbox to offer this (it's meant as a
preprocessing before casting to a database specific type).

Also, you should probably try to get as many of these changes as possible into Django vanilla. Most of the code is database agnostic and it's better for Django-nonrel to have very little code that differs from Django. (You could do that as a GSoC for example.)

I have to attend to matters of a more commercial nature for the next
couple of days (a bit less fun than working on nonrel unfortunately :-)
so short-term I'm only planing to do some feature updating / merging /
maintenance. The simplified value conversion framework may be a future
candidate for vanilla, but I'll let others take a look first, maybe we
can come up with something even better?


Reply to this email directly or view it on GitHub:
#16 (comment)

@wrwrwr
Copy link
Member Author

wrwrwr commented Feb 8, 2012

Closing for now, going to make a new request after reducing and bringing some order into the the commits list.

Anybody would mind if I just PEP-ified the whole toolbox? I probably should just restrain from making all those style changes, but I can't make myself to :-) Some of PEP's recommendations may feel awkward, but having something fixed should make further changes more readable.

@wrwrwr wrwrwr closed this Feb 8, 2012
@emperorcezar
Copy link
Member

I think PEP-ifying it is a great idea. Toolbox and nonrel code can be a bit confussing, anything that makes it more readable is better. It couldn't really hurt anything.

@aburgel
Copy link
Member

aburgel commented Feb 13, 2012

i don't mind if you PEP the whole thing, but maybe do it as a separate commit so its easier confirm that nothing breaks.

@wrwrwr
Copy link
Member Author

wrwrwr commented Feb 13, 2012

Yes exactly, for toolbox it's already done -- pepification and all insignificant things in one commit (not 100% strictly to the standard as sometimes it actually decreases readability, but close enough).

Currently working on GAE, but I'll probably just drop any insignificant changes that have wandered into the Mongo branch -- I find it's style specific, but fairly consistent (and there are some more significant things to work on :-)

@emperorcezar
Copy link
Member

I wanted to say a big thank you for this! It's made understanding and following the code paths much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants