Type conversions refactorization #16

wrwrwr · 2012-02-07T14:28:05Z

This is something that started as an effort to understand how things work, add some comments to make the process faster for others to come, and maybe clean some things up. However, some additional goals have appeared on the way.

Flaws in type conversions

The first was to fill missing parts of type conversions:

converting / deconverting embedded instance fields using the right db_subtype (rather than as RawFields);
converting filter arguments for collection field lookups (as a collection item);
deconverting new keys in the same way as any other values from the database.

This should fix a few issues, the ones I know about are the Mongo part of #10 and django-nonrel/mongodb-engine#47.

Change to converting / deconverting of embedded instance fields can be disabled by using LegacyEmbeddedModelField (now in toolbox) in case of incompatibility with existing code (hopefully this will be rare, see the comment on NonrelDatabaseOperations.value_for_db_model for some insight). [Update: Not any longer, we've decided to drop this completely.]

Data types abstraction

The second goal was to build data type framework more suitable for a "general" NoSQL database rather than just the GAE datastore:

db_type names suitable for a database driver that can process Python types directly;
all primary keys and relations use a "key" db_type, unless the back-end decides otherwise;
introduced a nonrel_db_type method that makes it possible to override Django's db_type logic even for custom fields;
toolbox can handle the "Django side" of type conversions -- converts Django lazy objects and wrappers to standard types before passing them to the database;
a bit lesser reliance on sql.Query -- values passed to back-ends compilers are now always accompanied by a proper field (my feeling is that the "column" logic is not very natural for NoSQL databases);

Nonrel fields

The third goal was to make things easier for back-end developers, by taking the burden of handling toolbox fields away (there are many subtleties which can easily lead to mistakes). Now all the code is within the toolbox and all a back-end develop has to do is choose a suitable collection or embedded field db_type for his/her database and call a parent conversion (deconversion) method at the beginning (end) of the back-end methods. There is likely enough to choose from to satisfy most NoSQL databases (see comments in creation.py to learn the available options).

GAE: relations as db.Keys and binary serialization

The fourth was to prepare grounds for storing ForeignKeys and other relations using the special db.Key type of the GAE datastore. A database option is implemented to switch between the two storage methods, but only a "dump data; switch; reload data" procedure is possible. The default is to use the legacy storage for backwards compatibility, but it could be changed with some future major revision. See django-nonrel/djangoappengine#18 for more info.

GAE engine now also serializes DictFields and EmbeddedModelFields using the binary pickle protocol (instead of the ASCII protocol 0); this should work transparently as deconversion is the same for both protocols and no lookups / filtering is specified for these fields (as far as I can tell).

Additional tests and fixes

Fix for the EmbeddedModelField.update issue (EmbeddedModelField pre_save, get_db_prep_value rework #14) is already applied [update: already on the develop branch];
Added some tests for the natural usage of ListField with ForeignKeys that we've discussed on the mailing list (marked as an expected failure for now) [update: on a separate branch now];
Trivial fix for the test for issue Resubmission of pull request for serialization #12 (failing on develop) [update: committed to develop]. Nevertheless, I believe there is more to collection fields serialization and as it is written now it may not work with any kind of nesting [pending verification].
Also added some tests checking interaction between Django safe/to-be-escaped strings and lazy objects and the database layer.

Django changes to be considered

We could drop all the related_db_type infrastructure -- it's not sufficient for GAE relations-as-Keys and rather hard to follow
Replacing various value_to_db_* and convert_values methods on DatabaseOperations with something like the new value_for/from_db(value, field [, lookup]) would allow to avoid double recursion for collection fields, simplify back-end code (get_db_prep_* would call these methods rather than compilers), and make it possible to avoid various hacks in toolbox code (e.g. by moving what is now in Field.get_db_prep_lookup -- and is more or less reverted in NonrelQuery._normalize_lookup_value -- to an abstract SQLDatabaseOperations). See comments on the NonrelDatabaseOperations class if you'd like to know what I find unnecessarily hard to do with the current framework.

Reviewing, testing & merging

I believe the most convenient way to review these changes is by reading creation.py and base.py of toolbox and the engine that you use. If you'd prefer to go through some commits the best bet is to go through the Mongo ones -- only these have been squashed; otherwise you'll find a story of taking bad paths, learning new aspects of the system and redesigning things a couple of times. Any kind of feedback is welcome (including linguistic corrections, comments improvements, code style suggestions, pointing out errors, better design ideas, etc.).

These changes could use some testing. Particularly with preexisting data -- the intention was to keep everything compatible, and a considerable effort was put into understanding what each piece in current conversions does and not lose it, but considering the amount and nature of changes I've probably broken some things. So if you have a project with a testing environment and some existing data, trying these branches out would be really nice -- just backup, update and see if everything works. If it does and it's a GAE project, dump the data, add STORE_RELATIONS_AS_DB_KEYS: True to your database options, reload and check if it still does.

I've run toolbox / appengine / dbindexer / django.contrib (except flatpages and GeoDjango) test cases for GAE under Python 2.5 / 2.7 and the Mongo test suite under 2.6 / 2.7 some horrible amount of times. However, if you have the power to run the full Django test suite (django/tests/runtests.py) in reasonable time -- run it -- I've given up for now (or otherwise would have to postpone pushing this for another week or so ;-) Some tests are expected to fail, but no new failures should be introduced compared to develop branches.

There are branches also called feature/type-conversions-refactor in the GAE and Mongo engine repositories. Each of them depends on this one and this one depends on them -- they need to be merged together.

WARNING: Although this fixes some things it also changes how things are stored, in some cases in a subtle, dynamic way (e.g. DictFields on GAE). It really should not be used in production until going through some more extensive, real-data testing. Nevertheless, in case somebody feels that lucky (especially considering that this is not likely to be merged any time soon :-) dumping your data before and reloading it after the update gives a better chance to avoid some yet unforeseen problems and I'd also highly recommend looking into your database (e.g. at _ah/admin) and checking that things are stored as they should be shortly after that.

…ch the base class.

…by nonrel query implementations).

…ther than get_db_prep_value for DecimalField.

…me more comments to NonrelQuery methods.

…rt_values_from_db) to allow symmetric handling of encoding and decoding.

…ons).

…o not doing any type casting for AutoField if the latter is true.

…llow back-ends to override the main type (for example to easily use list storage for SetFields).

… fit better; added db_table argument.

…gic). Use <lookup info (column, type, operator, negated)>, <value>, <value info (starting with field)> argument order everywhere.

…e, so it can be used as a part of a key.

…) 3-tuple. Don't pass db_type to filtering-related methods (it was only used for encoding).

…e-String/Unicode) to plain strings for type-inspecting back-ends.

…irement is only that str(value) is reversible given the field kind.

…lanks in value_to_db_* functions.

…ary_key is usable and a feature to mark back-ends as supporing using fields other than AutoField as keys or not.

…er for subclasses and these fields should behave much like standard fields.

…es in a collection field.

…fields.

…column used by the constraint (was: parent primary key, is: child foreign key).

…th a ListField more symmetric.

…lso allow the field to be used to prevent convert_value_to/from_db to be run on embedded instance fields' values like in the 0.4 version of the engine.

… as primary keys.

Conflicts: djangotoolbox/fields.py djangotoolbox/tests.py

…h by gpitfield, thanks!

…uses.

…lit moved into separate methods.

… conversions / deconversions of embedded instance fields' values.

…ry for legacy GAE foreign key storage.

…edModelField implementation and no-deconversion handling to the LegacyEmbeddedModelField.

…umn keys (fields are already available for filtering and updates).

…umn keys (fields are already available for filtering and updates) and pass the key returned by back-ends through database deconversions.

…also skips deconversions.

…the-box (to be chosen by a back-end developer).

… lists nor strings nor bytes lets not call it a database ;-)

emperorcezar · 2012-02-07T15:07:12Z

I will run your changes agains. Django-lms and see if it breaks anything (mongo).

Thanks. This is really great. Toolbox really needed a refactor.

jonashaag · 2012-02-07T15:59:27Z

Holy shit O_O

This is insane. Thanks so much!

I don't think we can review this easily, though. It would really help if you could squash your changes and then split them into logical bits. Also, please keep all sorts of style fixes (line breaks, whitespace, typos, etc) in one separate commit as to reduce noise.

From a quick glance:

we can drop LegacyEmbeddedModelField entirely. The old data format has been deprecated for two releases, so we're free to drop support :-)
We shouldn't pickle anything in djangotoolbox. Databases may have their own representation of types (like Binary for bytes on MongoDB), so pickling should only happen (if at all) in the backend.

Also, you should probably try to get as many of these changes as possible into Django vanilla. Most of the code is database agnostic and it's better for Django-nonrel to have very little code that differs from Django. (You could do that as a GSoC for example.)

Edit: We also need to know of any backwards incompatibilities :-)

wrwrwr · 2012-02-07T19:34:57Z

On Tue, 2012-02-07 at 07:07 -0800, Adam Cezar Jenkins wrote:

I will run your changes agains. Django-lms and see if it breaks anything (mongo).

Thanks. This is really great. Toolbox really needed a refactor.

Thanks, let me know if something does break. Toolbox compilers
themselves could be also cleaned up a bit (to rely even less on the
Django's SQL machinery or build some more general abstraction), but that
can probably be done in a more incremental manner (unlike this part
where one thing was asking for another :-)

Reply to this email directly or view it on GitHub:
#16 (comment)

wrwrwr · 2012-02-07T20:29:51Z

On Tue, 2012-02-07 at 07:59 -0800, Jonas Haag wrote:

Holy shit O_O

This is insane. Thanks so much!

Thank you for the good word.

I don't think we can review this easily, though. It would really help if you could squash your changes and then split them into logical bits. Also, please keep all sorts of style fixes (line breaks, whitespace, typos, etc) in one separate commit as to reduce noise.

Yes, all true. I like to keep such formatting stuff separate too (even
though it's against what Django coding style tells me :-) I'll try to
squash some time to at least do a partial squashing, but for the time
being maybe try a file based approach -- in short: DatabaseCreation had
most of the type logic changed, DatabaseOperations: all of the
conversions code collected and mostly rewritten, DatabaseFeatures: two
features removed one added, compilers: changes mostly related to
providing fields for values and adaptation.

From a quick glance:

we can drop LegacyEmbeddedModelField entirely. The old data format has been deprecated for two releases, so we're free to drop support :-)

It now also allows to keep using the old logic for embedded instance
fields conversions / deconversions -- I guess this can be removed too
(as far as I can tell, the change can cause incompatibility only in some
rather unusual circumstances), but maybe lets wait and see if anybody
needed to use it or not.

We shouldn't pickle anything in djangotoolbox. Databases may have their own representation of types (like Binary for bytes on MongoDB), so pickling should only happen (if at all) in the backend.

The pickling only happens if a back-end developer chooses "string" or
"bytes" storage for DictFields or EmbeddedModelFields (that's not the
default; the first one is more or less what GAE used to use, the second
just uses another pickle protocol). Generally the serialization will be
needed to support nonrel fields with databases that can't store a dict
or a nested list directly (say a relational DB with a BLOB type?); I
don't think it does any harm for toolbox to offer this (it's meant as a
preprocessing before casting to a database specific type).

Also, you should probably try to get as many of these changes as possible into Django vanilla. Most of the code is database agnostic and it's better for Django-nonrel to have very little code that differs from Django. (You could do that as a GSoC for example.)

I have to attend to matters of a more commercial nature for the next
couple of days (a bit less fun than working on nonrel unfortunately :-)
so short-term I'm only planing to do some feature updating / merging /
maintenance. The simplified value conversion framework may be a future
candidate for vanilla, but I'll let others take a look first, maybe we
can come up with something even better?

Reply to this email directly or view it on GitHub:
#16 (comment)

wrwrwr · 2012-02-08T12:29:41Z

Closing for now, going to make a new request after reducing and bringing some order into the the commits list.

Anybody would mind if I just PEP-ified the whole toolbox? I probably should just restrain from making all those style changes, but I can't make myself to :-) Some of PEP's recommendations may feel awkward, but having something fixed should make further changes more readable.

emperorcezar · 2012-02-08T15:00:06Z

I think PEP-ifying it is a great idea. Toolbox and nonrel code can be a bit confussing, anything that makes it more readable is better. It couldn't really hurt anything.

aburgel · 2012-02-13T15:12:51Z

i don't mind if you PEP the whole thing, but maybe do it as a separate commit so its easier confirm that nothing breaks.

wrwrwr · 2012-02-13T15:36:24Z

Yes exactly, for toolbox it's already done -- pepification and all insignificant things in one commit (not 100% strictly to the standard as sometimes it actually decreases readability, but close enough).

Currently working on GAE, but I'll probably just drop any insignificant changes that have wandered into the Mongo branch -- I find it's style specific, but fairly consistent (and there are some more significant things to work on :-)

emperorcezar · 2012-03-26T21:23:21Z

I wanted to say a big thank you for this! It's made understanding and following the code paths much easier.

wrwrwr added 30 commits January 19, 2012 18:21

NonrelDatabaseOperations: added comments and reordered methods to mat…

430f6af

…ch the base class.

Also override the standard deconversion method (currently not called …

faba083

…by nonrel query implementations).

Use an abstract key type for storing primary and foreign keys.

98a870a

Trying to explain convert_value_to/from_db.

6803ab8

Coded an explicit workaround for Django providing get_db_prep_save ra…

1b9eeea

…ther than get_db_prep_value for DecimalField.

Better comment for data_types.

89ed2d3

Moved DecimalField workaround to lookup value normalization. Added so…

9bd5935

…me more comments to NonrelQuery methods.

Provide reference implementations for iterable fields processing.

5301979

Subtype splitting in one place.

8ccf890

Also call the standard DatabaseOperations.convert_values (after conve…

11fe1bd

…rt_values_from_db) to allow symmetric handling of encoding and decoding.

Replaced lazy object handling with what it probably was meant to be.

c46b92d

Also default to the abstract key for references (foreign keys, relati…

510c14d

…ons).

Replaced string_based_auto_field feature with has_key_type, default t…

6c2bcb2

…o not doing any type casting for AutoField if the latter is true.

Use types for iterable fields better fitting the current framework, a…

573c60f

…llow back-ends to override the main type (for example to easily use list storage for SetFields).

Test lists of referenes (ListField with ForeignKeys).

fabca32

Moved additional conversion methods to DatabaseOperations, where they…

aa69d0b

… fit better; added db_table argument.

Match Django and subclasses inheritances.

cf15381

Added a test for lists with keys. See issue #10.

d19b319

Iterable field lookup values also need to be converted using db_subtype.

96adef5

Pass field to compiler conversion methods (needed for table / kind lo…

7616383

…gic). Use <lookup info (column, type, operator, negated)>, <value>, <value info (starting with field)> argument order everywhere.

For related objects convert the value using related model's table nam…

7d3bcb8

…e, so it can be used as a part of a key.

Combine all info needed for encoding / decoding into a single (nested…

10ccb09

…) 3-tuple. Don't pass db_type to filtering-related methods (it was only used for encoding).

Convert Django's strings marked as safe or needing escape (Safe/Escap…

52d857b

…e-String/Unicode) to plain strings for type-inspecting back-ends.

Let back-ends decide what they want to allow for AutoFields, the requ…

6aa9f56

…irement is only that str(value) is reversible given the field kind.

Pass also field internal type to conversions -- needed to fill some b…

4be6391

…lanks in value_to_db_* functions.

Added infrastructure that can be used to extend fields for which prim…

1471ede

…ary_key is usable and a feature to mark back-ends as supporing using fields other than AutoField as keys or not.

Added tests for marked strings (mysterious isinstance(str)).

4a15359

Bring back constant internal types for nonrel fields -- this may matt…

2a82c3d

…er for subclasses and these fields should behave much like standard fields.

A bit more elaborate solution is needed to determine db_infos of valu…

c715534

…es in a collection field.

Memoize db_info on field to improve performance for typed collection …

c03b48c

…fields.

wrwrwr added 20 commits February 1, 2012 21:13

Change the field passed to conversions for _set queries to match the …

f63e7e1

…column used by the constraint (was: parent primary key, is: child foreign key).

Made the test about using model instances rather than foreign keys wi…

dfa58c0

…th a ListField more symmetric.

Test for list with embedded instances update problem (issue #13).

ef87e04

Pass the right connection to Field.db_type calls.

5bd50f8

Improved the embedded instance update test.

15a13be

Compatibility with Mongo's LegacyEmbeddedModelField (pre-0.2 data); a…

a930c88

…lso allow the field to be used to prevent convert_value_to/from_db to be run on embedded instance fields' values like in the 0.4 version of the engine.

Default to not allowing collection / embedded model fields to be used…

6d3589f

… as primary keys.

Merge branch 'develop' into feature/type-conversions-refactor.

ece7095

Conflicts: djangotoolbox/fields.py djangotoolbox/tests.py

Introduced changes fixing EmbeddedModelField.update (issue #13). Patc…

a5173a1

…h by gpitfield, thanks!

Made conversion method names shorter and more similar to what Django …

324b060

…uses.

Code handling type conversions for collections and embedded models sp…

e47811c

…lit moved into separate methods.

Moved LegacyEmbeddedModelField as it can be now also used to avoid db…

3a10484

… conversions / deconversions of embedded instance fields' values.

Use nonrel_db_type to determine db_types of items / values -- necessa…

11cb5b9

…ry for legacy GAE foreign key storage.

Moved model class info serialization / deserialization back to Embedd…

9c80a73

…edModelField implementation and no-deconversion handling to the LegacyEmbeddedModelField.

Make SQLInsertCompiler pass a mapping with field keys rather than col…

e7bc7f6

…umn keys (fields are already available for filtering and updates).

Make SQLInsertCompiler pass a mapping with field keys rather than col…

ab1b7e5

…umn keys (fields are already available for filtering and updates) and pass the key returned by back-ends through database deconversions.

Subfield type conversion avoidance for LegacyEmbeddedModelField that …

32f873c

…also skips deconversions.

Support serialized formats for collection and embedded fields out-of-…

ae0338d

…the-box (to be chosen by a back-end developer).

Removed the supports_dict feature; if it can't store dicts nor nested…

f789e1d

… lists nor strings nor bytes lets not call it a database ;-)

Python 2.5 compatibility.

4d90aec

wrwrwr closed this Feb 8, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type conversions refactorization #16

Type conversions refactorization #16

wrwrwr commented Feb 7, 2012

emperorcezar commented Feb 7, 2012

jonashaag commented Feb 7, 2012

wrwrwr commented Feb 7, 2012

wrwrwr commented Feb 7, 2012

wrwrwr commented Feb 8, 2012

emperorcezar commented Feb 8, 2012

aburgel commented Feb 13, 2012

wrwrwr commented Feb 13, 2012

emperorcezar commented Mar 26, 2012

Type conversions refactorization #16

Type conversions refactorization #16

Conversation

wrwrwr commented Feb 7, 2012

Flaws in type conversions

Data types abstraction

Nonrel fields

GAE: relations as db.Keys and binary serialization

Additional tests and fixes

Django changes to be considered

Reviewing, testing & merging

emperorcezar commented Feb 7, 2012

jonashaag commented Feb 7, 2012

wrwrwr commented Feb 7, 2012

wrwrwr commented Feb 7, 2012

wrwrwr commented Feb 8, 2012

emperorcezar commented Feb 8, 2012

aburgel commented Feb 13, 2012

wrwrwr commented Feb 13, 2012

emperorcezar commented Mar 26, 2012