Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a generic data type converter to the Cursor object #442

Merged
merged 2 commits into from Oct 18, 2022

Conversation

amotl
Copy link
Member

@amotl amotl commented Aug 5, 2022

Hi there,

Introduction

After converging #395 to #437, it became clear / emerged that we wanted to have the type conversion

  • only being optionally enabled with a feature flag /cc @mfussenegger
  • being implemented more elegantly /cc @matriv

This patch aims to resolve both aspects.

Details

This will allow converting fetched data from CrateDB data types to Python data types in different ways. Its usage is completely optional. When not used, the feature will not incur any overhead.

There is also a DefaultTypeConverter, which aims to become a sane default choice when looking at this kind of convenience. It will enable the user to work with native Python data types from the start, for all database types where this makes sense, without needing to establish the required set of default value converters on their own.

If the DefaultTypeConverter does not fit the user's aims, it is easy to define custom type converters, possibly reusing specific ones from the library.

Acknowledgements

The patch follows the implementation suggested by the PyAthena driver very closely, see PyAthena/converter.py. Thank you!

Synopsis

>>> cursor = connection.cursor(converter=Cursor.get_default_converter())
>>> cursor.execute(stmt)
>>> cursor.fetchone()
['foo', IPv4Address('10.10.10.1'), datetime.datetime(2022, 7, 18, 18, 10, 36, 758000)]

Thoughts

In the form of an early review, please let me know if you endorse this approach and which adjustments you would like to see.

Personally, I would add more test cases to test_cursor.py, in order to exercise the machinery in more detail and to increase code coverage, and probably make the functionality more present in the documentation.

Also, as @mfussenegger mentioned at https://github.com/crate/crate-python/pull/437/files#r930822457, it might make sense to also add the important aspect of converting elements within ARRAY types to this patch already.

With kind regards,
Andreas.


Backlog

src/crate/client/cursor.py Outdated Show resolved Hide resolved
src/crate/client/doctests/cursor.txt Outdated Show resolved Hide resolved
@seut
Copy link
Member

seut commented Aug 26, 2022

👍 I really like this approach and would definitely prefer it over #437.

Copy link
Member

@mfussenegger mfussenegger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments regarding some of the implementation details.

I'm also not sure about the level of separation. E.g. the converter application could probably be separated from the cursor without much loss of API/UI convenience. But no strong opinion here.

On the other hand, I think something common like setting the timezone for datetime objects could be made even easier for users.

src/crate/client/cursor.py Outdated Show resolved Hide resolved
src/crate/client/converter.py Outdated Show resolved Hide resolved
src/crate/client/converter.py Outdated Show resolved Hide resolved
Comment on lines 75 to 120
def get(self, type_: int) -> Callable[[Optional[int]], Optional[Any]]:
return self.mappings.get(type_, self._default)

def set(self, type_: int, converter: Callable[[Optional[int]], Optional[Any]]) -> None:
self.mappings[type_] = converter

def remove(self, type_: int) -> None:
self.mappings.pop(type_, None)

def update(self, mappings: Dict[int, Callable[[Optional[int]], Optional[Any]]]) -> None:
self.mappings.update(mappings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to make this mutable?
I also wonder if we should introduce an Enum or similar to give names to the type ids?

Copy link
Member Author

@amotl amotl Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to make this mutable?

I also wasn't sure, but liked the flexibility in the end to make the user able to add and remove converters as they like. See also my other comment below about the blueprint I took from PyAthena.

I also wonder if we should introduce an Enum or similar to give names to the type ids?

Good idea. I've introduced an Enum with 00dbb42.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wasn't sure, but liked the flexibility in the end to make the user able to add and remove converters as they like.

I'd keep it immutable for now. It reduces the API surface and makes it easier to reason about the mappings.
And if a legitimate use-case pops up, it's easy to change later. On the other hand, changing a mutable mapping to immutable later on would be a breaking change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you fine with removing mutability by 299b771?

Comment on lines +319 to +328
>>> connection.client.set_next_response({
... "col_types": [4, 5, 11],
... "rows":[ [ "foo", "10.10.10.1", 1658167836758 ] ],
... "cols":[ "name", "address", "timestamp" ],
... "rowcount":1,
... "duration":123
... })

>>> cursor.execute('')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would confuse users because this is a mocked cursor and client.set_next_response isn't something that they can use.

Either we need to point this out explicitly, hide it or use real data and don't mock it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning of the file cursor.txt, there is a corresponding note about this detail.

Hardcode the next response of the mocked connection client, so we won't need a sql statement
to execute.

After that, the set_next_response() method is already used five times within cursor.txt, so I didn't hesitate to continue implementing the test cases this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't/shouldn't hide it for the rendered docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, cursor.txt is not part of the rendered docs at https://crate.io/docs/python/, it is only used as a doctest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #448 to have a separate discussion about the topic how to improve/adjust doctests vs. rendered documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't write doctests to ensure the client functions. Doctests are to ensure examples in the documention work. The primary concern is the documentation, not the test coverage it gives. If we use this for testing, we should convert it to regular python tests.

It's much easier to run isolated cases for debugging etc. when using the regular test framework instead of having one large doctest where you can only run the full thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with you. I think your guidance should be implemented seperately, being tracked by #448. If you want to see it sooner than later, than I will be happy to tackle it before coming back to this patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to handle this separately in a follow up

@mfussenegger
Copy link
Member

There is btw. also https://peps.python.org/pep-0249/#type-objects in the DB API definition - I didn't have a closer look at it, but we should double check to see if it affects what the PR would be doing.

@amotl
Copy link
Member Author

amotl commented Aug 29, 2022

Thank you for endorsing this patch and for your valuable suggestions.

@amotl
Copy link
Member Author

amotl commented Sep 8, 2022

Hi again,

other than the inline comments, let me address further questions here.

Personally, I would add more test cases to test_cursor.py, in order to exercise the machinery in more detail and to increase code coverage.

Done, mostly with 1410547.

It might make sense to also add the important aspect of converting elements within ARRAY types to this patch already.

Done with b21c38e.

I'm also not sure about the level of separation. E.g. the converter application could probably be separated from the cursor without much loss of API/UI convenience. But no strong opinion here.

I also thought back and forth about this detail. In the end, I followed the path outlined by https://github.com/laughingman7743/PyAthena very closely, where I originally discovered this way of doing it. See PyAthena/converter.py.

On the other hand, I think something common like setting the timezone for datetime objects could be made even easier for users.

Indeed, that would be sweet. I will think about how we could add this feature.

There is btw. also https://peps.python.org/pep-0249/#type-objects in the DB API definition - I didn't have a closer look at it, but we should double check to see if it affects what the PR would be doing.

I also found that section in the DBAPI spec, but I think it is exclusively about input parameter conversion:

Many databases need to have the input in a particular format for binding to an operation’s input parameters. For example, if an input is destined for a DATE column, then it must be bound to the database in a particular string format.

On the other hand, the current implementation is exclusively about output data conversion. That being said, the converter functions itself can well be reused for that use case, but otherwise I think it is a different area, and to be addressed within a different patch.

With kind regards,
Andreas.

Comment on lines +44 to +51
def _to_datetime(value: Optional[float]) -> Optional[datetime]:
"""
https://docs.python.org/3/library/datetime.html
"""
if value is None:
return None
return datetime.utcfromtimestamp(value / 1e3)
Copy link
Member Author

@amotl amotl Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @mfussenegger, @matriv, and @seut,

regarding @mfussenegger's proposal to let the user optionally specify a time zone, and make the machinery return a timezone-aware datetime object, I see two options here.

Let's assume the user defines a custom timezone object like

>>> import datetime
>>> tz_mst = datetime.timezone(datetime.timedelta(hours=7), name="MST")

and a timestamp in Epoch like

# Fri, 16 Sep 2022 11:09:16 GMT
timestamp = 1663326556

On this spot, which variant would be semantically the right one? Note that datetime.utcfromtimestamp does not accept the tz keyword argument, it will always return a naive datetime object with tzinfo=None.

  1. We can either use datetime.fromtimestamp, which does accept the tz keyword argument.

    >>> datetime.datetime.fromtimestamp(timestamp, tz=tz_mst)
    datetime.datetime(2022, 9, 16, 18, 9, 16, tzinfo=datetime.timezone(datetime.timedelta(seconds=25200), 'MST'))
    
    >>> datetime.datetime.fromtimestamp(timestamp, tz=datetime.timezone.utc)
    datetime.datetime(2022, 9, 16, 11, 9, 16, tzinfo=datetime.timezone.utc)
  2. Or we can replace the tzinfo on the naive datetime object returned by utcfromtimestamp, making it timezone-aware.

    >>> datetime.datetime.utcfromtimestamp(timestamp).replace(tzinfo=tz_mst)
    datetime.datetime(2022, 9, 16, 11, 9, 16, tzinfo=datetime.timezone(datetime.timedelta(seconds=25200), 'MST'))

I think option 1. is the right choice. So the implementation would look like something along the lines of:

def set_timezone(self, tz):
  """
  When the user specifies a time zone upfront, all returned datetime objects
  will be timezone-aware.
  """
  if tz is not None and not isinstance(tz, datetime.timezone):
    raise TypeError(f"Timezone object has wrong type: {tz}")
  self.timezone = tz

def _to_datetime(self, value):
  """
  Convert CrateDB's `TIMESTAMP` column to the native Python `datetime` object.
  When a timezone is given, return a timezone-aware object.
  Otherwise, return a naive object which is meant to be the time in the UTC time zone.
  """
  if self.timezone is not None:
    return datetime.fromtimestamp(value / 1e3, tz=self.timezone)
  else:
    return datetime.utcfromtimestamp(value / 1e3)

What do you think?

With kind regards,
Andreas.

Copy link
Member Author

@amotl amotl Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another implementation could look like this, it is always using datetime.fromtimestamp(value / 1e3, tz=self.timezone), where self.timezone would be datetime.timezone.utc by default, i.e. it will always return timezone-aware datetime objects.

While definitively easier and more straight-forward, it could be less performant when not aiming to use timezone-aware datetime objects and stay in "naive" land on purpose.

def __init__(self):
  # Return datetype objects in UTC by default.
  self.timezone = datetime.timezone.utc

def set_timezone(self, tz):
  """
  Set the time zone you want your `datetime` objects to be returned as.
  """
  if not isinstance(tz, datetime.timezone):
    raise TypeError(f"Timezone object has wrong type: {tz}")
  self.timezone = tz

def _to_datetime(self, value):
  """
  Convert CrateDB's `TIMESTAMP` column to a timezone-aware native Python `datetime` object.
  """
  return datetime.fromtimestamp(value / 1e3, tz=self.timezone)

Copy link
Member Author

@amotl amotl Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitively easier and more straight-forward

When thinking about it once more, I think I would prefer the implementation variant outlined in my latest post. In this way, it is always consistent that timezone-aware datetime objects are returned from the data converter machinery.

it could be less performant when not aiming to use timezone-aware datetime objects and stay in "naive" land on purpose.

I think when users are aiming for raw speed, they would completely opt out of the data type converter machinery at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again. Talk is cheap. I've converged a few more thoughts about this into a proposed patch at #445, completely separating the topic of timezone-aware datetime handling from this one.

src/crate/client/converter.py Outdated Show resolved Hide resolved
src/crate/client/converter.py Outdated Show resolved Hide resolved
Comment on lines 75 to 120
def get(self, type_: int) -> Callable[[Optional[int]], Optional[Any]]:
return self.mappings.get(type_, self._default)

def set(self, type_: int, converter: Callable[[Optional[int]], Optional[Any]]) -> None:
self.mappings[type_] = converter

def remove(self, type_: int) -> None:
self.mappings.pop(type_, None)

def update(self, mappings: Dict[int, Callable[[Optional[int]], Optional[Any]]]) -> None:
self.mappings.update(mappings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wasn't sure, but liked the flexibility in the end to make the user able to add and remove converters as they like.

I'd keep it immutable for now. It reduces the API surface and makes it easier to reason about the mappings.
And if a legitimate use-case pops up, it's easy to change later. On the other hand, changing a mutable mapping to immutable later on would be a breaking change.

Comment on lines +319 to +328
>>> connection.client.set_next_response({
... "col_types": [4, 5, 11],
... "rows":[ [ "foo", "10.10.10.1", 1658167836758 ] ],
... "cols":[ "name", "address", "timestamp" ],
... "rowcount":1,
... "duration":123
... })

>>> cursor.execute('')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't/shouldn't hide it for the rendered docs?

Comment on lines +315 to +318
# FIXME: Why does this croak on statements like ``DROP TABLE cities``?
# Note: When needing to debug the test environment, you may want to
# enable this logger statement.
# log.exception("Executing SQL statement failed")
Copy link
Member Author

@amotl amotl Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About

Tests: Add inline comment about how to debug the test environment.

Background story

It was needed because the tests suddenly started failing mysteriously on my machine.

Out of the blue, the test suite started croaking like TABLE 'locations' already exists, only to find that CrateDB would not accept any write operations after finding that my disk usage was 96%, while I had still 60G of free disk space.

Unfortunately, I did not discover this easily, because the test suite swallowed the exception about FORBIDDEN/12/index read-only on this spot. When enabling the log.exception statement, the error response from CrateDB became immediately visible within the full stacktrace.

Thoughts

I like having such inline comments in place. Please let me know if you have any better advice.

Comment on lines 92 to 98
# Map data type identifier to converter function.
_DEFAULT_CONVERTERS: Dict[DataType, ConverterFunction] = {
DataType.IP: _to_ipaddress,
DataType.TIMESTAMP_WITH_TZ: _to_datetime,
DataType.TIMESTAMP_WITHOUT_TZ: _to_datetime,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the DefaultTypeConverter should apply the _to_datetime function to both kinds of TIMESTAMP types?

Comment on lines 344 to 364
>>> from crate.client.converter import Converter, DataType

>>> converter = Converter()
>>> converter.set(DataType.BIT, lambda value: int(value[2:-1], 2))
>>> cursor = connection.cursor(converter=converter)

Proof that the converter works correctly, ``B\'0110\'`` should be converted to
``6``. CrateDB's ``BIT`` data type has the numeric identifier ``25``::

>>> connection.client.set_next_response({
... "col_types": [25],
... "rows":[ [ "B'0110'" ] ],
... "cols":[ "value" ],
... "rowcount":1,
... "duration":123
... })

>>> cursor.execute('')

>>> cursor.fetchone()
[6]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the type converter for the BIT data type should be included into the DefaultTypeConverter, like the converters for IP and TIMESTAMP? I am talking about this one:

>>> converter.set(DataType.BIT, lambda value: int(value[2:-1], 2))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave it out for now. Easy enough to add later if requested by users

Comment on lines +319 to +328
>>> connection.client.set_next_response({
... "col_types": [4, 5, 11],
... "rows":[ [ "foo", "10.10.10.1", 1658167836758 ] ],
... "cols":[ "name", "address", "timestamp" ],
... "rowcount":1,
... "duration":123
... })

>>> cursor.execute('')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't write doctests to ensure the client functions. Doctests are to ensure examples in the documention work. The primary concern is the documentation, not the test coverage it gives. If we use this for testing, we should convert it to regular python tests.

It's much easier to run isolated cases for debugging etc. when using the regular test framework instead of having one large doctest where you can only run the full thing.

Comment on lines 109 to 111
@property
def mappings(self) -> Dict[DataType, ConverterFunction]:
return self._mappings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@property
def mappings(self) -> Dict[DataType, ConverterFunction]:
return self._mappings

Comment on lines 116 to 117
def set(self, type_: DataType, converter: ConverterFunction) -> None:
self.mappings[type_] = converter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def set(self, type_: DataType, converter: ConverterFunction) -> None:
self.mappings[type_] = converter
def set(self, type_: DataType, converter: ConverterFunction) -> None:
self.mappings[type_] = converter

As already mentiond, I don't think we should make this mutable if it isn't required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added 14a355d, to follow your suggestion.

Comment on lines 32 to 33
InputVal = Any
ConverterFunction = Callable[[Optional[InputVal]], Optional[Any]]
ConverterDefinition = Union[ConverterFunction, Tuple[ConverterFunction, "ConverterDefinition"]]
ColTypesDefinition = Union[int, List[Union[int, "ColTypesDefinition"]]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
InputVal = Any
ConverterFunction = Callable[[Optional[InputVal]], Optional[Any]]
ConverterDefinition = Union[ConverterFunction, Tuple[ConverterFunction, "ConverterDefinition"]]
ColTypesDefinition = Union[int, List[Union[int, "ColTypesDefinition"]]]
ConverterFunction = Callable[[Optional[Any]], Optional[Any]]
ConverterDefinition = Union[ConverterFunction, Tuple[ConverterFunction, "ConverterDefinition"]]
ColTypesDefinition = Union[int, List[Union[int, "ColTypesDefinition"]]]

No need for custom vocabulary for Any?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 4e88a10.

Comment on lines 135 to 147
def col_type_to_converter(self, type_: ColTypesDefinition) -> ConverterDefinition:
"""
Resolve integer data type identifier to its corresponding converter function.
Also handles nested definitions with a *list* of data type identifiers on the
right hand side, describing the inner type of `ARRAY` values.

It is important to resolve the converter functions first, in order not to
hog the row loop with redundant lookups to the `mappings` dictionary.
"""
result: ConverterDefinition
if isinstance(type_, list):
type_, inner_type = type_
if DataType(type_) is not DataType.ARRAY:
raise ValueError(f"Data type {type_} is not implemented as collection type")
result = (self.get(DataType(type_)), self.col_type_to_converter(inner_type))
else:
result = self.get(DataType(type_))
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this bake the array handling also into a function? Then this could return a function that the cursor can directly call to convert the result - instead of having to go through convert again.

Copy link
Member Author

@amotl amotl Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for 0e572ca, it worked out of the box. I've added 6bf53a2 and b298508 to fix up only minor details.

@amotl amotl force-pushed the amo/type-converter branch 2 times, most recently from d7fff77 to 6bf53a2 Compare October 12, 2022 15:07
Copy link
Member

@mfussenegger mfussenegger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two more comments, otherwise looks good to me

@@ -123,12 +130,16 @@ def __init__(self,
self.lowest_server_version = self._lowest_server_version()
self._closed = False

def cursor(self):
def cursor(self, cursor=None, **kwargs) -> Cursor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the cursor=None for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably slipped in from the role model implementation of PyAthena. I think it can be removed. Thanks for spotting it.

def cursor(self, cursor: Optional[Type[BaseCursor]] = None, **kwargs) -> BaseCursor:

-- https://github.com/laughingman7743/PyAthena/blob/2f2c0e11487426bd382f480a8468e9bc78c42156/pyathena/connection.py#L231

Comment on lines 246 to 251
@staticmethod
def get_default_converter(mappings: Optional[ConverterMapping] = None) -> Converter:
"""
Return the standard converter instance.
"""
return DefaultTypeConverter(more_mappings=mappings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? It's just wrapping a constructor which could be called directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like get_default_converter() is currently never called with any parameters, so I guess it can also be removed. Thanks!

@lgtm-com

This comment was marked as resolved.

This will allow converting fetched data from CrateDB data types to
Python data types in different ways. Its usage is completely optional.
When not used, the feature will not incur any overhead.

There is also a `DefaultTypeConverter`, which aims to become a sane
default choice when looking at this kind of convenience. It will enable
the user to work with native Python data types from the start, for all
database types where this makes sense, without needing to establish the
required set of default value converters on their own.

If the `DefaultTypeConverter` does not fit the user's aims, it is easy
to define custom type converters, possibly reusing specific ones from
the library.
@amotl amotl merged commit 8b7e737 into master Oct 18, 2022
@amotl amotl deleted the amo/type-converter branch October 18, 2022 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants