Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ArangoClient generic to support typing when a custom serialiser is used #304

Closed
marcuslimdw opened this issue Dec 8, 2023 · 9 comments
Assignees

Comments

@marcuslimdw
Copy link

marcuslimdw commented Dec 8, 2023

Currently, the parameters of methods such as insert and aql.execute are typed concretely (with Json and Optional[MutableMapping[str, str]] respectively).

This is overly restrictive because these methods will in fact accept other types, as long as the serialiser in use supports them (and in fact, even the default json.loads serialiser can accept a wider range of types).

My suggestion:

Make ArangoClient, AQL, Database, and Collection generic in their serializer type (with the client's serializer type being passed down to everything created from it), something like this (using Python 3.12's new type parameter syntax):

class Serializer[T](Protocol):

    def __call__(self, value: T) -> Json:
        pass


class ArangoClient[T]:

    def __init__(self, serializer: Serializer[T], *args, **kwargs):
        pass

    def db(self, *args, **kwargs) -> Database[T]:
        pass


class StandardDatabase[T]:

    def collection(self, *args, **kwargs) -> Collection[T]:
        pass

    
class StandardCollection[T]:

    def insert(self, document: T, *args, **kwargs):  # no longer Json
        pass
@apetenchea
Copy link
Member

Hi @marcuslimdw,

This is not at all a bad idea. However, if that relies on something which works only on Python 3.12, it's going to heavily reduce the number of Python versions we can support. Currently, we go all the way from Python 3.8 to Python 3.11. So, if a feature requires features that are available only since Python 3.12, we unfortunately cannot adopt it.

Maybe I've misunderstood the limitation to Python 3.12, in which case let me know. I've labeled this as an enhancement so we can keep it around.

@marcuslimdw
Copy link
Author

marcuslimdw commented Mar 13, 2024

@apetenchea the "inline type parameter" syntax (that's how I think of it) is only available in 3.12, so to support >=3.8, we can use the older syntax instead.

The above example might look like this in 3.8:

T = TypeVar("T")

class Serializer(Protocol, Generic[T]):

    def __call__(self, value: T) -> Json:
        pass


class ArangoClient(Generic[T]):

    def __init__(self, serializer: Serializer[T], *args, **kwargs):
        pass

    def db(self, *args, **kwargs) -> Database[T]:
        pass


class StandardDatabase(Generic[T]):

    def collection(self, *args, **kwargs) -> Collection[T]:
        pass

    
class StandardCollection(Generic[T]):

    def insert(self, document: T, *args, **kwargs):  # no longer Json
        pass

@apetenchea
Copy link
Member

I agree that having the Json type hint seems overly restrictive. As you pointed out, one can implement a custom serialiser, which might work with completely different data from Json (serialise from bytes, for example).

However, I have thought about it carefully and I am not convinced that going full generic is exactly what we need right now. The reason for having Json in there is not to enforce (the Python ignores type hinting), but to give an idea on how the method is supposed to be used. So while going fully generic would indeed silence potential false-positive type-related warnings, it would render the whole type-hinting thing useless. There are other ways to work around such false positives, apart from modifying the driver.

I'm not adamant against it, and I can see why would someone opt for your proposed solution, but I'm wondering if there's other benefits you see that this potential change could bring to the driver.

@marcuslimdw
Copy link
Author

marcuslimdw commented Mar 16, 2024

Perhaps it might help if I explained my use case.

At work, we use ArangoDB for storage, but our data models are defined with Pydantic. We've created a custom serialiser that allows us to pass Pydantic models directly to, for example, aql.execute, or insert_many.

We also use Pyright for type checking, which necessarily fails because these methods expect builtin Python objects, but get Pydantic models. To put it another way, we're widening the types that the AQL/Collection methods can take at runtime, but the compile-time types stay the same, causing type checking failure. To me, this is exactly what generics are meant for.

An example that doesn't use Pydantic:

import json
from datetime import datetime
from datetime import timezone
from typing import Any

from arango.client import ArangoClient

# define a custom serialiser that can also handle datetimes
def serialiser(data: Any) -> str:
    if isinstance(data, datetime):
        return data.isoformat()

    return json.dumps(data)


ArangoClient(serializer=serialiser).db().aql.execute(
    "INSERT @document INTO @@collection", bind_vars={"document": {"timestamp": datetime.now(tz=timezone.utc)}, "@collection": "my_collection"}
)

This works at runtime, but fails type checking:

 error: Argument of type "dict[str, dict[str, datetime]]" cannot be assigned to parameter "bind_vars" of type "MutableMapping[str, DataTypes] | None" in function "execute"
    Type "dict[str, dict[str, datetime]]" cannot be assigned to type "MutableMapping[str, DataTypes] | None"
      "dict[str, dict[str, datetime]]" is incompatible with "MutableMapping[str, DataTypes]"
        Type parameter "_VT@MutableMapping" is invariant, but "dict[str, datetime]" is not the same as "DataTypes"
      "dict[str, dict[str, datetime]]" is incompatible with "None" (reportArgumentType)

There are other ways to work around such false positives, apart from modifying the driver.

Could you please suggest an alternative? I'd be happy to hear any possible solution to my problem.

@apetenchea
Copy link
Member

apetenchea commented May 20, 2024

Hello @marcuslimdw,

Please excuse my late reply, I've been away from the project for a while.

Just wanted to let you know that I've given it a go, but unfortunately, the change is not as simple as it looks like. Let me explain why.

Problem

The driver is already designed with having Json as main data type. Probably, in the beginning, that was enough to cover most use-cases. The serializer is used at the "core" of the driver, passed all the way to the Connection class, which couples it to more or less any data processing that we do. Say we introduce a generic Serializer[T] now, that would require at least the following classes to be generalized as well:

  • ArangoClient
  • BaseConnection
  • BasicConnection
  • JwtConnection
  • JwtSuperuserConnection
  • Database
  • StandardDatabase
  • AsyncDatabase
  • BatchDatabase
  • TransactionDatabase
  • OverloadControlDatabase
  • Collection
  • StandardCollection
  • VertexCollection
  • EdgeCollection
  • AQL

The change necessarily spills all over the place, in many of their functions (especially in the Collection sub-classes). There's lots of function arguments that must be (carefully) typed during this generalization. It's not that the volume of work makes it an impossible task, but rather it is tricky to make sure that we do the transition seamlessly, without causing any issues on the side. Perhaps covering the AQL example you posted above would be a bit easier, and introducing more generics one at a time would eventually get us to a place where we can safely include this in the next driver release, but I don't like the idea of introducing generics only partially, as it might create more confusion than good. All methods (aql, insert, update, replace, etc.) should be adapted to support this change.

Example

A small example, just to give you an idea of the problems we could be facing as a result of this:

    # an actual function which is used quite a lot in our code
    def _ensure_key_from_id(self, body: T) -> T:
        if "_id" in body and "_key" not in body:
            doc_id = self._validate_id(body["_id"])
            body = body.copy()
            body["_key"] = doc_id[len(self._id_prefix) :]
        return body

On body = body.copy(), mypy yields: error: Incompatible types in assignment (expression has type "Dict[Any, Any]", variable has type "T") [assignment]. And this is just one of the many errors I was facing while trying out this refactoring. Introducing generics at the moment makes mypy go crazy.

See my (incomplete) pull request for an in-depth look: #343

Believe me, I really tried. But this is proving to be a way bigger refactoring than anticipated, and I don't foresee these changes coming in the near future. If one would ever be so inclined to open a PR, that would be welcome, but I'd rather use my limited time now to get python-arango-async started. That's a new driver, where we'll have the chance to improve upon things (such as the problem you raised right now). But for now, I tend to close with with "won't fix", due to lack of time. Don't get me wrong, I like the idea of introducing more generic type hints into our code, is just that I can't prioritize it. I'm really sorry if that causes inconveniences for your project, wish I could help.

Alternatives

In the meantime, you may try to:

  • dump the Pydantic data type into a format that is accepted by the current driver implementation
  • create a wrapper, such as extending the AQL class yourself to make it work
  • intentionally silence the warning using # type: ignore

IMHO a combination of the latter two would be the most desirable. Creating a sub-class having, for example, and AQL that supports Pydantic, but where you intentionally skip the warning from the underlying implementation, would allow to skip warning from the driver, while still making Pyright aware of your Pydantic usage:

def foo(t: datetime) -> None:
    # use the custom serialiser in the ArangoClient
    ArangoClient(serializer=serialiser).db().aql.execute(
        "INSERT @document INTO @@collection", bind_vars={"document": {"timestamp": t}, "@collection": "my_collection"}  # type: ignore
    )


foo(datetime.now(tz=timezone.utc))  # no warning
foo(3) # pyright complains

@apetenchea apetenchea self-assigned this May 20, 2024
@marcuslimdw
Copy link
Author

Hi @apetenchea, thanks so much for the detailed writeup + your effort! To be honest, I was not expecting anything like what you did - I just wanted to start a discussion on improving typing in python-arango.

I agree with your conclusions - what I would have liked to do, had I been at my current company when they started using ArangoDB, was to introduce an abstraction around the driver that handles such cross-cutting concerns, but I wasn't 😔 such is life.

I understand that you have lots of do and that this is a relatively minor improvement, so thanks once again! Would you like me to close this issue?

@apetenchea
Copy link
Member

apetenchea commented May 21, 2024

It's great to have such discussions within the community, feel free to bring them up anytime. Although my lack of availability can sometimes take a toll on the development, I'm always up for improving the driver.
It's unfortunate that, at least for the time being, it is complicated to generalize codebases such as this one. I'm not an expert in Python generics, but in my experience the story is still young within the broader community. Although available since Python 3.5, there's lots of libraries which still have not integrated any type hinting at all, not to mention generics. And the syntax is still improving (generics in 3.12), which can make it even trickier to fully support (e.g if we want to support 3.9, we can't use a library that has 3.12-style type hints).
There's definitely lots of potential, and any new Python project will most probably going to offer typing support (such as the upcoming async driver), but I think it's going to take a lot of traction to migrate all of the old codebases to the newer syntax.
As you pointed out, a wrapper is the quickest workaround to adjust in such circumstances. If there's nothing else you want to discuss for now, then yes, please go ahead and close the issue :)

@apetenchea
Copy link
Member

Closing as not planned. Feel free to open another issue for discussing potential improvements.

@apetenchea apetenchea closed this as not planned Won't fix, can't repro, duplicate, stale May 25, 2024
@marcuslimdw
Copy link
Author

whoops, I reopened it because I wanted to close as not completed and I think the second request didn't go through

thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants