Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sktime semantic data types for time series & vision #194

Open
fkiraly opened this issue Mar 28, 2022 · 7 comments
Open

Sktime semantic data types for time series & vision #194

fkiraly opened this issue Mar 28, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@fkiraly
Copy link

fkiraly commented Mar 28, 2022

I've recently been made aware of this excellent and imo much needed library by @lmmentel.

The reason is its similarity to the datatypes module of sktime, which introduces semantic typing for time series related data types - we distinguish "mtypes" (machine representations) and "scitypes" (scientific types, what visions calls semantic type). More details here as reference.

Few questions for visions devs:

  • time series are known to be a notoriously splintered field in terms of data representation, and even more when it comes to learning tasks (as in your ML example). Do you see visions moving in the direction of typing for ML?
  • would you have time to look into the sktime datatypes module and assess how similar this is to visions? If similar, we might be tempted to take a dependency on visions and contribute. Key features are mtype conversions, scitype inference, checks that also return metadata (e.g., number of time stamps in a series, which can be represented 4 different ways)
@ieaves
Copy link
Collaborator

ieaves commented Mar 29, 2022

Hey @fkiraly - thanks for reaching out! I haven't had a chance to fully grok everything sktime is doing but I might be able to provide a few thoughts based on the high level API and a quick read of the check and convert implementations.

While I use visions fairly regularly for type inference and basic data cleaning we really think of the project as a library for library authors (like you). It offers a structured API for working with complex data types and type systems. While we definitely had ML applications in mind when we wrote the library, our goal is to support developers build those use-cases rather than building them ourselves.

We have a data compression library named compressio we wrote to show how easy it was to build on type-based logic which might be of some interest to you if you decide to use visions. A common motif we've used is Data -> Infer Type -> Take action based on type you can see what that looks like in our implementation here.

From what I'm seeing there are a few differences between visions and sktime which might inform the rest of the conversation.

Scitypes and Mtypes

In sktime, the same scitype can be implemented by multiple mtypes. For instance, sktime allows the user to specify time series as pandas.DataFrame, as pandas.Series, or as a numpy.ndarray. These are different mtypes which are admissible representations of the same scitype, "time series".

Visions doesn't have a formal abstraction for the data container the way you've developed your notion of mtypes, instead, we provide an API for registering container-specific implementations of all necessary type methods and a dispatch-based system for using them. The end result is the same as far as I can tell; we decided to go with a dispatch approach so that the API was identical regardless of underlying container.

Type Implementation

Both scitypes and mtypes are encoded by strings in sktime, for easy reference.

In visions types are encoded as objects with a standard API (including a __str__ method, of course). Within the context of sktime I don't think this matters other than, potentially, the need for some sort of internal mapping between the string name and visions type e.g.

TYPE_MAP = {
   "pd-multiindex": sktime.types.pdMultiindex  #  Or however it might be named
}

# Convert to would reference TYPE_MAP if passed a string argument in to_type
convert_to(X, to_type="pd-multiindex")

Typesets

From what I can tell, sktime has a single collection of types used throughout the package. Instead, visions has the notion of a Typeset which is somewhat akin to a namespace for types / type implementations.

Typesets - Graphs / Sets

For each typeset, we construct a directed dependency graph between types. The easiest way to think about this is through the lens of sets. Taking a math example, Integer is a subset of Real, the equivalent visions type implementation would have an IdentityRelation between the two. We might have a third String type with an InferenceRelation to Integer which applies to sequences like ['1', '2', '3'] and maps the sequence to integers [1, 2, 3]

Type inference within visions is the same as graph traversal across these relations. We actually have two forms of traversal though, detection which involves no modification to the underlying data (IdentityRelations) and inference where modification is permitted (InferenceRelations).

One nice thing you get in this construction is the ability to perform inference across many hops in the graph all in one go. For example, maybe you're passed a sequence of ordered YYYY-MM-DD strings with daily frequency, infer might do the following

String -> 
Datetime (pd.Timestamps) -> 
Date (timestamps but identified to have no sub-daily values) -> 
Daily (type of pandas.Period)`

The end result would be a pandas.Period with freq='D' (this is just an example, I may be misusing pandas implementation of period).

Because type dependency is explicitly encoded here it's also okay for a sequence to belong to many types (meaning issues like this are not a concern either).


Key features are mtype conversions, scitype inference, checks that also return metadata (e.g., number of time stamps in a series, which can be represented 4 different ways)

All of this should be doable, most actions in visions end up in the traverse_graph function at some point. It tracks three pieces of information:

  1. A copy of the original sequence (potentially modified as it traverses through the graph)
  2. The sequence of types it traversed through
  3. A state object - this is just a dictionary, you can use this state to put any metadata you want inside.

We would be very happy to help you guys in any way we can. Happy to hop on a zoom call as well!

@lmmentel
Copy link

lmmentel commented Apr 3, 2022

This is really hugely helpful @ieaves, thank you for putting in the time to elaborate on how a type system might be applicable in our case. 🙌

I've made a few attempts in the past to break down time series into smaller specialized objects and tried to build complexity back up again but I hit a few dead ends. After what you've wrote I think I can see how a type system might help here. I would like to take another stab at the problem and I'm wondering what would you recommend as a good place to start when developing a type system?

@ieaves
Copy link
Collaborator

ieaves commented Apr 3, 2022

Hey @lmmentel, I'm so glad! If you're thinking about using visions there are a couple of examples of building typeset up from scratch available here. I've also written a blog post that (might) have some value if only for walking through the thought process we were using when writing the basic typesets here. I think each of those resources is up to date with the current API but if not don't hesitate to ping me for clarification.


The first thing to do is to write down the set of types you need for your problem space. The scitypes you've already defined will be a good place to start here. Focus on what they are rather than what they aren't. It looks to me like scitypes have some nested structure as well, e.g. Table could be a List[Dict[str, Union[float, int, str]] or a pd.Series[not Object] (former, latter respectively).

You can tackle that type in a couple of ways - one would be separate implementations of each type method (contains_op, and relations) for each container.

@Table.contains_op.register
def boolean_contains(series: pd.Series, state: dict) -> bool:
    return  series.dtypes != "object"


PRIMITIVE_TYPES = (float, int, str)

@Table.contains_op.register
def boolean_contains(series: list, state: dict) -> bool:
    for i, d in enumerate(obj):
        for key in d.keys():
            if not isinstance(d[key], PRIMITIVE_TYPES):
                return False

   state['thing you want to save'] = 'your information'
   return True

The annotation can be read as {The type you are modifying}.{method to modify}.register creates a new dispatched method based on the type signature in the function definition - {pd.Series, dict} in the first case and {list, dict} in the second.

A second approach might be more like what we did in our example ML typeset. In this, typesets are nested nested, one handling primitive types and a seconds capturing higher order types like OrdinalRegression.

Generally speaking we advise using dispatch when your type defines an idea independent of its container. For example, latitude / longitude should be latitude / longitude regardless of whether I feed you the same sequence as a pd.series, list, tuple, np.array, or something else altogether. You can see how we structured that idea in the backends section of our codebase.

If you can write down what semantic types are and the attributes you think they ought to have things become relatively easy. You need a few basic concepts

  • Semantic type -> An instance of VisionsBaseType
  • contains_op -> defines whether data is of that type, so, Type.contains_op(sequence) -> True if sequence is Type and False if not (we usually use in syntax for this though, e.g. sequence in Type).
  • relations -> An instance of TypeRelation describe the connection between two types.

Relations have two methods. Let's say we have a relation between types A and B, that defines a mapping from A to B.

  • Relation.relationship(series) -> "Series of type A is also of type or can be transformed into type B"
  • Relation.transformer(series) -> "Performs the transformation from A to B"

Bringing it together with the Float dtype

class Float(VisionsBaseType):
    """**Float** implementation of :class:`visions.types.type.VisionsBaseType`.
    Examples:
        >>> import visions
        >>> x = [1.0, 2.5, 5.0]
        >>> x in visions.Float
        True
    """

    @staticmethod
    def get_relations() -> Sequence[TypeRelation]:
        relations = [
            IdentityRelation(Generic),
            InferenceRelation(String),
            InferenceRelation(Complex),
        ]
        return relations

    @staticmethod
    @multimethod
    def contains_op(item: Any, state: dict) -> bool:
        pass

@Float.register_relationship(Complex, pd.Series)
def complex_is_float(series: pd.Series, state: dict) -> bool:
    return all(np.imag(series.values) == 0)


@Float.register_transformer(Complex, pd.Series)
def complex_to_float(series: pd.Series, state: dict) -> pd.Series:
    return series.astype(float)

We've got a new class defining the Type, two relations declared in get_relations.

  1. An IdentityRelation(Generic) - the Generic type is the root of every typesystem.
  2. An InferenceRelation(String) - Encodes mappings from String -> Float
  3. An InferenceRelation(Complex) - Encodes mappings from Complex -> Float

You'll notice relations defined on a type map TO the type, not the other way around.

complex_is_float translates to "The Complex Series is actually Float is all of the imaginary elements are 0"

complex_to_float explicitly coerces the series to float.

The transformer on an IdentityRelation is essentially always lambda x: x and the relation method defaults to contains_op unless otherwise specified.

Now it's just a process of bootstrapping your way up through each type you need. Once you've defined your types you can compose them together (literally, just addition, e.g. `Float + Integer -> Typeset([Float, Integer])' and you're in business :).

@fkiraly
Copy link
Author

fkiraly commented Apr 4, 2022

Yes, @ieaves, thanks! Sorry for the late answer, I've taken some time to digest over the week-end.

If I understand correctly what you are saying, please correct me if wrong:

  • visions supports the check and type inference aspects of sktime out of the box (as part of its core feature API), and as far as we see it could be easily adapted. It does even more, such as multiple types being allowed or type hierarchy! Which is pretty neat.
  • visions does not support "machine type" inference
  • visions does not support conversion logic

I do agree with point 1 and point 3, however point 2 I think is not correct based on what you say?

Since you have a type hierarchy, you simply could make mtypes subtypes of the scitype. No? You would have to designate some types as mtypes and some as scitypes, but types can have types too, so that would not be a problem.

Do you agree or disagree?

What would be nice imo:

  • conversion functionality out-of-the-box, and/or a concept of visions types having, potentially, multiple "machine implementations" which can be converted between. Possibly out of scope for visions?
  • automated type checking based on function signature annotation, e.g., sth like pydantic but based on visions type checks.

Re call - that would be nice!
We'd have to check who in the sktime community would be interested and when the right time is - Fridays are good typically.
It looks like @lmmentel would also like to join :-)

@ieaves
Copy link
Collaborator

ieaves commented Apr 4, 2022

point 2 I think is not correct based on what you say?

No, you're right, it definitely does. There's no intrinsic difference between machine and scitypes within visions. They are each just different semantics. You could very easily have something like

class PandasSeries(VisionsBaseType):
    @staticmethod
    def get_relations() -> Sequence[TypeRelation]:
        relations = [
            IdentityRelation(Generic),
        ]
        return relations

    @staticmethod
    @multimethod
    def contains_op(item: Any, state: dict) -> bool:
        return isinstance(item, pd.Series)

class NumpyArray(VisionsBaseType):
    @staticmethod
    def get_relations() -> Sequence[TypeRelation]:
        relations = [
            IdentityRelation(Generic),
        ]
        return relations

    @staticmethod
    @multimethod
    def contains_op(item: Any, state: dict) -> bool:
        return isinstance(item, np.ndarray)

visions does not support conversion logic

It definitely does!

Typesets -> A graph
Types -> a node in a graph
Conversions (aka relations in our parlance) -> edges between nodes.

Every relation requires two methods, one for validating whether a conversion is appropriate, and another for actually performing the conversion (see the complex_is_float and complex_to_float example above).

conversion functionality out-of-the-box, and/or a concept of visions types having, potentially, multiple "machine implementations" which can be converted between.

Taking the pd.Series / np.ndarray example above you can implement conversions by adding an InferenceRelation into get_relations on both types and implementing the following methods.

@NumpyArray.register_relationship(PandasSeries, pd.Series)
def pandas_is_numpy(series: Any, state: dict) -> bool:
    return True


@NumpyArray.register_transformer(PandasSeries, pd.Series)
def pandas_to_numpy(series: Any, state: dict) -> np.ndarray:
    return np.ndarray(series)


@PandasSeries.register_relationship(NumpyArray, np.ndarray)
def numypy_is_pandas(series: Any, state: dict) -> bool:
    return True


@PandasSeries.register_transformer(NumpyArray, np.ndarray)
def numpy_to_pandas(series: Any, state: dict) -> pd.Series:
    return pd.Series(series)

automated type checking based on function signature annotation, e.g., sth like pydantic but based on visions type checks.

That's a really good idea! It wouldn't be particularly difficult to implement if there was interest. Maybe something to discuss on a call. If y'all want to get together and find a time that works on your end just grab a time off my calendly!

EDIT: There runs some risk of overcomplicating things but in the Pandas -> Numpy, Numpy -> Pandas example above we normally exclude cyclic relations between types because we try to offer automatic type inference and we can't resolve cycles automatically (it's like ping ponging between types). That being said, if you aren't worried about fully automated inference the cycles won't affect you, we would just add something like this for you:

def cast_along_path(series, graph, path, state={}):
  base_type = path[0]
  for vision_type in path[1:]:
      relation = graph[base_type][vision_type]["relationship"]
      series = relation.transform(series, state)
  return series

Path is just a direction to travel through the graph, to go from PandasSeries to NumpyArray it would be the list [PandasSeries, NumpyArray].

@fkiraly
Copy link
Author

fkiraly commented Apr 5, 2022

It definitely does!

Taking the pd.Series / np.ndarray example above you can implement conversions by adding an InferenceRelation into get_relations on both types and implementing the following methods.

Oh, that's neat!
The register logic basically parallels the convert_dict registration which is somewhat more manual. I'd assume it's the same logic.

Related questions:

  • we have a concept that a conversion might be "lossy", e.g., a conversion pandas to numpy may lose the column and row indices. When you convert back, and you know the conversion is a "back" conversion, you want to pick the lost information up from a temporary storage. Could we do that somehow? This is crucial for our application.
  • @chrisholder implemented nice experimental functionality for identifying a "path" to conversion if one is not defined, in the conversion graph, via Dijkstra. We did not merge this since it did not take into account "lossy" vs not conversions, but I was going to look at it again once other work is out of the way.
  • @chrisholder also implemented nice experimental functionality usin enums that allows a user to quickly access the types/strings e.g. Scitype. and then the dropdown menu suggests a number of possible scitypes, instead having to look them up; similar for Mtype., is sth like this possible with visions?

@ieaves
Copy link
Collaborator

ieaves commented Apr 5, 2022

The register logic basically parallels the convert_dict registration which is somewhat more manual. I'd assume it's the same logic.

Yes, it's really just semantic sugar to make development easier, particularly when working with multiple backends.

we have a concept that a conversion might be "lossy", e.g., a conversion pandas to numpy may lose the column and row indices. When you convert back, and you know the conversion is a "back" conversion, you want to pick the lost information up from a temporary storage. Could we do that somehow? This is crucial for our application.

I think the easiest way to accomplish this is to use the state dict. You might do something like

@NumpyArray.register_transformer(PandasSeries, pd.Series)
def pandas_to_numpy(series: Any, state: dict) -> np.ndarray:
    state['index'] = series.index
    return np.ndarray(series)


@PandasSeries.register_transformer(NumpyArray, np.ndarray)
def numpy_to_pandas(series: Any, state: dict) -> pd.Series:
    return pd.Series(series, index=state.get('index', None))

That dictionary will be passed up and down the stack and can be recovered whenever needed.

@chrisholder implemented nice experimental functionality for identifying a "path" to conversion if one is not defined, in the conversion graph, via Dijkstra. We did not merge this since it did not take into account "lossy" vs not conversions, but I was going to look at it again once other work is out of the way.

We've solved this puzzle with the 'relationship' method e.g.

@Float.register_relationship(Complex, pd.Series)
def complex_is_float(series: pd.Series, state: dict) -> bool:
    return all(np.imag(series.values) == 0)

This way edges along the graph have explicit validation, it's not just that complex numbers can be coerced to floats (just drop the complex data) but that they should be in some ontological sense. For us this was an advantage because it liberated us from requiring the user to specify what they wanted to cast to. Instead, the deepest element in the tree was the best specified type for the data.

That being said, if you know what you want to cast to then djikstra would give you the cast path and you could use the cast_along_path snippet from above. You'll just need to be intentional about setting any metadata in the state dict as you write your relations. Some additional good news though, internally we are using networkx which already has an implementation of djikstra ready to go!

@chrisholder also implemented nice experimental functionality usin enums that allows a user to quickly access the types/strings e.g. Scitype. and then the dropdown menu suggests a number of possible scitypes, instead having to look them up; similar for Mtype., is sth like this possible with visions?

Yes, it should be. If I'm understanding things correctly this is equivalent to finding all connected nodes in the relation graph. There are two subgraphs we track on each Typeset - an "inferential" graph (changes the underlying data e.g. Complex -> Float), and a "non-inferential" graph (no change to the underlying data e.g. Object -> String). As I said, we are using networkx under the hood so the list of scitypes would be generated as

import networkx as nx

from_type = "MyType"
type_paths = list(nx.all_shortest_paths(graph, from_type))
type_enum = [path[-1] for path in type_paths]

If we were to implement this for you it would look a bit like

your_typeset = PandasSeries + NumpyArray  # Any other types you wished to consider

your_typeset.accessible_types(PandasSeries)
-> [NumpyArray]

your_typeset.path_to_type(PandasSeries, NumpyArray) # Only two types so they are a single hop
-> [PandasSeries, NumpyArray]  

your_typeset.cast_to(numpy_array, PandasSeries)
-> pandas_series, state  # i.e. automatically uses the shortest path to coerce the initial numpy_array to a PandasSeries

EDIT: I realized there's a mistake in some of the code snippets I provided (this is what I get for spitballing), the API requires you to return the state dictionary as well so the registered operations should actually look like

@NumpyArray.register_transformer(PandasSeries, pd.Series)
def pandas_to_numpy(series: Any, state: dict) -> Tuple[np.ndarray, dict]:
    state['index'] = series.index
    return np.ndarray(series), state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants