Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Value_Type to In-Memory Column #5158

Closed
wdanilo opened this issue Feb 5, 2023 · 20 comments · Fixed by #6073
Closed

Add Value_Type to In-Memory Column #5158

wdanilo opened this issue Feb 5, 2023 · 20 comments · Fixed by #6073
Assignees
Labels
-libs Libraries: New libraries to be implemented l-table-column-datatypes p-low Low priority x-new-feature Type: new feature request

Comments

@wdanilo
Copy link
Member

wdanilo commented Feb 5, 2023

This task is automatically imported from the old Task Issue Board and it was originally created by James Dunkerley.
Original issue is here.


  • Update prototype Value type to current specification.
  • Adjust storage classes:
    • Hold the minimum and maximum values in the dataset (just the size needed)
    • Hold the nullability flag will be done separately in Implement nullable Value_Types #5872
    • Allow computation of the type from the values in the dataset
    • Allow setting of a type restriction on the column
  • Add set_value_type and value_type to the Column
  • Alter Table.info to return the value type of the column
  • Ensure builders create value typed if possible.
    Type is possible use Mixed in not.

From_Vector take a new optional argument with a fixed type.

@wdanilo wdanilo added this to the Beta Release milestone Feb 6, 2023
@jdunkerley jdunkerley removed their assignment Feb 6, 2023
@jdunkerley jdunkerley moved this to ❓New in Issues Board Feb 6, 2023
@jdunkerley jdunkerley moved this from ❓New to 📤 Backlog in Issues Board Feb 14, 2023
@radeusgd radeusgd moved this from 📤 Backlog to 🔧 Implementation in Issues Board Mar 9, 2023
@radeusgd
Copy link
Member

radeusgd commented Mar 9, 2023

  1. Core

    • Create a proper mapping from SQL Types into Value_Type.
    • Create a Dialect-specific SQL_Type_Mapping mapping from Value_Type to SQL type.
      • Add unit tests checking that the mapping is aligned with the table in the design doc.
      • Ensure if we can make SQL_Type_Mapping.value_type_to_sql injective.
        • Will not work in full with SQLite?
        • It is not injective as we are approximating some types on purpose. We report warnings in such cases when the approximation is inexact. That should be good enough for now.
        • Other issue is the reverse scenario - types like CLOB and TEXT that we both map into Value_Type.Char and so they cannot be distinguished - thus the mapping is not injective. We ignore this for now as it may not be a problem currently.
        • In particular, extend SQL_Type to allow additional parameters, to allow distinguishing types like fixed length text - e.g. CHAR(16).
      • Add SQL_Type_Mapping.to_code which will generate the Dialect specific type name, for example INTEGER. - currently probably not needed, may be needed for upload_table - TODO we will see
    • Add fits_in (bits) to numeric in-memory storage.
      • Aggregate min and max? This may be costly in some operations like masking.
      • Not sure if needed. Will be revised in next PR when working on constraining types of in-memory columns.
    • Decouple in-memory value_type from storage_type - add a value_type property on the column.
      • When creating a new column with a storage, check that the storage and value_type are compatible (i.e. the value_type must be of the same kind and have the same or greater size; or be Mixed).
      • Cannot be tested well without from_vector or set_value_type/cast.
      • After all, not sure if this is the best idea. It may be better to keep a storage<->value_type mapping and let storage keep information on the type data. This will make operations, like ensuring limited integer size, easier.
        • But needed to some extent - we want to be able to re-cast a type to Mixed in O(1) (ideally without changing the storage).
          • Still can be done by adding more metadata to storages.
      • Finally implemented this by adding more info to Storages and mapping Storage into a corresponding Value_Type. Mixed is handling by a special storage wrapper facade that changes its reported type from the actually stored type to the "Any" type. Still WIP until we get set_type...
  2. Tidying up

    • Ensure Table.info returns the Value_Type
      • and that it displays nicely in the GUI.
    • Add value_type parameter to Column.from_vector (Nothing by default which means that the type will be automatically inferred like now).
      • From_Vector take a new optional argument with a fixed type.
      • By default, infer the type.
      • If given values do not match the expected type, throw type mismatch error / does not fit the type error.
        • Figure out what the error type should be here.
        • Implement a Value_Type.is_valid_for function that checks if an Enso value can be stored in the given type.
          • Possibly relying on just InferredBuilder and checking its result type could be slightly more efficient, but it should not make a big difference, and this approach will allow us to raise an error containing an exact value that did not match the type, instead of just saying that some values were invalid. We can revisit this in the future if needed.
    • Rename Table.parse_values into parse?
      • This is suggested by designs. Are we doing it?
    • Make SQL_Type depend on the dialect.
      • Kind of done by removing the constants from SQL_Type and just relying on the mappings.
  3. Operations

    • Add checks on operations that check if the operand has right type (arithmetic require a numeric type, text need text).
    • Consolidate the return type based on operands (usually choose the larger type).
    • in-memory map and zip
      • Add expected_type parameter.
        • Acts exactly like Column.from_vector, i.e. infer by default, otherwise check and error if does not fit.
    • Arithmetic and text etc. operations will use the type consolidation for their type (numeric, text, generic for iif)
      • Consolidation will need to be partially context-specific - like Text + Text should not select the maximum length but instead add the lengths!
      • Probably worth having some module Value_Type_Unifier with various algorithms for numeric, text types, depending on context etc.
      • Add tests for overflows - new value and warnings.
      • If the operand type does not fit an operation (adding Text to Integer etc.), report Invalid_Value_Type error - but with expected not necessarily being a specific Value_Type but a description like "a numeric type".
    • Add a suite of tests that verify that the type merging algorithm is consistent with what SQL generates.
      • We could create tables with the given SQL types, perform each operation and check the metadata of the resulting type.
      • SQLite will be problematic.
      • No longer applicable - we are using the types that the Database reports to us - so this is aligned by definition. With the exception of SQLite Booleans which are not aligned on purpose to provide better UX.
  4. Conversions

    • Add cast to the in-memory column.
      • Allow setting of a type restriction on the column
      • This was called set_value_type in the designs.
      • Widening scenario - if types match and the goal is wider - just change it.
      • Converting scenario - stuff like .to_text or integer<->float
      • Introduce Lossy_Conversion error.
      • Truncating scenario - tests
        • Ensure that if the new type does not fit all numbers because the bit-size is too small, a Lossy_Conversion warning is reported.
        • Ensure if new Text type is too small and some values are truncated, Lossy_Conversion is reported.
    • Add cast to Database column.
      • This will mostly rely on the Value_Type <-> SQL_Type mapping and use the CAST operator.
      • Add tests for DB casting.
      • See if we can keep value_type and sql_type coupled.
        • The idea would be to resort on the mapping sql_type -> Value_Type. Cast will convert the desired Value_Type into a corresponding SQL Type and add a CAST expression.
        • Still, we want column.cast vt . value_type == vt, so this will rely on the property of Value_Type <-> SQL_Type such that forall vt, vt.to_sql_type.to_value_type == vt (i.e. to_sql_type will need to be injective).
    • Implement auto_value_types.
      • Will only work in-memory as too complicated for DB. Standard 'please materialize' error in DB.
      • Tests for use_smallest = True | False logic.

@enso-bot
Copy link

enso-bot bot commented Mar 10, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-09):

Progress: Taking care of my pending PRs (the CI was comlpetely clogged yesterday). Reviews. Created a detailed list of things needed for Value Types work (will still be improving as things proceed). Answered some uncertainties/open design questions (may be revised). Started adding first pieces of code for this - workin on the SQL_Type<->Value_Type<->Storage_Type conversion logic and accompanying infrastructure. {The estimate is very rough as its a huge task, it may also change if we decide to split up into smaller pieces - will depend on what will seem more effective} It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Continue. Hopefully have some prototype logic for SQL types and Storage type checks. Probably no tests yet.

@enso-bot
Copy link

enso-bot bot commented Mar 13, 2023

Radosław Waśko reports a new STANDUP for the provided date (2023-03-10):

Progress: Fixing some tests in the pending PRs to ensure they are mergable and rebasing and fixing conflicts. Fleshing out the spec for Value types, figuring out what is possible in various DB backends and how to best get a good common functionality. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Add more type information to our storages. Then work on SQL type mapping.

@enso-bot
Copy link

enso-bot bot commented Mar 14, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-13):

Progress: Work on storage type information. Changed long into a more detailed StorageType. Ran into a weird Java compiler error - need to investigate why it's happening and possibly change code structure a bit to fix it. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Continue that.

@enso-bot
Copy link

enso-bot bot commented Mar 15, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-14):

Progress: Implemented new approach to storage types in-memory (first stage, more fine-grained control like <64bit ints and fixed length strings will be done in a separate iteration). Fixed many issues making the test suite pass again after this refactor. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Now go on to implement the SQL type mapping.

@enso-bot
Copy link

enso-bot bot commented Mar 16, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-15):

Progress: Researching peculiarities of SQLite JDBC driver. Apparently it may allow us to do more than pure SQLite does and we may still keep a Boolean type for columns in SQLite - which could be cool. Created a (yet untested) prototype SQL mapping focused on SQLite. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Finish the mapping and test it. Work on PostgreSQL mapping - independently at first, but we can later think if we can extract the common parts as there is a (more significant than I initially thought!) overlap.

@enso-bot
Copy link

enso-bot bot commented Mar 17, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-16):

Progress: Working on the other direction type translation. Verifying what are exactly our usages of sql type to better understand how to best handle uncertain cases. Experimenting with where we can omit the type without issue and where we could try to use JDBC to infer it. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Check if we can try use JDBC to figure out most types for us. Update SQLite translation. Start Postgres.

@enso-bot
Copy link

enso-bot bot commented Mar 20, 2023

Radosław Waśko reports a new STANDUP for the provided date (2023-03-17):

Progress: Added unit tests for SQLite and Postgres type mappings. Implement prototype for Postgres, update SQLite to slightly changed requirement for it. Work on fetching types of operation results from the Database instead of figuring them out in our code. Create groundwork for a simple Lazy type in Enso allowing us to compute a value only once and cache it. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Finish the work on fetching types of op results from DB. Add tests for Lazy and fix it. Fix the mappings.

@radeusgd
Copy link
Member

radeusgd commented Mar 20, 2023

TODO list before the first PR:

  • Finish Postgres mapping to pass all tests.
  • Add overrides for Boolean operation types for SQLite.
    • Ensure we can override types for some operations.
    • How to wire this nicely so that the overrides only happen in SQLite?
      • Adding infer_return_type to the type mapping.
    • Clean-up column operations from obsolete new_type internal arguments.
  • Change all col.sql_type.is_definitely_TYPE into col.value_type.is_TYPE.
    • Remove the is_definitely_*** from SQL_Type.
    • Remove the constants for types from SQL_Type - we should rely on the backend-specific mappings instead.
    • Move Builders in Result_Set to depend on the Type_Mapping - add a place to later plug-in custom Postgres date handling logic.
  • Ensure tests are passing.

@enso-bot
Copy link

enso-bot bot commented Mar 21, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-20):

Progress: Working on fetching types and updating the mappings. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Continue that, do stuff from my PR checklist and work towards preparing the first PR.

@enso-bot
Copy link

enso-bot bot commented Mar 22, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-21):

Progress: Wiring through the type override logic. Fixed Postgres mapping. Fixing the SQLite override - apparently there are some more edge cases. It should be finished by 2023-03-21.

Next Day: Next day I will be working on the same task. Continue that, do stuff from my PR checklist and work towards preparing the first PR.

@enso-bot
Copy link

enso-bot bot commented Mar 22, 2023

Radosław Waśko reports a new 🔴 DELAY for today (2023-03-22):

Summary: There is 3 days delay in implementation of the Add Value_Type to In-Memory Column (#5158) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Unexpected SQLite behaviour being a pain. Underestimated how much cleaning will be needed after these changes.

Possible solutions: Hard to estimate big tasks well, the original estimate was a ballpark anyway.

@enso-bot
Copy link

enso-bot bot commented Mar 22, 2023

Radosław Waśko reports a new STANDUP for today (2023-03-22):

Progress: Ensuring that the tests are passing at the current state. Cleaning up. It should be finished by 2023-03-24.

Next Day: Next day I will be working on the same task. Fix remaining tests. Remove outdated SQL_Type references in preparation for PR.

@enso-bot
Copy link

enso-bot bot commented Mar 24, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-23):

Progress: Got all tests to pass again. It should be finished by 2023-03-24.

Next Day: Next day I will be working on the same task. Do some further cleanup and prepare the PR.

@radeusgd
Copy link
Member

radeusgd commented Mar 24, 2023

The major PR #6073 is ready, save for fixing conflicts and adding one more small functionality.

I think the remaining types work (parts noted here above, parts tracked by separate tasks), is something along these lines:

  1. Add Day, Month, Year operations on Columns - both in DB and in-mem.
    Add Date/Time support to Postgres backend and implement basic year/month/date operations. #6115
    Date/time support for Postgres. Year/month/day operations on Columns. #6153
  2. Implement Union for DB
    Add Table.union to the Database Table. #5235
  3. Implement Full Join workaround for SQLite
    Implement a workaround for Full joins in SQLite #5254
  4. Implement cast (in memory and DB)
    Implement Table/Column.cast #6112
  5. Add missing type checks to column operations (inmem and DB, mostly inmem I think)
    Add missing type-checks to column operations #6106
  6. Add expected type checks to map, zip and Column.from_vector (in-memory)
    Add value_type support to from_vector, map and zip in in-memory Column. #6111
  7. Improve variety of in-memory storage types (add 16/32-bit ints, fixed length strings) and adapt any functions to support these.
    Extend in-memory storage to support Integers other than 64-bit, and fixed-length/length-capped strings #5159
  8. Implement auto_value_types and update Table.parse_values to use Value_Type.
    Implement Table.auto_value_types #6113
    Adjust Table.parse_values and Column.parse to use the Value_Type type #5660
  9. Extend the Value_Type system with the concept of nullability.
    Implement nullable Value_Types #5872
    • This may be a bigger chunk and possibly split into multiple PRs too.

@enso-bot
Copy link

enso-bot bot commented Mar 27, 2023

Radosław Waśko reports a new 🔴 DELAY for the provided date (2023-03-24):

Summary: There is 4 days delay in implementation of the Add Value_Type to In-Memory Column (#5158) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: A bit more time needed to finish the PR and get it reviewed.

@enso-bot
Copy link

enso-bot bot commented Mar 27, 2023

Radosław Waśko reports a new STANDUP for the provided date (2023-03-24):

Progress: Removed obsolete functionality and adapted its usages to ensure tests keep passing - chnged existing places checking operation types to rely on Value Type instead of not-as-portable SQL type. Implemented ability to set custom fetching logic for particular types by each dialect (preparing for custom Dates handling in Postgres). It should be finished by 2023-03-28.

Next Day: Next day I will be working on the same task. Add ability to custom setting statement values by dialect (so that we can have custom date handling logic for Postgres). Align follow-up tasks for Value Types work.

@enso-bot
Copy link

enso-bot bot commented Mar 28, 2023

Radosław Waśko reports a new STANDUP for yesterday (2023-03-27):

Progress: Add customization for logic of setting statement values by each dialect (will be used by dates). Add a test and ensure that custom user-provided queries cannot contain "?" placeholders (this would completely misalign our own generated queries and everything would go insane if allowed). Fighting with some bugs. The PR is finally fully ready and passing tests. It should be finished by 2023-03-28.

Next Day: Next day I will be working on the same task. Align follow-up tasks for Value Types work. Start the first one (if done quickly I may add it to the pending PR) - add Day/Month/Year operations and ensure support for Dates in Postgres backend works as intended - add some tests for date support that will work both in Database and in-memory backends (current ones are mostly in-memory only).

@jdunkerley jdunkerley moved this from 🔧 Implementation to 👁️ Code review in Issues Board Mar 28, 2023
@enso-bot
Copy link

enso-bot bot commented Mar 30, 2023

Radosław Waśko reports a new 🔴 DELAY for the provided date (2023-03-28):

Summary: There is 3 days delay in implementation of the Add Value_Type to In-Memory Column (#5158) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: More time needed to get the PR reviewed. Some unexpected test errors needed fixing. Trying to reproduce an imports bug (later fixed by Hubert, thankfully) also took some of the time.

@enso-bot
Copy link

enso-bot bot commented Mar 30, 2023

Radosław Waśko reports a new STANDUP for the provided date (2023-03-28):

Progress: Fixed some failing tests in the pending PR. Planned out follow-up tasks for Value Types - aligned existing tasks and updated their specifications, added missing tasks and added ballpark estimates. It should be finished by 2023-03-31.

Next Day: Next day I will be working on the #6115 task. Work on Postgres date-time support.

@mergify mergify bot closed this as completed in #6073 Mar 31, 2023
mergify bot pushed a commit that referenced this issue Mar 31, 2023
This is the first part of the #5158 umbrella task. It closes #5158, follow-up tasks are listed as a comment in the issue.

- Updates all prototype methods dealing with `Value_Type` with a proper implementation.
- Adds a more precise mapping from in-memory storage to `Value_Type`.
- Adds a dialect-dependent mapping between `SQL_Type` and `Value_Type`.
- Removes obsolete methods and constants on `SQL_Type` that were not portable.
- Ensures that in the Database backend, operation results are computed based on what the Database is meaning to return (by asking the Database about expected types of each operation).
- But also ensures that the result types are sane.
- While SQLite does not officially support a BOOLEAN affinity, we add a set of type overrides to our operations to ensure that Boolean operations will return Boolean values and will not be changed to integers as SQLite would suggest.
- Some methods in SQLite fallback to a NUMERIC affinity unnecessarily, so stuff like `max(text, text)` will keep the `text` type instead of falling back to numeric as SQLite would suggest.
- Adds ability to use custom fetch / builder logic for various types, so that we can support vendor specific types (for example, Postgres dates).

# Important Notes
- There are some TODOs left in the code. I'm still aligning follow-up tasks - once done I will try to add references to relevant tasks in them.
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-libs Libraries: New libraries to be implemented l-table-column-datatypes p-low Low priority x-new-feature Type: new feature request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants