Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support for time with micro-second precision #353

Closed
yjiangnan opened this issue Jul 2, 2018 · 22 comments
Closed

No support for time with micro-second precision #353

yjiangnan opened this issue Jul 2, 2018 · 22 comments
Labels
kind/question This is a question wontfix
Projects

Comments

@yjiangnan
Copy link

Example:

cqlsh> SELECT totimestamp(now()) FROM myapp.stock_market;

 totimestamp(now())
---------------------------------
 2018-07-02 16:28:44.433000+0000
 2018-07-02 16:28:44.433000+0000
 2018-07-02 16:28:44.434000+0000
 2018-07-02 16:28:44.434000+0000
 2018-07-02 16:28:44.434000+0000
 2018-07-02 16:28:44.434000+0000

As one can see, even though there are 6 significant digits after second, there are only 3 non-zero digits. Therefore, the minimal time precision is only millisecond. Is it possible to get it to the level of micro-second?

I find this useful in my application. Suppose I have a high rate of requests from many users from multiple servers so that each millisecond sometimes could easily contain multiple requests. Then how do I now the order of the requests? Does TIMEUUID guarantee the correct order? Additionally, if I want the users to know the order of the requests of all users (say I run a blockchain and want the users to be able to verify the results), how could that be easily done without adding much overhead (they do not have YugaByte installed). Is it possible to add a field of id corresponding to the order of insertion with ACID guarantee? Thanks!

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 2, 2018

Hi @yjiangnan

  1. The timestamp type is modeled after Apache Cassandra's timestamp. To quote: <<< Values for the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as the epoch: January 1 1970 at 00:00:00 GMT. >>

  2. For additional precision, while we will not be able to change the behavior of current type without breaking wire protocol compatibility, we can consider adding a finer precision timestamp type as a new type in future.

  3. <<< Additionally, if I want the users to know the order of the requests of all users >>>

It'll be helpful to know what the schema you have in mind.

Do you need global order across all users? Or do you need order in the context of each user's action. For example, is 'userid', the partition portion of your compound key, and you have requests that needed to be ordered within the context of each user (so you need a order_id which is guaranteed increasing, and is the clustering portion of your compound key)?

Note that TIMEUUID is a unique time-based UUID which has a time component and additional components to make it unique. Additionally, within the context of a single partition (like a userid), it also has the property that it is in strictly increasing order. Will that work for your needs?

@yjiangnan
Copy link
Author

Hi @kmuthukk,

Thanks for the reply. I need the order for both, but focus on global order across all users, and I need it in a way that is easily verifiable by an user even in case of network unreliability.

For example, in case of an online interactive gaming, a user establish a REST or Websocket connection to server, and he wants to know exactly what his status is and what others status is in the current moment. Let's also assume he is running an AI to learn from other users' visible action histories (some actions maybe invisible to others) so that the order of data and data completeness are very important. In case of network instability, he wants constantly to know whether he is still connected and has retrieved all the historic data after possible reconnection. Of course, this can be done by retrieving all the historic data by REST API since last time, but this obviously adds a lot of network overhead. It would be much easier to see the proof in the data (most recently retrieved by Websocket) that his has retrieved all the data. One way I have in mind is to put all the historic data in a chain so that each row of new data has a field linked to the previous one. E.g.,

[{"id":1, "data":"hash0", "last_id":0}, 
....
{"id":100, "data":"hash99", "last_id":99}]

However, I am not sure if there is a very efficient way to find out what the last insertion is for concurrent insertions. Additionally, if the "data" field is the hash of the content of the previous insertion (i.e., in a blockchain fashion), is it efficiently doable?

@kmuthukk kmuthukk added the kind/question This is a question label Jul 3, 2018
@kmuthukk kmuthukk added this to To Do in YCQL via automation Jul 3, 2018
@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 3, 2018

hi @yjiangnan

a) A global, monotonically increasing counter in a sharded/scale-out database will need some kind of a centralized service/coordinator (e.g., you can use a special single row table to keep track of the counter and keep incrementing it in a race condition safe Read-Modify-Write operation to assign out a monotonically increasing value to requesting clients). However, there's no guarantee that two concurrent write operations that use these counters commit it the correct order that is compatible with the counter values.

b) The approach of chaining the actions in a linked list like scheme would work.

Question: Currently, in YCQL (our Cassandra based data model) we require the primary key (even when it is a compound key) to have at least 1 leading partition key. This means there isn't an easy way to do global order queries efficiently, because there's no efficient way to get the MAX(id) in the table without doing a scan. But we do plan to relax this restriction of requiring a partition key (and allow for the primary key to be based only on CLUSTER/sorted columns).

When we do that (i.e. implement range based primary key) finding the max id issued so far in log(N) time, and doing a conditional INSERT (with the IF NOT EXISTS clause) that protects against race conditions due to concurrent operations should be straightforward.

For example, in YCQL pseudo-code:

  :prev_max <--- SELECT max(id) FROM tab;

  INSERT INTO tab (id, hash, prev_id) VALUES (:prev_max + 1, hash, :prev_max) IF NOT EXISTS;

If the INSERT fails due to concurrent operations, then the SELECT and INSERT have to be retried.

Currently, because YCQL requires a hash-partition in the primary key, finding the max(id) will be O(N). If that's not desirable an alternative would be keep the current max in another single row table (say track_max.

  :prev_max <--- SELECT max_val FROM track_max;
  begin transaction
       INSERT INTO tab (id, hash, prev_id) VALUES (:prev_max + 1, hash, :prev_max) IF NOT EXISTS;
       INSERT INTO track_max (max_val) VALUES (:prev_max + 1);
  end transaction

@yjiangnan
Copy link
Author

Hi @kmuthukk,

Thank you very much for the reply. I thought about using a row to track the max id, too, but placing it in the same table with id 0. Would that still work and be better if I have multiple tables to track?

I am not sure on what scope the ACID transaction locks. Does it force all the operations on the cluster to be executed sequentially, or just do it per keyspace, or per table, or just the primary keys that would be affected by the transaction? I think this would have huge implications for the scalability of the cluster if, say, each user could get involved in multiple games and he has some credit that can be used to purchase equipments in each game so that the credit would be shared and modified from multiple games. If the ACID transaction locks operations globally, then all the games have to executed sequentially, making it very slow. But if only the affected primary keys are affected, then all the games could be executed in parallel.

Additionally, for the last example, is it better to place SELECT max_val INTO :prev_max FROM track_max; inside the transaction? I suppose a different server trying to do an insertion at the same time would get the same prev_max and create a problem. If placing it inside the transaction, is it correct that the transaction would be guaranteed to succeed even if IF NOT EXISTS is removed? And why do you need INTO :prev_max? Is it just for better readability or is it really necessary with specific purposes?

By the way, what is the most efficient hash function for the serialized string of the previous insertion that can be called inside the transaction? Thanks again.

@rkarthik007
Copy link
Collaborator

Hi @yjiangnan,

I thought about using a row to track the max id, too, but placing it in the same table with id 0. Would that still work and be better if I have multiple tables to track?

Having two tables or a single table with a fixed id (0 in this case) work about the same in terms of semantics and performance.

I am not sure on what scope the ACID transaction locks. Does it force all the operations on the cluster to be executed sequentially, or just do it per keyspace, or per table, or just the primary keys that would be affected by the transaction? ... But if only the affected primary keys are affected, then all the games could be executed in parallel.

ACID transactions only lock the rows (in fact, it is just the actual columns) they are modifying. So other, non-overlapping operations would not be affected. You would need to declare which tables are transactional using a property with transactions = { 'enabled' : true };

So in your case, the games would be executed in parallel since only modified columns are optimistically locked.

Additionally, for the last example, is it better to place SELECT max_val INTO :prev_max FROM track_max; inside the transaction? ... If placing it inside the transaction, is it correct that the transaction would be guaranteed to succeed even if IF NOT EXISTS is removed? Is it just for better readability or is it really necessary with specific purposes?

Today, we only support SNAPSHOT ISOLATION (SI) as the isolation level. Placing the read inside the transaction would require SERIALIZABLE isolation level. Because of this, the IF NOT EXISTS becomes important.

By the way, what is the most efficient hash function for the serialized string of the previous insertion that can be called inside the transaction?

Sorry I didnt understand... could you please explain?

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 3, 2018

Hi @yjiangnan

  1. Regarding: <<< I thought about using a row to track the max id, too, but placing it in the same table with id 0. Would that still work and be better if I have multiple tables to track?>>>

Yes, keeping a special row (with id 0 for example) within the same table will work too.

  1. Regarding <<< I am not sure on what scope the ACID transaction locks. Does it force all the operations on the cluster to be executed sequentially, or just do it per keyspace, or per table, or just the primary keys that would be affected by the transaction? >>>

YugaByte takes finer grained locks. We do not take DB/keyspace/table levels locks, as those will severely limit the concurrency the system can support (as you pointed out). Updates to different rows of a table can all proceed concurrently.

However, for the particular use case you mention, if you maintain a special row to track the max id issued so far - whether it is stored in the same table or a different table - that row will become a hotspot. So all updates related to that table (maybe a game in your case) will be serialized because every transaction on that table will also try to update that special row. But updates to different games can proceed concurrently.

regards,
Kannna

@yjiangnan
Copy link
Author

Hi @rkarthik007
Thanks for the explanation. For example, for row s = '{"id":1, "data":"hash0", "last_id":0}' where s is a string representation of the row. One can calculate its hash by a function:
https://stackoverflow.com/questions/40190282/generate-c-bucket-hash-from-multipart-primary-key

In Cassandra one can define and call that function to calculate the hash of s and insert it in row with "id":2 and column data:
https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlCreateFunction.html

By using a hash, the user will also be able verify the data integrity so that no body can change the historical data without changing all data in the row and being detected.

@rkarthik007
Copy link
Collaborator

Hi @yjiangnan,

Thanks for that explanation, we do not support user defined functions in YugaByte yet. Would the default hash function applied on the serialized string not work for you? cc @robertpang @m-iancu

@yjiangnan
Copy link
Author

Hi @rkarthik007

What is the default hash function? Is it dependent on the content of the row so that changing anything in the row would necessarily change the hash? And is the default hash function implemented in common programming languages so that users can easily verify the integrity of the data themselves?

@rkarthik007
Copy link
Collaborator

@robertpang or @m-iancu - could one of you please answer?

@yjiangnan - this is similar to the hash function that Cassandra uses by default. We have implemented it in C++ and Java, and should be able to port to other languages.

@robertpang
Copy link
Contributor

@yjiangnan Here is the Java implementation of our default hash function - https://github.com/YugaByte/cassandra-java-driver/blob/3.2.0-yb-x/driver-core/src/main/java/com/yugabyte/driver/core/utils/Jenkins.java. It is based on Bob Jenkin's hash function (http://burtleburtle.net/bob/hash/evahash.html). It is good for sharding / partitioning but not for ensuring data integrity.

For data integrity, I will suggest you to look into other hash functions like SHA-2 (https://en.wikipedia.org/wiki/SHA-2).

@yjiangnan
Copy link
Author

@rkarthik007 @robertpang I think almost any hash function would work in my cause because the data to hash does not contain a nonce and there is not so many ways you can change the data to cause a collision. However, the problem is where I can call it. If it can only be called outside of the database (e.g., by a client driver), then I think I can implement it by any means. However, This way may cause some delay and frequent conflict in case of concurrency during the process of query last record --> hash --> insert as it requires two rounds of communication, although I do not have an estimate of the actual delay after cluster deployment.

@m-iancu
Copy link
Contributor

m-iancu commented Jul 5, 2018

@yjiangnan if a collision change of approx 1 in 65536 is ok in your case, then you could use our hash function which you can call through the CQL token function. However, the caveat is that token requires the arguments to match the schema for the partition key so there are two options:

  1. all the columns that need to be hashed need to be part of the partition key (example below).
  2. Some values that match the expected types can be computed in the app (e.g. using something like row.toString).

Additionally, if needed, we may be able to add a generic hash builtin (that works similarly but does not have the partition key restriction).

I'm not sure if this example matches your exact question, perhaps you can clarify what your table schema and main query patterns are. Then we can provide a more concrete (and accurate) response.

Here is a simple example:

CREATE KEYSPACE example;
USE example;
CREATE TABLE test(id int, data bigint, last_id int, primary key((id, data, last_id)));
INSERT INTO test(id, data, last_id) values (0, -1, -1) ;
INSERT INTO test(id, data, last_id) values (1, token(0, -1, -1), 0);
SELECT token(id, data, last_id), id, data, last_id FROM test;
 token(id, data, last_id) | id | data                 | last_id
--------------------------+----+----------------------+---------
     -8611445437485809664 |  0 |                   -1 |      -1
     -7244040000625442816 |  1 | -8611445437485809664 |       0

You can see that the data value for id=1 corresponds to the internal hash (token.. column) for row 0.
Now, if you "change" row 0 the hashes will like not match anymore:

DELETE FROM test WHERE id = 0 AND data = -1 AND last_id = -1;
INSERT INTO test(id, data, last_id) VALUES (0, -2, -1) ;
SELECT token(id, data, last_id), id, data, last_id FROM test;

 token(id, data, last_id) | id | data                 | last_id
--------------------------+----+----------------------+---------
      7017452644373364736 |  0 |                   -2 |      -1
     -7244040000625442816 |  1 | -8611445437485809664 |       0

Would something like this help with your "no body can change the historical data without changing all data in the row and being detected" question above?

@yjiangnan
Copy link
Author

Hi @m-iancu

Thanks for the example. It is interesting that an almost random int64 integer can have a collision rate as large as 1 in 65536. xxHash for example offers a much lower collision rate:

http://fastcompression.blogspot.com/2014/07/xxhash-wider-64-bits.html

Then implementing xxHash or something in the SHA family would be a nice choice. Additionally, my purpose of keeping a hash to chain the data together is for users to easily verify the data. I am not sure if users can easily call a function to get the same result as the token function. I think xxHash is a nice hash function because it is extremely fast and collision resistant and is implemented in all major programming languages.

@m-iancu
Copy link
Contributor

m-iancu commented Jul 5, 2018

@yjiangnan We actually use uint16 hash internally (hence the 65536), the type is only presented as int64 to the user (for closer Cassandra compatibility) -- which causes the confusion. The purpose of our hashing function is mainly to force bucket-ization of the rows by partition key (for sharding) so collision resistance is not really its goal.

But, as mentioned above, if the general idea of the example matches your use-case we can use it as a starting point and provide another general builtin function (e.g. hash) that uses something like xxHash or SHA-2.

@yjiangnan
Copy link
Author

@m-iancu

Great! Then I will wait for the new version of YugaByte and hope that it will come with one or more builtin hash functions since the basic token does not suit my purpose well. For now, I will develop my system with the assumption that such a function already exist. Thanks!

@rkarthik007
Copy link
Collaborator

@yjiangnan - do you mind opening a github issue for a build in hash function? Just to make sure we dont lose it in this issue. Thanks!

@yjiangnan
Copy link
Author

Hi @m-iancu

How can I locate the server of a specific tablet? It seems that the tablet information is saved in table system.partitions:

cqlsh> select * from system.partitions where keyspace_name='app' and table_name='test_now';

 keyspace_name | table_name | start_key | end_key | id                                   | replica_addresses
---------------+------------+-----------+---------+--------------------------------------+---------------------------------------------------------------------------
           app |   test_now |        0x |  0x2aaa | 3f11f106-da92-398e-cf49-d0e35f14df10 | {'127.0.0.1': 'FOLLOWER', '127.0.0.2': 'FOLLOWER', '127.0.0.3': 'LEADER'}
           app |   test_now |    0x2aaa |  0x5554 | 2a682093-32cb-899f-7546-d1b51c24ada3 | {'127.0.0.1': 'LEADER', '127.0.0.2': 'FOLLOWER', '127.0.0.3': 'FOLLOWER'}
           app |   test_now |    0x5554 |  0x7ffe | d556fe5a-1657-f5b5-c34e-b70bf0c52573 | {'127.0.0.1': 'LEADER', '127.0.0.2': 'FOLLOWER', '127.0.0.3': 'FOLLOWER'}
           app |   test_now |    0x7ffe |  0xaaa8 | 9c7244d2-d4a7-41a9-a849-8205a56792c9 | {'127.0.0.1': 'FOLLOWER', '127.0.0.2': 'LEADER', '127.0.0.3': 'FOLLOWER'}
           app |   test_now |    0xaaa8 |  0xd552 | 036c5139-722f-ccb3-e64e-382da3159854 | {'127.0.0.1': 'FOLLOWER', '127.0.0.2': 'FOLLOWER', '127.0.0.3': 'LEADER'}
           app |   test_now |    0xd552 |      0x | 70e21ea0-a73d-c88c-9d40-5f8dfa159e30 | {'127.0.0.1': 'FOLLOWER', '127.0.0.2': 'LEADER', '127.0.0.3': 'FOLLOWER'}

It seems that generating the key for the partition key and comparing to the start_key and end_key is the way to achieve this. However, I have difficulty in generating the key:

cqlsh> select * from app.test_now;

 k | v
---+--------------------------------------
 1 | 397fe288-82a6-11e8-bb4b-0ca9572519fa

(1 rows)
cqlsh> select token(k) from app.test_now;

 token(k)
----------------------
 -7921831744544702464

The token is a negative number. In Python, I can see that it looks like the thing I want:

>>> hex(-7921831744544702464)
'-0x6df0000000000000'
>>> hex((-7921831744544702464) >> 48)
'-0x6df0'

However, it has a negative sign and I am not sure how I can compare it to the start_key and end_key. I eventually want to do this in golang.

@m-iancu
Copy link
Contributor

m-iancu commented Jul 9, 2018

@yjiangnan Great question, this is again because token uses int64 but we use uint16 internally, so you would need to convert from one to the other. Here is some Java code that does the conversion (from our Cassandra driver fork) as an example: https://github.com/YugaByte/cassandra-java-driver/blob/3.2.0-yb-x/driver-core/src/main/java/com/yugabyte/driver/core/policies/PartitionAwarePolicy.java#L117
(Still need to convert to hex).

Note that we do plan to add an alternative function (e.g. partition_hash instead of token) that actually returns uint16 directly so that users don't have to worry about the internal details and don't need extra implementation work for the conversion (feel free to open an issue on this if there isn't one yet). Until then, hope the link above helps as a sample implementation.

@yjiangnan
Copy link
Author

Hi @m-iancu Thanks for the info. An even simply approach would be adding a function to return the replica_addresses.

I implemented this in golang but still got a negative number:

package main

import "fmt"

func main() {
	var cql_hash int64 = -7921831744544702464
	hash_code := (int)(cql_hash >> 48);
        hash_code ^= 0x8000;
	fmt.Println((hash_code))
}

Output:
-60912

I guess this is because hash_code is still a signed integer. More importantly, how can use it in a query in clients such as gocql? start_key and end_key have data type blob. Comparing different data types does not seem to be supported in YB. The eventual goal is to get the replica_addresses. The final row of the above data in system.partitions is app | test_now | 0xd552 | 0x so that the start_key is larger than end_key, making the query even more complicated. I cannot simply use where k >= start_key and k < end_key.

@m-iancu
Copy link
Contributor

m-iancu commented Jul 9, 2018

@yjiangnan here is a working example in Go (only thing I changed is the type from int to uint16, but added some example to illustrate the expected behavior).

package main

import "fmt"

func main() {
    var min_cql int64 = -9223372036854775808
    var low_cql int64 = -3074457345618258603 // 1/3 of range
    var high_cql int64 = 3074457345618258602 // 2/3 of range
    var max_cql int64 = 9223372036854775807

    var min_yb uint16 = 0
    var low_yb int64 = 21845 // 1/3 of range 
    var high_yb int64 = 43690 // 2/3 of range 
    var max_yb uint16 = 65535

    fmt.Printf("Cql Hash: %d | YB Hash: %d | Expected YB Hash: %d\n", min_cql, CqlToYBHash(min_cql), min_yb);
    fmt.Printf("Cql Hash: %d | YB Hash: %d | Expected YB Hash: %d\n", low_cql, CqlToYBHash(low_cql), low_yb);
    fmt.Printf("Cql Hash: %d | YB Hash: %d | Expected YB Hash: %d\n", high_cql, CqlToYBHash(high_cql), high_yb);
    fmt.Printf("Cql Hash: %d | YB Hash: %d | Expected YB Hash: %d\n", max_cql, CqlToYBHash(max_cql), max_yb);
}

func CqlToYBHash(cql_hash int64) uint16 {
    var hash_code = (uint16)(cql_hash >> 48);
    hash_code ^= 0x8000;
    return hash_code;
}

This should return:

Cql Hash: -9223372036854775808 | YB Hash: 0 | Expected YB Hash: 0
Cql Hash: -3074457345618258603 | YB Hash: 21845 | Expected YB Hash: 21845
Cql Hash: 3074457345618258602 | YB Hash: 43690 | Expected YB Hash: 43690
Cql Hash: 9223372036854775807 | YB Hash: 65535 | Expected YB Hash: 65535

Regarding your second question, the ranges are start-inclusive and end-exclusive, so writing max value (0xffff as the last end_key) would technically be incorrect which is why we loop back to 0x. As an unfortunate side-effect, retrieving the corresponding partitions for a key with a CQL query is a bit tricky. (I can write more details on this if the suggestion below does not work for you).

But doing one query to look this up each time is probably inefficient anyway. We recommend caching the values from the system.partitions table in your app (with some regular refresh).
(That way you can also convert the blob to uint16 in your app).
Then you can do the lookup and checking in code, e.g.:

func isInPartition(k int, start_key int64, end_key int64) uint16 {
	return k >= start_key && (k < end_key || end_key ==0);
}

For example, in our Java driver fork we have a TableSplitMetadata instance for each table which, in turn, containing one PartitionMetadata for each row.
We build (and refresh this regularly) by querying the system.partitions table here.

We use this data in various applications, including for query locality in our Spark connector fork (which uses our Java driver fork internally): see this code

@m-iancu m-iancu added the wontfix label Mar 6, 2022
@m-iancu
Copy link
Contributor

m-iancu commented Mar 6, 2022

Closing based on the discussion and workaround above. No plans to support this feature at this point.

@m-iancu m-iancu closed this as completed Mar 6, 2022
YCQL automation moved this from To Do to Done Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question This is a question wontfix
Projects
YCQL
  
Done
Status: Done
Development

No branches or pull requests

5 participants