# The pilgrim to Mount Acid

## A schema for data

At this age of BigDataⒸ, your business data are enormous and change fast. You may have one billion active users on your platform carrying out all sorts of activities, concurrently of course. You don't want these activities to step on each other. You don't want to store the wrong thing into your user's accounts. You _especially_ don't want any money in transit to disappear in midair. To make things worse, hundreds of new activities pop up each day. 

Storing any of these in a stored relation is infeasible. With a traditional RDBMS, [data migrations](https://en.wikipedia.org/wiki/Data_migration) would have already killed you. And with Cozo, stored relations don't even try to support schema change (in fact, the only 'schema' for a stored relation is its arity).

To store such data and meet its query and mutation requirements, a database needs:

* high concurrency;
* fine-grained transactions;
* checks for data integrity;
* ability to rapidly adapt to new data shapes and requirements.

To support these, we need to pay some prices. With Cozo, we pay by:

* we demand that most transactions only apply _local changes_ that only touch on a tiny fraction of the data (otherwise the database cannot satisfy the high concurrency requirements);
* we tolerate indirections (since "all problems in computer science can be solved by another level of indirection").

With these tradeoffs, the solution is the [triple store](https://en.wikipedia.org/wiki/Triplestore).

A _triple_ is a sentence consisting of a subject, a verb, and an object. In the Cozo flavour, the subject is always an opaque identity, such as _entity42_, so it is actually an _entity-attribute-value_ triple. Examples:

* _entity42_ has first name `'Alice'`.
* _entity42_ has last name `'Liddell'`.
* _entity42_ loves _entity81_.
* _entity81_ is aged `20` years old.

We schematize triples by schematizing the verbs (attributes). In our example, the schema for first name and last name should have type strings, the schema for age should have type integers, and the schema for the "loves" relationship should be other entities. Here the types refer to the objects in the triple, since the subject is always an entity.

So let's put this into code:

In [1]:
:schema

put person {
    first_name: string index,
    nick_name: string many index,
    loves: ref many,
    age: int
}

attr_id,op
10000001,assert
10000002,assert
10000003,assert
10000004,assert


The `:schema` at the top indicates that we want to manage the schema instead of run normal queries. We then `put` a _group_ of related schema. Now even though they are declared together similarly to a table definition in SQL, we need to stress that this actually defines four separate, independent attributes named `person.first_name`, `person.last_name`, `person.loves`, `person.age`. An entity can have whatever attributes associated with it, even those with different prefixes.

The allowed types for attributes are:

* `ref`
* `bool`
* `int`
* `float`
* `string`
* `bytes`
* `list`

The list type is heterogeneous in its elements. There is no concept of a nullable type and you can't put `null` into values of triples (other than wrapping them in lists first). To indicate missing values, you simply omit the attribute.

The `ref` type has the special meaning of refering to other entities.

After the type comes one or more _modifiers_. The `many` modifier indicates that `loves` is a to-many relationship. If we omit it, any person can love at most one other person, which is not very realistic.

The modifier `index` indicates that we want values of this attribute to be _indexed_. Only indexed attributes support efficient value lookups and range scans. `ref` types are always implicitly indexed since the database wants to be able to traverse the graph in both directions.

Instead of `index`, we can mark attributes with the modifier `unique`, indicating there cannot be two entities with the same value for the attribute. The value then acts as an _unique identifier_ for the entity, which can be convenient when retrieving the entities since the entity ID is assigned by the database automatically and you cannot choose how it is assigned. So let's add an explicit `person.id` attribute, this time using the non-grouped syntax:

In [2]:
:schema

put person.id: string unique;

attr_id,op
10000005,assert


We can see what schema are there in the database now by running a system directive:

In [3]:
:db schema

attr_id,name,type,cardinality,index,history
10000001,person.first_name,string,one,index,False
10000002,person.nick_name,string,many,index,False
10000003,person.loves,ref,many,none,False
10000004,person.age,int,one,none,False
10000005,person.id,string,one,unique,False


We can rename the attribute:

In [4]:
:db rename attr person.id person.pid

status
OK


In [5]:
:db schema

attr_id,name,type,cardinality,index,history
10000001,person.first_name,string,one,index,False
10000002,person.nick_name,string,many,index,False
10000003,person.loves,ref,many,none,False
10000004,person.age,int,one,none,False
10000005,person.pid,string,one,unique,False


As well as getting rid of it (this will remove all the data associated with the attribute as well):

In [6]:
:db remove attr person.pid

status
OK


In [7]:
:db schema

attr_id,name,type,cardinality,index,history
10000001,person.first_name,string,one,index,False
10000002,person.nick_name,string,many,index,False
10000003,person.loves,ref,many,none,False
10000004,person.age,int,one,none,False


But that's about it. Except its name, an attribute is _immutable_ and you cannot change a `string` attribute to a `ref` attribute, nor can you decide that your `one` attribute should really be `many`.

So what do we mean when we said that this kind of structure can deal with new requirements? Say you initially made the `person.loves` attribute one-to-one and made `person.last_name` a unique index, and now you need to change them. But you need to change them not because the requirements have changed. You need to change them because you have made _mistakes_ at the beginning. These mistakes are fixed by, for example, first rename the offending attributes, then create a new attribute with the old name, next copy the data from the old attribute to the new attribute, and finally delete the old, wrong attribute. Fixing mistakes should be explicit, and this is procedure is very explicit.

New requirements are not mistakes, and they do not invalidate your old data or schema. Examples of changing requirements: you now need to record the passport number and the parent-child relationships of the people in your graph. Very easy:

In [8]:
:schema

put person.passport_no: string many index;
put person.parent_of: ref many;

attr_id,op
10000006,assert
10000007,assert


In [9]:
:db schema

attr_id,name,type,cardinality,index,history
10000001,person.first_name,string,one,index,False
10000002,person.nick_name,string,many,index,False
10000003,person.loves,ref,many,none,False
10000004,person.age,int,one,none,False
10000006,person.passport_no,string,many,index,False
10000007,person.parent_of,ref,many,none,False


## Data with schema

Let's reinstate the `person.id` attribute first:

In [10]:
:schema

put person.id: string one unique;

attr_id,op
10000008,assert


and now we add data to our database. First we add a person called Peter. Besides the `:tx` at the top indicating that we want to execute a transaction, it is just a map:

In [11]:
:tx

{ person.first_name: 'Peter', person.nick_name: 'Pan', person.id: 'p' }

asserts,retracts
3,0


You can insert multiple 'rows' at the same time, and the maps also allow some stylistic variations:

In [12]:
:tx

{"person.first_name": "Quin", "*person.nick_name": ["Q", "The Quick"], "person.id": "q"}
{"person.first_name": "Rich", "person.id": "r"}

asserts,retracts
6,0


Every entity is free to have any combination of attributes suitable for it. Note how we specified several nicknames for Quin at the same time, and Rich does not have a nickname.

To query the triples, use _triple rules_: these look like a list of three items, except there is no comma inside. The first slot contains the _entity id_ assigned by the system, the middle symbol is the attribute name and must be explicit (can't be a variable), and the last slot contains the value for the attribute. In fact, you should interpret the attribute name in the middle as an _operator_, that's why there are no commas around it:

In [13]:
?[eid, first_name, nick_name] := [eid person.nick_name nick_name], 
                                 [eid person.first_name first_name]

eid,first_name,nick_name
1b1dfab6-3592-11ed-b04c-96e194705b69,Peter,Pan
1b5e5a2a-3592-11ed-88b1-80d7c2073d1d,Quin,Q
1b5e5a2a-3592-11ed-88b1-80d7c2073d1d,Quin,The Quick


Besides the above _explicit querying_, there is another way to get attributes associated with an entity: you may specify an _pull directive_ which will expand an integer (interpreted as an entity ID) into a map containing its specified attributes. Observe:

In [14]:
?[pid, eid] := [eid person.id pid]

:pull eid {person.first_name, person.nick_name, person.age}

pid,eid
p,"{""_id"":""1b1dfab6-3592-11ed-b04c-96e194705b69"",""person.age"":null,""person.first_name"":""Peter"",""person.nick_name"":[""Pan""]}"
q,"{""_id"":""1b5e5a2a-3592-11ed-88b1-80d7c2073d1d"",""person.age"":null,""person.first_name"":""Quin"",""person.nick_name"":[""Q"",""The Quick""]}"
r,"{""_id"":""1b5e5a3e-3592-11ed-b632-9781c14b3dd9"",""person.age"":null,""person.first_name"":""Rich"",""person.nick_name"":[]}"


If you have several entry bindings that are entities, you can specify several `:pull` directives one after another, but each output binding can have at most one pull directive associated with it.

Another notable thing is that pulls always return a map, even if some of the requested attributes are missing for the entity (they are filled with `null` instead). In constrast, observe that the query not using pull directive did not return Rich, but returned Quin twice. As can be seen above, the pull also deals with to-many relationships automatically.

Pulls can have nested directives (see the manual for details) and can traverse `ref` triples in the reverse direction. But otherwise pull directives are kept deliberately simple. They are only intended for output processing. If you want recursions, non-trivial filters and the like, do it in the Datalog query instead.

Insertions in the triple store actually amounts to _assertions_ of facts. If two conflicting facts are asserted, the last one wins:

In [15]:
:tx

{_key: ['person.id', 'p'], person.first_name: "Pete"}

asserts,retracts
1,0


In [16]:
?[pid, eid] := [eid person.id pid], pid == 'p'

:pull eid {person.first_name, person.nick_name}

pid,eid
p,"{""_id"":""1b1dfab6-3592-11ed-b04c-96e194705b69"",""person.first_name"":""Pete"",""person.nick_name"":[""Pan""]}"


Here we specified an existing entity by providing `_key` with an attribute name and a unique value for the attribute. You can only refer to entities this way if the attribute is uniquely indexed. You can also specify an entity by providing its `_id`, but if you have a unique key to use, it is often much clearer.

The next transaction is superficially similar to the last one. But in this case, `person.nick_name` has cardinality `many` instead of `one`:

In [17]:
:tx

{_key: ['person.id', 'p'], person.nick_name: "Ping"}

asserts,retracts
1,0


In [18]:
?[pid, eid] := [eid person.id pid], pid == 'p'

:pull eid {person.first_name, person.nick_name}

pid,eid
p,"{""_id"":""1b1dfab6-3592-11ed-b04c-96e194705b69"",""person.first_name"":""Pete"",""person.nick_name"":[""Pan"",""Ping""]}"


Now the new nick name is simply recorded together with the last one. Note that if you try to add the same nickname for the same person again, you still get only one copy instead of two:

In [19]:
:tx

{_key: ['person.id', 'p'], person.nick_name: "Ping"}

asserts,retracts
1,0


In [20]:
?[pid, eid] := [eid person.id pid], pid == 'p'

:pull eid {person.first_name, person.nick_name}

pid,eid
p,"{""_id"":""1b1dfab6-3592-11ed-b04c-96e194705b69"",""person.first_name"":""Pete"",""person.nick_name"":[""Pan"",""Ping""]}"


As we have seen, triples abide by set semantics instead of bag semantics as well. If you really want to have duplicates, you need to disambiguate them at the level of values, by for example wrapping them in lists.

To get rid of data, you perform _retractions_:

In [21]:
:tx

retract {_key: ['person.id', 'p'], person.nick_name: "Ping", person.first_name: 'Peter'}

asserts,retracts
0,2


In [22]:
?[pid, eid] := [eid person.id pid], pid == 'p'

:pull eid {person.first_name, person.nick_name, person.id}

pid,eid
p,"{""_id"":""1b1dfab6-3592-11ed-b04c-96e194705b69"",""person.first_name"":null,""person.id"":""p"",""person.nick_name"":[""Pan""]}"


It is OK to retract facts that do not exist, in which case this is just a no-op. Notice that the entity still has its `person.id` attribute: the `_key` specification only indicates what entity to transact. If you want to get rid of the keyed attribute, you have to include it in the transaction map explicitly.

Note that when retracting facts above, we have to provide the database of values for existing triples. This can be cumbersome, especially in the case of to-many attributes --- if you someone miss one value, it will remain. Therefore another form of retraction `retract_all` is provided:

In [23]:
:tx

retract_all {_key: ['person.id', 'p'], person.nick_name: 0, person.first_name: 0, person.id: 0}

asserts,retracts
0,2


In [24]:
?[pid, eid] := [eid person.id pid], pid == 'p'

:pull eid {person.first_name, person.nick_name, person.id}

pid,eid


In this form, you can provide any value for the attributes, the database does not care and just removes all values associated with the attributes. Above we have used `0` since it is simple to type.

### Nested data mutations

We have so far inserted data in units of entities. This is fine for simple cases, but can become awkward for tree or graph shaped data which are linked together in non-trivial ways. We would need to insert some triples first, get ids of some entities (or use their unique keys), and use these to insert other triples.

Instead, Cozo supports nested data insertion. Let's insert our whole love triangle graph all at once.

Recall that our love triangles are:

In [25]:
?[] <- [['alice', 'eve'],
        ['bob', 'alice'],
        ['eve', 'alice'],
        ['eve', 'bob'],
        ['eve', 'charlie'],
        ['charlie', 'eve'],
        ['david', 'george'],
        ['george', 'george']]

0,1
alice,eve
bob,alice
charlie,eve
david,george
eve,alice
eve,bob
eve,charlie
george,george


We insert them into the triple store thus:

In [26]:
:tx

{
    _tid: 'a', 
    person.id: 'a', 
    person.first_name: 'Alice',
    person.loves: {
        _tid: 'e',
        person.id: 'e',
        person.first_name: 'Eve',
        *person.loves: [
            'a',
            {
                _tid: 'b',
                person.id: 'b',
                person.first_name: 'Bob',
                person.loves: 'a'
            },
            {
                _tid: 'c',
                person.id: 'c',
                person.first_name: 'Charlie',
                person.loves: 'e'
            }
        ]
    }
}

{person.id: 'd', person.first_name: 'David', person.loves: 'g'}
{_tid: 'g', person.id: 'g', person.first_name: 'George', person.loves: 'g'}

asserts,retracts
20,0


Nested mutations are done simply by using maps for `ref` attribute values. We identified entities that do not yet exist in the database by their `_tid` given inline. `_tid`s can be any string you like _except_ strings that can be interpreted as UUIDs. As before, an asterisk `*` before the attribute name denotes that we are transacting multiple triples into an attribute. As the last two maps in the example shows, you do not need `_tid` if you do not need to refer to an entity, and you can use `_tid` to refer to an entity itself.

Let's see if we get the same results querying the triple store:

In [27]:
?[loving, loved] := [a person.first_name loving], 
                    [a person.loves b], 
                    [b person.first_name loved]

loving,loved
Alice,Eve
Bob,Alice
Charlie,Eve
David,George
Eve,Alice
Eve,Bob
Eve,Charlie
George,George


Nice!

### A note on the entity ID

As you have probably already noticed, the database assigns UUIDs as entity IDs automatically when we created the entities. You can also create the IDs yourself when doing the creation for more control:

In [28]:
:tx

{_id: '4e7a35b9-e04d-48a3-9eeb-d8a68ef33c43', person.id: 'u', person.first_name: 'Ursula'}

asserts,retracts
2,0


In [29]:
?[p] <- [['4e7a35b9-e04d-48a3-9eeb-d8a68ef33c43']]

:pull p { person.first_name, person.id }

p
"{""_id"":""4e7a35b9-e04d-48a3-9eeb-d8a68ef33c43"",""person.first_name"":""Ursula"",""person.id"":""u""}"


The system-assigned IDs are UUID version 1 and is contains a timestamp. You can extract the timestamp by using the function `uuid_timestamp`:

In [30]:
?[pid, ts] := [p person.id pid], ts <- uuid_timestamp(p)

pid,ts
a,1663313732.684265
b,1663313732.684268
c,1663313732.684268
d,1663313732.684269
e,1663313732.684267
g,1663313732.684269
q,1663313715.9019048
r,1663313715.901907
u,


The returned numbers indicate seconds since the UNIX epoch. The UUID we made ourselves does not contain a timestamp as it is of version 4. You can provide any valid UUID as entity ID except the 'nil ID' `00000000-0000-0000-0000-000000000000`:

In [31]:
:tx

{_id: '00000000-0000-0000-0000-000000000000', person.id: '0', person.first_name: 'I am ZERO'}

Using the timestamped version has performance benefits: the database sorts UUIDs in a way such that those with similar timestamps are near each other. This provides the kind of data locality similar to an auto-incrementing integer key in a RDBMS, while mitigating the risk of malicious users trying to iterate over your data sequentially, or estimating the cardinality of your data. The UUIDs generated by the system contain only random bits besides the timestamp, in particular there is no node information encoded with them (as allowed but not required by the UUID specification), so users cannot tell on which machine the IDs were generated either. Still, if you want your keys to be completely obscure, provide your own UUIDv4 backed by a good random number generator.

## The time machine

Your data is changing fast. For administrative or regulative reasons, you may also need records of _how_ your data change. Or you may be presented with historical data in the first place, and you want your queries to reflect facts _at a particular instant of time_.

Someone used to say that 'more columns in a RDBMS solves anything'. In our case, maybe adding more attributes helps? Let's add to each entity the attribute `valid_at` indicating when the entity is considered valid.

In fact, this is doable, but the resulting system is a total pain to use. First, you will need to _reify_ most of your values. Instead of saying that `[bob person.name 'Bob']`, you need something like `[bob person.used_name name]`, where `[name name.is_spelled 'Bob']` and `[name name.is_valid_at '2020-03-04']`, etc. Next, how are you going to find our what everything was at a particular moment? You cannot use equality conditions to filter entities based on `is_valid_at`, since something that was introduced in 1999 is still valid in 2020, _unless_ some other fact supercedes it or it was retracted _after_ 1999. And we are only after the latest valid fact, not all historical facts at a point in time. Fulfilling these requirements _is_ possible in Cozo with aggregations, but they add a huge amount of complexities to the queries for something that was intuitively very simple.

To solve this particular problem, which occurs more common than you might think, Cozo has built-in support for historical facts. This functionality carries a non-trivial performance penalty, so you have to request it explicitly for each attribute. And like other properties of attributes, whether it has history support is immutable. If you change your mind, you need to define a new attribute and copy data over, as usual.

If you are already worried about performance, let's assure you that it is MUCH MORE performant than the hand-rolled solution indicated above. In fact, querying a history-enabled attribute is about $c \log n$ times slower than the corresponding query for a non-history-enabled attribute, where $c$ is a constant and $n$ is the number of historical facts a given entity-attribute pair has. The logarithmic complexity beats any simple-minded implementation.

Let's have some examples. We want to store countries and their heads of states. The schema:

In [102]:
:schema

put country {
    name: string unique,
    head: string index history,
}

attr_id,op
10000001,assert
10000002,assert


For simplicity we assumed a country's name does not change, but obviously its head of state will change, indicated by the modifier `history`. That's actually all you need for the schema.

Now let's insert some data. You can actually insert data as you do before:

In [103]:
:tx

put {country.name: 'US', country.head: 'Biden'}
{country.name: 'UK', country.head: 'Truss'}

asserts,retracts
4,0


In [104]:
?[country, head] := [c country.name country], [c country.head head]

country,head
UK,Truss
US,Biden


By the way, we showed that you can explicitly tell the system that you are doing `put`.

Now let's add in the historical data:

In [105]:
:tx

@'2019-07-24' {_key: ['country.name', 'UK'], country.head: 'Johnson'}
put @ '2017-01-20' {_key: ['country.name', 'US'], country.head: 'Trump'}

asserts,retracts
2,0


The syntax should explain itself. You can specify the date in ISO 8601 format, in which case it is interpreted as a timestamp at the stated date at midnight UTC, or as RFC 3339 format such as `'1996-12-19T16:39:57-08:00'`, or as an integer indicating the number of _microseconds_ since the UNIX epoch (negative numbers for before). Let's see who are the heads of states _now_:

In [106]:
?[country, head] := [c country.name country], [c country.head head]

country,head
UK,Truss
US,Biden


As expected, the historical data does not affect facts now.

Let's explicitly request historical facts:

In [107]:
?[country, head] @ '2020-01-01' := [c country.name country], [c country.head head]

country,head
UK,Johnson
US,Trump


Right. Try another one:

In [108]:
?[country, head] @ '2022-01-01' := [c country.name country], [c country.head head]

country,head
UK,Johnson
US,Trump


Umm ... that doesn't look right. The problem is, when we inserted facts about Biden and Truss, we did not tell the system when that fact starts being valid, so the system assumes the current timestamp. Let's fix that:

In [109]:
:tx

@'2022-09-05' {_key: ['country.name', 'UK'], country.head: 'Truss'}
@'2021-01-20' {_key: ['country.name', 'US'], country.head: 'Biden'}

asserts,retracts
2,0


In [110]:
?[country, head] @ '2022-01-01' := [c country.name country], [c country.head head]

country,head
UK,Johnson
US,Biden


That's more accurate. What about the future?

In [111]:
?[country, head] @ '9999-01-01' := [c country.name country], [c country.head head]

country,head
UK,Truss
US,Biden


Wow, that can't happen no matter what the world is coming to. We fix that by _retracting_ facts as before, but with a timestamp attached (we will use a _very_ generous timestamp):

In [112]:
:tx

retract_all @ '2099-01-01' {_key: ['country.name', 'UK'], country.head: 0}
retract_all @ '2099-01-01' {_key: ['country.name', 'US'], country.head: 0}

asserts,retracts
0,6


In [113]:
?[country, head] @ '9999-01-01' := [c country.name country], [c country.head head]

country,head


Good. What about now, again?

In [114]:
?[country, head] := [c country.name country], [c country.head head]

country,head
UK,Truss
US,Biden


And history?

In [115]:
?[country, head] @ '2018-01-01' := [c country.name country], [c country.head head]

country,head
US,Trump


UK is missing since we have yet to enter the head of state for UK at this period into the database. Fix it:

In [116]:
:tx

@'2016-07-11' {_key: ['country.name', 'UK'], country.head: 'May'}

asserts,retracts
1,0


One thing if it is not already obvious: timestamps apply at the level of rules, not queries, so you can have a different timestamp for each rule:

In [121]:
?[year, country, head] @ '2019-01-01' := year <-  2019, [c country.name country], [c country.head head]
?[year, country, head] @ '2022-01-01' := year <-  2022, [c country.name country], [c country.head head]
?[year, country, head] /* ~~NoW!~~ */ := year <- 'now', [c country.name country], [c country.head head]

year,country,head
2019,UK,May
2019,US,Trump
2022,UK,Johnson
2022,US,Biden
now,UK,Truss
now,US,Biden


The timestamp is also not required to represent actual time. You can `put` data by giving them integer timestamps with custom interpretation, and query them using the same interpretation. Just don't mix your fictional time and real time.

A final API before we are done with this time-travelling thing.