# 14. Introduction to Kdb+

- __[Wikipedia on Kdb+](https://en.wikipedia.org/wiki/Kdb%2B)__

- Kdb+, as well as q, is written in k.
- Kdb+ is the data persistence part of q.
- Kdb+ is a relational database which is:
    - __CSDB__ - column-store (columnar) (algol-like db)
    - __TSDB__ - time series database
    - __IMDB__ - in-memory abilities

- CSDB means faster seeking in data stored on hard disks
- CSDB is slower on retrieving full rows of tables
    - A RSDB retrieves a row in a single disk read (whole row operations are rare)
    - A CSDB retrieves a row in multiple disk operations
- OLTP-focused RDBMSs are more row oriented
- OLAP-focused RDBMSs are a balance of row-oriented and column-oriented

### __[OLTP vs OLAP](https://stackoverflow.com/a/21900244)__

- OLTP AND OLAP are different approaches to answer Multi-Dimensional Analytical (MDA) queries efficiently in computing
- __OLTP__:
    - stands for __On-line Transaction Processing__, transaction meaning an atomic change in the state of data.
    - __temporary data__: focussed on the operation of a particular system: frequent changes in data,
    - __volume of transactions (INSERT, UPDATE, DELETE)__: large number of short on-line transactions,
    - main emphasis is on:
        - very fast query __processing of full records__
        - maintaining __data integrity__ in multi-access environments and
    - __measure of effectiveness__: number of transactions per second.
    - __OLTP databases contain__
         - there is detailed and __current data__,
         - incorporates an __entity model__: a schema used to store transactional databases
             in the 3rd normal form (3NF).
    - __Main goals / focus__:
         - availability
         - speed
         - concurrency
         - recoverability
    - __Characteristics of OLTP__:
         - insert and update intensive
         - OLTP applications are used concurrently by a larga volume of users
         - because of the above caracteristics, data integrity is harder to maintain in an OLTP database than in an OLAP database
         - __Queries__ typically:
             - access individual records (full rows) and
             - are signifacantly less complex than those in an OLAP database
    - __Typical applications of OLTP__:
         - ATM: commercial transaction processing application
             - the same data (bank account balance) is updated frequently
         - Inventory management application of a web store:
             - Number of items available in the store is updated continuously as customers buy them
             

- __OLAP__:
    - stands for __On-line Analytical Processing__
    - __fixed data__: deals with Historical Data or Archival Data.
    - __volume of transactions__: relatively low,
    - __Queries__:
        - very complex
        - involve aggregations, joins, where and group by clauses.
        - need to access large amount of data in the database
    - __Measure of effectiveness__: response time
    - __Usage__: Data Mining techniques, time series analysis
    - __OLAP databases contain__:
        - aggregated, historical data,
        - stored in multi-dimensional schemas (usually star schema).
    - OLAP databases / data warehouses contain [3 types of data](https://docs.oracle.com/cd/B10500_01/olap.920/a95295/designd4.htm):
        - historical data
        - derived data: generated from historical data using mathematical operations and data transformation
        - metadata: data that describes the data and schema objects
            - it is used by applicatons to fetch and compute data correctly

- CSDBs are well suited for OLAP-like workloads (like in data warehouses),
    - where highly complex queries are frequently executed over big data
- CSDBs have been developed as hybrids capable of OLTP and OLAP operations
- RSDBs are better suited for OLTP-like workloads,
    - which are more heavily loaded with interactive transactions.
- Disadvantages of CSDBs to RSDBs:
    - Retrieving entire rows from table
    - Transaction-heavy operations (insert, update, delete)

## 14.1. Tables in memory and serialization

- Persist a table with set and read it into memory with get

### 14.1.1. Tables and key tables

- A table is a trasposed column dictionary, in which address slots are reversed but no data is moved
- The table schema looks like this: ([] s:`symbol$(); v:`int$())
- The type of a table is 98h
- The type of a keyed table is 99h
- The meta command returns column names, types and attributes

### 14.1.1. Foreign keys and [link colums](https://code.kx.com/v2/wp/the_application_of_foreign_keys_and_linked_columns_in_kdb.pdf)

- Schema for creating a foreign key:

In [None]:
.kin.table1:([id:til 10] col1:10?`5;col2:3*10?1000)

In [None]:
.kin.table2:([] t1id:`.kin.table1$5 4 7 2 3;date:.z.d+5?30;percent:90+5?20)

In [None]:
.kin.table3:([]; t1id:`.kin.table1!(exec id from .kin.table1)?5 4 7 2 3; q:100 101 102 103 109)

In [None]:
.kin.table1

In [None]:
.kin.table2

In [None]:
.kin.table3

- A link column is similar to a foreign key:
    - its entries are indices of rows in a table
    - but you must perform the lookup manually: you have to enumextend the id column then create a dictionary of them
- The advantages of link columns:
    - The target can be a table or a keyed table
    - The target can be the table containing the link column
    - __Link columns can be splayed or partitioned, whereas foreign keys cannot__

In [None]:
kt:([id:1001 1002 1003] s:`a`b`c; v:100 200 300)
t:([]; id:`kt$1002 1001 1003 1001; q:100 101 102 103) // foreign key

In [None]:
q)tk:([] id:1001 1002 100; s:`a`b`c; v:100 200 300)
q)t:([]; id:`tk!(exec id from tk)?1002 1001 1003 1001; q:100 101 102 103) // link column

### 14.1.3. Serializing tables

- __set/get__ to persist and load tables
    - set/get is the the general serialization/deserialization feature of q
    - a single file is created in the OS file system
- \l can also be used instead of get

In [None]:
(hsym `$dataDir,"table1") set .kin.table1

In [None]:
.kin.ldtable:get hsym `$dataDir,"table1"

In [None]:
.kin.ldtable

### 14.1.4. Operating on serialized tables

- To execute an operation on a table you can:
    - either load it into memory:tInMem:get `fileHandle -> select from tInMem
    - or do it directly on the persisted table: select from `fileHandle
- Under the hood, __operations are performed in memory in both cases__!!!
    - Serialized table must fit into memory.

### 14.1.5. The database view

- Zero dimensional persisted form: when 1 table = 1 file
    - The contents of the table is represented by a dot
- Zero dimensional vs higher-dimensional forms: 1 table = multiple files and/or multiple directories.

## Types of data decomposition in q

- Tables can be:
    - __Splayed__: columns of splayed tables are stored in multiple files
    - __Partitioned__: rows of partitiond tables are stored in multiple directories
        - columns are stored in different files in an additional subdirectory with the same name as the table
    - __Segmented__: rows of a segmented table are stored in multiple directories (that have the same sturcture as the root directory in a partitioned database) on multiple partitions of the hard drive to allow parallel processing of queries

## 14.2. Spayed tables

- Problems with serialized tables:
    - __Space complexity__: The entire table must fit into memory on each user's machine
    - __Time complexity__ (speed): operations on persisted tables will be slow due to reloading the entire table each time into memory

### 14.2.0. Spaying a table

- Splaying is when the columns of tables is persisted in multiple files in one directory:
    - A splayed tables corresponds to a directory with the name of the table and
    - Each columns corresponds to a file with the name of the column
    - The sym file (.d): the only metadata stored is the list of column names, which is serialized to the hidden sym file
- Splay tables visualization:
    1-dimensional persisted form: a series of dots representing columns
- Splaying resolves the memory/reload issue through mapping the columns to the memory:
    - Columns are loaded into memory on demand then memory is released when no longer needed
- Used case: splay those tables with many columns, since most queries refer to only a handful of columns

### 14.2.1. Creating splayed tables

- To create a splayed table, use
    - a directory as a file handle instead of a file name + set/upsert + tableName or
    - you can splay it manually
- Restrictions on the column types that can be splayed:
    - No general lists: simple or compound lists only
    - Columns of symbol type must be enumerated
    - You cannot splay keyed tables (use link columns instead of keys)
    - Same filename cannot exist as the would-be-splayed directory name
- To read a splayed table, use
    - get + the original directory handle
- To read only a column, use:
    - get + column file name as a file handle
- To list the files in the directory:
    - \ls + the full path __without__ the backtick and the backslash at the end

In [None]:
.kin.dirHandle:hsym `$dataDir,"testTrades/"

In [None]:
.kin.dirFileName:hsym `$dataDir,"testTrades" / this won't work with \ls

In [None]:
.kin.dirHandle set .kin.spTrades

In [None]:
.kin.dirFileName set .kin.mktrades[`goog`ibm`fb`tmobil`amazn`alibaba;100]

In [None]:
key `.kin

In [None]:
.kin.mktrades

In [None]:
.kin.tl:get .kin.dirHandle

In [None]:
\ls -a /home/iguana/1_Code/4_jupyter_projects/q4m3_tutorial/data/testTrades

### 14.2.2. Splayed tables with symbol columns

- For all splayed and partitioned tables involved in spalying, all symbol columns must be enumerated over the list of symbols that is serialized into the root directory.
- For enumerating, use the projected version of built-in .Q.en function:

In [None]:
`:/db/t/ set .Q.en[`:/db;] ([] s1:`a`b`c; v:10 20 30; s2:`x`y`z)

### 14.2.3. Splayed tables with nested columns

- The only type of nested colulmns that can be splayed is made up of compound lists:
    - compound list is a list of simple lists of uniform type
    - compund lists are indicated by upper case letters in the type column of meta results

In [None]:
/ function to examine if a column can be splayed
canBeSplayed:{where {(ts~1#0h)|1<count ts:distinct `type each`x} each flip x}

### 14.2.4. Basic operations on splayed tables

- Map a splayed table to memory by
    - either loading it on startap: $$q path/to/splayedTableDir
    - or \l /path/to/splayedTableDir
        - only this format works:
            - no string,
            - no file handle,
            - no trailing fwSlash
            - no tilde instead of /home/userName
- In these cases, tables are only mapped to memory, they are not loaded until an expression is evaluated on them
- \a list tables in current context
- extract column data via symbol indexting (dot notation does not work): tableName ``colName
- Operations that work on splayed tables:
    - select and exec (exec does not work on partitiond tables)

In [251]:
.kin.dirHandle set .kin.mktrades[`googl`ibm`fb`tmobil`amazn`alibaba;100]

`:/home/iguana/1_Code/4_jupyter_projects/q4m3_tutorial/data/testTrades/


In [253]:
testTrades `time

04:12:55.527 06:03:01.029 01:11:42.896 09:21:44.893 17:12:24.021 21:11:04.805..


### 14.2.5. Operations on a splayed directory

- select
- exec
- upsert
- xasc
- ``attr#
- updates applied to a mapped table are only applied in the memory not on the disk -> updates are not persistant
    - and there are no built-in operations to update data in a persisted table, only workarounds

### 14.2.6. Appending to a splayed table

- Use upsert to append records to a splayed table:
    - directoryHandle + upsert + table2Name (that contains the rows to be appended)
        - schema of the two tables must match

### 14.2.7. Manual operations on a splayed directory

- This is only file system operations -> inconsistency may occur:
    - Do this only when no queries are executed


In [None]:
`:/db/t/ set ([] ti:09:30:00 09:31:00; p:101.5 33.5)
`:/db/t/p set .[get `:/db/t/p; where 09:31:00=get `:/db/t/ti; :;42.0] / replace item at depth

- Adding a new column to a splayed table:

In [None]:
`:/db/t/s set (count get `:/db/t/ti)#`
`:/db/t/.d set get[`:/db/t/.d] union `s

- To delete a column, remove the column file and revise the .d file:

In [None]:
system "rm /db/t/s"
`:/db/t/.d set get[`:/db/t/.d] except `s

- Sorting a splayed table. E.g.:
    - create sort index (iasc, idesc)
    - re-index all column files

In [None]:
cs:system "ls /db/t"
I:idesc `:/db/t/ti
{pth set get[pth:hsym `$"/db/t/",x] I} each cs

### 14.2.8. Working with sym files

- What can be done by manupulating the sym files?
    1. Moving a table from one database to another
    2. Change column type from symbol to string
    3. Consolidating enumeration domains
- The sym file can easily be corrupted: always backup database before modifying the sym file

1. Moving tables between databases:
- un-enumerate table in source database
- re-enumerate table in target database

### 14.2.9. Splayed tables with link columns

### 14.2.10. Query execution on splayed tables

- When the same column is used in separate queries, it is loaded into memory only once and cached in memory until it is garbage collected.

## 14.3. Partitioned tables

- Table partitioning is the 2nd type of table decomposing in q. It is useful when the database is so large even one column cannot fit into the memory.
- E.g.: slice columns into daily partitions
- __ALL PARTITIOND TABLES ARE SPLAYED__, but not all splayed table are partitioned

### 14.3.1. Partitions

- Partitioning a splayed table is further decomposing it by grouping records having common values along a column of a special type.
- Along columns of only this special type can partitioning be made.
- The special type is any type that has an integer under the covers:
    - boolean, 0b-1b, 1, ``boolean
    - short
    - int
    - long
    - char
    - timestamp
    - date
    - timespan

- A partitioned table is 2-dimensional persisted form, it is cut horizontally and vertically:
- Partitions are stored in separate directories in the root directory
- and each partition-directory contains a subdirectory with the name of the table, which contains the splayed files:
/root
    /partitionvalue1
        /tablename
            .d
            column1name
            column2name
            …
    /partitionvalue2
        /tablename
            .d
            column1name
            column2name
            …
        …

- Kdb creates a virtual column for the partition variable. Its name cannot be controlled

### 14.3.2. Partition domain

- Kdb can have only a single partition domain
- If you need different granularity of a table, you need to create two separate databases
- How to partition the table?
    - Choose partition granularity based on what kind of units are most frequently queried

### 14.3.3. Creating partitioned tables

### 14.3.4. Working with partitioned tables

- You cannot do these operations on a partitioned table:
    - retreive an entire row or a column (for obvious reasons)

- Exec does not work on partitioned tables, but there is a workaround:
    - exec … from select … from … 

- Always place the partition constraint first in a where clause with multiple conditions

### 14.3.5. The virtual column i in partitioned tables

- virtual column i refers to the relative row number within a partition, not the absolute one

### 14.3.6. Qurey execution on partitioned tables

- Columns are loaded into memory as follows:
    - kdb analyses the where phrase to determine which partitions are targeted by the query
    - processes the remaining where phrases the subdomains that must be loaded
    - processes the query separately against the requisite partitions to obtain partial results
    - combines the partial results to obtain the final result
- You can speed up the query execution by starting q with slaves. In this case the query will be executed concurrently, one partition per slave.

### 14.3.7. Map-reduce

- Map-reduce in general:
    - m-r decomposes an operaion on a list into two sub-operations: map and reduce:
        - 1st step: op_map performed on each sublist -> list of partial results
        - 2nd step: op_reduce is performed on the combined partial results list
- Map-reduce for aggregation in queries, examples:
    - Computing sum of partitioned columns: computes sums for partitions then adds them up
    - Average: compute sum and count for partitions then calculates the average
- Exercise: do a sorting on a partitiond table (use recursion)

- Map-reduce in queries on partitioned tables:
    - The challenge is to decompose the query into a map step and a reduce step.
    - The solution depends on whether the query involves aggregation or not.
    - No aggregation:
        - produces a partial result table: the result of the query on each partition is the computed columns for the list of records in the partition matching the constraint
        - all the partial result tables conform, so you can take the __union__ of the partial result tables in order of their virtual partition column values
    - Yes aggregation:
        - Kdb+ recognizes the following aggregate functions to be map-reducible:
            - avg, cor, count, cov, dev, distinct, first, last, max, med, min, prd, sum, var, wavg, wsum

### 14.3.8. Multiple partitioned tables

- There can be only one partition domain in a given kdb+ root, but
- Multiple tables can share the same partitioning
    - All partitions must contain slices of all tables (they do not need to be populated, though)
        - Solution: create an empty copy of the table where it does not contain data
        - Warning: if you do not create a partition for a table, it will be __removed__

- See example [here](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#1438-multiple-partitioned-tables)

### 14.3.9. Examples of other partition domain types

- See examples [here](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#1439-examples-of-other-partition-domain-types)

### 14.3.10. Partitioned tables with links

- Create links between tables in the same partitioned database.
- Restriction __only intra-partion links__ are allowed: a link column in a table partition can only refer to another table's partition in the same partition directory
    - E.g.: table partitions from the same day/month/year
- See example [here](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#14310-partitioned-tables-with-links)

## 14.4. Segmented tables

- Performance gain through partitioning can only be achieved when the queries are not I/O-bound, but in most cases they are.
- To speed up I/O-bound queries, the solution requires multiple I/O channels, so data retreival and processing can occur in parallel.
- The third form of data decomposition, segmentation solves exactly this problem by creating different segments in different directories (on different hard disk partitions) above the partition directory layer and assigns separate threads to separate segments corresponding to separate I/O channels.

- Table segmentation is an additional level on top of partitioning
- Segmentation spreads a partitiond table's records across multiple directories that have the same sturcture as the root directory in a partitioned database
- Each pseudo-root is called a segment, which are __directories containing a collection of partition directories__.
- The segment directories are on independent I/O channels so that data retreival and processing can occur in parallel.
- Any criteria can be used to decompose partitions into segments as long as results are conforming record subsets that are __disjoint and complete__ (they reconstitute the original table with no omissions or duplication).
- Segmentation can happen
    - along rows,
    - partitions or
    - a combination of them
    - but cannot occur only along cloumns

- A segmented table has a 3-dimensional persisted form:
    - cut vertically by splaying
    - cut horizontally by partitioning
    - cut across physical locations

- The segment directories must not reside under the root
- Only the sym file and the par.txt can reside in the root dirextory:
    - the par.txt contains the paths of the physical locations of the segments

- Example1 for segmantation:
    - where segmentation is __orthogonal to partitioning__: one segment belongs to only one partition
        - segmentation and partitioning happens along the same variable: date
    - segment the table by bucketing trades into alternating days of the week:

- E.g.2:
    - segment by alphabet: symbols from a-m and n-z are in separate segments
- E.g.3:
    - segment by stock exchange: symbols from a particular stock exchange are in the same segment
- In example 2 and 3 segmentations are __NOT orthogonal to partitioning__:
        - one partition is divided into/span across multiple segments

- E.g.4: non-uniform segmentation:
    - Some partition span segments, some do not

### 14.4.2. Segmentation vs. partitions

[SEE COMPARISON MATRIX HERE](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#1442-segmentation-vs-partitions)

### 14.4.3. Creating segmented tables

- MISSING: see examples here

### 14.4.4. Multiple segmented tables

- Multiple tables that share the same partition can also be segmented.
- These tables can be distributed differently across the segmentation
- However, if you want to use links or joins between these segmented tables, you have to distribute them similarly across the different segments

- Example of table t and q distributed similarly across different segments:

- See example [here](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#1444-multiple-segmented-tables)

- __Starting q with slaves__:
    - $$q -s 2 / 2 is the number of slaves

In [None]:
// use peach to run processes in parallel
aj1:{aj[`sym`ti;select from t where date=d; select from q where date=d]}
raze aj1 peach 2015.01.01 2015.01.02
raze aj1 peach 2015.01.01 2015.01.02

### 14.4.5. Query execution against segmented tables

- Design principles for segmentations:
    1. Maximize the number of independent I/O channels to retrieve data in parallel: n (several hdds or sdds)
    2. Maximize server memory to allocate each slave thread as much memory as it needs
    3. Create n segments to spread data retrieval across the n I/O channels
    4. Open at least n slave threads.

- In such an environment, kdb decomposes a query into two steps of mapping and reducing:
    - Map: a revised form of the original query that executes on each segment
    - Reduce: aggregate the segment results
- This results in preliminary calculations close to the data as possible while performing the aggregation centrally at the last step.

- __Parallel query execution of map-reduce operations on segmented tables__:
    - Step 1: Determinining the __target partition footprint__ on each segment
        - kdb+ compares the query’s requisite partitions to the segment layout in par.txt.
        - __The result is a nested list__, each item being the partition list for one segment.
    - Step 2: __Map step execution__ in a specific segment:
        - kdb+ creates a revised query containing the map sub-operation from the original query,
        - __peach__: command dispatches the revised query to all n slaves.
            - Each slave is provided the partition list created in _step 1_ for one segment and
            - each slave computes the revised query for its dedicated segment.
            - For example, the revised query for avg is: "Compute the sum and count of the sublist"

- __Query execution within one slave/segment__:
    - In a single slave, the revised query is applied against a segment’s partition footprint.
    - Here kdb+ sequentially applies the map sub-operations of the original query across the targeted partitions to obtain partition result tables,
    - then these result tables are collected into a list representing one segment result.
- __Query execution across slaves/segments__ ignoring partition details:
    - At this level, the original query’s map step has n slaves
        - retrieving segment data in parallel and
        - calculating segment results.
    - __Raze__: command flattens the nested list of segment results and
    - __which command???__: reorders the segment results by partition value.
- __Original reduce step__ is applied to combine the full list of ordered partition results into the query result table.

### 14.4.6. Balancing slaves and cores: channel and core utilization

- To achieve 100% saturation of I/O and CPU, we have to optimize slaves and cores.
- __These are only guidelines__:
- __Kdb uses only as many slaves to process the query as there are segments in the query footprint__
- There are two cases:
    - __I/O bound optimization__: when the query has light calculation requirement but intensive I/O load:
        - use n channels => n segments => n slaves => n cores
        - Example: Volume Weighted Average Price (VWAP) calculation
    - __Balanced I/O-compute__: both I/O and calculation are intensive:
        - in this design,
            - while one slave waits for the data to be loaded via one channel, another slave on the same core can do the computation on data already in memory
            - and vice versa: one slave is crunching data already loaded into memory, another slave can do the loading of another segment through a free channel
        - So to maximize channel and core utilization, we want two slaves on each core, and two segments for or each channel:
            - __n channels => 2n segments => 2n slaves => n cores__
        - Example: regression analysis

- In practice,
    - start with one scenario,
    - construct the initial configuration
    - test it with your data, typical queries and
    - simulate a realistic user load
    - monitor the I/O saturation and CPU utilization
    - adjust the number of slaves and cores allocated to the q process accordingly.

### 14.4.7. Sample performance data

- See example [here](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#1447-sample-performance-data) for a configuation of slaves and segments.

## 14.5. [Utilities for splaying and partitioning](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#145-utilities-for-splaying-and-partitioning)

- __.Q.namespaces__ contain the functions for creating and maintaining splayed and partitioned tables
- Kx do not support for the customer use of functions in the .Q namespace, still, everybody uses them
- [Referenc for the functions](https://code.kx.com/v2/ref/#q) in the .Q namespace

- .Q.qp: is partitioned
- [.Q.en](https://code.kx.com/v2/kb/splayed-tables/#enumerating-varchar-columns-in-a-table): enumerate varchar columns (of symbol type)
- [.Q.pv](https://code.kx.com/q4m3/14_Introduction_to_Kdb+/#1453-qpv): list of partition slice directories in the database found in the root
- .Q.ind: getting individual rows of a partitioned table by index:
    - .Q.ind[trade;til 100] is the same for a partitioned table as 'select from table where i<100' for an in-memory table
- .Q.k: returns interpreter version number used in the process
- .Q.l: the functional implementaion of \l
- .Q.M: returns long integer infinity 0W~.Q.M
- .Q.MAP: keep partitions mapped to memory to avoid repeated file system calls during a select
    - usage: \l /dirName/ then .Q.MAP
    - not recommended to use with compressed files
- .Q.opt .z.x: returns dictionary of command line arguments
- .Q.dpft: The utility .Q.dpft assists in creating partitioned and segmented tables by incorporating the functionality of .Q.en at a slightly higher level. It is convenient when partitions are loaded and written out iteratively.
- .Q.fs: process large txt files (that do not fit into memory) in chunks
- .Q.fsn: returns chunk size as additional parameter
- .Q.chk: writes empty splayed splice of a table in a partition directory where it is missing
- .Q.view: for executing queries against partitond or segmented tables

## 14.6. Kdb database

### 14.6.1 Comparing kdb+ to an RDBMS

- Fundamental difference:
    - Tables are based on:
        - Kdb: lists (ordered with duplicate elements)
        - SQL: sets (unordered, distinct elements)
    - Data format for storage:
        - Kdb: as contiguous items in column lists
        - RDBMS: as fields with non-contiguous rows
    - Tables operations:
        - Kdb: vector operations on columns
        - SQL: scalar operations on individual fields and rows
- More differences:
    - Table creation:
        - Kdb: functionally in q language
        - SQL: defined declaratively in DDL on disks
    - Data persistence:
        - Kdb: Serialized q entities stored in O/S file system; no separate metadata
        - SQL: tables and related metadata stored in an opaque repository by row
    - Data access:
        - Kdb: direct data access in q. Query forms are in q-sql for table manupulation
        - SQL: DDL for accessing metadata; SQL as language for accessing data
    - Memory residence:
        - Kdb: table in memory but can be persisted to disk. columns subsets are page faulted into memory for mapped tables
        - SQL: Tables reside on disk; query result sets reside in program memory
    - Data modification in memory:
        - Kdb: memory resident tables modifiable via q and q-sql
        - SQL: no data modification in memory
    - Data modification on disks:
        - Kdb: only with append (upsert) via q
        - SQL: INSERT, UPDATE via SQL
    - Data programming:
        - Kdb: programs written in q, which is an integrated vector functional programming language; tables are first class entities
        - SQL: declarative relational programming language; programs are stored procedures written in proprietary procedural language
    - Transactions:
        - Kdb: no built-in transaction support
        - SQL: support for transactions via COMMIT and ROLLBACK

### 14.6.2 The Physical Layout of a kdb+ Database

- A kdb+ database is a file system directory and its subdirectories holding q entities.
- The root directory is the root of the databse
- All constituents of the database are q entities saved in files
- Database entities
    - either reside at some level under the root (splayed and partitioned tables)
    - or referenced in the par.txt file under the root (segmented tables)

#### 14.6.2.1. The sym file

- The sym file is an optional serialized q data file
    - containing a list of unique symbols
        - that are used as the domain for symbol enumeration
- Placing the sym file in the root directory uarantees that it will be loaded into memory at startup
- According to convention: all symbol columns from all tables are enumerated over a single domain sym
    - It os allowed over multiple domain syms, however .Q utilities handling symbol enumeration work only with single domain syms
    - With symbols in muliple domains, the ~ function won't work because they will have different enumeration types
- Corrupting the sym file will ressult in irresolvable symbol columns in the database:
    - Safegueards:
        - use conditional enumeration (``fileHandle/sym? or the appropriate .Q utilities: .Q.en etc.)
        - built-in file locking to mediate concurrent updates

#### 14.6.2.2 Other Serialized Files in Root

- By placing any serialized q entity into the root directory, you can have it loaded into a file with the same name
- E.g a small sized keyed table can be initialized this way

#### 14.6.2.3 Scripts

- Any q code in the root directory can be loaded on starup:
- E.g.: functions defined in such a script can be viewed as stored procedures for the database
- Alternatively, you can create QINIT directory and place all your files in it that you want to load upon q process startup

#### 14.6.2.4 Splayed tables

- spayed table directories must be immediately under the root
- all columns are mapped into memory, but only a few of them will reside simultaneously in the memory
- recently accessed columns are cached by the OS; this has a performance close to that of in-memory tables

#### 14.6.2.5 Partitioned tables

- Partition directories should also be under the root directory
- They must have a uniform structure:
    - must contain splayed directories for all tables in the partition even if they have no data in that partition -> splay en empty schema for a table with no records for a partition

#### 14.6.2.6 Segmented tables

- No records of a segmented table can reside under the root directory
- Only the par.txt file can be in the root with one entry per line
    - each entry represents an os path for a segment directory containing the segment for the data in the segment
- Symlinks can also be in the root directory

### 14.6.3. Creating and populating a kdb+ database

- Point q to a directory at startup
    - $$: q directoryName/ or
    - \l /directoryName
- That directory becomes the root directory for the kdb+ database and also the current working directory for the OS. We shall refer to this scenario as kdb+ startup to distinguish it from an arbitrary q session. We shall cover the items that Kdb+ startup finds in the order that it handles them:
    - Serialized q entities
    - Splayed tables
    - Partitioned or segmented tables
    - Scripts

#### 14.6.3.2 Serialized q entities

- Serialized data files are loaded into a variable with the same name of the file
    - E.g.: sym file
- Only load data files that are in the root
- Files with extensions in their names will not be loaded

#### 14.6.3.3 Splayed tables

- Subdirectories under the root directory recognized as a splayed table are mapped into memory automatically at startup (sym file should be in the root directory containing all symbol type columns in the table)

#### 14.6.3.4 Partitioned tables

- Subdirectories of root folder recognized as valid partition values are also mapped
- Root can contain both partitioned and splayed subdirectories
- Partitioned and segmented tables are mutually exclusive???

#### 14.6.3.5 Segmented tables

- Kdb identifies the presence of segmented tables through the presence of a par.txt file, which contains the paths to the different segments of the table.
- Valid segmented tables then mapped into memory
- Both segmented and splayed tables can be in the root directory
- Partitioned and segmented tables are mutually exclusive???

#### 14.6.3.6 Scripts (fileName.q)

- Files with the .q extention are interpreted as q scipts
- Files with the .k extention are interpreted as k scipts
- Best practice: put one script file in the root, which contains the paths to files to be loaded in the order we to want them to be loaded
- Invalid code in the scipt aborts the loading of the whole script

## 14.7. Putting it all together

- Example code for creating a partitioned database

In [None]:
/ create serialized variables
`:/db/LIFE set 42
`:/db/f set {x*y}
`:/db/lookup set ([s:`a`b`c] v:1 2 3)

/ create splayed tables
`:/db/tref/ set ([] c1:1 2 3; c2:1.1 2.2 3.3)
`:/db/cust/ set .Q.en[`:/db;] ([] sym:`ibm`msft`goog; name:`:/db/sym?`oracle`microsoft`google)

/create partitioned tables
`:/db/2015.01.01/t/ set .Q.en[`:/db;] ([] ti:09:30:00 09:31:00; sym:`ibm`msft; p:101 33f)
`:/db/2015.01.02/t/ set .Q.en[`:/db;] ([] ti:09:30:00 09:31:00; sym:`ibm`msft; p:101.5 33.5)
`:/db/2015.01.01/q/ set .Q.en[`:/db;] ([] ti:09:30:00 09:31:00; sym:`ibm`msft; b:100.75 32.75; a:101.25 33.25)
`:/db/2015.01.02/q/ set .Q.en[`:/db;] ([] ti:09:30:00 09:30:00; sym:`ibm`msft; b:101.25 33.25; a:101.75 33.75)

/ create load script
`:/db/init.q 0: ("TheUniverse:42";"\\l /lib/math.q";
 "\\l /lib/expr.q")

- Example code to create a segmented database:

In [None]:
/ create serialized variables
`:/db/LIFE set 42
`:/db/f set {x*y}
`:/db/lookup set ([s:`a`b`c] v:1 2 3)

/ create splayed tables
`:/db/tref/ set ([] c1:1 2 3; c2:1.1 2.2 3.3)
`:/db/cust/ set .Q.en[`:/db;] ([] sym:`ibm`msft`goog; name:`oracle`microsoft`google)

/ create segmented tables
extr:{[t;r] select from t where (`$1#'string sym) within r}
t:.Q.en[`:/db;] ([] ti:09:30:00 09:31:00; sym:`ibm`t; p:101 17f)
q:.Q.en[`:/db;] ([] ti:09:29:59 09:29:59 09:30:00; sym:`ibm`t`ibm; b:100.75 16.9 100.8; a:101.25 17.1 101.1)
`:/am/2015.01.01/t/ set extr[t;`a`m]
`:/nz/2015.01.01/t/ set extr[t;`n`z]
`:/am/2015.01.01/q/ set extr[q;`a`m]
`:/nz/20015.01.01/q/ set extr[q;`n`z]
t:.Q.en[`:/db;] ([] ti:09:30:00 09:31:00; sym:`t`ibm; p:17.1 100.9)
q:.Q.en[`:/db;] ([] ti:09:29:59 09:29:59 09:30:00; sym:`t`ibm`t; b:17 100.7 17.1;a:17.2 101.25 17.25)
`:/am/2015.01.02/t/ set extr[t;`a`m]
`:/nz/2015.01.02/t/ set extr[t;`n`z]
`:/am/2015.01.02/q/ set extr[q;`a`m]
`:/nz/2015.01.02/q/ set extr[q;`n`z]

`:/db/par.txt 0: ("/am"; "/nz")

/ create load script
`:/db/init.q 0: ("TheUniverse:6*7"; "\\l /lib/math.q"; "\\l /lib/expr.q")

## 14.8. QHOME

### 14.8.1. Envitonment variables (QHOME, QLIC, QINIT)

- Three environment variables QHOME, QLIC and QINIT are used by kdb+ at startup.
- __QHOME__ specifies the directory where kdb+ expects to find the bootstrap file q.k. By default, it also looks there for the license file k4.lic. If QHOME is not defined, kdb+ falls back to $HOME/q for Unix-based systems and c:\q for Windows.
- __QLIC__ overrides the default location for the license file. If QLIC is not defined, kdb+ falls back to QHOME (or its fallback).
- __QINIT__ specifies the name of the file that is executed immediately after the load of q.k. If QINIT is not defined, kdb+ attempts to load the file __q.q__ from QHOME. If QHOME is not defined or q.q is not found, no error is reported.
    - QINIT is executed in the root context

### 14.8.2. q in the hood

- Starting a q session without specifying a working directory with a bare 'q' command, sets the working directory to the current directory
- loading a scipt outside of the working root directory does not change the root directory
- loading a database cahges the current directory to the database root
- q -u: determines the files hierarchy visibility
- when loading a file without specifying its path,
    - first, q searches for it in the current directory, if not found
    - second, in QHOME, if not found,
    - third, in $HOME

### APPENDIX A: [Built-in functions](https://code.kx.com/q4m3/A_Built-in_Functions/)

Functions I already used and know
- Communication with OS
    - getenv ``environmentVariableName: returns environment variable value
    - setenv: `varName setenv varValueAsString
    - system
- Communicating with the q interpreter:
    - parse
    - eval
    - eval parse someQEntity
- String manupulation
    - ss:
    - ssv
    - like
    - sv: join
    - vs: split
    - upper
- Data structure manipulation:
    - raze for lists
    - ungroup for tables
    - idesc
    - iasc
    - where
    - whithin
- Mathematical functions
    - til:
    - var:
    - sum:
    - avg:
    - wavg: weighted average
    - wsum: weighted sum
    - xrank: for computing quantiles
- Multiple overloaded functions:
    - value

# APPENDIX B: [Error messages](https://code.kx.com/q4m3/B_Error_Messages/)

## B.1. Runtime errors

- Access: you cannot read files above root directory or your usr/password is invalid
- Assign: attempt to use a reserved word
- Conn: too many incoming connections; max. 1022
- Domain: argument out of domain; e.g.: til -1
- Glim: limit of number of attributes is exceeded. There are no limits over q3.2
- Length: incompatible list lengths
- Limit:
    - attempting to create a list longer than limit
    - attempt to serialize object > 2GB
- loop: circular reference loop; a::b::a
- mismatch: columns cannot be aligned for operation
- mlim: nested column limit 999 is exceeded
- nyi: not yet implemented
- os: operating system eror
- pl: peach cannot handle parallel lambdas
- Q7
- rank: invalid rank or valence
- type: wrin type
- value: missing value
- vd1: attempted multithread update
- wsfull: memory allocation failed due to running out of swap or hitting -W limit
- xxx: xxx undefind

In [1]:
2 xexp 10

1024f


## B.2. Parse errors

- unpaired item: ")}[({]
- branch: a branch more than 255 byte codes away
- char: invalid character
- constants: too many contants; max 96
- globals: too many global variables: 255 max
- locals: too many local variables: 24 max
- params: too many parameters in function: 8 max

## B.3. System errors

- xxx:yyy
    - xxx is a kdb message. xxx can be:
        - addr
        - close
        - conn
        - p from -p
        - snd
        - rcv
        - fileName (invalid)
    - yyy is the OS message

## B.4. License errors

- cores: exceeded number of licensed cores
- exp: expiry date passed
- host: unlicensed host
- k4.lic: k4.lic file not found
- os: unlicensed os
- srv: attempt to use client-only license in server mod
- upd: attempt to use kdb version more tecent than update date
- user: unlicensed user
- wha: invalid system date