# Notes before starting

1. Q is intepreted from right to left. Eg 10*5+2 = 70, not 52.
1. In Q, assignment is :, not = like in Python
1. In Q, comment is /, not # like in Python
1. This Jupyter Notebook is run under a Q kernel. To install, see https://github.com/KxSystems/jupyterq
   * Python code block starts with /%python
   * Q code block starts with /%q (optional)
1. Some Q data type notations
   * Boolean: 1b denotes true, 0b is false. We'll see later that boolean can be handled like integer
   * Date: Like 2019.06.16. Date can also be handled like integer, where 0 represents 1970.01.01
   * Character: Like "a". A list of characters, also called string, is represented as such "abc".
   * Symbol: Like `abc, is an atomic representation of string. Each character in the symbol cannot be accessed directly, unlike string which is a list.

# Processing a list of boolean

In [1]:
/%python
from numpy.random import seed, rand, randn, randint
from datetime import datetime, timedelta  
import numpy as np
import pandas as pd
import string

## Generate a list of 10 integers

In [2]:
/%python
# Generate a list of 10 integers
print(list(range(10)))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [3]:
/%q
til 10

0 1 2 3 4 5 6 7 8 9


## Find which element is >= 5

In [4]:
/%python
# Find which element is >= 5. Here we use a for loop
print([x >= 5 for x in list(range(10))])

[False, False, False, False, False, True, True, True, True, True]


### Replace for-loop with vector processing

In [5]:
/%python
# Use vector processing via numpy instead of for loop
# A you can see, the >= operation is broadcasted to the entire numpy array
print(np.array(range(10))>=5)

[False False False False False  True  True  True  True  True]


In [6]:
/%q
/ Similarly in Q, the >= operation is applied atomically across the entire list
/ the suffix 'b' indicates boolean
(til 10) >=5 

0000011111b


## Output 1 if number is >= 5, else 0 (like in a sigmoid function)

In [7]:
/%python
# Output 1 if number is >= 5, else 0 (like in a sigmoid function), in a for loop
print([1 if x>=5 else 0 for x in list(range(10))])

[0, 0, 0, 0, 0, 1, 1, 1, 1, 1]


### Replace for-loop with vector processing

In [8]:
/%python
# vector processing way
# It is very convenient to be able to handle boolean like an integer and 
# apply arithmetic operations against it, without using a verbose if-else statement.
print((np.array(range(10)) >= 5) + 0)

[0 0 0 0 0 1 1 1 1 1]


In [9]:
/%q
((til 10) >=5 ) + 0

0 0 0 0 0 1 1 1 1 1


# Processing a list of dates

## get a list of incremental dates starting today

In [10]:
/%python
print([datetime.today() + timedelta(x) for x in range(10)])

[datetime.datetime(2019, 6, 17, 20, 42, 41, 249308), datetime.datetime(2019, 6, 18, 20, 42, 41, 249308), datetime.datetime(2019, 6, 19, 20, 42, 41, 249308), datetime.datetime(2019, 6, 20, 20, 42, 41, 249308), datetime.datetime(2019, 6, 21, 20, 42, 41, 249308), datetime.datetime(2019, 6, 22, 20, 42, 41, 249308), datetime.datetime(2019, 6, 23, 20, 42, 41, 249308), datetime.datetime(2019, 6, 24, 20, 42, 41, 249308), datetime.datetime(2019, 6, 25, 20, 42, 41, 249308), datetime.datetime(2019, 6, 26, 20, 42, 41, 249308)]


### Replace for-loop with vector processing

In [11]:
/%python
# Same as the case of boolean above, the date is broadcasted to fit to the array
print( np.datetime64('today') + range(10) )

['2019-06-17' '2019-06-18' '2019-06-19' '2019-06-20' '2019-06-21'
 '2019-06-22' '2019-06-23' '2019-06-24' '2019-06-25' '2019-06-26']


In [12]:
/%q
/ Similar concept as above
.z.D + til 10

2019.06.17 2019.06.18 2019.06.19 2019.06.20 2019.06.21 2019.06.22 2019.06.23 ..


# Processing of string

## Generate a list of 10 strings (or symbols in Q) with 3 random characters

In [13]:
/%python
char_array = np.random.choice(list(string.ascii_lowercase), size=[10, 3])
print( [''.join(arr) for arr in char_array] )

['uip', 'mas', 'ypz', 'dls', 'whe', 'mgz', 'pxs', 'eid', 'sjr', 'cjg']


In [14]:
/%q
// Use the built-in ? operator
10?`3

// Another way is to mimic the Python code, 
// but since it requires a deeper understanding of Q vector processing concepts, we won't go through the details.
//`$() {y;x,"abcdefghijklmnopqrstuvwxyz"[3?26]}/: 10#0


`mil`igf`kao`baf`kfh`jec`kfm`lkk`kfi`fgl


## Append characters to a list of strings

In [15]:
/%python
char_array = np.random.choice(list(string.ascii_lowercase), size=[10, 3])
print( [''.join(arr) + ".n" for arr in char_array] )

['kbe.n', 'jbj.n', 'qoo.n', 'cof.n', 'uso.n', 'oad.n', 'bjh.n', 'aoq.n', 'mmd.n', 'ljc.n']


### Replace for-loop with lambda + vector processing

In [16]:
/%q
// Since symbol can't be manipulated directly, we first cast it to a string (ie list of characters),
// append ".n" to the end, then re-cast it back to symbol. The operator to append string is ",".
// To break down the chain of operations... (remember, Q is interpreted from right to left)
//    string(2?`3) converts (`aaa;`bbb) --> ("aaa";"bbb")
//    {x,".n"} : {} represents a lambda, x is the first (implicit) parameter. The lambda appends ".n" to the input.
//    "each" is like numpy.apply. It applies the lambda to the entire list in parameter, without for-loop!
//    {x,".n"} each string(2?`3) converts (`aaa;`bbb) --> ("aaa.n";"bbb.n") 
//    `$ re-cast string to symbol, against the entire list without any for-loop!
`${x,".n"} each string(10?`3)

`enf.n`plh.n`nni.n`glc.n`gkp.n`bgh.n`ifn.n`foh.n`kdj.n`eeg.n


# Hold on... why bother with vector processing?
This becomes more obvious as we handle multi dimensional and large amount of data.

## Performance
Let's use an element-wise multiplication of matrices as example. The dimension used here is 300 (features) by 1000 (samples), a relatively small dataset in the world of machine learning. You can already see vector processing outperform for-loop by > 20 times. It'll become more obvious as the dimension (say number of samples) increases.

In [17]:
/%python
matrix_a = matrix_b = np.zeros((300,1000))  # input
matrix_c = np.zeros(matrix_a.shape)   # to store the result

### For-loop

In [18]:
/%python
for i in range(matrix_a.shape[0]):
  for j in range(matrix_a.shape[1]):
    matrix_c[i][j] = matrix_a[i][j]*matrix_b[i][j]

print( matrix_c )

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Vector processing

In [19]:
/%python
print( np.multiply(matrix_a, matrix_b) )

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Readibility
As can be seen in the example above, the notations used in vector processing is more readable. Imagine you have some complex model or business logic to implement. It can make a difference.

# Finally, let's store those lists into a table, in memory and onto disk

In [20]:
/%python
booleans = np.array(range(10))>=5
dates = np.datetime64('today') + range(10) 
char_array = np.random.choice(list(string.ascii_lowercase), size=[10, 3])
strings = [''.join(arr) + ".n" for arr in char_array]
floats = np.random.rand(10)*10
table = pd.DataFrame({ "b":booleans, "d":dates, "s": strings, "f":floats  })
print(table)

       b          d      s         f
0  False 2019-06-17  mer.n  7.304223
1  False 2019-06-18  tac.n  5.325464
2  False 2019-06-19  ygq.n  2.078310
3  False 2019-06-20  cga.n  5.494876
4  False 2019-06-21  tkv.n  7.014140
5   True 2019-06-22  zht.n  0.239879
6   True 2019-06-23  kvq.n  7.416130
7   True 2019-06-24  wzj.n  7.875459
8   True 2019-06-25  idj.n  3.385423
9   True 2019-06-26  bby.n  5.944745


In [21]:
/%q
booleans:(til 10) >=5;
dates: .z.D + til 10;
symbols: `${x,".n"} each string[10?`3];
floats:10?10.0;
table:([] b:booleans; d:dates; s:symbols; f:floats);
table

b d          s     f        
----------------------------
0 2019.06.17 nce.n 3.91543  
0 2019.06.18 jog.n 0.8123546
0 2019.06.19 cih.n 9.367503 
0 2019.06.20 hkp.n 2.782122 
0 2019.06.21 aea.n 2.392341 
1 2019.06.22 blm.n 1.508133 
1 2019.06.23 ooe.n 1.567317 
1 2019.06.24 jgj.n 9.785    
1 2019.06.25 cfl.n 7.043314 
1 2019.06.26 bpm.n 9.441671 


## Querying the table
Find entries where column b is true, f > 5, d > 5 days from today

In [22]:
/%python
print( table[ (table.b * table.f > 5) & (table.d > np.datetime64('today')+5)] )

      b          d      s         f
6  True 2019-06-23  kvq.n  7.416130
7  True 2019-06-24  wzj.n  7.875459
9  True 2019-06-26  bby.n  5.944745


### Q-SQL
This is a built in SQL-like templates in Q

In [23]:
/%q
// Find entries where column b is true, f > 5, d > 5 days from today
// Here, we treat b as integer by multiplying it with f. 
// Apparently this also works: select from t where b = 1b, f > 5, ...
// We also treat d like integers where we add number of days to it
select from table where (f*b) > 5, d > .z.D+5

b d          s     f       
---------------------------
1 2019.06.24 jgj.n 9.785   
1 2019.06.25 cfl.n 7.043314
1 2019.06.26 bpm.n 9.441671


## Save to and load from files

In [34]:
/%python
# CSV. Data types are lost.
table.to_csv('table.csv')

# Python binary. Data types are preserved.
table.to_pickle('table.pkl')
pd.read_pickle('table.pkl')

In [35]:
/%q
// CSV. Data types are lost.
save `:table.csv;

// KDB+ binary. Data types are preserved.
save `:table;
load `:table;