### Load in utils functions

In [1]:
\l ../../ml.q
.ml.loadfile`:utils/init.q

In [2]:
n:1000
xf:1000?100f
yf:1000?100f
xb:010010101011111010110111b
yb:000000000001000000111000b
xroc:rand each 1000#0b
yroc:asc 1000?1f
5#table:([]x:n?10000f;x1:1+til n;x2:reverse til n;x3:n?100f)
5#complextable:([]sym:n?`4;time:asc n?0D00:59:00.000;x:n?10000f;x1:1+til n;x2:reverse til n;x3:n?100f)

x        x1 x2  x3      
------------------------
3455.276 1  999 12.30845
7933.981 2  998 61.03279
2514.913 3  997 2.814231
4154.687 4  996 54.86936
5158.35  5  995 56.59052


sym  time                 x        x1 x2  x3       
---------------------------------------------------
mmib 0D00:00:01.826405818 9381.123 1  999 86.19983 
mkli 0D00:00:10.065060686 1335.608 2  998 71.78646 
piel 0D00:00:11.097194077 1155.152 3  997 67.29624 
mank 0D00:00:13.213437027 3288.032 4  996 0.3079086
hfni 0D00:00:13.668105890 3387.841 5  995 76.41682 


### stats.q Examples

The functions contained in this script are related to the production of statistics from the outputs from machine learning algorithms. Here are a number of examples of their application.

In [3]:
/ correlation matrix
.ml.utils.corrmat[table]
/ get descriptive statistics for a table
.ml.utils.describe[table]
/ given classification labels and predictions we can produce a 2x2 confusion matrix.  
.ml.utils.confmat[xb;yb]

  | x          x1         x2          x3         
- | ---------------------------------------------
x | 1          -0.0162072 0.0162072   0.03213823 
x1| -0.0162072 1          -1          0.01134946 
x2| 0.0162072  -1         1           -0.01134946
x3| 0.03213823 0.01134946 -0.01134946 1          


     | x        x1       x2       x3       
-----| ------------------------------------
count| 1000     1000     1000     1000     
mean | 4953.491 500.5    499.5    49.77201 
std  | 2890.066 288.8194 288.8194 28.91279 
min  | 7.908894 1        0        0.1122762
q1   | 2491.828 250.75   249.75   24.38531 
q2   | 5000.222 500.5    499.5    49.96016 
q3   | 7453.287 750.25   749.25   74.98685 
max  | 9994.308 1000     999      99.98165 


0| 8 12
1| 1 3 


In [4]:
/ the following are a subset of the functions which can be used to compare predictions to real results.
-1"Precision of the calculation is: ",string .ml.utils.precision[xb;yb;1b];
-1"Specificity of the calculation is: ",string .ml.utils.specificity[xb;yb;1b];
-1"Sensitivity of the calculation is: ",string .ml.utils.sensitivity[xb;yb;1b];
-1"Accuracy of the calculation is: ",string .ml.utils.accuracy[xb;yb];
-1"Calculation of mean-squared error: ",string .ml.utils.mse[xf;yf];
-1"Calculation of sum-squared error: ",string .ml.utils.sse[xf;yf];
-1"This is the t-score for two independent samples with unequal variances: ",string .ml.utils.tscoreeq[xf;yf];
-1"Calculate the area under an ROC curve for diagnostics for a binary classifier: ",string .ml.utils.rocaucscore[xroc;yroc];

Precision of the calculation is: 0.2
Specificity of the calculation is: 0.4
Sensitivity of the calculation is: 0.75
Accuracy of the calculation is: 0.4583333
Calculation of mean-squared error: 1648.344
Calculation of sum-squared error: 1648344
This is the t-score for two independent samples with unequal variances: 0.6049928
Calculate the area under an ROC curve for diagnostics for a binary classifier: 0.49934


### funcs.q Examples

Conversions of pandas dataframes to q tables and vice-versa

In [5]:
/ conversion of q table to pandas dataframe
dftab:.ml.utils.tab2df[10#table]
print dftab

             x  x1   x2         x3
0  3455.276149   1  999  12.308453
1  7933.980743   2  998  61.032792
2  2514.913306   3  997   2.814231
3  4154.686674   4  996  54.869356
4  5158.350274   5  995  56.590523
5  7083.205539   6  994  16.680499
6  9138.481687   7  993  52.008451
7  9145.908495   8  992  95.254735
8  6715.353606   9  991  93.561197
9   884.439801  10  990   2.615358


In [6]:
/ conversion back to q table from pandas dataframe
5#.ml.utils.df2tab dftab

x        x1 x2  x3      
------------------------
3455.276 1  999 12.30845
7933.981 2  998 61.03279
2514.913 3  997 2.814231
4154.687 4  996 54.86936
5158.35  5  995 56.59052


In [7]:
/ convert the symbols within the table to an enumeration ()
5#enumtab:.ml.utils.enum complextable

sym time                 x        x1 x2  x3       
--------------------------------------------------
0   0D00:00:01.826405818 9381.123 1  999 86.19983 
1   0D00:00:10.065060686 1335.608 2  998 71.78646 
2   0D00:00:11.097194077 1155.152 3  997 67.29624 
3   0D00:00:13.213437027 3288.032 4  996 0.3079086
4   0D00:00:13.668105890 3387.841 5  995 76.41682 


In [8]:
/ convert times to longs
5#t2long:.ml.utils.times2long[enumtab]

sym time        x        x1 x2  x3       
-----------------------------------------
0   1826405818  9381.123 1  999 86.19983 
1   10065060686 1335.608 2  998 71.78646 
2   11097194077 1155.152 3  997 67.29624 
3   13213437027 3288.032 4  996 0.3079086
4   13668105890 3387.841 5  995 76.41682 


In [9]:
/ the following are the functions available to compute splitting of data into a train and test set
.ml.utils.traintestsplit[xf;yf;0.2]
/ this can also be 'seeded' in order to force the same splitting for multiple datasets
.ml.utils.traintestsplitseed[xf;yf;0.1;42]

xtrain| 36.15143 19.58467 77.58173 54.97936 69.9044 38.39461 88.25151 39.7576..
ytrain| 55.37373 44.20741 5.110413 16.34049 66.20432 84.75023 64.68219 30.571..
xtest | 31.64413 28.5799 72.48355 95.48708 80.50147 49.62382 26.20215 1.56385..
ytest | 97.71186 71.99879 18.20468 7.787298 75.79171 73.43714 81.1176 24.3663..


xtrain| 70.43314 57.78177 36.67275 18.56475 95.50901 20.11578 94.52199 55.655..
ytrain| 75.56175 40.01745 45.79215 14.04364 12.74362 90.18013 64.11941 2.3588..
xtest | 27.44944 4.164985 79.19793 1.280633 65.35973 20.73435 53.39515 72.274..
ytest | 48.61721 73.51245 49.86128 34.66436 11.76313 50.79442 16.5771 67.1711..


the functions below are used for the interogation of data and the creation of arrays.

In [10]:
/ array creation
.ml.utils.arange[0;220;2.5]        / evenly spaced array between 0 and 220 with 2.5 between each data-point
.ml.utils.linspace[0;5;200]        / 200 evenly spaced datapoints between 0 and 5
/ datatype information
.ml.utils.dtypes[complextable]
/ matrix shape
.ml.utils.shape[flip value t:flip complextable]
/ range calculations for numeric values of the 'complex table'
.ml.utils.range flip 2_t

0 2.5 5 7.5 10 12.5 15 17.5 20 22.5 25 27.5 30 32.5 35 37.5 40 42.5 45 47.5 5..


0 0.02512563 0.05025126 0.07537688 0.1005025 0.1256281 0.1507538 0.1758794 0...


sym | 11
time| 16
x   | 9
x1  | 7
x2  | 7
x3  | 9


1000 6


x | 9971.215
x1| 999
x2| 999
x3| 99.89498


### preprocess.q examples

In [11]:
5#tab:([]time:asc n?00:00:05.000;@[n?1000f;10?100;:;0n];n?10000f;n?10f;n#2;n#0n)

time         x        x1       x2       x3 x4
---------------------------------------------
00:00:00.003 209.0333 3840.776 5.951947 2    
00:00:00.004          8150.623 6.529447 2    
00:00:00.006 755.1312 7342.214 8.318315 2    
00:00:00.008 372.3322 3640.228 8.341788 2    
00:00:00.026 83.0886  6356.158 8.068155 2    


In [12]:
/ creates a 'rolled' table with 3 elements per window in order to produce a forecasting frame. 
5#.ml.utils.tablerolldrop[tab;`time;3]

time          x        x1       x2       x3 x4
----------------------------------------------
t00:00:00.006 209.0333 3840.776 5.951947 2    
t00:00:00.006          8150.623 6.529447 2    
t00:00:00.006 755.1312 7342.214 8.318315 2    
t00:00:00.008          8150.623 6.529447 2    
t00:00:00.008 755.1312 7342.214 8.318315 2    


In [13]:
/ produces a min-max scaling of the data by column to enforce values between 0-1 
5#.ml.utils.minmaxscaler[flip 1_flip tab]

x          x1        x2        x3 x4
------------------------------------
0.2091286  0.3840551 0.5950374      
           0.8163077 0.6528172      
0.7559122  0.7352289 0.8317964      
0.3726326  0.3639412 0.8341449      
0.08302585 0.6363332 0.8067675      


In [14]:
/ creates a standard scaling of the dataset by column (x-avg x)/dev(x)
5#.ml.utils.stdscaler[flip 1_flip tab]

x          x1         x2        x3 x4
-------------------------------------
-1.025089  -0.4274092 0.3359808      
           1.0842     0.5366565      
0.8793566  0.8006631  1.158271       
-0.4556053 -0.4977485 1.166428       
-1.464305  0.4548199  1.071343       


In [15]:
/ removes all columns that contain no variance (single column value)
5#nullfreetab:.ml.utils.dropconstant[tab]

time         x        x1       x2      
---------------------------------------
00:00:00.003 209.0333 3840.776 5.951947
00:00:00.004          8150.623 6.529447
00:00:00.006 755.1312 7342.214 8.318315
00:00:00.008 372.3322 3640.228 8.341788
00:00:00.026 83.0886  6356.158 8.068155


In [16]:
/ find columns that contain null values
.ml.utils.checknulls[tab]

`x`x4


In [17]:
/ create polynomial features from the initial table, the polynomials can be tuned according to use case
5#nullfreetab^.ml.utils.polytab[nullfreetab;2;1]

time         x        x1       x2       x_x1     x_x2     x1_x2   
------------------------------------------------------------------
00:00:00.003 209.0333 3840.776 5.951947 802850.1 1244.155 22860.1 
00:00:00.004          8150.623 6.529447                   53219.06
00:00:00.006 755.1312 7342.214 8.318315 5544335  6281.419 61074.85
00:00:00.008 372.3322 3640.228 8.341788 1355374  3105.917 30366.01
00:00:00.026 83.0886  6356.158 8.068155 528124.2 670.3717 51282.47


In [18]:
/ this produces the powerset of all possible polynomials that could be produced from the table
5#.ml.utils.powerset[nullfreetab;1]

time         x        x1       x2       x_x1     x_x2     x1_x2    x_x1_x2     
-------------------------------------------------------------------------------
00:00:00.003 209.0333 3840.776 5.951947 802850.1 1244.155 22860.1  4778521     
00:00:00.004          8150.623 6.529447                   53219.06             
00:00:00.006 755.1312 7342.214 8.318315 5544335  6281.419 61074.85 4.611952e+07
00:00:00.008 372.3322 3640.228 8.341788 1355374  3105.917 30366.01 1.130624e+07
00:00:00.026 83.0886  6356.158 8.068155 528124.2 670.3717 51282.47 4260988     


---