## Symbolic Aggregate Approximation

### 1.  [reference](http://dl.acm.org/citation.cfm?id=1285965)
### 2. main usage for time series data:
1. indexing and query
2. calculating distance between time-sereis and thus perform clustering/classification
3. symbolic representation for time series - inspiring text-mining related tasks such as association mining
4. vector representation of time-series
    
### 3. algorithm steps

1. Segment time-series data into gapless pieces (e.g., gap introduced by missing values or change of sampling frequences)

2. Each piece will be SAXed into a sequence of "words" (e.g., "abcdd" "aabcd", ...). This is done by rolling a sliding window of length $window$ with a stride of length $stride$. If $stride$ < $window$, there will be overlapping of different windows. Later each window will be converted into one word

3. for each sliding window:

    3.1 whiten/normalize across the window (it is the step key to many problems)
    
    3.2 discretize on time axis (index) by grouping points into equal-sized bins (bin sizes could be fractional) - controlled by $nbins$. For each bin, use the mean of bin as local approximation.
    
    3.3 discretize on value axis by dividing values into $nlevels$ quantiles (equiprobability), for each level, calculate the "letter" by $cutpoint$ table
    
    3.4 at the end, each bin in a sliding window will be mapped to a letter, each window in the piece of time-series will be mapped to a word, and the whole piece of series will be a sentence
    
    3.5 calcualte the distance between two symoblic representations by their corresponding levels
    
    3.6 if a vector representation is necessary, each letter can be mapped to a scalar value, such as the mean of the  corresponding level.

## sax module test

In [21]:
import pysax
import numpy as np
reload(pysax)
sax = pysax.SAXModel(window=3, stride=2) 

In [22]:

list(sax.sliding_window_index(10))

[slice(0, 3, None), slice(2, 5, None), slice(4, 7, None), slice(6, 9, None)]

In [25]:
ws = np.random.random(10)
print ws.mean(), ws.std()
ss = sax.whiten(ws)
print ss.mean(), ss.std() 

0.524387272869 0.214122501064
2.58126853225e-16 0.999999999533


In [19]:
ws

array([ 0.30291988,  0.58340595,  0.31254414,  0.92532847,  0.67185901,
        0.59398496,  0.79084947,  0.74645292,  0.66051761,  0.23549668])