$\newcommand{\ol}{\overline{l}}$
$\newcommand{\os}{\overline{s}}$

## Sequence Mining

In [12]:
import nbimporter
from MakeTransactions_t5 import make_cal_transactions
PATH = '../../../datasets/pmdata/'

Let us load the transactions first:

In [13]:
t = make_cal_transactions(PATH, [1,2,3,4], n_clusters=7)

Each transaction may be seen as a sequence of items if ordered according to their timestamp

In [14]:
for tr in t:
    tr.sort(key = lambda x: x[0] )

In [15]:
t = [ [i[1] for i in tr ]   for tr in t ]

In [16]:
t[0]

['calories cluster: 3, day: Friday',
 'calories cluster: 3, day: Friday',
 'calories cluster: 3, day: Friday',
 'calories cluster: 3, day: Friday',
 'calories cluster: 3, day: Friday',
 'calories cluster: 3, day: Friday',
 'calories cluster: 2, day: Friday',
 'calories cluster: 2, day: Friday',
 'calories cluster: 0, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 0, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 1, day: Friday',
 'calories cluster: 5, day: Friday',
 'calories cluster: 2, day: Friday',
 'calories cluster: 3, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 1, day: Friday',
 'calories cluster: 0, day: Friday',
 'calories cluster: 2, day: Friday',
 'calories cluster: 4, day: Friday']

Given a sequence of labels $s = [l_1, \ldots, l_n]$ a sequence $p = [\ol_{1},\ldots ,\ol_m]$  is a **pattern of** $s$ if and only if there exists a monotonic strictly increasing function $f: \{
1, \ldots m\} \rightarrow \{1,\ldots, n\}$ such that for every $i \in \{
1, \ldots m\}$ we have $l_{f(i)} = \ol_{i}$, if such a function exists  we will write $p \in s$ and $p \notin s$
otherwise.


Given a multiset of sequences $S =s_1, \ldots, s_N$ and a pattern $p$ we define the support of $p \in S$, written $sup_S(p)$, as $sup_S(p)= \frac{\{i: p \in s_i\}}{N}$.

A pattern association rule $par$ is a rule of the form $ s \rightarrow \os$ where $s$ and $\os$ are sequences.
Given a multiset of sequences $S =s_1, \ldots, s_N$ the support of the rule $par$ in $S$, written
$sup_S(par)$, is $sup_S(par) = sup_S(s \cdot \os)$ where $s \cdot \os$ is the sequence concatenation 
operation. Moreover, the confidence of $par$ in $S$, written $conf_S(par)$, is the value 
$conf_S(par)= \frac{sup_S(par)}{sup_S(s)}$.


## Example

You can use regvular expression for detecting pattern elegantly but you have to use a special separator that DOES NOT appear as a character or pattern in ANY transaction (be careful!).

In [20]:
SEP = '<!>'

In [103]:
 p = ['calories cluster: 3, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 0, day: Friday',     
 'calories cluster: 0, day: Friday',     
 'calories cluster: 2, day: Friday',
 'calories cluster: 2, day: Friday'    ]

In [104]:
import re 

In [105]:
ttext = SEP + SEP.join(t[0]) + SEP

In [106]:
preg = '.*'.join([ re.escape(SEP + i + SEP) for i in p ])

In [107]:
search = re.search(preg, ttext)
match = search[0] if search != None else None

In [108]:
print(match.replace(SEP, '\n')) if match != None else print('No Match')

No Match


In [109]:
c = 0
for tr in t:
    ttext = SEP + SEP.join(tr) + SEP
    preg = '.*'.join([ re.escape(SEP + i + SEP) for i in p ])
    search = re.search(preg, ttext)
    c = c + 1 if search else c

In [110]:
sup_rule = c/len(t)

In [111]:
p = ['calories cluster: 3, day: Friday',
 'calories cluster: 4, day: Friday',
 'calories cluster: 0, day: Friday', ]

In [112]:
c = 0
for tr in t:
    ttext = SEP + SEP.join(tr) + SEP
    preg = '.*'.join([ re.escape(SEP + i + SEP) for i in p ])
    search = re.search(preg, ttext)
    c = c + 1 if search else c

In [113]:
sup_antecedent = c/len(t)

In [114]:
sup_rule / sup_antecedent

0.07246376811594202

## Assignment

- Build a function ```extract_pattern_rules(seq, minsup, minconf, minlen, maxlen)``` 
that extracts all the pattern association rules from the set of sequences $seq$ with length within $minlen$ and $maxlen$ whose undelying pattern features support greater than $minsup$ and the confidence  the rule is 
greater than $minconf$;

- on the extracted rules perform an anlysis with secondary measures similar to the one proposed during the lectures.