# Finding frequent sequential patterns with gaps shorter than specified in sequence databases using prefixSpan

This tutorial has two parts. In the first part, we describe the basic approach to find frequent patterns in a sequence database using the prefixSpan algorithm. In the final part, we describe an advanced approach, where we evaluate the prefixSpan algorithm on a dataset at different *gAP* threshold.

## Prerequisites:

1. Installing the PAMI library

In [1]:
!pip install -U pami

Collecting pami
  Downloading pami-2024.10.24.2-py3-none-any.whl.metadata (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.3/80.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting resource (from pami)
  Downloading Resource-0.2.1-py2.py3-none-any.whl.metadata (478 bytes)
Collecting validators (from pami)
  Downloading validators-0.34.0-py3-none-any.whl.metadata (3.8 kB)
Collecting sphinx-rtd-theme (from pami)
  Downloading sphinx_rtd_theme-3.0.1-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting discord.py (from pami)
  Downloading discord.py-2.4.0-py3-none-any.whl.metadata (6.9 kB)
Collecting JsonForm>=0.0.2 (from resource->pami)
  Downloading JsonForm-0.0.2.tar.gz (2.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting JsonSir>=0.0.2 (from resource->pami)
  Downloading JsonSir-0.0.2.tar.gz (2.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-easyconfig>=0.1.0 (from resource->pami)
  Downloading Python_EasyCon

2. Downloading a sample dataset

In [2]:
!wget -nc https://github.com/UdayLab/PAMI/tree/main/notebooks/sequencePatternMining/basic/airDatabase.txt

--2024-10-29 05:48:27--  https://www.dropbox.com/scl/fi/c2xdmns7rprxnkgd9h3gb/airPollution.csv?rlkey=q7zoop7mi2n4z3qi94lpc1jlf
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucba0397d27cc0bea184f285a81a.dl.dropboxusercontent.com/cd/0/inline/CdWDiWvNTDyjxaA_Fq430pJd5-DwK15OQg24ikF0hDvC7fYq87yS2t0jqIL9oxvodUwvyg-nzrnjPFmV1l3TQ3Ri0P13gAwVT6Wge0JR7Ldj3CTD-ewYqFyq1-cw9d86uCb0XVCO052EvCp87yuxtF2h/file# [following]
--2024-10-29 05:48:28--  https://ucba0397d27cc0bea184f285a81a.dl.dropboxusercontent.com/cd/0/inline/CdWDiWvNTDyjxaA_Fq430pJd5-DwK15OQg24ikF0hDvC7fYq87yS2t0jqIL9oxvodUwvyg-nzrnjPFmV1l3TQ3Ri0P13gAwVT6Wge0JR7Ldj3CTD-ewYqFyq1-cw9d86uCb0XVCO052EvCp87yuxtF2h/file
Resolving ucba0397d27cc0bea184f285a81a.dl.dropboxusercontent.com (ucba0397d27cc0bea184f285a81a.dl.dropboxusercontent.com)... 162.125.81.15, 

3 Printing few lines of a dataset to know its format.

In [3]:
!head -2 airDatabase.txt

head: cannot open 'airDatabase.txt' for reading: No such file or directory


_format:_ every row contains items seperated by a seperator in one sequence.
        _ every row contains subsequence seperated by a "-1".
        _ every row contains sequence seperated by a "-2".

__Example:__

item1 item2 -1 item3 item4 -1 -2

item1 item4 -1 item6 -1 -2

## Part 1: Finding frequent sequential patterns with lengths shorter than specified using prefixSpan

### Step 1: Understanding the statistics of a sequence database

In [24]:
#import the class file
from PAMI.extras.dbStats import SequentialDatabase as stats

#specify the file name
inputFile = 'airDatabase.txt'

#initialize the class
obj=stats.SequentialDatabase(inputFile,sep='\t')

#execute the class
obj.readDatabase()

### Step 2: Draw the items' frequency graph and sequence length's distribution graphs for more information

In [2]:
obj.printStats()

Database size (total no of sequence) : 135
Number of items : 121
Minimum Sequence Size : 1
Average Sequence Size : 20.955555555555556
Maximum Sequence Size : 24
Standard Deviation Sequence Size : 6.568010766746562
Variance in Sequence Sizes : 43.460696517412934
Sequence size (total no of subsequence) : 2829
Minimum subSequence Size : 1
Average subSequence Size : 18.83457051961824
Maximum subSequence Size : 104
Standard Deviation Sequence Size : 18.84802364721196
Variance in Sequence Sizes : 355.37361350890427


### Step 3: Choosing an appropriate *minSup* value and *maxGap*

In [25]:
minSup= 0.4 #minSup is specified in count. However, the users can also specify minSup between 0 and 1.
maxGap=4 #maxGap shold be int and more than 0

### Step 4:Mining frequent sequence patterns using prefixSpan

In [26]:
from PAMI.sequentialPattern.basic import PrefixSpan as alg


_ap = alg.PrefixSpan('airDatabase.txt', minSup, '\t',maxGap=maxGap)
_ap.startMine()
_Patterns = _ap.getPatterns()
_memUSS = _ap.getMemoryUSS()
print("Total Memory in USS:", _memUSS)
_memRSS = _ap.getMemoryRSS()
print("Total Memory in RSS", _memRSS)
_run = _ap.getRuntime()
print("Total ExecutionTime in ms:", _run)
print("Total number of Frequent Patterns:", len(_Patterns))
_ap.save("results.txt")

0.05209803581237793
0.052195072174072266
Frequent patterns were generated successfully using prefixSpan algorithm 
Total Memory in USS: 246816768
Total Memory in RSS 283996160
Total ExecutionTime in ms: 5.089320182800293
Total number of Frequent Patterns: 870


### Step 5: Investigating the generated patterns
Open the patterns' file and investigate the generated patterns. If the generated patterns were interesting, use them; otherwise, redo the Steps 3 and 4 with a different minSup value.

In [27]:
!head results.txt

['POINT(130.2113464,32.7321302)', '-1']:57 
['POINT(130.3597423,33.5840497)', '-1']:54 
['POINT(130.4674218,32.9808242)', '-1']:81 
['POINT(130.4674218,32.9808242)', 'POINT(130.601994,32.507843)', '-1']:55 
['POINT(130.4674218,32.9808242)', 'POINT(132.7326196,33.8884275)', '-1']:65 
['POINT(130.4674218,32.9808242)', 'POINT(136.6548337,35.0051925)', '-1']:66 
['POINT(130.4674218,32.9808242)', 'POINT(136.6548337,35.0051925)', '-1', 'POINT(130.4674218,32.9808242)', '-1']:54 
['POINT(130.4674218,32.9808242)', 'POINT(130.9612121,33.8854016)', '-1']:64 
['POINT(130.4674218,32.9808242)', 'POINT(130.9612121,33.8854016)', '-1', 'POINT(130.9612121,33.8854016)', '-1']:55 
['POINT(130.4674218,32.9808242)', '-1', 'POINT(130.4674218,32.9808242)', '-1']:59 


The storage format is: _frequentPattern:support_

## Part 2: Evaluating the prefixSpan algorithm on a dataset at different MaxGap values

### Step 1: Import the libraries and specify the input parameters

In [28]:
#Import the libraries
from PAMI.sequentialPattern.basic import PrefixSpan as alg #import the algorithm
import pandas as pd

#Specify the input parameters
inputFile = "airDatabase.txt"
seperator='\t'
minSupCount= 0.4
maximumGapList = [2,3,4,5,6,7]
#minimumSupport can also specified between 0 to 1. E.g., minSupList = [0.005, 0.006, 0.007, 0.008, 0.009]

In [29]:
result = pd.DataFrame(columns=['algorithm', 'minSup',"maxGap" ,'patterns', 'runtime', 'memory'])
#initialize a data frame to store the results of prefixSpan algorithm

In [30]:
for maxGap in maximumGapList:
    obj = alg.PrefixSpan(inputFile, minSup=minSupCount,sep=seperator,maxGap=maxGap)
    obj.startMine()
    #store the results in the data frame
    result.loc[result.shape[0]] = ['prefixSpan', minSupCount,maxGap, len(obj.getPatterns()), obj.getRuntime(), obj.getMemoryRSS()]

0.05330920219421387
0.0534052848815918
Frequent patterns were generated successfully using prefixSpan algorithm 
0.05083060264587402
0.05086326599121094
Frequent patterns were generated successfully using prefixSpan algorithm 
0.05088233947753906
0.050913333892822266
Frequent patterns were generated successfully using prefixSpan algorithm 
0.05080270767211914
0.05083060264587402
Frequent patterns were generated successfully using prefixSpan algorithm 
0.05076169967651367
0.05079030990600586
Frequent patterns were generated successfully using prefixSpan algorithm 
0.05147719383239746
0.05150723457336426
Frequent patterns were generated successfully using prefixSpan algorithm 


In [31]:
print(result)

    algorithm  minSup  maxGap  patterns    runtime     memory
0  prefixSpan     0.4       2       244   3.981484  283885568
1  prefixSpan     0.4       3       434   4.276574  283885568
2  prefixSpan     0.4       4       870   5.061490  283885568
3  prefixSpan     0.4       5      1959   6.646942  283885568
4  prefixSpan     0.4       6      3814   8.822992  283885568
5  prefixSpan     0.4       7      7662  13.526026  287408128
