# Detecting Rare Patterns Using the IBM SPSS RPI Algorithm


Advances in data collection and data storage technologies lead to the increasing availability of complex temporal data sets. Data instances are traces of entity behaviors that are characterized by the time series of events with single or multiple variables. This kind of data is event-based time series. The analysis of these temporal data is one of the most challenging topics in data mining research.

## What is Event-Based Time Series

The event-based time series consist of one or more sequences of events that occurred at different time points. Each event is optionally linked to a numeric value. The time points are unevenly spaced, that is to say, the time spaces between consecutive events are of arbitrary length. 

<img src="https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/632af13e-5c20-4d2d-bad3-09c5e598ef39.jpg"  width="350">

Event-based time series data can be collected from many industrial or scientific domains. In the previous data example, it shows online travel agency data. Each transaction record includes a customer ID number and the type of product booked online. A time stamp shows when a booking event happened, and a numeric value shows how much money this booking spent. Data with such characteristics can also be found in other cases. For a bank, customers can conduct different activities at different time points: withdraw, deposit, transfer, and so on. For a gas station, customer activities might include top-up, refilling, shopping, and so on. All these customer activities or events of can be represented in event-based time series data. So event-based time series pattern analysis might benefit enterprise in the gaining insight and understanding behavior, such as behavior prediction, demand shaping, personalized promotion. 

Compared with traditional time series or sequence data, there are some challenges for event-based time series pattern analysis that is listed as the following:

* Different from sequence analysis in which only orders of events are mined. Event-based time series pattern analysis also needs to mine the time intervals between consecutive events.
* Event-based time series pattern analysis is interested in the values that are linked with events and their adjacent relationship rather than mining the events themselves.
To solve those challenges, IBM provides a Rare Pattern Identifier(RPI) Analysis algorithm that tries to discover rare temporal patterns in event-based time series data.

# IBM SPSS RPI Analysis

IBM SPSS Rare Pattern Identifier(RPI) Analysis algorithm can discover rare temporal patterns in event-based time series data by accounting for two elements for each event: time interval and event value, which reveal the sequential relationship among adjacent events. Rare temporal patterns are discovered across all the entities, which might be used as a feature for customer segmentation or behavior prediction. 

RPI Analysis can handle the following data:

* Consist of one or more series of events that occurred at different time points.
* Each series is unequally spaced time series.
* Each event might link with a numeric value.
RPI Analysis can provide the following information:

* Discretization rule for the linked values and time interval.
* Temporal patterns whose vertical support is below predefined threshold.
* Temporal patterns whose horizontal support and rate are above predefined threshold.
* Temporal features to characterize each entity.
* Interestingness for identified rare patterns, and entities with identified rare patterns.

### Use Case

Bill administers a large website and he wants to detect potential network attacks targeted at his website. He learned from previous experience that rare patterns of user activity in the website might indicate network attacks.


The data that he extracted from web server log file include the following information:

* There are millions of users to be analyzed.
* Each user has a sequence of transactional events.
* Each event data includes:
   - userID: Identifier of user
   - visitTime: the time user visit a URL
   - url: encoded value of the URL
<img src="https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/91cf9508-a6fd-47b1-a70c-91f9c7d05f0d.jpg"  width="200">

The characters of the data match the property of the event-based time series data.

### Step 1. Load Data
First specify the data type for each field of the data, and load the data rpi_data.csv.

__Note:__ _The code inserted automatically does not take into account the schema. Just add this first cell and then `schema(schema).` before the `.load` method_. _Also delete `option("inferSchema", "true").` as we are not infering from the file_

In [1]:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}

In [2]:
val schema = StructType(
  StructField("userID", StringType, true) ::
  StructField("visitTime", DateType, true) ::
  StructField("url", StringType, true) ::Nil)

schema = StructType(StructField(userID,StringType,true), StructField(visitTime,DateType,true), StructField(url,StringType,true))


StructType(StructField(userID,StringType,true), StructField(visitTime,DateType,true), StructField(url,StringType,true))

In [3]:
import com.ibm.ibmos2spark.CloudObjectStorage

// @hidden_cell
var credentials = scala.collection.mutable.HashMap[String, String](
    "endPoint"->"https://s3-api.us-geo.objectstorage.service.networklayer.com",
    "apiKey"->"kUkuG6QrEdOHy3MRJBp1CS_f0FuIrbjfK6mYDjAh-c6p",
    "serviceId"->"iam-ServiceId-5a988285-b05d-4148-afb9-19cbd594edb6",
    "iamServiceEndpoint" -> "https://iam.bluemix.net/oidc/token")

var configurationName = "os_cba83a820ee941cd921cc2bbfefd15eb_configs"
var cos = new CloudObjectStorage(sc, credentials, configurationName, "bluemix_cos")

import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder().
    getOrCreate()
val df = spark.
    read.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").
    schema(schema).
    option("header", "true").
    load(cos.url("taller-donotdelete-pr-oig7dhr92xtgem", "rpi_data.csv"))
df.show(5)

+------+----------+---+
|userID| visitTime|url|
+------+----------+---+
|     0|2016-06-21|  0|
|     0|2016-06-11|  0|
|     0|2016-06-14|  0|
|     0|2016-06-17|  0|
|     0|2016-06-13|  1|
+------+----------+---+
only showing top 5 rows



credentials = Map(apiKey -> kUkuG6QrEdOHy3MRJBp1CS_f0FuIrbjfK6mYDjAh-c6p, serviceId -> iam-ServiceId-5a988285-b05d-4148-afb9-19cbd594edb6, endPoint -> https://s3-api.us-geo.objectstorage.service.networklayer.com, iamServiceEndpoint -> https://iam.bluemix.net/oidc/token)
configurationName = os_cba83a820ee941cd921cc2bbfefd15eb_configs
cos = com.ibm.ibmos2spark.CloudObjectStorage@772e7ff8
spark = org.apache.spark.sql.SparkSession@1f09df6f
df = [userID: string, visitTime: date ... 1 more field]


[userID: string, visitTime: date ... 1 more field]

### Step 2. Set Parameters and Run RPI Analysis

Then, set the entity id, event time, and event type field. Also set the minimal level of pattern to 3 and maximal level of pattern to 5. He also sets the threshold of vertical support to 0.1, minimal frequency of rare pattern to 2, and minimal rate of rare pattern to 0.3.

In [4]:
import com.ibm.spss.ml.frequentpatternmining.rarepatternidentifier.RarePatternIdentifier
val rpi = new RarePatternIdentifier().
  setEntityIDField("userID").
  setEventTimeField("visitTime").
  setEventTypeField("url").
  setMaxVerticalSupport(0.1).
  setMinPatternLength(3).
  setMaxPatternLength(5).
  setMinFreqOfRarePatternInEntity(2).
  setMinRateOfRarePatternInEntity(0.3).
  fit(df)
 
val patternXML = rpi.patternXML

rpi = RarePatternIdentifierModel_cc4581fdb4b8
patternXML = <?xml version='1.0' encoding='UTF-8'?><RarePatternIdentifier><Fields id="userID" time="visitTime" type="url"/><EventType size="4"><ID><Array n="4" type="int">0 1 2 3</Array></ID><Values><Array n="4" type="string">"0" "1" "2" "3"</Array></Values></EventType><TimeInterval size="2" transformation="NONE"><ID><Array n="2" type="int">0 1</Array></ID><CutPoints><Array n="1" type="real">0.0</Array></CutPoints></TimeInterval><Patterns size="3"><Pattern id="0" length="4" vSupport="0.09090909090909091" maxRate="0.5" confidence="1.0" interest="0.5625173454237...


<?xml version='1.0' encoding='UTF-8'?><RarePatternIdentifier><Fields id="userID" time="visitTime" type="url"/><EventType size="4"><ID><Array n="4" type="int">0 1 2 3</Array></ID><Values><Array n="4" type="string">"0" "1" "2" "3"</Array></Values></EventType><TimeInterval size="2" transformation="NONE"><ID><Array n="2" type="int">0 1</Array></ID><CutPoints><Array n="1" type="real">0.0</Array></CutPoints></TimeInterval><Patterns size="3"><Pattern id="0" length="4" vSupport="0.09090909090909091" maxRate="0.5" confidence="1.0" interest="0.5625173454237...

In the setting, vertical support is the percentage of users that have a rare pattern.  Rate is the percentage of the length of the rare pattern in all the events of a user.

### Step 3. Check Result

In [5]:
val p = new scala.xml.PrettyPrinter(80, 4)
val xml = scala.xml.XML.loadString(patternXML)
print(p.format(xml))

<RarePatternIdentifier>
    <Fields type="url" time="visitTime" id="userID"/>
    <EventType size="4">
        <ID>
            <Array type="int" n="4">0 1 2 3</Array>
        </ID>
        <Values>
            <Array type="string" n="4">
                &quot;0&quot; &quot;1&quot; &quot;2&quot; &quot;3&quot;
            </Array>
        </Values>
    </EventType>
    <TimeInterval transformation="NONE" size="2">
        <ID>
            <Array type="int" n="2">0 1</Array>
        </ID>
        <CutPoints>
            <Array type="real" n="1">0.0</Array>
        </CutPoints>
    </TimeInterval>
    <Patterns size="3">
        <Pattern 
        interestOfConfidence="1.0" interestOfMaxRate="0.5" interestOfVSupport="6.938169513703851E-5" interestOfLength="0.75" interest="0.5625173454237843" confidence="1.0" maxRate="0.5" vSupport="0.09090909090909091" length="4" id="0">
            <Item>
                <Array type="int" n="7">1 0 1 0 1 0 1</Array>
            </Item>
            <Entiti

p = scala.xml.PrettyPrinter@3bca9b0d
xml = <RarePatternIdentifier><Fields type="url" time="visitTime" id="userID"/><EventType size="4"><ID><Array type="int" n="4">0 1 2 3</Array></ID><Values><Array type="string" n="4">&quot;0&quot; &quot;1&quot; &quot;2&quot; &quot;3&quot;</Array></Values></EventType><TimeInterval transformation="NONE" size="2"><ID><Array type="int" n="2">0 1</Array></ID><CutPoints><Array type="real" n="1">0.0</Array></CutPoints></TimeInterval><Patterns size="3"><Pattern interestOfConfidence="1.0" interestOfMaxRate="0.5" interestOfVSupport="6.938169513703851E-5" interestOfLength="0.75" interest="0.5625173454237843" confidence="1.0" maxRate="0.5" vSupport="0.09090909090909091" length="4" id="0"><Item><Array type="int" n="7">1 0 1 ...


<RarePatternIdentifier><Fields type="url" time="visitTime" id="userID"/><EventType size="4"><ID><Array type="int" n="4">0 1 2 3</Array></ID><Values><Array type="string" n="4">&quot;0&quot; &quot;1&quot; &quot;2&quot; &quot;3&quot;</Array></Values></EventType><TimeInterval transformation="NONE" size="2"><ID><Array type="int" n="2">0 1</Array></ID><CutPoints><Array type="real" n="1">0.0</Array></CutPoints></TimeInterval><Patterns size="3"><Pattern interestOfConfidence="1.0" interestOfMaxRate="0.5" interestOfVSupport="6.938169513703851E-5" interestOfLength="0.75" interest="0.5625173454237843" confidence="1.0" maxRate="0.5" vSupport="0.09090909090909091" length="4" id="0"><Item><Array type="int" n="7">1 0 1 ...

The results are as a pattern XML file. In the output pattern XML, we found the discretization information and rare patterns from the data.

* Discretization rule of time interval:
<img src="https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/518fe6d1-790e-4844-aa91-dda0806bcbd3.jpg"  width="300">

Time interval was split to 2 categories:

* Category 1: time duration within 1 day.
* Category 2: time duration greater than 1 day.
 

* Patterns with vertical support, confidence, maximal rate, and interestingness:

<img src="https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/336752e3-ed39-42b1-8178-5335229be897.jpg"  width="600">
 

For pattern with id 1, it described a user who visits URL 1 four times within one day. 

* Rare Patterns of user:
<img src="https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/1fd4f6f8-204c-4e0b-ad51-f89d11a868c5.jpg"  width="500">

For user with ID 8:

* There is a rare pattern 0, the horizontal support (frequency of the pattern) of this pattern is 2. The rate of this rare pattern is 50%, and the interestingness of this rare pattern for user 8 is 0.25.
* There is a rare pattern 1, the horizontal support (frequency of the pattern) of this pattern is 3. The rate of this rare pattern is 60%, and the interestingness of this rare pattern for user 8 is 0.3.
* ……

With all these information, Bill can combine rare patterns and the structure of his website to do further analysis for potential network attacks.

## Locating IBM SPSS RPI Algorithm

### Here is the link for the algorithms

[IBM SPSS algorithm Spark and Python API](http://spss-algo.mybluemix.net/)