# Demo - xStream for row-streaming datasets
This is a demo notebook for our row-streaming implementation in Python of [xStream](https://github.com/arielramos97/xStream). It will show you how to run the algorithm on the spam-sms dataset.

# Set up environment
We need to install some libraries and clone the repository.

In [1]:
!pip install mmh3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mmh3
  Downloading mmh3-3.0.0-cp38-cp38-manylinux2010_x86_64.whl (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 KB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mmh3
Successfully installed mmh3-3.0.0


In [5]:
from XStream_River import xStream

In [4]:
!git clone https://github.com/mayaawada/Test.git
%cd Test/Row-streaming

Cloning into 'Test'...
remote: Enumerating objects: 75, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (43/43), done.[K
remote: Total 75 (delta 28), reused 75 (delta 28), pack-reused 0[K
Unpacking objects: 100% (75/75), 4.01 MiB | 8.25 MiB/s, done.
/content/Test/Row-streaming


# Imports and runtime setup

In [1]:
import tqdm
from sklearn.metrics import average_precision_score, roc_auc_score
from sklearn.datasets import load_svmlight_file

# Load the data

In [1]:
data = load_svmlight_file("data/Row-streaming/spam-sms")
X = data[0]
y = data[1]
X= X.todense()

NameError: name 'load_svmlight_file' is not defined

# Run the algorithm

In [7]:
window_size = int(0.05*len(y))
k = 100
n_chains = 100
depth = 15

cf = xStream(num_components=k, n_chains=n_chains, depth=depth, window_size=window_size) 

all_scores = []

for i, sample in enumerate(tqdm.tqdm(X)):
  cf.learn_one(sample.A1)
  if i>=window_size:
    anomalyscore = -cf.predict_one(sample.A1)
    all_scores.append(anomalyscore[0])

100%|██████████| 5574/5574 [3:30:17<00:00,  2.26s/it]      


In [8]:
y_adjusted = y[window_size:window_size+len(all_scores)]

In [9]:
# Computation for Mean Average Precision

chunks = [all_scores[x:x+window_size] for x in range(0, len(all_scores), window_size)]
y_chunks = [y_adjusted[x:x+window_size] for x in range(0, len(y_adjusted), window_size)]

AP_window = []

for i in range(len(y_chunks)-1):
  score = average_precision_score(y_chunks[i], chunks[i])
  AP_window.append(score)

# Print results

In [10]:
OAP = average_precision_score(y_adjusted, all_scores) 
MAP = sum(AP_window)/len(AP_window)
AUC = roc_auc_score(y_adjusted, all_scores)

print("XStream: OAP =", OAP,"\n\t",
      "MAP =", MAP, "\n\t", 
      "AUC =", AUC)

XStream: OAP = 0.3730751125550796 
	 MAP = 0.404928224231042 
	 AUC = 0.855161384824829
