## Anomaly Detection in Univariate Time Series with OneClassSVM, in the context of WISDom project

#### Data:
    1. Flow Rate Data from a sensor in a Water Sypply System located in Barreiro
    2. Holidays since 1970 to 2029 (+ 3 regional holidays of 2018)

### Libraries and Packages

In [1]:
# basics
import pandas as pd
import numpy as np
from datetime import datetime as dt
# for graphics
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# algorithm
from sklearn import svm

### Load dataset

In [2]:
features = ["date","time","flow","anomaly"]
df = pd.read_csv('barreiro_ano.csv', sep=';', names=features)
holidays = pd.read_csv('holidays2018.csv',sep=';')

### Basic dataset changes

In [5]:
# convert column to datetime
df['date'] = pd.to_datetime(df['date'])
holidays['date'] = pd.to_datetime(holidays['date'])

In [6]:
# extra column indicating day of week
# 0: mon, 1:tue, ..., 5:sat, 6:sun
df['dayofweek'] = pd.to_datetime(df['date'],dayfirst=True)
df['dayofweek'] = df['dayofweek'].dt.dayofweek

In [7]:
# if day is a holiday, then dayofweek is -1
df.loc[df.date.isin(holidays.date), 'dayofweek'] = -1

In [8]:
# returns the day count from the date
df['date'] = pd.to_datetime(df['date'])
df['date']=df['date'].map(dt.toordinal)

In [9]:
df

Unnamed: 0,date,time,flow,anomaly,dayofweek
0,736695,00:07:30,18.333067,0,-1
1,736695,00:22:30,18.333067,0,-1
2,736695,00:37:30,19.784872,0,-1
3,736695,00:52:30,22.294744,0,-1
4,736695,01:07:30,27.229756,0,-1
...,...,...,...,...,...
35035,737059,22:52:30,24.792000,0,0
35036,737059,23:07:30,23.029933,0,0
35037,737059,23:22:30,20.415628,0,0
35038,737059,23:37:30,22.019056,0,0


### Time in indexes

In [10]:
# to extract all unique values (time) present in dataframe
time_unique_val=df.time.unique()
# in order to accept time series with different periods per day
periods_per_day=len(time_unique_val)
time_unique_ind=np.arange(periods_per_day)
#in order to have a mapping between the time of day and its index
time_unique=pd.DataFrame({'time':time_unique_val, 'time_unique_ind':time_unique_ind})
#creates a column with the time index
df['time'] = df['time'].map(time_unique.set_index('time')['time_unique_ind'])

In [11]:
df

Unnamed: 0,date,time,flow,anomaly,dayofweek
0,736695,0,18.333067,0,-1
1,736695,1,18.333067,0,-1
2,736695,2,19.784872,0,-1
3,736695,3,22.294744,0,-1
4,736695,4,27.229756,0,-1
...,...,...,...,...,...
35035,737059,91,24.792000,0,0
35036,737059,92,23.029933,0,0
35037,737059,93,20.415628,0,0
35038,737059,94,22.019056,0,0


### Divide dataset on two labels

In [13]:
nor_obs = df.loc[df.anomaly==0]
ano_obs = df.loc[df.anomaly==1]

### Train and test sets

In [14]:
# OneClassSVM is trained with the observations of only one class.
# So, the algorithm is trained with normal observations == 0.
# The training observation is merged with the anomalous observations to create a test set.
train_feature = nor_obs.loc[0:20000, :].drop('anomaly', 1)

In [15]:
# to create test observations/features
X_test_1 = nor_obs.loc[20000:, :].drop('anomaly', 1)
X_test_2 = ano_obs.drop('anomaly', 1)
X_test = X_test_1.append(X_test_2)

In [16]:
Y_test_1 = nor_obs.loc[20000:, 'anomaly']
Y_test_2 = ano_obs['anomaly']
Y_test = Y_test_1.append(Y_test_2)

### OneClassSVM model

In [17]:
# setting the hyperparameters for Once Class SVM
oneclass = svm.OneClassSVM(kernel='linear', gamma=0.001, nu=0.95)

In [21]:
# training model
oneclass.fit(train_feature)

OneClassSVM(gamma=0.001, kernel='linear', nu=0.95)

In [22]:
# test the algorithm on the test set
anomaly_pred = oneclass.predict(X_test)

In [23]:
# check the number of outliers predicted by the algorithm
unique, counts = np.unique(anomaly_pred, return_counts=True)
print (np.asarray((unique, counts)).T)

[[   -1 10567]
 [    1  4512]]


In [24]:
# convert Y-test and anomaly_pred to dataframe for ease of operation
Y_test= Y_test.to_frame()
Y_test=Y_test.reset_index()
anomaly_pred = pd.DataFrame(anomaly_pred)
anomaly_pred= anomaly_pred.rename(columns={0: 'prediction'})

In [26]:
# performance check of the model
TP = FN = FP = TN = 0
for j in range(len(Y_test)):
    if Y_test['anomaly'][j]== 0 and anomaly_pred['prediction'][j] == 1:
        TP = TP+1
    elif Y_test['anomaly'][j]== 0 and anomaly_pred['prediction'][j] == -1:
        FN = FN+1
    elif Y_test['anomaly'][j]== 1 and anomaly_pred['prediction'][j] == 1:
        FP = FP+1
    else:
        TN = TN +1
print (TP,  FN,  FP,  TN)

4466 10528 46 39


In [27]:
# performance Matrix
accuracy = (TP+TN)/(TP+FN+FP+TN)
print (accuracy)
sensitivity = TP/(TP+FN)
print (sensitivity)
specificity = TN/(TN+FP)
print (specificity)

0.2987598647125141
0.29785247432306255
0.4588235294117647


One class SVM has shown unpromising performance for this dataset, with near 30% anomaly detection and many false alarms.