## Anomaly detection in Univariate Time series with Random Forest (Classification), in the context of WISDom project

#### Data:
    1. Flow Rate Data from a sensor in a Water Sypply System located in Barreiro
    2. Holidays since 1970 to 2029 (+ 3 regional holidays of 2018)

#### Problem Definition:
The task here is to predict whether or not it is an anomaly.
#### Solution:
This is a binary classification problem and we will use a random forest classifier to solve this problem.

### Libraries and Packages

In [1]:
# basics
import pandas as pd
import numpy as np
from datetime import datetime as dt
# for graphics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set()
# train_test_split
from sklearn.model_selection import train_test_split
# feature scaling
from sklearn.preprocessing import StandardScaler
# algorithm
from sklearn.ensemble import RandomForestClassifier
# evaluatuon metrics
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

### Load dataset

In [2]:
features = ["date","time","flow","anomaly"]
df = pd.read_csv('barreiro_ano.csv', sep=';', names=features)
holidays = pd.read_csv('holidays2018.csv',sep=';')

### New column: DayOfWeek

In [4]:
# extra column indicating day of week
# 0: mon, 1:tue, ..., 5:sat, 6:sun
df['dayofweek'] = pd.to_datetime(df['date'],dayfirst=True)
df['dayofweek'] = df['dayofweek'].dt.dayofweek

In [5]:
# if day is a holiday, then dayofweek is -1
df.loc[df.date.isin(holidays.date), 'dayofweek'] = -1

-1 is definitely viable, or is the number 7 better (for example)?

### Date in indexes

In [6]:
# requisition of Feature Scaling - 1st option
# returns the day count from the date
df['date'] = pd.to_datetime(df['date'])
df['date']=df['date'].map(dt.toordinal)

In [None]:
# OR
# dt.strftime convert index using specified date_format
# Convert date in format d/m/y to date in format y/m/d represented in the column "int_date"
# df['int_date'] = pd.to_datetime(df['date'],dayfirst=True).dt.strftime("%Y%m%d").astype(int)

### Time in indexes

In [7]:
# to extract all unique values (time) present in dataframe
time_unique_val=df.time.unique()
# in order to accept time series with different periods per day
periods_per_day=len(time_unique_val)
time_unique_ind=np.arange(periods_per_day)
#in order to have a mapping between the time of day and its index
time_unique=pd.DataFrame({'time':time_unique_val, 'time_unique_ind':time_unique_ind})
#creates a column with the time index
df['time'] = df['time'].map(time_unique.set_index('time')['time_unique_ind'])

### Final dataset

In [8]:
df

Unnamed: 0,date,time,flow,anomaly,dayofweek
0,736695,0,18.333067,0,-1
1,736695,1,18.333067,0,-1
2,736695,2,19.784872,0,-1
3,736695,3,22.294744,0,-1
4,736695,4,27.229756,0,-1
...,...,...,...,...,...
35035,737059,91,24.792000,0,0
35036,737059,92,23.029933,0,0
35037,737059,93,20.415628,0,0
35038,737059,94,22.019056,0,0


### Train and test sets

In [9]:
X = df.iloc[:, [0,1,2,4]].values
y = df.iloc[:, 3].values

In [10]:
X

array([[ 7.36695000e+05,  0.00000000e+00,  1.83330667e+01,
        -1.00000000e+00],
       [ 7.36695000e+05,  1.00000000e+00,  1.83330667e+01,
        -1.00000000e+00],
       [ 7.36695000e+05,  2.00000000e+00,  1.97848722e+01,
        -1.00000000e+00],
       ...,
       [ 7.37059000e+05,  9.30000000e+01,  2.04156278e+01,
         0.00000000e+00],
       [ 7.37059000e+05,  9.40000000e+01,  2.20190556e+01,
         0.00000000e+00],
       [ 7.37059000e+05,  9.50000000e+01,  2.07920000e+01,
         0.00000000e+00]])

In [11]:
y

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

### Train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

### Feature Scaling

In [14]:
# Therefore, it would be beneficial to scale our data (this step isn't as important for the random forests algorithm)
# To do so, we will use Scikit-Learn's StandardScaler class.
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [15]:
X_train

array([[ 0.28889222, -1.46179811, -0.98349514, -1.3211612 ],
       [ 0.12747893,  0.59551838,  0.69261188,  1.49837869],
       [ 0.36485141,  1.64222327,  0.10842009,  1.02845538],
       ...,
       [ 1.27636173,  0.7037982 , -0.54338321, -0.85123789],
       [ 0.93454536, -0.73993267,  0.48422766,  0.55853206],
       [-1.45816923, -0.12634705, -0.40093037, -1.3211612 ]])

In [24]:
X_test

array([[-1.23029165,  0.84817129,  0.44078602,  0.08860875],
       [ 1.73211689,  1.71440981, -0.73806517, -1.3211612 ],
       [ 0.4882851 ,  0.01802603,  2.11318655, -0.38131457],
       ...,
       [-0.59413341, -0.59555959, -0.09367211, -1.3211612 ],
       [-0.41373032, -1.06477212, -1.23308912,  1.02845538],
       [-0.87898038, -1.35351829, -1.34629055,  1.02845538]])

### Random Forest model

In [18]:
classifier = RandomForestClassifier(n_estimators=550) # random_state=0
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

### Evaluating the algorithm

For classification problems the metrics used to evaluate an algorithm are accuracy, confusion matrix, precision recall, and F1 values.

In [21]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

[[10489     0]
 [    5    18]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10489
           1       1.00      0.78      0.88        23

    accuracy                           1.00     10512
   macro avg       1.00      0.89      0.94     10512
weighted avg       1.00      1.00      1.00     10512

0.9995243531202436


The accuracy achieved for by our random forest classifier with 550 trees is 99.95%.