# Anonomaly detection with k-means clustering

In this notebook we will use k-means clustering to detect anomalies in unlabelled data. i.e. We dont know what anomalies looks like beforehand

This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The objective is to classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process.

<img src='water.jpg'>

The variables in the data are as follows:

| Column   | Description                                                          |
|----------|----------------------------------------------------------------------|
| Q-E      | (input flow to plant)                                                |
| ZN-E     | (input Zinc to plant)                                                |
| PH-E     | (input pH to plant)                                                  |
| DBO-E    | (input Biological demand of oxygen to plant)                         |
| DQO-E    | (input chemical demand of oxygen to plant)                           |
| SS-E     | (input suspended solids to plant)                                    |
| SSV-E    | (input volatile supended solids to plant)                            |
| SED-E    | (input sediments to plant)                                           |
| COND-E   | (input conductivity to plant)                                        |
| PH-P     | (input pH to primary settler)                                        |
| DBO-P    | (input Biological demand of oxygen to primary settler)               |
| SS-P     | (input suspended solids to primary settler)                          |
| SSV-P    | (input volatile supended solids to primary settler)                  |
| SED-P    | (input sediments to primary settler)                                 |
| COND-P   | (input conductivity to primary settler)                              |
| PH-D     | (input pH to secondary settler)                                      |
| DBO-D    | (input Biological demand of oxygen to secondary settler)             |
| DQO-D    | (input chemical demand of oxygen to secondary settler)               |
| SS-D     | (input suspended solids to secondary settler)                        |
| SSV-D    | (input volatile supended solids to secondary settler)                |
| SED-D    | (input sediments to secondary settler)                               |
| COND-D   | (input conductivity to secondary settler)                            |
| PH-S     | (output pH)                                                          |
| DBO-S    | (output Biological demand of oxygen)                                 |
| DQO-S    | (output chemical demand of oxygen)                                   |
| SS-S     | (output suspended solids)                                            |
| SSV-S    | (output volatile supended solids)                                    |
| SED-S    | (output sediments)                                                   |
| COND-S   | (output conductivity)                                                |
| RD-DBO-P | (performance input Biological demand of oxygen in primary settler)   |
| RD-SS-P  | (performance input suspended solids to primary settler)              |
| RD-SED-P | (performance input sediments to primary settler)                     |
| RD-DBO-S | (performance input Biological demand of oxygen to secondary settler) |
| RD-DQO-S | (performance input chemical demand of oxygen to secondary settler)   |
| RD-DBO-G | (global performance input Biological demand of oxygen)               |
| RD-DQO-G | (global performance input chemical demand of oxygen)                 |
| RD-SS-G  | (global performance input suspended solids)                          |
| RD-SED-G | (global performance input sediments)                                 |

## load libraries

In [None]:
# libraries
import pandas as pd
import numpy as np

import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


## read in data 

In [None]:
df = pd.read_csv("anomaly_water_clean.csv",index_col=0)

## understand data

In [None]:
print(df.info())

In [None]:
# check the timestamp format and frequency
print(df['Date'].head(10))

In [None]:
# change the type of timestamp column for plotting
df['Date'] = pd.to_datetime(df['Date'])
# plot the data
df.plot(x='Date', y='Q-E')

## Clustering

We group together the usual combination of features. The points that are far from the cluster are points with unusual combinations of features.We consider those points as anomalies.

In [None]:
# Take useful feature and standardize them
data = df.drop(['Date'],axis=1)
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)

In [None]:
#I choose 2 centroids arbitrarily and add these data to the central dataframe
n_clust = 2
kmeans = KMeans(n_clust).fit(data)
scores = kmeans.score(data)
df['cluster'] = kmeans.predict(data)

#to visualize we must reduce to 2 principle components
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data)
# standardize these 2 new features
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data_pca)
data_pca = pd.DataFrame(np_scaled)

df['principal_feature1'] = data_pca[0]
df['principal_feature2'] = data_pca[1]
df['cluster'].value_counts()
df = df.dropna().copy()

In [None]:
#plot the different clusters with the 2 main features)
colors = {0:'red', 1:'blue', 2:'green', 3:'pink', 4:'black', 5:'orange', 6:'cyan', 7:'yellow', 8:'brown', 9:'purple', 10:'white', 11: 'grey', 12:'lightblue', 13:'lightgreen', 14: 'darkgrey'}
plt.scatter(df['principal_feature1'], df['principal_feature2'], c=df["cluster"].apply(lambda x: colors[x]))

In [None]:
#define function to calcuate distance to cluster center
def getDistanceByPoint(data, model):
    distance = pd.Series()
    for i in range(0,len(data)):
        Xa = np.array(data.loc[i])
        Xb = model.cluster_centers_[model.labels_[i]-1]
        distance.at[i] = np.linalg.norm(Xa-Xb)
    return distance


In [None]:
# get the distance between each point and its nearest centroid. The biggest distances are considered as anomaly
distance = getDistanceByPoint(data, kmeans)
df['distance'] = distance

In [None]:
#set the fraction of point to classify as outliers
outliers_fraction = 0.01

In [None]:
# anomaly21 contain the anomaly result of cluster method (0:normal, 1:anomaly) 
distance = getDistanceByPoint(data, kmeans)
number_of_outliers = int(outliers_fraction*len(distance))
threshold = distance.nlargest(number_of_outliers).min()
# anomaly21 contain the anomaly result of method 2.1 Cluster (0:normal, 1:anomaly) 
df['anomaly21'] = (distance >= threshold).astype(int)

In [None]:
# visualisation of anomaly with cluster view
plt.scatter(df['principal_feature1'], df['principal_feature2'], c=df["anomaly21"].apply(lambda x: colors[x]))

## Visualization through time

Make a few plots of the data through time. You can plot the raw data, and also the principle components which summarize the data. On these plots, add lines or points to show where the anomlies occur in time