<h1 style="background-color:#35ffff
;font-family:newtimeroman;font-size:225%;text-align:center;border-radius: 30px 50px;"> 📈 Time Series Outlier and Anomaly Detection 📈</h1><a id=0></a>

# Time-Series-Anomaly-Detection

# Introduction:

Anomaly detection is the process of discover the event or the points which are unexpected at 
this position of the dataset or deviates from the normal pattern of the dataset. 
So, the detection of those points very important; because it give us an early step to make the 
emergency movements to control that un usual change. 
We used many techniques to reach best one to apply it on our way of the project. 

# Dataset:
This dataset provides information about the telecommunication activity over the city of Milano. The dataset is the result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano. CDRs log the user activity for billing purposes and network management.

-  [Milan dataset]( https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EGZHFV) 
-  [Article that describes the data: ]( https://www.nature.com/articles/sdata201555)
 
# Anomaly Detection Methods: 
The outlairs and anomaly detection different methods:

<li><a href="#m1">1- Tukey’s box plot method</a>
<li><a href="#m2">2- Isolation forest.</a>
<li><a href="#m3">3- Anomaly Detection with LSTM Autoencoders</a>
<li><a href="#m4">4- Seasonal-Trend Decomposition.</a>

In [80]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
from datetime import datetime, timezone
import datetime
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
warnings.filterwarnings('ignore')

In [81]:
df = pd.read_csv('/kaggle/input/milan-dataset/final_data.csv',parse_dates= ["time"])

In [82]:
df.head()

In [83]:
df.shape

In [84]:
df.info()

In [85]:
df.describe()

In [86]:
sns.boxplot(x=df['internet_cdr'])

In [87]:
fig = plt.figure()
ax = fig.gca()
sns.distplot(df['internet_cdr'])
plt.show()

#Probability plot
fig = plt.figure()
res = stats.probplot(df['internet_cdr'], plot=plt)

## <h3> 1- Tukey’s box plot method</h3>
<a id="m1"></a>

<img src="https://raw.githubusercontent.com/abdallah-elsawy/Time-Series-Anomaly-Detection/main/Outputs/Tukeys-Box-plot-method/Tukey%E2%80%99s-box.jpg">

in this method we depend on the pox plot to determine if the point is outlier or not and not 
only that it gives us the ability to decide if this outlier is possible or probable outlier point; by 
calculate the following parameters: 
- 25th percentile (Q1) 
- 75th percentile (Q3) 
- interquartile range (IQR = Q3 – Q1) 
- Lower inner fence: Q1 – (1.5 * IQR) 
- Upper inner fence: Q3 + (1.5 * IQR) 
- Lower outer fence: Q1 – (3 * IQR) 
- Upper outer fence: Q3 + (3 * IQR) 

Then decide if the point between the inner fence and outer fence it considered as a possible 
outlier point. And if the point lies outside the outer fence, it will be considered as probable 
outlier. 

#### First apply on a random grid

In [88]:
fig, ax = plt.subplots(figsize=(30,10))
ax.plot(df.groupby("grid_square").get_group(5056).index, 
        df.groupby("grid_square").get_group(5056)['internet_cdr'], color='blue', label = 'Normal')
ax.set_title('Random grid(5056) ', fontsize=20)
plt.legend()
plt.show();

In [89]:
def tukeys_method(df, variable):
    #Takes two parameters: dataframe & variable of interest as string
    q1 = df[variable].quantile(0.25)
    q3 = df[variable].quantile(0.75)
    iqr = q3-q1
    inner_fence = 1.5*iqr
    outer_fence = 3*iqr
    
    #inner fence lower and upper end
    inner_fence_le = q1-inner_fence
    inner_fence_ue = q3+inner_fence
    
    #outer fence lower and upper end
    outer_fence_le = q1-outer_fence
    outer_fence_ue = q3+outer_fence
    
    outliers_prob = []
    outliers_poss = []
    for index, x in enumerate(df[variable]):
        if x <= outer_fence_le or x >= outer_fence_ue:
            outliers_prob.append(index)
    for index, x in enumerate(df[variable]):
        if x <= inner_fence_le or x >= inner_fence_ue:
            outliers_poss.append(index)
    return outliers_prob, outliers_poss


In [90]:
random_grid=df.groupby("grid_square").get_group(5056)

probable_outliers_tm, possible_outliers_tm = tukeys_method(random_grid, "internet_cdr")

print(probable_outliers_tm)
print("*****************************************************************************************")
print(possible_outliers_tm)

In [91]:
len(probable_outliers_tm)

In [92]:
len(possible_outliers_tm)

In [93]:
anomaly = pd.DataFrame(possible_outliers_tm)
anomaly['Anomaly'] = 1
anomaly.set_index(0, inplace=True)

In [94]:
random_grid = pd.concat([random_grid, anomaly], axis=1)

In [95]:
random_grid['Anomaly'] = random_grid['Anomaly'].replace(np.nan, False)
random_grid['Anomaly'] = random_grid['Anomaly'].replace(1.0, True)

In [96]:
random_grid

In [97]:
random_grid['Anomaly'].value_counts()

In [98]:
fig, ax = plt.subplots(figsize=(30,10))

anomaly = random_grid.loc[random_grid['Anomaly'] == True, ['internet_cdr']] 

ax.plot(random_grid.index, random_grid['internet_cdr'], color='blue', label = 'Normal')
ax.scatter(anomaly.index,anomaly['internet_cdr'], color='red', label = 'Anomaly')
ax.set_title('Random grid(5056) anomalies points for box blot method', fontsize=20)
plt.legend()
plt.show();

In [99]:
# fig1 = px.line(random_grid,  y="internet_cdr")
# fig1.update_traces(line=dict(color = 'magenta'))

# anomaly = random_grid.loc[random_grid['Anomaly'] == True, ['internet_cdr']] 
# fig2 = px.scatter(anomaly,y="internet_cdr")

# fig3 = go.Figure(data=fig1.data + fig2.data)
# fig3.update_layout(title="Random grid(5056) anomalies points for box blot method")
# fig3.show()

#### Second apply on all grids

In [100]:
df = pd.read_csv('/kaggle/input/milan-dataset/final_data.csv',parse_dates= ["time"])

In [101]:
full_grid = df.groupby("grid_square")
grids = list(full_grid.groups.keys())

In [102]:
grids

In [103]:
data=[]
for grid in grids:
    full_grid = df.groupby("grid_square").get_group(grid)
    data.append(full_grid)
data

In [104]:
x=len(grids)
x

In [105]:
data_2=pd.DataFrame()
anomalies= pd.DataFrame()
for i in range(x):
    

    probable_outliers_tm, possible_outliers_tm = tukeys_method(data[i], "internet_cdr")
    
    anomaly = pd.DataFrame(possible_outliers_tm)
    anomaly['Anomaly'] = 1
    anomaly.set_index(0, inplace=True)

    data_2 = pd.concat([data[i].reset_index(drop=True), anomaly], axis=1)
    
    print("========== grid number {} done ==========".format(i+1))
    print(data_2['Anomaly'].value_counts())
    
    
    anomalies=anomalies.append(data_2)
    print(data_2)
    print("===================================================================")
#     print(anomalies)

In [106]:
#     probable_outliers_tm, possible_outliers_tm = tukeys_method(data[2], "internet_cdr")
    
#     anomaly = pd.DataFrame(possible_outliers_tm)
#     anomaly['Anomaly'] = 1
#     anomaly.set_index(0, inplace=True)

#     x=pd.concat([data[2].reset_index(drop=True), anomaly], axis=1)
#     print (x)
#     print(x['Anomaly'].value_counts())

#     print("========== grid number {} done ==========".format(i+1))

In [107]:
anomalies.reset_index(drop=True,inplace=True)

In [108]:
anomalies['Anomaly'] = anomalies['Anomaly'].replace(np.nan, False)
anomalies['Anomaly'] = anomalies['Anomaly'].replace(1.0, True)

In [109]:
anomalies['Anomaly'].value_counts()

In [110]:
fig, ax = plt.subplots(figsize=(30,10))
ax.plot(df.index, df['internet_cdr'], color='blue', label = 'Normal')
ax.set_title('Total grids', fontsize=20)
plt.show();

In [111]:
fig, ax = plt.subplots(figsize=(30,10))

anomaly = anomalies.loc[anomalies['Anomaly'] == True, ['internet_cdr']] 

ax.plot(df.index, df['internet_cdr'], color='blue', label = 'Normal')
ax.scatter(anomaly.index,anomaly['internet_cdr'], color='red', label = 'Anomaly')
ax.set_title('Total grids anomalies points for box blot method', fontsize=20)
plt.legend()
plt.show();

## <h3> 2- Isolation forest.</h3>
<a id="m2"></a>

In this method we will depend on the detection using some Machine Learning algorithms. In the 
we will depend on Isolation Forest. 

One of those algorithms is the Isolation Forest method. Isolation Forest build using the decision 
trees which depend on the points that go deeper into the tree are not anomalies and points 
which go short distance have big probability to be anomalies, and it is unsupervised learning 
model which used without labeled data. 

The algorithm goes by selecting a sample of the dataset then branch it on the binary tress by 
setting a threshold if the sample we selected is less than this threshold it will be in the left 
branch and if it not it will be in the right branch. This process repeated until we every point in 
the dataset is isolated. 

In [112]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

In [113]:
df = pd.read_csv('/kaggle/input/milan-dataset/final_data.csv', parse_dates= ["time"])

#### First apply on a random grid

In [114]:
random_grid = df.groupby("grid_square").get_group(5056)
random_grid.info()

In [115]:
random_grid = df.groupby("grid_square").get_group(5056)
random_grid.info()

In [116]:
# train isolation forest
outliers_fraction = float(.01)
scaler = StandardScaler()
np_scaled = scaler.fit_transform(random_grid["internet_cdr"].values.reshape(-1, 1))
df_data = pd.DataFrame(np_scaled)

model =  IsolationForest(contamination=outliers_fraction)
model.fit(df_data) 

In [117]:
# predict isolation forest
anomaly = model.predict(df_data)

In [118]:
anomaly = pd.DataFrame(anomaly,columns=['anomaly'])
anomaly

In [119]:
random_grid = pd.concat([random_grid, anomaly], axis=1)
random_grid

In [120]:
random_grid['anomaly'].value_counts()

In [121]:
fig1 = px.line(random_grid,  y="internet_cdr")
fig1.update_traces(line=dict(color = 'turquoise'))

anomaly = random_grid.loc[random_grid['anomaly'] == -1, ['internet_cdr']]

fig2 = px.scatter(anomaly,y='internet_cdr')
fig3 = go.Figure(data=fig1.data + fig2.data)

fig3.update_layout(title="Random grid(5056) anomalies points for isolation forest")
fig3.show()

#### Second apply on all grids 

In [122]:
data=[]
for grid in grids:
    full_grid = df.groupby("grid_square").get_group(grid)
    data.append(full_grid)
data

In [123]:
data[0]

In [124]:
# train isolation forest
full_anomaly = pd.DataFrame()
for i in range (x):
    
    full_grid = data[i]
    
    outliers_fraction = float(.01)

    scaler = StandardScaler()
    np_scaled = scaler.fit_transform(full_grid["internet_cdr"].values.reshape(-1, 1))
    df_data = pd.DataFrame(np_scaled)

    model =  IsolationForest(contamination=outliers_fraction)
    model.fit(df_data) 
    anomaly = model.predict(df_data)
    anomaly = pd.DataFrame(anomaly,columns=['anomaly'])
    print("========== grid number {} predicted ==========".format(i+1))
    full_anomaly = full_anomaly.append(anomaly)

In [125]:
full_anomaly.reset_index(inplace= True)
full_anomaly.drop('index', axis=1, inplace=True)
full_anomaly

In [126]:
print(df.shape)
print(full_anomaly.shape)

In [127]:
df = pd.concat([df, full_anomaly], axis=1)
df

In [128]:
df['anomaly'].value_counts()

In [129]:
fig1 = px.line(df,  y="internet_cdr")
fig1.update_traces(line=dict(color = 'rgba(250,100,10,0.2)'))
fig1.update_layout(title="Total grids")
fig1.show()

In [130]:
fig1 = px.line(df,  y="internet_cdr")
fig1.update_traces(line=dict(color = 'rgba(250,100,10,0.2)'))
anomaly = df.loc[df['anomaly'] == -1, ['internet_cdr']] 

fig2 = px.scatter(anomaly,y='internet_cdr')

fig3 = go.Figure(data=fig1.data + fig2.data)
fig3.update_layout(title="Total grids anomalies points for isolation forest")
fig3.show()

In [131]:

# fig, ax = plt.subplots(figsize=(30,10))

# anomaly = df.loc[df['anomaly'] == -1, ['internet_cdr']] 

# ax.plot(df.index, df['internet_cdr'], color='blue', label = 'Normal')
# ax.scatter(anomaly.index,anomaly['internet_cdr'], color='red', label = 'Anomaly')
# ax.set_title('All grids anomalies points', fontsize=20)
# plt.legend()
# plt.show();

## <h3> 3- Anomaly Detection with LSTM Autoencoders. </h3>
<a id="m3"></a>

In this method we will depend on the detection using the forecasting by Deep Learning 
algorithms. In the forecasting methods we depend on predict the next point with the addition 
of some noise and make comparison of this point and the true point at this timestamp by 
finding the difference between the two points then add threshold finally find the anomalies by 
compare the difference of the two points with this threshold (we used the Mean absolute error 
MAE). 

Autoencoders are type of self-supervised learning model which are a neural network that learn 
from the input data. We use autoencoder because the Principal Component Analysis (PCA), 
which we used in the previous method we depend on the linear algebra to do the models, but 
by using autoencoders we depended on the non-linear transformation like by use the activation 
functions; those non-linearity gives us the ability to go deep in the number of the neural 
network layers. 

Long Short-Term Memory (LSTM) is a type of artificial recurrent neural network (RNN). which 
are designed to handle sequential data, with the previous step's output being fed as the current 
step's input. 

<img src="https://raw.githubusercontent.com/abdallah-elsawy/Time-Series-Anomaly-Detection/main/Outputs/LSTM-Autoencoders/Anomaly-detection-autoencoders.png">

We apply some dimensionality reduction on our dataset by use encoder to make the dimension 
small then use the decoder to get it back and that minimize the reconstruction loss. In fact, that 
will make us lose some information but it gives us the ability to know the main pattern of the 
information and thought that we could define any information out hits pattern under sone 
threshold will be outlier. 

In [132]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from pandas.plotting import register_matplotlib_converters

%matplotlib inline
%config InlineBackend.figure_format='retina'

register_matplotlib_converters()
sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 22, 10

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

In [133]:
df = pd.read_csv('/kaggle/input/milan-dataset/final_data.csv',parse_dates=['time'], index_col='time')

In [134]:
df= df.groupby("grid_square").get_group(5056)

In [135]:
train_size = int(len(df) * 0.85)
test_size = len(df) - train_size
train, test = df.iloc[0:train_size], df.iloc[train_size:len(df)]
print(train.shape, test.shape)

In [136]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler = scaler.fit(train[['internet_cdr']])

train['internet_cdr'] = scaler.transform(train[['internet_cdr']])
test['internet_cdr'] = scaler.transform(test[['internet_cdr']])

In [137]:
def create_dataset(X, y, time_steps=1):
    Xs, ys = [], []
    for i in range(len(X) - time_steps):
        v = X.iloc[i:(i + time_steps)].values
        Xs.append(v)        
        ys.append(y.iloc[i + time_steps])
    return np.array(Xs), np.array(ys)

In [138]:
TIME_STEPS = 30

# reshape to [samples, time_steps, n_features]

X_train, y_train = create_dataset(train[['internet_cdr']], train.internet_cdr, TIME_STEPS)
X_test, y_test = create_dataset(test[['internet_cdr']], test.internet_cdr, TIME_STEPS)

print(X_train.shape)

In [139]:
model = keras.Sequential()
model.add(keras.layers.LSTM(
    units=64, 
    input_shape=(X_train.shape[1], X_train.shape[2])
))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.RepeatVector(n=X_train.shape[1]))
model.add(keras.layers.LSTM(units=64, return_sequences=True))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.TimeDistributed(keras.layers.Dense(units=X_train.shape[2])))
model.compile(loss='mae', optimizer='adam')
model.summary()

In [140]:
history = model.fit(
    X_train, y_train,
    epochs=20,
    batch_size=32,
    validation_split=0.1,
    shuffle=False
)

In [141]:
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend();

In [142]:
X_test_pred = model.predict(X_test)

test_mae_loss = np.mean(np.abs(X_test_pred - X_test), axis=1)

In [143]:
len(test_mae_loss)

In [144]:
sns.distplot(test_mae_loss, bins=50, kde=True);

In [145]:
X_train_pred = model.predict(X_train, verbose=0)
train_mae_loss = np.mean(np.abs(X_train_pred - X_train), axis=1)

plt.hist(train_mae_loss, bins=50)
plt.xlabel('Train MAE loss')
plt.ylabel('Number of Samples');

threshold = np.max(train_mae_loss)
print(f'Reconstruction error threshold: {threshold}')

In [146]:
THRESHOLD = 0.65

test_score_df = pd.DataFrame(index=test[TIME_STEPS:].index)
test_score_df['loss'] = test_mae_loss
test_score_df['threshold'] = THRESHOLD
test_score_df['anomaly'] = test_score_df.loss > test_score_df.threshold
test_score_df['internet_cdr'] = test[TIME_STEPS:].internet_cdr

 To determine the cutoff point we use the Mean Absolute Error (MAE). We use the MAE 
because it so sensitive toward outliers. MAE find the mean absolute error between the actual 
value 𝑦 and predicted value 𝑦' of the dataset then find the threshold like the following : 



In [147]:
plt.plot(test_score_df.index, test_score_df.loss, label='loss')
plt.plot(test_score_df.index, test_score_df.threshold, label='threshold')
plt.xticks(rotation=25)
plt.title('test_score_loss vs. threshold')
plt.legend();

When we apply the threshold to the predicted values which will give us the anomalies at the 
points which corresponding to the locations of the signal which above the threshold line the 
previous graph, we get the following graph for the anomalies.

In [148]:
anomalies = test_score_df[test_score_df.anomaly == True]
anomalies

In [149]:
test_score_df['anomaly'].value_counts()

In [150]:
plt.plot(
  test[TIME_STEPS:].index, 
  scaler.inverse_transform(test[TIME_STEPS:].internet_cdr), 
  label='internet_cdr'
);

sns.scatterplot(
  anomalies.index,
  scaler.inverse_transform(anomalies.internet_cdr),
  color=sns.color_palette()[3],
  s=52,
  label='anomaly'
)
plt.xticks(rotation=25)
plt.title('Anomalies')
plt.legend();


## <h3> 4- Seasonal-Trend Decomposition.</h3>
<a id="m4"></a>
 
Now we will go to the final method which is decomposition. Signal decomposition aims to 
analysis our signal to its main three components Seasonal, trend and the residual (S, T, R). 
Seasonal is the signal component which contain the most rapidly pattern which occurs regular 
every cerin time. Trend contain the general shape of the data over the whole dataset and finally 
the residual is the rest of the signal after extract the seasonal and trend of it, it is in somehow a 
random part over the signal which indicate it.


The residual will be our focus here, we will first analysis the signal to its main three component 
and take the residual to work on it. 
We will apply the model by define the threshold which depend on the he confidence interval, 
then apply it for the residual then decide if this point is an anomaly or not.

In [151]:
df = pd.read_csv('/kaggle/input/milan-dataset/final_data.csv')
df.set_index('time', inplace=True)
df_=df

#### First apply on a random grid

In [152]:
random_grid=df.groupby("grid_square").get_group(5056)
random_grid.drop('grid_square', axis=1, inplace=True)
random_grid.head

In [153]:
from statsmodels.tsa.seasonal import STL

In [154]:
stl = STL(random_grid,period=12)
result = stl.fit()

In [155]:
seasonal = result.seasonal
trend = result.trend
resid = result.resid

In [156]:
plt.figure(figsize=(30,15))



plt.subplot(3,1,1)
plt.plot(trend)
plt.title('Trend', fontsize=16)

plt.subplot(3,1,2)
plt.plot(seasonal)
plt.title('Seasonal', fontsize=16)

plt.subplot(3,1,3)
plt.plot(resid)
plt.title('Residual', fontsize=16)

plt.tight_layout()

To make the residual more obvious to us we will plot the signal with and without the residual 
component as the following: 

In [157]:
estimated = trend + seasonal
plt.figure(figsize=(12,4))
plt.plot(random_grid)
plt.plot(estimated)
ax.set_title('Random grid(5056) plot over (trend + seasonal) plot ', fontsize=20)

#### Anomaly Detection

In [158]:
random_grid = pd.concat([random_grid, resid.to_frame()], axis=1)
random_grid

In [159]:
resid_mu = resid.mean()
resid_dev = resid.std()

lower = resid_mu - 3*resid_dev
upper = resid_mu + 3*resid_dev
    
random_grid['anomaly'] = (random_grid['resid'] < lower) | (random_grid['resid'] > upper)
random_grid

In [160]:
random_grid['anomaly'].value_counts()

In [161]:
random_grid=random_grid.reset_index()
random_grid["time"] = pd.to_datetime(random_grid["time"])

In [162]:
# anomaly = full_grid.loc[full_grid['anomaly'] == True, ['internet_cdr']]
# anomaly.reset_index()

In [163]:
fig, ax = plt.subplots(figsize=(30,10))

anomaly = random_grid.loc[random_grid['anomaly'] == True, ['internet_cdr']] 

ax.plot(random_grid.index, random_grid['internet_cdr'], color='blue', label = 'Normal')
ax.scatter(anomaly.index,anomaly['internet_cdr'], color='red', label = 'Anomaly')
ax.set_title('Random grid anomalies points for Decomposition Method', fontsize=20)
plt.legend()
plt.show();

In [164]:
# fig1 = px.line(random_grid.reset_index(),  y="internet_cdr")
# fig1.update_traces(line=dict(color = 'magenta'))

# anomaly = random_grid.loc[random_grid['anomaly'] == True, ['internet_cdr']] 
# fig2 = px.scatter(anomaly,y="internet_cdr")

# fig3 = go.Figure(data=fig1.data + fig2.data)

# fig3.show()

#### Second apply on all grids

In [165]:
data=[]
for grid in grids:
    full_grid = df.groupby("grid_square").get_group(grid)
    full_grid.drop('grid_square', axis=1, inplace=True)
    data.append(full_grid)
data

In [166]:
full_decom1 = pd.DataFrame()
full_decom2 = pd.DataFrame()
full_decom3 = pd.DataFrame()

stl = STL(data[1],period=12)
result = stl.fit()
seasonal, trend, resid = result.seasonal, result.trend, result.resid
seasonal=   seasonal.to_frame()
trend   =   trend.to_frame()
resid   =   resid.to_frame()
print("========== grid number {} done ==========".format(i+1))
full_decom1 = full_decom1.append(seasonal)
full_decom2 = full_decom2.append(trend)
full_decom3 = full_decom3.append(resid)

In [167]:
seasonal_decom = pd.DataFrame()
trend_decom = pd.DataFrame()
resid_decom = pd.DataFrame()
for i in range (x):
    stl = STL(data[i],period=12)
    result = stl.fit()
    seasonal, trend, resid = result.seasonal, result.trend, result.resid
    seasonal=   seasonal.to_frame()
    trend   =   trend.to_frame()
    resid   =   resid.to_frame()
    print("========== grid number {} done ==========".format(i+1))
    seasonal_decom = seasonal_decom.append(seasonal)
    trend_decom    = trend_decom.append(trend)
    resid_decom    = resid_decom.append(resid)

In [168]:
df = pd.concat([df, seasonal_decom,trend_decom,resid_decom], axis=1)
df

####  Anomaly Detection

In [169]:
df1=df

In [170]:
data=[]
for grid in grids:
    full_grid = df.groupby("grid_square").get_group(grid)
    data.append(full_grid)
data

In [171]:
anomalies= pd.DataFrame()
for i in range (x):
    
    resid_mu = data[i]['resid'].mean()
    resid_dev = data[i]['resid'].std()

    lower = resid_mu - 3*resid_dev
    upper = resid_mu + 3*resid_dev
    

    anomaly = (data[i]['resid'] < lower) | (data[i]['resid'] > upper)
    print("========== grid number {} checked ==========".format(i+1))
    anomalies=anomalies.append(anomaly.to_frame())
    print(anomalies)


In [172]:
anomalies

In [173]:
df = pd.concat([df_, anomalies], axis=1).reset_index()
df["time"] = pd.to_datetime(df["time"])
df

In [174]:
df['resid'].value_counts()

In [175]:
fig, ax = plt.subplots(figsize=(30,10))

anomaly = df.loc[df['resid'] == True, ['internet_cdr']] 

ax.plot(df.index, df['internet_cdr'], color='blue', label = 'Normal')
ax.scatter(anomaly.index,anomaly['internet_cdr'], color='red', label = 'Anomaly')
ax.set_title('Total grids anomalies points for Decomposition Method ', fontsize=20)
plt.legend()
plt.show();

In [176]:
# fig1 = px.line(df,  y="internet_cdr")
# fig1.update_traces(line=dict(color = 'turquoise'))

# anomaly = df.loc[df['resid'] == True, ['internet_cdr']] 

# fig2 = px.scatter(anomaly,y='internet_cdr')

# fig3 = go.Figure(data=fig1.data + fig2.data)

# fig3.show()

In [177]:
anomaly

In [178]:
df.rename(columns = {"resid": "anomaly"}, inplace = True)
df

In [179]:
#df.to_csv("anomaly.csv")

## What is next?
After determining the anomalies points what we should do about them? In the most common 
application like Microsoft anomaly detection, they have some web application to send some 
emails to the concerned persons about the sudden and didn’t expected change, then they take 
the right decision to solve this issue. 

But in our scope here we concerned with if this point is outlier or point of interest. 

<img src="https://github.com/abdallah-elsawy/Time-Series-Anomaly-Detection/blob/main/Outputs/Time-series-outliers.png?raw=true">

So, if the point we detected as outlier is a data that we did not need we will do on it some 
cleaning data processing. But in the other hand some time did not mean if there is some 
unusual event that we didn’t need it. Some event most be important to us because might this 
event will happen in the future so by studying it will make us have the ability to avoid this 
sudden change in the future by handle it by some control flow process. 
Here we after studying those points we will make all anomalies point as a nan value so will 
handle it in the coming part, Missing value imputation. 