# Summary

This notebook is about simple PoC of using basic statistics to detect anomaly. After this, I would incorporate Time Series / Machine learning / Deep learning on anomaly detection


Overall Anomaly Detection 

- Goal: The AI/ML will detect the anomaly based on deviation in pattern of data from last 30 days
- How it Works: There will be score given to the anomaly based on below attributes:
    - The criticality of data
        - Minor
        - Major
        - Critical
    - Duration of anomaly
    - Frequency of anomaly in last 30 days
	
The Anomaly will be stored in MySQL/NoSQL database, with below details:
- Anomaly id
- Anomaly first occurrence
- Anomaly Duration
- Anomaly Status
- Involved network parameters
- Impacted Customers
- Impacted machines


## Preparation

### Connect Data

In [51]:
import numpy as np
import mysql.connector
import pandas as pd
from pandas_profiling import ProfileReport

# Initiate with Parameters
db_name= "core_stats"
col = "peak_upload_speed"


# Start Database Connection
db_connection = mysql.connector.connect(
    host="20.94.254.6",
    user="gyan",
    password="5Gaa$2022",
    database="gyan_db"
)

# Create Database Cursor for SQL Queries
mycursor = db_connection.cursor()
mycursor.execute("SELECT * FROM {} LIMIT 5".format(db_name))

myresult = mycursor.fetchall()
for x in myresult:
    print(x)

# Load data from database and store as pandas Dataframe
df = pd.read_sql('SELECT * FROM {}'.format(db_name), con=db_connection)
df.head()

('BETAZRPDCOR001', datetime.datetime(2022, 3, 1, 14, 55, 27), 2, '0', 45751, 0, 0, 0, 0, 0, 0, 1, 2, 33.3333)
('BETAZRPDCOR001', datetime.datetime(2022, 3, 1, 15, 16, 59), 2, '0', 45901, 0, 0, 0, 0, 0, 0, 1, 2, 33.3333)
('BETAZRPDCOR001', datetime.datetime(2022, 3, 1, 15, 18, 1), 2, '0', 45901, 0, 0, 0, 0, 0, 0, 1, 2, 33.3333)
('BETAZRPDCOR001', datetime.datetime(2022, 3, 1, 15, 24, 2), 2, '0', 45921, 0, 0, 0, 0, 0, 0, 1, 2, 33.3333)
('BETAZRPDCOR001', datetime.datetime(2022, 3, 1, 15, 30, 2), 2, '0', 45940, 0, 0, 0, 0, 0, 0, 1, 2, 33.3333)


Unnamed: 0,client_id,stats_timestamp,total_attached_user,total_rejected_user,peak_upload_speed,peak_download_speed,enodeb_shutdown_count,handover_failure_count,bearer_active_user_count,bearer_rejected_user_count,total_users,total_dropped_packets,enodeb_connected_count,enodeb_connection_status
0,BETAZRPDCOR001,2022-03-01 14:55:27,2,0,45751,0,0,0,0,0,0,1,2,33.3333
1,BETAZRPDCOR001,2022-03-01 15:16:59,2,0,45901,0,0,0,0,0,0,1,2,33.3333
2,BETAZRPDCOR001,2022-03-01 15:18:01,2,0,45901,0,0,0,0,0,0,1,2,33.3333
3,BETAZRPDCOR001,2022-03-01 15:24:02,2,0,45921,0,0,0,0,0,0,1,2,33.3333
4,BETAZRPDCOR001,2022-03-01 15:30:02,2,0,45940,0,0,0,0,0,0,1,2,33.3333



### Data Cleaning

In [52]:
# Load functions from Jupyter notebook, Could change to .py as Well
%run helper_functions.ipynb

In [53]:
getSummaryTable(df)

Unnamed: 0,Columns_Name,missing_rate,unique_count
0,client_id,0.0,2
1,stats_timestamp,0.0,5739
2,total_attached_user,0.0,4
3,total_rejected_user,0.0,1
4,peak_upload_speed,0.0,1450
5,peak_download_speed,0.0,1
6,enodeb_shutdown_count,0.0,1
7,handover_failure_count,0.0,6
8,bearer_active_user_count,0.0,9
9,bearer_rejected_user_count,0.0,1


In [54]:
Summary_table= getSummaryTable(df,True)

In [55]:
getDuplicateColumns(df)

Pair 0 :  peak_download_speed  |  enodeb_shutdown_count
Pair 1 :  peak_download_speed  |  bearer_rejected_user_count
Pair 2 :  enodeb_shutdown_count  |  bearer_rejected_user_count


['enodeb_shutdown_count', 'bearer_rejected_user_count']

In [56]:
temp=df

# Phrase 1 - Starter Anomaly Detector

**How it Works?**
1. Summarize first, forecast later. 
    - Exaplin how a metric is identified as anomaly data point comparing to last 30 days.

2. Anomaly Detector Category
    - Statistical (Outlier, Z-score)
    - Time Series (Prophet, Arima | Kalman Filter)
    - Machine Learning (Clustering, Classification: Isolation Forest |)
    - Deep Learning (LSTM, Autoencoder | Clockwork RNN, Depth Gated RNN)


**How it's different from the Final Product**
1. No enough data. 
    - As for adjustments, I'll use moving window 7 days instead of 30 days for PoC.
    - e.g. What data are anomlies comparing to historical 7 (30) days
    

2. No enough column.
    - We simply want to focus on the most important column first. 
    - However, the function is designed to apply to more columns in type aspect.
    
    
3. Not Focusing on Quality, but automated process.

4. Individual labels or Scores instead of an Overall Score.

5. Keep comparison with Exiting 3rd Party Anomaly Detection tools like Anodot.


## Stats Calculations

- Mean
- Median
- IQR
- Outlier:     x> Q3+1.5IQR or x < Q1-1.5IQR 


**Business Explanation** 
Big jump are going on going down on upload or download speed.
We can identify there's a big jump in the 

WHen the jump happends, we send an alert.


**Column of interest:**  
- Peak download speed

### Basic Summary Stats

In [57]:
col= "peak_upload_speed"

df[col].describe()

count      5739.000000
mean      61499.402858
std       15739.465869
min          34.000000
25%       55583.000000
50%       69499.000000
75%       72702.000000
max      261251.000000
Name: peak_upload_speed, dtype: float64

### Z - Score

- Function Details
    - Input: dataframe, one column
    - Output:
        - One Score Column
            - e.g. Z-score (negative 3 to 3 )-> Adjusted Z-score ( 0~1)
        - One Label Column
        
    

In [58]:
add_Z_score_column(df,col)

Unnamed: 0,client_id,stats_timestamp,total_attached_user,total_rejected_user,peak_upload_speed,peak_download_speed,enodeb_shutdown_count,handover_failure_count,bearer_active_user_count,bearer_rejected_user_count,total_users,total_dropped_packets,enodeb_connected_count,enodeb_connection_status,Z-score_peak_upload_speed,label_Z-score_peak_upload_speed
0,BETAZRPDCOR001,2022-03-01 14:55:27,2,0,45751,0,0,0,0,0,0,1,2,33.3333,-1.00,0
1,BETAZRPDCOR001,2022-03-01 15:16:59,2,0,45901,0,0,0,0,0,0,1,2,33.3333,-0.99,0
2,BETAZRPDCOR001,2022-03-01 15:18:01,2,0,45901,0,0,0,0,0,0,1,2,33.3333,-0.99,0
3,BETAZRPDCOR001,2022-03-01 15:24:02,2,0,45921,0,0,0,0,0,0,1,2,33.3333,-0.99,0
4,BETAZRPDCOR001,2022-03-01 15:30:02,2,0,45940,0,0,0,0,0,0,1,2,33.3333,-0.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5734,BETBELPDCOR001,2022-03-22 15:00:03,0,0,44221,0,0,0,0,0,0,0,3,0.0000,-1.10,0
5735,BETBELPDCOR001,2022-03-22 15:05:03,0,0,72702,0,0,0,0,0,0,0,3,0.0000,0.71,0
5736,BETBELPDCOR001,2022-03-22 15:10:03,0,0,72702,0,0,0,0,0,0,0,3,0.0000,0.71,0
5737,BETBELPDCOR001,2022-03-22 15:15:03,0,0,72702,0,0,0,0,0,0,0,3,0.0000,0.71,0


### Outlier

In [59]:
add_outlier_column(df,col)

Unnamed: 0,client_id,stats_timestamp,total_attached_user,total_rejected_user,peak_upload_speed,peak_download_speed,enodeb_shutdown_count,handover_failure_count,bearer_active_user_count,bearer_rejected_user_count,total_users,total_dropped_packets,enodeb_connected_count,enodeb_connection_status,Z-score_peak_upload_speed,label_Z-score_peak_upload_speed,label_outlier_peak_upload_speed
0,BETAZRPDCOR001,2022-03-01 14:55:27,2,0,45751,0,0,0,0,0,0,1,2,33.3333,-1.00,0,0
1,BETAZRPDCOR001,2022-03-01 15:16:59,2,0,45901,0,0,0,0,0,0,1,2,33.3333,-0.99,0,0
2,BETAZRPDCOR001,2022-03-01 15:18:01,2,0,45901,0,0,0,0,0,0,1,2,33.3333,-0.99,0,0
3,BETAZRPDCOR001,2022-03-01 15:24:02,2,0,45921,0,0,0,0,0,0,1,2,33.3333,-0.99,0,0
4,BETAZRPDCOR001,2022-03-01 15:30:02,2,0,45940,0,0,0,0,0,0,1,2,33.3333,-0.99,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5734,BETBELPDCOR001,2022-03-22 15:00:03,0,0,44221,0,0,0,0,0,0,0,3,0.0000,-1.10,0,0
5735,BETBELPDCOR001,2022-03-22 15:05:03,0,0,72702,0,0,0,0,0,0,0,3,0.0000,0.71,0,0
5736,BETBELPDCOR001,2022-03-22 15:10:03,0,0,72702,0,0,0,0,0,0,0,3,0.0000,0.71,0,0
5737,BETBELPDCOR001,2022-03-22 15:15:03,0,0,72702,0,0,0,0,0,0,0,3,0.0000,0.71,0,0


## Automation

### Apply Stats Functions for All column in one Database

In [60]:
keeped_column_name = list(Summary_table["Columns_Name"])
print("We keep columns: ", keeped_column_name,"\n")
keeped_column_name.remove("client_id")


We keep columns:  ['client_id', 'total_attached_user', 'peak_upload_speed', 'handover_failure_count', 'bearer_active_user_count', 'total_users', 'total_dropped_packets', 'enodeb_connected_count', 'enodeb_connection_status'] 



['BETAZRPDCOR001', 'BETBELPDCOR001']

In [101]:
# Create a new table to store all the computed metrics
Stats_summary_core = df
client_id_list= list(df.client_id.unique())
Stats_summary_core_new= pd.DataFrame()

for clientID in client_id_list:
    temp_df = Stats_summary_core[Stats_summary_core["client_id"]==clientID]
    
    for col in keeped_column_name:

        temp_df=add_Z_score_column(temp_df,col)
        temp_df=add_outlier_column(temp_df,col)
    

    Stats_summary_core_new=Stats_summary_core_new.append(temp_df, ignore_index = True)

filter_col = [col for col in Stats_summary_core if col.startswith('label')]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col_name]=Z_score_list
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["label_Z-score_"+col] = [ 1 if (x< -1*z_score_threshold or x>1*z_score_threshold) else 0 for x in df[col_name]]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["label_outlier_"+col]  = outlier_label


In [103]:
Stats_summary_core= Stats_summary_core_new

In [104]:
# Find percentage of labels that's been marked as anomalies, 0.01 means 1%
np.round(Stats_summary_core[filter_col].sum()/Stats_summary_core.shape[0],2)

label_Z-score_peak_upload_speed           0.05
label_outlier_peak_upload_speed           0.24
label_Z-score_total_attached_user         0.00
label_outlier_total_attached_user         0.00
label_Z-score_handover_failure_count      0.01
label_outlier_handover_failure_count      0.01
label_Z-score_bearer_active_user_count    0.01
label_outlier_bearer_active_user_count    0.01
label_Z-score_total_users                 0.03
label_outlier_total_users                 0.03
label_Z-score_total_dropped_packets       0.00
label_outlier_total_dropped_packets       0.10
label_Z-score_enodeb_connected_count      0.00
label_outlier_enodeb_connected_count      0.10
label_Z-score_enodeb_connection_status    0.00
label_outlier_enodeb_connection_status    0.10
dtype: float64

### Export to CSVs

In [105]:
export_full_option=False
if export_full_option == True:
    Anomaly_summary = Stats_summary_core[Stats_summary_core[filter_col].sum(axis=1)>=1].reset_index(drop=True)
    Anomaly_summary.to_csv("Anomaly_summary.csv")

In [106]:
import os.path
# folder_path = "C:\Users\Jijun Du\Desktop\Main_Anomaly_Detection\Generated_csv\"

for col in keeped_column_name:
    
    # Filter columns with at least one record that have anomaly label
    condition= (Stats_summary_core["label_Z-score_"+ col]!=0) | (Stats_summary_core["label_outlier_"+col]!=0)
    
    
    subset_columns=["client_id","stats_timestamp",col,"label_Z-score_"+col,"label_outlier_"+col]
    print("{} rows of anomaly detected for column {}".format(sum(condition),col))
    
    subset_Summary= Stats_summary_core[condition][subset_columns]
    
    csv_filename = "Anomaly_{}_summary.csv".format(col)
    
    if os.path.exists(csv_filename):
        print("File {} already generated".format(csv_filename))
        pass
    else:
        subset_Summary.to_csv(csv_filename,index=False)

    print("Done for:", col,"\n")
    
    

0 rows of anomaly detected for column total_attached_user
File Anomaly_total_attached_user_summary.csv already generated
Done for: total_attached_user 

1391 rows of anomaly detected for column peak_upload_speed
File Anomaly_peak_upload_speed_summary.csv already generated
Done for: peak_upload_speed 

32 rows of anomaly detected for column handover_failure_count
File Anomaly_handover_failure_count_summary.csv already generated
Done for: handover_failure_count 

47 rows of anomaly detected for column bearer_active_user_count
File Anomaly_bearer_active_user_count_summary.csv already generated
Done for: bearer_active_user_count 

155 rows of anomaly detected for column total_users
File Anomaly_total_users_summary.csv already generated
Done for: total_users 

575 rows of anomaly detected for column total_dropped_packets
File Anomaly_total_dropped_packets_summary.csv already generated
Done for: total_dropped_packets 

575 rows of anomaly detected for column enodeb_connected_count
File Anoma

In [107]:
export_anomaly_df= pd.DataFrame()

In [108]:
# folder_path = "C:\Users\Jijun Du\Desktop\Main_Anomaly_Detection\Generated_csv\"
export_anomaly_df= pd.DataFrame()

for col in keeped_column_name:

    # Filter columns with at least one record that have anomaly label
    condition= (Stats_summary_core["label_Z-score_"+ col]!=0) | (Stats_summary_core["label_outlier_"+col]!=0)
    
    
    subset_columns=["client_id","stats_timestamp",col,"label_Z-score_"+col,"label_outlier_"+col]
    print("{} rows of anomaly detected for column {}".format(sum(condition),col))
    
    subset_Summary= Stats_summary_core[condition][subset_columns]
    
    export_anomaly_df

    print("Done for:", col,"\n")
    
    

0 rows of anomaly detected for column total_attached_user
Done for: total_attached_user 

1391 rows of anomaly detected for column peak_upload_speed
Done for: peak_upload_speed 

32 rows of anomaly detected for column handover_failure_count
Done for: handover_failure_count 

47 rows of anomaly detected for column bearer_active_user_count
Done for: bearer_active_user_count 

155 rows of anomaly detected for column total_users
Done for: total_users 

575 rows of anomaly detected for column total_dropped_packets
Done for: total_dropped_packets 

575 rows of anomaly detected for column enodeb_connected_count
Done for: enodeb_connected_count 

575 rows of anomaly detected for column enodeb_connection_status
Done for: enodeb_connection_status 



In [109]:
def reorder_columns(dataframe, col_name, position):
    """Reorder a dataframe's column.
    Args:
        dataframe (pd.DataFrame): dataframe to use
        col_name (string): column name to move
        position (0-indexed position): where to relocate column to
    Returns:
        pd.DataFrame: re-assigned dataframe
    """
    temp_col = dataframe[col_name]
    dataframe = dataframe.drop(columns=[col_name])
    dataframe.insert(loc=position, column=col_name, value=temp_col)
    return dataframe

In [110]:
# Filter columns with at least one record that have anomaly label

for col in keeped_column_name:
    condition= (Stats_summary_core["label_Z-score_"+ col]!=0) | (Stats_summary_core["label_outlier_"+col]!=0)


    subset_columns=["client_id","stats_timestamp",col,"label_Z-score_"+col,"label_outlier_"+col]
    print("{} rows of anomaly detected for column {}".format(sum(condition),col))

    subset_Summary= Stats_summary_core[condition][subset_columns]
    subset_Summary["Attribute_Name"] = col
    subset_Summary = reorder_columns(subset_Summary,"Attribute_Name",2)
    subset_Summary=subset_Summary.rename(columns={str(col):"Attribute_Value","label_Z-score_"+col: "Attribute_Label_Z_Score", "label_outlier_"+col: "Attribute_Label_Outlier"})
    
    export_anomaly_df=export_anomaly_df.append(subset_Summary, ignore_index = True)


0 rows of anomaly detected for column total_attached_user
1391 rows of anomaly detected for column peak_upload_speed
32 rows of anomaly detected for column handover_failure_count
47 rows of anomaly detected for column bearer_active_user_count
155 rows of anomaly detected for column total_users
575 rows of anomaly detected for column total_dropped_packets
575 rows of anomaly detected for column enodeb_connected_count
575 rows of anomaly detected for column enodeb_connection_status


In [111]:
export_anomaly_df.to_csv("export_anomaly_df.csv")

### Anomaly Per Day

In [127]:
Anomaly_summary=Stats_summary_core

In [128]:
ans_time_delta= max(df.stats_timestamp)-min(df.stats_timestamp)
print("Q: What's the number of days for data we have?","\nA:",ans_time_delta)

num_days = ans_time_delta.days
ans_avg_anomaly_by_day = np.round(Anomaly_summary.shape[0]/num_days,1)
print("Q: How many records per day is classified as anomaly by at least one label?","\nA:",ans_avg_anomaly_by_day)


for col in keeped_column_name:
    condition= (Stats_summary_core["label_Z-score_"+ col]!=0) | (Stats_summary_core["label_outlier_"+col]!=0)    
    subset_columns=["client_id","stats_timestamp",col,"label_Z-score_"+col,"label_outlier_"+col]
    subset_Summary= Stats_summary_core[condition][subset_columns]
    
    ans_col_avg_anomaly_by_day= np.round(subset_Summary.shape[0] / num_days,2)
    #print("\n")
    print(col)
    print("Q: How many Total Anomaly Occurs per day?","\nA:",ans_col_avg_anomaly_by_day)


Q: What's the number of days for data we have? 
A: 21 days 00:24:35
Q: How many records per day is classified as anomaly by at least one label? 
A: 273.3
total_attached_user
Q: How many Total Anomaly Occurs per day? 
A: 0.0
peak_upload_speed
Q: How many Total Anomaly Occurs per day? 
A: 66.24
handover_failure_count
Q: How many Total Anomaly Occurs per day? 
A: 1.52
bearer_active_user_count
Q: How many Total Anomaly Occurs per day? 
A: 2.24
total_users
Q: How many Total Anomaly Occurs per day? 
A: 7.38
total_dropped_packets
Q: How many Total Anomaly Occurs per day? 
A: 27.38
enodeb_connected_count
Q: How many Total Anomaly Occurs per day? 
A: 27.38
enodeb_connection_status
Q: How many Total Anomaly Occurs per day? 
A: 27.38


### Addtional Summary 
- Anoamly for which Column
- Client
- Anomaly Time

**Data Structure**

### Automate Process for Anomly Experimentation

In [130]:
def experiment_Stats(z_score=3,iqr=1.5, num_days=14):

    Stats_summary_core = df
    for col in keeped_column_name:

        Stats_summary_core=add_Z_score_column(Stats_summary_core,col,z_score_threshold=z_score)
        Stats_summary_core=add_outlier_column(Stats_summary_core,col, iqr_factor=iqr)
    
    print("Anomaly Detected Rocord per Day")
    print(np.round(Stats_summary_core[filter_col].sum()/num_days,2))
    print("\n")
    print("Anomaly Percentage")
    print(np.round(Stats_summary_core[filter_col].sum()/Stats_summary_core.shape[0],2))


In [131]:
experiment_Stats(4,2)

Anomaly Detected Rocord per Day
label_Z-score_peak_upload_speed           18.36
label_outlier_peak_upload_speed           17.93
label_Z-score_total_attached_user          2.57
label_outlier_total_attached_user         94.36
label_Z-score_handover_failure_count       2.29
label_outlier_handover_failure_count       2.29
label_Z-score_bearer_active_user_count     2.29
label_outlier_bearer_active_user_count     3.36
label_Z-score_total_users                 11.07
label_outlier_total_users                 11.07
label_Z-score_total_dropped_packets        0.00
label_outlier_total_dropped_packets        0.00
label_Z-score_enodeb_connected_count       0.00
label_outlier_enodeb_connected_count       0.00
label_Z-score_enodeb_connection_status     0.00
label_outlier_enodeb_connection_status     0.00
dtype: float64


Anomaly Percentage
label_Z-score_peak_upload_speed           0.04
label_outlier_peak_upload_speed           0.04
label_Z-score_total_attached_user         0.01
label_outlier_total_att

In [132]:
filter_col_Z = [col for col in Stats_summary_core if col.startswith('label_Z')]
Stats_summary_core[filter_col_Z].head(3)

Unnamed: 0,label_Z-score_peak_upload_speed,label_Z-score_total_attached_user,label_Z-score_handover_failure_count,label_Z-score_bearer_active_user_count,label_Z-score_total_users,label_Z-score_total_dropped_packets,label_Z-score_enodeb_connected_count,label_Z-score_enodeb_connection_status
0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0


In [133]:
Stats_summary_core.head(3)

Unnamed: 0,client_id,stats_timestamp,total_attached_user,total_rejected_user,peak_upload_speed,peak_download_speed,enodeb_shutdown_count,handover_failure_count,bearer_active_user_count,bearer_rejected_user_count,...,label_outlier_total_users,Z-score_total_dropped_packets,label_Z-score_total_dropped_packets,label_outlier_total_dropped_packets,Z-score_enodeb_connected_count,label_Z-score_enodeb_connected_count,label_outlier_enodeb_connected_count,Z-score_enodeb_connection_status,label_Z-score_enodeb_connection_status,label_outlier_enodeb_connection_status
0,BETAZRPDCOR001,2022-03-01 14:55:27,2,0,45751,0,0,0,0,0,...,0,0.5,0,0,-0.5,0,0,0.5,0,0
1,BETAZRPDCOR001,2022-03-01 15:16:59,2,0,45901,0,0,0,0,0,...,0,0.5,0,0,-0.5,0,0,0.5,0,0
2,BETAZRPDCOR001,2022-03-01 15:18:01,2,0,45901,0,0,0,0,0,...,0,0.5,0,0,-0.5,0,0,0.5,0,0


In [118]:
matrix_names= filter_col_Z

In [119]:
Stats_summary_matrix = Stats_summary_core[filter_col_Z].to_numpy()

In [120]:
filter_col

['label_Z-score_peak_upload_speed',
 'label_outlier_peak_upload_speed',
 'label_Z-score_total_attached_user',
 'label_outlier_total_attached_user',
 'label_Z-score_handover_failure_count',
 'label_outlier_handover_failure_count',
 'label_Z-score_bearer_active_user_count',
 'label_outlier_bearer_active_user_count',
 'label_Z-score_total_users',
 'label_outlier_total_users',
 'label_Z-score_total_dropped_packets',
 'label_outlier_total_dropped_packets',
 'label_Z-score_enodeb_connected_count',
 'label_outlier_enodeb_connected_count',
 'label_Z-score_enodeb_connection_status',
 'label_outlier_enodeb_connection_status']

### Scoring for Different Types of Anomalies

In [121]:
Stats_summary_df= pd.DataFrame(Stats_summary_matrix*[12,13,14,14,13,11,14,11])

In [122]:
score_list = list(range(0,110,10))
value_list= []

In [123]:
for score in score_list:
    v= (Stats_summary_df.sum(axis=1)>score).sum()
    value_list.append(v)
    
value_list

[441, 441, 34, 32, 32, 1, 0, 0, 0, 0, 0]

In [124]:
Stats_summary_df.loc[Stats_summary_df.sum(axis=1)>10,]

Unnamed: 0,0,1,2,3,4,5,6,7
20,0,0,14,14,13,0,0,0
21,0,0,14,14,13,0,0,0
22,0,0,14,14,13,0,0,0
23,0,0,14,14,13,0,0,0
24,0,0,14,14,13,0,0,0
...,...,...,...,...,...,...,...,...
5629,12,0,0,0,0,0,0,0
5661,12,0,0,0,0,0,0,0
5665,12,0,0,0,0,0,0,0
5693,12,0,0,0,0,0,0,0


In [125]:
Stats_summary_df.loc[Stats_summary_df.sum(axis=1)>40,].head(3)

Unnamed: 0,0,1,2,3,4,5,6,7
20,0,0,14,14,13,0,0,0
21,0,0,14,14,13,0,0,0
22,0,0,14,14,13,0,0,0


In [126]:
Stats_summary_df.loc[Stats_summary_df.sum(axis=1)>30,].head(3)

Unnamed: 0,0,1,2,3,4,5,6,7
20,0,0,14,14,13,0,0,0
21,0,0,14,14,13,0,0,0
22,0,0,14,14,13,0,0,0


In [201]:
theory_data_point = round(24*60/5)

In [202]:
df.shape[0]/14*0.01

2.8985714285714286

In [203]:
24*60/5*0.01

2.88

### Create Database and Insert Core stats

In [73]:
new_db= db_name+"_Anomaly_Summary"
# mycursor.execute("CREATE DATABASE {}".format(new_db))


# Phrase 2: Business & Analytics and Automation

# Phrase 3: Advanced Anomaly Detection Methods

## Time Series

## Functions need  exploring

In [None]:
#https://github.com/Vicam/Unsupervised_Anomaly_Detection/blob/master/custom_function.py
#from pyemma import msm
import pandas as pd
import numpy as np

# return Series of distance between each point and his distance with the closest centroid
def getDistanceByPoint(data, model):
    distance = pd.Series()
    for i in range(0,len(data)):
        Xa = np.array(data.loc[i])
        Xb = model.cluster_centers_[model.labels_[i]-1]
        distance.set_value(i, np.linalg.norm(Xa-Xb))
    return distance

# train markov model to get transition matrix
# def getTransitionMatrix (df):
#     df = np.array(df)
#     model = msm.estimate_markov_model(df, 1)
#     return model.transition_matrix

# return the success probability of the state change 
def successProbabilityMetric(state1, state2, transition_matrix):
    proba = 0
    for k in range(0,len(transition_matrix)):
        if (k != (state2-1)):
            proba += transition_matrix[state1-1][k]
    return 1-proba

# return the success probability of the whole sequence
def sucessScore(sequence, transition_matrix):
    proba = 0 
    for i in range(1,len(sequence)):
        if(i == 1):
            proba = successProbabilityMetric(sequence[i-1], sequence[i], transition_matrix)
        else:
            proba = proba*successProbabilityMetric(sequence[i-1], sequence[i], transition_matrix)
    return proba

# return if the sequence is an anomaly considering a threshold
def anomalyElement(sequence, threshold, transition_matrix):
    if (sucessScore(sequence, transition_matrix) > threshold):
        return 0
    else:
        return 1

# return a dataframe containing anomaly result for the whole dataset 
# choosing a sliding windows size (size of sequence to evaluate) and a threshold
def markovAnomaly(df, windows_size, threshold):
    transition_matrix = getTransitionMatrix(df)
    real_threshold = threshold**windows_size
    df_anomaly = []
    for j in range(0, len(df)):
        if (j < windows_size):
            df_anomaly.append(0)
        else:
            sequence = df[j-windows_size:j]
            sequence = sequence.reset_index(drop=True)
            df_anomaly.append(anomalyElement(sequence, real_threshold, transition_matrix))
    return df_anomaly

In [84]:
import numpy as np
import matplotlib.pyplot as plt


# multiply and add by random numbers to get some real values
data = np.random.randn(50000)  * 20 + 20

# Function to Detection Outlier on one-dimentional datasets.
def find_anomalies(data):
    #define a list to accumlate anomalies
    anomalies = []
    
    # Set upper and lower limit to 3 standard deviation
    data_std = np.std(data)
    data_mean = np.mean(data)
    anomaly_cut_off = data_std * 3
    
    lower_limit  = data_mean - anomaly_cut_off 
    upper_limit = data_mean + anomaly_cut_off
    print(lower_limit)
    # Generate outliers
    
    for outlier in data:
        if outlier > upper_limit or outlier < lower_limit:
            anomalies.append(outlier)
    return anomalies

find_anomalies(data)

-39.82363313253761


[87.05098068073669,
 -44.50153689442895,
 83.06030704640155,
 -66.2397851634998,
 80.32570363471504,
 -44.570131314635844,
 84.44012343151162,
 -42.42153550973904,
 88.7214983083458,
 87.13051025606175,
 85.49807199793064,
 -54.02755858804922,
 82.48852319426348,
 -44.65543350512681,
 -43.18472806447021,
 -40.29403314504932,
 -42.919128747846045,
 -44.897535595540106,
 -40.868015429640685,
 -42.798424993729675,
 83.40701363312897,
 80.33833211379071,
 84.1114087762238,
 83.0120430284039,
 97.62419386997476,
 -60.50883734702492,
 103.69500796372314,
 -44.29664474040774,
 -44.853050446629524,
 -42.1225677943949,
 -41.14645771764991,
 98.35518352956501,
 85.09872804886461,
 -42.60710742245901,
 -40.64859337285825,
 -52.88032113057588,
 81.26374224411121,
 80.20948083017932,
 81.40675098220848,
 -39.93653210554128,
 -47.27608096626108,
 96.36570823350168,
 -47.61081739694775,
 -54.47613329820615,
 -44.62147013738333,
 83.28833528321165,
 -71.52825426730372,
 -59.13533470356104,
 99.6206970