# Project Introduction: Univariate Anomaly Detection Using Machine Learning Methods

This project dives into Univariate Anomaly Detection, a crucial method for identifying unusual patterns or outliers in single-variable datasets. This notebook focuses on implementing an end-to-end anomaly detection pipeline in Python within Jupyter Notebook, guiding you through data preprocessing, model training, and evaluation.

I explore three robust machine learning techniques: Isolation Forest, One-Class SVM, and Kernel Density Estimation. These methods each provide a unique approach to detecting anomalies, and this video will demonstrate their strengths and applications in univariate data contexts. Whether for detecting fraud, monitoring system performance, or ensuring data quality, this project aims to provide a practical, hands-on guide to univariate anomaly detection in Python.

Link to dataset: https://www.kaggle.com/datasets/julienjta/twitter-mentions-volumes

# ML Based Anomaly Detection
### 1. Isolation Forest
### 2. OneClassSVM
### 3. Kernel Density

In [40]:
# Import Libraries

import pandas as pd
import numpy as np
from numpy import where, quantile

# importing ensemble.IsolationForest, neighbors.KernelDensity,svm.OneClassSVM
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import KernelDensity
from sklearn.svm import OneClassSVM

# plotly for graphical representations
import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

In [41]:
# check for dataset in colab files directory
import os
os.listdir()

['.config', 'company_twitter_mentions_dataset.csv', 'sample_data']

In [42]:
# Read the data from csv file. We will be looking at Tweets for Amazon
DIR_PATH = "company_twitter_mentions_dataset.csv"
# Twitter mentions only for Amazon
company = "Amazon"

# read data
df = pd.read_csv(DIR_PATH, usecols=["timestamp", company])

In [43]:
df.head()

Unnamed: 0,timestamp,Amazon
0,2015-02-26 21:42:53,57.0
1,2015-02-26 21:47:53,43.0
2,2015-02-26 21:52:53,55.0
3,2015-02-26 21:57:53,64.0
4,2015-02-26 22:02:53,93.0


# Data Preprocessing

In [44]:
# Check for datatype of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15902 entries, 0 to 15901
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  15902 non-null  object 
 1   Amazon     15831 non-null  float64
dtypes: float64(1), object(1)
memory usage: 248.6+ KB


In [45]:
# check for null values
df.isnull().sum()

Unnamed: 0,0
timestamp,0
Amazon,71


In [46]:
# fill Null values and convert column to int
df['Amazon'] = df['Amazon'].fillna(0).astype(int)

In [47]:
df.head()

Unnamed: 0,timestamp,Amazon
0,2015-02-26 21:42:53,57
1,2015-02-26 21:47:53,43
2,2015-02-26 21:52:53,55
3,2015-02-26 21:57:53,64
4,2015-02-26 22:02:53,93


In [48]:
df.isnull().sum()

Unnamed: 0,0
timestamp,0
Amazon,0


In [49]:
# Converting datetime to datetime and hour for grouping all the tweet mentions within that particular hour
df['timestamp'] = df["timestamp"].map(lambda time: f"{time.split(':')[0]}:00:00")

In [50]:
df.head()

Unnamed: 0,timestamp,Amazon
0,2015-02-26 21:00:00,57
1,2015-02-26 21:00:00,43
2,2015-02-26 21:00:00,55
3,2015-02-26 21:00:00,64
4,2015-02-26 22:00:00,93


In [51]:
# convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15902 entries, 0 to 15901
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  15902 non-null  datetime64[ns]
 1   Amazon     15902 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 248.6 KB


In [53]:
# group and aggregate tweet mentions by hour
df = df.groupby(by='timestamp', as_index=False).sum()

In [54]:
df.head()

Unnamed: 0,timestamp,Amazon
0,2015-02-26 21:00:00,219
1,2015-02-26 22:00:00,931
2,2015-02-26 23:00:00,568
3,2015-02-27 00:00:00,516
4,2015-02-27 01:00:00,574


In [55]:
# Function to display the anomaly vs normal observations
def plot_anomalies(
    method: str,
    dataframe,
    x_col: str,
    y_col: str,
    anomaly_col: str = "anomaly"
):
    """
    Creates a scatter plot to visualize anomalies versus normal observations
    based on the specified anomaly detection method.

    Parameters:
        method (str): Name of the anomaly detection method used (e.g., 'Isolation Forest').
        dataframe (pd.DataFrame): DataFrame containing the data for plotting.
        x_col (str): Column name for the X-axis (typically datetime).
        y_col (str): Column name for the Y-axis (typically Twitter mentions or observation values).
        anomaly_col (str): Column name for marking anomalies. Default is 'anomaly'.

    Returns:
        None: Displays the plot.
    """
    # Plotting logic goes here
    fig = px.scatter(data_frame=dataframe, x=x_col, y=y_col, color=anomaly_col)
    fig.update_layout(title=f"Anomaly detection for {company} using {method}")
    fig.show()
    pass


# 1. Isolation Forest

In [57]:
# Creating a copy of dataframe with Twitter Mentions for Isolation Forest
# this ml algorithm predicts -1 for 'anomaly' and 1 for 'normal'
if_df = df[[company]].copy()

# initialize ml model
IF = IsolationForest(contamination=0.2)

# fit data to the model
IF.fit(if_df)

# predict anomaly
if_pred = IF.predict(if_df)

# replace -1 with 'Anomaly' and 1 with 'Normal'
if_pred = ['Normal' if value == 1 else 'Anomaly' for value in if_pred]

if_df['timestamp'] = df['timestamp']
if_df['anomaly'] = if_pred

# graphical representation
# Example call to plot_anomalies function
plot_anomalies(
    method="Isolation Forest",        # Method used for anomaly detection
    dataframe=if_df,                     # DataFrame containing the data
    x_col="timestamp",                # Column name for the X-axis (e.g., datetime)
    y_col=company,                    # Column name for the Y-axis (e.g., Twitter mentions for Amazon)
    anomaly_col="anomaly"             # Column indicating anomalies (default is 'anomaly')
)


# 2. One Class SVM

In [59]:
# Creating dataframe with Twitter Mentions for One Class SVM
# this ml algorithm predicts -1 for 'anomaly' and 1 for 'normal'
oc_df = df[[company]].copy()

# initialize ml model
OC_SVM = OneClassSVM(nu=0.2)

# fit data to the model
OC_SVM.fit(oc_df)

# predict anomaly
oc_pred = OC_SVM.predict(oc_df)

# replace -1 with 'Anomaly' and 1 with 'Normal'
oc_pred = ['Normal' if value == 1 else 'Anomaly' for value in oc_pred]

oc_df['timestamp'] = df['timestamp']
oc_df['anomaly'] = oc_pred

# graphical representation
# Example call to plot_anomalies function
plot_anomalies(
    method="One Class SVM",        # Method used for anomaly detection
    dataframe=oc_df,                     # DataFrame containing the data
    x_col="timestamp",                # Column name for the X-axis (e.g., datetime)
    y_col=company,                    # Column name for the Y-axis (e.g., Twitter mentions for Amazon)
    anomaly_col="anomaly"             # Column indicating anomalies (default is 'anomaly')
)



# 3. Kernel Density Curve

In [64]:
# Create a new DataFrame with only Twitter mentions for generating the Kernel Density curve
kd_df = df[[company]].copy()

# Initialize the Kernel Density Estimation model
KD = KernelDensity()

# Fit the model to the data
KD.fit(kd_df)

# Obtain scores for each sample
scores = KD.score_samples(kd_df)

# Set a threshold, below which observations will be flagged as outliers/anomalies
# Here, we use the 20th percentile as the threshold, adjustable as needed
threshold = quantile(scores, 0.2)

# Based on the threshold, label each observation as either "Anomaly" or "Normal"
kd_df['anomaly'] = ['Anomaly' if score < threshold else 'Normal' for score in scores]
kd_df['timestamp'] = df['timestamp']

# Generate a graphical representation
plot_anomalies(
    method="Kernel Density",         # Anomaly detection method used
    dataframe=kd_df,                 # DataFrame containing the data to plot
    x_col="timestamp",               # Column for the X-axis (e.g., datetime)
    y_col=company,                   # Column for the Y-axis (e.g., Twitter mentions for Apple)
    anomaly_col="anomaly"            # Column indicating anomalies
)
