#  Anomaly detection in cellular networks

## 1. Introduction

The purpose of this notebook is to solve a anomaly detection problem proposed as a competition in the Kaggle InClass platform.

## 2. Problem description

### Context:

Traditionally, the design of a cellular network focuses on the optimization of energy and resources that guarantees a smooth operation even during peak hours (i.e. periods with higher traffic load). 
However, this implies that cells are most of the time overprovisioned of radio resources. 
Next generation cellular networks ask for a dynamic management and configuration in order to adapt to the varying user demands in the most efficient way with regards to energy savings and utilization of frequency resources. 
If the network operator were capable of anticipating to those variations in the users’ traffic demands, a more efficient management of the scarce (and expensive) network resources would be possible.
Current research in mobile networks looks upon Machine Learning (ML) techniques to help manage those resources. 
In this case, you will explore the possibilities of ML to detect abnormal behaviors in the utilization of the network that would motivate a change in the configuration of the base station.


### Objective

The objective of the network optimization team is to analyze traces of past activity, which will be used to train an ML system capable of classifying samples of current activity as:
 - 0 (normal): current activity corresponds to normal behavior of any working day and. Therefore, no re-configuration or redistribution of resources is needed.
 - 1 (unusual): current activity slightly differs from the behavior usually observed for that time of the day (e.g. due to a strike, demonstration, sports event, etc.), which should trigger a reconfiguration of the base station.

### Dataset

The dataset has been obtained from a real LTE deployment. During two weeks, different metrics were gathered from a set of 10 base stations, each having a different number of cells, every 15 minutes. 

The dataset is provided in the form of a csv file, where each row corresponds to a sample obtained from one particular cell at a certain time. Each data example contains the following features:

 - Time : hour of the day (in the format hh:mm) when the sample was generated.
 - CellName1: text string used to uniquely identify the cell that generated the current sample. CellName is in the form xαLTE, where x identifies the base station, and α the cell within that base station (see the example in the right figure).
 - PRBUsageUL and PRBUsageDL: level of resource utilization in that cell measured as the portion of Physical Radio Blocks (PRB) that were in use (%) in the previous 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - meanThrDL and meanThrUL: average carried traffic (in Mbps) during the past 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxThrDL and maxThrUL: maximum carried traffic (in Mbps) measured in the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - meanUEDL and meanUEUL: average number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxUEDL and maxUEUL: maximum number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxUE_UL+DL: maximum number of user equipment (UE) devices that were active simultaneously in the last 15 minutes, regardless of UL and DL.
 - Unusual: labels for supervised learning. A value of 0 determines that the sample corresponds to normal operation, a value of 1 identifies unusual behavior.

## Libraries

In [None]:
import os
import sys
import random
import getpass
from zipfile import ZipFile
from IPython.display import Image

#Data
import kaggle
import pandas as pd

#Analysis
import pyspark
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
except ImportError as e:
    print('WARN: Something wrong with pyspark library. Please check configuration settings!')
from pyspark.sql.types import StructType, DoubleType, IntegerType, StringType, TimestampType
from pyspark.sql.functions import col, lit, concat, split, when, udf, regexp_replace

#EDA
import swat

# Reloads functions each time so you can edit a script and not need to restart the kernel
%load_ext autoreload
%autoreload 2

## Helpers

In [None]:
def get_root_dir (src: str, max_nest: int) -> str:
    '''
    Specify paths and appending directories
    with relevant python source code.
    :param src: the path of the source
    :param max_nest: number of levels to search for the src
    :return: root_dir path of the root
    '''
    root_dir = os.curdir
    nest = 0
    while src not in os.listdir(root_dir) and nest < max_nest:
        root_dir = os.path.join(os.pardir, root_dir)  # Look up the directory structure for a src directory
        nest += 1
    # If you don't find the src directory, the root directory is this directory
    root_dir = os.path.abspath(root_dir) if nest < max_nest else os.path.abspath(
        os.curdir)
    return root_dir

def set_src (root_dir: str, src: str) -> str:
    '''
     Get the source directory and append
     path to access python packages/scripts within directory
    :param root_dir: root path
    :param src: src path
    :return: last system path record (to check)
    '''
    if src in os.listdir(root_dir):
        src_dir = os.path.join(root_dir, src)
        sys.path.append(src_dir)
    return sys.path[-1]


def set_folder (root_dir: str, folder: str) -> str:
    '''
    Set the folder path based on the folder name
    :param root_dir: root path
    :param folder: folder name
    :return: folder_path from root
    '''
    folder_path = os.path.join(
        root_dir, folder) if folder in os.listdir(root_dir) else os.curdir
    return folder_path

def set_path(path:str, dirname:str) -> str:
    '''
    Set the entire path given a directory name
    :param path: 
    :param dirname: 
    :return: new path
    '''
    return os.path.join(path, dirname)


def unzip (inpath: str, outpath: str) -> None:
    '''
    unzip a compressed file
    :param inpath: path of zip
    :param outpath: path to unzip
    :return: None
    '''
    zf = ZipFile(inpath, 'r')
    zf.extractall(outpath)
    zf.close()

## Setup

In [None]:
# Folders
root_dir = get_root_dir('src', 5)
src_dir = set_src(root_dir, 'src')
data_dir = set_folder(root_dir, 'data')
raw_data_dir = set_path(data_dir, 'raw')
processed_data_dir = set_path(data_dir, 'processed')
figures_dir = set_folder(root_dir, 'figures')
models_dir = set_folder(root_dir, 'models')

#Variables
# cashost =''
# casport = '5570'
# print('Provide username and password to Viya Server login')
# casuser = input("")
# password = getpass.getpass()
# caslib='casuser'

In [None]:
# To convert to html with collapsible headings and table of contents
# change filename and run cell
# filename = "template.ipynb"
# ! jupyter nbconvert --to html_ch {filename} --template toc2

# 1. Data

## Download from Kaggle

In [None]:
!kaggle competitions download -c anomaly-detection-in-cellular-networks -p ../../data/raw/ --force

In [None]:
unzip('../../data/raw/anomaly-detection-in-cellular-networks.zip', raw_data_dir)

In [None]:
train_path = set_path(raw_data_dir, 'ML-MATT-CompetitionQT1920_train.csv')
val_path = set_path(raw_data_dir, 'ML-MATT-CompetitionQT1920_test.csv')
train_data = pd.read_csv(train_path, header=0, sep=',', engine='python') #because UnicodeDecodeError with c engine

## Inspect data

In [None]:
train_data.head()

In [None]:
train_data.columns

In [None]:
train_data.info()

In [None]:
print('The distribution of target variable')
round((train_data['Unusual'].value_counts()/train_data.shape[0])*100, 3)

### Comment

The sample is umbalanced. We have some missing. All variables are continuous. We need hour for time variable. And we may consider change maxUE_UL+DL labels

# 2. ETL

## Initiate Spark session

In [None]:
#If not exists create a spark session named Anomaly Detection where the master node is local
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Anomaly Detection") \
    .getOrCreate()

In [None]:
spark.getActiveSession()

## Extract

### Define schema and read data


In [None]:
schema_train = StructType() \
    .add("Time", StringType(), True) \
    .add("CellName", StringType(), True) \
    .add("PRBUsageUL", DoubleType(), True) \
    .add("PRBUsageDL", DoubleType(), True) \
    .add("meanThr_DL", DoubleType(), True) \
    .add("meanThr_UL", DoubleType(), True) \
    .add("maxThr_DL", DoubleType(), True) \
    .add("maxThr_UL", DoubleType(), True) \
    .add("meanUE_DL", DoubleType(), True) \
    .add("meanUE_UL", DoubleType(), True) \
    .add("maxUE_DL", DoubleType(), True) \
    .add("maxUE_UL", DoubleType(), True) \
    .add("maxUE_UL+DL", IntegerType(), True) \
    .add("Unusual", IntegerType(), True)

train_df = spark.read.option("header", True) \
                .option("delimiter", ',') \
                .schema(schema_train) \
                .csv(train_path)

In [None]:
schema_val = StructType() \
    .add("Time", StringType(), True) \
    .add("CellName", StringType(), True) \
    .add("PRBUsageUL", DoubleType(), True) \
    .add("PRBUsageDL", DoubleType(), True) \
    .add("meanThr_DL", DoubleType(), True) \
    .add("meanThr_UL", DoubleType(), True) \
    .add("maxThr_DL", DoubleType(), True) \
    .add("maxThr_UL", DoubleType(), True) \
    .add("meanUE_DL", DoubleType(), True) \
    .add("meanUE_UL", DoubleType(), True) \
    .add("maxUE_DL", DoubleType(), True) \
    .add("maxUE_UL", DoubleType(), True) \
    .add("maxUE_UL+DL", IntegerType(), True) 

val_df = spark.read.option("header", True) \
                .option("delimiter", ',') \
                .schema(schema_val) \
                .csv(val_path)

In [None]:
train_df.printSchema()

In [None]:
train_df.show(5)

## Transform

Because we have:

 - a particular time format (hh:mm)
 - a composed cell identifier (xαLTE)
 - a messy name (maxUE_UL+DL)
 - missing values
 - unbalanced sample
 
we need to implement some transformations:

 - we have to format the column e.g HH:mm
 - I would leave the cell indentifier because we want to optimize for cell
 - rename maxUE_UL+DL in maxUE_UL_DL
 - we could consider drop missings for simplicity
 - we may assign weights for each class to penalize the majority class (experiment)


In [None]:
flt = """
PRBUsageUL IS NOT NULL
and PRBUsageDL IS NOT NULL
and meanThr_DL IS NOT NULL
and meanThr_UL IS NOT NULL
and maxThr_DL IS NOT NULL
and maxThr_UL IS NOT NULL
and meanUE_DL IS NOT NULL
and meanUE_UL IS NOT NULL
and maxUE_DL IS NOT NULL
and maxUE_UL IS NOT NULL
and maxUE_UL_DL IS NOT NULL
and Unusual IS NOT NULL
"""

item = split(col('hour_enc'), ':').getItem(0)
cond = (item == '0') | (item == '1') | (item == '2') | (item == '3') | (item == '4') | (item == '5') | (item == '6') | (item == '7') | (item == '8') | (item == '9')

# Recipe #
# Concat :00 to Time
# Concat 0 to 0, 1, ... , 9
# Extract hour
# Extract minutes
# Rename "maxUE_UL+DL" in "maxUE_UL_DL"
# Filter for missing
# Reorder columns

train_df = train_df.withColumn('hour_enc', concat(col('Time'), lit(":00"))) \
                   .withColumn('timestamp', when(cond, concat(lit("0"), col('hour_enc'))).otherwise(col('hour_enc'))) \
                   .withColumn('timestamp_raw', regexp_replace(col('timestamp'), "\\:", "")) \
                   .withColumn('hour', split(col('timestamp'), ':').getItem(0)) \
                   .withColumn('minutes', split(col('timestamp'), ':').getItem(1)) \
                   .withColumnRenamed("maxUE_UL+DL","maxUE_UL_DL") \
                   .filter(flt) \
                   .select('CellName', 'timestamp_raw', 'timestamp', 'hour', 'minutes', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL', 'maxUE_UL_DL', 
                           'Unusual')



train_df.show(5)
print(f"The new number of rown is {train_df.count()}")

In [None]:
flt = """
PRBUsageUL IS NOT NULL
and PRBUsageDL IS NOT NULL
and meanThr_DL IS NOT NULL
and meanThr_UL IS NOT NULL
and maxThr_DL IS NOT NULL
and maxThr_UL IS NOT NULL
and meanUE_DL IS NOT NULL
and meanUE_UL IS NOT NULL
and maxUE_DL IS NOT NULL
and maxUE_UL IS NOT NULL
and maxUE_UL_DL IS NOT NULL
"""

item = split(col('hour_enc'), ':').getItem(0)
cond = (item == '0') | (item == '1') | (item == '2') | (item == '3') | (item == '4') | (item == '5') | (item == '6') | (item == '7') | (item == '8') | (item == '9')

val_df = val_df.withColumn('hour_enc', concat(col('Time'), lit(":00"))) \
                   .withColumn('timestamp', when(cond, concat(lit("0"), col('hour_enc'))).otherwise(col('hour_enc'))) \
                   .withColumn('timestamp_raw', regexp_replace(col('timestamp'), "\\:", "")) \
                   .withColumn('hour', split(col('timestamp'), ':').getItem(0)) \
                   .withColumn('minutes', split(col('timestamp'), ':').getItem(1)) \
                   .withColumnRenamed("maxUE_UL+DL","maxUE_UL_DL") \
                   .filter(flt) \
                   .select('CellName', 'timestamp_raw', 'timestamp', 'hour', 'minutes', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL', 'maxUE_UL_DL')

val_df.show(5)
print(f"The new number of rown is {val_df.count()}")

## Load in SAS Viya for Exploration

I don't have a load actually. But I can store it in csv file for now.

In [None]:
# conn = swat.CAS(cashost, casport, casuser, password)

# 2. Analysis



Based on the context, I expect:
 - different trends in level of resource utilization when we have anomalies. Both in download and in upload
 - different trends in traffic. Both in download and in upload
 - different trends in usage. Both in download and in upload
    

In [None]:
## Check how import VA report in Jupyter notebook
# %%html
# <iframe src="http://viyalab/reportImages/directImage?reportUri=%2Freports%2Freports%2Ff72f28f9-a6ec-4a6b-a89e-b2802673d644&size=1200x1000&layoutType=entireSection&sectionIndex=0"></iframe>

In [None]:
Image(filename = figures_dir + '\\1_PRB_UL_Hour_Cell.JPG')

In [None]:
Image(filename = figures_dir + '\\2_PRB_DL_Hour_Cell.JPG')

In [None]:
Image(filename = figures_dir + '\\3_mean_max_DL_UL_Usage_0_1.JPG')

In [None]:
Image(filename = figures_dir + '\\4_mean_max_DL_UL_Thr_0_1.JPG')

# 3. Store the data

In [None]:
train, test = train_df.randomSplit([0.9, 0.1], seed=666)

In [None]:
processed_train_path = set_path(processed_data_dir, 'ML-MATT-CompetitionQT1920_train_processed.parquet')
processed_test_path = set_path(processed_data_dir, 'ML-MATT-CompetitionQT1920_test_processed.parquet')
processed_val_path = set_path(processed_data_dir, 'ML-MATT-CompetitionQT1920_val_processed.parquet')
train.drop("timestamp_raw", "timestamp", "maxUE_UL_DL").write.mode('overwrite').save(processed_train_path)
test.drop("timestamp_raw", "timestamp", "maxUE_UL_DL").write.mode('overwrite').save(processed_test_path)
val_df.drop("timestamp_raw", "timestamp", "maxUE_UL_DL").write.mode('overwrite').save(processed_val_path)

# Conclusions

## Key findings 
1. Different trends in level of resource utilization when we have anomalies. Both in download and in upload
2. Different trends in traffic. Both in download and in upload
3. Different trends in usage. Both in download and in upload

## Next steps
Let's engineering variables...