#  Anomaly detection in cellular networks

## 1. Introduction

The purpose of this notebook is to solve a anomaly detection problem proposed as a competition in the Kaggle InClass platform.

## 2. Problem description

### Context:

Traditionally, the design of a cellular network focuses on the optimization of energy and resources that guarantees a smooth operation even during peak hours (i.e. periods with higher traffic load). 
However, this implies that cells are most of the time overprovisioned of radio resources. 
Next generation cellular networks ask for a dynamic management and configuration in order to adapt to the varying user demands in the most efficient way with regards to energy savings and utilization of frequency resources. 
If the network operator were capable of anticipating to those variations in the users’ traffic demands, a more efficient management of the scarce (and expensive) network resources would be possible.
Current research in mobile networks looks upon Machine Learning (ML) techniques to help manage those resources. 
In this case, you will explore the possibilities of ML to detect abnormal behaviors in the utilization of the network that would motivate a change in the configuration of the base station.


### Objective

The objective of the network optimization team is to analyze traces of past activity, which will be used to train an ML system capable of classifying samples of current activity as:
 - 0 (normal): current activity corresponds to normal behavior of any working day and. Therefore, no re-configuration or redistribution of resources is needed.
 - 1 (unusual): current activity slightly differs from the behavior usually observed for that time of the day (e.g. due to a strike, demonstration, sports event, etc.), which should trigger a reconfiguration of the base station.

### Dataset

The dataset has been obtained from a real LTE deployment. During two weeks, different metrics were gathered from a set of 10 base stations, each having a different number of cells, every 15 minutes. 

The dataset is provided in the form of a csv file, where each row corresponds to a sample obtained from one particular cell at a certain time. Each data example contains the following features:

 - Time : hour of the day (in the format hh:mm) when the sample was generated.
 - CellName1: text string used to uniquely identify the cell that generated the current sample. CellName is in the form xαLTE, where x identifies the base station, and α the cell within that base station (see the example in the right figure).
 - PRBUsageUL and PRBUsageDL: level of resource utilization in that cell measured as the portion of Physical Radio Blocks (PRB) that were in use (%) in the previous 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - meanThrDL and meanThrUL: average carried traffic (in Mbps) during the past 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxThrDL and maxThrUL: maximum carried traffic (in Mbps) measured in the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - meanUEDL and meanUEUL: average number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxUEDL and maxUEUL: maximum number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxUE_UL+DL: maximum number of user equipment (UE) devices that were active simultaneously in the last 15 minutes, regardless of UL and DL.
 - Unusual: labels for supervised learning. A value of 0 determines that the sample corresponds to normal operation, a value of 1 identifies unusual behavior.

## Libraries

In [None]:
import os
import sys
from zipfile import ZipFile

#Data
import kaggle
import pandas as pd

#Analysis
import pyspark
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
except ImportError as e:
    print('WARN: Something wrong with pyspark library. Please check configuration settings!')
from pyspark.sql.types import StructType, DoubleType, IntegerType, StringType, TimestampType
from pyspark.sql.functions import col, split
    

# Reloads functions each time so you can edit a script and not need to restart the kernel
%load_ext autoreload
%autoreload 2

In [None]:
# # must go first
# %matplotlib inline
# %config InlineBackend.figure_format='retina'

# # plotting
# import matplotlib as mpl
# from matplotlib import pyplot as plt
# import seaborn as sns
# sns.set_context("poster", font_scale=1.3)

# import sys
# import os
# import datetime

# sns.set()
# sns.set_context('poster', font_scale=1.3)
# sns.set_style("white")

# import warnings
# warnings.filterwarnings('ignore')

# # basic wrangling
# import numpy as np
# import yaml
# import json
# import re
# import pandas as pd

# # eda tools
# import pivottablejs
# import missingno as msno

# # Update matplotlib defaults to something nicer
# mpl_update = {
#     'font.size': 16,
#     'xtick.labelsize': 14,
#     'ytick.labelsize': 14,
#     'figure.figsize': [12.0, 8.0],
#     'axes.labelsize': 20,
#     'axes.labelcolor': '#677385',
#     'axes.titlesize': 20,
#     'lines.color': '#0055A7',
#     'lines.linewidth': 3,
#     'text.color': '#677385',
#     'font.family': 'sans-serif',
#     'font.sans-serif': 'Tahoma'
# }
# mpl.rcParams.update(mpl_update)

## Helpers

In [None]:
def get_root_dir(src:str, max_nest:int) -> str:
    '''
    Specify paths and appending directories with relevant python source code.
    '''
    root_dir = os.curdir
    nest = 0
    while src not in os.listdir(root_dir) and nest < max_nest:
        root_dir = os.path.join(os.pardir, root_dir)     # Look up the directory structure for a src directory
        nest += 1
        
    # If you don't find the src directory, the root directory is this directory
    root_dir = os.path.abspath(root_dir) if nest < max_nest else os.path.abspath(
    os.curdir)
    
    return root_dir

def set_src(root_dir:str, src:str) -> str:
    '''
     Get the source directory and append path to access python packages/scripts within directory
    '''
    if src in os.listdir(root_dir):
        src_dir = os.path.join(root_dir, src)
        sys.path.append(src_dir)
    return sys.path[-1]

def set_data(root_dir:str, data:str) -> str:
    '''
    '''
    data_dir = os.path.join(
        root_dir, data) if data in os.listdir(root_dir) else os.curdir
    return data_dir

def set_figures(root_dir:str, figures:str) -> str:
    '''
    '''
    figures_dir = os.path.join(
        root_dir,
        figures) if figures in os.listdir(root_dir) else os.curdir
    return figures_dir
    
def set_models(root_dir:str, models:str) -> str:
    '''
    '''
    models_dir = os.path.join(
        root_dir, models) if models in os.listdir(root_dir) else os.curdir
    return models_dir

def set_path(path:str, dirname:str) -> str:
    '''
    '''
    return os.path.join(path, dirname)

def unzip(inpath:str, outpath:str) -> None:
    zf = ZipFile(inpath, 'r')
    zf.extractall(outpath)
    zf.close()    

# # Prepends the directory path for specifying paths to data or figures
# # dataplus("data.csv") -> "/Users/cmawer/project/data/data.csv"
# # figplus("cool.png") -> "/Users/cmawer/project/figures/cool.png"
# dataplus = lambda x: os.path.join(data_dir, x)
# dataextplus = lambda x: os.path.join(external_data_dir, x)
# figplus = lambda x: os.path.join(figure_dir, x)
# modelsplus = lambda x: os.path.join(models_dir, x)

# # Prepends the date to a string (e.g. to save dated files)
# # dateplus("cool-figure.png") -> "2018-12-05-cool-figure.png"
# now = datetime.datetime.now().strftime("%Y-%m-%d")
# dateplus = lambda x: "%s-%s" % (now, x)

## Setup

In [None]:
root_dir = get_root_dir('src', 5)
src_dir = set_src(root_dir, 'src')
data_dir = set_data(root_dir, 'data')
raw_data_dir = set_path(data_dir, 'raw')
processed_data_dir = set_path(data_dir, 'processed')
figures_dir = set_figures(root_dir, 'figures')
models_dir = set_models(root_dir, 'models')

In [None]:
# To convert to html with collapsible headings and table of contents
# change filename and run cell
# filename = "template.ipynb"
# ! jupyter nbconvert --to html_ch {filename} --template toc2

# 1. Data

## Download from Kaggle and inspect data

In [None]:
!kaggle competitions download -c anomaly-detection-in-cellular-networks -p ../../data/raw/ --force

In [None]:
unzip('../../data/raw/anomaly-detection-in-cellular-networks.zip', raw_data_dir)

In [None]:
train_path = set_path(raw_data_dir, 'ML-MATT-CompetitionQT1920_train.csv')
test_path = set_path(raw_data_dir, 'ML-MATT-CompetitionQT1920_test.csv')
train_data = pd.read_csv(train_path, header=0, sep=',', engine='python') #because UnicodeDecodeError with c engine

In [None]:
train_data.head()

In [None]:
train_data.columns

In [None]:
train_data.info()

# 2. ETL

## Initiate Spark session

In [None]:
#If not exists create a spark session named Anomaly Detection where the master node is local
spark = SparkSession.builder \
    .master("local") \
    .appName("Anomaly Detection") \
    .getOrCreate()

In [None]:
spark.getActiveSession()

## Extract

### Define schema


In [None]:
schema = StructType() \
    .add("Time", StringType(), True) \
    .add("CellName", StringType(), True) \
    .add("PRBUsageUL", DoubleType(), True) \
    .add("PRBUsageDL", DoubleType(), True) \
    .add("meanThr_DL", DoubleType(), True) \
    .add("meanThr_UL", DoubleType(), True) \
    .add("maxThr_DL", DoubleType(), True) \
    .add("maxThr_UL", DoubleType(), True) \
    .add("meanUE_DL", DoubleType(), True) \
    .add("meanUE_UL", DoubleType(), True) \
    .add("maxUE_DL", DoubleType(), True) \
    .add("maxUE_UL", DoubleType(), True) \
    .add("maxUE_UL+DL", IntegerType(), True) \
    .add("Unusual", IntegerType(), True)

schema

In [None]:
train_df = spark.read.option("header", True) \
                .option("delimiter", ',') \
                .schema(schema) \
                .csv(train_path)

test_df = spark.read.option("header", True) \
                .option("delimiter", ',') \
                .schema(schema) \
                .csv(test_path)

In [None]:
train_df.show(5)

## Transform

Because we have:

 - a particular time format (hh:mm)
 - a composed cell identifier (xαLTE)
 - a messy name (maxUE_UL+DL)
 - missing values
 
we need to implement some transformations:

 - we want to use only hh. Then we can slip Time field and drop minutes
 - I would leave the cell indentifier because we want to optimize for cell
 - rename maxUE_UL+DL in maxUE_UL_DL
 - we could consider drop missings for simplicity


In [None]:
flt = """
PRBUsageUL IS NOT NULL
and PRBUsageDL IS NOT NULL
and meanThr_DL IS NOT NULL
and meanThr_UL IS NOT NULL
and maxThr_DL IS NOT NULL
and maxThr_UL IS NOT NULL
and meanUE_DL IS NOT NULL
and meanUE_UL IS NOT NULL
and maxUE_DL IS NOT NULL
and maxUE_UL IS NOT NULL
and maxUE_UL_DL IS NOT NULL
and Unusual IS NOT NULL
"""

train_df = train_df.withColumn('hour', split(train_df['Time'], ':').getItem(0)) \
                   .withColumnRenamed("maxUE_UL+DL","maxUE_UL_DL") \
                   .filter(flt) \
                   .drop(train_df['Time'])

train_df.show(5)
print(f"The new number of rown is {train_df.count()}")

In [None]:
flt = """
PRBUsageUL IS NOT NULL
and PRBUsageDL IS NOT NULL
and meanThr_DL IS NOT NULL
and meanThr_UL IS NOT NULL
and maxThr_DL IS NOT NULL
and maxThr_UL IS NOT NULL
and meanUE_DL IS NOT NULL
and meanUE_UL IS NOT NULL
and maxUE_DL IS NOT NULL
and maxUE_UL IS NOT NULL
and maxUE_UL_DL IS NOT NULL
"""
test_df = test_df.withColumn('hour', split(test_df['Time'], ':').getItem(0)) \
                   .withColumnRenamed("maxUE_UL+DL","maxUE_UL_DL") \
                   .filter(flt) \
                   .drop(test_df['Time'])

test_df.show(5)
print(f"The new number of rown is {test_df.count()}")

## Load

I don't have a load actually. But I can store it in csv file for now.

In [None]:
processed_train_path = set_path(processed_data_dir, 'ML-MATT-CompetitionQT1920_train_processed.csv')
processed_test_path = set_path(processed_data_dir, 'ML-MATT-CompetitionQT1920_test_processed.csv')

train_df.write.csv(processed_train_path)
test_df.write.csv(processed_test_path)

# 2. Analysis

# Conclusions

## Decisions made

## Key findings 
1. 
2. 
3. 

## Next steps
1. 
2. 

# Appendix

## Watermark 
For full reproducibility of results, use exact data extraction as defined at top of notebook and ensure that the environment is exactly as follows: 

In [None]:
# ! pip install watermark
%load_ext watermark
%watermark -v -m --iversions -g

<center>© <a href="http://lineagelogistics.com">2019 Lineage Logistics</a></center>