#  Anomaly detection in cellular networks

## Introduction

The purpose of this notebook is to solve a anomaly detection problem proposed as a competition in the Kaggle InClass platform.

## Problem description

### Context:

Traditionally, the design of a cellular network focuses on the optimization of energy and resources that guarantees a smooth operation even during peak hours (i.e. periods with higher traffic load). 
However, this implies that cells are most of the time overprovisioned of radio resources. 
Next generation cellular networks ask for a dynamic management and configuration in order to adapt to the varying user demands in the most efficient way with regards to energy savings and utilization of frequency resources. 
If the network operator were capable of anticipating to those variations in the users’ traffic demands, a more efficient management of the scarce (and expensive) network resources would be possible.
Current research in mobile networks looks upon Machine Learning (ML) techniques to help manage those resources. 
In this case, you will explore the possibilities of ML to detect abnormal behaviors in the utilization of the network that would motivate a change in the configuration of the base station.


### Objective

The objective of the network optimization team is to analyze traces of past activity, which will be used to train an ML system capable of classifying samples of current activity as:
 - 0 (normal): current activity corresponds to normal behavior of any working day and. Therefore, no re-configuration or redistribution of resources is needed.
 - 1 (unusual): current activity slightly differs from the behavior usually observed for that time of the day (e.g. due to a strike, demonstration, sports event, etc.), which should trigger a reconfiguration of the base station.

### Dataset

The dataset has been obtained from a real LTE deployment. During two weeks, different metrics were gathered from a set of 10 base stations, each having a different number of cells, every 15 minutes. 

The dataset is provided in the form of a csv file, where each row corresponds to a sample obtained from one particular cell at a certain time. Each data example contains the following features:

 - Time : hour of the day (in the format hh:mm) when the sample was generated.
 - CellName1: text string used to uniquely identify the cell that generated the current sample. CellName is in the form xαLTE, where x identifies the base station, and α the cell within that base station (see the example in the right figure).
 - PRBUsageUL and PRBUsageDL: level of resource utilization in that cell measured as the portion of Physical Radio Blocks (PRB) that were in use (%) in the previous 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - meanThrDL and meanThrUL: average carried traffic (in Mbps) during the past 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxThrDL and maxThrUL: maximum carried traffic (in Mbps) measured in the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - meanUEDL and meanUEUL: average number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxUEDL and maxUEUL: maximum number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
 - maxUE_UL+DL: maximum number of user equipment (UE) devices that were active simultaneously in the last 15 minutes, regardless of UL and DL.
 - Unusual: labels for supervised learning. A value of 0 determines that the sample corresponds to normal operation, a value of 1 identifies unusual behavior.

## Libraries

In [1]:
import os
import sys
import random
random.seed(888) #set seed for reproducibility
from zipfile import ZipFile
from IPython.display import Image


#Analysis
import pyspark
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
except ImportError as e:
    print('WARN: Something wrong with pyspark library. Please check configuration settings!')
    
#Feature Engineering
from pyspark.sql.functions import col, when, lit, array, explode, rand
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, MinMaxScaler
#Model Training
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
    
# Reloads functions each time so you can edit a script and not need to restart the kernel
%load_ext autoreload
%autoreload 2

## Helpers

In [2]:
def get_root_dir(src:str, max_nest:int) -> str:
    '''
    Specify paths and appending directories with relevant python source code.
    '''
    root_dir = os.curdir
    nest = 0
    while src not in os.listdir(root_dir) and nest < max_nest:
        root_dir = os.path.join(os.pardir, root_dir)     # Look up the directory structure for a src directory
        nest += 1
        
    # If you don't find the src directory, the root directory is this directory
    root_dir = os.path.abspath(root_dir) if nest < max_nest else os.path.abspath(
    os.curdir)
    
    return root_dir

def set_src(root_dir:str, src:str) -> str:
    '''
     Get the source directory and append path to access python packages/scripts within directory
    '''
    if src in os.listdir(root_dir):
        src_dir = os.path.join(root_dir, src)
        sys.path.append(src_dir)
    return sys.path[-1]

def set_folder(root_dir:str, folder:str) -> str:
    '''
    Set the folder path based on the folder name
    '''
    folder_path = os.path.join(
        root_dir, folder) if folder in os.listdir(root_dir) else os.curdir
    return folder_path

def set_path(path:str, dirname:str) -> str:
    '''
    '''
    return os.path.join(path, dirname)

def unzip(inpath:str, outpath:str) -> None:
    zf = ZipFile(inpath, 'r')
    zf.extractall(outpath)
    zf.close()
    
def metrics(dataframe, actual, predicted):
    '''
    Calculates evaluation metrics from predicted results
    
    Input:
    ---------
        dataframe: spark.sql.dataframe with the real and predicted values [object]
        actual:  Name of column with observed target values [string]
        predicted: Name of column with predicted values [string]
        
    
    Output:
    ---------
        None
    '''
       
    # Along each row are the actual values and down each column are the predicted
    dataframe = dataframe.withColumn(actual, col(actual).cast('integer'))
    dataframe = dataframe.withColumn(predicted, col(predicted).cast('integer'))
    cm = dataframe.crosstab(actual, predicted)
    cm = cm.sort(cm.columns[0], ascending = True)
    
    # Adds missing column in case just one class was predicted
    if not '0' in cm.columns:
        cm = cm.withColumn('0', lit(0))
    if not '1' in cm.columns:
        cm = cm.withColumn('1', lit(0))
    
    # Subsets values from confusion matrix
    zero = cm.filter(cm[cm.columns[0]] == 0.0)
    first_0 = zero.take(1)
    
    one = cm.filter(cm[cm.columns[0]] == 1.0)
    first_1 = one.take(1)
    
    tn = first_0[0][1]
    fp = first_0[0][2]
    fn = first_1[0][1]
    tp = first_1[0][2]
    
    # Calculate metrics from values in the confussion matrix
    if (tp == 0):
        acc = float((tp + tn) / (tp + tn + fp + fn))
        sen = 0
        spe = float((tn) / (tn + fp))
        prec = 0
        rec = 0
        f1 = 0
    elif (tn == 0):
        acc = float((tp + tn) / (tp + tn + fp + fn))
        sen = float((tp) / (tp + fn))
        spe = 0
        prec = float((tp) / (tp + fp))
        rec = float((tp) / (tp + fn))
        f1 = 2 * float((prec * rec) / (prec + rec))
    else:
        acc = float((tp + tn) / (tp + tn + fp + fn))
        sen = float((tp) / (tp + fn))
        spe = float((tn) / (tn + fp))
        prec = float((tp) / (tp + fp))
        rec = float((tp) / (tp + fn))
        f1 = 2 * float((prec * rec) / (prec + rec))

    # Print results
    print('Confusion Matrix and Statistics: \n')
    cm.show()
    
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', dataframe.count(), '\n')
    
    print('Accuracy: {0:.2f}'.format(acc))
    print('Sensitivity: {0:.2f}'.format(sen))
    print('Specificity: {0:.2f}'.format(spe))
    print('Precision: {0:.2f}'.format(prec))
    print('Recall: {0:.2f}'.format(rec))
    print('F1-score: {0:.2f}'.format(f1))

## Setup

In [3]:
root_dir = get_root_dir('src', 5)
src_dir = set_src(root_dir, 'src')
data_dir = set_folder(root_dir, 'data')
raw_data_dir = set_path(data_dir, 'raw')
interim_data_dir = set_path(data_dir, 'interim')
processed_data_dir = set_path(data_dir, 'processed')
figures_dir = set_folder(root_dir, 'figures')
features_dir = set_folder(root_dir, 'features')
models_dir = set_folder(root_dir, 'models')

# 1. Data

## Initiate Spark session

In [4]:
#If not exists create a spark session named Anomaly Detection where the master node is local
spark = SparkSession.builder \
    .master("local[4]") \
    .appName("Anomaly Detection") \
    .getOrCreate()

In [5]:
spark.getActiveSession()

## Load

### Set path

In [6]:
train_path = set_path(processed_data_dir, 'ML-MATT-CompetitionQT1920_train_processed.csv')
test_path = set_path(processed_data_dir, 'ML-MATT-CompetitionQT1920_test_processed.csv')

### Load data

In [7]:
train_df = spark.read.option("header", True) \
                .option("delimiter", ',') \
                .option("inferSchema" , "true") \
                .csv(train_path)

test_df = spark.read.option("header", True) \
                .option("delimiter", ',') \
                .option("inferSchema" , "true") \
                .csv(test_path)

In [8]:
train_df.printSchema()

root
 |-- CellName: string (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minutes: integer (nullable = true)
 |-- PRBUsageUL: double (nullable = true)
 |-- PRBUsageDL: double (nullable = true)
 |-- meanThr_DL: double (nullable = true)
 |-- meanThr_UL: double (nullable = true)
 |-- maxThr_DL: double (nullable = true)
 |-- maxThr_UL: double (nullable = true)
 |-- meanUE_DL: double (nullable = true)
 |-- meanUE_UL: double (nullable = true)
 |-- maxUE_DL: double (nullable = true)
 |-- maxUE_UL: double (nullable = true)
 |-- Unusual: integer (nullable = true)



In [9]:
train_df.show(5)

+--------+----+-------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+-------+
|CellName|hour|minutes|PRBUsageUL|PRBUsageDL|meanThr_DL|meanThr_UL|maxThr_DL|maxThr_UL|meanUE_DL|meanUE_UL|maxUE_DL|maxUE_UL|Unusual|
+--------+----+-------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+-------+
|   3BLTE|  10|     45|    11.642|     1.393|      0.37|     0.041|   15.655|    0.644|    1.114|    1.025|     4.0|     3.0|      1|
|   1BLTE|   9|     45|    21.791|     1.891|     0.537|     0.268|   10.273|    1.154|    1.353|    1.085|     6.0|     4.0|      1|
|   9BLTE|   7|     45|     0.498|     0.398|     0.015|      0.01|    0.262|    0.164|    0.995|    0.995|     1.0|     1.0|      1|
|   4ALTE|   2|     45|     1.891|     1.095|      0.94|     0.024|   60.715|    0.825|    1.035|    0.995|     2.0|     2.0|      1|
|  10BLTE|   3|     30|     0.303|     0.404|     0.016|     0

In [10]:
# .withColumn("minutes_cat", col("minutes").cast('string')) \
train_df_fmt = train_df.withColumn("hour_cat", col("hour").cast('string')) \
                       .select('CellName', 'hour_cat', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL', 'Unusual')


test_df_fmt = test_df.withColumn("hour_cat", col("hour").cast('string')) \
                       .select('CellName', 'hour_cat', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL')

# 2. Feature Engineering

Because we have:

 - unbalanced sample
 - different scales

and we want to understand the role of time.
 
we need to implement some transformations:

 - balance the train sample with weights
 - standardize the data
 - onehot encoding (hour)

## Balancing Target

There are different methods to balance data:
  1. Undersampling (the majority class)
  2. Oversampling (the minority class) 
  3. Class weighting (assign the inverse ratio of each class as weights)

The sample is large and we don't want to alterate the context. Then I choose Undersampling!

**REMEMBER: DON'T BALANCE THE TEST SAMPLE**

In [11]:
df_major_label = train_df_fmt.filter(col("Unusual") == 0)
df_minor_label= train_df_fmt.filter(col("Unusual") == 1)
ratio = int(df_major_label.count()/df_minor_label.count())
print("The ratio is {}".format(ratio))

The ratio is 2


In [12]:
sample = df_major_label.sample(False, 1/ratio)

In [13]:
sample.show(10)

+--------+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+-------+
|CellName|hour_cat|PRBUsageUL|PRBUsageDL|meanThr_DL|meanThr_UL|maxThr_DL|maxThr_UL|meanUE_DL|meanUE_UL|maxUE_DL|maxUE_UL|Unusual|
+--------+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+-------+
|   9ALTE|      13|    15.966|     1.819|     0.415|     0.071|   10.116|    0.706|    1.364|    1.314|     6.0|     5.0|      0|
|   5CLTE|      13|    24.858|     4.446|     1.448|     0.145|   27.558|    2.511|    1.617|    1.213|     7.0|     5.0|      0|
|   6ULTE|       5|     0.202|     1.112|     0.217|     0.013|    4.345|    0.103|    1.021|     0.01|     3.0|     2.0|      0|
|   3CLTE|      21|     7.175|     3.638|     1.705|     0.067|   43.851|    1.032|    1.142|    1.041|     4.0|     3.0|      0|
|   5BLTE|       2|     9.802|     2.223|     0.832|     0.023|   33.075|    0.447|    1.0

In [14]:
train_df_balanced = sample.unionAll(df_minor_label).orderBy(rand())

In [15]:
ratio_balanced = train_df_balanced.where('Unusual == 0').count()/train_df_balanced.where('Unusual == 1').count()
print(f'The ratio now is {int(ratio_balanced)}')

The ratio now is 1


## OneHot encoding

We need to convert categorical values to numerical values (StringIndexer), encode columns (OneHotEncoder) using a vector assembler

### Train set

In [16]:
indexer = StringIndexer(inputCol="hour_cat", outputCol="hour_index")
indexer_fit = indexer.fit(train_df_balanced)
train_df_indexed = indexer_fit.transform(train_df_balanced)
train_df_indexed.show(5) #If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.

+--------+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+-------+----------+
|CellName|hour_cat|PRBUsageUL|PRBUsageDL|meanThr_DL|meanThr_UL|maxThr_DL|maxThr_UL|meanUE_DL|meanUE_UL|maxUE_DL|maxUE_UL|Unusual|hour_index|
+--------+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+-------+----------+
|   6ULTE|       6|     0.202|     1.011|     0.184|     0.018|     2.86|    0.142|    1.041|     0.01|     3.0|     2.0|      0|      18.0|
|   1CLTE|      11|     21.89|     3.085|      0.68|     0.096|   13.484|    1.163|    1.303|    1.154|     5.0|     5.0|      1|       7.0|
|   6VLTE|       2|     1.095|      2.09|     0.252|     0.029|   12.338|    0.573|    1.095|     0.01|     3.0|     3.0|      1|      13.0|
|   6ULTE|       8|     0.498|     1.194|      0.25|     0.027|   17.531|    0.803|    1.075|     0.01|     3.0|     2.0|      1|       2.0|
|   4BLTE|   

In [17]:
encoder = OneHotEncoder(dropLast=False, inputCol="hour_index", outputCol="hour_vec")
encoder_fit = encoder.fit(train_df_indexed)
train_df_encoded = encoder_fit.transform(train_df_indexed)
train_df_encoded = train_df_encoded.select('CellName', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL', 'hour_vec', 'Unusual')
train_df_encoded.show(5)

+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+---------------+-------+
|CellName|PRBUsageUL|PRBUsageDL|meanThr_DL|meanThr_UL|maxThr_DL|maxThr_UL|meanUE_DL|meanUE_UL|maxUE_DL|maxUE_UL|       hour_vec|Unusual|
+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+---------------+-------+
|   6ULTE|     0.202|     1.011|     0.184|     0.018|     2.86|    0.142|    1.041|     0.01|     3.0|     2.0|(24,[18],[1.0])|      0|
|   1CLTE|     21.89|     3.085|      0.68|     0.096|   13.484|    1.163|    1.303|    1.154|     5.0|     5.0| (24,[7],[1.0])|      1|
|   6VLTE|     1.095|      2.09|     0.252|     0.029|   12.338|    0.573|    1.095|     0.01|     3.0|     3.0|(24,[13],[1.0])|      1|
|   6ULTE|     0.498|     1.194|      0.25|     0.027|   17.531|    0.803|    1.075|     0.01|     3.0|     2.0| (24,[2],[1.0])|      1|
|   4BLTE|    22.433|     4.143|     1.31

### Test set

In [18]:
test_df_indexed = indexer_fit.transform(test_df_fmt)
test_df_indexed.show(5)
test_df_encoded = encoder_fit.transform(test_df_indexed)
test_df_encoded = test_df_encoded.select('CellName', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL', 'hour_vec')

test_df_encoded.show(5)

+--------+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+----------+
|CellName|hour_cat|PRBUsageUL|PRBUsageDL|meanThr_DL|meanThr_UL|maxThr_DL|maxThr_UL|meanUE_DL|meanUE_UL|maxUE_DL|maxUE_UL|hour_index|
+--------+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+----------+
|   6ALTE|       3|     3.781|     1.493|     0.575|     0.042|   22.659|    0.743|    0.985|     0.01|     3.0|     2.0|       0.0|
|   6ULTE|      20|     2.021|     3.335|     0.569|     0.075|   29.265|    1.049|    1.314|     0.01|     6.0|     3.0|      12.0|
|   2ALTE|      11|     0.505|     0.404|     0.014|      0.01|    0.227|    0.097|    1.011|     0.01|     2.0|     1.0|       7.0|
|   3CLTE|       6|     1.011|     0.505|     0.238|     0.021|   20.962|    0.609|    1.011|    1.011|     2.0|     1.0|      18.0|
|   6CLTE|      15|     3.881|     0.498|     0.076|     0.041|    3.

## Standardize data

### Train set

In [19]:
scalable_vars = ['PRBUsageUL', 'PRBUsageDL', 'meanThr_DL', 
                 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                 'meanUE_DL', 'meanUE_UL', 'maxUE_DL','maxUE_UL'] + ['hour_vec']

vec_assembler = VectorAssembler(inputCols=scalable_vars, outputCol='vars_vectorized')
df_train_assembled = vec_assembler.transform(train_df_encoded)
scaler = MinMaxScaler(inputCol=vec_assembler.getOutputCol(), outputCol="features")
scaler_fit = scaler.fit(df_train_assembled)
scaled_df_train = scaler_fit.transform(df_train_assembled)
scaled_df_train = scaled_df_train.select('CellName', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL', 'features', 'Unusual')

In [20]:
scaled_df_train.show(10)

+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+--------------------+-------+
|CellName|PRBUsageUL|PRBUsageDL|meanThr_DL|meanThr_UL|maxThr_DL|maxThr_UL|meanUE_DL|meanUE_UL|maxUE_DL|maxUE_UL|            features|Unusual|
+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+--------------------+-------+
|   6ULTE|     0.202|     1.011|     0.184|     0.018|     2.86|    0.142|    1.041|     0.01|     3.0|     2.0|(34,[0,1,2,3,4,5,...|      0|
|   1CLTE|     21.89|     3.085|      0.68|     0.096|   13.484|    1.163|    1.303|    1.154|     5.0|     5.0|(34,[0,1,2,3,4,5,...|      1|
|   6VLTE|     1.095|      2.09|     0.252|     0.029|   12.338|    0.573|    1.095|     0.01|     3.0|     3.0|(34,[0,1,2,3,4,5,...|      1|
|   6ULTE|     0.498|     1.194|      0.25|     0.027|   17.531|    0.803|    1.075|     0.01|     3.0|     2.0|(34,[0,1,2,3,4,5,...|      1|
|   4B

### Test set

In [21]:
df_test_assembled = vec_assembler.transform(test_df_encoded)
scaled_df_test = scaler_fit.transform(df_test_assembled)
scaled_df_test = scaled_df_test.select('CellName', 'PRBUsageUL', 'PRBUsageDL', 
                           'meanThr_DL', 'meanThr_UL', 'maxThr_DL', 'maxThr_UL', 
                           'meanUE_DL', 'meanUE_UL', 'maxUE_DL', 'maxUE_UL', 'features')

scaled_df_test.show(5)

+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+--------------------+
|CellName|PRBUsageUL|PRBUsageDL|meanThr_DL|meanThr_UL|maxThr_DL|maxThr_UL|meanUE_DL|meanUE_UL|maxUE_DL|maxUE_UL|            features|
+--------+----------+----------+----------+----------+---------+---------+---------+---------+--------+--------+--------------------+
|   6ALTE|     3.781|     1.493|     0.575|     0.042|   22.659|    0.743|    0.985|     0.01|     3.0|     2.0|(34,[0,1,2,3,4,5,...|
|   6ULTE|     2.021|     3.335|     0.569|     0.075|   29.265|    1.049|    1.314|     0.01|     6.0|     3.0|(34,[0,1,2,3,4,5,...|
|   2ALTE|     0.505|     0.404|     0.014|      0.01|    0.227|    0.097|    1.011|     0.01|     2.0|     1.0|(34,[0,1,2,3,4,5,...|
|   3CLTE|     1.011|     0.505|     0.238|     0.021|   20.962|    0.609|    1.011|    1.011|     2.0|     1.0|(34,[0,1,2,3,4,5,...|
|   6CLTE|     3.881|     0.498|     0.076|     0.041|    3.93

## Store Features

In [22]:
train_df_feat = scaled_df_train.select("CellName", "features", "Unusual")
train_df_feat.show(10)
train_df_feat.printSchema()

+--------+--------------------+-------+
|CellName|            features|Unusual|
+--------+--------------------+-------+
|   6ULTE|(34,[0,1,2,3,4,5,...|      0|
|   1CLTE|(34,[0,1,2,3,4,5,...|      1|
|   6VLTE|(34,[0,1,2,3,4,5,...|      1|
|   6ULTE|(34,[0,1,2,3,4,5,...|      1|
|   4BLTE|(34,[0,1,2,3,4,5,...|      0|
|   9ALTE|(34,[0,1,2,3,4,5,...|      0|
|   1ALTE|(34,[0,1,2,3,4,5,...|      1|
|   4ALTE|(34,[0,1,2,3,4,5,...|      0|
|   5ALTE|(34,[0,1,2,3,4,5,...|      0|
|   8BLTE|(34,[0,1,2,3,4,5,...|      1|
+--------+--------------------+-------+
only showing top 10 rows

root
 |-- CellName: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- Unusual: integer (nullable = true)



In [23]:
test_df_feat = scaled_df_test.select("CellName", "features")
test_df_feat.show(10)

+--------+--------------------+
|CellName|            features|
+--------+--------------------+
|   6ALTE|(34,[0,1,2,3,4,5,...|
|   6ULTE|(34,[0,1,2,3,4,5,...|
|   2ALTE|(34,[0,1,2,3,4,5,...|
|   3CLTE|(34,[0,1,2,3,4,5,...|
|   6CLTE|(34,[0,1,2,3,4,5,...|
|   7CLTE|(34,[0,1,2,3,4,5,...|
|   1BLTE|(34,[0,1,2,3,4,5,...|
|   3CLTE|(34,[0,1,2,3,4,5,...|
|   2ALTE|(34,[1,2,3,4,5,6,...|
|   3ALTE|(34,[0,1,2,3,4,5,...|
+--------+--------------------+
only showing top 10 rows



In [25]:
# suppose to work
scaler_fit.write().overwrite().save(features_dir)
# train_features_path = set_path(features_dir, 'ML-MATT-CompetitionQT1920_train_features.csv')
# test_features_path = set_path(features_dir, 'ML-MATT-CompetitionQT1920_test_features.csv')
# train_df_feat.toPandas().to_csv(train_features_path, index=False, encoding='utf-8')
# test_df_feat.toPandas().to_csv(test_features_path, index=False, encoding='utf-8')

# 3. Model Training

According to literature, tree models works well with this kind of problems


## GBTClassifier

In [26]:
# Train a GBT model
gbt = GBTClassifier(featuresCol="features", labelCol='Unusual', maxIter=3, seed=888)
model = gbt.fit(train_df_feat)
predictions_train = model.transform(train_df_feat)
predictions_train.show(5)

+--------+--------------------+-------+--------------------+--------------------+----------+
|CellName|            features|Unusual|       rawPrediction|         probability|prediction|
+--------+--------------------+-------+--------------------+--------------------+----------+
|   6ULTE|(34,[0,1,2,3,4,5,...|      0|[-0.0427902648882...|[0.47861791620985...|       1.0|
|   1CLTE|(34,[0,1,2,3,4,5,...|      1|[0.61901727720455...|[0.77522171473472...|       0.0|
|   6VLTE|(34,[0,1,2,3,4,5,...|      1|[-0.0427902648882...|[0.47861791620985...|       1.0|
|   6ULTE|(34,[0,1,2,3,4,5,...|      1|[0.02053032961511...|[0.51026372281406...|       0.0|
|   4BLTE|(34,[0,1,2,3,4,5,...|      0|[0.61901727720455...|[0.77522171473472...|       0.0|
+--------+--------------------+-------+--------------------+--------------------+----------+
only showing top 5 rows



## Evaluate the model

In [27]:
metrics(predictions_train, 'Unusual', 'prediction')

Confusion Matrix and Statistics: 

+------------------+-----+----+
|Unusual_prediction|    0|   1|
+------------------+-----+----+
|                 0|11147|2020|
|                 1| 3346|6821|
+------------------+-----+----+

True Positives: 6821
True Negatives: 11147
False Positives: 2020
False Negatives: 3346
Total: 23334 

Accuracy: 0.77
Sensitivity: 0.67
Specificity: 0.85
Precision: 0.77
Recall: 0.67
F1-score: 0.72


In [28]:
predictions_test = model.transform(test_df_feat)

In [29]:
predictions_test.show(5)

+--------+--------------------+--------------------+--------------------+----------+
|CellName|            features|       rawPrediction|         probability|prediction|
+--------+--------------------+--------------------+--------------------+----------+
|   6ALTE|(34,[0,1,2,3,4,5,...|[-0.2050628010309...|[0.39888200451582...|       1.0|
|   6ULTE|(34,[0,1,2,3,4,5,...|[0.65522141669415...|[0.78758724849262...|       0.0|
|   2ALTE|(34,[0,1,2,3,4,5,...|[1.08604356545617...|[0.89771477223903...|       0.0|
|   3CLTE|(34,[0,1,2,3,4,5,...|[1.16036086748836...|[0.91057872543355...|       0.0|
|   6CLTE|(34,[0,1,2,3,4,5,...|[-0.3375221304381...|[0.33736826819910...|       1.0|
+--------+--------------------+--------------------+--------------------+----------+
only showing top 5 rows



# 4. Model Tuning

In [None]:
#Build Grid
grid = ParamGridBuilder() \
        .addGrid(gbt.maxBins, [6]) \
        .addGrid(gbt.maxDepth, [2,4]) \
        .addGrid(gbt.maxIter, [4]).build()

#Define evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', 
                                          labelCol='Unusual', 
                                          metricName='areaUnderROC')

cv = CrossValidator(estimator=gbt, 
                    estimatorParamMaps=grid, 
                    evaluator=evaluator, 
                    parallelism=4, 
                    seed=888)

cvModel = cv.fit(train_df_feat)

In [None]:
cv_predictions_train = cvModel.transform(train_df_feat)
cv_predictions_train.show(5)

In [None]:
metrics(predictions_train, 'Unusual', 'prediction')

## 5. Store the model

In [30]:
model.write().overwrite().save(path=models_dir)

# Conclusion

We get the model. Let's build the pipeline for deployment