# Included in this notebook is some exploratory scratchwork as well as notes on the reference work and data with some ideas

I think our best bet for this project is to do the analysis that we can on this data, but for the narrative to be an exploration on the use of data and machine learning in the 5G RAN slicing problem.

With that in mind, I think it is easy for us to comment on the state of data-driven 5G RAN slicing, as we know it to be limited and for the data available to be sub-par

So, for instance we can demonstrate the lack of (good) publicly available (benchmark) data to motivate research on the topic and use this dataset to demonstrate the shortcomings, make assertions as to what would be useful in the data sense for future work

Even though this current paper is leaps and bounds better than the last works we investigate, there are still some potential issues with this approach and maybe all data driven approaches:

```
"First, there are policies on both raw and augmented/structured data, which dictate how to share data between stakeholders (i.e., slice owners and NOs) via filters that secure privacy restrictions such as for NOs which do not want to share all/parts of their monitoring information and setup/configuration logs with other NOs or even the slices that they host."
```

Maybe I am wrong. I'm not a security expert, but it seems questionable or at least a problem that needs to be analyzed because data anonymization is itself a difficult problem. So, for a workflow like this, the questions 'how to share data between stakeholders while guaranteeing privacy' is worth looking into (not for us).

In [12]:
import os, glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, mean_squared_error
from sklearn.utils import shuffle

In [65]:
from sklearn.ensemble import GradientBoostingRegressor
import json

In [2]:
os.listdir("data/02-PreprocessedDatasets/")

['stable-middistance.csv',
 'stable-shortdistance.csv',
 'stable-longdistance.csv',
 'moving-away.csv',
 'moving-closerfarcloser.csv',
 'stable-shortistance2.csv']

Six different files refer to different scenarios

Phase 1; Regression model for wbCQI
- “predicting” wbCQI (why "predicting"?)

Phase 2; 5G monitoring & History Data
- using elasticmon software


In [3]:
path = 'data/02-PreprocessedDatasets/' + '*.csv'   
files = glob.glob(path)
data = []
for file in files:
    print(file)
    data.append(pd.read_csv(file, sep='\t', index_col=0))

data/02-PreprocessedDatasets/stable-middistance.csv
data/02-PreprocessedDatasets/stable-shortdistance.csv
data/02-PreprocessedDatasets/stable-longdistance.csv
data/02-PreprocessedDatasets/moving-away.csv
data/02-PreprocessedDatasets/moving-closerfarcloser.csv
data/02-PreprocessedDatasets/stable-shortistance2.csv


In [4]:
train_df = pd.concat(data[:-1], ignore_index=True)

In [5]:
val_df = data[-1]

In [6]:
val_df.reset_index(drop=True, inplace=True)

In [8]:
y_train, y_test = train_df['wbcqi'], val_df['wbcqi']

In [9]:
X_train, X_test = train_df.drop(columns=['wbcqi']), val_df.drop(columns=['wbcqi'])

In [27]:
clf = GradientBoostingRegressor(n_estimators=1000, criterion='mse')

In [32]:
clf.fit(X_train, y_train.values.reshape(-1,1))

  return f(**kwargs)


GradientBoostingRegressor(criterion='mse', n_estimators=1000)

In [34]:
preds = clf.predict(X_test)

In [35]:
mean_squared_error(y_test, y_pred, squared=False)

0.48518188327060874

In [37]:
np.unique(y_test.values)

array([15])

In [41]:
np.unique(val_df['wbcqi'].values)

array([15])

Ok so shortistdistance2.csv, i have no idea what it is. I thought it was maybe the validation set for the "shordistance" movement category

Either they did not include the validation set when they posted this data, or they are including samples in both the training and validation sets. obviously that would be a rookie mistake, but the way this paper is written suggests to me they aren't real strong data people

``` 
"we tried to keep a 90% to 10% ratio between training and validation data set sizes"
```
note the word sizes here. there's no language which suggests the data is actually split between these scenarios

## Ok one question sort of answered: shortisdistance2.csv is not helpful

```
Feature exclusion: We excluded MAC statistic metrics like “mcs1Dl”, which are directly calculated based on wbCQI, as the purpose of the envisioned regression model is to predict the dependent wbCQI out of independent metric values.
```

In [67]:
drop_cols = []
for item in train_df.columns.values:
    if 'macStats' in item:
        drop_cols.append(item)

In [71]:
train_df.drop(columns=drop_cols, inplace=True)

Ok now we'll see if this changes the results of what Fatemeh did \
   -not an exact copy of what she did. I'm dropping shortisdistance2.csv \
   -reindexing to drop repeat indiices \
   -dropped macStats features 

In [76]:
def knn_classification(X_train, X_test, y_train, y_test ):
#    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

    knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)

    print(confusion_matrix(y_test, y_pred))
    print("accuracy:", accuracy_score(y_test, y_pred))
    print("precision: ", precision_score(y_test, y_pred, pos_label=1, average='micro'))
    print("recall: ", recall_score(y_test, y_pred, pos_label=1, average='micro'))
    print("rmse: ", mean_squared_error(y_test, y_pred, squared=False))
    return y_pred

In [73]:
def LSVC_(data, label, n_features):
    """returns the features selected based on linearSVC,
    data. label should be numpy array"""
    lsvc = LinearSVC(C=1, penalty="l1", dual=False).fit(data, label)
    coef = np.squeeze(np.sum(np.square(np.array(lsvc.coef_)), axis=0))
    coefidx = np.argsort(coef)
    fidx = coefidx[-n_features:]
    return fidx

In [74]:
data = train_df.drop(columns=['wbcqi'])
label = train_df['wbcqi']

In [77]:
data, label = shuffle(data, label, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.25, random_state=1)
selected_indx = LSVC_(X_train, y_train, 10)
X_train = X_train[X_train.columns[list(selected_indx)]]
X_test = X_test[X_test.columns[list(selected_indx)]]
y_pred = knn_classification(X_train, X_test, y_train, y_test) 



[[   0    0    0    0    0    0    0    4    0    0    0    0    0    0
     2]
 [   0    0    0    1    0    0    0    0    0    0    0    0    0    0
     0]
 [   0    0   88    2    0    1    3    0    0    0    0    0    0    0
     1]
 [   0    0    4   20    8    2    5    0    2    2    0    0    0    0
     0]
 [   0    0    0    4   62   73   12    2    0    1    0    0    0    0
     0]
 [   0    0    1    2   15  177   72   14    2    0    0    0    0    0
     0]
 [   0    0    0    4    8   87  318   92   22    3    0    1    0    0
     0]
 [   0    0    0    2    2   20  107  594  108   14    0    2    0    0
     1]
 [   0    0    0    1    1    3   23  125  293   31    2    3    0    0
     1]
 [   0    0    0    0    0    0    5   28  109   75    9    8    3    0
     4]
 [   0    0    0    0    0    0    0    3   18   38   35   15    6    0
     8]
 [   0    0    0    4    0    1    0    3    7   19   26   84   38    3
    25]
 [   0    0    0    0    0    2    0    

## Ok so dropping those columns makes a big difference. 

In [81]:
y_test - y_pred

13296    1
518      1
21769    1
15249    0
9404     0
        ..
22480    1
17631    0
20972    0
25068   -2
753      0
Name: wbcqi, Length: 6521, dtype: int64

In [83]:
clf = GradientBoostingRegressor(n_estimators=1000, criterion='mse')
data, label = shuffle(data, label, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(data,
                                                    label,
                                                    test_size=0.25,
                                                    random_state=1)

In [84]:
clf.fit(X_train, y_train)

GradientBoostingRegressor(criterion='mse', n_estimators=1000)

In [85]:
preds = clf.predict(X_test)

In [87]:
print("rmse: ", mean_squared_error(y_test, y_pred, squared=False))
print("accuracy:", accuracy_score(y_test, y_pred))

rmse:  5.253882536302076
accuracy: 0.29320656341051987


## ok from here I should probably do what they did in the paper and look for anomolous wbcqi values

...to be continued

In [89]:
label[label == 3].shape

(399,)

## Ideas for paper narrative:

A survey on the current state of machine learning based 5G RAN slicing 

    - Pretty easy to do since we would just have to write up a handful that don't publish      their data as well as include our investigation of these two datasets
    
Sample off the top of my head abstract

"RAN slicing is a key technology necessary for enabling and delivering 5G service. Most research in this relatively new area is on algorithm development. There is not yet a consensus problem formulation and thus no state-of-the-art for this topic. We investigate the state of machine learning based 5G RAN slicing with the aim of uncovering the needs of research in this area. We run experiments using two publicly available datasets developed for 5G RAN slicing research and conclude with specific calls for future work to fill in the gaps in this research area"

## Looking at the raw datasets. I don't think this is useful for us, we can at least trust their preprocessing

In [45]:
path = 'data/01-RawDatasets/' + '*.csv'   
files = glob.glob(path)
data = []
for file in files:
    print(file)
    data.append(pd.read_csv(file, index_col=0))

data/01-RawDatasets/movingaway.csv
data/01-RawDatasets/stablemiddistance.csv
data/01-RawDatasets/movingcloserfarcloser.csv
data/01-RawDatasets/stableshortdistance.csv
data/01-RawDatasets/stablelongdistance.csv


In [48]:
df = pd.read_csv('data/01-RawDatasets/movingaway.csv')

macStats_phr
macStats_totalBytesSdusDl
macStats_totalTbsUl
macStats_mcs1Ul
macStats_totalPduDl
macStats_totalBytesSdusUl
macStats_tbsDl
macStats_totalPrbUl
macStats_macSdusDl_sduLength
macStats_macSdusDl_lcid
macStats_prbUl
macStats_totalPduUl
macStats_mcs1Dl
macStats_mcs2Dl
macStats_prbDl
macStats_totalPrbDl
macStats_prbRetxDl
macStats_totalTbsDl


In [64]:
train_df

Unnamed: 0,rsrp,rsrq,wbcqi,macStats_phr,dlCqiReport_sfnSn,macStats_totalBytesSdusDl,macStats_totalTbsUl,macStats_mcs1Ul,macStats_totalPduDl,macStats_totalBytesSdusUl,...,pdcpStats_pktTxBytes,pdcpStats_pktRxAiat,pdcpStats_pktRxBytes,pdcpStats_pktTx,pdcpStats_pktTxW,pdcpStats_pktTxAiatW,pdcpStats_sfn,pdcpStats_pktTxAiat,rnti,quality
0,-107,-8,8,26,11909,8647,538416,10,664,537426,...,6963,232298,23909,19,0,0,232722,176673,6163,0
1,-107,-8,8,26,11829,8647,538290,10,664,537300,...,6963,232298,23909,19,0,0,232672,176673,6163,0
2,-107,-8,8,26,11749,8645,538164,10,663,537174,...,6963,232298,23909,19,0,0,232622,176673,6163,0
3,-107,-8,7,26,11589,8645,537975,10,663,536922,...,6963,232298,23909,19,0,0,232522,176673,6163,0
4,-107,-8,8,26,11349,8643,537597,10,662,536607,...,6963,232298,23909,19,0,0,232372,176673,6163,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26077,-105,-7,12,15,11187,3344,6567,10,32,1572755,...,23595,887446,56733,65,0,0,887630,887139,20457,0
26078,-105,-7,12,16,11107,3344,6504,10,32,1572629,...,23595,887446,56733,65,0,0,887580,887139,20457,0
26079,-105,-7,12,17,11027,3344,6378,10,32,1572503,...,23595,887446,56733,65,0,0,887530,887139,20457,0
26080,-105,-7,12,17,10947,3344,6252,10,32,1572440,...,23595,887446,56733,65,0,0,887480,887139,20457,0


In [56]:
json.loads(df.loc[0, "mac_stats"][1:-1])

{'agent_id': 2,
 'eNBId': 234881024,
 'ue_mac_stats': [{'rnti': 32812,
   'mac_stats': {'rnti': 32812,
    'bsr': [0, 0, 0, 0],
    'phr': 4294967295,
    'rlcReport': [{'lcId': 1,
      'txQueueSize': 0,
      'txQueueHolDelay': 0,
      'statusPduSize': 0},
     {'lcId': 2, 'txQueueSize': 0, 'txQueueHolDelay': 0, 'statusPduSize': 0},
     {'lcId': 3, 'txQueueSize': 0, 'txQueueHolDelay': 0, 'statusPduSize': 0}],
    'pendingMacCes': 0,
    'dlCqiReport': {'sfnSn': 6800,
     'csiReport': [{'servCellIndex': 0,
       'ri': 0,
       'type': 'FLCSIT_P10',
       'p10csi': {'wbCqi': 3}}]},
    'ulCqiReport': {'sfnSn': 6800,
     'cqiMeas': [{'type': 'FLUCT_SRS', 'servCellIndex': 0}],
     'pucchDbm': [{'p0PucchDbm': 0, 'servCellIndex': 0}]},
    'rrcMeasurements': {'measid': 1, 'pcellRsrp': -127, 'pcellRsrq': -13},
    'pdcpStats': {'pktTx': 35,
     'pktTxBytes': 12872,
     'pktTxSn': 14,
     'pktTxW': 0,
     'pktTxBytesW': 0,
     'pktTxAiat': 609518,
     'pktTxAiatW': 0,
     'pkt