# Model Training

This notebook will cover the iterative process of training multiple models to find one best suited to our needs, and then further explore the possibilities of improving the selected model's performance.

### Import the necessary libraries

In [19]:
from deepod.models.time_series import DevNetTS, PReNetTS, DeepSADTS, DeepSVDDTS, DeepIsolationForestTS, AnomalyTransformer, COUTA, TcnED, TimesNet, TranAD, USAD
from deepod.models.tabular import DeepSAD, DeepSVDD, DeepIsolationForest, RCA, REPEN, RDP, RoSAS, GOAD, NeuTraL, ICL, SLAD, DevNet, PReNet, FeaWAD
import pandas as pd
from sklearn.model_selection import train_test_split
from deepod.metrics import ts_metrics
from deepod.metrics import point_adjustment
from deepod.metrics import tabular_metrics
import joblib
import torch

Assign the variable 'data_dir' with the location of the combined data file to be used for training.

In [20]:
data_dir= "../../data/anomalyDataset.parquet"

In [21]:
df = pd.read_parquet(data_dir)

In [22]:
df.shape

(2280812, 40)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2280812 entries, 0 to 2280811
Data columns (total 40 columns):
 #   Column             Dtype  
---  ------             -----  
 0   tick               float64
 1   seconds            float64
 2   clockTime          float64
 3   attackerSteamID    float64
 4   zoomLevel          float64
 5   ctAlivePlayers     float64
 6   entityId           float64
 7   penetratedObjects  float64
 8   steamID            float64
 9   ping               float64
 10  endTick            float64
 11  tScore             float64
 12  ctScore            float64
 13  victimName         int64  
 14  weapon             int64  
 15  weaponClass        int64  
 16  hitGroup           int64  
 17  mapName            int64  
 18  lastPlaceName      int64  
 19  ctTeam             int64  
 20  winningSide        int64  
 21  roundEndReason     int64  
 22  playerName         int64  
 23  attackerStrafe     int64  
 24  isSuicide          int64  
 25  isHeadshot        

In [24]:
df.describe()

We will select a subset of the total data to quickly iterate over all available models and test their performances without changing the default hyperparameters, to set a baseline for improving performance.

In [None]:
sub_df = df.iloc[:100000]

In [None]:
sub_df["label"].value_counts()

*Note: Remember, our data has a temporal dependence, i.e., there is an order of events that makes sense. Hence, we cannot shuffle the data points and must pick a contiguous block of data as our subset.*

In [None]:
X = sub_df.drop(["label"], axis=1).values
y = sub_df["label"].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Set the device to GPU, if available.

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available.")
else:
    device = torch.device("cpu")
    print("GPU is not available.")

# Unsupervised Models


## Tabular models
***

In [None]:
model = DeepSVDD(lr=0.0001, device=device)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for DeepSVDD:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = REPEN(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for REPEN:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = RCA(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for RCA:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = RDP(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for RDP:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = GOAD(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for GOAD:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = ICL(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for ICL:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = NeuTraL(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for NeuTraL:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = SLAD(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for SLAD:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = DeepIsolationForest(lr=0.0001)
model.fit(X_train, y=None)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for DeepIsolationForest:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

## Time-series models
***

In [None]:
model = DeepSVDDTS(device=device, network='LSTM')
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

eval_metrics = ts_metrics(y_test, scores)
adj_eval_metrics = ts_metrics(y_test, point_adjustment(y_test, scores))
print("Results for DeepSVDDTS:\n",
      f"auc: {adj_eval_metrics[1]:.2f}, average precision: {adj_eval_metrics[1]:.2f}, f1: {adj_eval_metrics[2]:.2f}, precision: {adj_eval_metrics[3]:.2f}, recall: {adj_eval_metrics[4]:.2f}")

In [None]:
model = AnomalyTransformer()
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for AnomalyTransformer:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = COUTA()
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for COUTA:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = TcnED()
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for TcnED:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = TimesNet()
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for TimesNet:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = TranAD()
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for TranAD:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = DeepIsolationForestTS()
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for DeepIsolationForestTS:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = USAD()
model.fit(X_train, y=None)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for USAD:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

# Weakly-supervised models

## Tabular models
***

In [None]:
model = DevNet(lr=0.0001)
model.fit(X_train, y=y_train)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for DevNet:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = PReNet(lr=0.0001)
model.fit(X_train, y=y_train)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for PReNet:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = DeepSAD(lr=0.001)
model.fit(X_train, y=y_train)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for DeepSAD:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = FeaWAD(lr=0.0001)
model.fit(X_train, y=y_train)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for FeaWAD:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

In [None]:
model = RoSAS(lr=0.0001)
model.fit(X_train, y=y_train)
scores = model.decision_function(X_test)

auc, ap, f1 = tabular_metrics(y_test, scores)
print("Results for RoSAS:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}")

## Time-series models
***

In [None]:
model = DevNetTS(seq_len=50, )
print("X:", X_train.shape,"y:", y_train.shape)
model.fit(X_train, y_train)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for DevNetTS:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = DeepSADTS(batch_size=100, lr=0.001, rep_dim=128, hidden_dims='100,50', act='ReLU', bias=False, epoch_steps=-1)
model.fit(X_train, y_train)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for DeepSADTS:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

In [None]:
model = PReNetTS(batch_size=100, lr=0.001, rep_dim=128, hidden_dims='100,50', act='ReLU', bias=False, epoch_steps=-1)
model.fit(X_train, y_train)

scores = model.decision_function(X_test)
anomalies = X_test[scores>0.5]

auc, ap, f1, p, r = ts_metrics(y_test, scores)
print("Results for PReNetTS:\n",
      f"auc: {auc:.2f}, average precision: {ap:.2f}, f1: {f1:.2f}, precision: {p:.2f}, recall: {r:.2f}")

# Results

Even though our data is classified as time-series, i.e., events can only happen in a certain order, we observe that TS models tend to perform especially poorly on our dataset. This may be because a single anomaly within a 'match' unit can never be enough to predict an anomalous data point. It is a series of multiple anomalies that help us classify with high confidence if an individual is exploiting the game's system. 

However, a single series is not estabilished with a temporal pattern. The occurence of anomalous events does not directly and completely depend on the timestamp, but is a good feature to track these events. For example, a cheater may only use cheats every alternate round, or in a random pattern, so as to not be too obvious while using exploitative methods. The goal is for the model to identify an individual(s) within a unit that exhibits anomalous behaviour (like getting an extremely high number of kills, instantly spotting or snapping to enemy hitboxes, unusual pitch and yaw changes during movement and kills etc.)