<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Anomalies_1_One_Class_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **One-Class Classification for Imbalanced Datasets**

This method is used for anomoly detection. <br>
The model is trained on the 'normal' data.<br>
Then the trained model is used to predict if new data is an anomaly.

One-class classification techniques can be used for binary (two-class) imbalanced classification problems where the negative case (class 0) is taken as normal and the positive case (class 1) is taken as an outlier or anomaly.<br>
- Negative Case: Normal or inlier.<br>
- Positive Case: Anomaly or outlier.<br>

One-class classification has proven to be especially useful when the minority class lack any structure, being predominantly composed of small disjuncts or noisy instances.

**Import libraries**

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.svm import OneClassSVM

**Create an imbalanced dataset**

In [None]:
X, y = make_classification(n_samples=5000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4)
# summarize class distribution
counter = Counter(y)
print(counter)


In [None]:
 for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

**Split the data into training and test sets**

In [None]:
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

**Use the OneClassSVM from the sklearn library**<br>
Train the model on the training set, with only the majority class used for training.

In [None]:
model = OneClassSVM(gamma='scale', nu=0.01) # fit on majority class
trainX = trainX[trainy==0]
model.fit(trainX)

**Use the test set to make inferences**

In [None]:
yhat = model.predict(testX)

Mark all predictions that are part of the majority case as '1' and those that are part of the minorty case as '-1'

In [None]:
# mark inliers 1, outliers -1
testy[testy == 1] = -1
testy[testy == 0] = 1

In [None]:
testX.shape

The plot shows the -1 values, these might be considered outliers

In [None]:
pyplot.scatter(testX[:, 0], testy, label="test set")
pyplot.scatter(testX[:,0],yhat, label="predictions", s=4)
pyplot.legend()
pyplot.show()

In [None]:
score = f1_score(testy, yhat, pos_label=-1)
print('F-measure: %.3f' % score)

**Assignment**:
1. Change the size of the dataset. Make it larger and then smaller.
2. Change the number of features.
3. Change the number of groups per class.
4. Change the class ratio of the dataset<br>

What do these changes do to the outlier inferences?<br>