# Anomaly Detection With Clustering

Many data sets take the form of vector values.
One relatively simple but very effective way of defining expected behavior for this kind of data is to cluster it, and use these clusters as a model of our expectations.
Data samples that do not fall inside, or near, any cluster are often anomalous in some way.

In this notebook we will work with
[Amazon Fine Food Reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews/)
data[1] from Kaggle.
We will generate feature vectors from this data and _cluster_ them to use as an anomaly detection model.

[1] J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.

Parquet files used in this notebook were created from the raw Kaggle CSV as follows:
```python
with open("data/amazon-reviews.csv") as f:
    data = pd.read_csv(f)
data = data.sample(10000).reset_index(drop=True)
data = data.drop(columns=["Id", "ProductId", "UserId", "ProfileName", "Time", "Summary"])
data["hscore"] = \
    data.apply(lambda row: (1+row["HelpfulnessNumerator"]) / (2+row["HelpfulnessDenominator"]), axis=1)
data = data.drop(columns=["HelpfulnessNumerator", "HelpfulnessDenominator"])
data = data.rename(columns={"Score":"score", "Text":"text"})
data = data[["score", "hscore", "text"]]
data.to_parquet("data/amazon-reviews-10K.parquet", compression="brotli")
```

In [None]:
!pip install altair vega pyarrow

In [None]:
import codecs
import random
import math
import numpy as np
import scipy
import scipy.stats
from scipy.stats import gamma, kstest
import pandas as pd
from sklearn.cluster import KMeans
import re

In [None]:
import altair as alt
from detail.altairdf import altairDF
alt.renderers.enable("notebook")

In [None]:
def filterdf(df, pred):
    return df.loc[[idx for idx in df.index if pred(df.loc[idx])]]
def showtxt(df, subset = ["text"]):
    return df.style \
             .applymap(lambda x: 'white-space:wrap', subset=subset) \
             .applymap(lambda x:'text-align:left', subset=subset)

# Loading the data

The raw food review data has been sub-sampled to 50,000 records and stored as a parquet file to reduce its footprint on disk.

We begin by loading the data.
You can see that each review comes with a score, from one to five "stars", a helpfulness score, and the review text.
In this lab we will not be using the helpfulness score.

In [None]:
reviews = pd.read_parquet("data/amazon-reviews-50K.parquet").reindex()
showtxt(reviews.head(5))

In [None]:
htmlbr = re.compile('<br />')
whitesp = re.compile('\\s+')
def cleantxt(txt):
    clean = re.sub(htmlbr, ' ', txt)
    clean = re.sub(whitesp, ' ', clean)
    clean = clean.lower()
    return clean

def hashing_frequency(vecsize, h, norm = 1.0):
    def hf(words):
        if type(words) is type(""):
            # handle both lists of words and space-delimited strings
            words = words.split(" ")
        hsig = np.zeros(vecsize, dtype=np.float32)
        for term in [w for w in words if len(w) > 0]:
            hsig[h(term) % vecsize] += 1.0
        z = np.linalg.norm(hsig) / norm
        if (z > 0.0): hsig /= z
        return hsig
    return hf

# Visualizing

As before, we would like to visualize our data.
However, in this case, our shingle-based features have hundreds of dimensions.
So we will apply Principle Component Analysis (PCA) to project our features down to 2 dimensions and observe their structure.

In [None]:
import sklearn.decomposition

def append_pca_columns(df, featcol, pcacols=["x", "y"]):
    DIMENSIONS = 2
    data = np.array(list(df[featcol]))
    pca2 = sklearn.decomposition.PCA(DIMENSIONS)
    pca = pca2.fit_transform(data)
    pca_df = pd.DataFrame(pca, columns=pcacols)
    df = df.drop(columns=pcacols, errors='ignore')
    df = pd.concat([df, pca_df], axis=1).reindex()
    return df

def pca_features(df, icol, ocol, dimensions=2):
    data = np.array(list(df[icol]))
    pca2 = sklearn.decomposition.PCA(dimensions)
    pca = pca2.fit_transform(data)
    df[ocol] = list(pca)
    return df

# Anomalies with Word Features

The previous two variations on feature vectors for anomaly detection suggest that there is no one way to define what is "anomalous".
What we detect as an anomaly depends on how we define our expectations.
That in turn depends on what kind of features we collect in the first place.

With that in mind, what will happen if we replace shingles with whole words for generating hashed frequency vectors?

In the following cells we apply the SKLearn hashing vectorizer to create a hashed vector of word counts.
As before, we will normalize these vectors to a length of 1 to put different review lengths on an equal footing.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer

HVSIZE = 100
vectorizer = HashingVectorizer(token_pattern='(?u)\\b[A-Za-z]\\w+\\b', n_features = HVSIZE, alternate_sign=False)
hvcounts = vectorizer.fit_transform(reviews["text"].apply(cleantxt))

In [None]:
def normarray(v):
    r = v.toarray().reshape(HVSIZE)
    z = np.linalg.norm(r)
    if (z > 0.0): r /= z
    return r

feats3 = reviews.copy()
feats3["feats"] = [normarray(v) for v in hvcounts]

# Visualization

Again we use PCA get our feature vectors into a visualizable form.
In this low dimensionality there is relatively little structure evident.

In [None]:
feats3 = append_pca_columns(feats3, "feats")
alt.Chart(feats3.sample(2000)).encode(x="x", y="y", color="score").mark_point().interactive()

# Clustering Hashed Word Counts

As with hashed shingles, the word-based clusters show a lot of overlap in low dimensional projections,
but possible outliers are present.

In [None]:
%%time
data = np.array(list(feats3["feats"]))
clustering = KMeans(n_clusters=10).fit(data)

In [None]:
feats3["pred"] = clustering.predict(np.array(list(feats3["feats"])))
feats3["pstr"] = feats3["pred"].apply(str)
alt.Chart(feats3.sample(2000)).encode(x="x", y="y", color="pstr").mark_point().interactive()

# Hashed Word Anomalies

We apply our now-familiar technique of identifying reviews which are not near any cluster, and sorting by distance.
You can see that word-based features identify different anomalies than shingle-based features, even though they are both representations of the review text.

In [None]:
feats3["pdist"] = feats3.apply(lambda row: np.linalg.norm(row["feats"] - clustering.cluster_centers_[row["pred"]]), axis=1)
feats3["pdist"].sample(5)

In [None]:
anomalies = feats3.sort_values(by=["pdist"], ascending=False)[["pdist","score","text"]].head(25)
showtxt(anomalies)

# Exercises

1. How are the anomalies detected with hashed word frequencies different than hashed shingles?
1. Can you think of explainations for this difference?
1. Are the anomalies detected by all the different features in this notebook equally useful?