# This notebook demonstrates drift detection for text data with evidently using IMDB reviews and eco-dot reviews.

#### In the first part of the notebook we demonstrate using glove vectors and in the later part of the notebook we show how sentence transformers could be utilized.

#### We demonstrate the usage with both glove vectors and sentence transformers

#### In case of glove vectors, each record is represented as a sentence vector by using glove6B 50d vector (by averaging) for both IMDB and eco-dot reviews datasets. Incase a vector representation is not present for a particular word, we use 50d zero vector.

#### In case of sentence transformers we use the MiniLM-v2 model to get the embeddings which is a 384 vector representation. This is 

# Install and import the pre-requisites

In [None]:
try:
    import evidently
except:
    get_ipython().system('pip install git+https://github.com/evidentlyai/evidently.git')
    # Install sentence transformers
    get_ipython().system('pip install sentence-transformers')

In [None]:
import pandas as pd
import numpy as np

from sklearn import datasets

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import EmbeddingsDriftMetric
from sklearn.model_selection import train_test_split
from evidently.metrics.data_drift.embedding_drift_methods import model, distance, ratio, mmd

# Load cleaned Imdb movies review dataset and amazon ecodot review dataset

The datasets used in this notebook are from the below resource

https://github.com/SangamSwadiK/nlp_example_datasets

In case, you do not want to load the glove vectors, and want to use preprocessed vectors and view the drift, please go to https://github.com/SangamSwadiK/nlp_example_datasets and use the rawgithub url's or download it and load them.

#### Please go through the below kaggle resources for pre-processing and if you want to acess the original datasets:
1) https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews

2) https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset

In [None]:
# IMBD reviews dataset
imdb_5k_data = pd.read_csv("https://raw.githubusercontent.com/SangamSwadiK/test_dataset/main/cleaned_imdb_data_5000_rows.csv")
imdb_5k_data.head()

In [None]:
#This data is oob and has no relationship with movie reviews, you can replace the test set with eco_dot to better understand the workings of drift detection.
eco_dot_data = pd.read_csv("https://raw.githubusercontent.com/SangamSwadiK/test_dataset/main/eco_data.csv", squeeze=True)
eco_dot_data.head()

In [None]:
## Run this to experiment with the dataset with various ways of embedding (average over records / sum of records etc ...)
!wget http://nlp.stanford.edu/data/glove.6B.zip -P ./
!unzip  glove.6B.zip -d ./

In [None]:
# Load glove vector from vector file
def load_glove_model(File):
  """ Loads the keyed vectors from a given text file
  Args:
    File: text file which contains the vectors
  Returns:
    Dictionary: map containing the key:vector pair
  """
  glove_model = {}
  with open(File,'r') as f:
      for line in f:
          split_line = line.split()
          word = split_line[0]
          embedding = np.array(split_line[1:], dtype=np.float64)
          glove_model[word] = embedding
      return glove_model

In [None]:
# We load 50 dimension vector here
glove_vec = load_glove_model("glove.6B.50d.txt")

# Train test split the imdb dataset for input data drift comparison

In [None]:
## Perform train test split on imdb data
train_df, test_df, y_train, y_test = train_test_split(imdb_5k_data.review, imdb_5k_data.sentiment, test_size=0.50, random_state=42)

# Convert train and test records into embedding vectors

In [None]:
def get_sentence_vector(dataframe):
  """Get a sentence vector for each text/record by averaging based on counts for each text record
  Args:
    dataframe: the dataframe containing the text data
  returns:
    array: a matrix of sentence vectors for each record in the dataframe
  """
  tmp_arr = []
  for row in dataframe.values:
    tmp = np.zeros(50,)
    for word in row:
      try:
        tmp += glove_vec[word]
      except KeyError:
        tmp+= np.zeros(50,)
    tmp = tmp/len(row.split(" "))
    tmp_arr.append(tmp.tolist())

  return tmp_arr

In [None]:
train_matrix =  get_sentence_vector(train_df)

In [None]:
train_df_converted = pd.DataFrame(np.array(train_matrix), index = train_df.index)
train_df_converted.columns = ["col_"+ str(i) for i in range(train_df_converted.shape[1])]
train_df_converted.head()

In [None]:
## Get the sentence vectors for test dataframe
test_matrix =  get_sentence_vector(test_df)

In [None]:
test_df_converted = pd.DataFrame(np.array(test_matrix), index = test_df.index)
test_df_converted.columns = ["col_"+ str(i) for i in range(test_df_converted.shape[1])]

test_df_converted.head()

In [None]:
# Get sentence vector for echo dot
eco_dot_matrix = get_sentence_vector(eco_dot_data)

In [None]:
ecodot_review_df_converted = pd.DataFrame(np.array(eco_dot_matrix), index = eco_dot_data.index)
ecodot_review_df_converted.columns = ["col_"+ str(i) for i in range(ecodot_review_df_converted.shape[1])]

ecodot_review_df_converted.head()

# Embeddings Drift Report

Here we take a small subset of the columns and calculate the drift based on it.
To understand more about how its being calculated checkout the below link

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embeddings_drift.py#L27

In [None]:
column_mapping = ColumnMapping(
    embeddings={'small_subset': train_df_converted.columns[:10]}
)

In [None]:
# Here we measure drift on the small subset between train and test imdb records
report = Report(metrics=[
    EmbeddingsDriftMetric('small_subset')
])

report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000], 
           column_mapping = column_mapping)
report

# Embeddings Drift Detection: model

This approach involves training an SGD Classifier and calculating the ROC AUC score. The drift is measured on the same. If bootstrap and PCA components are enabled, it performs dimensionality reduction with PCA and then performs bootstrap for the ROC AUC and returns the drift result

Checkout the below link to understand more on how its calculated

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L99

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = model(
                              threshold = 0.55,
                              bootstrap = None,
                              quantile_probability = 0.95,
                              pca_components = None,
                          )
                         )
])

report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  
           column_mapping = column_mapping)
report

# Embeddings Drift Detection: mmd

Here, The drift is calculated using the maximum mean discrepancy test (MMD).
If you want to learn more about MMD checkout the below paper
https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf

If you want to understand how MMD is used as a Multivariate test checkout the below paper

https://arxiv.org/pdf/1810.11953.pdf


To understand more about the implementation checkout the below

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L201

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = mmd(
                              threshold = 0.015,
                              bootstrap = None,
                              quantile_probability = 0.95,
                              pca_components = None,
                          )
                         )
])

report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  
           column_mapping = column_mapping)
report

# Embeddings Drift Detection: ratio

Here, The drift is calculated based on the ratio of drifted embeddings, we look at each individual embedding and then apply a statistical test which can be picked from the evidently stattests module (https://docs.evidentlyai.com/reference/api-reference/evidently.calculations/evidently.calculations.stattests)

In case you want to reduce the dimensionality and use this method, use the pca_components parameter.

Checkout the below link to understand more about how this works

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L139

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = ratio(
                              component_stattest = 'wasserstein',
                              component_stattest_threshold = 0.1,
                              threshold = 0.2,
                              pca_components = None,
                          )
                         )
])

report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  
           column_mapping = column_mapping)
report

# Embeddings Drift detection: Distance

Here we use the average distance method for measuring the drift 
The available distances are euclidean, cosine, cityblock and chebyshev.

If bootstrap is enabled, it performs bootstrapping to calculate drift based on quantile probability. If not enabled, it uses the threshold parameter. All values above this threshold means data drift. This only applies when bootstrap != True


To understand how this is implemented checkout the below link

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L45

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = distance(
                              dist = 'euclidean', #"euclidean", "cosine", "cityblock" or "chebyshev"
                              threshold = 0.2,
                              pca_components = None,
                              bootstrap = None,
                              quantile_probability = 0.95
                          )
                         )
])

report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  
           column_mapping = column_mapping)
report

# Drift detection with sentence transformers

This demonstrates usage of sentence transformers with evidently for drift detection

## Convert to embeddings

In [None]:
# import MiniLM v2 from sentence transformer

from sentence_transformers import SentenceTransformer
model_miniLM = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# Encode only a fraction
ref_embeddings = model_miniLM.encode(imdb_5k_data["review"][: 100].tolist() )

In [None]:
ref_df = pd.DataFrame(ref_embeddings)
ref_df.columns = ['col_' + str(x) for x in ref_df.columns]
ref_df.head(5)

In [None]:
# Similarly encode only a fraction
cur_embeddings = model_miniLM.encode( eco_dot_data.tolist()[:100] )

In [None]:
cur_df = pd.DataFrame(cur_embeddings)
cur_df.columns = ['col_' + str(x) for x in cur_df.columns]
cur_df.head(5)

## Embeddings Drift Report


Here we take a small subset of the columns and calculate the drift based on it.
To understand more about how its being calculated checkout the below link

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embeddings_drift.py#L27

In [None]:
column_mapping = ColumnMapping(
    embeddings={'small_subset': ref_df.columns[:10]}
)

In [None]:
report = Report(metrics=[
    EmbeddingsDriftMetric('small_subset')
])

report.run(reference_data = ref_df[:50], current_data = cur_df[:50], 
           column_mapping = column_mapping)
report

## Embeddings Drift Detection: model

This approach involves training an SGD Classifier and calculating the ROC AUC score. The drift is measured on the same. If bootstrap and PCA components are enabled, it performs dimensionality reduction with PCA and then performs bootstrap for the ROC AUC and returns the drift result

Checkout the below link to understand more on how its calculated

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L99

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = model(
                              threshold = 0.55,
                              bootstrap = None,
                              quantile_probability = 0.95,
                              pca_components = None,
                          )
                         )
])

report.run(reference_data = ref_df[:50], current_data = cur_df[:50], 
           column_mapping = column_mapping)
report

## Embeddings Drift Detection: MMD

Here, The drift is calculated using the maximum mean discrepancy test (MMD). If you want to learn more about MMD checkout the below paper https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf

If you want to understand how MMD is used as a Multivariate test checkout the below paper

https://arxiv.org/pdf/1810.11953.pdf

To understand more about the implementation checkout the below

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L201

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = mmd(
                              threshold = 0.015,
                              bootstrap = None,
                              quantile_probability = 0.95,
                              pca_components = None,
                          )
                         )
])

report.run(reference_data = ref_df[:50], current_data = ref_df[:50],  
           column_mapping = column_mapping)
report

## Embeddings Drift Detection: ratio

Here, The drift is calculated based on the ratio of drifted embeddings, we look at each individual embedding and then apply a statistical test which can be picked from the evidently stattests module (https://docs.evidentlyai.com/reference/api-reference/evidently.calculations/evidently.calculations.stattests)

In case you want to reduce the dimensionality and use this method, use the pca_components parameter.

Checkout the below link to understand more about how this works

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L139

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = ratio(
                              component_stattest = 'wasserstein',
                              component_stattest_threshold = 0.1,
                              threshold = 0.2,
                              pca_components = None,
                          )
                         )
])

report.run(reference_data = ref_df[:50], current_data = ref_df[:50],  
           column_mapping = column_mapping)
report

## Embeddings Drift detection: Distance

Here we use the average distance method for measuring the drift 
The available distances are euclidean, cosine, cityblock and chebyshev.

If bootstrap is enabled, it performs bootstrapping to calculate drift based on quantile probability. If not enabled, it uses the threshold parameter. All values above this threshold means data drift. This only applies when bootstrap != True


To understand how this is implemented checkout the below link

https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L45

In [None]:
report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = distance(
                              dist = 'euclidean', #"euclidean", "cosine", "cityblock" or "chebyshev"
                              threshold = 0.2,
                              pca_components = None,
                              bootstrap = None,
                              quantile_probability = 0.95
                          )
                         )
])

report.run(reference_data = ref_df[:50], current_data = ref_df[:50],  
           column_mapping = column_mapping)
report