# Data Drift


Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. Data drift occurs when the data the model was trained on no longer accurately reflects the data that the model is currently analyzing. Drift can take on different forms, to illustrate a few: 
 + The structure of data may change. Maybe spam emails start utilizing photo attachments rather than text. Since our model is based off of text within the email, it would likely start performing very poorly.
 + Data can change meaning, even if structure does not. Perhaps spam mail about food becomes our new favorite reading to go along with morning coffee, and we no longer want that type of spam filtered out of our mailbox.
 + Features may change. Features that are previously infrequent may become more frequent, or vice versa. One (unlikely) drift could be that all modern spam emails begin containing the word "coffee" and never the word "tree." This could be an important insight to include in our model. 
 
Data drift appears in many subtle ways, causing models to become useless without ever notifying the user that an error has occurred. Models with changing data need to be monitored to ensure that the model is still performing as expected. 

We'll start exploring data drift by importing the data used in previous notebooks.

In [None]:
import pandas as pd
import os.path

df = pd.read_parquet(os.path.join("data", "training.parquet"))

We split the data into training and testing sets, as in the modelling notebooks. We use the `random_state` parameter to ensure that the data is split in the same way as it was when we fit the model. 

In [None]:
from sklearn import model_selection

df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy() #filter the spam documents

Then, we filter out the spam and force the spam data to drift by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. 

In [None]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [None]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [None]:
# appending text to the start of the spam
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [None]:
pd.set_option('display.max_colwidth', None) # ensures that all the text is visible
df_test_spam.sample(3)

We now pass this "drifted" data through the pipeline we created: we compute feature vectors, and we make spam/legitimate classifications using the model we trained. 

In [None]:
from sklearn.pipeline import Pipeline
import pickle, os

# loading in feature vectors pipeline
filename = 'feature_pipeline.sav'
feat_pipeline = pickle.load(open(filename, 'rb'))

# loading model
filename = 'model.sav'
model = pickle.load(open(filename, 'rb'))

feature_pipeline = Pipeline([
    ('features',feat_pipeline)
])

Next, we use our feature engineering pipeline to transform the data into feature vectors. We'll then use PCA (discussed in the [visualization](01-vectors-and-visualization.ipynb) notebook) to map these large vectors to 2 dimensions so we can view the structure of the new spam data.

In [None]:
ft_train_data = feature_pipeline.fit_transform(df_train["text"], df_train["label"])
ft_drifted_data = feature_pipeline.fit_transform(df_test_spam["text"], df_test_spam["label"])

In [None]:
import sklearn.decomposition

DIMENSIONS = 2
pca = sklearn.decomposition.TruncatedSVD(DIMENSIONS)

# fit_transform original data, put into data frame
pca_a = pca.fit_transform(ft_train_data)
pca_df = pd.DataFrame(pca_a, columns=["x", "y"])

# transform new spam data, put into data frame
pca_b = pca.transform(ft_drifted_data)
pca_df_drift = pd.DataFrame(pca_b, columns=["x", "y"])

In [None]:
df_test_spam

In [None]:
import altair as alt
from altair.expr import datum
alt.renderers.enable('notebook')
SAMPLE = 2000
plot_data = pca_df.assign(label = df_train["label"])
plot_data["label"]= plot_data["label"].replace("spam", "previous spam")

plot_data_drift = pca_df_drift.assign(label = "drifted spam")

plot_data2 = pd.concat([plot_data_drift, plot_data])
domain = ['legitimate', 'previous spam', 'drifted spam']
range_ = ['lightgray', 'blue', 'red']

chart1 = alt.Chart(plot_data2.sample(SAMPLE))\
            .mark_point(opacity=0.4) \
            .encode(x='x', y='y', color = alt.Color('label', scale= alt.Scale(domain=domain, range=range_)))\
            .interactive()

chart1


Seeing drifted spam emails in red and previous spam emails in blue, it looks like structure of spam has changed drastically. There's a good chance our model no longer performs as well as it used to. Utilizing pipelines, let's make predictions for the drifted spam data.

In [None]:
pipeline = Pipeline([
    ('features',feat_pipeline),
    ('model',model)
])

pipeline.fit(df_train["text"], df_train["label"])

# predict test instances
y_preds = pipeline.predict(df_test_spam["text"])
print(y_preds)

It looks as though the drifted data is mostly classified as legitimate (even though the entire test set was made of spam emails), but let's look at a confusion matrix to visualize the predictions.

In [None]:
from sklearn.metrics import confusion_matrix
from mlworkflows import plot

df, chart = plot.binary_confusion_matrix(df_test_spam["label"], y_preds)
df

Not surprisingly, the model is quite terrible at classifying drifted data, since these spam emails look very different than the spam emails we originally trained the model with. 

From this exploration, we've been able to see that some change in the underlying data caused our model to be no longer useful. Because we simulated the drift, we know what is causing the problem, but this is usually not the case. Further exploration may be needed: is the drift gradual or abrupt? Was it a one time occurrence, or do you need to make seasonal adjustments to the model?

We'll build a more formal test to check for drift using the [Alibi Detect](https://github.com/SeldonIO/alibi-detect) library. 


In [None]:
import numpy as np

# change to numpy arrays in order to interact with KSDrift
array_test = np.asarray(df_test)
array_test_spam = np.asarray(df_test_spam)

While there are many methods of detection, we will display [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test), or K-S, tests in this notebook to check for drift. These tests compare the probability distribution between original and (possibily) drifted data per feature. Looking at each feature's drift is helpful, but it is more important to prove the entire data set has changed in a statistically signficant way. Using a [Bonferroni](https://mathworld.wolfram.com/BonferroniCorrection.html) correction, the K-S test results are aggregated and tested as a whole. 

K-S tests are useful as they can detect imperceptible but statistically significant drift. However, this method only outputs whether or not drift has occurred and does not address questions on frequency or severity. 

In [None]:
# KSDrift
import alibi_detect
from alibi_detect.cd import KSDrift
from sklearn import preprocessing

# initialize label encoder
label_encoder = preprocessing.LabelEncoder() 

p_val = 0.05
drift_detect = KSDrift(
    p_val = p_val, # p-value for KS test
    X_ref = array_test, # test against original test set
    preprocess_fn = pca, # other options: auto-encoder, softmax output
    preprocess_kwargs = {'model': label_encoder.fit(array_test[:,1]), 'batch_size':32},
    alternative = 'two-sided',  # other options: 'less', 'greater'
    correction = 'bonferroni' # other option: false discovery rate
)

We'll start with a sanity check and test the original data. Since we're feeding in the same data set twice, we should not get any drift.

In [None]:
preds_test = drift_detect.predict(array_test)
labels = ['No!', 'Yes!']
print('Has the data drifted? {}'.format(labels[preds_test['data']['is_drift']]))

This was the desired output! Let's try again, but with the drifted data. 

In [None]:
preds_test = drift_detect.predict(array_test_spam)
print('Has the data drifted? {}'.format(labels[preds_test['data']['is_drift']]))

Great! Our drift detector can confirm that the data has drifted. Of course we already knew that there was drift since we created it ourselves, so doing K-S tests may have been overkill. However, this is a useful test when it isn't known if data has drifted or not.

Now we can both visualize and prove our data has drifted. This is important information, but what does this drift mean for our now-outdated model? *There is no one-size-fits-all answer to this question.* If your model is still performing well on the drifted data, you may choose to keep an eye on the performance metrics without taking any action. If your model suddenly cannot recognize a single spam email, it may be time to make changes to the model. Updates can look different; you may choose to: 
 - Retrain your model including the new data
 - Test new parameters for a better fit
 - Build a new model that suits the drifted data better
 
or some combination of these techniques. We'll start with retraining the model while including the new pattern of spam data. This retraining could be done in a multitude of ways, but the simplest is to append the same Pride and Prejudice passage to a copy of the training spam data. 

In [None]:
# append pride + prejudice to spam train 
pd.set_option('display.max_colwidth', None) 

# filter out spam training data
df_train_spam_drift = df_train[df_train.label == 'spam'].copy()

# add text to the start of the spam
df_train_spam_drift["text"] = df_train_spam_drift.text.apply(add_text, adds=pride_pred)
df_train_spam_drift

In [None]:
# append drifted spam data to df_train
df_train = df_train.append(df_train_spam_drift)

Great! We have a new dataset that should capture the same type of drift. Next, let's retrain the model and look at the results. While we wouldn't normally use accurary score (Can't remember why? Look at [this notebook.](./02-evaluating-models.ipynb)), because all of our test data is spam, there are no false positives or negatives.

In [None]:
# retrain model including drifted spam
pipeline.fit(df_train["text"], df_train["label"])

# predict test instances
y_preds = pipeline.predict(df_test_spam["text"])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(df_test_spam["label"], y_preds)

This new and improved model is much more successful at identifying the drifted emails as spam! While most models won't be 100% accurate, the drift was significant and the training data reflected an identical drift. If the results were poor, our next step might have been to adjust the parameters set in previous notebooks or research a completely new model better suited for the new data set. 

It's possible to put streamed data into piplines in order to automatically alert users when drift occurs and retrain the model. We look at integration services in [another notebook](07-services.ipynb) to better understand other capabilities.

## Exercises
The two models perform very similarly on the "drifted" data in this notebook. Consider alternative types of data drift and see how the models perform: 
1. What happens when fewer words from Pride and Prejudice are appended to the spam? 
2. How about using a completely different excerpt of Austen? 
3. How do the models perform when generic text (neither Austen nor food reviews) is appended to the spam? 