<a href="https://colab.research.google.com/github/bhuvana-ak/uplimit-mlops/blob/main/FINAL_MLOPS_Week_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3: Monitoring Model Performance and Detecting Data Drift

##Introduction to Data Drift
In the fast-paced world of e-commerce, change is constant. This week, we'll explore a critical challenge in maintaining machine learning models: data drift. We'll use Lamada's recent expansion into the French market as a real-world example to understand this concept.

## Lamada's Expansion to France
Exciting news! Lamada has recently expanded its operations to France, opening up a whole new market for their e-commerce platform. While this expansion brings great opportunities, it also introduces new challenges for our sentiment analysis model.
### Understanding Data Drift
Data drift occurs when the statistical properties of the model's input data change over time, potentially affecting the model's performance. In Lamada's case, the introduction of French language reviews is a perfect example of data drift.
### Types of Data Drift

1. Data Drift: Changes in the distribution of input features.
2. Concept Drift: Changes in the relationship between input features and the target variable.
3. Target Drift: Changes in the distribution of the target variable.

![Compare ML Models in W&B](https://drive.google.com/uc?id=1D33HeSi85W1Ua5ibQo3wY09KZhAj7oJ8)

In our scenario with Lamada, we're primarily dealing with data drift as the language of the reviews (our input feature) has changed.

To detect data drift, we'll compare our original dataset (English reviews) with the new data coming in from the French market. We'll use the dataset and model we created and logged in Week 1.

# Step 1: Retrieving the Original Dataset and Model
First, we need to fetch the following:
1. dataset used to train the model
2. trained model

Which we stored in Week 1 using Weights & Biases (wandb), thankfully we logged them during the train step otherwise we would need to regenerate the data which could potentially lead to issues as there are no guarantees that the data we create is the same one as the data the model was trained on.

Don't forget to get your W&B API Key!

In [1]:
# Installing all the necessary packages
!pip install \
pandas \
scikit-learn \
wandb \
skl2onnx \
onnxruntime \
deep-translator \
evidently

Collecting skl2onnx
  Downloading skl2onnx-1.17.0-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting onnxruntime
  Downloading onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting deep-translator
  Downloading deep_translator-1.11.4-py3-none-any.whl.metadata (30 kB)
Collecting evidently
  Downloading evidently-0.4.38-py3-none-any.whl.metadata (11 kB)
Collecting onnx>=1.2.1 (from skl2onnx)
  Downloading onnx-1.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxconverter-common>=1.7.0 (from skl2onnx)
  Downloading onnxconverter_common-1.14.0-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting coloredlogs (from onnxruntime)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting litestar>=2.8.3 (from evidently)
  Downloading litestar-2.12.1-py3-none-any.whl.metadata (105 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.3/105.3 kB[0m [31m4.2 MB

In [None]:
!wandb login

In [None]:
import wandb

run = wandb.init()
# Example dataset name: yudhiesh/Drug Review MLOps Uplimit/drug-review-dataset:v2
dataset_artifact = run.use_artifact('kbhuvi-uplimit/Drug Review MLOps Uplimit/drug-review-dataset:v1', type='dataset')
dataset_artifact_dir = dataset_artifact.download()


In [None]:
from pathlib import Path
import pandas as pd

dataset_dir = Path(dataset_artifact_dir)
train_csv, test_csv, test_probas_csv = dataset_dir / "train.csv", dataset_dir / "test.csv", dataset_dir / "test_probas.csv"
train_df, test_df, test_probas = pd.read_csv(train_csv), pd.read_csv(test_csv), pd.read_csv(test_probas_csv)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
test_probas.head()

In [None]:
from enum import Enum


class SentimentLabel(str, Enum):
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"


LABEL_CLASS_TO_NAME = {
    0: SentimentLabel.NEGATIVE.value,
    1: SentimentLabel.NEUTRAL.value,
    2: SentimentLabel.POSITIVE.value,
}

In [None]:
import pandas as pd
import numpy as np

test_df['prob_NEGATIVE'] = test_probas['Negative']
test_df['prob_NEUTRAL'] = test_probas['Neutral']
test_df['prob_POSITIVE'] = test_probas['Positive']

test_df['predicted_label'] = test_probas.idxmax(axis=1).map({'Negative': 0, 'Neutral': 1, 'Positive': 2})
test_df['predicted_sentiment'] = test_df['predicted_label'].map(LABEL_CLASS_TO_NAME)
column_order = ['text', 'label', 'prob_NEGATIVE', 'prob_NEUTRAL', 'prob_POSITIVE', 'predicted_label', 'predicted_sentiment']
test_df = test_df[column_order]

In [None]:
test_df.head()

## Step 2: Simulating Data Drift
In a real-world scenario, data drift would occur naturally over time. For our learning purposes, we'll simulate this drift by translating a sample of our English reviews to French.

### [OPTIONAL] Add in your own kind of pertubations to the reviews
You can try the following:
1. **Spelling Errors**: Introduce random spelling mistakes to simulate typos in reviews.
2. **Emoji Usage**: Add emojis to reviews to simulate changing trends in online communication.
3. **Text Shortening**: Simulate the trend of shorter, more concise reviews.


In [None]:
from deep_translator import GoogleTranslator

def translate_str(review: str) -> str:
    translator = GoogleTranslator(source='auto', target='fr')
    return translator.translate(review)

In [None]:
SEED = 42
SAMPLE_SIZE = 100

data_drift_df = train_df.sample(SAMPLE_SIZE, random_state=SEED)

In [None]:
data_drift_df.head()

In [None]:
# Will take a couple of minutes!

data_drift_df['text'] = data_drift_df['text'].apply(lambda row: translate_str(row))

In [None]:
data_drift_df.head()

## Step 3: Making Predictions on the New Data
Now that we have our "French" reviews, let's use our original model to make predictions on this new data.

In [None]:
import numpy as np
import onnxruntime as rt


run = wandb.init()
# Example name of model: yudhiesh/model-registry/Drugs Review MLOps Uplimit:v1
downloaded_model_path = run.use_model(
    name="kbhuvi-uplimit/Drug Review MLOps Uplimit/run-y3m59we9-logreg_model_LR_train_size_1000.onnx:v0"
)

sess = rt.InferenceSession(downloaded_model_path, providers=["CPUExecutionProvider"])
input_name = sess.get_inputs()[0].name
query = "I loved the product!"
_, probas = sess.run(None, {input_name: np.array([[query]])})
print(probas[0])

In [None]:
def predict_batch(sess, texts):
    input_name = sess.get_inputs()[0].name
    # Convert list of texts to a 2D numpy array
    input_data = np.array([[text] for text in texts])
    _, probas = sess.run(None, {input_name: input_data})
    return probas

probas = predict_batch(sess, data_drift_df['text'].values)

In [None]:
probas_array = np.array([[prob[0], prob[1], prob[2]] for prob in probas])

print("Shape of probas_array:", probas_array.shape)
print("First few rows of probas_array:", probas_array[:5])

for i, label in LABEL_CLASS_TO_NAME.items():
    data_drift_df[f'prob_{label}'] = probas_array[:, i]


data_drift_df['predicted_label'] = np.argmax(probas_array, axis=1)
data_drift_df['predicted_sentiment'] = data_drift_df['predicted_label'].map(LABEL_CLASS_TO_NAME)

In [None]:
data_drift_df.head()

## Step 4: Analyzing Data Drift
To analyze the data drift, we'll use the [Evidently](https://github.com/evidentlyai/evidently/tree/main) library, which provides tools for monitoring machine learning models in production.

The performance report will show how our model's performance has changed when applied to the French reviews. We expect to see a significant drop in performance metrics like accuracy and F1-score.
The data drift report will highlight changes in the statistical properties of our text data. We should observe significant drift in features like average word length, unique word count, and character distributions due to the change in language.

In [None]:
from evidently.pipeline.column_mapping import ColumnMapping

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metric_preset import ClassificationPreset
from evidently.metrics import ClassificationQualityMetric, TextDescriptorsDriftMetric, ColumnDriftMetric

column_mapping = ColumnMapping()

column_mapping.target = 'label'
column_mapping.prediction = 'predicted_label'
column_mapping.text_features = ['text']
column_mapping.numerical_features = []
column_mapping.task = 'classification'
column_mapping.categorical_features = []

In [None]:
performance_report = Report(metrics=[
    ClassificationQualityMetric()
])

performance_report.run(reference_data=test_df, current_data=data_drift_df,
                        column_mapping=column_mapping)

In [None]:
performance_report.show()

In [None]:
data_drift_dataset_report = Report(metrics=[
    ColumnDriftMetric(column_name='text')
])

data_drift_dataset_report.run(reference_data=test_df,
                              current_data=data_drift_df,
                              column_mapping=column_mapping)
data_drift_dataset_report.show()

In [None]:
data_drift_report = Report(
    metrics=[
        TextDescriptorsDriftMetric(column_name='text'),
    ]
)

data_drift_report.run(reference_data=test_df, current_data=data_drift_df, column_mapping=column_mapping)
data_drift_report.show()

## TODO: Analyze the Performance Report Results

After running the performance report, you should see a comparison between the current (French) and reference (English) datasets. Your task is to analyze these results and explain their significance.

1. Examine each metric (Accuracy, Precision, Recall, F1) and describe how it has changed from the reference to the current data.

2. Explain why you think each metric has changed in the way it has. Consider the nature of the data drift we've introduced (English to French translation).

3. Discuss what these changes mean for Lamada's sentiment analysis system as they expand into the French market. What are the potential business implications?

4. Propose strategies Lamada could consider to address this performance degradation.

5. Reflect on why this example demonstrates the importance of continuous monitoring in ML systems, especially for businesses operating in diverse markets.

Your analysis should be comprehensive, touching on all the points above. Use the specific numbers from the performance report to support your explanations. Remember to consider both the technical aspects of the model's performance and the real-world business implications for Lamada.

**Hint**: Pay special attention to metrics that have changed dramatically. Think about what each metric represents and how the language change might affect the model's ability to make correct predictions.

In [None]:
# YOUR ANSWER GOES HERE

---

## [OPTIONAL] TODO: Retrain the Model on French Data

Now that we've identified the performance degradation due to data drift, let's attempt to address it by retraining our model on the new French data. This exercise will help you understand how model retraining can mitigate the effects of data drift. Follow how we trained and evaluted the model in Week 1!

**NOTE**: As we started off using `LogisticRegression` from scikit-learn we will have to perform a stateless retraining, where we retrain from scratch, as the current implementation does not support incremental training as per the scikit-learn documentation [here](https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning).

---
## [OPTIONAL] TODO: Implement Canary Deployment with Ray Serve
In this advanced exercise, you'll implement a canary deployment strategy for your sentiment analysis model using Ray Serve. This approach allows you to gradually roll out a new version of your model while still serving the old version, reducing risk and allowing for easy rollback if issues arise. In a production deployment we would have to incrementally rollout this new model taking into account model/business metrics.

```python
@serve.deployment()
class Canary:
    def __init__(self, old_model: DeploymentHandle, new_model: DeploymentHandle, canary_percent: float):
        self.old_model = old_model
        self.new_model = new_model
        self.canary_percent = canary_percent

    async def predict(self, request: SimpleModelRequest) -> SimpleModelResponse:
        if random.random() > self.canary_percent:
            results = await self.old_model.predict.remote(request.review)
        else:
            results = await self.new_model.predict.remote(request.review)
        return SimpleModelResponse.model_validate(results.model_dump())

@serve.deployment()
@serve.ingress(app)
class APIIngress:
    def __init__(self, canary_handle: DeploymentHandle) -> None:
        self.handle = canary_handle

    @app.post("/predict")
    async def predict(self, request: SimpleModelRequest):
        return await self.handle.predict.remote(request)

```

### Testing the Canary Deployment:
After deploying your canary setup, test it to ensure it's working correctly:

1. Send multiple requests to the `/predict` endpoint.
2. Log the responses to see which model version is being used for each request.
3. Verify that approximately 20% of requests are being routed to the new model.
---