# Onboarding IMDB Movie Reviews for NLP Explainability

In this notebook, we present the steps for onboarding a model artifact to Fiddler that predicts the sentiment of IMDB movie reviews.  Fiddler is able to explain complex models with a variety of input types like unstructured text, images, and multi-modal.  

Fiddler is the pioneer in enterprise Model Performance Management (MPM), offering a unified platform that enables Data Science, MLOps, Risk, Compliance, Analytics, and LOB teams to **monitor, explain, analyze, and improve ML deployments at enterprise scale**. 
Obtain contextual insights at any stage of the ML lifecycle, improve predictions, increase transparency and fairness, and optimize business revenue.

---

You can experience Fiddler's NLP monitoring ***in minutes*** by following these five quick steps:

1. Connect to Fiddler
2. Upload a baseline dataset
3. Upload a model package directory containing the **1) package.py and 2) model artifact**
4. Explain your model

# 0. Imports

In [1]:
!pip install -q fiddler-client==1.8.1

import fiddler as fdl
import pandas as pd
import yaml
import datetime
import time
from IPython.display import clear_output

print(f"Running Fiddler client version {fdl.__version__}")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Running Fiddler client version 1.8.1


# 1. Connect to Fiddler

Before you can add information about your model with Fiddler, you'll need to connect using our Python client.

---

**We need a few pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your organization ID
3. Your authorization token

The latter two of these can be found by pointing your browser to your Fiddler URL and navigating to the **Settings** page.

In [2]:
URL = 'https://preprod.fiddler.ai'  # Make sure to include the full URL (including https://).
ORG_ID = 'preprod'
AUTH_TOKEN = '6lxdgyAZ3B2PNFxR3GZ7N4ao6As6UvicPQdamdaU13g'

Now just run the following code block to connect the client to your Fiddler environment.

In [3]:
client = fdl.FiddlerApi(
    url=URL,
    org_id=ORG_ID,
    auth_token=AUTH_TOKEN
)

Once you connect, you can create a new project by specifying a unique project ID in the client's [create_project](https://docs.fiddler.ai/reference/clientcreate_project) function.

In [4]:
PROJECT_ID = 'imdb_explainability'

if not PROJECT_ID in client.list_projects():
    print(f'Creating project: {PROJECT_ID}')
    client.create_project(PROJECT_ID)
else:
    print(f'Project: {PROJECT_ID} already exists')

Project: imdb_explainability already exists


# 2. Upload a baseline dataset

In this example, we'll be considering the case where we have a model that **predicts sentiment for movie reviews**.  
  
**Fiddler needs a small  sample of data that can serve as a baseline**.


---


*For more information on how to design a baseline dataset, [click here](https://docs.fiddler.ai/docs/designing-a-baseline-dataset).*

In [6]:
PATH_TO_BASELINE_CSV = 'https://media.githubusercontent.com/media/fiddler-labs/fiddler-examples/main/quickstart/data/imdb_baseline.csv'

baseline_df = pd.read_csv(PATH_TO_BASELINE_CSV)
baseline_df

Unnamed: 0,sentence,polarity,sentiment
0,A real blow-up of the film literally. This Bri...,False,0.190378
1,"I only wish that Return of the Jedi, have been...",True,0.282132
2,"""I like cheap perfume better; it doesn't last ...",True,0.238484
3,On the eighth day God created Georges. But the...,True,0.650361
4,"No, this is not no Alice fairy tale my friends...",True,0.859355
...,...,...,...
24995,Boris Karloff and Bela Lugosi made many films ...,True,0.845252
24996,As horror fans we all know that blind rentals ...,False,0.282349
24997,"While visiting Romania with his CIA dad, Tony(...",True,0.730350
24998,This one was marred by potentially great match...,False,0.619230


Fiddler uses this baseline dataset to keep track of important information about your data.
  
This includes **data types**, **data ranges**, and **unique values** for categorical variables.

---

You can construct a [DatasetInfo](https://docs.fiddler.ai/reference/fdldatasetinfo) object to be used as **a schema for keeping track of this information** by running the following code block.

In [7]:
dataset_info = fdl.DatasetInfo.from_dataframe(baseline_df, max_inferred_cardinality=100)
dataset_info

Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,sentence,STRING,,False,
1,polarity,BOOLEAN,2.0,False,
2,sentiment,FLOAT,,False,0.002 - 0.994


Then use the client's [upload_dataset](https://docs.fiddler.ai/reference/clientupload_dataset) function to send this information to Fiddler.
  
*Just include:*
1. A unique dataset ID
2. The baseline dataset as a pandas DataFrame
3. The `DatasetInfo` object you just created

In [9]:
DATASET_ID = 'imdb_baseline'

client.upload_dataset(
    project_id=PROJECT_ID,
    dataset_id=DATASET_ID,
    dataset={
        'baseline': baseline_df
    },
    info=dataset_info
)

Within your Fiddler environment's UI, you should now be able to see the newly created dataset within your project.

## 3. Upload your model package

Now it's time to upload your model package to Fiddler.  To complete this step, we need to ensure we have the assets required to load the model and a package.py script that tells Fiddler how to call the model's prediction endpoint.  It doesn't matter what this directory is called, but for this example we will call it **/model**.  We also need a few subdirectories to house other assets needed to load the model.

In [10]:
import os
os.makedirs("model")
os.makedirs("model/saved_model")
os.makedirs("model/saved_model/variables")

***Your model package directory will need to contain:***
1. A **package.py** file which explains to Fiddler how to invoke your model's prediction endpoint
2. And the **model artifact** and other files required to load the model
3. A **requirements.txt** specifying which python libraries need by package.py

---

### 3.1.a  Create the **model_info** object 

This is done by creating our [model_info](https://docs.fiddler.ai/reference/fdlmodelinfo) object.


In [10]:
target = 'polarity'
features = ['sentence']
output = ['sentiment']

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(PROJECT_ID, DATASET_ID),
    target=target,
    features=features,
    input_type=fdl.ModelInputType.TEXT,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION,
    outputs=output,
    display_name='IMDB Sentiment Classifier',
    description='imdb rnn sentiment classifier',
    preferred_explanation_method=fdl.ExplanationMethod.IG_FLEX
)

model_info

Using binary_classification_threshold=0.5


Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,polarity,BOOLEAN,2,False,

Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,sentence,STRING,,False,

Unnamed: 0,column,dtype,count(possible_values),is_nullable,value_range
0,sentiment,FLOAT,,False,0 - 0


### 3.1.b Add Model Information to Fiddler

In [11]:
MODEL_ID = 'imdb_rnn'

client.add_model(
    project_id=PROJECT_ID,
    dataset_id=DATASET_ID,
    model_id=MODEL_ID,
    model_info=model_info
)

### 3.2 Create the **package.py** file

The contents of the cell below will be written into our ***package.py*** file.  This is the step that will be most unique based on model type, framework and use case.  The model's ***package.py*** file also allows for preprocessing transformations and other processing before the model's prediction endpoint is called.  For more information about how to create the ***package.py*** file for a variety of model tasks and frameworks, please reference the [Uploading a Model Artifact](https://docs.fiddler.ai/docs/uploading-a-model-artifact#packagepy-script) section of the Fiddler product documentation.

In [11]:
%%writefile model/package.py

import pathlib
import pickle
import re

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences

import fiddler as fdl

# Name the output of your model here - this will need to match the model schema we define in the next notebook
OUTPUT_COL = ['sentiment']

# These are the names of the inputs of yout TensorFlow model
FEATURE_LABEL = 'sentence'

MODEL_ARTIFACT_PATH = 'saved_model'

TOKENIZER_PATH = 'tokenizer.pkl'

ATTRIBUTABLE_LAYER_NAMES = EMBEDDING_NAMES = ['embedding']

MAX_SEQ_LENGTH = 150


def _pad(seq):
    return pad_sequences(seq, MAX_SEQ_LENGTH, padding='post', truncating='post')


class FiddlerModel:
    def __init__(self):
        """Model deserialization and initialization goes here.  Any additional serialized preprocession
        transformations would be initialized as well - e.g. tokenizers, embedding lookups, etc.
        """
        self.model_dir = pathlib.Path(__file__).parent
        self.model = tf.keras.models.load_model(
            str(self.model_dir / MODEL_ARTIFACT_PATH)
        )

        # Construct sub-models (for each ATTRIBUTABLE_LAYER_NAME)
        # if not possible to attribute directly to the input (e.g. embeddings).
        self.att_sub_models = {
            att_layer: Model(
                self.model.inputs, outputs=self.model.get_layer(att_layer).output
            )
            for att_layer in ATTRIBUTABLE_LAYER_NAMES
        }

        with open(str(self.model_dir / TOKENIZER_PATH), 'rb') as f:
            self.tokenizer = pickle.load(f)

        self.grad_model = self._define_model_grads()

    def get_settings(self):

        # from ig_flex_exec.py
        # DEFAULT_START_STEPS = 32
        # DEFAULT_MAX_STEPS = 2048
        # DEFAULT_MAX_ERROR_PCT = 1.0

        return {
            'ig_start_steps': 32,  # 32
            'ig_max_steps': 4096,  # 2048
            'ig_min_error_pct': 5.0,  # 1.0
        }

    def transform_to_attributable_input(self, input_df):
        """This method is called by the platform and is responsible for transforming the input dataframe
        to the upstream-most representation of model inputs that belongs to a continuous vector-space.
        For this example, the model inputs themselves meet this requirement.  For models with embedding
        layers (esp. NLP models) the first attributable layer is downstream of that.
        """
        transformed_input = self._transform_input(input_df)

        return {
            att_layer: att_sub_model.predict(transformed_input)
            for att_layer, att_sub_model in self.att_sub_models.items()
        }

    def get_ig_baseline(self, input_df):
        """This method is used to generate the baseline against which to compare the input.
        It accepts a pandas DataFrame object containing rows of raw feature vectors that
        need to be explained (in case e.g. the baseline must be sized according to the explain point).
        Must return a pandas DataFrame that can be consumed by the predict method described earlier.
        """
        baseline_df = input_df.copy()
        baseline_df[FEATURE_LABEL] = input_df[FEATURE_LABEL].apply(lambda x: '')

        return baseline_df

    def _transform_input(self, input_df):
        """Helper function that accepts a pandas DataFrame object containing rows of raw feature vectors.
        The output of this method can be any Python object. This function can also
        be used to deserialize complex data types stored in dataset columns (e.g. arrays, or images
        stored in a field in UTF-8 format).
        """
        sequences = self.tokenizer.texts_to_sequences(input_df[FEATURE_LABEL])
        sequences_matrix = sequence.pad_sequences(
            sequences, maxlen=MAX_SEQ_LENGTH, padding='post'
        )
        return sequences_matrix.tolist()

    def predict(self, input_df):
        """Basic predict wrapper.  Takes a DataFrame of input features and returns a DataFrame
        of predictions.
        """
        transformed_input = self._transform_input(input_df)
        pred = self.model.predict(transformed_input)
        return pd.DataFrame(pred, columns=OUTPUT_COL)

    def compute_gradients(self, attributable_input):
        """This method computes gradients of the model output wrt to the differentiable input.
        If there are embeddings, the attributable_input should be the output of the embedding
        layer. In the backend, this method receives the output of the transform_to_attributable_input()
        method. This must return an array of dictionaries, where each entry of the array is the attribution
        for an output. As in the example provided, in case of single output models, this is an array with
        single entry. For the dictionary, the key is the name of the input layer and the values are the
        attributions.
        """
        gradients_by_output = []
        attributable_input_tensor = {
            k: tf.identity(v) for k, v in attributable_input.items()
        }
        gradients_dic_tf = self._gradients_input(attributable_input_tensor)
        gradients_dic_numpy = dict(
            [key, np.asarray(value)] for key, value in gradients_dic_tf.items()
        )
        gradients_by_output.append(gradients_dic_numpy)
        return gradients_by_output

    def _gradients_input(self, x):
        """
        Function to Compute gradients.
        """
        with tf.GradientTape() as tape:
            tape.watch(x)
            preds = self.grad_model(x)

        grads = tape.gradient(preds, x)

        return grads

    def _define_model_grads(self):
        """
        Define a differentiable model, cut from the Embedding Layers.
        This will take as input what the transform_to_attributable_input function defined.
        """
        model = tf.keras.models.load_model(str(self.model_dir / 'saved_model'))

        for index, name in enumerate(EMBEDDING_NAMES):
            model.layers.remove(model.get_layer(name))
            model.layers[index]._batch_input_shape = (None, 150, 64)
            model.layers[index]._dtype = 'float32'
            model.layers[index]._name = name

        new_model = tf.keras.models.model_from_json(model.to_json())

        for layer in new_model.layers:
            try:
                layer.set_weights(self.model.get_layer(name=layer.name).get_weights())
            except:
                pass

        return new_model

    #  Here's a project_attributions that works for a different single text input model

    # input_df: explain_point df from raw feature space (model_info)
    # attributions: array[<output_dims>] of dict{tensor_names: }
    #     of array[tensor_dims...]
    # returns: dict{output_names: } of feature attributions described in
    #     GEM [generalized explanation markup].
    def project_attributions(self, input_df, attributions):
        explanations_by_output = {}

        for output_field_index, att in enumerate(attributions):
            segments = re.split(
                r'([ ' + self.tokenizer.filters + '])', input_df.iloc[0][FEATURE_LABEL]
            )

            unpadded_tokens = [
                self.tokenizer.texts_to_sequences([x])[0]
                for x in input_df[FEATURE_LABEL].values
            ]

            padded_tokens = _pad(unpadded_tokens)

            word_tokens = self.tokenizer.sequences_to_texts(
                [[x] for x in padded_tokens[0]]
            )

            # Note - summing over attributions in the embedding direction
            word_attributions = np.sum(att['embedding'][-len(word_tokens) :], axis=1)

            i = 0
            final_attributions = []
            final_segments = []
            for segment in segments:
                if segment is not '':  # dump empty tokens
                    final_segments.append(segment)
                    seg_low = segment.lower()
                    if len(word_tokens) > i and seg_low == word_tokens[i]:
                        final_attributions.append(word_attributions[i])
                        i += 1
                    else:
                        final_attributions.append(0)

            gem_text = fdl.gem.GEMText(
                feature_name=FEATURE_LABEL,
                text_segments=final_segments,
                text_attributions=final_attributions,
            )

            gem_container = fdl.gem.GEMContainer(contents=[gem_text])

            explanations_by_output[
                OUTPUT_COL[output_field_index]
            ] = gem_container.render()

        return explanations_by_output


def get_model():
    return FiddlerModel()

Writing model/package.py


### 3.3  Ensure your model's artifact is in the **/model** directory

Make sure your model artifact is also present in the model package directory as well as any dependencies called out in a *requirements.txt* file.  The following cell will move this model's binary file, other required assets and our requirements.txt file into our */model* directory.

In [12]:
import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/models/imdb/tokenizer.pkl", "model/tokenizer.pkl")
urllib.request.urlretrieve("https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/models/requirements.txt", "model/requirements.txt")
urllib.request.urlretrieve("https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/models/imdb/saved_model/keras_metadata.pb", "model/saved_model/keras_metadata.pb")
urllib.request.urlretrieve("https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/models/imdb/saved_model/saved_model.pb", "model/saved_model/saved_model.pb")
urllib.request.urlretrieve("https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/models/imdb/saved_model/variables/variables.data-00000-of-00001", "model/saved_model/variables/variables.data-00000-of-00001")
urllib.request.urlretrieve("https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/models/imdb/saved_model/variables/variables.index", "model/saved_model/variables/variables.index")

('model/saved_model/variables/variables.index',
 <http.client.HTTPMessage at 0x7fbbe53d12e0>)

### 3.4 Define Model Parameters 

Fiddler provides extreme flexibility when onboarding a model artifact for explainability.  Each model runs in its own container with the libraries it needs as defined in the requirement.txt file.  The container is built from a base image and we can specify the compute needs our model requires.  This is done by creating our [DEPLOYMENT_PARAMETERS](https://docs.fiddler.ai/reference/fdldeploymentparams) object.

In [13]:
DEPLOYMENT_PARAMETERS = fdl.DeploymentParams(image_uri="md-base/python/deep-learning:1.0.0",
                                                cpu=1000,
                                                memory=1024,
                                                replicas=1)

### Finally, upload the model package directory

Once the model's artifact is in the */model* directory along with the **pacakge.py** file and requirments.txt the model package directory can be uploaded to Fiddler.

In [14]:
client.add_model_artifact(model_dir='model/', project_id=PROJECT_ID, model_id=MODEL_ID, deployment_params=DEPLOYMENT_PARAMETERS)

Within your Fiddler environment's UI, you should now be able to see the newly created model.

# 4. Explain your model

**You're all done!**
  
Now just head to your Fiddler environment's UI and check out NLP explainability for this model.  You can also run the explanation from the Fiddler client.

In [None]:
#slice to run explanation on
explain_df = df_baseline[1:2]
explain_df

In [None]:
explanation = client.run_explanation(
    project_id=PROJECT_ID,
    model_id=MODEL_ID,
    dataset_id=DATASET_ID,
    df=explain_df
)

In [None]:
explanation



---


**Questions?**  
  
Check out [our docs](https://docs.fiddler.ai/) for a more detailed explanation of what Fiddler has to offer.

If you're still looking for answers, fill out a ticket on [our support page](https://fiddlerlabs.zendesk.com/) and we'll get back to you shortly.