# Tensorflow Tutorial for Text Data with Integrated Gradient

## Initialize Fiddler Client
We begin this section as usual by establishing a connection to our
Fiddler instance. We can establish this connection either by specifying 
our credentials directly, or by utilizing our `fiddler.ini` file. More
information can be found in the [setup](https://github.com/fiddler-labs/fiddler-samples/blob/master/content_root/tutorial/00%20Install%20%26%20Setup.ipynb) section.

In [None]:
import fiddler as fdl

# client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=auth_token)
client = fdl.FiddlerApi()

In [None]:
project_id = 'tf_text'
dataset_id = 'imdb_rnn'
model_id = 'imdb_rnn'

## Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case.

In [None]:
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

## Load Dataset
Here we will load in our baseline dataset from a csv called `imdb_rnn.csv`. We will
also create a schema using this information.

In [None]:
import pandas as pd
df = pd.read_csv('/app/fiddler_samples/samples/datasets/imdb_rnn/imdb_rnn.csv')

#### Data Cleaning

In [None]:
import re

In [None]:
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub(' ', text)

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)
    sentence = re.sub(r'\s+', ' ', sentence)    # Removing multiple spaces

    return sentence

In [None]:
processed_sentences = []
sentences = list(df['sentence'])
for sen in sentences:
    processed_sentences.append(preprocess_text(sen))

df['sentence'] = processed_sentences
df.head()

## Upload Dataset
To upload a model, you first need to upload a sample of the data of the model’s 
inputs, targets, and additional metadata that might be useful for model analysis. 
This data sample helps us (among other things) to infer the model schema and the 
data types and values range of each feature.

In [None]:
df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=10)

In [None]:
if dataset_id not in client.list_datasets(project_id):
    upload_result = client.upload_dataset(
        project_id=project_id,
        dataset={'train': df}, 
        dataset_id=dataset_id,
        info=df_schema)

## Create Model Schema
As you must have noted, in the dataset upload step we did not ask for the model’s 
features and targets, or any model specific information. That’s because we 
allow for linking multiple models to a given dataset schema. Hence we require 
an Infer model schema step which helps us know the features relevant to the 
model and the model task. Here you can specify the input features, the target 
column, decision columns and metadata columns, and also the type of model.

In [None]:
target = 'polarity'
feature_columns = ['sentence']
output = 'sentiment'

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, dataset_id),
    target=target,
    features=feature_columns,
    input_type=fdl.ModelInputType.TEXT,
    model_task=fdl.ModelTask.BINARY_CLASSIFICATION,
    outputs=output,
    display_name='Text IG',
    description='this is a tensorflow model using text data and IG enabled from tutorial',
    preferred_explanation_method=fdl.ExplanationMethod.IG_FLEX
)
model_info

## Install Tensorflow if necessary

Currently, we support Sklearn version 0.21.2 and TF version 2.5 
If you have another version, please contact Fiddler for assistance.

In [None]:
import tensorflow as tf

assert tf.__version__=='2.5.0', 'Please change tensorflow version to 2.5.0'

## Train Model
Build and train your model.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

train_input = df['sentence']

target = 'polarity'
train_target = df[target]
train_target = le.fit_transform(train_target).reshape(-1,1)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

vocab_size = 1000
max_seq_length = 150
tok = Tokenizer(num_words=vocab_size)
tok.fit_on_texts(train_input)
sequences = tok.texts_to_sequences(train_input)
sequences_matrix = sequence.pad_sequences(sequences, maxlen=max_seq_length, padding='post')

In [None]:
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, GlobalAveragePooling1D
from tensorflow.keras.models import Model

def RNN():
    inputs = Input(name='inputs', shape=[max_seq_length])
    layer = Embedding(vocab_size, 64, input_length=max_seq_length)(inputs)
    
    
    layer = LSTM(64, return_sequences=True)(layer)
    layer = LSTM(32, return_sequences=True)(layer)
    layer = GlobalAveragePooling1D()(layer)
    layer = Dense(256, name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.2)(layer)
    layer = Dense(1, name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    
    model = Model(inputs=inputs, outputs=layer)
    
    return model

In [None]:
from tensorflow.keras.optimizers import RMSprop

model = RNN()
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])

In [None]:
model.fit(sequences_matrix, train_target, batch_size=128, epochs=5,
          validation_split=0.1, callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.001)]);

## Save Model
Next step, we need to save the model and any pre-processing step you had 
on the input features (for example Categorical encoder, Tokenization, ...).

In [None]:
import pathlib
import pickle

In [None]:
# create model files 
model_dir = pathlib.Path(model_id)
model_dir.mkdir(exist_ok=True)

# save model
model.save(model_dir/'saved_model')

# save tokenizer
with open(model_dir / 'tokenizer.pkl', 'wb') as tok_file:
    tok_file.write(pickle.dumps(tok))


## Write `package.py` and related wrappers

### Import related wrappers

We need to import the GEM wrapper for displaying the attributions. This file is stored in the utils directory.

In [None]:
import shutil
shutil.copy('utils/GEM.py', model_dir)

### Write `package.py` file

A wrapper is needed between Fiddler and the model. This wrapper can be used to 
translate the inputs and outputs to fit what the model expects and what Fiddler 
is able to consume. This file contains functions to transform the input, generate the 
baseline and get the attributions. More information can be found [here](https://api.fiddler.ai/#package-py/)

In [None]:
%%writefile imdb_rnn/package.py

import pickle
import pathlib
import re
import numpy as np
import pandas as pd
import tensorflow as tf
from .GEM import GEMContainer, GEMSimple, GEMText

from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Name the output of your model here - this will need to match the model schema we define in the next notebook
OUTPUT_COL = ['sentiment']

# These are the names of the inputs of yout TensorFlow model
FEATURE_LABEL = 'sentence'

MODEL_ARTIFACT_PATH = 'saved_model'

TOKENIZER_PATH = 'tokenizer.pkl'

ATTRIBUTABLE_LAYER_NAMES = EMBEDDING_NAMES = ['embedding']

MAX_SEQ_LENGTH = 150

def _pad(seq):
    return pad_sequences(seq, MAX_SEQ_LENGTH,
                         padding='post', truncating='post')

class FiddlerModel:
    def __init__(self):
        """ Model deserialization and initialization goes here.  Any additional serialized preprocession
            transformations would be initialized as well - e.g. tokenizers, embedding lookups, etc.
        """
        self.model_dir = pathlib.Path(__file__).parent
        self.model = tf.keras.models.load_model(str(self.model_dir / MODEL_ARTIFACT_PATH))

        # Construct sub-models (for each ATTRIBUTABLE_LAYER_NAME)
        # if not possible to attribute directly to the input (e.g. embeddings).
        self.att_sub_models = {att_layer : Model(self.model.inputs,
                    outputs=self.model.get_layer(att_layer).output)
                    for att_layer in ATTRIBUTABLE_LAYER_NAMES}
        
        with open(str(self.model_dir / TOKENIZER_PATH), 'rb') as f:
            self.tokenizer = pickle.load(f)
        
        self.grad_model = self._define_model_grads()
        
    def get_settings(self):
        
        return {'ig_start_steps': 32,  # 32
                'ig_max_steps': 4096,  # 2048
                'ig_min_error_pct':5.0 # 1.0
               }
        
    def transform_to_attributable_input(self, input_df):
        """ This method is called by the platform and is responsible for transforming the input dataframe
            to the upstream-most representation of model inputs that belongs to a continuous vector-space.
            For this example, the model inputs themselves meet this requirement.  For models with embedding
            layers (esp. NLP models) the first attributable layer is downstream of that.
        """
        transformed_input = self._transform_input(input_df)

        return {att_layer : att_sub_model.predict(transformed_input)
                    for att_layer, att_sub_model in self.att_sub_models.items()}

    def get_ig_baseline(self, input_df):
        """ This method is used to generate the baseline against which to compare the input. 
            It accepts a pandas DataFrame object containing rows of raw feature vectors that 
            need to be explained (in case e.g. the baseline must be sized according to the explain point).
            Must return a pandas DataFrame that can be consumed by the predict method described earlier.
        """
        baseline_df = input_df.copy()
        baseline_df[FEATURE_LABEL] = input_df[FEATURE_LABEL].apply(lambda x: '')

        return baseline_df

    def _transform_input(self, input_df):
        """ Helper function that accepts a pandas DataFrame object containing rows of raw feature vectors. 
            The output of this method can be any Python object. This function can also 
            be used to deserialize complex data types stored in dataset columns (e.g. arrays, or images 
            stored in a field in UTF-8 format).
        """
        sequences = self.tokenizer.texts_to_sequences(input_df[FEATURE_LABEL])
        sequences_matrix = sequence.pad_sequences(sequences,
                                                  maxlen=MAX_SEQ_LENGTH,
                                                  padding='post')
        return sequences_matrix.tolist()
    
    
    def predict(self, input_df):
        """ Basic predict wrapper.  Takes a DataFrame of input features and returns a DataFrame
            of predictions.
        """
        transformed_input = self._transform_input(input_df)
        pred = self.model.predict(transformed_input)
        return pd.DataFrame(pred, columns=OUTPUT_COL)
    
    def compute_gradients(self, attributable_input):
        """ This method computes gradients of the model output wrt to the differentiable input. 
            If there are embeddings, the attributable_input should be the output of the embedding 
            layer. In the backend, this method receives the output of the transform_to_attributable_input() 
            method. This must return an array of dictionaries, where each entry of the array is the attribution 
            for an output. As in the example provided, in case of single output models, this is an array with 
            single entry. For the dictionary, the key is the name of the input layer and the values are the 
            attributions.
        """
        gradients_by_output = []
        attributable_input_tensor = {k: tf.identity(v) for k, v in attributable_input.items()}
        gradients_dic_tf = self._gradients_input(attributable_input_tensor)
        gradients_dic_numpy = dict([key, np.asarray(value)] for key, value in gradients_dic_tf.items()) 
        gradients_by_output.append(gradients_dic_numpy)
        return gradients_by_output    
    
    def _gradients_input(self, x):
        """
        Function to Compute gradients.
        """
        with tf.GradientTape() as tape:
            tape.watch(x)
            preds = self.grad_model(x)

        grads = tape.gradient(preds, x)

        return grads


    def _define_model_grads(self):
        """
        Define a differentiable model, cut from the Embedding Layers. 
        This will take as input what the transform_to_attributable_input function defined.
        """
        model = tf.keras.models.load_model(str(self.model_dir / 'saved_model'))

        for index, name in enumerate(EMBEDDING_NAMES):
            model.layers.remove(model.get_layer(name))
            model.layers[index]._batch_input_shape = (None, 150, 64)
            model.layers[index]._dtype = 'float32'
            model.layers[index]._name = name

        new_model = tf.keras.models.model_from_json(model.to_json())

        for layer in new_model.layers:
            try:
                layer.set_weights(self.model.get_layer(name=layer.name).get_weights())
            except:
                pass
        
        return new_model


    def project_attributions(self, input_df, attributions):
        explanations_by_output = {}

        for output_field_index, att in enumerate(attributions):           
            segments = re.split(r'([ ' + self.tokenizer.filters + '])',
                                input_df.iloc[0][FEATURE_LABEL])

            unpadded_tokens = [self.tokenizer.texts_to_sequences([x])[0] for x
                              in input_df[FEATURE_LABEL].values]

            padded_tokens = _pad(unpadded_tokens)

            word_tokens = self.tokenizer.sequences_to_texts(
                [[x] for x in padded_tokens[0]])

            # Note - summing over attributions in the embedding direction
            word_attributions = np.sum(att['embedding'][-len(word_tokens):],
                                       axis=1)

            i = 0
            final_attributions = []
            final_segments = []
            for segment in segments:
                if segment is not '':  # dump empty tokens
                    final_segments.append(segment)
                    seg_low = segment.lower()
                    if len(word_tokens) > i and seg_low == word_tokens[i]:
                        final_attributions.append(word_attributions[i])
                        i += 1
                    else:
                        final_attributions.append(0)

            gem_text = GEMText(feature_name=FEATURE_LABEL,
                               text_segments=final_segments,
                               text_attributions=final_attributions)

            gem_container = GEMContainer(contents=[gem_text])

            explanations_by_output[OUTPUT_COL[output_field_index]] \
                = gem_container.render()

        return explanations_by_output


def get_model():
    return FiddlerModel()

## Upload Model
Now that we have all the parts that we need, we can go ahead and upload the model to the Fiddler platform. You first need to add your model shema in Fiddler using `add_model`. Then, you can use the `add_model_artifact` to upload this entire directory in one shot. We need the following for uploading a model:
- The `path` to the directory
- The `project_id` to which the model belongs
- The `model_id`, which is the name you want to give the model. You can access it in Fiddler henceforth via this ID
- The `dataset_id` which the model is linked to

In total, we will have a model file, a tokenizer file, the GEM.py wrapper and a `package.py` file within our model directory.

In [None]:
client.delete_model(project_id, model_id)

In [None]:
client.add_model(project_id=project_id, model_id=model_id, dataset_id=dataset_id, model_info=model_info)
client.add_model_artifact(model_dir=model_dir, project_id=project_id, model_id=model_id)

## Run Model
Now, let's test out our model by interfacing with the client and 
calling [run model](https://api.fiddler.ai/#run-model).

In [None]:
prediction_input = df.head(3)
client.run_model(project_id, model_id, prediction_input)

## Get Explanation
Let's get an explanation on a selected data point to better understand how our
model came to the conclusion it did. We can do so by calling the `run_explanation`
method. In this case, we will call for an explanation using `'ig'`.
More information on this method can be found [here](https://api.fiddler.ai/#run-explanation)

In [None]:
selected_point = df.head(1)

In [None]:
client.run_explanation(
    project_id=project_id,
    model_id=model_id, 
    df=selected_point, 
    dataset_id=dataset_id,
    explanations='ig_flex')