# Tensorflow Tutorial for Text Data with Integrated Gradient

## Initialize Fiddler Client
We begin this section as usual by establishing a connection to our
Fiddler instance. We can establish this connection either by specifying 
our credentials directly, or by utilizing our `fiddler.ini` file. More
information can be found in the [setup](https://github.com/fiddler-labs/fiddler-samples/blob/master/content_root/tutorial/00%20Install%20%26%20Setup.ipynb) section.

In [None]:
import fiddler as fdl

# client = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=auth_token)
client = fdl.FiddlerApi()

## Create Project

Here we will create a project, a convenient container for housing the models and datasets associated with a given ML use case.

In [None]:
project_id = 'tf_text'

In [None]:
# Creating our project using project_id
if project_id not in client.list_projects():
    client.create_project(project_id)

## Load Dataset
Here we will load in our baseline dataset from a csv called `imdb_rnn.csv`. We will
also create a schema using this information.

In [None]:
import pandas as pd
df = pd.read_csv('/app/fiddler_samples/samples/datasets/imdb_rnn/imdb_rnn.csv')
df_schema = fdl.DatasetInfo.from_dataframe(df, max_inferred_cardinality=1000)

In [None]:
df.head()

## Upload Dataset
To upload a model, you first need to upload a sample of the data of the model’s 
inputs, targets, and additional metadata that might be useful for model analysis. 
This data sample helps us (among other things) to infer the model schema and the 
data types and values range of each feature.

In [None]:
if 'imdb_rnn' not in client.list_datasets(project_id):
    upload_result = client.upload_dataset(
        project_id=project_id,
        dataset={'train': df}, 
        dataset_id='imdb_rnn')

## Create Model Schema
As you must have noted, in the dataset upload step we did not ask for the model’s 
features and targets, or any model specific information. That’s because we 
allow for linking multiple models to a given dataset schema. Hence we require 
an Infer model schema step which helps us know the features relevant to the 
model and the model task. Here you can specify the input features, the target 
column, decision columns and metadata columns, and also the type of model.

In [None]:
target = 'polarity'
feature_columns = ['sentence']
train_input = df[feature_columns]
train_target = df[target]

model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=client.get_dataset_info(project_id, 'imdb_rnn'),
    target=target, 
    features=feature_columns,
    display_name='Text IG',
    description='this is a tensorflow model using text data and IG enabled from tutorial',
    input_type=fdl.ModelInputType.TEXT
)

## Install Tensorflow if necessary

Currently, we support Sklearn version 0.21.2 and TF version 1.14  
If you have another version, please contact Fiddler for assistance.

In [None]:
import tensorflow as tf

assert tf.__version__=='1.14.0', 'Please change tensorflow version to 1.14.0'

In [None]:
import sklearn

assert sklearn.__version__=='0.21.2', 'Please change sklearn version to 0.21.2'

In [None]:
# !pip install tensorflow==1.14

In [None]:
# !pip install scikit-learn==0.21.2

## Train Model
Build and train your model.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_target = le.fit_transform(train_target)
train_target = train_target.reshape(-1,1)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

vocab_size = 1000
max_seq_length = 150
tok = Tokenizer(num_words=vocab_size)
tok.fit_on_texts(train_input['sentence'])
sequences = tok.texts_to_sequences(train_input['sentence'])
sequences_matrix = sequence.pad_sequences(sequences, maxlen=max_seq_length, padding='post')

In [None]:
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.models import Model

def RNN():
    inputs = Input(name='inputs', shape=[max_seq_length])
    layer = Embedding(vocab_size, 64, input_length=max_seq_length)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256, name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.2)(layer)
    layer = Dense(1, name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs, outputs=layer)
    return model

In [None]:
from tensorflow.keras.optimizers import RMSprop

model = RNN()
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
model.fit(sequences_matrix, train_target, batch_size=128, epochs=5,
          validation_split=0.1, callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.001)])

## Save Model and Schema
Next step, we need to save the model and any pre-processing step you had 
on the input features (for example Categorical encoder, Tokenization, ...).

In [None]:
import pathlib
import shutil
import pickle
import yaml
import tensorflow as tf

project_id = 'tf_text'
model_id = 'tf_ig_imdb'

# create temp dir
model_dir = pathlib.Path(model_id)
shutil.rmtree(model_dir, ignore_errors=True)
model_dir.mkdir()

# save model
tf.keras.experimental.export_saved_model(model, str(model_dir / 'saved_model'))

# save model schema
with open(model_dir / 'model.yaml', 'w') as yaml_file:
    yaml.dump({'model': model_info.to_dict()}, yaml_file)

# save tokenizer
with open(model_dir / 'tokenizer.pkl', 'wb') as tok_file:
    tok_file.write(pickle.dumps(tok))

## Write `package.py` and related wrappers

### Import related wrappers

We need to import 2 wrappers for tensorflow. Those files are stored in the utils directory.
- The tf_saved_model_wrapper.py file contains a wrapper to load and run a TF model from a saved_model path.
- The tf_saved_model_wrapper_ig.py file contains a wrapper to support Integrated Gradients (IG) computation for a TF model loaded from a saved_model path.

In [None]:
files = ['utils/tf_saved_model_wrapper.py', 'utils/tf_saved_model_wrapper_ig.py']
for f in files:
    shutil.copy(f, model_dir)

### Write `package.py` file

A wrapper is needed between Fiddler and the model. This wrapper can be used to 
translate the inputs and outputs to fit what the model expects and what Fiddler 
is able to consume. This file contains functions to transform the input, generate the 
baseline and get the attributions. More information can be found [here](https://api.fiddler.ai/#package-py/)

In [None]:
%%writefile tf_ig_imdb/package.py

import numpy as np
import re
import pathlib
import pickle
import logging
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from .tf_saved_model_wrapper_ig import TFSavedModelWrapperIg


PACKAGE_PATH = pathlib.Path(__file__).parent
SAVED_MODEL_PATH = PACKAGE_PATH / 'saved_model'
TOKENIZER_PATH = PACKAGE_PATH / 'tokenizer.pkl'

LOG = logging.getLogger(__name__)


class MyModel(TFSavedModelWrapperIg):
    def __init__(self, saved_model_path, sig_def_key, tokenizer_path,
                 target,
                 is_binary_classification=False,
                 output_key=None,
                 batch_size=8,
                 output_columns=[],
                 input_tensor_to_differentiable_layer_mapping={},
                 max_allowed_error=None):
        """
        Class to load and run the IMDB RNN model.
        See: TFSavedModelWrapper

        """
        super().__init__(saved_model_path, sig_def_key,
                         is_binary_classification=is_binary_classification,
                         output_key=output_key,
                         batch_size=batch_size,
                         output_columns=output_columns,
                         input_tensor_to_differentiable_layer_mapping=
                         input_tensor_to_differentiable_layer_mapping,
                         max_allowed_error=max_allowed_error)
        with open(tokenizer_path, 'rb') as handle:
            self.tokenizer = pickle.load(handle)
        self.max_seq_length = 150
        self.target = target

    def transform_input(self, input_df):
        """
        Transform the provided dataframe into one that complies with the input
        interface of the model.

        Overrides the transform_input method of TFSavedModelWrapper.
        """
        
        sequences = self.tokenizer.texts_to_sequences(input_df[self.target])
        sequences_matrix = sequence.pad_sequences(sequences,
                                                  maxlen=self.max_seq_length,
                                                  padding='post')

        return pd.DataFrame({'inputs': sequences_matrix.tolist()})

    def generate_baseline(self, input_df):
        
        input_tokens = input_df[self.target].apply(lambda x: '')
        sequences = self.tokenizer.texts_to_sequences(input_tokens)
        sequences_matrix = sequence.pad_sequences(sequences,
                                                  maxlen=self.max_seq_length,
                                                  padding='post')

        return pd.DataFrame({'inputs': sequences_matrix.tolist()})

    def project_attributions(self, input_df, transformed_input_df,
                             attributions):
        """
        Maps the transformed input to original input space so that the
        attributions correspond to the features of the original input.
        Overrides the project_attributions method of TFSavedModelWrapper.
        """
        segments = re.split(r'([ '+self.tokenizer.filters+'])', input_df[self.target].iloc[0])
        unpadded_input=[self.tokenizer.texts_to_sequences([x])[0] for x in input_df[self.target].values]
        word_tokens = self.tokenizer.sequences_to_texts([[x] for x in unpadded_input[0]])
        word_attributions = attributions['inputs'][0].astype('float').tolist()[:len(word_tokens)] 
        
        # Let's walk segments and assign attributions to the components where
        # they match word_tokens, the token sequence consumed by the model; otherwise assign 0.
        i = 0
        final_attributions = []
        final_segments = []
        for segment in segments:
            if segment is not '':
                final_segments.append(segment)
                seg_low = segment.lower()
                if len(word_tokens)>i and seg_low == word_tokens[i]:
                    final_attributions.append(word_attributions[i])
                    i+=1
                else:
                    final_attributions.append(0)       
        return {"embedding_input":[final_segments, final_attributions]}


def get_model():
    model = MyModel(
        SAVED_MODEL_PATH,
        tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY,
        TOKENIZER_PATH,
        target='sentence',
        is_binary_classification=True,
        batch_size=128,
        output_columns=['probability_polarity_True'],
        input_tensor_to_differentiable_layer_mapping=
        {'inputs': 'embedding/embedding_lookup:0'},
        max_allowed_error=5)
    model.load_model()
    return model


## Upload Model
Now that we have all the parts that we need, we can go ahead and upload the model to the Fiddler platform. You can use the [upload_model_package](https://api.fiddler.ai/#upload-model-package) to upload this entire directory in one shot. We need the following for uploading a model:
- The `path` to the directory
- The `project_id` to which the model belongs
- The `model_id`, which is the name you want to give the model. You can access it in Fiddler henceforth via this ID
- The `dataset` which the model is linked to (optional)  

In total, we will have a `model.yaml`, a `*.pkl`, and a `package.py` file within our model directory.

In [None]:
client.delete_model(project_id, model_id)
client.upload_model_package(model_dir, project_id, model_id)

## Run Model
Now, let's test out our model by interfacing with the client and 
calling [run model](https://api.fiddler.ai/#run-model).

In [None]:
prediction_input = train_input[:10]
result = client.run_model(project_id, model_id, prediction_input)
result

## Get Explanation
Let's get an explanation on a selected data point to better understand how our
model came to the conclusion it did. We can do so by calling the `run_explanation`
method. In this case, we will call for an explanation using `'ig'`.
More information on this method can be found [here](https://api.fiddler.ai/#run-explanation)

In [None]:
selected_point = df.head(1)

In [None]:
project_id = 'tf_text'
model_id = 'tf_ig_imdb'

ex_ig = client.run_explanation(
    project_id=project_id,
    model_id=model_id, 
    df=selected_point, 
    dataset_id='imdb_rnn',
    explanations='ig')

In [None]:
ex_ig