<a href="https://colab.research.google.com/github/alsmith151/genomics_utilities/blob/master/lit_nlp/examples/notebooks/LIT_sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using the Learning Interpretability Tool in Notebooks

This notebook shows use of the [Learning Interpretability Tool](https://pair-code.github.io/lit) on a binary classifier for labelling statement sentiment (0 for negative, 1 for positive).

The LitWidget object constructor takes a dict mapping model names to model objects, and a dict mapping dataset names to dataset objects. Those will be the datasets and models displayed in LIT. Running the constructor will cause the LIT server to be started in the background, loading the models and datasets and enabling the UI to be served.

Render the LIT UI in an output cell by calling the `render` method on the LitWidget object. The LIT UI can be rendered multiple times in separate cells if desired. The widget also contains a `stop` method to shut down the LIT server.

Copyright 2020 Google LLC.
SPDX-License-Identifier: Apache-2.0

In [11]:
# The pip installation will install all necessary prerequisite packages for use of the core LIT package.
!pip install lit-nlp transformers datasets



In [14]:
from lit_nlp import notebook
from lit_nlp.examples.glue import data
from lit_nlp.examples.glue import models
from lit_nlp.api import types as lit_types
from lit_nlp.api.dataset import Dataset
from lit_nlp.api.model import Model
from typing import Dict

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Hide INFO and lower logs. Comment this out for debugging.
from absl import logging
logging.set_verbosity(logging.WARNING)

In [8]:
# Fetch the trained model weights
# !wget https://storage.googleapis.com/what-if-tool-resources/lit-models/sst2_tiny.tar.gz
# !tar -xvf sst2_tiny.tar.gz

In [9]:
class GenomicData(Dataset):
    """Hugging Face Spam Dataset.
    See https://huggingface.co/datasets/sms_spam
    """

    LABELS = ["NEGATIVE", "POSITIVE"]

    def __init__(self, path):

        from datasets import load_from_disk
        # load dataset from huggingface dataset
        dataset = load_from_disk(path)["train"]
        dataset.set_format("pandas")
        df = dataset[:]
        # Store as a list of dicts, conforming to self.spec()
        self._examples = [
            {
                "sequence": row["sequence"],
                "label": self.LABELS[row["label"]],
            }
            for _, row in df.iterrows()
        ]

    def spec(self):
        return {
            "sequence": lit_types.TextSegment(),
            "label": lit_types.CategoryLabel(vocab=self.LABELS),
        }

In [13]:
class DNAModel(Model):
    def __init__(
        self,
        tokenizer: str,
        model: str,
    ):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer)
        self.vocab = self.tokenizer.convert_ids_to_tokens(range(len(self.tokenizer)))
        self.model = AutoModelForSequenceClassification.from_pretrained(model)

    def _process_inputs(self, inputs):
        """Tokenise inputs for the model."""
        return self.tokenizer(
            inputs["sequence"],
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512,
        )

    def predict(self, sequence: str) -> Dict[str, Any]:
        inputs = self._process_inputs({"sequence": sequence})
        outputs = self.model(**inputs)
        return outputs

    def input_spec(self) -> Dict[str, lit_types.Spec]:
        return {
            "sequence": lit_types.TextSegment(),
        }

    def output_spec(self) -> Dict[str, lit_types.Spec]:
        return {
            "logits": lit_types.MulticlassPreds(parent="label", vocab=self.vocab),
        }


NameError: name 'Model' is not defined

In [4]:
# Create the LIT widget with the model and dataset to analyze.
datasets = {'sst_dev': data.SST2Data('validation')}
models = {'sst_tiny': models.SST2Model('./')}

widget = notebook.LitWidget(models, datasets, port=8890)

Downloading and preparing dataset 7.09 MiB (download: 7.09 MiB, generated: 7.22 MiB, total: 14.31 MiB) to /root/tensorflow_datasets/glue/sst2/2.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/67349 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/glue/sst2/incomplete.S7XXEV_2.0.0/glue-train.tfrecord*...:   0%|          …

Generating validation examples...:   0%|          | 0/872 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/glue/sst2/incomplete.S7XXEV_2.0.0/glue-validation.tfrecord*...:   0%|     …

Generating test examples...:   0%|          | 0/1821 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/glue/sst2/incomplete.S7XXEV_2.0.0/glue-test.tfrecord*...:   0%|          |…

Dataset glue downloaded and prepared to /root/tensorflow_datasets/glue/sst2/2.0.0. Subsequent calls will reuse this data.


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [5]:
# Render the widget
widget.render(height=600)

<IPython.core.display.Javascript object>

If you've found interesting examples using the LIT UI, you can access these in Python using `widget.ui_state`:

In [None]:
widget.ui_state.primary  # the main selected datapoint

In [None]:
widget.ui_state.selection  # the full selected set, if you have multiple points selected

In [None]:
widget.ui_state.pinned  # the pinned datapoint, if you use the 📌 icon or comparison mode

Note that these include some metadata; the bare example is in the `['data']` field for each record:

In [None]:
widget.ui_state.primary['data']

In [None]:
[ex['data'] for ex in widget.ui_state.selection]