Copyright 2018 Google LLC.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Evaluation code


__Disclaimer__
*   This notebook contains experimental code, which may be changed without notice.
*   The ideas here are some ideas relevant to fairness - they are not the whole story!



# Notebook summary

This notebook intends to evaluate a list of models on two dimensions:
- "Performance": How well the model perform to classify the data (intended bias). Currently, we use the AUC.
- "Bias": How much bias does the model contain (unintended bias). Currently, we use the pinned auc.

This script takes the following steps:

- Defines the models to evaluate and specify their signature (expected inputs/outputs).
- Write input function to generate 2 datasets:
    - A "performance dataset" which will be used for the first set of metrics. This dataset is supposed to be similar format to the training data (contain a piece of text and a label).
    - A "bias dataset" which will be used for the second set of metrics. This data contains a piece of text, a label but also some subgroup information to evaluate the unintended bias on.
- Runs predictions with the export_utils.
- Evaluate metrics.

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import getpass
from IPython.display import display
import json
import nltk
import numpy as np
import pandas as pd
import pkg_resources
import os
import random
import re
import seaborn as sns

import tensorflow as tf
from tensorflow.python.lib.io import file_io

In [None]:
!pip install -U -q git+https://github.com/conversationai/unintended-ml-bias-analysis

In [None]:
from unintended_ml_bias import model_bias_analysis

In [None]:
import input_fn_example
from utils_export.dataset import Dataset, Model
from utils_export import utils_cloudml
from utils_export import utils_tfrecords

In [None]:
os.environ['GCS_READ_CACHE_MAX_SIZE_MB'] = '0' #Faster to access GCS file + https://github.com/tensorflow/tensorflow/issues/15530

In [None]:
nltk.download('punkt')

# Settings

### Global variables

In [None]:
# User inputs
PROJECT_NAME = 'wikidetox'

# Part 1: Defining your model

An important user input is the description of the deployed models that are evaluated.

1- Defining which model will be used.
$MODEL_NAMES defined the different names (format: "model_name:version").

2- Defining the model signature.
Currently, the `Dataset` API does not detect the signature of a CMLE model, so this information is given by a `Model` instance.
You need to describe:
- input_spec: what the input_file should be (argument `feature_keys_spec`). It is a dictionary which describes the name of the fields and their types.
- prediction_keys (argument `prediction_keys`). It is the name of the prediction field in the model output.
- Name of the example key (argument `example_key`). A unique identifier for each sentence which will be generated by the dataset API (a.k.a. your input data does not need to have this field).
    - When using Cloud MLE for batch predictions, data is processed in an unpredictable order. To be able to match the returned predictions with your input instances, you must have instance keys defined.

In [None]:
# User inputs:
MODEL_NAMES = [
    'tf_gru_attention_civil:v_20181109_164318', # "Normal" embeddings, finetuned
    'tf_gru_attention_civil:v_20181109_164403', # "Normal" embeddings, not finetuned
    'tf_gru_attention_civil:v_20181109_164535', # "Unbias" embeddings, finetuned
    'tf_gru_attention_civil:v_20181109_164630', # "Unbias" embeddings, not finetuned
]

In [None]:
# User inputs: Model description (see above for more info).
TEXT_FEATURE_NAME = 'comment_text' #Input defined in serving function called in run.py (arg: `text_feature_name`).
SENTENCE_KEY = 'comment_key' #Input key defined in serving functioncalled in run.py (arg: `example_key_name`).
LABEL_NAME_PREDICTION_MODEL = 'toxicity/logistic' # Output prediction: typically $label_name/logistic

In [None]:
model_input_spec = {
    TEXT_FEATURE_NAME: utils_tfrecords.EncodingFeatureSpec.LIST_STRING} #library will use this automatically

model = Model(
    feature_keys_spec=model_input_spec,
    prediction_keys=LABEL_NAME_PREDICTION_MODEL,
    example_key=SENTENCE_KEY,
    model_names=MODEL_NAMES,
    project_name=PROJECT_NAME)

# Part 2: Defining the input_fn

In [None]:
def tokenizer(text, lowercase=True):
  """Converts text to a list of words.

  Args:
    text: piece of text to tokenize (string).
    lowercase: whether to include lowercasing in preprocessing (boolean).
    tokenizer: Python function to tokenize the text on.

  Returns:
    A list of strings (words).
  """
  words = nltk.word_tokenize(text.decode('utf-8'))
  if lowercase:
    words = [w.lower() for w in words]
  return words

### Defining input_fn

We need to define first some input_fn which will be fed to the `Dataset` API.
An input_fn must follow the following requirements:
- Returns a pandas DataFrame
- Have an argument 'max_n_examples' to control the size of the dataframe.
- Containing at least a field $TEXT_FEATURE_NAME, which maps to a tokenized text (list of words) AND  a field 'label' which is 1 for toxic (0 otherwise).

We will define two different input_fn (1 for performance, 1 for bias). The bias input_fn should also contain identity information.

Note: You can use ANY input_fn that matches those requirements. You can find a few examples of input_fn in the file input_fn_example.py (for toxicity and civil_comments dataset).

In [None]:
# User inputs: Choose which one you want to use OR create your own!
INPUT_FN_PERFORMANCE = input_fn_example.create_input_fn_civil_performance(
    tokenizer,
    model_input_comment_field=TEXT_FEATURE_NAME,
    )
INPUT_FN_BIAS = input_fn_example.create_input_fn_civil_bias(
    tokenizer,
    model_input_comment_field=TEXT_FEATURE_NAME,)

# Part 3: Running prediction

### Performance dataset

In [None]:
# User inputs
SIZE_PERFORMANCE_DATA_SET = 10000

In [None]:
# Pattern for path of tf_records
PERFORMANCE_DATASET_DIR = os.path.join(
    'gs://conversationai-models/',
    getpass.getuser(),
    'tfrecords',
    'performance_dataset_dir')

In [None]:
dataset_performance = Dataset(INPUT_FN_PERFORMANCE, PERFORMANCE_DATASET_DIR)
random.seed(2018) # Need to set seed before loading data to be able to reload same data in the future
dataset_performance.load_data(SIZE_PERFORMANCE_DATA_SET, random_filter_keep_rate=0.5)

In [None]:
# Set recompute_predictions=False to save time if predictions are available.
dataset_performance.add_model_prediction_to_data(model, recompute_predictions=True)

### Bias dataset

In [None]:
# User inputs
SIZE_BIAS_DATA_SET = 20000

In [None]:
# Pattern for path of tf_records
BIAS_DATASET_DIR = os.path.join(
    'gs://conversationai-models/',
    getpass.getuser(),
    'tfrecords',
    'bias_dataset_dir')

In [None]:
dataset_bias = Dataset(INPUT_FN_BIAS, BIAS_DATASET_DIR)
random.seed(2018) # Need to set seed before loading data to be able to reload same data in the future
dataset_bias.load_data(SIZE_BIAS_DATA_SET)

In [None]:
# Set recompute_predictions=False to save time if predictions are available.
dataset_bias.add_model_prediction_to_data(model, recompute_predictions=True)

### Post processing

In [None]:
test_performance_df = dataset_performance.show_data()

In [None]:
test_bias_df = dataset_bias.show_data()

### Analyzing final results

In [None]:
test_performance_df.head()

In [None]:
test_bias_df.head()

# Part 4: Run evaluation metrics

## Performance metrics

### Data Format

At this point, our performance data is in DataFrame df, with columns:

- label: True if the comment is Toxic, False otherwise.
- < model name >: One column per model, cells contain the score from that model.
You can run the analysis below on any data in this format. Subgroup labels can be generated via words in the text as done above, or come from human labels if you have them.

### Run AUC

In [None]:
import sklearn.metrics as metrics

In [None]:
auc_list = []
for _model in MODEL_NAMES:
    fpr, tpr, thresholds = metrics.roc_curve(
        test_performance_df['label'],
        test_performance_df[_model])
    _auc = metrics.auc(fpr, tpr)
    auc_list.append(_auc)
    print ('Auc for model {}: {}'.format(_model, _auc))

## Unintended Bias Metrics

### Data Format
At this point, our bias data is in DataFrame df, with columns:

*   label: True if the comment is Toxic, False otherwise.
*   < model name >: One column per model, cells contain the score from that model.
*   < subgroup >: One column per identity, True if the comment mentions this identity.

You can run the analysis below on any data in this format. Subgroup labels can be 
generated via words in the text as done above, or come from human labels if you have them.


In [None]:
identity_terms_civil_included = []
for _term in input_fn_example.identity_terms_civil:
    if sum(test_bias_df[_term]) >= 20:
        print ('keeping {}'.format(_term))
        identity_terms_civil_included.append(_term)

In [None]:
test_bias_df['model_1'] = test_bias_df['tf_gru_attention_civil:v_20181109_164318']
test_bias_df['model_2'] = test_bias_df['tf_gru_attention_civil:v_20181109_164403']
test_bias_df['model_3'] = test_bias_df['tf_gru_attention_civil:v_20181109_164535']
test_bias_df['model_4'] = test_bias_df['tf_gru_attention_civil:v_20181109_164630']

In [None]:
MODEL_NAMES = ['model_1', 'model_2', 'model_3', 'model_4']

In [None]:
bias_metrics = model_bias_analysis.compute_bias_metrics_for_models(test_bias_df, identity_terms_civil_included, MODEL_NAMES, 'label')

In [None]:
model_bias_analysis.plot_auc_heatmap(bias_metrics, MODEL_NAMES)

In [None]:
model_bias_analysis.plot_aeg_heatmap(bias_metrics, MODEL_NAMES)