In [None]:
#! pip install humanize
#! pip install catboost

# Label noise


## Problem statement 
Have some binary classification task, traditionally assume data of the form X,y

In reality, some of the labels may be incorrect, distinguish
```
y - true label
y* - observed, possibly incorrect label
```

This can obviously effect model training, validation. Would also effect benchmarking process (comparing performance on noisy data doesn't tell you about performance on actual data).

## Types of noise

Can be completely independent:
`p(y* != y | x, y) = p(y* != y)`

class-dependent, depends on y:
`p(y* != y | x, y) = p(y* != y | y)`

feature-dependent, depends on x:
`p(y* != y | x, y) = p(y* != y | x, y)`

In fraud modeling, higher likelihood of `(y*, y) = (0, 1)` than reverse.
(missed fraud, label maturity, intentional data poisoning, etc.)

"feature-dependent" is probably most realistic in fraud but fewer removal techniques and also harder to synthetically generate. We will work with "boundary conditional" noise, probability of being mislabeled is weighted by distance from some decision boundary (score from model trained on clean data), implemented in scikit-clean.

## Literature/packages

Many methods in the literature to address this; can build loss functions that are robust to noise, can try to identify and filter (remove) or clean (flip label) examples identified as noisy.

Some packages including CleanLab and scikit-clean. Can also hand-code an ensemble method. Most of these are model-agnostic.

## CleanLab

well-established, state of the art, open source package with some theoretical guarantees

score all examples with y* = 1, determine average score t_1
now score all examples with y* = 0. Any that score above t_1 are marked as noise

can wrap any (sklearn-compatible) model with this process. 

## scikit-clean 

library of several different approaches including filtering as well as noise generation. Is similarly designed to be model-agnostic but doesn't always do a great job (doesn't handle unencoded categorical features well). Some of its methods can also be *very* slow relative to others

## micro-models

slice up training data, train a model on each slice, let models vote on whether to remove data. Can use majority (more than half of models "misclassify" example), consensus (all models misclassify) or any other threshold.

## experiment design

take 7 of the datasets - [‘ieeecis’, ‘ccfraud’, ‘fraudecom’, ‘sparknov’, ‘fakejob’, ‘vehicleloan’,‘twitterbot’]
* drop IP and malurl dataset as they are difficult to work with "out of the box"
* use numerical and categorical features, target-encode categorical features (drop text and enrichable features)

add boundary-conditional noise `n` to training data (flipping both classes).

values: `n in [0, 0.1, 0.2, 0.3, 0.4, 0.5]`
    
target encoding is done after noise is added
    
Catboost used as base classifier in all cases (with default settings)

compare following methods for cleaning training data
* baseline (no cleaning done)
* CleanLab
* scikit-clean MCS 
* micro-model majority voting (hand-built)
* micro-model consensus voting (hand-built)

measure AUC on (clean) test data

repeat process 5 times for each experiment (start with clean data, add random noise, filter noise back out, train classifier, etc.), compute mean and std. dev of AUC for each

CleanLab usually winds up being the best, but not uniformly. Baseline is sometimes the best for zero noise (as expected), and sometimes MCS or micro-model majority will come out ahead

In [None]:
# basic imports
import os
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
import humanize
import pickle

# basics from sklearn
from sklearn.metrics import roc_auc_score
from category_encoders.target_encoder import TargetEncoder

# noise generation
from skclean.simulate_noise import flip_labels_cc, BCNoise

# base classifiers
from catboost import CatBoostClassifier

# cleaning methods/helpers
from cleanlab.classification import CleanLearning
from micro_models import MicroModelCleaner
from skclean.pipeline import Pipeline
from skclean.handlers import Filter
from skclean.detectors import MCS

# dataset loader
from load_fdb_datasets import prepare_noisy_dataset, dataset_stats

In [None]:
# wrapper definitions for the various types of cleaning methods we will use. 
# Each one wraps a model_class (in our case catboost, but could use xgboost, etc.)
# resulting model_class can then take noisy data in its .fit() method and clean before training

def baseline_model(model_class, params):
    return model_class(**params)

def cleanlab_model(model_class, params, pulearning=False):
    if pulearning:
        return CleanLearning(model_class(**params), pulearning=pulearning)
    else:
        return CleanLearning(model_class(**params))
    
def micromodels(model_class, pulearning, num_clfs, threshold, params):
    return MicroModelCleaner(model_class, pulearning=pulearning, num_clfs=num_clfs, threshold=threshold, **params)

def skclean_MCS(model_class, params):
    skclean_pipeline = Pipeline([
        ('detector',MCS(classifier=model_class(**params))),
        ('handler',Filter(model_class(**params)))
    ])
    return skclean_pipeline

In [None]:
# some high-level parameters, 
# the number of runs for each experiment (determine mean/std. dev)
num_samples = 5 
# whether to use target encoding on categorical features
target_encoding = True
# whether to save intermediate results to disk (in case of failure etc.)
save_results = True

# we will be creating a lot of classifiers, let's use the same parameters for each
model_config_dict = {
    'catboost': {
        'model_class': CatBoostClassifier,
        'default_params': {
            'verbose': False,
            'iterations': 100
        }
    }
}

# all of our experiments will use catboost and boundary-consistent noise
base_model_type = 'catboost'
noise_type = 'boundary-consistent'
model_class = model_config_dict[base_model_type]['model_class']

# the set of experimental parameters, we will iterate over all these datasets
keys = ['ieeecis', 'sparknov', 'ccfraud', 'fraudecom', 'fakejob', 'vehicleloan', 'twitterbot']
# all these cleaning methods
clf_types = ['baseline', 'skclean_MCS', 'cleanlab', 'micromodels_majority', 'micromodels_consensus']
# all these noise levels
noise_amounts = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
# and we will let cleaning methods know that noise can happen for either class
pulearning = None

# a little bit of setup for saving intermediate results to disk
if save_results:
    results_file_path = './results'
    results_file_name = '{}_noise_benchmark_results.pkl'
    try:
        os.mkdir(results_file_path)
    except OSError as error:
        print(error) 

In [None]:
# initialize results dict, we will index results by dataset/noise_amount/cleaning_method
results = {}

# main experimental loop   
for key in keys:
    # check to see if we have already run this experiment and saved to disk
    full_result_path = os.path.join(results_file_path,results_file_name.format(key))
    if os.path.exists(full_result_path) and save_results:
        with open(full_result_path, 'rb') as results_file:
            results[key] = pickle.load(results_file)
    # otherwise start from scratch
    else:
        # initialize sub-results
        results[key] = {}
        model_params = model_config_dict[base_model_type]['default_params']
        
        for noise_amount in noise_amounts:
            print(f"\n =={key}_{noise_amount}== \n")
            
            # initialize sub-sub-results
            results[key][noise_amount] = {}

            # these are the cleaning classifiers we will use
            clfs = {
                'baseline': baseline_model(model_class, model_params),
                'skclean_MCS': skclean_MCS(model_class, model_params),
                'cleanlab': cleanlab_model(model_class, model_params, pulearning),
                'micromodels_majority': micromodels(model_class, pulearning=pulearning,
                                                    num_clfs=8, threshold=0.5, params=model_params),
                'micromodels_consensus': micromodels(model_class, pulearning=pulearning,
                                                     num_clfs=8, threshold=1, params=model_params),

            }
            print('generating datasets')
            # preparing a dataset has some overhead, we want to do this five times for each dataset/noise level
            # we will save a little bit of time by doing this in advance and using same set of five
            # for each cleaning method
            datasets = [prepare_noisy_dataset(key, noise_type, noise_amount, split=1, target_encoding=target_encoding) 
                        for i in range(num_samples)]
            
            # now for each cleaning method, train a "clean" model on noisy training data, then determine
            # auc on clean test data and record the results. Do this five times for each cleaning method
            # to determine mean/std. dev
            for clf_type in clfs:
                print(f"testing {clf_type}")
                auc = []
                try:
                    for i in range(num_samples):
                        # grab the dataset we need for this run and extract metadata and subsets
                        dataset = datasets[i]
                        features, cat_features, label = dataset['features'], dataset['cat_features'], dataset['label']
                        train, test = dataset['train'], dataset['test']
                        X_tr, y_tr = train[features], train[label].values.reshape(-1)
                        X_ts, y_ts = test[features], test[label].values.reshape(-1)
                        clf = clfs[clf_type]
                        # fit the "clean" classifier on noisy training data
                        clf.fit(X_tr, y_tr)
                        # make predictions on clean test data and calculate AUC
                        y_pred = clf.predict_proba(X_ts)[:, 1]
                        auc.append(roc_auc_score(y_ts, y_pred))
                        print(f"{clf_type} auc: {auc}", end="\r", flush=True)
                    # store mean/std. dev for this run in the results dict
                    results[key][noise_amount][clf_type] = (np.mean(auc), np.std(auc), auc)
                    print('\n{} auc: {:.2f} ± {:.4f}\n'.format(clf_type,
                                                               *results[key][noise_amount][clf_type][:2]))
                # if this run failed for some reason, handle it gracefully
                except Exception as e:
                    results[key][noise_amount][clf_type] = (0, 0, [0] * num_samples)
                    print(e)
    
    # if we are saving intermediate results to disk, do so now
    if save_results:
        with open(full_result_path, 'wb') as results_file:
            pickle.dump(results[key], results_file)

In [None]:
# a couple of helper functions to analyze/summarize results

def highlight_max(s, props=''):
    return np.where(s == np.nanmax(s.values), props, '')

def record_places(places, scores):
    scores = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}
    last_score, last_stddev, last_placement = (2, 0, 1)
    for i, clf in enumerate(scores.keys()): 
        if scores[clf][0] + scores[clf][1] >= last_score:
            placement = last_placement                          
        else:
            placement = i+1
            last_score, last_stddev = scores[clf]            
            last_placement = i+1
        places[clf][placement] += 1 

In [None]:
# create dataframe of results for each experiment, also process results into dict for keeping track of 
# 1st/2nd/etc. place, as well as a dict for plotting later

places = {clf:{p:0 for p in range(1,len(clf_types)+1)} for clf in clf_types}
plots = {key:{clf:[[],[]] for clf in clf_types} for key in keys}
        
for key in results.keys():
    print(f"\n =={key}==\n")
    rows = pd.Index([clf_type for clf_type in clf_types])
    columns = pd.MultiIndex.from_product([noise_amounts, ['mean','std_dev']], names=['type 2 noise', 'auc'])
    df = pd.DataFrame(index=rows, columns=columns)
    
    for noise_amount in noise_amounts:
        scores = {}
        for clf_type in clf_types:
            auc = results[key][noise_amount][clf_type]  
            df.loc[clf_type, (noise_amount, 'mean')] = auc[0] 
            df.loc[clf_type, (noise_amount, 'std_dev')] = auc[1]
            scores[clf_type] = (auc[0], auc[1])

            plots[key][clf_type][0].append(noise_amount)
            plots[key][clf_type][1].append(auc[0])
        record_places(places, scores)
    display(df.style.set_caption(f"{key}")
            .format({(n,'mean'): "{:.2f}" for n in noise_amounts})
            .format({(n,'std_dev'): "{:.4f}" for n in noise_amounts})
            .apply(highlight_max, props='font-weight:bold;background-color:lightblue', axis=0,
                  subset=[[n,'mean'] for n in noise_amounts]))

In [None]:
# produce "race results" (i.e. how many first place, second place, etc. finishes)

race_results = pd.DataFrame.from_dict(places).rename(index=lambda x : humanize.ordinal(x))
race_results['totals'] = race_results.sum(axis=1)
display(race_results)
print(race_results.to_latex())

In [None]:
# finally, we can plot the results of individual experiments

colors = ['black','purple','green','red','orange']
linestyles = ['-','--',':']
ylims = {
    'boundary-consistent': {
        'ieeecis':[0.5,0.9],
        'sparknov':[0.5,1],
        'ccfraud':[0.25,1],
        'fraudecom':[0.48,0.52],
        'fakejob':[0.5,1],
        'vehicleloan':[0.57,0.66],
        'twitterbot':[0.7,0.95]
    },
    'class-conditional': {
        'ieeecis':[0.7,0.9],
        'sparknov':[0.7,1],
        'ccfraud':[0.8,1],
        'fraudecom':[0.48,0.52],
        'fakejob':[0.7,1],
        'vehicleloan':[0.5,0.7],
        'twitterbot':[0.8,0.95]
    }
}

x_labels = {
    'boundary-consistent':'Boundary-Consistent Noise Level',
    'class-conditional':'Class-Conditional Type 2 Noise Level'
}

legends = {
    'boundary-consistent':'Cleaning Method',
    'class-conditional':'Type 1 Noise, Cleaning Method'
}
def fix_failures(x):
    if x == 0:
        return None
    else:
        return x

def labels(noise_type, noise_amount, clf_type):
    if noise_type == 'boundary-consistent':
        return '{}'.format(clf_type)
    elif noise_type == 'class-conditional':
        return '{}, {}'.format(noise_amount, clf_type)

for key in results.keys():
    plt.figure(figsize=(10,10))
    
    for c, clf_type in enumerate(clf_types):
        a = plots[key][clf_type]
        plt.plot(a[0],[fix_failures(c) for c in a[1]],
                 label=labels(noise_type, noise_amount, clf_type),
                 color=colors[c],
                 linestyle=linestyles[0])
    plt.title(key)
    plt.xlabel(x_labels[noise_type])
    plt.ylabel('Test AUC')
    plt.ylim(ylims[noise_type][key])
    plt.legend(title=legends[noise_type])
    plt.savefig(f"./figures/label_noise_{key}.png")
    plt.show()