Skip to content
A solution to the Jigsaw Unintended Bias in Toxicity Classification Kaggle competition.
HTML Python Jupyter Notebook Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Jigsaw Toxic 2019 Solution

A solution to the Jigsaw Unintended Bias in Toxicity Classification Kaggle competition.

Fine-tunes BERT and GPT-2 models on the training data with custom weighting schemes and auxiliary target variables.

Unfortunately I used a bugged evaluation metric function during the competition, and severely undermines the effort I put into this competition. I fixed the function and incorporated some of the custom weighting schemes shared by top competitors post-competition.

TODO: Try the renamed huggingface/pytorch-transformers (from huggingface/pytorch-pretrained-BERT) package and the new XLNet models.


Unfortunately this project is not as well versioned all its dependencies like my last project ceshine/imet-collection-2019. But this time I included a Dockerfile that can replicate a working environment (at least at the time of writing, that is, July 2019).

Some peculiarity specific to this project:

  • is included and should be used via pip install, This is because the version that I used that lived on the project master branch never made it to PyPI. The latest PyPI version is not compatible with this project.
  • pytroch_helper_bot is included via git subtree to ease the cognitive load on user (it's not on PyPI yet, and I'm not planing to put it on).

Generally speaking, the essential dependencies of this project includes (besides the above two):

  • PyTorch >= 1.0
  • NVIDIA/apex (for reducing GPU memory consumption and speed up training on newer GPUs).
  • pandas

TODO: Write down the specific versions of major dependencies that are proven to work.

Kaggle Training and Predicting Workflow

I used almost exactly the same framework used by ceshine/imet-collection-2019. Only this time we don't need a separate validation Kernel. The validation scoring function/metric is integrated to the helperbot workflow.

I used a Kaggle Dataset toxic-cache to store tokenized training data, so the kernel won't need to re-tokenized the whole training set in every single run.

Google Colab Training

Example Colab Notebook: code is cloned directly from this Github repo, but the dataset, caching, and model weights live on Google Drive (you need to set it up in your account yourself).

You can’t perform that action at this time.