# Hackathon: Finetuning DistillBERT

© Data Trainers LLC. GPL v 3.0.

**Author:** Axel Sirota


In this notebook we will fine-tune DistillBERT, a transformer based on BERT Googles model to create a toxicity model

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

You can use any architecture you want! Good luck!

In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [2]:
!pip install textblob 'keras-nlp' 'keras-preprocessing' 'gensim==4.2.0' np_utils

Collecting keras-nlp
  Downloading keras_nlp-0.6.3-py3-none-any.whl (584 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.5/584.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras-preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gensim==4.2.0
  Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting np_utils
  Downloading np_utils-0.6.0.tar.gz (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting keras-core (from keras-nlp)
  Downloading keras_core-0.1.7-py3-non

In [3]:
import multiprocessing
import os
import random
import warnings

import keras.backend as K
import nltk
import numpy as np
import tensorflow as tf
from textblob import TextBlob

TRACE = False
embedding_dim = 100
rnn_units = 128
epochs=100
buffer_size = 256
max_len = 50
# Batch size
batch_size = 256
min_count_words = 3
BATCH = True

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config)
  tf.compat.v1.keras.backend.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
%%writefile get_data.sh
if [ ! -f toxic_comments.csv ]; then
  wget -O toxic_comments.csv https://www.dropbox.com/s/qecfi95tirln8sh/toxic_comments.csv?dl=0
fi


Writing get_data.sh


In [5]:
!bash get_data.sh

--2023-11-19 19:03:58--  https://www.dropbox.com/s/qecfi95tirln8sh/toxic_comments.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/qecfi95tirln8sh/toxic_comments.csv [following]
--2023-11-19 19:03:59--  https://www.dropbox.com/s/raw/qecfi95tirln8sh/toxic_comments.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc3b86433c214efe82224ab63f49.dl.dropboxusercontent.com/cd/0/inline/CH0GJKGzA5Exga1RBLZAuDs-Fn76PlG0c7p6keR6MPRLImyAR762gJNfXYv3NE-iYw1iYULylpGZDWWhfX61-HCdNAR_oB70xFsFTb8Qnma68dYcl-2eT0JgAvA0GNG29gIkv68zROcKDEs4mfEsp6t3/file# [following]
--2023-11-19 19:03:59--  https://uc3b86433c214efe82224ab63f49.dl.dropboxusercontent.com/cd/0/inline/CH0GJKGzA5Exga1RBLZAuDs-Fn76PlG0c7p6keR6MPRLImyAR762gJNfXYv3NE-iYw1iYULylpGZD

In [6]:
!head -n 5 toxic_comments.csv

"id","comment_text","toxic","severe_toxic","obscene","threat","insult","identity_hate"
"0000997932d777bf","Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,0,0,0,0,0
"000103f0d9cfb60f","D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",0,0,0,0,0,0
"000113f07ec002fd","Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,0,0,0,0,0
