# garbage_classifier

This notebook classifies website text snippets into useful or not (i.e., garbage) using transfer learning starting from an existing hugging face model
* Get a model checkpoint for an encoder model 
* Use reinforcement learning to apply the model on a new classification problem (EAGER website data) with limited new trained data
* Apply new head of model to full EAGER corpus to come up with mixes of models
* Metrics and model registered through a combination of comet.ml and tensorboard 

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sun Jul 10 18:04:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install and import libraries

In [1]:
COMET_PROJECT_NAME = "eager-garbage-classifier"

In [2]:
# check environment
import sys
IN_COLAB = 'google.colab' in sys.modules
print (IN_COLAB)

True


In [4]:
# colab file system setup 
if IN_COLAB: 
    !git clone https://github.com/euphonic/EAGER.git
    !pwd
    !mkdir /content/logs

Cloning into 'EAGER'...
remote: Enumerating objects: 19733, done.[K
remote: Counting objects: 100% (425/425), done.[K
remote: Compressing objects: 100% (205/205), done.[K
remote: Total 19733 (delta 263), reused 360 (delta 220), pack-reused 19308[K
Receiving objects: 100% (19733/19733), 370.93 MiB | 29.20 MiB/s, done.
Resolving deltas: 100% (5981/5981), done.
Checking out files: 100% (5176/5176), done.
/content


In [4]:
# mount google drive if in colab
drive_path = '/content/drive/'

if IN_COLAB:  
    from google.colab import drive
    drive.mount(drive_path, force_remount=True)

Mounted at /content/drive/


In [10]:
# install huggingface and other modules if in colab
if IN_COLAB: 
    !pip install transformers
    !pip install datasets
    !pip install python-dotenv
    !pip install comet_ml

if IN_COLAB: 
    !pip uninstall -y comet_ml==3.30.0
    !pip install comet_ml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Found existing installation: comet-ml 3.30.0
Uninstalling comet-ml-3.30.0:
  Successfully uninstalled comet-ml-3.30.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting comet_ml
  Downloading comet_ml-3.31.5-py2.py3-none-any.whl (361 kB)
[K     |████████████████████████████████| 361 kB 8.5 MB/s 
Collecting sentry-sdk>=1.1.0
  Downloading sentry_sdk-1.6.0-py2.py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 74.7 MB/s 
Installing collected packages: sentry-sdk, comet-ml
Successfully installed comet-ml-3.31.5 sentr

In [81]:
from comet_ml import Experiment
from comet_ml.api import API
from dotenv import load_dotenv

# setup comet_ml experiment
if IN_COLAB: 
    # read env file from Google drive 
    env_file = drive_path + 'MyDrive/raaste-config/.env'
    comet_config_file = drive_path + 'MyDrive/raaste-config/.comet.config'
    load_dotenv(env_file)

In [6]:
# ml libraries
from transformers import pipeline, AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding, AutoConfig, AdamWeightDecay
from datasets import Dataset
import datasets
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from sklearn.model_selection import train_test_split
import pandas as pd

# other
import numpy as np
import gzip
import tarfile
import datetime

In [7]:
# load tensorboard 
%load_ext tensorboard

## Garbage classifier
keep test == 1, discard == 0

In [8]:
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [9]:
firm_file_location = '/content/EAGER/data/modeling/garbage/garbage_classifier_input.csv'
input_df = pd.read_csv(firm_file_location)

In [10]:
non_null_df = input_df[~ input_df['sample_text'].isnull() ]
non_null_df.shape

(5601, 2)

In [11]:
dataset = Dataset.from_pandas(non_null_df, split='train')
dataset.cast_column("of_interest", datasets.Value('int8'))

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['sample_text', 'of_interest', '__index_level_0__'],
    num_rows: 5601
})

In [12]:
# 80% train, 20% test + validation
train_test_dataset = dataset.train_test_split(test_size=0.2)
# Split the 20% test + valid in half test, half valid
test_valid_dataset = train_test_dataset['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
train_test_valid_dataset = datasets.DatasetDict({
    'train': train_test_dataset['train'],
    'test': test_valid_dataset['test'],
    'valid': test_valid_dataset['train']})

In [13]:
train_test_valid_dataset

DatasetDict({
    train: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 4480
    })
    test: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 561
    })
    valid: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 560
    })
})

In [14]:
def tokenize_function(x):
  return tokenizer(x["sample_text"], truncation=True, max_length=100)

In [15]:
tokenized_dataset = train_test_valid_dataset.map(tokenize_function, batched=True, batch_size=None)



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [16]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4480
    })
    test: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 561
    })
    valid: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 560
    })
})

In [17]:
samples = tokenized_dataset["train"].to_dict()
samples = {k: v for k, v in samples.items() if k not in ["__index_level_0__", "sample_text"]}
for k, v in samples.items(): 
  print (k, v[0:5])

of_interest [0, 0, 0, 0, 0]
input_ids [[101, 1041, 1011, 5653, 2149, 102], [101, 4773, 3981, 2869, 102], [101, 2466, 102], [101, 20116, 2099, 102], [101, 9152, 2232, 102]]
token_type_ids [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
attention_mask [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]


In [18]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length", max_length=100, return_tensors="tf")

In [19]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'attention_mask': TensorShape([4480, 100]),
 'input_ids': TensorShape([4480, 100]),
 'of_interest': TensorShape([4480]),
 'token_type_ids': TensorShape([4480, 100])}

In [67]:
# config
config = AutoConfig.from_pretrained(checkpoint)
config.num_labels=1
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2
config.hidden_size = 64
config.intermediate_size = 256
config.num_hidden_layers = 4
config.num_attention_heads = 4
print (type(config))

<class 'transformers.models.bert.configuration_bert.BertConfig'>


In [87]:
num_epochs = 50
batch_sizes = [64]

for bs in batch_sizes: 
  # read config file from git repo 
  experiment = Experiment(project_name=COMET_PROJECT_NAME)
  with experiment.train():
    experiment.log_parameter("batch_size", bs)

  # model
  model = TFAutoModelForSequenceClassification.from_config(config)
  print (type(model))

  print ('batch_size', bs)

  tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="of_interest",
    shuffle=True,
    collate_fn=data_collator,
    batch_size=bs,
  )

  tf_validation_dataset = tokenized_dataset["valid"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="of_interest",
    shuffle=False,
    collate_fn=data_collator,
    batch_size=bs,
  )

  # The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
  # by the total number of epochs
  num_train_steps = len(tf_train_dataset) * num_epochs
  lr_scheduler = PolynomialDecay(
      initial_learning_rate=5e-5, end_learning_rate=0, decay_steps=num_train_steps
  )

  opt = Adam(learning_rate=lr_scheduler, beta_1=0.9, beta_2=0.98)

  log_dir = "/content/logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)    

  early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

  loss = tf.keras.losses.BinaryFocalCrossentropy(from_logits=True, gamma=0.0, label_smoothing=0.2) # gamma = 0 is equivalent to binary cross entropy
  model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

  model.fit(tf_train_dataset, validation_data=tf_validation_dataset, 
        epochs=num_epochs, callbacks=[tensorboard_callback, early_stopping_callback])
  
  experiment.end()

COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Experiment is live on comet.ml https://www.comet.com/euphonic/eager-garbage-classifier/ebdc86fc38124b3089c4448e1b7b0690



<class 'transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification'>
batch_size 64




Epoch 1/50




Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.com/euphonic/eager-garbage-classifier/ebdc86fc38124b3089c4448e1b7b0690
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     accuracy [16]                : (0.6602678298950195, 0.9598214030265808)
COMET INFO:     batch_accuracy [112]         : (0.515625, 0.984375)
COMET INFO:     batch_loss [112]             : (0.3403220772743225, 0.6932670474052429)
COMET INFO:     epoch_duration [16]          : (3.177935914000045, 19.266099873000712)
COMET INFO:     loss [16]                    : (0.3788703680038452, 0.6813281178474426)
COMET INFO:     val_accuracy [16]            : (0.6517857313156128, 0.8999999761581421)
COMET INFO:     val_loss [16]                : (0.44740596413612366, 0.6695268154144287)
COMET INFO:     validate_batch_accuracy [16] : (0.6

## Register model

In [79]:
# save model to disk -- can be added to for loop above
file_name = COMET_PROJECT_NAME.replace ('-', '_') + '.tf'
model_save_path = '/content/models/' + file_name
print (model_save_path)
model.save (model_save_path)

/content/models/eager_garbage_classifier.tf


In [89]:
experiment.log_model(COMET_PROJECT_NAME, model_save_path)

Please double-check the directory path and the recursive parameter


In [85]:
api = API()
type(experiment)

comet_ml.api.APIExperiment

In [83]:
best_run = 'euphonic/' + COMET_PROJECT_NAME + '/concerned_pouf_708'

api = API()
experiment = api.get(best_run)



AttributeError: ignored

In [75]:

experiment.register_model("eager-garbage-classifier")

ValueError: ignored

In [66]:
print(type(model))

<class 'transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification'>


## Register model

In [None]:
  experiment.end()git