# garbage_classifier

This notebook classifies website text snippets into useful or not (i.e., garbage) using transfer learning starting from an existing hugging face model
* Get a model checkpoint for an encoder model 
* Use reinforcement learning to apply the model on a new classification problem (EAGER website data) with limited new trained data
* Apply new head of model to full EAGER corpus to come up with mixes of models
* Metrics and model registered through a combination of comet.ml and tensorboard 

## Install and import libraries

In [1]:
COMET_PROJECT_NAME = "eager-garbage-classifier"

In [2]:
# check environment
import sys
IN_COLAB = 'google.colab' in sys.modules
print (IN_COLAB)

True


In [3]:
# colab file system setup 
if IN_COLAB: 
    !git clone https://github.com/euphonic/EAGER.git
    !pwd
    !mkdir /content/logs

Cloning into 'EAGER'...
remote: Enumerating objects: 19658, done.[K
remote: Counting objects: 100% (350/350), done.[K
remote: Compressing objects: 100% (243/243), done.[K
remote: Total 19658 (delta 207), reused 176 (delta 107), pack-reused 19308[K
Receiving objects: 100% (19658/19658), 370.55 MiB | 22.73 MiB/s, done.
Resolving deltas: 100% (5925/5925), done.
Checking out files: 100% (5172/5172), done.
/content


In [4]:
# mount google drive if in colab
drive_path = '/content/drive/'

if IN_COLAB:  
    from google.colab import drive
    drive.mount(drive_path, force_remount=True)

Mounted at /content/drive/


In [7]:
# install huggingface and other modules if in colab
if IN_COLAB: 
    !pip install transformers
    !pip install datasets
    !pip install python-dotenv
    !pip install comet_ml==3.30.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 12.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 25.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 53.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalli

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-dotenv
  Downloading python_dotenv-0.20.0-py3-none-any.whl (17 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-0.20.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
from comet_ml import Experiment
from dotenv import load_dotenv

# setup comet_ml experiment
if IN_COLAB: 
    # read env file from Google drive 
    env_file = drive_path + 'MyDrive/raaste-config/.env'
    comet_config_file = drive_path + 'MyDrive/raaste-config/.comet.config'
    load_dotenv(env_file)

In [9]:
# ml libraries
from transformers import pipeline, AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import Dataset
import datasets
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from sklearn.model_selection import train_test_split
import pandas as pd

# other
import numpy as np
import gzip
import tarfile
import datetime

In [10]:
# load tensorboard 
%load_ext tensorboard

## Garbage classifier
keep test == 1, discard == 0

In [11]:
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [12]:
firm_file_location = '/content/EAGER/data/modeling/garbage/garbage_classifier_input.csv'
input_df = pd.read_csv(firm_file_location)

In [13]:
non_null_df = input_df[~ input_df['sample_text'].isnull() ]
non_null_df

Unnamed: 0,sample_text,of_interest
0,Our Management,0
1,Latest Press Releases,0
2,On-Going Clinical Studies on Very Low Nicotine...,1
3,Links to the ‚ÄúMiracle Plant‚Äù,0
4,This advisory note presents the conclusions an...,1
...,...,...
5619,DLS,0
5620,Sign up to get the latest news from Socialx,1
5621,Our Distributors,0
5622,The motivation for starting was the frustratio...,1


In [14]:
dataset = Dataset.from_pandas(non_null_df, split='train')
dataset.cast_column("of_interest", datasets.Value('int8'))

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['sample_text', 'of_interest', '__index_level_0__'],
    num_rows: 5624
})

In [15]:
# 70% train, 30% test + validation
train_test_dataset = dataset.train_test_split(test_size=0.3)
# Split the 30% test + valid in half test, half valid
test_valid_dataset = train_test_dataset['test'].train_test_split(test_size=0.30)
# gather everyone if you want to have a single DatasetDict
train_test_valid_dataset = datasets.DatasetDict({
    'train': train_test_dataset['train'],
    'test': test_valid_dataset['test'],
    'valid': test_valid_dataset['train']})

In [16]:
train_test_valid_dataset

DatasetDict({
    train: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 3936
    })
    test: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 507
    })
    valid: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 1181
    })
})

In [17]:
def tokenize_function(x):
  return tokenizer(x["sample_text"], truncation=True, max_length=100)

In [18]:
tokenized_dataset = train_test_valid_dataset.map(tokenize_function, batched=True, batch_size=2000)



  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [19]:
samples = tokenized_dataset["train"].to_dict()
samples = {k: v for k, v in samples.items() if k not in ["idx", "sample_text"]}
# set([len(x) for x in samples["input_ids"]])

In [20]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length", max_length=100, return_tensors="tf")

In [21]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'__index_level_0__': TensorShape([3936]),
 'attention_mask': TensorShape([3936, 100]),
 'input_ids': TensorShape([3936, 100]),
 'of_interest': TensorShape([3936]),
 'token_type_ids': TensorShape([3936, 100])}

In [22]:
num_epochs = 50

def train():
  log_dir = "/content/logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)    

  early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

  model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

  model.fit(tf_train_dataset, validation_data=tf_validation_dataset, 
          epochs=num_epochs, callbacks=[tensorboard_callback, early_stopping_callback])
  
  return model


In [30]:
batch_sizes = [8, 16, 32, 64]

for bs in batch_sizes: 
  # read config file from git repo 
  experiment = Experiment(project_name=COMET_PROJECT_NAME)
  with experiment.train():
    experiment.log_parameter("batch_size", bs)

  print ('batch_size', bs)

  tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="of_interest",
    shuffle=True,
    collate_fn=data_collator,
    batch_size=bs,
  )

  tf_validation_dataset = tokenized_dataset["valid"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="of_interest",
    shuffle=False,
    collate_fn=data_collator,
    batch_size=bs,
  )

  # The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
  # by the total number of epochs
  num_train_steps = len(tf_train_dataset) * num_epochs
  lr_scheduler = PolynomialDecay(
      initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
  )
  opt = Adam(learning_rate=lr_scheduler)

  model = train()
  
  experiment.end()

  model.save('/content/drive/MyDrive/eager-models/garbage_classifier_v3_bs' + str(bs))

COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/euphonic/eager-garbage-classifier/fd8b07e5a09a40c2ad158fd6916801e2



batch_size 8


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/euphonic/eager-garbage-classifier/fd8b07e5a09a40c2ad158fd6916801e2
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     accuracy [6]                 : (0.8907520174980164, 0.9872967600822449)
COMET INFO:     batch_accuracy [300]         : (0.5, 1.0)
COMET INFO:     batch_loss [300]             : (0.0027239869814366102, 0.730162501335144)
COMET INFO:     epoch_duration [6]           : (97.66186477700012, 132.29903482600002)
COMET INFO:     loss [6]                     : (0.04978441819548607, 0.26426997780799866)
COMET INFO:     val_accuracy [6]             : (0.8966977000236511, 0.9170194864273071)
COMET INFO:     val_loss [6]                 : (0.21996869146823883, 0.456766277551651)
COMET INFO:     validate_batch_accuracy [90] : (0.870967745

INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs8/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs8/assets
COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/euphonic/eager-garbage-classifier/0ee679baa204415e8d123db7a895d2fa



batch_size 16


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/50




Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/euphonic/eager-garbage-classifier/0ee679baa204415e8d123db7a895d2fa
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     accuracy [7]                 : (0.8953251838684082, 0.9905995726585388)
COMET INFO:     batch_accuracy [175]         : (0.5625, 1.0)
COMET INFO:     batch_loss [175]             : (0.0020765531808137894, 0.706519603729248)
COMET INFO:     epoch_duration [7]           : (78.68659436200141, 123.16816753600051)
COMET INFO:     loss [7]                     : (0.023565934970974922, 0.2609458863735199)
COMET INFO:     val_accuracy [7]             : (0.8924640417098999, 0.9161727428436279)
COMET INFO:     val_loss [7]                 : (0.2504030168056488, 0.46417173743247986)
COMET INFO:     validate_batch_accuracy [56] : (0.86290

INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs16/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs16/assets
COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/euphonic/eager-garbage-classifier/11eefa33cbcd4f08a34f8441db4f8e49



batch_size 32


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/50




Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/euphonic/eager-garbage-classifier/11eefa33cbcd4f08a34f8441db4f8e49
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     accuracy [6]                 : (0.8998983502388, 0.9926320910453796)
COMET INFO:     batch_accuracy [78]          : (0.4375, 1.0)
COMET INFO:     batch_loss [78]              : (0.00806517992168665, 0.7393709421157837)
COMET INFO:     epoch_duration [6]           : (71.18901056899995, 110.29101833300047)
COMET INFO:     loss [6]                     : (0.025743670761585236, 0.25243130326271057)
COMET INFO:     val_accuracy [6]             : (0.8933107256889343, 0.9127857685089111)
COMET INFO:     val_loss [6]                 : (0.2116578221321106, 0.34844979643821716)
COMET INFO:     validate_batch_accuracy [24] : (0.87215906

INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs32/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs32/assets
COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/euphonic/eager-garbage-classifier/c1ad25c4bfc34ef19fb5181ca9af041e



batch_size 64


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/50




Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/euphonic/eager-garbage-classifier/c1ad25c4bfc34ef19fb5181ca9af041e
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     accuracy [7]                 : (0.889735758304596, 0.9900914430618286)
COMET INFO:     batch_accuracy [49]          : (0.65625, 1.0)
COMET INFO:     batch_loss [49]              : (0.021025780588388443, 0.6695417761802673)
COMET INFO:     epoch_duration [7]           : (67.55666414900043, 100.82430036599908)
COMET INFO:     loss [7]                     : (0.02908296138048172, 0.2907353341579437)
COMET INFO:     val_accuracy [7]             : (0.8882303237915039, 0.9229466319084167)
COMET INFO:     val_loss [7]                 : (0.1999080628156662, 0.40699630975723267)
COMET INFO:     validate_batch_accuracy [14] : (0.875, 0

INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs64/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/eager-models/garbage_classifier_v3_bs64/assets


## Register model