# text-classification-model-v1
This notebook classifies website data using transfer learning starting from an existing hugging face model
* Get a model checkpoint for an encoder model 
* Use reinforcement learning to apply the model on a new classification problem (EAGER website data) with limited new trained data
* Apply new head of model to full EAGER corpus to come up with mixes of models

In [1]:
!git clone https://github.com/euphonic/EAGER.git

Cloning into 'EAGER'...
remote: Enumerating objects: 19616, done.[K
remote: Counting objects: 100% (308/308), done.[K
remote: Compressing objects: 100% (207/207), done.[K
remote: Total 19616 (delta 180), reused 164 (delta 101), pack-reused 19308[K
Receiving objects: 100% (19616/19616), 370.00 MiB | 25.37 MiB/s, done.
Resolving deltas: 100% (5898/5898), done.
Checking out files: 100% (11765/11765), done.


## Install libraries

In [None]:
# install bson for reading mongodb data
!pip uninstall --yes bson
!pip install pymongo

In [2]:
# install hugging face 
!pip install transformers
!pip install datasets
!pip install DatasetDict

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 8.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 37.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.8 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.6.0 py

In [27]:
import tensorflow as tf
import numpy as np
from transformers import pipeline, AutoTokenizer, TFAutoModelForSequenceClassification

import gzip
import tarfile
import shutil
import bson
import pandas as pd

## Canned Examples

In [None]:
print(pipeline('sentiment-analysis')('This application looks promising'))

In [28]:
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = dict(tokenizer(sequences, padding=True, truncation=True, return_tensors="tf"))

# This is new
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
labels = tf.convert_to_tensor([1, 1])
model.train_on_batch(batch, labels)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2.3841855067985307e-07

## EAGER test

In [None]:
# EAGER data -- unpack
tar_dir = '/dbfs/FileStore/eager/'
tar_file = tar_dir + "FirmDB_about2_20190131.tar"
print (tar_file)

In [None]:
# tar = tarfile.open(tar_file)
# tar.extractall(tar_dir)
# tar.close()

In [None]:
# gunzip about us pages
ungzip_file = tar_dir + "FirmDB_about2_20190131/pages_ABOUT2.bson"
gzip_file = ungzip_file + ".gz"
print (gzip_file)

In [None]:
with gzip.open(gzip_file, 'rb') as f_in:
    with open(ungzip_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
with open(ungzip_file,'rb') as f:
    about_pages = bson.decode_all(f.read())

In [None]:
about_pages[0]['full_text']

### Zero-shot classification

In [None]:
firm_file_location = '/dbfs/FileStore/eager/about/Zygo.txt'
txt_file = open(firm_file_location, "r")
content_list = txt_file.readlines()
print(content_list)

In [None]:
content_list[6]

In [None]:
classifier = pipeline("zero-shot-classification")

In [None]:
candidate_labels = ["business", "marketing", "manufacturing", "research", "engineering"]

for i in range(6,10):
  res = classifier(content_list[i], candidate_labels)
  print (res)

In [None]:
classifier("I havern't a dog.")

In [None]:
classifier = pipeline("text-classification", model = "textattack/distilbert-base-uncased-CoLA")

for i in range(0, len(content_list)):
  print ((content_list[i]))
  # res = classifier(content_list[i])
  # print (res)

### Garbage classifier
keep test == 1, discard == 0

In [14]:
from datasets import Dataset
from sklearn.model_selection import train_test_split
import pandas as pd

In [11]:
firm_file_location = '/content/EAGER/data/modeling/garbage/garbage_classifier_input.csv'
input_df = pd.read_csv(firm_file_location)

In [12]:
non_null_df = input_df[~ input_df['sample_text'].isnull() ]
non_null_df

Unnamed: 0,sample_text,keep_text
0,Our Management,0
1,Latest Press Releases,0
2,On-Going Clinical Studies on Very Low Nicotine...,0
3,Links to the ‚ÄúMiracle Plant‚Äù,0
4,This advisory note presents the conclusions an...,1
...,...,...
4057,How It Works,0
4058,Read Article,0
4059,Tank Farms,0
4060,Industries Served,0


In [16]:
dataset = Dataset.from_pandas(non_null_df, split='train')
dataset.cast_column("keep_text", datasets.Value('int8'))

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['sample_text', 'keep_text', '__index_level_0__'],
    num_rows: 4055
})

In [19]:
# 80% train, 20% test + validation
train_test_dataset = dataset.train_test_split(test_size=0.2)
# Split the 20% test + valid in half test, half valid
test_valid_dataset = train_test_dataset['test'].train_test_split(test_size=0.2)
# gather everyone if you want to have a single DatasetDict
train_test_valid_dataset = datasets.DatasetDict({
    'train': train_test_dataset['train'],
    'test': test_valid_dataset['test'],
    'valid': test_valid_dataset['train']})

In [20]:
train_test_valid_dataset

DatasetDict({
    train: Dataset({
        features: ['sample_text', 'keep_text', '__index_level_0__'],
        num_rows: 3244
    })
    test: Dataset({
        features: ['sample_text', 'keep_text', '__index_level_0__'],
        num_rows: 163
    })
    valid: Dataset({
        features: ['sample_text', 'keep_text', '__index_level_0__'],
        num_rows: 648
    })
})

In [25]:
def tokenize_function(x):
  return tokenizer(x["sample_text"], truncation=True, max_length=100)

In [29]:
tokenized_dataset = train_test_valid_dataset.map(tokenize_function, batched=True, batch_size=2000)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [32]:
samples = tokenized_dataset["train"].to_dict()
samples = {k: v for k, v in samples.items() if k not in ["idx", "sample_text"]}
set([len(x) for x in samples["input_ids"]])

{2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100}

In [33]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length", max_length=100, return_tensors="tf")

In [34]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'__index_level_0__': TensorShape([3244]),
 'attention_mask': TensorShape([3244, 100]),
 'input_ids': TensorShape([3244, 100]),
 'keep_text': TensorShape([3244]),
 'token_type_ids': TensorShape([3244, 100])}

In [35]:
tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="keep_text",
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_dataset["valid"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="keep_text",
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

In [36]:
tf_train_dataset

<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(8, None), dtype=tf.int64, name=None), 'token_type_ids': TensorSpec(shape=(8, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(8, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(8,), dtype=tf.int64, name=None))>

In [37]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

model.compile(
    optimizer="adam",
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
)

In [39]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

batch_size = 8
num_epochs = 5
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)
from tensorflow.keras.optimizers import Adam

opt = Adam(learning_rate=lr_scheduler)

In [40]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [41]:
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=num_epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fe8fabb2d50>

In [43]:
model.save('/content/EAGER/models/garbage_classifier_v1')



INFO:tensorflow:Assets written to: /content/EAGER/models/garbage_classifier_v1/assets


INFO:tensorflow:Assets written to: /content/EAGER/models/garbage_classifier_v1/assets


In [None]:
from google.colab import drive
drive.mount('/content/drive')