# GA Capstone
## Classification Modeling

The goal here is to create a binary classification model that will classify text as being Shakespearian or not.

Much of the below is adapted from the [Hugging Face Text Classification Tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification) and the notebook linked therein.

### Imports and Preliminaries

In [1]:
# IMPORTS
# Datasets for dataset formatting
from datasets import Dataset, DatasetDict

# tokenizer and collator
from transformers import AutoTokenizer, DataCollatorWithPadding

# model and optimizer
from transformers import TFAutoModelForSequenceClassification, create_optimizer

# support
import numpy as np
import os
import re
import random

2022-10-12 23:52:19.501513: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-12 23:52:19.636360: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-12 23:52:19.636377: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-12 23:52:19.661831: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-12 23:52:20.299181: W tensorflow/stream_executor/platform/de

In [2]:
# pretrained model designator
MODEL_TYPE = 'distilbert-base-uncased'

# model batch size
BATCH_SIZE = 16

# model num epochs
N_EPOCHS = 8

In [3]:
# directories, etc.
MODEL_DIR = '../models/'
MODEL_NAME = 'shakespeare-farjeon'
MODEL_FULL_PATH = os.path.join(MODEL_DIR, f'{MODEL_NAME}.{MODEL_TYPE}.{N_EPOCHS}')

In [4]:
# helpful regexes
RE_SENTENCE = re.compile('\w.*?[.!?:;]')

### Data Loading and Preparation

In [5]:
# load data
with open('../data/shakespeare-sonnets.clean.txt', 'r') as f:
    shakespeare = ' '.join([line.strip() for line in f.readlines() if line.strip()])[1:]
print(shakespeare[:250])

with open('../data/unused/farjeon-sonnets.clean.txt', 'r') as f:
    farjeon = ' '.join([line.strip() for line in f.readlines() if line.strip()])[5:]
print(farjeon[:250])

From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed’st thy light’s flame with self-
Man cannot be a sophist to his heart, He must look nakedly on his intent, Expose it of all shreds of argument, And strip it like a slave-girl in the mart. What though with speckled truths and masked confessions He still deceives awhile the outer sens


In [6]:
# split into sentences
sentences = {
    'shakespeare': RE_SENTENCE.findall(shakespeare),
    'farjeon': RE_SENTENCE.findall(farjeon)
}
print(sentences['shakespeare'][0])
print(sentences['farjeon'][0])

From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His tender heir might bear his memory:
Man cannot be a sophist to his heart, He must look nakedly on his intent, Expose it of all shreds of argument, And strip it like a slave-girl in the mart.


In [7]:
# manually train-test split since we're not using pandas and want to retain class proportions
test_ratio = 0.1
test_ratio = 1 - test_ratio

splits = {'test': list(), 'train': list()}

for author, collection in sentences.items():
    clen = len(collection)
    split_ix = int(clen * test_ratio)
    
    # not at all elegant, but gets the job done
    train = collection[:split_ix]
    test = collection[split_ix:]
    for sentence in train:
        splits['train'].append({'text': sentence, 'label': 1 if author == 'shakespeare' else 0})
    for sentence in test:
        splits['test'].append({'text': sentence, 'label': 1 if author == 'shakespeare' else 0})
        
splits['train'][0], splits['train'][-1], splits['test'][0], splits['test'][-1]

({'text': 'From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His tender heir might bear his memory:',
  'label': 1},
 {'text': 'such patience, sure, Is not lifes child and mine, but mine and deaths.',
  'label': 0},
 {'text': 'Be wise as thou art cruel;', 'label': 1},
 {'text': 'Then fold me in your bosom so deep away That memory cannot touch this loveless day.',
  'label': 0})

In [8]:
# format into DatasetDict format
train_data = Dataset.from_list(splits['train'])
test_data = Dataset.from_list(splits['test'])
dataset = DatasetDict({'train': train_data, 'test': test_data})
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1024
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 114
    })
})

### Tokenization and Prepping Collator

In [9]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)

def tokenizer_func(text):
    return tokenizer(text['text'])

In [10]:
tokenized_data = dataset.map(tokenizer_func, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [11]:
collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors='tf')

### Modeling

In [12]:
# instantiate model
if os.path.exists(MODEL_FULL_PATH):
    model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_FULL_PATH)
else:
    model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_TYPE)

2022-10-12 23:52:21.663980: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-10-12 23:52:21.664004: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-12 23:52:21.664023: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (archzolam): /proc/driver/nvidia/version does not exist
2022-10-12 23:52:21.664220: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of

In [13]:
# prep train and test sets for model
tf_train_set = model.prepare_tf_dataset(
    tokenized_data['train'],
    shuffle=True,
    batch_size=BATCH_SIZE,
    collate_fn=collator
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_data['test'],
    shuffle=False,
    batch_size=BATCH_SIZE,
    collate_fn=collator
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [14]:
# set up optimizer
batches_per_epoch = len(tokenized_data['train']) // BATCH_SIZE
total_train_steps = int(batches_per_epoch * N_EPOCHS)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

In [15]:
# compile and fit model
if not os.path.exists(MODEL_FULL_PATH):
    model.compile(optimizer=optimizer)
    model.fit(tf_train_set, validation_data=tf_test_set, epochs = N_EPOCHS)
    os.makedirs(MODEL_FULL_PATH)
    model.save_pretrained(MODEL_FULL_PATH)

### Test

In [18]:
tests = [sentences['farjeon'][random.randint(0,len(sentences['farjeon'])-1)] for _ in range(10)]
tests += [sentences['shakespeare'][random.randint(0,len(sentences['shakespeare'])-1)] for _ in range(10)]
tests += [
    "I'm just a guy doing what guys do",
    "Get away from me you mischevious rogue!",
    "Whither goest thou?",
    "Romeo, O Romeo! Wherefore art thou Romeo?"
]

In [19]:
def get_class_from_output(output):
    return np.argmax(output.logits, axis=1)

def get_probs_from_output(output, c=1):
    logits = output.logits
    return (np.exp(logits) / (1 + np.exp(logits)))[:,c]

tests_tokened = tokenizer(tests, return_tensors='tf', padding=True)
outputs = model(tests_tokened)
classifications = get_class_from_output(outputs)
probs = get_probs_from_output(outputs)
list(zip(tests, classifications, probs))

[('Why if we dare not hear make hearing ours?', 0, 0.15327503),
 ('Lo, this and this and this I did not spend!', 0, 0.14387587),
 ('And thou wilt to the earth at last, times scorn, Relinquishing a crown thou hast not worn.',
  0,
  0.1252184),
 ('Alas, poor fools!', 0, 0.3099175),
 ('Am I here or there?', 0, 0.1783713),
 ('But since my walls of ignorance are broken, Though on that desert knowledge builds no towers, I cannot say of life, he has not spoken, I cannot say of love, he has no powers.',
  0,
  0.12840965),
 ('O what damnation man would deal himself If meeting her beyond his uttermost dreams He still could face his soul and lie to her.',
  0,
  0.13320598),
 ('Hast neither earth nor seed?', 0, 0.14555538),
 ('A few of us who faltered as we fared Love has returned for.',
  0,
  0.12886521),
 ('half-truth hedged with lies!', 0, 0.13973568),
 ('Against my love shall be as I am now, With Time’s injurious hand crush’d and o’erworn;',
  1,
  0.9382365),
 ('My mistress’ eyes are noth

### Conclusion

Classification works pretty well! Woohoo!