<a href="https://colab.research.google.com/github/brandonowens24/Pre-Trained_Transformers/blob/main/Pre_Trained_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install datasets
! pip install transformers==4.17



## Task 2.1: Dataset

In [None]:
from datasets import load_dataset
from tqdm import tqdm

# Grab Dataset from Huggingface
dataset = load_dataset("sms_spam")

## Task 2.2 Fine-Tuning Pre-Trained Models

In [None]:
from transformers import AutoModel, AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments
from sklearn.metrics import f1_score

In [None]:
def tokenize_function(data):
    return tokenizer(data["sms"], padding="max_length", truncation=True, max_length=128)

def compute_metrics(pred):
    labels = pred.label_ids
    predictions = pred.predictions.argmax(axis=1)
    return {"F1:": f1_score(labels, predictions, pos_label=1)}

#### Model 1: BERT

In [None]:
# Load in bert-small tokenized
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-small")

# Tokenize Existing bert-small with my inputted text
bert_tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Partition
bert_train_dataset = bert_tokenized_dataset["train"].shuffle(seed=42).select(range(4000))
bert_eval_dataset = bert_tokenized_dataset["train"].shuffle(seed=42).select(range(1000))

# Grab Existing bert-small for sequence classification
bert_small_model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-small")

# Establish Training Arguments, Epochs = 3 from prior convergence testing
training_args = TrainingArguments(output_dir="tmp", evaluation_strategy="epoch",
                                  num_train_epochs=2)
# Set up training object
trainer = Trainer(
    model=bert_small_model,
    args=training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

# Save fine-tuned model
trainer.save_model("bert_model_trained")


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ac031779e2b4dd1d9da1e39c9d6a29fd45deea195eb3703a701d9c77f60abb4e.1257bb8f1f585038e86954d2560e36ca5c2dd98a8cde30fd22468940c911b672
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-small",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 8,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert

Epoch,Training Loss,Validation Loss,F1:
1,0.0738,0.031553,0.970954
2,0.0218,0.008117,0.995951


Saving model checkpoint to tmp/checkpoint-500
Configuration saved in tmp/checkpoint-500/config.json
Model weights saved in tmp/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to tmp/checkpoint-1000
Configuration saved in tmp/checkpoint-1000/config.json
Model weights saved in tmp/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training completed. D

In [None]:
# Load in this saved model
bert = AutoModelForSequenceClassification.from_pretrained("bert_model_trained")


trainer= Trainer(
    model=bert,
    args=training_args,
    compute_metrics=compute_metrics
)

bert_results = trainer.evaluate(bert_eval_dataset)

print(bert_results['eval_F1:'])

loading configuration file bert_model_trained/config.json
Model config BertConfig {
  "_name_or_path": "bert_model_trained",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 8,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file bert_model_trained/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the mode

0.9959514170040485


#### Model 2: Electra

In [None]:
# Load in electra tokenized
tokenizer = AutoTokenizer.from_pretrained("bhadresh-savani/electra-base-emotion")

# Tokenize Existing electra model with my inputted text
electra_tokenized = dataset.map(tokenize_function, batched=True)

# Partition
electra_train_dataset = electra_tokenized["train"].shuffle(seed=42).select(range(4000))
electra_eval_dataset = electra_tokenized["train"].shuffle(seed=42).select(range(1000))

# Grab Existing electra for sequence classification
electra_model = AutoModelForSequenceClassification.from_pretrained("bhadresh-savani/electra-base-emotion")

# Training arguments already established previously
# Set up training object
trainer = Trainer(
    model=electra_model,
    args=training_args,
    train_dataset=electra_train_dataset,
    eval_dataset=electra_eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

# Save fine-tuned model
trainer.save_model("electra_model_train")


loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/f0056783be98abb0d6b20e5b346b5bb62031eafef77f812bb21191be71a90da3.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/deee9457c375bd31a25f6cf0ad7ede249f4f539eec7bc38c85e32824d57b5e31.dfddd0c8c70880badf1fde8c5ead6bcad9f80371ef0c53356e31719db70bdaa9
loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/special_tokens_map.json from cache at /root/.cache/huggingface/transformers/baaa5869753c78bf43d6cb67dfd7b79dfb95aa4b0c0179dbc7dcf87cb635fc3f.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://hug

Epoch,Training Loss,Validation Loss,F1:
1,0.1091,0.03827,0.966942
2,0.031,0.010862,0.99187


Saving model checkpoint to tmp/checkpoint-500
Configuration saved in tmp/checkpoint-500/config.json
Model weights saved in tmp/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to tmp/checkpoint-1000
Configuration saved in tmp/checkpoint-1000/config.json
Model weights saved in tmp/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training 

In [None]:
# Load in this saved model
electra = AutoModelForSequenceClassification.from_pretrained("electra_model_train")

trainer= Trainer(
    model=electra,
    args=training_args,
    compute_metrics=compute_metrics
)

electra_results = trainer.evaluate(electra_eval_dataset)

print(electra_results['eval_F1:'])

loading configuration file electra_model_train/config.json
Model config ElectraConfig {
  "_name_or_path": "electra_model_train",
  "architectures": [
    "ElectraForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": "sadness",
    "1": "joy",
    "2": "love",
    "3": "anger",
    "4": "fear",
    "5": "surprise"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "anger": 3,
    "fear": 4,
    "joy": 1,
    "love": 2,
    "sadness": 0,
    "surprise": 5
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_typ

0.991869918699187


## Task 2.3: Zero-Shot Classification


In [None]:
from transformers import pipeline
bart_classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from transformers import pipeline
selectra_classifier = pipeline("zero-shot-classification",
                       model="Recognai/zeroshot_selectra_medium")

Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/156M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/337 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Prompting

In [None]:
sequence1 = "Is this message spam or ham (non-spam)? FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence1 = "Is this message spam or ham(non-spam)? Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

sequence2 = "Classify the following text message into spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence2 = "Classify the following text message into spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

sequence3 = "Please classify the following message as either spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence3 = "Please classify the following message as either spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

sequence4 = "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence4 = "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

candidate_labels = ['spam', 'ham(non-spam)']


##### Results


In [None]:
print("Sequence 1:")
print(bart_classifier(sequence1, candidate_labels))
print(bart_classifier(nsequence1, candidate_labels))
print(selectra_classifier(sequence1, candidate_labels))
print(selectra_classifier(nsequence1, candidate_labels))

Sequence 1:
{'sequence': "Is this message spam or ham (non-spam)? FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.5264318585395813, 0.4735681116580963]}
{'sequence': 'Is this message spam or ham(non-spam)? Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.6561935544013977, 0.3438064754009247]}
{'sequence': "Is this message spam or ham (non-spam)? FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.7874264717102051, 0.2125735580921173]}
{'sequence': 'Is this message spam or ham(non-spam)? Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam

In [None]:
print("Sequence 2:")
print(bart_classifier(sequence2, candidate_labels))
print(bart_classifier(nsequence2, candidate_labels))
print(selectra_classifier(sequence2, candidate_labels))
print(selectra_classifier(nsequence2, candidate_labels))

Sequence 2:
{'sequence': "Classify the following text message into spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['spam', 'ham(non-spam)'], 'scores': [0.7364700436592102, 0.2635299265384674]}
{'sequence': 'Classify the following text message into spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.6477053761482239, 0.3522946536540985]}
{'sequence': "Classify the following text message into spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.9079785943031311, 0.09202135354280472]}
{'sequence': 'Classify the following text message into spam or ham (non-spam): Good

In [None]:
print("Sequence 3:")
print(bart_classifier(sequence3, candidate_labels))
print(bart_classifier(nsequence3, candidate_labels))
print(selectra_classifier(sequence3, candidate_labels))
print(selectra_classifier(nsequence3, candidate_labels))

Sequence 3:
{'sequence': "Please classify the following message as either spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['spam', 'ham(non-spam)'], 'scores': [0.8841157555580139, 0.11588427424430847]}
{'sequence': 'Please classify the following message as either spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.5612019300460815, 0.43879806995391846]}
{'sequence': "Please classify the following message as either spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.7929933667182922, 0.20700660347938538]}
{'sequence': 'Please classify the following message as eithe

In [None]:
print("Sequence 4:")
print(bart_classifier(sequence4, candidate_labels))
print(bart_classifier(nsequence4, candidate_labels))
print(selectra_classifier(sequence4, candidate_labels))
print(selectra_classifier(nsequence4, candidate_labels))

Sequence 4:
{'sequence': "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['spam', 'ham(non-spam)'], 'scores': [0.65892094373703, 0.34107905626296997]}
{'sequence': 'Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.8419137597084045, 0.15808622539043427]}
{'sequence': "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live l

##### Compare Bart and Selectra to Previous Models

In [None]:
def predict_spam(data):

  data["Bart"] = None
  data["Selectra"] = None

  for index, row in tqdm(data.iterrows(), total = len(data)):
      input = "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam):" + row["sms"]

      bart_pred = bart_classifier(input, candidate_labels)['labels'][0]
      selectra_pred = selectra_classifier(input, candidate_labels)['labels'][0]

      data.at[index, "Bart"] = bart_pred
      data.at[index, "Selectra"] = selectra_pred

  return data


In [None]:
datatest = dataset["train"].shuffle(seed=42).select(range(500))

datatest = datatest.to_pandas()

In [None]:
df_zero_shot = predict_spam(datatest)

100%|██████████| 500/500 [27:21<00:00,  3.28s/it]


In [None]:
df_zero_shot

Unnamed: 0,sms,label,Bart,Selectra
0,sports fans - get the latest sports news str* ...,1,spam,spam
1,It's justbeen overa week since we broke up and...,0,ham(non-spam),ham(non-spam)
2,Not directly behind... Abt 4 rows behind ü...\n,0,spam,spam
3,"Haha, my legs and neck are killing me and my a...",0,ham(non-spam),ham(non-spam)
4,Me too baby! I promise to treat you well! I be...,0,ham(non-spam),ham(non-spam)
...,...,...,...,...
495,Hows the champ just leaving glasgow!\n,0,ham(non-spam),ham(non-spam)
496,That would be great. We'll be at the Guild. Co...,0,ham(non-spam),ham(non-spam)
497,Hey are you angry with me. Reply me dr.\n,0,spam,ham(non-spam)
498,am up to my eyes in philosophy\n,0,spam,ham(non-spam)


In [None]:
df_zero_shot.to_csv('zero_shot_results.csv')

NameError: name 'df_zero_shot' is not defined

In [None]:
import pandas as pd
df_zero_shot = pd.read_csv("zero_shot_results.csv")

In [None]:
df_zero_shot["Bart"] = df_zero_shot["Bart"].map({"spam": 1, "ham(non-spam)": 0})
df_zero_shot["Selectra"] = df_zero_shot["Selectra"].map({"spam": 1, "ham(non-spam)": 0})

In [None]:
print("Bart Recall:", f1_score(df_zero_shot["Bart"], df_zero_shot["label"]))
print("Selectra Recall:", f1_score(df_zero_shot["Selectra"], df_zero_shot["label"], pos_label=1))

Bart Recall: 0.2877697841726619
Selectra Recall: 0.1509433962264151


## Baselines

#### BOW Baseline

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
vectorizer = TfidfVectorizer(input='content', stop_words = 'english')
vectors = vectorizer.fit_transform(dataset["train"]["sms"])
labels = dataset["train"]["label"]
vectors

<5574x8444 sparse matrix of type '<class 'numpy.float64'>'
	with 43577 stored elements in Compressed Sparse Row format>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size = 0.2)

In [None]:
train_vectors = X_train.toarray()
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)
f1 = f1_score(y_test, y_pred, pos_label=1)
print(f1)

0.8164794007490637


#### Random Class Baseline

Theoretically, for a random class model with a binary prediction, each class has a 50% chance of being selected. Since we are interested in F1-scores...

**F1 Score = (2 X Precision X Recall)/ (Precision + Recall)**

Let's run a simulation...

In [None]:
import random

generated_actuals = []
generated_preds = []

for i in range(10000):
  generated_actuals.append(random.randint(0,1))
  generated_preds.append(random.randint(0,1))

print(f1_score(generated_actuals, generated_preds, pos_label=1))

0.4971830985915493


So roughly **50%** is the F1 score for our Random Class Baseline

#### Target Class Baseline


Going to use the genrated_actuals from above. Our target class is 1, detecting the actual spam messages.

In [None]:
generated_preds = [1] * 10000

print(f1_score(generated_actuals, generated_preds, pos_label=1))

0.6670221274326846


Our F1 score if every message is spam is roughly **66%** for our Target Class Baseline