<a href="https://colab.research.google.com/github/brandonowens24/Pre-Trained_Transformers/blob/main/Pre_Trained_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install datasets
! pip install transformers

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2

## Task 2.1: Dataset

In [2]:
from datasets import load_dataset
from tqdm import tqdm

# Grab Dataset from Huggingface
dataset = load_dataset("sms_spam")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

## Task 2.2 Fine-Tuning Pre-Trained Models

In [25]:
from transformers import AutoModel, AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments
from sklearn.metrics import f1_score

In [None]:
def tokenize_function(data):
    return tokenizer(data["sms"], padding="max_length", truncation=True, max_length=128)

def compute_metrics(pred):
    labels = pred.label_ids
    predictions = pred.predictions.argmax(axis=1)
    return {"F1:": f1_score(labels, predictions, pos_label=1)}

#### Model 1: BERT

In [None]:
# Load in bert-small tokenized
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-small")

# Tokenize Existing bert-small with my inputted text
bert_tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Partition
bert_train_dataset = bert_tokenized_dataset["train"].shuffle(seed=42).select(range(4000))
bert_eval_dataset = bert_tokenized_dataset["train"].shuffle(seed=42).select(range(1000))

# Grab Existing bert-small for sequence classification
bert_small_model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-small")

# Establish Training Arguments, Epochs = 3 from prior convergence testing
training_args = TrainingArguments(output_dir="tmp", evaluation_strategy="epoch",
                                  num_train_epochs=2)
# Set up training object
trainer = Trainer(
    model=bert_small_model,
    args=training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

# Save fine-tuned model
trainer.save_model("bert_model_trained")


In [None]:
# Load in this saved model
bert = AutoModelForSequenceClassification.from_pretrained("bert_model_trained")


trainer= Trainer(
    model=bert,
    args=training_args,
    compute_metrics=compute_metrics
)

bert_results = trainer.evaluate(bert_eval_dataset)

print(bert_results['eval_F1:'])

#### Model 2: Electra

In [None]:
# Load in electra tokenized
tokenizer = AutoTokenizer.from_pretrained("bhadresh-savani/electra-base-emotion")

# Tokenize Existing electra model with my inputted text
electra_tokenized = dataset.map(tokenize_function, batched=True)

# Partition
electra_train_dataset = electra_tokenized["train"].shuffle(seed=42).select(range(4000))
electra_eval_dataset = electra_tokenized["train"].shuffle(seed=42).select(range(1000))

# Grab Existing electra for sequence classification
electra_model = AutoModelForSequenceClassification.from_pretrained("bhadresh-savani/electra-base-emotion")

# Training arguments already established previously
# Set up training object
trainer = Trainer(
    model=electra_model,
    args=training_args,
    train_dataset=electra_train_dataset,
    eval_dataset=electra_eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

# Save fine-tuned model
trainer.save_model("electra_model_train")


loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/f0056783be98abb0d6b20e5b346b5bb62031eafef77f812bb21191be71a90da3.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/deee9457c375bd31a25f6cf0ad7ede249f4f539eec7bc38c85e32824d57b5e31.dfddd0c8c70880badf1fde8c5ead6bcad9f80371ef0c53356e31719db70bdaa9
loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bhadresh-savani/electra-base-emotion/resolve/main/special_tokens_map.json from cache at /root/.cache/huggingface/transformers/baaa5869753c78bf43d6cb67dfd7b79dfb95aa4b0c0179dbc7dcf87cb635fc3f.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://hug

Epoch,Training Loss,Validation Loss,F1:
1,0.1091,0.03827,0.966942
2,0.031,0.010862,0.99187


Saving model checkpoint to tmp/checkpoint-500
Configuration saved in tmp/checkpoint-500/config.json
Model weights saved in tmp/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to tmp/checkpoint-1000
Configuration saved in tmp/checkpoint-1000/config.json
Model weights saved in tmp/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: sms. If sms are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training 

In [None]:
# Load in this saved model
electra = AutoModelForSequenceClassification.from_pretrained("electra_model_train")

trainer= Trainer(
    model=electra,
    args=training_args,
    compute_metrics=compute_metrics
)

electra_results = trainer.evaluate(electra_eval_dataset)

print(electra_results['eval_F1:'])

loading configuration file electra_model_train/config.json
Model config ElectraConfig {
  "_name_or_path": "electra_model_train",
  "architectures": [
    "ElectraForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": "sadness",
    "1": "joy",
    "2": "love",
    "3": "anger",
    "4": "fear",
    "5": "surprise"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "anger": 3,
    "fear": 4,
    "joy": 1,
    "love": 2,
    "sadness": 0,
    "surprise": 5
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_typ

0.991869918699187


## Task 2.3: Zero-Shot Classification


In [3]:
from transformers import pipeline

In [7]:
bart_classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [8]:
deberta_classifier = pipeline("zero-shot-classification",model="sileod/deberta-v3-base-tasksource-nli")

#### Prompting

In [10]:
sequence1 = "Is this message spam or ham (non-spam)? FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence1 = "Is this message spam or ham(non-spam)? Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

sequence2 = "Classify the following text message into spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence2 = "Classify the following text message into spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

sequence3 = "Please classify the following message as either spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence3 = "Please classify the following message as either spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

sequence4 = "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end."
nsequence4 = "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!"

candidate_labels = ['spam', 'ham(non-spam)']


##### Results


In [11]:
print("Sequence 1:")
print(bart_classifier(sequence1, candidate_labels))
print(bart_classifier(nsequence1, candidate_labels))
print(deberta_classifier(sequence1, candidate_labels))
print(deberta_classifier(nsequence1, candidate_labels))

Sequence 1:
{'sequence': "Is this message spam or ham (non-spam)? FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.5264315605163574, 0.4735684096813202]}


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'sequence': 'Is this message spam or ham(non-spam)? Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.6561940908432007, 0.3438059389591217]}
{'sequence': "Is this message spam or ham (non-spam)? FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['spam', 'ham(non-spam)'], 'scores': [0.6742880344390869, 0.32571202516555786]}
{'sequence': 'Is this message spam or ham(non-spam)? Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.732473611831665, 0.2675263583660126]}


In [13]:
print("Sequence 2:")
print(bart_classifier(sequence2, candidate_labels))
print(bart_classifier(nsequence2, candidate_labels))
print(deberta_classifier(sequence2, candidate_labels))
print(deberta_classifier(nsequence2, candidate_labels))

Sequence 2:
{'sequence': "Classify the following text message into spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['spam', 'ham(non-spam)'], 'scores': [0.7364699244499207, 0.26353007555007935]}
{'sequence': 'Classify the following text message into spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.6477056741714478, 0.35229429602622986]}
{'sequence': "Classify the following text message into spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.7577852010726929, 0.24221481382846832]}
{'sequence': 'Classify the following text message into spam or ham (non-spam): Go

In [14]:
print("Sequence 3:")
print(bart_classifier(sequence3, candidate_labels))
print(bart_classifier(nsequence3, candidate_labels))
print(deberta_classifier(sequence3, candidate_labels))
print(deberta_classifier(nsequence3, candidate_labels))

Sequence 3:
{'sequence': "Please classify the following message as either spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['spam', 'ham(non-spam)'], 'scores': [0.8841160535812378, 0.11588394641876221]}
{'sequence': 'Please classify the following message as either spam or ham (non-spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.5612017512321472, 0.438798189163208]}
{'sequence': "Please classify the following message as either spam or ham (non-spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.5861629843711853, 0.4138370454311371]}
{'sequence': 'Please classify the following message as either s

In [15]:
print("Sequence 4:")
print(bart_classifier(sequence4, candidate_labels))
print(bart_classifier(nsequence4, candidate_labels))
print(deberta_classifier(sequence4, candidate_labels))
print(deberta_classifier(nsequence4, candidate_labels))

Sequence 4:
{'sequence': "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end.", 'labels': ['spam', 'ham(non-spam)'], 'scores': [0.6589221954345703, 0.3410778343677521]}
{'sequence': 'Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!', 'labels': ['ham(non-spam)', 'spam'], 'scores': [0.8419134020805359, 0.1580866128206253]}
{'sequence': "Spam is automated and unnecessary, ham(the opposite) is from a real person. Please classify the following text message as being spam or ham(not spam): FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live l

##### Compare Bart and Selectra to Previous Models

In [16]:
def predict_spam(data):

  data["Bart"] = None
  data["Selectra"] = None

  for index, row in tqdm(data.iterrows(), total = len(data)):

      bart_input = "Classify the following text message into spam or ham (non-spam):" + row["sms"]
      deberta_input = "Is this message spam or ham (non-spam)?" + row["sms"]

      bart_pred = bart_classifier(bart_input, candidate_labels)['labels'][0]
      deberta_pred = deberta_classifier(deberta_input, candidate_labels)['labels'][0]

      data.at[index, "Bart"] = bart_pred
      data.at[index, "Deberta"] = deberta_pred

  return data


In [17]:
datatest = dataset["train"].shuffle(seed=42).select(range(500))

datatest = datatest.to_pandas()

In [18]:
df_zero_shot = predict_spam(datatest)

100%|██████████| 500/500 [27:00<00:00,  3.24s/it]


In [19]:
df_zero_shot

Unnamed: 0,sms,label,Bart,Selectra,Deberta
0,sports fans - get the latest sports news str* ...,1,spam,,ham(non-spam)
1,It's justbeen overa week since we broke up and...,0,spam,,ham(non-spam)
2,Not directly behind... Abt 4 rows behind ü...\n,0,spam,,ham(non-spam)
3,"Haha, my legs and neck are killing me and my a...",0,spam,,ham(non-spam)
4,Me too baby! I promise to treat you well! I be...,0,spam,,ham(non-spam)
...,...,...,...,...,...
495,Hows the champ just leaving glasgow!\n,0,spam,,ham(non-spam)
496,That would be great. We'll be at the Guild. Co...,0,spam,,ham(non-spam)
497,Hey are you angry with me. Reply me dr.\n,0,spam,,ham(non-spam)
498,am up to my eyes in philosophy\n,0,spam,,spam


In [21]:
df_zero_shot.to_csv('zero_shot_results.csv')

In [22]:
import pandas as pd
df_zero_shot = pd.read_csv("zero_shot_results.csv")

In [23]:
df_zero_shot["Bart"] = df_zero_shot["Bart"].map({"spam": 1, "ham(non-spam)": 0})
df_zero_shot["Deberta"] = df_zero_shot["Deberta"].map({"spam": 1, "ham(non-spam)": 0})

In [26]:
print("Bart Recall:", f1_score(df_zero_shot["Bart"], df_zero_shot["label"]))
print("Deberta Recall:", f1_score(df_zero_shot["Deberta"], df_zero_shot["label"], pos_label=1))

Bart Recall: 0.23885918003565065
Deberta Recall: 0.31578947368421056


## Baselines

#### BOW Baseline

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
vectorizer = TfidfVectorizer(input='content', stop_words = 'english')
vectors = vectorizer.fit_transform(dataset["train"]["sms"])
labels = dataset["train"]["label"]
vectors

<5574x8444 sparse matrix of type '<class 'numpy.float64'>'
	with 43577 stored elements in Compressed Sparse Row format>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size = 0.2)

In [None]:
train_vectors = X_train.toarray()
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)
f1 = f1_score(y_test, y_pred, pos_label=1)
print(f1)

0.8164794007490637


#### Random Class Baseline

Theoretically, for a random class model with a binary prediction, each class has a 50% chance of being selected. Since we are interested in F1-scores...

**F1 Score = (2 X Precision X Recall)/ (Precision + Recall)**

Let's run a simulation...

In [None]:
import random

generated_actuals = []
generated_preds = []

for i in range(10000):
  generated_actuals.append(random.randint(0,1))
  generated_preds.append(random.randint(0,1))

print(f1_score(generated_actuals, generated_preds, pos_label=1))

0.4971830985915493


So roughly **50%** is the F1 score for our Random Class Baseline

#### Target Class Baseline


Going to use the genrated_actuals from above. Our target class is 1, detecting the actual spam messages.

In [None]:
generated_preds = [1] * 10000

print(f1_score(generated_actuals, generated_preds, pos_label=1))

0.6670221274326846


Our F1 score if every message is spam is roughly **66%** for our Target Class Baseline