# Text Classification with Transformers

### Objective:
1. Familiarize yourself with huggingface datasets and models
2. Learn to perform Binary and Multi Class Classification

### Problem Statement:
Authorship Profiling is the task of learning / predicting certain characteristics of the author with respect to demographics etc. It is hypothesised that in certain cases, the profile of an author with respect to say Gender can differenciate their style of writing.

**Task1 - Gender Prediction (Binary Classification)**: Does gender influence their written text?

**Task2 - Age Group Prediction (Multi Class Classification)**: Does the age group of a person influence their written text?

**Dataset**:  [Blog Authorship Corpus](https://huggingface.co/datasets/blog_authorship_corpus)

Change Runtime to GPU if possible

In [None]:
!pip install 'transformers[torch]'
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
import evaluate
import numpy as np
from numpy import zeros
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import os
import pandas as pd
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction

## Loading Data


In [None]:
raw_dataset = (load_dataset('blog_authorship_corpus', split='train', trust_remote_code=True)
        .train_test_split(train_size=1000, test_size=100))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

blog_authorship_corpus.py:   0%|          | 0.00/5.53k [00:00<?, ?B/s]

blogs.zip:   0%|          | 0.00/313M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/689793 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/37919 [00:00<?, ? examples/s]

In [None]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'date', 'gender', 'age', 'horoscope', 'job'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'date', 'gender', 'age', 'horoscope', 'job'],
        num_rows: 100
    })
})

In [None]:
raw_dataset["train"][0]

{'text': 'urlLink JibJab  made a hilarious cartoon called "This Land" making fun of both Bush and Kerry.',
 'date': '02,August,2004',
 'gender': 'male',
 'age': 23,
 'horoscope': 'Sagittarius',
 'job': 'Student'}

## Task1.1 - Gender Prediction (Binary Classification )
Can you predict the gender of a person from a piece of written text?


In [None]:
labels = ["female", "male"]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

In [None]:
label2id

{'female': 0, 'male': 1}

In [None]:
id2label

{0: 'female', 1: 'male'}

In [None]:
def tokenize_function(batch, tokenizer,label2id ):#= tokenizer
    tokenized_batch = tokenizer(batch["text"],padding=True, max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
    tokenized_batch["labels"] = [label2id.get(label) for label in batch["gender"]] #label
    return tokenized_batch

Models can come in cased or uncased version. Uncased models convert to lower case and remove accents before progressing while all this information is retained for cased models. Cased models may be more suited to tasks such as NER and POS tagging wehre such information is important

In [None]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [None]:
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True, fn_kwargs={"tokenizer": tokenizer, "label2id":label2id},num_proc=4, remove_columns=raw_dataset['train'].column_names) #use fn_kwargs to pass any arguments to the tokenizing function

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset.set_format(type="torch")

In [None]:
tokenized_dataset["train"][0]

[Data Collators](https://huggingface.co/docs/transformers/main_classes/data_collator) are used to batch together input data to take care of padding (E.g. *DataCollatorWithPadding*), batching, dynamic masking (E.g. *DataCollatorForLanguageModeling*) or handling special token requirements (E.g. *DataCollatorForTokenClassification*).

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True) #pads to the max sequence length in a batch

In [None]:
accuracy = evaluate.load("accuracy")
#https://huggingface.co/docs/evaluate/choosing_a_metric

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, id2label=id2label, label2id=label2id) #problem_type="multi_label_classification"

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# WandDB

To turn of WandDB, please include this in the TrainingArguments


https://discuss.huggingface.co/t/how-to-turn-wandb-off-in-trainer/6237

In [None]:
training_args = TrainingArguments(
    report_to=None,
    output_dir=model_name + "_blog_authorship_gender",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.665821,0.65
2,0.541200,0.76992,0.65


TrainOutput(global_step=500, training_loss=0.5411568603515625, metrics={'train_runtime': 275.3468, 'train_samples_per_second': 7.264, 'train_steps_per_second': 1.816, 'total_flos': 526222110720000.0, 'train_loss': 0.5411568603515625, 'epoch': 2.0})

## Task1.2 - Gender Prediction (Binary Classification )
Choose a different encoder model and compare results.
Estimated time (20 mins)


In [None]:
# Todo
# Define model and training parameters
# Perform training
# Peer Learning: Which model did you choose and how did it perform?

In [None]:
model_name = "albert/albert-base-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True,
                                    fn_kwargs={"tokenizer": tokenizer, "label2id":label2id},
                                    num_proc=4, remove_columns=raw_dataset['train'].column_names) #use fn_kwargs to pass any arguments to the tokenizing function
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True) #pads to the max sequence length in a batch
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, id2label=id2label, label2id=label2id) #problem_type="multi_label_classification"
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]



Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/100 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.75096,0.51
2,0.697500,0.697082,0.54


TrainOutput(global_step=500, training_loss=0.697461181640625, metrics={'train_runtime': 218.6065, 'train_samples_per_second': 9.149, 'train_steps_per_second': 2.287, 'total_flos': 47796203520000.0, 'train_loss': 0.697461181640625, 'epoch': 2.0})

## Task2 - Multi Class Classification:
Can you predict the age group of a person from a piece of written text?

In this dataset, the age column has a numeric value. We try to convert the age into a set of age groups say "<20","20-30" and "30+".

For ease of processing, we convert the datasets first to pandas datasets and then apply the processing needed to create the groups.

In [None]:
raw_df_train = raw_dataset["train"].to_pandas()
raw_df_test = raw_dataset["test"].to_pandas()

In [None]:
raw_df_train.head()

Unnamed: 0,text,date,gender,age,horoscope,job
0,Â,"12,July,2004",female,15,Scorpio,Student
1,urlLink Guys..I forgot to add Juicy Fruit a...,"17,June,2004",female,23,Aquarius,indUnk
2,"Oh little boy, how you make me laugh.","16,May,2004",female,16,Pisces,Student
3,urlLink postCount('106861767733727212'); |...,"07,octubre,2003",female,24,Virgo,indUnk
4,oh no.... my BIG toe... how?? today was su...,"13,August,2004",male,17,Gemini,Student


In [None]:
raw_df_train.age.value_counts().sort_index()

13     36
14     84
15    136
16    201
17    239
23    217
24    242
25    185
26    173
27    131
33     46
34     60
35     45
36     37
37     33
38     20
39     10
40     20
41     13
42      4
43     14
44      8
45     23
46      3
47      3
48     17
Name: age, dtype: int64

In [None]:
bins = [0, 20, 30, np.inf]
age_labels = ['<20', '20-30', '30+']
raw_df_train['AgeRange'] = pd.cut(raw_df_train['age'], bins, labels=age_labels)
raw_df_test['AgeRange'] = pd.cut(raw_df_test['age'], bins, labels=age_labels)
raw_df_train.head()

Unnamed: 0,text,date,gender,age,horoscope,job,AgeRange
0,Â,"12,July,2004",female,15,Scorpio,Student,<20
1,urlLink Guys..I forgot to add Juicy Fruit a...,"17,June,2004",female,23,Aquarius,indUnk,20-30
2,"Oh little boy, how you make me laugh.","16,May,2004",female,16,Pisces,Student,<20
3,urlLink postCount('106861767733727212'); |...,"07,octubre,2003",female,24,Virgo,indUnk,20-30
4,oh no.... my BIG toe... how?? today was su...,"13,August,2004",male,17,Gemini,Student,<20


In [None]:
raw_df_train.AgeRange.value_counts().sort_index()

<20      696
20-30    948
30+      356
Name: AgeRange, dtype: int64

We merge the datasets back into the DatasetDict object for huggingface and create the mappings id2label and label2id respectively

In [None]:
train_dataset = Dataset.from_dict(raw_df_train)
test_dataset = Dataset.from_dict(raw_df_test)
new_dataset = DatasetDict({"train":train_dataset,"test":test_dataset})

In [None]:
new_dataset["train"][0]

{'text': 'Â',
 'date': '12,July,2004',
 'gender': 'female',
 'age': 15,
 'horoscope': 'Scorpio',
 'job': 'Student',
 'AgeRange': '<20'}

In [None]:
id2label = {idx:label for idx, label in enumerate(age_labels)}
label2id = {label:idx for idx, label in enumerate(age_labels)}

The tokenize function is modified to account for the labels.

Note that the labels are in the form [<20, <20, 20-30.......30+].

However we need them to be in the format [[1,0,0],[1,0,0],[0,1,0],.....[0,0,1]] for training and evaluation.

Hence at first, we create a labels matrix of dimensions (batch_size, number_of_labels). Then we populate the matrix at row = batch_position and column = label_position with 0 or 1 depending on which label was present.

Finally we return the tokenize batch

In [None]:
def tokenize_function(batch, tokenizer, label2id, np ):#= tokenizer
  """ Function takes in a batch of data to tokenize and create the corresponding label matrix and returns the tokenized batch with labels
  """
  batch_size = len(batch["AgeRange"])
  tokenized_batch = tokenizer(batch["text"],padding=True, max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
  labels_matrix = np.zeros((batch_size, len(label2id))) #number of labels = len(label2id)
  for batch_pos,label in enumerate(batch["AgeRange"]):
      labels_matrix[batch_pos, label2id.get(label)] = 1
  tokenized_batch["labels"] = labels_matrix.tolist()
  return tokenized_batch

In [None]:
tokenized_dataset = new_dataset.map(tokenize_function, batched=True,
                                    fn_kwargs={"tokenizer": tokenizer, "label2id":label2id, "np":np},
                                    num_proc=4, remove_columns=raw_dataset['train'].column_names)
#use fn_kwargs to pass any arguments to the tokenizing function

Map (num_proc=4):   0%|          | 0/2000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset.set_format(type="torch")

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True) #pads to the max sequence length in a batch

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(age_labels), #number of classes
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    #f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    #roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {#'f1': f1_micro_average,
               #'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

In [None]:
training_args = TrainingArguments(
    output_dir=model_name+"_blog_authorship_age",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4412,0.568686,0.535
2,0.3459,0.590835,0.61


TrainOutput(global_step=1000, training_loss=0.3935282745361328, metrics={'train_runtime': 441.696, 'train_samples_per_second': 9.056, 'train_steps_per_second': 2.264, 'total_flos': 1052453670912000.0, 'train_loss': 0.3935282745361328, 'epoch': 2.0})

In [None]:
text =  "Do you read comics?" #"I am worried about my job and my kids. I do not know how to manage both."

tokenized_text = tokenizer(text, return_tensors="pt")
tokenized_text = {k: v.to(trainer.model.device) for k,v in tokenized_text.items()}

outputs = trainer.model(**tokenized_text)

In [None]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.7989, -0.6965, -2.9556]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
# apply sigmoid + threshold
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(outputs.logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

['<20']
