# Text Classification with Transformers

### Objective:
1. Familiarize yourself with huggingface datasets and models
2. Learn to perform Binary and Multi Class Classification

### Problem Statement:
Authorship Profiling is the task of learning / predicting certain characteristics of the author with respect to demographics etc. It is hypothesised that in certain cases, the profile of an author with respect to say Gender can differenciate their style of writing.

**Task1 - Gender Prediction (Binary Classification)**: Does gender influence their written text?

**Task2 - Age Group Prediction (Multi Class Classification)**: Does the age group of a person influence their written text?

**Dataset**:  [Blog Authorship Corpus](https://huggingface.co/datasets/blog_authorship_corpus)

Change Runtime to GPU if possible

In [42]:
import torch
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import Pipeline
import evaluate
import numpy as np
from numpy import zeros
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import os
import pandas as pd
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction

## Loading Data


In [2]:
raw_dataset = (load_dataset('blog_authorship_corpus', split='train', trust_remote_code=True)
        .train_test_split(train_size=1000, test_size=100))

Downloading builder script: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.53k/5.53k [00:00<00:00, 20.3kB/s]
Downloading readme: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30k/7.30k [00:00<00:00, 19.9kB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 313M/313M [00:09<00:00, 31.5MB/s]
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 689793/689793 [00:26<00:00, 26226.41 exam

In [3]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'date', 'gender', 'age', 'horoscope', 'job'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'date', 'gender', 'age', 'horoscope', 'job'],
        num_rows: 100
    })
})

In [4]:
raw_dataset["train"][0]

{'text': 'We scared a seagull &amp; convinced a random dude to take  urlLink a pic of us  with my digicam before going into what promised to be the actual sea port building.',
 'date': '29,August,2003',
 'gender': 'male',
 'age': 24,
 'horoscope': 'Gemini',
 'job': 'Student'}

## Task1.1 - Gender Prediction (Binary Classification )
Can you predict the gender of a person from a piece of written text?


In [5]:
labels = ["female", "male"]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

In [6]:
label2id

{'female': 0, 'male': 1}

In [7]:
id2label

{0: 'female', 1: 'male'}

In [8]:
def tokenize_function(batch, tokenizer,label2id ):#= tokenizer
    tokenized_batch = tokenizer(batch["text"],padding=True, max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
    tokenized_batch["labels"] = [label2id.get(label) for label in batch["gender"]] #label
    return tokenized_batch

Models can come in cased or uncased version. Uncased models convert to lower case and remove accents before progressing while all this information is retained for cased models. Cased models may be more suited to tasks such as NER and POS tagging wehre such information is important

In [9]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [10]:
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True, fn_kwargs={"tokenizer": tokenizer, "label2id":label2id},num_proc=4, remove_columns=raw_dataset['train'].column_names) #use fn_kwargs to pass any arguments to the tokenizing function

Map (num_proc=4): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1095.84 examples/s]
Map (num_proc=4): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 158.91 examples/s]


In [11]:
tokenized_dataset.set_format(type="torch")

In [12]:
tokenized_dataset["train"][0]

{'input_ids': tensor([  101,  2057,  6015,  1037,  2712, 24848,  2140,  1004, 23713,  1025,
          6427,  1037,  6721, 12043,  2000,  2202, 24471, 21202,  2243,  1037,
         27263,  1997,  2149,  2007,  2026, 10667,  5555,  2213,  2077,  2183,
          2046,  2054,  5763,  2000,  2022,  1996,  5025,  2712,  3417,  2311,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

[Data Collators](https://huggingface.co/docs/transformers/main_classes/data_collator) are used to batch together input data to take care of padding (E.g. *DataCollatorWithPadding*), batching, dynamic masking (E.g. *DataCollatorForLanguageModeling*) or handling special token requirements (E.g. *DataCollatorForTokenClassification*).

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True) #pads to the max sequence length in a batch

In [14]:
accuracy = evaluate.load("accuracy")
#https://huggingface.co/docs/evaluate/choosing_a_metric

In [15]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [16]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, id2label=id2label, label2id=label2id) #problem_type="multi_label_classification"

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
training_args = TrainingArguments(
    report_to=None,
    output_dir="models/"+model_name + "_blog_authorship_gender",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [19]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.600549,0.68
2,0.634000,0.563006,0.66


TrainOutput(global_step=500, training_loss=0.6340369873046875, metrics={'train_runtime': 37.9007, 'train_samples_per_second': 52.769, 'train_steps_per_second': 13.192, 'total_flos': 526222110720000.0, 'train_loss': 0.6340369873046875, 'epoch': 2.0})

## Task1.2 - Gender Prediction (Binary Classification )
Choose a different encoder model and compare results.
Estimated time (20 mins)


# ToDo
1. Define model and training parameters
2. Perform training
3. Peer Learning: Which model did you choose and how did it perform?

## Task2 - Multi Class Classification:
Can you predict the age group of a person from a piece of written text?

In this dataset, the age column has a numeric value. We try to convert the age into a set of age groups say "<20","20-30" and "30+".

For ease of processing, we convert the datasets first to pandas datasets and then apply the processing needed to create the groups.

In [20]:
raw_df_train = raw_dataset["train"].to_pandas()
raw_df_test = raw_dataset["test"].to_pandas()

In [21]:
raw_df_train.head()

Unnamed: 0,text,date,gender,age,horoscope,job
0,We scared a seagull &amp; convinced a random d...,"29,August,2003",male,24,Gemini,Student
1,And on the really good news front: we visited ...,"30,May,2004",female,38,Virgo,indUnk
2,damnnnnn..chelsea snatched arjen robben from m...,"02,March,2004",male,16,Libra,Student
3,Author(s): Peter Mayle Genre: Fiction Revi...,"03,June,2004",female,38,Taurus,indUnk
4,You may be one of the few who actually comment...,"15,August,2004",male,27,Libra,indUnk


In [22]:
raw_df_train.age.value_counts().sort_index()

age
13     16
14     46
15     66
16    125
17    117
23     96
24    115
25    109
26     71
27     65
33     22
34     37
35     20
36     23
37     13
38     16
39      7
40      3
41      4
42      3
43      9
44      4
45      1
46      5
47      3
48      4
Name: count, dtype: int64

In [23]:
bins = [0, 20, 30, np.inf]
age_labels = ['<20', '20-30', '30+']
raw_df_train['AgeRange'] = pd.cut(raw_df_train['age'], bins, labels=age_labels)
raw_df_test['AgeRange'] = pd.cut(raw_df_test['age'], bins, labels=age_labels)
raw_df_train.head()

Unnamed: 0,text,date,gender,age,horoscope,job,AgeRange
0,We scared a seagull &amp; convinced a random d...,"29,August,2003",male,24,Gemini,Student,20-30
1,And on the really good news front: we visited ...,"30,May,2004",female,38,Virgo,indUnk,30+
2,damnnnnn..chelsea snatched arjen robben from m...,"02,March,2004",male,16,Libra,Student,<20
3,Author(s): Peter Mayle Genre: Fiction Revi...,"03,June,2004",female,38,Taurus,indUnk,30+
4,You may be one of the few who actually comment...,"15,August,2004",male,27,Libra,indUnk,20-30


In [24]:
raw_df_train.AgeRange.value_counts().sort_index()

AgeRange
<20      370
20-30    456
30+      174
Name: count, dtype: int64

We merge the datasets back into the DatasetDict object for huggingface and create the mappings id2label and label2id respectively

In [25]:
train_dataset = Dataset.from_dict(raw_df_train)
test_dataset = Dataset.from_dict(raw_df_test)
new_dataset = DatasetDict({"train":train_dataset,"test":test_dataset})

In [26]:
new_dataset["train"][0]

{'text': 'We scared a seagull &amp; convinced a random dude to take  urlLink a pic of us  with my digicam before going into what promised to be the actual sea port building.',
 'date': '29,August,2003',
 'gender': 'male',
 'age': 24,
 'horoscope': 'Gemini',
 'job': 'Student',
 'AgeRange': '20-30'}

In [27]:
id2label = {idx:label for idx, label in enumerate(age_labels)}
label2id = {label:idx for idx, label in enumerate(age_labels)}

The tokenize function is modified to account for the labels.

Note that the labels are in the form [<20, <20, 20-30.......30+].

However we need them to be in the format [[1,0,0],[1,0,0],[0,1,0],.....[0,0,1]] for training and evaluation.

Hence at first, we create a labels matrix of dimensions (batch_size, number_of_labels). Then we populate the matrix at row = batch_position and column = label_position with 0 or 1 depending on which label was present.

Finally we return the tokenize batch

In [28]:
def tokenize_function(batch, tokenizer, label2id, np ):#= tokenizer
  """ Function takes in a batch of data to tokenize and create the corresponding label matrix and returns the tokenized batch with labels
  """
  batch_size = len(batch["AgeRange"])
  tokenized_batch = tokenizer(batch["text"],padding=True, max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
  labels_matrix = np.zeros((batch_size, len(label2id))) #number of labels = len(label2id)
  for batch_pos,label in enumerate(batch["AgeRange"]):
      labels_matrix[batch_pos, label2id.get(label)] = 1
  tokenized_batch["labels"] = labels_matrix.tolist()
  return tokenized_batch

In [29]:
tokenized_dataset = new_dataset.map(tokenize_function, batched=True,
                                    fn_kwargs={"tokenizer": tokenizer, "label2id":label2id, "np":np},
                                    num_proc=4, remove_columns=raw_dataset['train'].column_names)
#use fn_kwargs to pass any arguments to the tokenizing function

Map (num_proc=4): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2120.03 examples/s]
Map (num_proc=4): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 529.03 examples/s]


In [30]:
tokenized_dataset.set_format(type="torch")

In [31]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True) #pads to the max sequence length in a batch

In [32]:
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(age_labels), #number of classes
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    #f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    #roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {#'f1': f1_micro_average,
               #'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

In [50]:
training_args = TrainingArguments(
    output_dir="models/"+model_name+"_blog_authorship_age",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [51]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [52]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.465699,0.65
2,0.402900,0.527951,0.62
3,0.402900,0.543877,0.57
4,0.194200,0.622845,0.57
5,0.194200,0.617257,0.61


TrainOutput(global_step=1250, training_loss=0.25565726928710936, metrics={'train_runtime': 78.4015, 'train_samples_per_second': 63.774, 'train_steps_per_second': 15.944, 'total_flos': 1315567088640000.0, 'train_loss': 0.25565726928710936, 'epoch': 5.0})

In [54]:
from transformers import pipeline

In [55]:
classification_pipeline = pipeline("text-classification", model="models/bert-base-uncased_blog_authorship_age/checkpoint-500")

Device set to use cuda:0


In [56]:
classification_pipeline("I am retired")

[{'label': '<20', 'score': 0.7296162247657776}]

In [57]:
classification_pipeline("I love my spouse and kids and love to take them outside after work.") # 

[{'label': '20-30', 'score': 0.7512363195419312}]

In [53]:
classification_pipeline("My friends and I want to go to Comic con") # 

[{'label': '20-30', 'score': 0.5268863439559937}]