# Text classification
__pipelines__
* Simple interface
* automatic model selection
* less control
* less flexibility in choice of task

__auto classes__
* Flexibility, customization
* manual setup is complex

Example:
* load pre-trained model weights and tokenizer by name
* model_name aka "model checkpoint"

AutoModel does not provide a head

In [1]:
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "I am an example for text classification"

class SimpleClassifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super(SimpleClassifier, self).__init__()
        self.fc = nn.Linear(input_size, num_classes)

    def forward(self, x):
        return self.fc(x)

  from .autonotebook import tqdm as notebook_tqdm


Tokenize inputs

In [2]:
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=64)

get model's hidden states:
* pooler_output: high-level aggregated
* last_hidden_states: raw

In [3]:
outputs = model(**inputs)
pooled_output = outputs.pooler_output

In [4]:
outputs.last_hidden_state.shape

torch.Size([1, 9, 768])

In [5]:
pooled_output.shape

torch.Size([1, 768])

Forward through custom classification head to obtain class probabilities

In [6]:
import torch

classifier_head = SimpleClassifier(pooled_output.size(-1), num_classes=2)
logits = classifier_head(pooled_output)
probs = torch.softmax(logits, dim=1)
probs

tensor([[0.4355, 0.5645]], grad_fn=<SoftmaxBackward0>)

Autoclass with preconfigured head

AutoModelForSequenceClassification: sentiment classification in a 5 star rating scale

In [7]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)



In [8]:
text = "The quality of the product was just okay."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
logits = outputs.logits

predicted_class = torch.argmax(logits, dim=1).item()
predicted_class + 1

3

# Text generation

* AutoModelForCausalLM accepts auto-regressive models like gpt2
* Model head for next word prediction
* takes prompt and generates max_length tokens

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "This is a simple example for text generation,"
inputs = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(inputs, max_length=26)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
generated_text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"This is a simple example for text generation, but it's also a good way to get a feel for how the text is generated"

# Datasets for text classification
Example: load imdb reviews dataset

In [1]:
from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset('imdb')
train_data = dataset['train']
dataloader = DataLoader(train_data, batch_size=2, shuffle=True)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# print some examples
i = 0
batch = next(iter(dataloader))
print(f"Example {i + 1}:")
print("------------------------------")
print(batch['text'])
print("------------------------------")
print("Label:", batch['label'][i])

Example 1:
------------------------------
['Cute film about three lively sisters from Switzerland (often seen running about in matching outfits) who want to get their parents back together (seems mom is still carrying the torch for dad) - so they sail off to New York to stop the dad from marrying a blonde gold-digger he calls "Precious". Dad hasn\'t seen his daughters in ten years, they (oddly enough) don\'t seem to mind and think he\'s wonderful, and meanwhile Precious seems to lead a life mainly run by her overbearing mother (Alice Brady), a woman who just wants to see to it her daughter marries a rich man. The sisters get the idea of pushing Precious into the path of a drunken Hungarian count, tricking the two gold-digging women into thinking he is one of the richest men in Europe. But a case of mistaken identity makes the girls think the count is good-looking Ray Milland, who goes along with the scheme \'cause he has a crush on sister Kay.<br /><br />This film is enjoyable, light f

Load a dataset from Stanford NLP catalogue

In [7]:
from datasets import load_dataset

dataset = load_dataset("stanfordnlp/shp-2", data_dir="reddit/askculinary")
train_data = dataset['train']
example = train_data[1]
print("Title:", example["post_id"])
print("Paragrpah:", example["history"])

Title: jqcxwi
Paragrpah: How can I purposely get clumps in my spaghetti Ok this is a weird one guys, but I have an autistic kid and his absolute favourite thing in the world to eat is 'spaghetti chunk'... so like you know when you boil the dried pasta and you get a little lump where some of the spaghetti has fused together? I dont know if I'm explaining this properly but anyway it's his birthday tomorrow and I really wanna make him a bowl of 'spaghetti chunk' and meatballs for his birthday meal (as we can't go out to celebrate due to lockdown)  So yeah I know this is an odd question but how can I cook/prepare the pasta so I can give him a full bowl of chunks? I only have 2 300g packs so not enough for a load of trial and error. I was gonna snap it and cook it in as little water as possible but I really dont know if that will work. Sorry for bizarre question but my son would literally be beside himself with happiness if I were to cook him a big bowl of his goddamn chunks... Thanks in ad

### Train the LLM for next word prediction

Example: "the cat is sleeping on the mat"

An **input sequence** is a segment of text, e.g. "the cat is"

A **target sequence** is the same as the input, but shifted 1 token to the left, e.g. "cat is sleeping"

### Classifying movie opinions

In [12]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "textattack/distilbert-base-uncased-SST-2"
# Load the tokenizer and pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
  model_name, num_labels=2)

text = ["The best movie I've ever watched!", "What an awful movie. I regret watching it."]

# Tokenize inputs and pass them to the model for inference
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits = outputs.logits

predicted_classes = torch.argmax(logits, dim=1).tolist()
for idx, predicted_class in enumerate(predicted_classes):
    print(f"Predicted class for \"{text[idx]}\": {predicted_class}")



Predicted class for "The best movie I've ever watched!": 1
Predicted class for "What an awful movie. I regret watching it.": 0


### Text generation example

Arrange the code pieces in the right order to load Stanford NLP's "askculinary" dataset, then extract a short sequence (first 60 characters) from the 'history' attribute of the first training instance, and finally use the extracted sequence as a prompt for the "gpt" model to generate follow-up text upon the prompt.

In [17]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

dataset = load_dataset("stanfordnlp/shp-2", data_dir="reddit/askculinary")
train_data = dataset['train']

In [22]:
prompt = train_data[0]["history"][:60]
prompt

'How can I purposely get clumps in my spaghetti Ok this is a '

In [20]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)



In [23]:
inputs = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(inputs, max_length=29)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [25]:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
generated_text

'How can I purposely get clumps in my spaghetti Ok this is a \xa0problem. I have a lot of spaghetti and I have to make'

In [26]:
len(output)

1