# 3. Introduction to Hugging Face

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
imdb.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [None]:
imdb['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
imdb['test']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
imdb['train'].column_names

['text', 'label']

In [None]:
type(imdb['train']['text'])

list

In [None]:
len(imdb['train']['text'])

25000

In [None]:
print(f"Text: \n{imdb['train']['text'][0]}\n")
print(f"Label: {imdb['train']['label'][0]}")

Text: 
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwe

# 4. AutoTokenizer

In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [None]:
tokenizer.is_fast

True

In [None]:
tokenizer.model_max_length

512

In [None]:
tokenizer.vocab_size

30522

### Answers to Questions

1. A `DistilBertTokenizerFast` tokenizer was created using `AutoTokenizer` for the `distilbert-base-uncased` pre-trained model.
2. Yes, a Rust-based tokenizer was available since the tokenizer's class name contains the word "Fast" and the `is_fast` attribute of the `tokenizer` object is `True`.
3. The model can accept sequences up to a maximum of `512` tokens in length as input.
4. The `tokenizer` object has a vocabulary size of `30,522`.

# 5. Tokenization

In [None]:
X_train = imdb['train']['text']
y_train = imdb['train']['label']
X_test = imdb['test']['text']
y_test = imdb['test']['label']

print(type(X_train), type(y_train), type(X_test), type(y_test))
print(len(X_train), len(y_train), len(X_test), len(y_test))

<class 'list'> <class 'list'> <class 'list'> <class 'list'>
25000 25000 25000 25000


In [None]:
train_encodings = tokenizer(X_train, max_length=128, truncation=True, padding=True)
test_encodings = tokenizer(X_test, max_length=128, truncation=True, padding=True)

# 6. Preparing the Data for Transformers

In [None]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    tf.constant(y_train, dtype=tf.int32)
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    tf.constant(y_test, dtype=tf.int32)
))

train_dataset = train_dataset.shuffle(len(X_train)).batch(16)
test_dataset = test_dataset.batch(16)

# 7. Build and Evaluate the model

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5)
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
model.compile(loss=loss, optimizer=optimizer, metrics=metrics)
model.fit(train_dataset, epochs=3)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fbbec335b40>

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)
print(f"Test set accuracy: {test_acc}")

Test set accuracy: 0.8692799806594849
