Loading dataset using Tensorflow Dataset API

In [8]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [9]:
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews', 
                                        split=(tfds.Split.TRAIN, tfds.Split.TEST),
                                        as_supervised=True,
                                        with_info=True)

Now check some examples from the dataset.

In [10]:
for review, label in tfds.as_numpy(ds_train.take(5)):
    print(review.decode()[0:50],'\t', label)

This was an absolutely terrible movie. Don't be lu 	 0
I have been known to fall asleep during films, but 	 0
Mann photographs the Alberta Rocky Mountains in a  	 0
This is the kind of film for a snowy Sunday aftern 	 1
As others have mentioned, all the women that go nu 	 1


In [11]:
#!pip install -q transformers

Now we need to apply BERT tokenizer to use pre-trained tokenizer.

In [12]:
from transformers import BertTokenizer

In [13]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Now we need to prepare the data for BERT model.

Input IDs – The input ids are often the only required parameters to be passed to the model as input. Token indices, numerical representations of tokens building the sequences that will be used as input by the model.

Attention mask – Attention Mask is used to avoid performing attention on padding token indices. Mask value can be either 0 or 1, 1 for tokens that are NOT MASKED, 0 for MASKED tokens.

Token type ids – It is used in use cases like sequence classification or question answering. As these require two different sequences to be encoded in the same input IDs. Special tokens, such as the classifier[CLS] and separator[SEP] tokens are used to separate the sequences.

<center><img src="img1.png"/></center>
<center>Source: BERT Input/Output(https://pysnacks.com/machine-learning/bert-text-classification-with-fine-tuning/)</center>

The encode_plus function of the tokenizer class will tokenize the raw input.

In [14]:
max_length = 512
def convert_to_feature(review):
    return tokenizer.encode_plus(review,
                    add_special_tokens=True, # add [CLS], [SEP]
                    max_length=max_length, # max length of the text that can go to BERT
                    pad_to_max_length=True, # add [PAD] tokens 
                    return_attention_mask=True) # add attention mask to not focus on pad tokens

Following helper functions are transforming raw data to an appropriate format to feed into the BERT model.

In [15]:
def map_to_dict(input_ids, attention_masks, token_type_ids, label):
    return {
        "input_ids":input_ids,
        "attention_mask":attention_masks,
        "token_type_ids": token_type_ids
    }, label

In [16]:
def encode_reviews(ds, limit=-1):
    input_ids_list=[]
    token_type_ids_list=[]
    attention_mask_list=[]
    label_list=[]
    if limit>0:
        ds=ds.take(limit)
    
    for review, label in tfds.as_numpy(ds):
        bert_input = convert_to_feature(review.decode())
        input_ids_list.append(bert_input['input_ids'])
        token_type_ids_list.append(bert_input['token_type_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append(label)
    
    return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list,label_list)).map(map_to_dict)


Now we need to create our train and test datasets.

In [17]:
batch_size=6
ds_train_encoded = encode_reviews(ds_train).shuffle(10000).batch(batch_size)
ds_test_encoded = encode_reviews(ds_test).batch(batch_size)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Now we will initialize BERT model for sentiment analysis.

In [18]:
from transformers import TFBertForSequenceClassification
learning_rate = 2e-5
epoch = 1
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-8)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric]) 

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we will train the model.

In [19]:
bert_history = model.fit(ds_train_encoded, epochs=epoch, validation_data=ds_test_encoded)



Test on a random example.

In [20]:
test_sentence = "This is a really good movie. I loved it and will watch again"
predict_input = tokenizer.encode(test_sentence,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

In [24]:
tf_output=model.predict(predict_input)[0]
tf_prediction = tf.nn.softmax(tf_output, axis=1)
labels = ['Negative','Positive'] #(0:negative, 1:positive)
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
print(labels[label[0]])

Positive


In [2]:
from transformers import pipeline

In [5]:
pipeline(model='sayakpaul/videomae-base-finetuned-ucf101-subset')

KeyError: "Unknown task video-classification, available tasks are ['audio-classification', 'automatic-speech-recognition', 'conversational', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-text', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'visual-question-answering', 'vqa', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']"