###The transformers library is from Hugging Face, and it provides:

*Pretrained transformer models (like BERT, GPT, T5)

*Pipelines for tasks like text classification, translation, summarization, etc.

In [1]:
!pip3 install --quiet transformers

In [2]:
import tensorflow as tf

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data_path = '/content/drive/MyDrive/0.Latest_DS_Course/RNN/Data/labeledTrainData.tsv'


##Data Preprocessing

#### Load Data

In [5]:
!pip install sentencepiece




###What is sentencepiece?
* sentencepiece is a tokenizer developed by Google.

* It is often used with models like T5, mBART, XLM-R, and ALBERT that don’t rely on word-based tokenization but instead use subword units (like byte-pair encoding or unigram models).

* It allows language-independent tokenization — even for text without spaces, like Japanese or Chinese.

In [6]:
import pandas as pd


In [7]:
import pandas as pd

#change file path to point to where you have stored the zip file.
df = pd.read_csv(data_path, header=0, delimiter="\t", quoting=3)


In [8]:
df.sample(n=5)


Unnamed: 0,id,sentiment,review
24642,"""575_3""",0,"""This movie isn't very good. It's boring, and ..."
14218,"""6471_10""",1,"""An OUR GANG Comedy Short.<br /><br />The Gang..."
589,"""11641_1""",0,"""Although I am generally a proponent of the we..."
11303,"""5617_8""",1,"""After his classic film noir homage Chinatown ..."
12169,"""2438_9""",1,"""This was a gem. Amazing acting from the leads..."


In [9]:
# Sentences and labels
sentences = df.review.values
labels = df.sentiment.values


## Tokenize Data using BERT Tokenizer

In [10]:
from transformers import *


No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
GroupViT models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version.Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.
TAPAS models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version. Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.


In [12]:
# Get BertTokenizer
# Loads a pretrained BERT tokenizer that transforms text into numerical tokens, making it ready to be fed into the BERT model.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/tokenizer_config.json
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/tokenizer.json
loading file chat_template.jinja from cache at None


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.52.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



###.from_pretrained('bert-base-uncased')
###This function:

* Downloads the vocabulary and configuration of the BERT tokenizer that was used with the 'bert-base-uncased' model.

* 'bert-base-uncased' is a popular variant of BERT that:

  - Has 12 layers and 110 million parameters.

  - Converts all text to lowercase (hence "uncased").

  - Is trained on English Wikipedia + BooksCorpus.

In [13]:
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]


In [None]:
sentences[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [None]:
#Check tokenized text
print(tokenized_texts[0])



['"', 'with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'm', '##j', 'i', "'", 've', 'started', 'listening', 'to', 'his', 'music', ',', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', ',', 'watched', 'the', 'wi', '##z', 'and', 'watched', 'moon', '##walker', 'again', '.', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', '.', 'moon', '##walker', 'is', 'part', 'biography', ',', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', '.', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'm', '##j', "'", 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', "'", 'kay',

In [14]:
#We will use only first 200 tokens to do classification (this value can be changed)
max_length = 200
tokenized_texts = [sent[:max_length] for sent in tokenized_texts]

In [15]:


for i in range(len(tokenized_texts)):
    sent = tokenized_texts[i]
    sent = ['[CLS]'] + sent + ['[SEP]']
    tokenized_texts[i] = sent



In [16]:
print(tokenized_texts[0])

['[CLS]', '"', 'with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'm', '##j', 'i', "'", 've', 'started', 'listening', 'to', 'his', 'music', ',', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', ',', 'watched', 'the', 'wi', '##z', 'and', 'watched', 'moon', '##walker', 'again', '.', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', '.', 'moon', '##walker', 'is', 'part', 'biography', ',', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', '.', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'm', '##j', "'", 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', "'

In [17]:
#Convert tokens into IDs
input_ids = [tokenizer.convert_tokens_to_ids(sent) for sent in tokenized_texts]


In [18]:
print(input_ids[0])

[101, 1000, 2007, 2035, 2023, 4933, 2183, 2091, 2012, 1996, 2617, 2007, 1049, 3501, 1045, 1005, 2310, 2318, 5962, 2000, 2010, 2189, 1010, 3666, 1996, 5976, 4516, 2182, 1998, 2045, 1010, 3427, 1996, 15536, 2480, 1998, 3427, 4231, 26965, 2153, 1012, 2672, 1045, 2074, 2215, 2000, 2131, 1037, 3056, 12369, 2046, 2023, 3124, 2040, 1045, 2245, 2001, 2428, 4658, 1999, 1996, 27690, 2074, 2000, 2672, 2191, 2039, 2026, 2568, 3251, 2002, 2003, 5905, 2030, 7036, 1012, 4231, 26965, 2003, 2112, 8308, 1010, 2112, 3444, 2143, 2029, 1045, 3342, 2183, 2000, 2156, 2012, 1996, 5988, 2043, 2009, 2001, 2761, 2207, 1012, 2070, 1997, 2009, 2038, 11259, 7696, 2055, 1049, 3501, 1005, 1055, 3110, 2875, 1996, 2811, 1998, 2036, 1996, 5793, 4471, 1997, 5850, 2024, 2919, 1049, 1005, 10905, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 17453, 8052, 2021, 1997, 2607, 2023, 2003, 2035, 2055, 2745, 4027, 2061, 4983, 2017, 19512, 2066, 1049, 3501, 1999, 4312, 2059, 2017, 2024, 2183, 2000, 5223, 2023, 1998, 2424, 2

In [None]:

#Pad our tokens which might be less than max_length size
input_ids = tf.keras.preprocessing.sequence.pad_sequences(input_ids, maxlen=max_length+2, truncating='post', padding='post')


##Split data between training and test

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
#80% data will be used for training while 20% will be used for test
trainX, testX, trainY, testY = train_test_split(input_ids, labels, test_size=0.2, random_state=12345)

##Create Attention masks : Attention masks are useful to ignore padding tokens. Mask value will be set to 0 for padding tokens and 1 for actual tokens. We will create mask both for training and test data


Sources








In [None]:
# Create attention masks for training
train_attn_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in trainX:
    seq_mask = [float(i>0) for i in seq]
    train_attn_masks.append(seq_mask)


In [None]:

# Create attention masks for Test
test_attn_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in testX:
    seq_mask = [float(i>0) for i in seq]
    test_attn_masks.append(seq_mask)

## Our Data is ready at this point


## BUILD MODEL

In [None]:
#Load Pre-trained Bert Model with a Binary Classification layer at the top.
#Huggingface library provides TFBertForSequenceClassification for the same
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

# model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)



loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.51.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors
Loaded 109,482,240 parameters i

In [None]:
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:

model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109483778 (417.65 MB)
Trainable params: 109483778 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## TRAIN MODEL

In [None]:
import numpy as np

In [None]:
train_x_data = {'input_ids': np.array(trainX), 'attention_mask': np.array(train_attn_masks)}
test_x_data = {'input_ids': np.array(testX), 'attention_mask': np.array(test_attn_masks)}

In [None]:
model.fit(train_x_data, trainY, validation_data=(test_x_data, testY), batch_size=16, epochs=2)

Epoch 1/2
 254/1250 [=====>........................] - ETA: 9:54:33 - loss: 0.3817 - accuracy: 0.8209

KeyboardInterrupt: 

In [None]:
"""## PREDICT ON TEST DATA"""

# Get the raw predictions (logits)
predictions = model.predict(test_x_data)

# Convert logits to predicted labels (0 or 1)
predicted_labels = np.argmax(predictions.logits, axis=1)

# Evaluate accuracy manually
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Test Accuracy:", accuracy_score(testY, predicted_labels))
print("\nClassification Report:\n", classification_report(testY, predicted_labels))
print("\nConfusion Matrix:\n", confusion_matrix(testY, predicted_labels))


In [None]:

# Show few predictions vs actual
for i in range(5):
    print(f"\nReview: {tokenizer.decode(testX[i], skip_special_tokens=True)}")
    print(f"Actual Sentiment: {'Positive' if testY[i]==1 else 'Negative'}")
    print(f"Predicted Sentiment: {'Positive' if predicted_labels[i]==1 else 'Negative'}")


In [None]:
def predict_sentiment(text, tokenizer, model, max_length=200):
    # Tokenize with special tokens [CLS] and [SEP]
    tokens = tokenizer.tokenize(text)
    tokens = tokens[:max_length]  # Truncate
    tokens = ['[CLS]'] + tokens + ['[SEP]']

    # Convert to IDs
    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Pad to max_length+2 to match training input
    input_ids = tf.keras.preprocessing.sequence.pad_sequences([input_ids], maxlen=max_length+2, padding='post')

    # Create attention mask
    attn_mask = [[float(i > 0) for i in input_ids[0]]]

    # Prepare input dict
    inputs = {
        'input_ids': tf.convert_to_tensor(input_ids),
        'attention_mask': tf.convert_to_tensor(attn_mask)
    }

    # Predict
    outputs = model.predict(inputs)
    pred_label = np.argmax(outputs.logits, axis=1)[0]

    sentiment = "Positive" if pred_label == 1 else "Negative"
    print(f"Review: {text}\nPredicted Sentiment: {sentiment}")
    return sentiment


In [None]:
predict_sentiment("I loved the movie very much", tokenizer, model)
