<a href="https://colab.research.google.com/github/ZohanaZuthi/Sentiment-Analysis-Project/blob/main/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install tensorflow-hub
!pip install tensorflow-text




In [3]:
import tensorflow_hub as hub
import tensorflow_text as text


In [10]:
# BERT preprocessing and encoder URLs
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

In [11]:
text_test=['nice movie indeed','I love python programming']

In [12]:
bert_preprocess_model=hub.KerasLayer(preprocess_url)
text_preprocessed=bert_preprocess_model(text_test)
text_preprocessed.keys()

dict_keys(['input_type_ids', 'input_mask', 'input_word_ids'])

What Each One Means:
1. input_word_ids
The actual token IDs for each word/subword in your sentence.

These IDs map to BERT’s vocabulary.

For example, "I love pizza" might become: [101, 1045, 2293, 10733, 102]

101 is [CLS], 102 is [SEP]

2. input_mask
This tells BERT which tokens are real vs. padding.

1 means use this token, 0 means ignore it (used for padding).

Example: [1, 1, 1, 1, 0, 0, 0]

3. input_type_ids
This tells BERT which segment each token belongs to.

Used in tasks with two sentences, like question answering.

Tokens from sentence A → 0

Tokens from sentence B → 1

In sentiment analysis (single sentence), this is just a list of 0s.

In [16]:
text_preprocessed['input_mask']

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

In [18]:
text_preprocessed['input_word_ids']

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[  101,  3835,  3185,  5262,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

In [15]:
bert_model=hub.KerasLayer(encoder_url)
bert_results=bert_model(text_preprocessed)
bert_results.keys()

dict_keys(['pooled_output', 'sequence_output', 'encoder_outputs', 'default'])

1. pooled_output
Shape: (batch_size, 768)

This is the [CLS] token representation — a summary of the entire sentence.

Typically used for classification tasks like sentiment analysis.

2. sequence_output
Shape: (batch_size, sequence_length, 768)

This gives the embedding for each token in the input.

Useful for token-level tasks, like Named Entity Recognition (NER), Q&A, etc.

3. encoder_outputs
A list of all the hidden states from each BERT layer.

There are 12 layers → so this gives you [layer1_output, ..., layer12_output]

Used when you want to inspect or extract intermediate layers (e.g., for visualization or probing tasks).

In [22]:
bert_results['encoder_outputs'][-1]
# the last encoder is same as sequence output

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[ 0.07292047,  0.08567817,  0.14476831, ..., -0.09677088,
          0.08722127,  0.07711085],
        [ 0.1783942 , -0.19006078,  0.5034952 , ..., -0.05869824,
          0.32717073, -0.15578493],
        [ 0.18701449, -0.43388748, -0.4887515 , ..., -0.15502778,
          0.00145182, -0.24471016],
        ...,
        [ 0.12083091,  0.12884252,  0.46453542, ...,  0.0737552 ,
          0.17441946,  0.16522075],
        [ 0.07967882, -0.01190652,  0.502254  , ...,  0.13777725,
          0.21002197,  0.00624588],
        [-0.07212701, -0.28303462,  0.59033376, ...,  0.475519  ,
          0.16668482, -0.08920349]],

       [[-0.0790059 ,  0.36335096, -0.21101595, ..., -0.17183761,
          0.16299714,  0.6724268 ],
        [ 0.27883548,  0.43716297, -0.35764822, ..., -0.04463654,
          0.38315183,  0.5887981 ],
        [ 1.2037671 ,  1.0727016 ,  0.48408663, ...,  0.24921048,
          0.407309  ,  0.40481755],
        ...,

In [19]:
bert_results['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[ 0.07292047,  0.08567817,  0.14476831, ..., -0.09677088,
          0.08722127,  0.07711085],
        [ 0.1783942 , -0.19006078,  0.5034952 , ..., -0.05869824,
          0.32717073, -0.15578493],
        [ 0.18701449, -0.43388748, -0.4887515 , ..., -0.15502778,
          0.00145182, -0.24471016],
        ...,
        [ 0.12083091,  0.12884252,  0.46453542, ...,  0.0737552 ,
          0.17441946,  0.16522075],
        [ 0.07967882, -0.01190652,  0.502254  , ...,  0.13777725,
          0.21002197,  0.00624588],
        [-0.07212701, -0.28303462,  0.59033376, ...,  0.475519  ,
          0.16668482, -0.08920349]],

       [[-0.0790059 ,  0.36335096, -0.21101595, ..., -0.17183761,
          0.16299714,  0.6724268 ],
        [ 0.27883548,  0.43716297, -0.35764822, ..., -0.04463654,
          0.38315183,  0.5887981 ],
        [ 1.2037671 ,  1.0727016 ,  0.48408663, ...,  0.24921048,
          0.407309  ,  0.40481755],
        ...,