# Sentiment Analysis using BERT and TensorFlow

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SaschaHeyer/Sentiment-Analysis-GCP/blob/main/notebook/Sentiment_Analysis_BERT_and_TensorFlow.ipynb)

This notebook contains the code for the DoiT blog article https://blog.doit-intl.com/performing-surprisingly-easy-sentiment-analysis-on-google-cloud-platform-fc26b2e2b4b. If you want to deploy this model to Google Cloud head over to the article. 
 

## Author
Sascha Heyer - Senior Machine Learning Engineer at [DoiT International](https://twitter.com/doitint)<br/>
[Twitter](https://twitter.com/HeyerSascha)
[LinkedIn](https://www.linkedin.com/in/saschaheyer/)


## Install Huggingface Transformers

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 9.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 69.8 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 73.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


## Dependencies

In [3]:
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification

import tensorflow as tf
import json
import pandas as pd
import numpy as np
from io import StringIO

## Configuration

First, you'll need to enable GPUs for the notebook:

Navigate to Edit→Notebook Settings
select GPU from the Hardware Accelerator drop-down

In [4]:
num_gpus_available = len(tf.config.experimental.list_physical_devices('GPU'))
print("Num GPUs Available: ", num_gpus_available)
assert num_gpus_available > 0

Num GPUs Available:  1


## Load dataset

In [5]:
file = tf.io.gfile.GFile(
'gs://machine-learning-samples/datasets/sentiment/imdb/csv/dataset.csv', mode='r').read()

df = pd.read_excel('/content/messages.xlsx')
df = df.dropna()
df= df.astype(str)
train_set = df[0:200]
valid_set = df[200:300]

In [6]:
df.tail()

Unnamed: 0,messages
693,The perfect occasion to fix it
694,Not solved yet
695,"Sorry, I should have warned you that in this c..."
696,Extra line
697,Fixed


## Train / Validation split



In [8]:

import nltk
nltk.downloader.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [9]:

from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def vader_sentiment_result(sent):
    scores = analyzer.polarity_scores(sent)
    
    if scores["neg"] > scores["pos"]:
        return 0

    return 1

train_set["vader_result"] = train_set["messages"].apply(lambda x: vader_sentiment_result(x))
valid_set["vader_result"] = valid_set["messages"].apply(lambda x: vader_sentiment_result(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set["vader_result"] = train_set["messages"].apply(lambda x: vader_sentiment_result(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid_set["vader_result"] = valid_set["messages"].apply(lambda x: vader_sentiment_result(x))


In [None]:
print(train_set)

                                              messages  vader_result
0     @beltran-rubo @juan131 please review this again.             1
1    @vikram-bitnami @juan131 can you sync about th...             1
2    @vikram-bitnami @juan131 can you sync about th...             1
3    > @vikram-bitnami @juan131 can you sync about ...             0
4    Closing in favour of https://github.com/bitnam...             1
..                                                 ...           ...
204  [INFO] Once installed this environment will be...             1
205              [INFO] This may take a few minutes...             1
206  [INFO] Installing environment for https://gith...             1
207  [INFO] Once installed this environment will be...             1
208              [INFO] This may take a few minutes...             1

[200 rows x 2 columns]


In [10]:
sentiments = train_set['vader_result'].values.astype(str).tolist()
reviews = train_set['messages'].values.astype(str).tolist()
print(reviews)
from sklearn.model_selection import train_test_split
training_sentences, validation_sentences, training_labels, validation_labels = train_test_split(reviews, sentiments, test_size=.2)

['@beltran-rubo @juan131 please review this again.', '@vikram-bitnami @juan131 can you sync about those PRs (https://github.com/bitnami/charts-docs/pull/10 and https://github.com/bitnami/charts-docs/pull/11)? It seems you are (partially) modifying the same files at the same time 😄 ', '@vikram-bitnami @juan131 can you sync about those PRs (https://github.com/bitnami/charts-docs/pull/10 and https://github.com/bitnami/charts-docs/pull/11)? It seems you are (partially) modifying the same files at the same time 😄 ', '> @vikram-bitnami @juan131 can you sync about those PRs (#10 and #11)? It seems you are (partially) modifying the same files at the same time smile_x000D_\n_x000D_\nIndeed, we are discussing this offline.', 'Closing in favour of https://github.com/bitnami/charts-docs/pull/11', '@juan131 can this be merged now? Or if not please let me know when it could be merged. Thanks.', 'Approved.', 'Just an FYI @raquel-campuzano that there is a plan to to delete all the individual `sidecar`

## Tokenization

In [11]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [12]:
tokenizer([training_sentences[0]], truncation=True,
                            padding=True, max_length=64)

{'input_ids': [[101, 1048, 13512, 2213, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}

In [13]:
sequence = '@beltran-rubo @juan131 please review this again.'
tokenizer.tokenize(sequence)

['@',
 'belt',
 '##ran',
 '-',
 'rub',
 '##o',
 '@',
 'juan',
 '##13',
 '##1',
 'please',
 'review',
 'this',
 'again',
 '.']

In [14]:
tokenizer(sequence)

{'input_ids': [101, 1030, 5583, 5521, 1011, 14548, 2080, 1030, 5348, 17134, 2487, 3531, 3319, 2023, 2153, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [15]:
train_encodings = tokenizer(training_sentences,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(validation_sentences,
                            truncation=True,
                            padding=True)

## TensorFlow dataset

In [16]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    training_labels
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    validation_labels
))

## Model

In [17]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                              num_labels=2)

Downloading:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_projector', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

In [18]:
# using the Hugginface model saves as the time and effor to build the model on our own
# https://www.tensorflow.org/official_models/fine_tuning_bert_files/output_8L__-erBwLIQ_0.png?dcb_=0.04391390122987171

model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


## Training

In [19]:

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5, epsilon=1e-08)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])
model.fit(train_dataset.shuffle(10).batch(16),
              epochs=2,
              batch_size=16,
              validation_data=val_dataset.shuffle(10).batch(16))

Epoch 1/2


AttributeError: ignored

IMDB Sentiment Benchmark https://paperswithcode.com/sota/sentiment-analysis-on-imdb

## Save model

In [None]:
model.save_pretrained("./model")

## Load model from storage (for demo purpuses without time for fullt raining)

In [None]:
!gsutil cp -r gs://machine-learning-samples/models/sentiment/model . 

## Load model

In [None]:
loaded_model = TFDistilBertForSequenceClassification.from_pretrained("./model")

## Prediction

In [None]:
test_sentence = "DoiT is a great company"

# replace to test_sentence_sarcasm variable, if you want to test sarcasm
predict_input = tokenizer.encode(test_sentence,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

tf_output = loaded_model.predict(predict_input)[0]

tf_prediction = tf.nn.softmax(tf_output, axis=1)
labels = ['Negative','Positive']
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
print(labels[label[0]])

## Excursion

### Masking

In [None]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("DoiT is a [MASK] company to work for.")

### Tokenization

In [None]:
tokenizer.tokenize('Cat Dog Cat Dog')

In [None]:
tokenizer(['Cat Dog Cat Dog'], 
          truncation=True,
          padding=True, 
          max_length=128)

### Biased
BERT was trained on Wikipedia and Book Corpus and thus learns underlying bias. This is a important topic and we need to be aware of that at any time we work with machine learning models and data.

In [None]:
unmasker("The White man worked as a [MASK].")

In [None]:
unmasker("The woman worked as a [MASK].")

In [None]:
unmasker("The Black woman worked as a [MASK].")